================================== CBE Pt.7: Pointers and Arrays in C [TOC] ================================== ---- VIDEO ------------------------------ https://www.youtube.com/embed/yv2TWgQ5ZYs ----------------------------------------- Pointers and Arrays in the "hello, world" Program ================================================= Usually the first "hello, world" program in C is presented in this form (or with an `printf` instead of `puts`): ---- CODE (file=session09/c/hello.c) ------------------------------------------- #include int main() { puts("hello, world!"); } -------------------------------------------------------------------------------- That's probably a good idea because it is hiding the declaration of `puts`. Because this is a functions that gets a pointer as parameter. But we know what is hidden behind the include directive `#include `. So we can see that essentially this code is processed by the compiler: ---- CODE (file=session09/c/hello1.c) ------------------------------------------ int puts(const char *s); int main() { puts("hello, world!"); } -------------------------------------------------------------------------------- You actually see that the pointer parameter for `puts` also has the type qualifier `const`. Here it just means that the value of the dereferenced pointer is supposed to be read only (the function should not modify what is at the end of the pointer). So it is now clear that `puts` does not get the complete string as parameter. It just gets a pointer to a character. You will see it is the address of the first character in the string. Now where is the array hidden in this code? You have to know that the C compiler kind of rewrites your code. Every string literal becomes an array of characters. The compiler will use some unique internal name for each string literal. Usually this internal names begin with a dot, e.g `str.L1234` so no name conflict gets generated as identifiers you can use can not contain a dot. So basically the source file above is treated as if you would have written this code: ---- CODE (file=session09/c/hello2.c) ------------------------------------------ int puts(const char *s); char Lstr[14] = "hello, world!"; // length of string + 1 (string + zero byte) int main() { puts(Lstr); } -------------------------------------------------------------------------------- Or like that: ---- CODE (file=session09/c/hello3.c) ------------------------------------------ int puts(const char *s); char Lstr[] = "hello, world!"; // array size determined by initializer int main() { puts(Lstr); } -------------------------------------------------------------------------------- Because the size of the array can be determined by the initializer you can omit the size in the array declaration. The compiler can fill out the blank. Recall that the identifier of an array is also the address of the first element. So with `puts(Lstr)` function `puts` gets the address of the first character in `"hello, world!"`. Now also have a look at the assembly code generated by the ULM compiler: --- SHELL (path=session09/c/) ------------------------------------------------- ucc -S hello.c -------------------------------------------------------------------------------- The name of the array is `str.L0`. It is a label in the data segment. After this label the assembler generates with `.string "hello, world!"` a zero terminated array of characters: :import: session09/c/hello.s Fun and Food for Thoughts ========================= Because string literals are treated like an array you can use them like an array. For example like here: ---- CODE (file=session09/c/hello_fun.c) --------------------------------------- int puts(const char *s); int main() { "hello, world!"[3] = 'x'; puts("hello, world!"); } -------------------------------------------------------------------------------- The compiler will treat it basically as if you would have written this code: ---- CODE (file=session09/c/hello_fun_rewritten.c) ----------------------------- int puts(const char *s); char Lstr char[] = "hello, world!"; int main() { Lstr[3] = 'x'; puts(Lstr); } -------------------------------------------------------------------------------- Now lets compile it with the ULM compiler and gcc: --- SHELL (path=session09/c/) ------------------------------------------------- ucc -o hello_fun_ucc hello_fun.c gcc -o hello_fun_gcc hello_fun.c -------------------------------------------------------------------------------- So both compiler translate the code into executables. Well, gcc gives a warning but still compiles it because it is legal C code. Running the Fun Program on the ULM ---------------------------------- We actually see that the string literal was modified: --- SHELL (path=session09/c/) ------------------------------------------------- ./hello_fun_ucc ------------------------------------------------------------------------------- Running the Fun Program on Solaris ---------------------------------- The program gets executed but it crashes when the code tries to modify the string literal: --- SHELL (path=session09/c/) ------------------------------------------------- ./hello_fun_gcc ------------------------------------------------------------------------------- The reason for the crash on Solaris (and why the program works fine on the ULM) is that the virtual memory on Solaris supports read-only segments (and the ULM does not). And because the gcc compiler generates the character array in a read-only segment such an error can be generated (and it is a good thing to get an error). Bottom Line ----------- Modifying string literals should be disallowed. But the C language does not ban to modify a string literal. Because in general it requires some runtime overhead to check if your code could try to do that. Such a protection either has to come from the hardware or you are on your own. Grammar for Pascal (Just for Comparison) ======================================== Here is the complete __Pascal Syntax in BNF__. The __BNF (Backus-Naur-Form)__ is notation for describing contex-free grammar. But it's not about the notation. Just observe (before you have a look at the grammar for declarations in C) that you could print out the complete grammar on two or three pages. This language was designed for teaching so it was design goal to have a simple grammar. :links: Pascal Syntax in BNF -> https://condor.depaul.edu/ichu/csc447/notes/wk2/pascal.html BNF \(Backus-Naur-Form\) -> https://en.wikipedia.org/wiki/Backus–Naur_form Grammar for Swift (Just for Comparison) ======================================== In __Swift__ the grammar is more complex. But compared to Pascal or C it is very high-level and therefore provides more language features. Hence, the __Grammar of Swift__ requires a few pages but still simple and easy to read. This language was designed to be used for programming iOS Apps. So it was designed to attract programmers. This also requires a simple to understand grammar. :links: Swift -> https://en.wikipedia.org/wiki/Swift_(programming_language) Grammar of Swift -> https://docs.swift.org/swift-book/ReferenceManual/AboutTheLanguageReference.html Grammar for Declarations in C ============================= C was designed to be a portable assembler. So a completely different design goal ;-) Because in C a program is a list of declarations and function definitions the grammar is compared to expressions a bit more complex. This is actually just the top part of the grammar (below you see what is relevant for decalring a variable of type `int` or type `int *`): ---- LATEX --------------------------------------------------------------------- \begin{array}{rcl} \langle\text{declaration}\rangle & \to & \langle\text{declaration-specifiers}\rangle\; \langle\text{initialized-declarator-list}\rangle\; \textbf{;}\; \\ \langle\text{declaration-specifiers}\rangle\; & \to & \langle\text{storage-class-specifier}\rangle\; \langle\text{declaration-specifiers}\rangle_\text{opt} \\ & \to & \langle\text{type-specifier}\rangle\; \langle\text{declaration-specifiers}\rangle_\text{opt} \\ & \to & \langle\text{type-qualifier}\rangle\; \langle\text{declaration-specifiers}\rangle_\text{opt} \\ & \to & \langle\text{function-qualifier}\rangle\; \langle\text{declaration-specifiers}\rangle_\text{opt} \\ \langle\text{storage-class-specifier}\rangle & \to & \langle\text{typedef}\rangle\; \\ & \to & \langle\text{extern}\rangle\; \\ & \to & \langle\text{static}\rangle\; \\ & \to & \langle\text{auto}\rangle\; \\ & \to & \langle\text{register}\rangle_\text{not supported by ucc}\; \\ \langle\text{type-qualifier}\rangle & \to & \langle\text{const}\rangle\; \\ & \to & \langle\text{volatile}\rangle_\text{not supported by ucc}\; \\ & \to & \langle\text{restrict}\rangle_\text{not supported by ucc}\; \\ \langle\text{type-specifier}\rangle & \to & \textbf{void} \\ & \to & \langle\text{integer-type-specifier}\rangle\; \\ &\to & \langle\text{floating-point-type-specifier}\rangle\; \\ & \to & \langle\text{enumeration-type-specifier}\rangle\; \\ & \to & \langle\text{structure-type-specifier}\rangle\; \\ & \to & \langle\text{union-type-specifier}\rangle\; \\ & \to & \langle\text{typedef-specifier}\rangle\; \\ \langle\text{initialized-declarator-list}\rangle & \to & \langle\text{initialized-declarator}\rangle \\ & \to & \langle\text{initialized-declarator}\rangle\; \textbf{,}\; \langle\text{initialized-declarator-list}\rangle \\ \langle\text{initialized-declarator}\rangle & \to & \langle\text{declarator}\rangle \\ & \to & \langle\text{declarator}\rangle\; \textbf{=}\; \langle\text{initializer}\rangle\; \\ \end{array} -------------------------------------------------------------------------------- The Relevant Part for Decalring Variables of Type `int` or `int *` ================================================================== In the two declarations ---- CODE (type=c) ------------------------------------------------------------- int q; int *p; -------------------------------------------------------------------------------- the token `int` is the $\langle\text{integer-type-specifier}\rangle$ and more precise a $\langle\text{signed-type-specifier}\rangle$ ---- LATEX --------------------------------------------------------------------- \begin{array}{rcl} \langle\text{integer-type-specifier}\rangle & \to & \langle\text{signed-type-specifier}\rangle\; \\ & \to & \langle\text{unsigned-type-specifier}\rangle\; \\ & \to & \langle\text{character-type-specifier}\rangle\; \\ & \to & \langle\text{bool-type-specifier}\rangle\; \\ \langle\text{signed-type-specifier}\rangle & \to & \langle\text{int}\rangle\; \\ & \to & \dots \\ \end{array} -------------------------------------------------------------------------------- When parsing `int q` the $\langle\text{signed-type-specifier}\rangle$ is followed by a $\langle\text{direct-declarator}\rangle$, whereas in `int *p` it is followed by a $\langle\text{pointer}\rangle$ and then by a $\langle\text{direct-declarator}\rangle$: ---- LATEX --------------------------------------------------------------------- \begin{array}{rcl} \langle\text{declarator}\rangle & \to & \langle\text{direct-declarator}\rangle\; \\ & \to & \langle\text{pointer}\rangle\; \langle\text{direct-declarator}\rangle\; \\ \langle\text{pointer}\rangle & \to & \textbf{*}\; \\ & \to & \textbf{*}\; \langle\text{pointer}\rangle\; \\ & \to & \textbf{*}\; \langle\text{type-qualifier-list}\rangle\; \\ & \to & \textbf{*}\; \langle\text{type-qualifier-list}\rangle\; \langle\text{pointer}\rangle\; \\ \end{array} -------------------------------------------------------------------------------- In both cases the $$\langle\text{direct-declarator}\rangle$ is just an $\langle\text{identifier}\rangle$: ---- LATEX --------------------------------------------------------------------- \begin{array}{rcl} \langle\text{direct-declarator}\rangle & \to & \langle\text{simple-declarator}\rangle\; \\ & \to & \textbf{(}\; \langle\text{simple-declarator}\rangle\; \textbf{)}\; \\ & \to & \langle\text{function-declarator}\rangle\; \\ & \to & \langle\text{array-declarator}\rangle\; \\ \langle\text{simple-declarator}\rangle\; & \to & \langle\text{identifier}\rangle\; \\ \end{array} --------------------------------------------------------------------------------