==================================
CBE Pt.7: Pointers and Arrays in C					[TOC]
==================================

---- VIDEO ------------------------------
https://www.youtube.com/embed/yv2TWgQ5ZYs
-----------------------------------------

Pointers and Arrays in the "hello, world" Program
=================================================

Usually the first "hello, world" program in C is presented in this form (or with
an `printf` instead of `puts`):

---- CODE (file=session09/c/hello.c) -------------------------------------------
#include <stdio.h>

int
main()
{
    puts("hello, world!");
}
--------------------------------------------------------------------------------

That's probably a good idea because it is hiding the declaration of `puts`.
Because this is a functions that gets a pointer as parameter. But we know what
is hidden behind the include directive `#include <stdio.h>`. So we can see that
essentially this code is processed by the compiler:


---- CODE (file=session09/c/hello1.c) ------------------------------------------
int puts(const char *s);

int
main()
{
    puts("hello, world!");
}
--------------------------------------------------------------------------------

You actually see that the pointer parameter for `puts` also has the type
qualifier `const`. Here it just means that the value of the dereferenced
pointer is supposed to be read only (the function should not modify what is at
the end of the pointer).

So it is now clear that `puts` does not get the complete string as parameter.
It just gets a pointer to a character. You will see it is the address of the
first character in the string.

Now where is the array hidden in this code? You have to know that the C compiler
kind of rewrites your code. Every string literal becomes an array of characters.
The compiler will use some unique internal name for each string literal.
Usually this internal names begin with a dot, e.g `str.L1234` so no name
conflict gets generated as identifiers you can use can not contain a dot.

So basically the source file above is treated as if you would have written this
code:

---- CODE (file=session09/c/hello2.c) ------------------------------------------
int puts(const char *s);


char Lstr[14] = "hello, world!"; // length of string + 1 (string + zero byte)

int
main()
{
    puts(Lstr);
}
--------------------------------------------------------------------------------

Or like that:

---- CODE (file=session09/c/hello3.c) ------------------------------------------
int puts(const char *s);

char Lstr[] = "hello, world!"; // array size determined by initializer

int
main()
{
    puts(Lstr);
}
--------------------------------------------------------------------------------

Because the size of the array can be determined by the initializer you can
omit the size in the array declaration. The compiler can fill out the blank.

Recall that the identifier of an array is also the address of the first
element. So with `puts(Lstr)` function `puts` gets the address of the first
character in `"hello, world!"`.


Now also have a look at the assembly code generated by the ULM compiler:

--- SHELL (path=session09/c/) -------------------------------------------------
ucc -S hello.c
--------------------------------------------------------------------------------

The name of the array is `str.L0`. It is a label in the data segment. After this
label the assembler generates with `.string "hello, world!"` a zero terminated
array of characters:

:import: session09/c/hello.s

Fun and Food for Thoughts
=========================
Because string literals are treated like an array you can use them like an
array. For example like here:

---- CODE (file=session09/c/hello_fun.c) ---------------------------------------
int puts(const char *s);

int
main()
{
    "hello, world!"[3] = 'x';
    puts("hello, world!");
}
--------------------------------------------------------------------------------

The compiler will treat it basically as if you would have written this code:

---- CODE (file=session09/c/hello_fun_rewritten.c) -----------------------------
int puts(const char *s);

char Lstr char[] = "hello, world!";

int
main()
{
    Lstr[3] = 'x';
    puts(Lstr);
}
--------------------------------------------------------------------------------

Now lets compile it with the ULM compiler and gcc:

--- SHELL (path=session09/c/) -------------------------------------------------
ucc -o hello_fun_ucc hello_fun.c
gcc -o hello_fun_gcc hello_fun.c
--------------------------------------------------------------------------------

So both compiler translate the code into executables. Well, gcc gives a warning
but still compiles it because it is legal C code.

Running the Fun Program on the ULM
----------------------------------
We actually see that the string literal was modified:

--- SHELL (path=session09/c/) -------------------------------------------------
./hello_fun_ucc
-------------------------------------------------------------------------------

Running the Fun Program on Solaris
----------------------------------
The program gets executed but it crashes when the code tries to modify the
string literal:

--- SHELL (path=session09/c/) -------------------------------------------------
./hello_fun_gcc
-------------------------------------------------------------------------------

The reason for the crash on Solaris (and why the program works fine on the ULM)
is that the virtual memory on Solaris supports read-only segments (and the ULM
does not). And because the gcc compiler generates the character array in a
read-only segment such an error can be generated (and it is a good thing to get
an error).

Bottom Line
-----------
Modifying string literals should be disallowed.  But the C language does not
ban to modify a string literal. Because in general it requires some runtime
overhead to check if your code could try to do that.  Such a protection either
has to come from the hardware or you are on your own.


Grammar for Pascal (Just for Comparison)
========================================
Here is the complete __Pascal Syntax in BNF__. The __BNF (Backus-Naur-Form)__ is
notation for describing contex-free grammar. But it's not about the notation.
Just observe (before you have a look at the grammar for declarations in C) that
you could print out the complete grammar on two or three pages. This language
was designed for teaching so it was design goal to have a simple grammar.

:links: Pascal Syntax in BNF -> https://condor.depaul.edu/ichu/csc447/notes/wk2/pascal.html
	BNF \(Backus-Naur-Form\) -> https://en.wikipedia.org/wiki/Backus–Naur_form


Grammar for Swift  (Just for Comparison)
========================================
In __Swift__ the grammar is more complex. But compared to Pascal or C it is
very high-level and therefore provides more language features. Hence, the
__Grammar of Swift__ requires a few pages but still simple and easy to read.
This language was designed to be used for programming iOS Apps. So it was
designed to attract programmers. This also requires a simple to understand
grammar.

:links: Swift -> https://en.wikipedia.org/wiki/Swift_(programming_language)
        Grammar of Swift -> https://docs.swift.org/swift-book/ReferenceManual/AboutTheLanguageReference.html

Grammar for Declarations in C
=============================
C was designed to be a portable assembler. So a completely different design goal
;-)


Because in C a program is a list of declarations and function definitions the
grammar is compared to expressions a bit more complex. This is actually just
the top part of the grammar (below you see what is relevant for decalring a
variable of type `int` or type `int *`):

---- LATEX ---------------------------------------------------------------------
\begin{array}{rcl}
    \langle\text{declaration}\rangle
	& \to &
	\langle\text{declaration-specifiers}\rangle\;
	\langle\text{initialized-declarator-list}\rangle\;
	\textbf{;}\;
	\\
    \langle\text{declaration-specifiers}\rangle\;
	& \to &
	\langle\text{storage-class-specifier}\rangle\;
	\langle\text{declaration-specifiers}\rangle_\text{opt}
	\\
	& \to &
	\langle\text{type-specifier}\rangle\;
	\langle\text{declaration-specifiers}\rangle_\text{opt}
	\\
	& \to &
	\langle\text{type-qualifier}\rangle\;
	\langle\text{declaration-specifiers}\rangle_\text{opt}
	\\
	& \to &
	\langle\text{function-qualifier}\rangle\;
	\langle\text{declaration-specifiers}\rangle_\text{opt}
	\\
    \langle\text{storage-class-specifier}\rangle
	& \to &
	\langle\text{typedef}\rangle\;
	\\
	& \to &
	\langle\text{extern}\rangle\;
	\\
	& \to &
	\langle\text{static}\rangle\;
	\\
	& \to &
	\langle\text{auto}\rangle\;
	\\
	& \to &
	\langle\text{register}\rangle_\text{not supported by ucc}\;
	\\
    \langle\text{type-qualifier}\rangle
	& \to &
	\langle\text{const}\rangle\;
	\\
	& \to &
	\langle\text{volatile}\rangle_\text{not supported by ucc}\;
	\\
	& \to &
	\langle\text{restrict}\rangle_\text{not supported by ucc}\;
	\\
    \langle\text{type-specifier}\rangle
	& \to &
	\textbf{void}
	\\
	& \to &
	\langle\text{integer-type-specifier}\rangle\;
	\\
	&\to &
	\langle\text{floating-point-type-specifier}\rangle\;
	\\
	& \to &
	\langle\text{enumeration-type-specifier}\rangle\;
	\\
	& \to &
	\langle\text{structure-type-specifier}\rangle\;
	\\
	& \to &
	\langle\text{union-type-specifier}\rangle\;
	\\
	& \to &
	\langle\text{typedef-specifier}\rangle\;
	\\
    \langle\text{initialized-declarator-list}\rangle
	& \to &
	\langle\text{initialized-declarator}\rangle
	\\
	& \to &
	\langle\text{initialized-declarator}\rangle\;
	\textbf{,}\;
	\langle\text{initialized-declarator-list}\rangle
	\\
    \langle\text{initialized-declarator}\rangle
	& \to &
	\langle\text{declarator}\rangle
	\\
	& \to &
	\langle\text{declarator}\rangle\;
	\textbf{=}\;
	\langle\text{initializer}\rangle\;
	\\
\end{array}
--------------------------------------------------------------------------------


The Relevant Part for Decalring Variables of Type `int` or `int *`
==================================================================

In the two declarations

---- CODE (type=c) -------------------------------------------------------------
int q;
int *p;
--------------------------------------------------------------------------------

the token `int` is the $\langle\text{integer-type-specifier}\rangle$ and more
precise a $\langle\text{signed-type-specifier}\rangle$

---- LATEX ---------------------------------------------------------------------
\begin{array}{rcl}
    \langle\text{integer-type-specifier}\rangle
	& \to &
	\langle\text{signed-type-specifier}\rangle\;
	\\
	& \to &
	\langle\text{unsigned-type-specifier}\rangle\;
	\\
	& \to &
	\langle\text{character-type-specifier}\rangle\;
	\\
	& \to &
	\langle\text{bool-type-specifier}\rangle\;
	\\
    \langle\text{signed-type-specifier}\rangle
	& \to &
	\langle\text{int}\rangle\;
	\\
	& \to &
	\dots
	\\
\end{array}
--------------------------------------------------------------------------------

When parsing `int q` the $\langle\text{signed-type-specifier}\rangle$ is
followed by a $\langle\text{direct-declarator}\rangle$, whereas in `int *p`
it is followed by a $\langle\text{pointer}\rangle$ and then by a
$\langle\text{direct-declarator}\rangle$:

---- LATEX ---------------------------------------------------------------------
\begin{array}{rcl}
    \langle\text{declarator}\rangle
	& \to &
	\langle\text{direct-declarator}\rangle\;
	\\
	& \to &
	\langle\text{pointer}\rangle\;
	\langle\text{direct-declarator}\rangle\;
	\\
    \langle\text{pointer}\rangle
	& \to &
	\textbf{*}\;
	\\
	& \to &
	\textbf{*}\;
	\langle\text{pointer}\rangle\;
	\\
	& \to &
	\textbf{*}\;
	\langle\text{type-qualifier-list}\rangle\;
	\\
	& \to &
	\textbf{*}\;
	\langle\text{type-qualifier-list}\rangle\;
	\langle\text{pointer}\rangle\;
	\\
\end{array}
--------------------------------------------------------------------------------

In both cases the $$\langle\text{direct-declarator}\rangle$ is just an
$\langle\text{identifier}\rangle$:

---- LATEX ---------------------------------------------------------------------
\begin{array}{rcl}
    \langle\text{direct-declarator}\rangle
	& \to &
	\langle\text{simple-declarator}\rangle\;
	\\
	& \to &
	\textbf{(}\;
	\langle\text{simple-declarator}\rangle\;
	\textbf{)}\;
	\\
	& \to &
	\langle\text{function-declarator}\rangle\;
	\\
	& \to &
	\langle\text{array-declarator}\rangle\;
	\\
    \langle\text{simple-declarator}\rangle\;
	& \to &
	\langle\text{identifier}\rangle\;
	\\
\end{array}
--------------------------------------------------------------------------------