CBE Pt.7: Pointers and Arrays in C

Pointers and Arrays in the “hello, world” Program

Usually the first “hello, world” program in C is presented in this form (or with an printf instead of puts):

1
2
3
4
5
6
7
#include <stdio.h>

int
main()
{
    puts("hello, world!");
}

That's probably a good idea because it is hiding the declaration of puts. Because this is a functions that gets a pointer as parameter. But we know what is hidden behind the include directive #include <stdio.h>. So we can see that essentially this code is processed by the compiler:

1
2
3
4
5
6
7
int puts(const char *s);

int
main()
{
    puts("hello, world!");
}

You actually see that the pointer parameter for puts also has the type qualifier const. Here it just means that the value of the dereferenced pointer is supposed to be read only (the function should not modify what is at the end of the pointer).

So it is now clear that puts does not get the complete string as parameter. It just gets a pointer to a character. You will see it is the address of the first character in the string.

Now where is the array hidden in this code? You have to know that the C compiler kind of rewrites your code. Every string literal becomes an array of characters. The compiler will use some unique internal name for each string literal. Usually this internal names begin with a dot, e.g str.L1234 so no name conflict gets generated as identifiers you can use can not contain a dot.

So basically the source file above is treated as if you would have written this code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
int puts(const char *s);


char Lstr[14] = "hello, world!"; // length of string + 1 (string + zero byte)

int
main()
{
    puts(Lstr);
}

Or like that:

1
2
3
4
5
6
7
8
9
int puts(const char *s);

char Lstr[] = "hello, world!"; // array size determined by initializer

int
main()
{
    puts(Lstr);
}

Because the size of the array can be determined by the initializer you can omit the size in the array declaration. The compiler can fill out the blank.

Recall that the identifier of an array is also the address of the first element. So with puts(Lstr) function puts gets the address of the first character in "hello, world!".

Now also have a look at the assembly code generated by the ULM compiler:

theon$ ucc -S hello.c
theon$ 

The name of the array is str.L0. It is a label in the data segment. After this label the assembler generates with .string "hello, world!" a zero terminated array of characters:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
        .equ     ZERO,   0
        .equ     FP,     1
        .equ     SP,     2
        .equ     RET,    3


        .data
str.L0:
        .string         "hello, world!"

        .text
/*
 * function main
 **/
        .globl  main
main:
        // function prologue
        movq     %RET,   (%SP)
        movq     %FP,    8(%SP)
        addq     0,      %SP,    %FP
        subq     0,      %SP,    %SP

        // begin of the function body
        #0 {...
        # puts("hello, world!");
        subq     32,     %SP,    %SP
        ldzwq    @w3(str.L0),    %4
        shldwq   @w2(str.L0),    %4
        shldwq   @w1(str.L0),    %4
        shldwq   @w0(str.L0),    %4
        movq     %4,     24(%SP)
        ldzwq    @w3(puts),      %4
        shldwq   @w2(puts),      %4
        shldwq   @w1(puts),      %4
        shldwq   @w0(puts),      %4
        jmp      %4,     %RET
        movswq   16(%SP),        %4
        addq     32,     %SP,    %SP
        #0 ...}
        // end of the function body

        // function epilogue
        // 'main' returns 0 if there is no return statement
        movw     %0,     16(%FP)
leave.L1:
        addq     0,      %FP,    %SP
        movq     8(%SP),         %FP
        movq     (%SP),  %RET
        jmp      %RET,   %0

Fun and Food for Thoughts

Because string literals are treated like an array you can use them like an array. For example like here:

1
2
3
4
5
6
7
8
int puts(const char *s);

int
main()
{
    "hello, world!"[3] = 'x';
    puts("hello, world!");
}

The compiler will treat it basically as if you would have written this code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
int puts(const char *s);

char Lstr char[] = "hello, world!";

int
main()
{
    Lstr[3] = 'x';
    puts(Lstr);
}

Now lets compile it with the ULM compiler and gcc:

theon$ ucc -o hello_fun_ucc hello_fun.c
theon$ gcc -o hello_fun_gcc hello_fun.c
hello_fun.c: In function 'main':
hello_fun.c:6:5: warning: assignment of read-only location '"hello, world!"[3]'
    6 |     "hello, world!"[3] = 'x';
      |     ^~~~~~~~~~~~~~~
theon$ 

So both compiler translate the code into executables. Well, gcc gives a warning but still compiles it because it is legal C code.

Running the Fun Program on the ULM

We actually see that the string literal was modified:

theon$ ./hello_fun_ucc
helxo, world!
theon$ 

Running the Fun Program on Solaris

The program gets executed but it crashes when the code tries to modify the string literal:

theon$ ./hello_fun_gcc
theon$ 

The reason for the crash on Solaris (and why the program works fine on the ULM) is that the virtual memory on Solaris supports read-only segments (and the ULM does not). And because the gcc compiler generates the character array in a read-only segment such an error can be generated (and it is a good thing to get an error).

Bottom Line

Modifying string literals should be disallowed. But the C language does not ban to modify a string literal. Because in general it requires some runtime overhead to check if your code could try to do that. Such a protection either has to come from the hardware or you are on your own.

Grammar for Pascal (Just for Comparison)

Here is the complete Pascal Syntax in BNF. The BNF (Backus-Naur-Form) is notation for describing contex-free grammar. But it's not about the notation. Just observe (before you have a look at the grammar for declarations in C) that you could print out the complete grammar on two or three pages. This language was designed for teaching so it was design goal to have a simple grammar.

Grammar for Swift (Just for Comparison)

In Swift the grammar is more complex. But compared to Pascal or C it is very high-level and therefore provides more language features. Hence, the Grammar of Swift requires a few pages but still simple and easy to read. This language was designed to be used for programming iOS Apps. So it was designed to attract programmers. This also requires a simple to understand grammar.

Grammar for Declarations in C

C was designed to be a portable assembler. So a completely different design goal ;-)

Because in C a program is a list of declarations and function definitions the grammar is compared to expressions a bit more complex. This is actually just the top part of the grammar (below you see what is relevant for decalring a variable of type int or type int *):

\[\begin{array}{rcl} \langle\text{declaration}\rangle & \to & \langle\text{declaration-specifiers}\rangle\; \langle\text{initialized-declarator-list}\rangle\; \textbf{;}\; \\ \langle\text{declaration-specifiers}\rangle\; & \to & \langle\text{storage-class-specifier}\rangle\; \langle\text{declaration-specifiers}\rangle_\text{opt} \\ & \to & \langle\text{type-specifier}\rangle\; \langle\text{declaration-specifiers}\rangle_\text{opt} \\ & \to & \langle\text{type-qualifier}\rangle\; \langle\text{declaration-specifiers}\rangle_\text{opt} \\ & \to & \langle\text{function-qualifier}\rangle\; \langle\text{declaration-specifiers}\rangle_\text{opt} \\ \langle\text{storage-class-specifier}\rangle & \to & \langle\text{typedef}\rangle\; \\ & \to & \langle\text{extern}\rangle\; \\ & \to & \langle\text{static}\rangle\; \\ & \to & \langle\text{auto}\rangle\; \\ & \to & \langle\text{register}\rangle_\text{not supported by ucc}\; \\ \langle\text{type-qualifier}\rangle & \to & \langle\text{const}\rangle\; \\ & \to & \langle\text{volatile}\rangle_\text{not supported by ucc}\; \\ & \to & \langle\text{restrict}\rangle_\text{not supported by ucc}\; \\ \langle\text{type-specifier}\rangle & \to & \textbf{void} \\ & \to & \langle\text{integer-type-specifier}\rangle\; \\ &\to & \langle\text{floating-point-type-specifier}\rangle\; \\ & \to & \langle\text{enumeration-type-specifier}\rangle\; \\ & \to & \langle\text{structure-type-specifier}\rangle\; \\ & \to & \langle\text{union-type-specifier}\rangle\; \\ & \to & \langle\text{typedef-specifier}\rangle\; \\ \langle\text{initialized-declarator-list}\rangle & \to & \langle\text{initialized-declarator}\rangle \\ & \to & \langle\text{initialized-declarator}\rangle\; \textbf{,}\; \langle\text{initialized-declarator-list}\rangle \\ \langle\text{initialized-declarator}\rangle & \to & \langle\text{declarator}\rangle \\ & \to & \langle\text{declarator}\rangle\; \textbf{=}\; \langle\text{initializer}\rangle\; \\\end{array}\]

The Relevant Part for Decalring Variables of Type int or int *

In the two declarations

1
2
int q;
int *p;

the token int is the \(\langle\text{integer-type-specifier}\rangle\) and more precise a \(\langle\text{signed-type-specifier}\rangle\)

\[\begin{array}{rcl} \langle\text{integer-type-specifier}\rangle & \to & \langle\text{signed-type-specifier}\rangle\; \\ & \to & \langle\text{unsigned-type-specifier}\rangle\; \\ & \to & \langle\text{character-type-specifier}\rangle\; \\ & \to & \langle\text{bool-type-specifier}\rangle\; \\ \langle\text{signed-type-specifier}\rangle & \to & \langle\text{int}\rangle\; \\ & \to & \dots \\\end{array}\]

When parsing int q the \(\langle\text{signed-type-specifier}\rangle\) is followed by a \(\langle\text{direct-declarator}\rangle\), whereas in int *p it is followed by a \(\langle\text{pointer}\rangle\) and then by a \(\langle\text{direct-declarator}\rangle\):

\[\begin{array}{rcl} \langle\text{declarator}\rangle & \to & \langle\text{direct-declarator}\rangle\; \\ & \to & \langle\text{pointer}\rangle\; \langle\text{direct-declarator}\rangle\; \\ \langle\text{pointer}\rangle & \to & \textbf{*}\; \\ & \to & \textbf{*}\; \langle\text{pointer}\rangle\; \\ & \to & \textbf{*}\; \langle\text{type-qualifier-list}\rangle\; \\ & \to & \textbf{*}\; \langle\text{type-qualifier-list}\rangle\; \langle\text{pointer}\rangle\; \\\end{array}\]

In both cases the $$\langle\text{direct-declarator}\rangle$ is just an $\langle\text{identifier}\rangle$:

\[\begin{array}{rcl} \langle\text{direct-declarator}\rangle & \to & \langle\text{simple-declarator}\rangle\; \\ & \to & \textbf{(}\; \langle\text{simple-declarator}\rangle\; \textbf{)}\; \\ & \to & \langle\text{function-declarator}\rangle\; \\ & \to & \langle\text{array-declarator}\rangle\; \\ \langle\text{simple-declarator}\rangle\; & \to & \langle\text{identifier}\rangle\; \\\end{array}\]