What we can learn from the Exercise

Content

First Steps with C99
What we can learn from the Exercise

A compiler like gcc is an important tool for us. And in short, we can learn:

how to use such a tool,
structural understanding of what such a tool will do, can do and can not do when we use it.

About using the tools

We can see the result of the C Preprocessor (CPP) after processing processing the source file vec_sum.c with

gcc -E vec_sum.c

It will print the result on the terminal (on stdout) by default. If you want to have the result in a file you can use the -o flag, e.g.

gcc -E -o ver_sum.after_cpp vec_sum.c

or redirect the output with >, e.g.

gcc -E vec_sum.c > vec_sum.after_cpp

By the way, other flags like -Wall, -O1 etc. are in this case not relevant. These flags are only relevant for the actual C compiler, but we stop the translation process right after the preprocessing.
The result of the C Compiler is assembly code. Using the -S flag like in

gcc -S vec_sum.c

will create a file vec_sum.s with the produced assembly code. So by default a new file gets created. Using the -o flag we could specify a different name for the output, e.g. with

gcc -S -o vec_sum.asm vec_sum.c

the result gets stored in vec_sum.asm.

In this case flags like -Wall or the optimization flags -O1, -O2, -O3 will matter. In particular, using different optimization flags produces different

About `vec_sum.c`

Looking at the C code in

session06/vec_sum.c

#define N   5

unsigned char x[N] = {1,2,3,4,5};

int main()
{
    register unsigned long long  res = 0;
    for (register unsigned long long i=0; i<N; ++i) {
        res += x[i];
    }
    return res;
}

we see a for loop in Line 8. We know that the compiler actually sees the code that was first processed by the preprocessor, that is

heim$ gcc -E vec_sum.c
# 1 "vec_sum.c"
# 1 ""
# 1 ""
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "" 2
# 1 "vec_sum.c"


unsigned char x[5] = {1,2,3,4,5};

int main()
{
    register unsigned long long res = 0;
    for (register unsigned long long i=0; i<5; ++i) {
        res += x[i];
    }
    return res;
}
heim$

How a C compiler for the ULM could translate the C Code

It is important that we consider the result of the preprocessor when we think about how the C code could be translated into assembly code for the ULM. The result could be something like this:

session06/ulm/vec_sum_ulm.s

.equ    X       1
.equ    RES     2
.equ    I       3
.equ    TMP     4

.data
x:
        .byte   1
        .byte   2
        .byte   3
        .byte   4
        .byte   5
.text
        load    $x,     %X

        # RES = 0
        load    $0,     %RES

        # for (I=0; I<5; ++I)
        load    $0,     %I
        jmp     check
loop:
        movzbq  (%X,    %I),    %TMP            # TMP =  X[I]
        addq    %TMP,   %RES,   %RES            # RES += TMP
        addq    $1,     %I,     %I
check:
        subq    $5,     %I,     %0
        jb      loop

        # return RES
        halt %RES

Structure of assembly code produced by gcc

So the loop the compiler sees is for (register unsigned long long i=0; i<5; ++i) { ... }. Hence, if the C compiler would produce assembly code for the ULM we would expect that it had a structure like this:

...
.text
    ...

    # init a result-register res with 0
    # init a counter-register i with 0
    jmp check
loop:
    # add somehow x[i] to the result-register res
    # increment the counter-register i
check:
    # subtract 5 from the counter-register i (i.e. compute i-5)
    jb loop

    ...
    # halt with result

Now we generate the assembly code with

heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer vec_sum.c
heim$

and have a closer look:

session06/vec_sum.s

        .file   "vec_sum.c"
        .globl  x
        .data
        .type   x, @object
        .size   x, 5
x:
        .byte   1
        .byte   2
        .byte   3
        .byte   4
        .byte   5
        .text
        .globl  main
        .type   main, @function
main:
        pushq   %rbp
        pushq   %rbx
        movl    $0, %ebp
        movl    $0, %ebx
        jmp     .L2
.L3:
        movzbl  x(%rbx), %eax
        movzbl  %al, %eax
        addq    %rax, %rbp
        addq    $1, %rbx
.L2:
        cmpq    $4, %rbx
        jbe     .L3
        movl    %ebp, %eax
        popq    %rbx
        popq    %rbp
        ret
        .size   main, .-main
        .ident  "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2"
        .section        .note.GNU-stack,"",@progbits

We guessed a few things right. In the following extract the observed structure is pointed out in the comments:

        .text
        ...
main:
        movl    $0, %ebp              #   init a result-register res with 0
        movl    $0, %ebx              #   init a counter-register i with 0
        jmp     .L2                   #   jmp check
.L3:                                  # loop:
        movzbl  x(%rbx), %eax
        movzbl  %al, %eax
        addq    %rax, %rbpa           #   add x[i] to the result-register
        addq    $1, %rbx              #   increment the counter-register i
.L2:                                  # check:
        cmpq    $4, %rbx              #   subtract 4 from counter-regsiter
        jbe     .L3                   #   jbe loop
        movl    %ebp, %eax
        popq    %rbx
        popq    %rbp
        ret

However, the compiler treats the loop as the equivalent loop for (register unsigned long long i=0; i<=4; ++i).

Of course we also observe that there are some details that we can not understand at the moment as they are Intel64 specific. For example the counter register is sometimes denoted as %ebx and sometimes as %rbx.

Influence of optimization flag `-O1`

heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer -O1 -o vec_sum_opt1.s vec_sum.c
heim$

session06/vec_sum_opt1.s

        .file   "vec_sum.c"
        .text
        .globl  main
        .type   main, @function
main:
        movl    $0, %edx
        movl    $0, %eax
.L2:
        movzbl  x(%rdx), %ecx
        addq    %rcx, %rax
        addq    $1, %rdx
        cmpq    $5, %rdx
        jne     .L2
        rep ret
        .size   main, .-main
        .globl  x
        .data
        .type   x, @object
        .size   x, 5
x:
        .byte   1
        .byte   2
        .byte   3
        .byte   4
        .byte   5
        .ident  "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2"
        .section        .note.GNU-stack,"",@progbits

With the -O1 optimization the loop was realized different. The assembly code has no longer the unconditional jump. Instead of the unconditional jbe is uses a jne (which is equivalent to jnz). The code has following structure:

        .text
        .globl  main
        .type   main, @function
main:
        movl    $0, %edx                #   init a result-register res with 0
        movl    $0, %eax                #   init a counter-register i with 0
.L2:                                    # loop:
        movzbl  x(%rdx), %ecx
        addq    %rcx, %rax              #   add x[i] to the result-register
        addq    $1, %rdx                #   increment the counter-register i
        cmpq    $5, %rdx                #   subtract 5 from counter-regsiter
        jne     .L2                     #   jne loop
        ...

In C we would express the structure of the loop as

   res = 0;
   do {
       res += x[i];
       i = i+1;
   } while (i!=5);

Even if you don't now the do-while loop of C, the difference to for loops and while loops is that the loop body is always executed at least once. And this property can be seen from the assembly code.

Influence of optimization flag `-O2`

heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer -O2 -o vec_sum_opt2.s vec_sum.c
heim$

Execept for some minor details (that in this case really don't matter) the -O2 optimization gives the same code:

session06/vec_sum_opt2.s

        .file   "vec_sum.c"
        .section        .text.unlikely,"ax",@progbits
.LCOLDB0:
        .section        .text.startup,"ax",@progbits
.LHOTB0:
        .p2align 4,,15
        .globl  main
        .type   main, @function
main:
        xorl    %edx, %edx
        xorl    %eax, %eax
.L2:
        movzbl  x(%rdx), %ecx
        addq    $1, %rdx
        addq    %rcx, %rax
        cmpq    $5, %rdx
        jne     .L2
        rep ret
        .size   main, .-main
        .section        .text.unlikely
.LCOLDE0:
        .section        .text.startup
.LHOTE0:
        .globl  x
        .data
        .type   x, @object
        .size   x, 5
x:
        .byte   1
        .byte   2
        .byte   3
        .byte   4
        .byte   5
        .ident  "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2"
        .section        .note.GNU-stack,"",@progbits

Influence of optimization flag `-O3` (and your discovery of loop unrolling)

heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer -O3 -o vec_sum_opt3.s vec_sum.c
heim$

Now this is interesting! There are no jump instructions at all:

session06/vec_sum_opt3.s

        .file   "vec_sum.c"
        .section        .text.unlikely,"ax",@progbits
.LCOLDB0:
        .section        .text.startup,"ax",@progbits
.LHOTB0:
        .p2align 4,,15
        .globl  main
        .type   main, @function
main:
        movzbl  x+1(%rip), %eax
        movzbl  x(%rip), %edx
        addq    %rax, %rdx
        movzbl  x+2(%rip), %eax
        addq    %rax, %rdx
        movzbl  x+3(%rip), %eax
        addq    %rax, %rdx
        movzbl  x+4(%rip), %eax
        addq    %rdx, %rax
        ret
        .size   main, .-main
        .section        .text.unlikely
.LCOLDE0:
        .section        .text.startup
.LHOTE0:
        .globl  x
        .data
        .type   x, @object
        .size   x, 5
x:
        .byte   1
        .byte   2
        .byte   3
        .byte   4
        .byte   5
        .ident  "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2"
        .section        .note.GNU-stack,"",@progbits

The code just fetches data from the memory and adds it:

    movzbl  x+1(%rip), %eax
    movzbl  x(%rip), %edx
    addq    %rax, %rdx
    movzbl  x+2(%rip), %eax
    addq    %rax, %rdx
    movzbl  x+3(%rip), %eax
    addq    %rax, %rdx
    movzbl  x+4(%rip), %eax
    addq    %rdx, %rax

We would expect such a code structure from C code like this:

  res = x[0];
  res = res + x[1];
  res = res + x[2];
  res = res + x[3];
  res = res + x[4];
  return res;

And that is how the compiler treated the loop. This is called loop unrolling.

About `for_loop.c

session06/for_loop.c

#define N   5

int main()
{
    register unsigned long long  res = 1;
    for (register unsigned long long i=2; i<=N; ++i) {
        res *= i;
    }
    return res;
}

heim$ gcc -E for_loop.c
# 1 "for_loop.c"
# 1 ""
# 1 ""
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "" 2
# 1 "for_loop.c"


int main()
{
    register unsigned long long res = 1;
    for (register unsigned long long i=2; i<=5; ++i) {
        res *= i;
    }
    return res;
}
heim$

How a C compiler for the ULM could translate the C Code

session06/ulm/for_loop_ulm.s

.equ    RES     1
.equ    I       2


.text
        # RES = 1
        load    $1,     %RES

        # for (I=2; I<=5; ++I)
        load    $2,     %I
loop:
        mulq    %I,     %RES,   %RES
        addq    $1,     %I,     %I
check:
        subq    $5,     %I,     %0
        jbe     loop

        # return RES
        halt    %RES

Structure of the produced assembly code

heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer for_loop.c
heim$

You see again the structure of the for loop:

session06/for_loop.s

        .file   "for_loop.c"
        .text
        .globl  main
        .type   main, @function
main:
        pushq   %rbp
        pushq   %rbx
        movl    $1, %ebp
        movl    $2, %ebx
        jmp     .L2
.L3:
        imulq   %rbx, %rbp
        addq    $1, %rbx
.L2:
        cmpq    $5, %rbx
        jbe     .L3
        movl    %ebp, %eax
        popq    %rbx
        popq    %rbp
        ret
        .size   main, .-main
        .ident  "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2"
        .section        .note.GNU-stack,"",@progbits

Influence of optimization flag `-O1` (and your discovery of Compile time function execution)

heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer -O1 -o for_loop_opt1.s for_loop.c
heim$

In this example we already achieve the best possible optimization with -O1 (but see for your self that -O2 and -O3 will produce the same result). The compiler already computed the factorial at compile time as you can see for yourself. The assembly code contains the result \(5! = 120\):

session06/for_loop_opt1.s

        .file   "for_loop.c"
        .text
        .globl  main
        .type   main, @function
main:
        movl    $120, %eax
        ret
        .size   main, .-main
        .ident  "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2"
        .section        .note.GNU-stack,"",@progbits

About `while_loop.c` (and your discovery of how the preprocessor can be confusing)

This is an example for a not uncommon problem for beginners in C. The programmer uses N as if it was a variable. However, as N is expanded by the preprocessor it is actually a literal integer constant:

heim$ gcc -E while_loop.c
# 1 "while_loop.c"
# 1 ""
# 1 ""
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "" 2
# 1 "while_loop.c"


int main()
{
    register unsigned long long res = 1;
    while (5>1) {
        res *= 5;
        --5;
    }
    return res;
}
heim$

So it is understandable that the compiler complains I can not decrement the literal 5. However, the compiler uses the term lvalue. This is a term denotes in programming an expression that can stand on the left-hand side of an assignment (and therefore it value can be changed):

heim$ gcc -Wall -std=c11  while_loop.c
while_loop.c: In function ‘main’:
while_loop.c:8:9: error: lvalue required as decrement operand
         --N;
         ^
heim$

A simple fix could be to change N into a variable:

unsigned long long N = 5;

int main()
{
    register unsigned long long  res = 1;
    while (N>1) {
        res *= N;
        --N;
    }
    return res;
}

What we can learn from the Exercise

Content

About using the tools

About vec_sum.c

How a C compiler for the ULM could translate the C Code

Structure of assembly code produced by gcc

Influence of optimization flag -O1

Influence of optimization flag -O2

Influence of optimization flag -O3 (and your discovery of loop unrolling)

About `for_loop.c

How a C compiler for the ULM could translate the C Code

Structure of the produced assembly code

Influence of optimization flag -O1 (and your discovery of Compile time function execution)

About while_loop.c (and your discovery of how the preprocessor can be confusing)

About `vec_sum.c`

Influence of optimization flag `-O1`

Influence of optimization flag `-O2`

Influence of optimization flag `-O3` (and your discovery of loop unrolling)

Influence of optimization flag `-O1` (and your discovery of Compile time function execution)

About `while_loop.c` (and your discovery of how the preprocessor can be confusing)