===================================
What we can learn from the Exercise                                        [TOC]
===================================

A compiler like `gcc` is an important tool for us. And in short, we can learn:

- how to use such a tool,
- structural understanding of what such a tool will do, can do and can not do
  when we use it.

About using the tools
=====================

- We can see the result of the *C Preprocessor (CPP)* after processing
  processing the source file `vec_sum.c` with

  `gcc -E vec_sum.c`

  It will print the result on the terminal (on `stdout`) by default.  If you
  want to have the result in a file you can use the `-o` flag, e.g.

  `gcc -E -o ver_sum.after_cpp vec_sum.c`

  or redirect the output with `>`, e.g.

  `gcc -E vec_sum.c > vec_sum.after_cpp`

  By the way, other flags like `-Wall`, `-O1` etc. are in this case not
  relevant.  These flags are only relevant for the actual C compiler, but we
  stop the translation process right after the preprocessing.

- The result of the *C Compiler* is assembly code.  Using the `-S` flag like
  in

  `gcc -S vec_sum.c`

  will create a file `vec_sum.s` with the produced assembly code.  So by default
  a new file gets created.  Using the `-o` flag we could specify a different
  name for the output, e.g. with

  `gcc -S -o vec_sum.asm  vec_sum.c`

  the result gets stored in `vec_sum.asm`.

  In this case flags like `-Wall` or the optimization flags  `-O1`, `-O2`,
  `-O3` will matter.  In particular, using different optimization flags
  produces different 

About `vec_sum.c`
=================

Looking at the C code in

:import: session06/vec_sum.c [linenumbers]

we see a for loop in Line 8. We know that the compiler actually sees the code
that was first processed by the preprocessor, that is

---- SHELL(path=session06,hostname=heim) ---------------------------------------
gcc -E vec_sum.c
--------------------------------------------------------------------------------

How a C compiler for the ULM could translate the C Code
-------------------------------------------------------
It is important that we consider the result of the preprocessor when we think
about how the C code could be translated into assembly code for the ULM.  The
result could be something like this:

:import: session06/ulm/vec_sum_ulm.s


Structure of assembly code produced by gcc
------------------------------------------

So the loop the compiler sees is `for (register unsigned long long i=0; i<5;
++i) { ... }`.  Hence, if the C compiler would produce assembly code for the
ULM we would expect that it had a structure like this:

---- CODE(type=s) --------------------------------------------------------------
...
.text
    ...

    # init a result-register res with 0
    # init a counter-register i with 0
    jmp check
loop:
    # add somehow x[i] to the result-register res
    # increment the counter-register i
check:
    # subtract 5 from the counter-register i (i.e. compute i-5)
    jb loop

    ...
    # halt with result
--------------------------------------------------------------------------------

Now we generate the assembly code with

---- SHELL(path=session06,hostname=heim) ---------------------------------------
gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer   +++
    vec_sum.c
--------------------------------------------------------------------------------

and have a closer look:

:import: session06/vec_sum.s [fold]

We guessed a few things right.  In the following extract the observed structure
is pointed out in the comments:

---- CODE(type=s) --------------------------------------------------------------
        .text
        ...
main:
        movl    $0, %ebp              #   init a result-register res with 0
        movl    $0, %ebx              #   init a counter-register i with 0
        jmp     .L2                   #   jmp check
.L3:                                  # loop:
        movzbl  x(%rbx), %eax
        movzbl  %al, %eax
        addq    %rax, %rbpa           #   add x[i] to the result-register
        addq    $1, %rbx              #   increment the counter-register i
.L2:                                  # check:
        cmpq    $4, %rbx              #   subtract 4 from counter-regsiter
        jbe     .L3                   #   jbe loop
        movl    %ebp, %eax
        popq    %rbx
        popq    %rbp
        ret
--------------------------------------------------------------------------------

However, the compiler treats the loop as the equivalent loop `for (register
unsigned long long i=0; i<=4; ++i)`.

Of course we also observe that there are some details that we can not
understand at the moment as they are Intel64 specific.  For example the counter
register is sometimes denoted as `%ebx` and sometimes as `%rbx`.

Influence of optimization flag `-O1`
------------------------------------

---- SHELL(path=session06,hostname=heim) ---------------------------------------
gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer   +++
    -O1 -o vec_sum_opt1.s vec_sum.c
--------------------------------------------------------------------------------

:import: session06/vec_sum_opt1.s [fold]

With the `-O1` optimization the loop was realized different.  The assembly code
has no longer the unconditional jump.  Instead of the unconditional `jbe` is
uses a `jne` (which is equivalent to `jnz`).  The code has following structure:

---- CODE(type=s) --------------------------------------------------------------
        .text
        .globl  main
        .type   main, @function
main:
        movl    $0, %edx                #   init a result-register res with 0
        movl    $0, %eax                #   init a counter-register i with 0
.L2:                                    # loop:
        movzbl  x(%rdx), %ecx
        addq    %rcx, %rax              #   add x[i] to the result-register
        addq    $1, %rdx                #   increment the counter-register i
        cmpq    $5, %rdx                #   subtract 5 from counter-regsiter
        jne     .L2                     #   jne loop
        ...
--------------------------------------------------------------------------------

In C we would express the structure of the loop as

---- CODE(type=c) --------------------------------------------------------------
   res = 0;
   do {
       res += x[i];
       i = i+1;
   } while (i!=5);
--------------------------------------------------------------------------------

Even if you don't now the do-while loop of C, the difference to for loops and
while loops is that the loop body is always executed at least once.  And this
property can be seen from the assembly code.


Influence of optimization flag `-O2`
------------------------------------

---- SHELL(path=session06,hostname=heim) ---------------------------------------
gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer   +++
    -O2 -o vec_sum_opt2.s vec_sum.c
--------------------------------------------------------------------------------

Execept for some minor details (that in this case really don't matter) the
`-O2` optimization gives the same code:

:import: session06/vec_sum_opt2.s [fold]

Influence of optimization flag `-O3` (and your discovery of loop unrolling)
---------------------------------------------------------------------------

---- SHELL(path=session06,hostname=heim) ---------------------------------------
gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer   +++
    -O3 -o vec_sum_opt3.s vec_sum.c
--------------------------------------------------------------------------------

Now this is interesting!  There are no jump instructions at all:

:import: session06/vec_sum_opt3.s [fold]

The code just fetches data from the memory and adds it:

---- CODE(type=s) --------------------------------------------------------------
    movzbl  x+1(%rip), %eax
    movzbl  x(%rip), %edx
    addq    %rax, %rdx
    movzbl  x+2(%rip), %eax
    addq    %rax, %rdx
    movzbl  x+3(%rip), %eax
    addq    %rax, %rdx
    movzbl  x+4(%rip), %eax
    addq    %rdx, %rax
--------------------------------------------------------------------------------

We would expect such a code structure from C code like this:

---- CODE(type=c) --------------------------------------------------------------
  res = x[0];
  res = res + x[1];
  res = res + x[2];
  res = res + x[3];
  res = res + x[4];
  return res;
--------------------------------------------------------------------------------

And that is how the compiler treated the loop.  This is called *loop unrolling*.

About `for_loop.c`
=================

:import: session06/for_loop.c [linenumbers]

---- SHELL(path=session06,hostname=heim) ---------------------------------------
gcc -E for_loop.c
--------------------------------------------------------------------------------

How a C compiler for the ULM could translate the C Code
-------------------------------------------------------

:import: session06/ulm/for_loop_ulm.s


Structure of the produced assembly code
---------------------------------------

---- SHELL(path=session06,hostname=heim) ---------------------------------------
gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer   +++
    for_loop.c
--------------------------------------------------------------------------------

You see again the structure of the for loop:

:import: session06/for_loop.s


Influence of optimization flag `-O1` (and your discovery of Compile time function execution)
--------------------------------------------------------------------------------------------

---- SHELL(path=session06,hostname=heim) ---------------------------------------
gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer   +++
    -O1 -o for_loop_opt1.s for_loop.c
--------------------------------------------------------------------------------

In this example we already achieve the best possible optimization with `-O1`
(but see for your self that `-O2` and `-O3` will produce the same result).  The
compiler already computed the factorial *at compile time* as you can see for
yourself.  The assembly code contains the result $5! = 120$:

:import: session06/for_loop_opt1.s

About `while_loop.c` (and your discovery of how the preprocessor can be confusing)
==================================================================================

This is an example for a not uncommon problem for beginners in C.  The
programmer uses `N` as if it was a variable.  However, as `N` is expanded by
the preprocessor it is actually a literal integer constant:

---- SHELL(path=session06,hostname=heim) ---------------------------------------
gcc -E while_loop.c
--------------------------------------------------------------------------------

So it is understandable that the compiler complains _I can not decrement the
literal 5_.  However, the compiler uses the term `lvalue`.  This is a term
denotes in programming an expression that can stand on the *left-hand side* of
an assignment (and therefore it value can be changed):

---- SHELL(path=session06,hostname=heim) ---------------------------------------
gcc -Wall -std=c11  while_loop.c
--------------------------------------------------------------------------------

A simple fix could be to change `N` into a variable:

---- CODE(type=c) --------------------------------------------------------------
unsigned long long N = 5;

int main()
{
    register unsigned long long  res = 1;
    while (N>1) {
        res *= N;
        --N;
    }
    return res;
}
--------------------------------------------------------------------------------