What we can learn from the Exercise

Content

A compiler like gcc is an important tool for us. And in short, we can learn:

About using the tools

About vec_sum.c

Looking at the C code in

      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
#define N   5

unsigned char x[N] = {1,2,3,4,5};

int main()
{
    register unsigned long long  res = 0;
    for (register unsigned long long i=0; i<N; ++i) {
        res += x[i];
    }
    return res;
}

we see a for loop in Line 8. We know that the compiler actually sees the code that was first processed by the preprocessor, that is

heim$ gcc -E vec_sum.c
# 1 "vec_sum.c"
# 1 ""
# 1 ""
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "" 2
# 1 "vec_sum.c"


unsigned char x[5] = {1,2,3,4,5};

int main()
{
    register unsigned long long res = 0;
    for (register unsigned long long i=0; i<5; ++i) {
        res += x[i];
    }
    return res;
}
heim$ 

How a C compiler for the ULM could translate the C Code

It is important that we consider the result of the preprocessor when we think about how the C code could be translated into assembly code for the ULM. The result could be something like this:

.equ    X       1
.equ    RES     2
.equ    I       3
.equ    TMP     4

.data
x:
        .byte   1
        .byte   2
        .byte   3
        .byte   4
        .byte   5
.text
        load    $x,     %X

        # RES = 0
        load    $0,     %RES

        # for (I=0; I<5; ++I)
        load    $0,     %I
        jmp     check
loop:
        movzbq  (%X,    %I),    %TMP            # TMP =  X[I]
        addq    %TMP,   %RES,   %RES            # RES += TMP
        addq    $1,     %I,     %I
check:
        subq    $5,     %I,     %0
        jb      loop

        # return RES
        halt %RES

Structure of assembly code produced by gcc

So the loop the compiler sees is for (register unsigned long long i=0; i<5; ++i) { ... }. Hence, if the C compiler would produce assembly code for the ULM we would expect that it had a structure like this:

...
.text
    ...

    # init a result-register res with 0
    # init a counter-register i with 0
    jmp check
loop:
    # add somehow x[i] to the result-register res
    # increment the counter-register i
check:
    # subtract 5 from the counter-register i (i.e. compute i-5)
    jb loop

    ...
    # halt with result

Now we generate the assembly code with

heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer vec_sum.c
heim$ 

and have a closer look:

        .file   "vec_sum.c"
        .globl  x
        .data
        .type   x, @object
        .size   x, 5
x:
        .byte   1
        .byte   2
        .byte   3
        .byte   4
        .byte   5
        .text
        .globl  main
        .type   main, @function
main:
        pushq   %rbp
        pushq   %rbx
        movl    $0, %ebp
        movl    $0, %ebx
        jmp     .L2
.L3:
        movzbl  x(%rbx), %eax
        movzbl  %al, %eax
        addq    %rax, %rbp
        addq    $1, %rbx
.L2:
        cmpq    $4, %rbx
        jbe     .L3
        movl    %ebp, %eax
        popq    %rbx
        popq    %rbp
        ret
        .size   main, .-main
        .ident  "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2"
        .section        .note.GNU-stack,"",@progbits

We guessed a few things right. In the following extract the observed structure is pointed out in the comments:

        .text
        ...
main:
        movl    $0, %ebp              #   init a result-register res with 0
        movl    $0, %ebx              #   init a counter-register i with 0
        jmp     .L2                   #   jmp check
.L3:                                  # loop:
        movzbl  x(%rbx), %eax
        movzbl  %al, %eax
        addq    %rax, %rbpa           #   add x[i] to the result-register
        addq    $1, %rbx              #   increment the counter-register i
.L2:                                  # check:
        cmpq    $4, %rbx              #   subtract 4 from counter-regsiter
        jbe     .L3                   #   jbe loop
        movl    %ebp, %eax
        popq    %rbx
        popq    %rbp
        ret

However, the compiler treats the loop as the equivalent loop for (register unsigned long long i=0; i<=4; ++i).

Of course we also observe that there are some details that we can not understand at the moment as they are Intel64 specific. For example the counter register is sometimes denoted as %ebx and sometimes as %rbx.

Influence of optimization flag -O1

heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer -O1 -o vec_sum_opt1.s vec_sum.c
heim$ 
        .file   "vec_sum.c"
        .text
        .globl  main
        .type   main, @function
main:
        movl    $0, %edx
        movl    $0, %eax
.L2:
        movzbl  x(%rdx), %ecx
        addq    %rcx, %rax
        addq    $1, %rdx
        cmpq    $5, %rdx
        jne     .L2
        rep ret
        .size   main, .-main
        .globl  x
        .data
        .type   x, @object
        .size   x, 5
x:
        .byte   1
        .byte   2
        .byte   3
        .byte   4
        .byte   5
        .ident  "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2"
        .section        .note.GNU-stack,"",@progbits

With the -O1 optimization the loop was realized different. The assembly code has no longer the unconditional jump. Instead of the unconditional jbe is uses a jne (which is equivalent to jnz). The code has following structure:

        .text
        .globl  main
        .type   main, @function
main:
        movl    $0, %edx                #   init a result-register res with 0
        movl    $0, %eax                #   init a counter-register i with 0
.L2:                                    # loop:
        movzbl  x(%rdx), %ecx
        addq    %rcx, %rax              #   add x[i] to the result-register
        addq    $1, %rdx                #   increment the counter-register i
        cmpq    $5, %rdx                #   subtract 5 from counter-regsiter
        jne     .L2                     #   jne loop
        ...

In C we would express the structure of the loop as

   res = 0;
   do {
       res += x[i];
       i = i+1;
   } while (i!=5);

Even if you don't now the do-while loop of C, the difference to for loops and while loops is that the loop body is always executed at least once. And this property can be seen from the assembly code.

Influence of optimization flag -O2

heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer -O2 -o vec_sum_opt2.s vec_sum.c
heim$ 

Execept for some minor details (that in this case really don't matter) the -O2 optimization gives the same code:

        .file   "vec_sum.c"
        .section        .text.unlikely,"ax",@progbits
.LCOLDB0:
        .section        .text.startup,"ax",@progbits
.LHOTB0:
        .p2align 4,,15
        .globl  main
        .type   main, @function
main:
        xorl    %edx, %edx
        xorl    %eax, %eax
.L2:
        movzbl  x(%rdx), %ecx
        addq    $1, %rdx
        addq    %rcx, %rax
        cmpq    $5, %rdx
        jne     .L2
        rep ret
        .size   main, .-main
        .section        .text.unlikely
.LCOLDE0:
        .section        .text.startup
.LHOTE0:
        .globl  x
        .data
        .type   x, @object
        .size   x, 5
x:
        .byte   1
        .byte   2
        .byte   3
        .byte   4
        .byte   5
        .ident  "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2"
        .section        .note.GNU-stack,"",@progbits

Influence of optimization flag -O3 (and your discovery of loop unrolling)

heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer -O3 -o vec_sum_opt3.s vec_sum.c
heim$ 

Now this is interesting! There are no jump instructions at all:

        .file   "vec_sum.c"
        .section        .text.unlikely,"ax",@progbits
.LCOLDB0:
        .section        .text.startup,"ax",@progbits
.LHOTB0:
        .p2align 4,,15
        .globl  main
        .type   main, @function
main:
        movzbl  x+1(%rip), %eax
        movzbl  x(%rip), %edx
        addq    %rax, %rdx
        movzbl  x+2(%rip), %eax
        addq    %rax, %rdx
        movzbl  x+3(%rip), %eax
        addq    %rax, %rdx
        movzbl  x+4(%rip), %eax
        addq    %rdx, %rax
        ret
        .size   main, .-main
        .section        .text.unlikely
.LCOLDE0:
        .section        .text.startup
.LHOTE0:
        .globl  x
        .data
        .type   x, @object
        .size   x, 5
x:
        .byte   1
        .byte   2
        .byte   3
        .byte   4
        .byte   5
        .ident  "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2"
        .section        .note.GNU-stack,"",@progbits

The code just fetches data from the memory and adds it:

    movzbl  x+1(%rip), %eax
    movzbl  x(%rip), %edx
    addq    %rax, %rdx
    movzbl  x+2(%rip), %eax
    addq    %rax, %rdx
    movzbl  x+3(%rip), %eax
    addq    %rax, %rdx
    movzbl  x+4(%rip), %eax
    addq    %rdx, %rax

We would expect such a code structure from C code like this:

  res = x[0];
  res = res + x[1];
  res = res + x[2];
  res = res + x[3];
  res = res + x[4];
  return res;

And that is how the compiler treated the loop. This is called loop unrolling.

About `for_loop.c

     1
     2
     3
     4
     5
     6
     7
     8
     9
     10
#define N   5

int main()
{
    register unsigned long long  res = 1;
    for (register unsigned long long i=2; i<=N; ++i) {
        res *= i;
    }
    return res;
}
heim$ gcc -E for_loop.c
# 1 "for_loop.c"
# 1 ""
# 1 ""
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "" 2
# 1 "for_loop.c"


int main()
{
    register unsigned long long res = 1;
    for (register unsigned long long i=2; i<=5; ++i) {
        res *= i;
    }
    return res;
}
heim$ 

How a C compiler for the ULM could translate the C Code

.equ    RES     1
.equ    I       2


.text
        # RES = 1
        load    $1,     %RES

        # for (I=2; I<=5; ++I)
        load    $2,     %I
loop:
        mulq    %I,     %RES,   %RES
        addq    $1,     %I,     %I
check:
        subq    $5,     %I,     %0
        jbe     loop

        # return RES
        halt    %RES

Structure of the produced assembly code

heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer for_loop.c
heim$ 

You see again the structure of the for loop:

        .file   "for_loop.c"
        .text
        .globl  main
        .type   main, @function
main:
        pushq   %rbp
        pushq   %rbx
        movl    $1, %ebp
        movl    $2, %ebx
        jmp     .L2
.L3:
        imulq   %rbx, %rbp
        addq    $1, %rbx
.L2:
        cmpq    $5, %rbx
        jbe     .L3
        movl    %ebp, %eax
        popq    %rbx
        popq    %rbp
        ret
        .size   main, .-main
        .ident  "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2"
        .section        .note.GNU-stack,"",@progbits

Influence of optimization flag -O1 (and your discovery of Compile time function execution)

heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer -O1 -o for_loop_opt1.s for_loop.c
heim$ 

In this example we already achieve the best possible optimization with -O1 (but see for your self that -O2 and -O3 will produce the same result). The compiler already computed the factorial at compile time as you can see for yourself. The assembly code contains the result \(5! = 120\):

        .file   "for_loop.c"
        .text
        .globl  main
        .type   main, @function
main:
        movl    $120, %eax
        ret
        .size   main, .-main
        .ident  "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2"
        .section        .note.GNU-stack,"",@progbits

About while_loop.c (and your discovery of how the preprocessor can be confusing)

This is an example for a not uncommon problem for beginners in C. The programmer uses N as if it was a variable. However, as N is expanded by the preprocessor it is actually a literal integer constant:

heim$ gcc -E while_loop.c
# 1 "while_loop.c"
# 1 ""
# 1 ""
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "" 2
# 1 "while_loop.c"


int main()
{
    register unsigned long long res = 1;
    while (5>1) {
        res *= 5;
        --5;
    }
    return res;
}
heim$ 

So it is understandable that the compiler complains I can not decrement the literal 5. However, the compiler uses the term lvalue. This is a term denotes in programming an expression that can stand on the left-hand side of an assignment (and therefore it value can be changed):

heim$ gcc -Wall -std=c11  while_loop.c
while_loop.c: In function ‘main’:
while_loop.c:8:9: error: lvalue required as decrement operand
         --N;
         ^
heim$ 

A simple fix could be to change N into a variable:

unsigned long long N = 5;

int main()
{
    register unsigned long long  res = 1;
    while (N>1) {
        res *= N;
        --N;
    }
    return res;
}