What we can learn from the Exercise
Content |
A compiler like gcc is an important tool for us. And in short, we can learn:
-
how to use such a tool,
-
structural understanding of what such a tool will do, can do and can not do when we use it.
About using the tools
-
We can see the result of the C Preprocessor (CPP) after processing processing the source file vec_sum.c with
gcc -E vec_sum.c
It will print the result on the terminal (on stdout) by default. If you want to have the result in a file you can use the -o flag, e.g.
gcc -E -o ver_sum.after_cpp vec_sum.c
or redirect the output with >, e.g.
gcc -E vec_sum.c > vec_sum.after_cpp
By the way, other flags like -Wall, -O1 etc. are in this case not relevant. These flags are only relevant for the actual C compiler, but we stop the translation process right after the preprocessing.
-
The result of the C Compiler is assembly code. Using the -S flag like in
gcc -S vec_sum.c
will create a file vec_sum.s with the produced assembly code. So by default a new file gets created. Using the -o flag we could specify a different name for the output, e.g. with
gcc -S -o vec_sum.asm vec_sum.c
the result gets stored in vec_sum.asm.
In this case flags like -Wall or the optimization flags -O1, -O2, -O3 will matter. In particular, using different optimization flags produces different
About vec_sum.c
Looking at the C code in
#define N 5 unsigned char x[N] = {1,2,3,4,5}; int main() { register unsigned long long res = 0; for (register unsigned long long i=0; i<N; ++i) { res += x[i]; } return res; } |
we see a for loop in Line 8. We know that the compiler actually sees the code that was first processed by the preprocessor, that is
heim$ gcc -E vec_sum.c # 1 "vec_sum.c" # 1 "" # 1 " " # 1 "/usr/include/stdc-predef.h" 1 3 4 # 1 " " 2 # 1 "vec_sum.c" unsigned char x[5] = {1,2,3,4,5}; int main() { register unsigned long long res = 0; for (register unsigned long long i=0; i<5; ++i) { res += x[i]; } return res; } heim$
How a C compiler for the ULM could translate the C Code
It is important that we consider the result of the preprocessor when we think about how the C code could be translated into assembly code for the ULM. The result could be something like this:
.equ X 1 .equ RES 2 .equ I 3 .equ TMP 4 .data x: .byte 1 .byte 2 .byte 3 .byte 4 .byte 5 .text load $x, %X # RES = 0 load $0, %RES # for (I=0; I<5; ++I) load $0, %I jmp check loop: movzbq (%X, %I), %TMP # TMP = X[I] addq %TMP, %RES, %RES # RES += TMP addq $1, %I, %I check: subq $5, %I, %0 jb loop # return RES halt %RES
Structure of assembly code produced by gcc
So the loop the compiler sees is for (register unsigned long long i=0; i<5; ++i) { ... }. Hence, if the C compiler would produce assembly code for the ULM we would expect that it had a structure like this:
... .text ... # init a result-register res with 0 # init a counter-register i with 0 jmp check loop: # add somehow x[i] to the result-register res # increment the counter-register i check: # subtract 5 from the counter-register i (i.e. compute i-5) jb loop ... # halt with result
Now we generate the assembly code with
heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer vec_sum.c heim$
and have a closer look:
.file "vec_sum.c" .globl x .data .type x, @object .size x, 5 x: .byte 1 .byte 2 .byte 3 .byte 4 .byte 5 .text .globl main .type main, @function main: pushq %rbp pushq %rbx movl $0, %ebp movl $0, %ebx jmp .L2 .L3: movzbl x(%rbx), %eax movzbl %al, %eax addq %rax, %rbp addq $1, %rbx .L2: cmpq $4, %rbx jbe .L3 movl %ebp, %eax popq %rbx popq %rbp ret .size main, .-main .ident "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2" .section .note.GNU-stack,"",@progbits
We guessed a few things right. In the following extract the observed structure is pointed out in the comments:
.text ... main: movl $0, %ebp # init a result-register res with 0 movl $0, %ebx # init a counter-register i with 0 jmp .L2 # jmp check .L3: # loop: movzbl x(%rbx), %eax movzbl %al, %eax addq %rax, %rbpa # add x[i] to the result-register addq $1, %rbx # increment the counter-register i .L2: # check: cmpq $4, %rbx # subtract 4 from counter-regsiter jbe .L3 # jbe loop movl %ebp, %eax popq %rbx popq %rbp ret
However, the compiler treats the loop as the equivalent loop for (register unsigned long long i=0; i<=4; ++i).
Of course we also observe that there are some details that we can not understand at the moment as they are Intel64 specific. For example the counter register is sometimes denoted as %ebx and sometimes as %rbx.
Influence of optimization flag -O1
heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer -O1 -o vec_sum_opt1.s vec_sum.c heim$
.file "vec_sum.c" .text .globl main .type main, @function main: movl $0, %edx movl $0, %eax .L2: movzbl x(%rdx), %ecx addq %rcx, %rax addq $1, %rdx cmpq $5, %rdx jne .L2 rep ret .size main, .-main .globl x .data .type x, @object .size x, 5 x: .byte 1 .byte 2 .byte 3 .byte 4 .byte 5 .ident "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2" .section .note.GNU-stack,"",@progbits
With the -O1 optimization the loop was realized different. The assembly code has no longer the unconditional jump. Instead of the unconditional jbe is uses a jne (which is equivalent to jnz). The code has following structure:
.text .globl main .type main, @function main: movl $0, %edx # init a result-register res with 0 movl $0, %eax # init a counter-register i with 0 .L2: # loop: movzbl x(%rdx), %ecx addq %rcx, %rax # add x[i] to the result-register addq $1, %rdx # increment the counter-register i cmpq $5, %rdx # subtract 5 from counter-regsiter jne .L2 # jne loop ...
In C we would express the structure of the loop as
res = 0; do { res += x[i]; i = i+1; } while (i!=5);
Even if you don't now the do-while loop of C, the difference to for loops and while loops is that the loop body is always executed at least once. And this property can be seen from the assembly code.
Influence of optimization flag -O2
heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer -O2 -o vec_sum_opt2.s vec_sum.c heim$
Execept for some minor details (that in this case really don't matter) the -O2 optimization gives the same code:
.file "vec_sum.c" .section .text.unlikely,"ax",@progbits .LCOLDB0: .section .text.startup,"ax",@progbits .LHOTB0: .p2align 4,,15 .globl main .type main, @function main: xorl %edx, %edx xorl %eax, %eax .L2: movzbl x(%rdx), %ecx addq $1, %rdx addq %rcx, %rax cmpq $5, %rdx jne .L2 rep ret .size main, .-main .section .text.unlikely .LCOLDE0: .section .text.startup .LHOTE0: .globl x .data .type x, @object .size x, 5 x: .byte 1 .byte 2 .byte 3 .byte 4 .byte 5 .ident "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2" .section .note.GNU-stack,"",@progbits
Influence of optimization flag -O3 (and your discovery of loop unrolling)
heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer -O3 -o vec_sum_opt3.s vec_sum.c heim$
Now this is interesting! There are no jump instructions at all:
.file "vec_sum.c" .section .text.unlikely,"ax",@progbits .LCOLDB0: .section .text.startup,"ax",@progbits .LHOTB0: .p2align 4,,15 .globl main .type main, @function main: movzbl x+1(%rip), %eax movzbl x(%rip), %edx addq %rax, %rdx movzbl x+2(%rip), %eax addq %rax, %rdx movzbl x+3(%rip), %eax addq %rax, %rdx movzbl x+4(%rip), %eax addq %rdx, %rax ret .size main, .-main .section .text.unlikely .LCOLDE0: .section .text.startup .LHOTE0: .globl x .data .type x, @object .size x, 5 x: .byte 1 .byte 2 .byte 3 .byte 4 .byte 5 .ident "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2" .section .note.GNU-stack,"",@progbits
The code just fetches data from the memory and adds it:
movzbl x+1(%rip), %eax movzbl x(%rip), %edx addq %rax, %rdx movzbl x+2(%rip), %eax addq %rax, %rdx movzbl x+3(%rip), %eax addq %rax, %rdx movzbl x+4(%rip), %eax addq %rdx, %rax
We would expect such a code structure from C code like this:
res = x[0]; res = res + x[1]; res = res + x[2]; res = res + x[3]; res = res + x[4]; return res;
And that is how the compiler treated the loop. This is called loop unrolling.
About `for_loop.c
#define N 5 int main() { register unsigned long long res = 1; for (register unsigned long long i=2; i<=N; ++i) { res *= i; } return res; } |
heim$ gcc -E for_loop.c # 1 "for_loop.c" # 1 "" # 1 " " # 1 "/usr/include/stdc-predef.h" 1 3 4 # 1 " " 2 # 1 "for_loop.c" int main() { register unsigned long long res = 1; for (register unsigned long long i=2; i<=5; ++i) { res *= i; } return res; } heim$
How a C compiler for the ULM could translate the C Code
.equ RES 1 .equ I 2 .text # RES = 1 load $1, %RES # for (I=2; I<=5; ++I) load $2, %I loop: mulq %I, %RES, %RES addq $1, %I, %I check: subq $5, %I, %0 jbe loop # return RES halt %RES
Structure of the produced assembly code
heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer for_loop.c heim$
You see again the structure of the for loop:
.file "for_loop.c" .text .globl main .type main, @function main: pushq %rbp pushq %rbx movl $1, %ebp movl $2, %ebx jmp .L2 .L3: imulq %rbx, %rbp addq $1, %rbx .L2: cmpq $5, %rbx jbe .L3 movl %ebp, %eax popq %rbx popq %rbp ret .size main, .-main .ident "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2" .section .note.GNU-stack,"",@progbits
Influence of optimization flag -O1 (and your discovery of Compile time function execution)
heim$ gcc -Wall -std=c11 -S -fno-asynchronous-unwind-tables -fomit-frame-pointer -O1 -o for_loop_opt1.s for_loop.c heim$
In this example we already achieve the best possible optimization with -O1 (but see for your self that -O2 and -O3 will produce the same result). The compiler already computed the factorial at compile time as you can see for yourself. The assembly code contains the result \(5! = 120\):
.file "for_loop.c" .text .globl main .type main, @function main: movl $120, %eax ret .size main, .-main .ident "GCC: (Debian 4.9.2-10+deb8u1) 4.9.2" .section .note.GNU-stack,"",@progbits
About while_loop.c (and your discovery of how the preprocessor can be confusing)
This is an example for a not uncommon problem for beginners in C. The programmer uses N as if it was a variable. However, as N is expanded by the preprocessor it is actually a literal integer constant:
heim$ gcc -E while_loop.c # 1 "while_loop.c" # 1 "" # 1 " " # 1 "/usr/include/stdc-predef.h" 1 3 4 # 1 " " 2 # 1 "while_loop.c" int main() { register unsigned long long res = 1; while (5>1) { res *= 5; --5; } return res; } heim$
So it is understandable that the compiler complains I can not decrement the literal 5. However, the compiler uses the term lvalue. This is a term denotes in programming an expression that can stand on the left-hand side of an assignment (and therefore it value can be changed):
heim$ gcc -Wall -std=c11 while_loop.c while_loop.c: In function ‘main’: while_loop.c:8:9: error: lvalue required as decrement operand --N; ^ heim$
A simple fix could be to change N into a variable:
unsigned long long N = 5; int main() { register unsigned long long res = 1; while (N>1) { res *= N; --N; } return res; }