Supplements

Here some additional information about related topics. I will also collect question that arise.

How the `jmp %CALL, %RET` and `jmp %RET, %0` pattern is related to ARM

The ARM architecture provides the “branch and link” instruction BL for calling a function:

1	BL function_label

What this does is storing the return address in register called lr (for link register) and then jumps to the address referred to by the label function_label. So pretty much the same as the jmp %CALL, %RET except that on the ULM you explicitly to specify the return register. This is sometime annoying as you have to make sure that all parties agree on the same return register. But I think for educational purposes a more explicit notation (that requires and enforces more discipline in programming) is ok.

Returning from the function is done on ARM with a move-instruction. You simple overwrite the instruction pointer (on ARM it's call PC for program counter, on ULM it is `%IP) with the link register.

    MOV     pc, lr            /* Return from subroutine.
                                 Note: The MOV copies the right-hand-side to the
                                 left-hand-side
                              */

Note that on the ULM you can note use %IP explicitly in an instruction, but jmp %RET, %0 is implicitly the exact same thing as you would do on ARM.

Control structures

If-then-else in an algorithm

Flow chart (Variant 1)

Flow chart (Variant 2)

Loading a function address into `%CALL`

In general the address of a function foo can be larger than \(2^{16}\). Then using

1	ldzwq foo, %CALL

will fail. You can simulate this by trying to load a literal value that is larger than \(2^{16}\), e.g. 0x12345, into a register:

session09/load/ldzwq_fail.s

1	ldzwq 0x12345, %1

You will get the following error from the code generator:

theon$ ulmas ldzwq_fail.s
ulmas: ldzwq_fail.s:1.5-26: value 74565 out of range [0, 2^16)
theon$

So how to solve this?

Besides ldzwq you need as ingredients the immediate operators @w0, .. , @w3 and the shldwq instruction:

The operator @w0 picks the least significant word of a literal (or label). So for example @w0(0x12345) picks 0x2345. And with the other operators you can pick the other words. In general, if \(X\) is some bit pattern then
\[\begin{array}{lcl}\text{@w0}(X) & = & u(X) \cdot 2^{-16 \cdot 0} \bmod 2^{16} \\\text{@w1}(X) & = & u(X) \cdot 2^{-16 \cdot 1} \bmod 2^{16} \\\text{@w2}(X) & = & u(X) \cdot 2^{-16 \cdot 2} \bmod 2^{16} \\\text{@w3}(X) & = & u(X) \cdot 2^{-16 \cdot 3} \bmod 2^{16} \\\end{array}\]
The shldwq (shift left load) instruction shifts the content of a register %Z 16 positions to the left and inserts into the least significant bits a 16-bit pattern XY:
\[u(\%\text{Z}) \cdot 2^{16} + u(\text{XY}) \to u(\%\text{Z})\]

So for example

session09/load/load32.s

    ldzwq   @w1(0x12345),    %1
    shldwq  @w0(0x12345),    %1

loads the bit pattern 0x12345 into %1. And a 64-bit literal can be loaded like this:

session09/load/load64.s

    ldzwq   @w3(0x1234567890123456),    %1
    shldwq  @w2(0x1234567890123456),    %1
    shldwq  @w1(0x1234567890123456),    %1
    shldwq  @w0(0x1234567890123456),    %1

When you look at the generated machine code you easily can see what bit pattern where picked form the literal:

theon$ ulmas -o load64 load64.s
theon$ cat load64
#TEXT 4
0x0000000000000000: 56 12 34 01 #       ldzwq w3(0x1234567890123456), %1
0x0000000000000004: 5D 56 78 01 #       shldwq w2(0x1234567890123456), %1
0x0000000000000008: 5D 90 12 01 #       shldwq w1(0x1234567890123456), %1
0x000000000000000C: 5D 34 56 01 #       shldwq w0(0x1234567890123456), %1
theon$

General pattern for loading an label (or address)

In general four instructions are needed to load an arbitrary 64 bit address or literal into a register, e.g.

session09/load/load_label.s

    ldzwq   @w3(some_label),    %1
    shldwq  @w2(some_label),    %1
    shldwq  @w1(some_label),    %1
    shldwq  @w0(some_label),    %1

That the prize to pay for the simplicity and efficiency of a RISC architecture. On a CISC architecture you would just provide an instruction that is encoded with more bytes... On the other hand a RISC architecture is simpler, and that can mean less energy consumption, higher clock rate, etc. And that usually pays off ... (Even the Intel64 architecture just looks for the outside world like a CISC architecture but has internally some RISC core).

Function calls on Intel64

In order to show you that our calling convention is relevant for the real world I just show you an example from the real world.

`foo.s` (Callee code)

The instruction format looks a bit different. For example there is an instruction addq but it only takes two register operands. When executed the first register gets added to the second and the second gets overwritten with the result (like on the ULM when you would write addq %1, %2, %2).

Also the register names are different (and strange). Our %CALLEE0 is here %rdi and %CALLEE1 is here %rsi. And the result of a function gets returned by writing it into %rax. And there is also this .globl directive that is needed when functions are defined in a separate source files and need to by linked.

But don't get confused by the details. Look at this code:

session09/gcc/foo.s

        .text
        .globl  foo
foo:
        addq    %rdi,   %rsi
        movq    %rsi,   %rax
        ret

Can you see similarities to assembly code for the ULM like this:

    .text
    .globl foo
foo:
    addq    %CALLEE0,   %CALLEE1,   %CALLEE1    # like addq    %rdi,   %rsi
    addq    %CALLEE1,   %0,         %CALLEE0    # like movq    %rsi,   %rax
    jmp     %RET,       %0                      # like ret

You can translate this assembly code into machine code by using gcc as a convenient front end for the GNU assembler (otherwise you need to know about several options for using it on theon):

theon$ gcc -S foo.s
theon$

The generated machine code gets written into (an object) file foo.o. You can see that the machine code you can use the program objdump like this:

theon$ objdump -d foo.o

foo.o:     file format elf64-x86-64-sol2


Disassembly of section .text:

0000000000000000 <foo>:
   0:   48 01 fe                add    %rdi,%rsi
   3:   48 89 f0                mov    %rsi,%rax
   6:   c3                      retq   
theon$

Looks kind of familiar! Although there are differences, e.g. that instructions have different sizes.

`main.c` (Caller code)

Now some code that calls this functions. And this code is described in C as follows (and I say on purpose “described” because we describe what machine code should be generated form that):

session09/gcc/main.c

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>

uint64_t
foo(uint64_t a, uint64_t b);

int
main()
{
    uint64_t x = 3;
    uint64_t y = 4;

    uint64_t res = foo(x, y);

    printf("foo(%" PRIu64 ", %" PRIu64 ") returned %" PRIu64 "\n", x, y, res);
}

Analogously this can be translated into machine code and this time using the GNU C compiler and subsequently the GNU assembler. This tool chain get invoked by

theon$ gcc -c -O3 main.c
theon$

With the additional option -O3 I turned on some optimizations. You can skip that, I was just using that so that the generated machine code is a bit shorter. Again you can look at the generated machine code with objdump:

theon$ objdump -d main.o

main.o:     file format elf64-x86-64-sol2


Disassembly of section .text.startup:

0000000000000000 <main>:
   0:   55                      push   %rbp
   1:   be 04 00 00 00          mov    $0x4,%esi
   6:   bf 03 00 00 00          mov    $0x3,%edi
   b:   48 89 e5                mov    %rsp,%rbp
   e:   e8 00 00 00 00          callq  13 <main+0x13>
  13:   ba 04 00 00 00          mov    $0x4,%edx
  18:   48 89 c1                mov    %rax,%rcx
  1b:   be 03 00 00 00          mov    $0x3,%esi
  20:   bf 00 00 00 00          mov    $0x0,%edi
  25:   31 c0                   xor    %eax,%eax
  27:   e8 00 00 00 00          callq  2c <main+0x2c>
  2c:   31 c0                   xor    %eax,%eax
  2e:   5d                      pop    %rbp
  2f:   c3                      retq   
theon$

Let's link and run that thing

You can combine (the right expression is to link) these two pieces of machine code in main.o and foo.o by using gcc again as a convenient front-end for the Solaris linker ld:

theon$ gcc main.o foo.o
theon$

As I mentioned in the introduction video to Session 8 actually more than just these two object files get linked. But let's ignore the details for the moment and look at the generated executable a.out:

theon$ a.out
foo(3, 4) returned 7
theon$

You wanna play?

Replace addq %rdi, %rsi with imulq %rdi, %rsi. I guess you know what would happen ;-)