===========
Supplements								[TOC]
===========

Here some additional information about related topics. I will also collect
question that arise.

How the `jmp %CALL, %RET` and `jmp %RET, %0` pattern is related to ARM
======================================================================

The __ARM architecture__ provides the "branch and link" instruction `BL` for
calling a function:

---- CODE (type=s) -------------------------------------------------------------
    BL	function_label
--------------------------------------------------------------------------------

What this does is storing the return address in register called `lr` (for _link
register_) and then jumps to the address referred to by the label
`function_label`. So pretty much the same as the `jmp %CALL, %RET` except that
on the ULM you explicitly to specify the return register. This is sometime
annoying as you have to make sure that all parties agree on the same return
register. But I think for educational purposes a more explicit notation (that
requires and enforces more discipline in programming) is ok.

Returning from the function is done on ARM with a move-instruction. You simple
overwrite the instruction pointer (on ARM it's call `PC` for program counter, on
ULM it is `%IP) with the link register.

---- CODE (type=s) -------------------------------------------------------------
    MOV     pc, lr            /* Return from subroutine.
				 Note: The MOV copies the right-hand-side to the
				 left-hand-side
			      */
--------------------------------------------------------------------------------


Note that on the ULM you can note use
`%IP` explicitly in an instruction, but `jmp %RET, %0` is implicitly the exact
same thing as you would do on ARM.


Control structures
==================
* If-then-else in an algorithm
  
  ---- TIKZ -----------
  \begin{adjustbox}{}
  \textcolor{white}{.}
  \begin{varwidth}{7cm}
  \begin{algorithmic}
  \State $A$
  \If{$\text{cond}$}
      \State $B$
  \Else
      \State $C$
  \EndIf
  \State $D$
  \end{algorithmic}
  \end{varwidth}
  \end{adjustbox}
  ---------------------

* Flow chart (Variant 1)

  ---- TIKZ --------------
  \begin{tikzpicture}
  \input{flowchart.tex}
  \SetMargin{1}{1}{0}{5}
 
  \PutStatement{0}{A}
  \PutJump{1}{$\text{cond}$}
  \PutStatement{2}{C}
  \PutJump{3}{}
  \PutStatement{4}{B}
  \PutStatement{5}{D}
 
  \AddPath{0}{1}
  \AddPath{1}{2}
  \AddCondJumpPath{1}{4}
  \AddPath{2}{3}
  \AddJumpPathLeft{3}{5}
  \AddPath{4}{5}
 
  \end{tikzpicture}
  ------------------------

* Flow chart (Variant 2)

  ---- TIKZ ---------------
  \begin{tikzpicture}
  \input{flowchart.tex}
  \SetMargin{1}{1}{0}{5}

  \PutStatement{0}{A}
  \PutJump{1}{$\lnot \text{cond}$}
  \PutStatement{2}{B}
  \PutJump{3}{}
  \PutStatement{4}{C}
  \PutStatement{5}{D}

  \AddPath{0}{1}
  \AddPath{1}{2}
  \AddCondJumpPath{1}{4}
  \AddPath{2}{3}
  \AddJumpPathLeft{3}{5}
  \AddPath{4}{5}

  \end{tikzpicture}
  ------------------------


Loading a function address into `%CALL`
=======================================
In general the address of a function `foo` can be larger than $2^{16}$. Then
using

---- CODE (type=s) -------------------------------------------------------------
    ldzwq   foo,    %CALL
--------------------------------------------------------------------------------

will fail. You can simulate this by trying to load a literal value that is
larger than $2^{16}$, e.g. 0x12345, into a register:

---- CODE (file=session09/load/ldzwq_fail.s) -----------------------------------
    ldzwq   0x12345,    %1
--------------------------------------------------------------------------------

You will get the following error from the code generator:

---- SHELL (path=session09/load) -----------------------------------------------
ulmas ldzwq_fail.s
--------------------------------------------------------------------------------

So how to solve this?
---------------------

Besides __ldzwq__ you need as ingredients the immediate operators `@w0`, .. ,
`@w3` and the __shldwq__ instruction:

- The operator `@w0` picks the least significant word of a literal (or label). So
  for example `@w0(0x12345)` picks 0x2345. And with the other operators you can
  pick the other words.  In general, if $X$ is some bit pattern then

  ---- LATEX -------------------------------------------------------------------
  \begin{array}{lcl}
  \text{@w0}(X) & = & u(X) \cdot 2^{-16 \cdot 0} \bmod 2^{16} \\
  \text{@w1}(X) & = & u(X) \cdot 2^{-16 \cdot 1} \bmod 2^{16} \\
  \text{@w2}(X) & = & u(X) \cdot 2^{-16 \cdot 2} \bmod 2^{16} \\
  \text{@w3}(X) & = & u(X) \cdot 2^{-16 \cdot 3} \bmod 2^{16} \\
  \end{array}
  ------------------------------------------------------------------------------

- The `shldwq` (shift left load) instruction shifts the content of a register
  `%Z` 16 positions to the left and inserts into the least significant bits a
  16-bit pattern `XY`:

  ---- LATEX -------------------------------------------------------------------
  u(\%\text{Z}) \cdot 2^{16} + u(\text{XY}) \to u(\%\text{Z})
  ------------------------------------------------------------------------------

So for example

---- CODE (file=session09/load/load32.s) -----------------------------------
    ldzwq   @w1(0x12345),    %1
    shldwq  @w0(0x12345),    %1
--------------------------------------------------------------------------------

loads the bit pattern 0x12345 into `%1`. And a 64-bit literal can be loaded
like this:

---- CODE (file=session09/load/load64.s) ---------------------------------------
    ldzwq   @w3(0x1234567890123456),    %1
    shldwq  @w2(0x1234567890123456),    %1
    shldwq  @w1(0x1234567890123456),    %1
    shldwq  @w0(0x1234567890123456),    %1
--------------------------------------------------------------------------------

When you look at the generated machine code you easily can see what bit pattern
where picked form the literal:

---- SHELL (path=session09/load) -----------------------------------------------
ulmas -o load64 load64.s
cat load64
--------------------------------------------------------------------------------


General pattern for loading an label (or address)
-------------------------------------------------

In general four instructions are needed to load an arbitrary 64 bit address or
literal into a register, e.g.

---- CODE (file=session09/load/load_label.s) -----------------------------------
    ldzwq   @w3(some_label),    %1
    shldwq  @w2(some_label),    %1
    shldwq  @w1(some_label),    %1
    shldwq  @w0(some_label),    %1
--------------------------------------------------------------------------------

That the prize to pay for the simplicity and efficiency of a RISC architecture.
On a CISC architecture you would just provide an instruction that is encoded
with more bytes... On the other hand a RISC architecture is simpler, and that
can mean less energy consumption, higher clock rate, etc. And that usually pays
off ... (Even the Intel64 architecture just looks for the outside world like a
CISC architecture but has internally some RISC core).


Function calls on Intel64
=========================
In order to show you that our calling convention is relevant for the real world
I just show you an example from the real world. 

`foo.s` (Callee code)
---------------------
The instruction format looks a bit different. For example there is an
instruction `addq` but it only takes two register operands. When executed the
first register gets added to the second and the second gets overwritten with
the result (like on the ULM when you would write `addq %1, %2, %2`).

Also the register names are different (and strange). Our `%CALLEE0` is here
`%rdi` and `%CALLEE1` is here `%rsi`. And the result of a function gets
returned by writing it into `%rax`.  And there is also this `.globl` directive
that is needed when functions are defined in a separate source files and need
to by linked.

But don't get confused by the details. Look at this code:

:import: session09/gcc/foo.s

Can you see similarities to assembly code for the ULM like this:

---- CODE (type=s) -------------------------------------------------------------
    .text
    .globl foo
foo:
    addq    %CALLEE0,	%CALLEE1,   %CALLEE1	# like addq    %rdi,   %rsi
    addq    %CALLEE1,	%0,	    %CALLEE0	# like movq    %rsi,   %rax
    jmp	    %RET,	%0			# like ret
--------------------------------------------------------------------------------


You can translate this assembly code into machine code by using `gcc` as a
convenient front end for the GNU assembler (otherwise you need to know about
several options for using it on `theon`):

---- SHELL (path=session09/gcc) ------------------------------------------------
gcc -S foo.s
--------------------------------------------------------------------------------

The generated machine code gets written into (an object) file `foo.o`. You can
see that the machine code you can use the program `objdump` like this:

---- SHELL (path=session09/gcc) ------------------------------------------------
objdump -d foo.o
--------------------------------------------------------------------------------

Looks kind of familiar! Although there are differences, e.g. that instructions
have different sizes.


`main.c` (Caller code)
----------------------
Now some code that calls this functions. And this code is described in C as
follows (and I say on purpose "described" because we describe what machine code
should be generated form that):

:import: session09/gcc/main.c

Analogously this can be translated into machine code and this time using the
GNU C compiler and subsequently the GNU assembler. This tool chain get invoked
by

---- SHELL (path=session09/gcc) ------------------------------------------------
gcc -c -O3 main.c
--------------------------------------------------------------------------------

With the additional option `-O3` I turned on some optimizations. You can skip
that, I was just using that so that the generated machine code is a bit shorter.
Again you can look at the generated machine code with `objdump`:

---- SHELL (path=session09/gcc, fold) ------------------------------------------
objdump -d main.o
--------------------------------------------------------------------------------

Let's link and run that thing
-----------------------------

You can combine (the right expression is to link) these two pieces of machine
code in `main.o` and `foo.o` by using `gcc` again as a convenient front-end for
the Solaris linker `ld`:

---- SHELL (path=session09/gcc) ------------------------------------------------
gcc main.o foo.o
--------------------------------------------------------------------------------

As I mentioned in the introduction video to Session 8 actually more than just
these two object files get linked. But let's ignore the details for the moment
and look at the generated executable `a.out`:

---- SHELL (path=session09/gcc) ------------------------------------------------
a.out
--------------------------------------------------------------------------------

You wanna play?
---------------

Replace `addq %rdi, %rsi` with `imulq %rdi, %rsi`. I guess you know what would
happen ;-)


:links: ldzwq -> http://www.mathematik.uni-ulm.de/numerik/hpc/ss20/hpc0/ulm.pdf#page=28
	shldwq -> http://www.mathematik.uni-ulm.de/numerik/hpc/ss20/hpc0/ulm.pdf#page=46
	ARM architecture -> https://en.wikipedia.org/wiki/ARM_architecture

:navigate:  up      -> doc:index
            back    -> doc:session09/page03