===========================================
Assembly: Loading a Literal into a Register				[TOC]
===========================================

---- VIDEO ------------------------------
https://www.youtube.com/embed/-7a6L675GCI
-----------------------------------------

New `subq` Instruction: Fixing a Problem in `print_uint64`
==========================================================
Currently the instruction

---- CODE (type=s) -------------------------------------------------------------
        subq    .print_uint64.buf,      %p,     %0
--------------------------------------------------------------------------------

is specified in the instruction set as follows

---- CODE (type=txt) -----------------------------------------------------------
RRR     (OP u 8) (X u 8) (Y u 8) (Z u 8)

# ...

0x05    RRR
:   subq    X, %Y, %Z
    ulm_sub64(X, ulm_regVal(Y), Z);
--------------------------------------------------------------------------------

Hence, for the literal `.print_uint64.buf` only 8 bits are available. Because
of that we will run sooner or later into a problem with our current
implementation of `print_uint64`. Because of this code fragment:

---- CODE (type=s) -------------------------------------------------------------
print_uint64

// ...

        .bss
.print_uint64.buf:
        .space  20

        .text

// ...

        subq    .print_uint64.buf,      %p,     %0

// ...

        ret     %RET_ADDR

--------------------------------------------------------------------------------

If the size of our program grows the text segment soon will exceed 256 bytes.
That means the addresses of the data and BSS segment can no longer be encode
with 8 bits.

In order to fix that we need two things:

- Another instruction for subtraction where we can subtract a register `%X`
  from `%Y`

  ---- CODE (type=txt) ---------------------------------------------------------
    0x18    RRR
  :   subq    %X, %Y, %Z
      ulm_sub64(ulm_regVal(X), ulm_regVal(Y), Z);
  ------------------------------------------------------------------------------

- And fix the code of `print_uint64` so that the literal `.print_uint64.buf`
  will be available in some register. Using the `@w[0-3]` operators we could
  achieve this as follows:

  ---- CODE (type=s) -----------------------------------------------------------
  print_uint64
          .data
          .equ    val,    PARAM0
          .equ    digit,  CALLEE1
          .equ    p,      CALLEE2
          .equ    buf,    CALLEE3
  
          .bss
  .print_uint64.buf:
          .space  20
  
          .text
  
  	  # load .print_uint64.buf into %p
          ldzwq   @w3(.print_uint64.buf), %p
          shldwq  @w2(.print_uint64.buf), %p
          shldwq  @w1(.print_uint64.buf), %p
          shldwq  @w0(.print_uint64.buf), %p
  
  	  # copy %p to %buf
          movq    %p,     %buf
  
  	  # subtract .print_uint64.buf (stored in %buf) from %p
          subq    %buf,   %p,     %0
  
  	  // ...
  
          ret     %RET_ADDR
  ------------------------------------------------------------------------------

  Using always four instructions (with one `ldzwq` and three `shldwq`) is
  inconvenient in handwritten code. And it often is unnecessary probably our
  addresses in the text, data and BSS segment will always fit into 16 bits. But
  hoping that that 16-bit will just is like calling for __hit me again__.

  :links: hit me again -> https://youtu.be/rVV0Cty4lMw

  The clean solution would be to use a literal pool. As it just contains one
  literal we can simply that bookkeeping a bit:

  ---- CODE (type=s) -----------------------------------------------------------
  print_uint64
          .data
          .equ    val,    PARAM0
          .equ    digit,  CALLEE1
          .equ    p,      CALLEE2
          .equ    buf,    CALLEE3
  
          .bss
  .print_uint64.buf:
          .space  20
  
          .text
  
  	  # load .print_uint64.buf into %p
	  ldpa	  .print_uint.pool.buf,	%p
	  ldfp	  0(%p),		%p
  
  	  # copy %p to %buf
          movq    %p,     %buf
  
  	  # subtract .print_uint64.buf (stored in %buf) from %p
          subq    %buf,   %p,     %0
  
  	  // ...
  
          ret     %RET_ADDR

	  .align  8
  .print_uint.pool.buf:
	  .quad   .print_uint64.buf
  ------------------------------------------------------------------------------

Of course want to support the special case where the displacemane _Y_ in
_ldfp Y(%X), %Z_ equals zero in a more convenient way. Like for _movq Y(%X),
%Z_ we simply provide in the instruction set an alternative where _Y_ is
%skipped:

---- CODE (type=txt) -----------------------------------------------------------
0x17    RRR
:   ldfp    Y(%X), %Z
:   ldfp    (%X), %Z
    ulm_fetch64(Y * 8, X, 0, 0, ULM_ZERO_EXT, 8, Z);
--------------------------------------------------------------------------------

Now we can change in function _print_uint64_ the line

---- CODE (type=s) -------------------------------------------------------------
	ldfp	  0(%p),	%p
--------------------------------------------------------------------------------

to

---- CODE (type=s) -------------------------------------------------------------
	ldfp	  (%p),	%p
--------------------------------------------------------------------------------


Provided Material
=================

Here an __ULM Instruction Set__ and its `isa.txt` source code that contains all
the instructions shown in the video (and for _ldfp_ the alternative with a zero
displacement):

:import: session13/load64/0_ulm_variants/load64/isa.txt [fold]

---- SHELL (path=session13/load64/, hide) --------------------------------------
make
make refman
mkdir -p /home/www/htdocs/numerik/hpc/ss22/hpc0/session13/load64/
cp 1_ulm_build/load64/refman.pdf /home/www/htdocs/numerik/hpc/ss22/hpc0/session13/load64/
--------------------------------------------------------------------------------

:links: ULM Instruction Set -> https://www.mathematik.uni-ulm.de/numerik/hpc/ss22/hpc0/session13/load64/refman.pdf


Quiz 13: Computing the greatest common divisor
==============================================
Write a program `gcd.s` that computes for two 64-bit unsigned integers the
greatest common divisor (gcd). Provide the user a nice experience, i.e. using
the program looks like this:

---- CODE (type=txt) -----------------------------------------------------------
theon$ 1_ulm_build/load64/ulm gcd
a = 18
b = 12
gcd(18, 12) = 6
theon$ 1_ulm_build/load64/ulm gcd
a = 350982
b = 822647
gcd(350982, 822647) = 527
--------------------------------------------------------------------------------

Use the following algorithm for your implementation:
  
--- TIKZ ---------------------------------------------------------------------
\begin{adjustbox}{}
\textcolor{white}{.}
\begin{varwidth}{10cm}

\begin{algorithmic}
    \Function{gcd}{$a, b$}
    \If{$a = 0 \;\lor\ b=0$}
	\State \textbf{return} $0$
    \EndIf
    \While{$a\not=b$}
	\If{$a > b$}
	    \State $a\gets a - b$
	\Else
	    \State $b\gets b - a$
	\EndIf
    \EndWhile
    \State \textbf{return} $b$
    \EndFunction
\end{algorithmic}

\end{varwidth}
\end{adjustbox}
------------------------------------------------------------------------------

All labels are in general 64-bit literals. If you require a 64-bit literal as an
absolute address use a literal pool.  So in particular also fix the problem in
`print_uint64` as outlined above.

Your program should have an exit code of 0.

Submit your program with

---- CODE (type=txt) ----------
submit hpc quiz13 isa.txt gcd.s
-------------------------------