More about the assembly language

For our next project you have to write an assembly program for the ULM that computes the factorial of an unsigned integer. This project consists of three subprojects: implementing an algorithm for computing the factorial, reading in an integer and printing an integer.

Video tutorial

In the video some features of the assembly language (e.g. labels and directives) are introduced.

Here the “hello, world!” program with labels and .equ directives shown in the video:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
        .equ       p,    1
        .equ       ch,   2

        .data
msg     .string    "hello, world!\n"

        .text
        ldzwq      msg,  %p
load    movzbq     (%p), %ch
        subq       0,    %ch,  %0
        jz         halt
        putc       %ch
        addq       1,    %p,   %p
        jmp        load
halt    halt       0

This program is equivalent (not just similar) to the “hello, world!” program from Session 6. However, when you translate the program with

theon$ ulmas -o hello hello.s
theon$ 

you see that the assembler output contains more information than the code in Session 6:

#TEXT 4
0x0000000000000000: 56 00 20 01 #       ldzwq msg, %p
0x0000000000000004: 1B 00 01 02 #       movzbq (%p), %ch
0x0000000000000008: 39 00 02 00 #       subq 0, %ch, %0
0x000000000000000C: 42 00 00 04 #       jz halt
0x0000000000000010: 61 02 00 00 #       putc %ch
0x0000000000000014: 38 01 01 01 #       addq 1, %p, %p
0x0000000000000018: 41 FF FF FB #       jmp load
0x000000000000001C: 09 00 00 00 #       halt 0
#DATA 1
0x0000000000000020: 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 0A 00 #      .string "hello, world!\n"
#SYMTAB
a ch                          0x0000000000000002
t halt                        0x000000000000001C
t load                        0x0000000000000004
d msg                         0x0000000000000000
a p                           0x0000000000000001
#FIXUPS
text 0x0000000000000000 1 2 absolute [data]

Recall from Session 5.2 how the sequential execution of a program was explained for the ULM: The program gets loaded into memory, the instruction pointer get initialized with zero and then the ULM performs Von Neumann cycles until an halt instruction gets executed. Loading the program and initializing the instruction pointer is internally done by the loader of the ULM (in the Mini-ULM example you had to press the “load” button to trigger the loader explicitly). The loader recognizes the format of the assembler output and initializes the memory with the following layout:

More details about the assembler format and how the assembler are explained in the following sections.

Format of the assembler output

The generated output consists of different sections. Each of this sections starts with a header (which also separates the section from a previous section). In this case the assembler output has the following 4 sections:

#TEXT <alignment>

is the header for the text segment. This section contains the instructions for the “hello, world” program.

The alignment parameter is (for convenience) a decimal numeral and in this case 4. It specifies that the loader has to copy this section to a memory block with a start address that is a multiple of 4. For the moment this is not relevant as in this case the start address is 0 (and hence a multiple of any integer that is not zero).

Lines of the text segment have the format

[address:] instruction [# comments]

Addresses and comments are optional (and for convenience). The instructions are in hexadecimal. By default the loader copies the text segment to memory beginning by address 0.

Compare the instructions of the text segment with the memory content from address 0x00 to 0x20.

#DATA <alignment>

is the header for the data segment. This section contains the data for the “hello, world” program and has the same format as the text segment (i.e optional address, actual data, optional comments).

The loader copies the data segment to memory such that it follows the text segment. Like the text segment, the data segment can have alignment restrictions. That means in general there can be a gap in memory between the text and data segment. However, in the “hello, world” program the alignment of the data segment is 1. Hence, the loader begins the data segment at address 0x20 (where the text segment ended).

#SYMTAB

is the header for the symbol table.

Labels and .equ directives generate symbols that have a value and a type:

  • Labels in the text segment have type text and the value is an address within the text segment. For example, the text symbol halt has the value 0x1C.

  • Accordingly labels in the data segment have type data and the value is the relative address to the begin of the data segment. For example, the text symbol msg has value 0x00 (and not 0x20).

  • The .equ directive defines a symbol of type absolute with a given value. Here for example, the symbol p has value 1.

For loading (and running) the program the symbol table is not relevant. But it is relevant for linking (which will be covered in upcoming sessions).

If an instruction or directive contains an undefined symbol an entry of type undefined and value 0 is added to the symbol table.

#FIXUPS

is the header for the relocation table.

Like the symbol table this will be relevant for linking and will be covered in upcoming sessions.

Memory layout of the “hello, world” program

The information of the text and data segment together with the text and data labels from symbol table can be visualized by

Disassembling the instructions in the text segment and interpreting the bytes in the data segment as zero terminated string allows to almost see the original source code in the memory layout:

From the symbol table we also know that symbols p and ch were defined with absolute value 1 and 2 respectively. However, we can not determine where the symbols were used. So for example, we don't know that the first instruction was written as ldzwq msg, %p in the source file. In practise that makes it hard to understand disassembled programs where the original source is not available (and there are actually legal cases where you have to deal with such problems).

Pointers! Start learning about them here!

Have a look at what the first two instructions are doing and how this can be represented descriptively.

ldzwq msg, %p

The assembler replaces the label msg with the address of the h in the “hello, world!” string. That means msg has value 0x20 (or 32 in decimal). Compare that with the assembly program in Session 6.

Obviously the address of the string depends on where the loader will copy the data segment when we run the program. And this in turn depends on the size of the text segment. It requires some kind of bookkeeping to figure out the actual address of the string by just looking at the assembly source code. But it is possible, we can do it and the assembler can do it. But it is less error prone if the assembler is doing it, and using labels delegates this job to the assembler.

So after this instruction the value in %p has the meaning “address of the first character in the string”. So we think of %p as being a “pointer to the first character in the string”:

movzbq (%p), %ch

This instruction copies the value at address %p into %ch.

In this instruction the pointer get dereferenced. And it is impossible to overestimate how important it will be to understand what dereferencing a pointer means. So I will explain and talk about it more than once.

Dereferencing means that we refer to a value at the end of a pointer. And this requires two pieces of information:

  • Location: “Where does the pointers point to?”

    This information is the address stored as value in the pointer. So in this case the value in %p. Note the difference between “value of %p” and “value at address %p”.

  • Type information: “What is the value at the end of the pointer?”

    In the “hello, world” example the value is the byte at the end of the pointer, and this byte has here the meaning of being the ASCII value of a character. This information is only given by the context and not stored in any register or whatsoever. Because we know that the instruction movzb is used to copy a single byte from the end of the pointer, zero extend it and to copy it into the destination register %ch.

    In general you have to do the bookkeeping: You have to know of how many bytes the dereferenced value consists of. And to know if the dereferenced value is a character, signed or unsigned integer or whatsoever. You have zero support from the assembly language for bookkeeping this kind of type information.

In this context we can illustrate the meaning of (%p) as follows:

How to know if a register is a pointer?

The assembler does not know whether a register is used as pointer. This is also up to you. You give the register its meaning. And this meaning can change!

Consider this modification of the first two instructions:

1
2
    ldzwq   msg,    %3
    movzbq  (%3),   %3

With the first instruction register %3 could have the meaning “pointer to the string”. After the second instruction the meaning “first character of the string”.

Some personal opinion/experience

When learning C/C++ understanding pointers and how to use them is the hardest part. My rule of thumb is that every non-trivial bug in a C/C++ program is related to pointers. The dangerous thing is this combination:

  • In the C/C++ programming languages the compilers do a lot of the necessary bookkeeping. So compared to programming in assembler the C/C+ compilers can detect many bugs related to pointers. Some of the bugs that slip though this line of defence can be detected by additional tools.

  • Slightly exaggerated but true in the quintessence: If a bug slipped through it is impossible to find it. You don't even know if a bug slipped through! Because such a bug might only once in a while cause the program to crash, or worse, the bug does not crash the program and just causes wrong results.

Because some bugs are detected in C/C++ it is tempting to use these languages and underestimate the danger.

The advantage of programming in assembly is: You will never underestimate what can go wrong! And be aware that you are learning some non-trivial concepts. Be patient now if things don't work out the first time and try to understand the underlying reason. This will allow you to do some solid programming in C/C++ later.

More about the assembly language: Tokens

Like I said in the video it first seems to be odd, that for example halt can be used as an label. How can the assembler distinguish the meanings?

Field format of source lines

This is handled by the scanner during the lexical analysis. The format of the source lines consists of fields:

[label] [operators] [operands]

Tokens for mnemonics like (e.g. “addq”, “halt”, etc.) and pseudo operators (e.g. “.string”, “.byte”, etc) are only generated from the operator field. So in other fields they can be used as identifiers. For example, from this code

1
2
3
4
addq                                        # some label
        addq    %0,     %12,    %0x3        // an instruction
        .quad   4                           /* some data
                                            */

the scanner (you can call it with ulmas-test-lexer) generates the following tokens:

theon$ ulmas-test-lexer lex_example.s
IDENT "addq" at lex_example.s:1.1-4
EOL at lex_example.s:1.45-57
SPACE at lex_example.s:2.1-8
ADDQ "addq" at lex_example.s:2.9-12
PERCENT at lex_example.s:2.17
OCTAL_CONSTANT "0" at lex_example.s:2.18
COMMA at lex_example.s:2.19
PERCENT at lex_example.s:2.25
DECIMAL_CONSTANT "12" at lex_example.s:2.26-27
COMMA at lex_example.s:2.28
PERCENT at lex_example.s:2.33
HEXADECIMAL_CONSTANT "0x3" at lex_example.s:2.34-36
EOL at lex_example.s:2.45-62
SPACE at lex_example.s:3.1-8
DOT_QUAD ".quad" at lex_example.s:3.9-13
DECIMAL_CONSTANT "4" at lex_example.s:3.17
EOL at lex_example.s:4.47
theon$ 

So note that the character sequence “addq” was first detected as an identifier (IDENT) and in the second case as mnemonic (ADDQ).

Comments

You also might notice that comments are removed. And comments can be used in different flavors:

  • Single-line comments start with “#” or “//”

  • “/*” begins a multi-line comment and “*/” ends a multi-line comment

Tokens recognized only in the operator field

As specified in the grammar, a mnemonic is part of an instruction and a pseudo operator part of a directive.

Mnemonics

The mnemonics specified by the Instruction Set of the ULM:

addq

andq

divq

getc

halt

idivq

imulq

ja

jae

jb

jbe

je

jg

jge

jl

jle

jmp

jna

jnae

jnb

jnbe

jne

jng

jnge

jnl

jnle

jnz

jz

ldswq

ldzwq

movb

movl

movq

movsbq

movslq

movswq

movw

movzbq

movzwq

mulq

nop

notq

orq

putc

salq

sarq

shldwq

shlq

shrq

subq

trap

Pseudo operators

.align

.byte

.comm

.data

.equ

.equiv

.globl

.global

.lcomm

.long

.quad

.set

.string

.text

.word

Tokens recognized in the label or operands fields

Identifiers

Identifiers begin with a letter (i.e. 'A' to 'Z' and 'a' to 'z'), or underscore '_', or a dot '.' and are optionally continued with a sequence of more letters, underscores, dots , or decimal digits 0 to 9.

Hence “foo”, “.fOo”, “.fOo1”, “_”, “.” are allowed nut not “2foo”.

Empty label

You also see some tokens called SPACE in this example. This token gets generated when the label field is empty. For the parser (and describing the grammar) it is important that every line has in general a label (which can be empty, but it exists). Otherwise white space characters get consumed by the scanner.

Literals

  • Decimal, hexadecimal and octal constants (e.g. 12, 018, 0x2a)

  • Character constants (e.g. ‘h’, ‘\n’)

  • String literals (e.g. “hello, wolrd!\n”)

  • End of line (newline character ‘\n’)

Punctuators/Delimiters

+

-

*

/

%

(

)

:

,

$

Immediate operator

@w0

@w1

@w2

@w3

Not important now but have a look at Session 10, Loading a function address for an application.

More about the assembly language: Grammar

Reading a grammar and understanding the essentials is easier when you can look at actual syntax trees. In the terminal you can print the syntax tree of an assembly program with the command ulmas-test-parser:

theon$ ulmas-test-parser hello.s
("compilation_unit"
   ("directive"
      ("pseudo_op_def"
         '.equ'
      ), 
      ("identifier"
         'p'
      ), 
      ("expression"
         ("integer"
            ("decimal_constant"
               '1'
            )
         )
      )
   ), 
   ("directive"
      ("pseudo_op_def"
         '.equ'
      ), 
      ("identifier"
         'ch'
      ), 
      ("expression"
         ("integer"
            ("decimal_constant"
               '2'
            )
         )
      )
   ), 
   ("data_header"
      '.data'
   ), 
   ("labelled_op"
      ("label"
         ("identifier"
            'msg'
         )
      ), 
      ("directive"
         ("pseudo_op_string"
            '.string'
         ), 
         'hello, world!
'
      )
   ), 
   ("text_header"
      '.text'
   ), 
   ("instruction"
      ("mnemonic"
         'ldzwq'
      ), 
      ("immediate_operand"
         ("expression"
            ("identifier"
               'msg'
            )
         )
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("identifier"
                  'p'
               )
            )
         )
      )
   ), 
   ("labelled_op"
      ("label"
         ("identifier"
            'load'
         )
      ), 
      ("instruction"
         ("mnemonic"
            'movzbq'
         ), 
         ("memory_operand"
            ("reg"
               ("expression"
                  ("identifier"
                     'p'
                  )
               )
            )
         ), 
         ("register_operand"
            ("reg"
               ("expression"
                  ("identifier"
                     'ch'
                  )
               )
            )
         )
      )
   ), 
   ("instruction"
      ("mnemonic"
         'subq'
      ), 
      ("immediate_operand"
         ("expression"
            ("integer"
               ("octal_constant"
                  '0'
               )
            )
         )
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("identifier"
                  'ch'
               )
            )
         )
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("integer"
                  ("octal_constant"
                     '0'
                  )
               )
            )
         )
      )
   ), 
   ("instruction"
      ("mnemonic"
         'jz'
      ), 
      ("immediate_operand"
         ("expression"
            ("identifier"
               'halt'
            )
         )
      )
   ), 
   ("instruction"
      ("mnemonic"
         'putc'
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("identifier"
                  'ch'
               )
            )
         )
      )
   ), 
   ("instruction"
      ("mnemonic"
         'addq'
      ), 
      ("immediate_operand"
         ("expression"
            ("integer"
               ("decimal_constant"
                  '1'
               )
            )
         )
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("identifier"
                  'p'
               )
            )
         )
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("identifier"
                  'p'
               )
            )
         )
      )
   ), 
   ("instruction"
      ("mnemonic"
         'jmp'
      ), 
      ("immediate_operand"
         ("expression"
            ("identifier"
               'load'
            )
         )
      )
   ), 
   ("labelled_op"
      ("label"
         ("identifier"
            'halt'
         )
      ), 
      ("instruction"
         ("mnemonic"
            'halt'
         ), 
         ("immediate_operand"
            ("expression"
               ("integer"
                  ("octal_constant"
                     '0'
                  )
               )
            )
         )
      )
   )
)
theon$ 

For an interactive visualization of the tree for the “hello, world!” program click here: syntax tree for hello.s

Expressions

In the grammar expressions are defined quite similar to the expressions in the ULM calculator with variables:

\[\begin{array}{lcl}\langle\text{expression}\rangle & \to & \langle\text{simple-expression}\rangle \\\langle\text{simple-expression}\rangle & \to & \langle\text{term}\rangle \\ & \to & \langle\text{simple-expression}\rangle \quad \textbf{+} \quad \langle\text{term}\rangle \\ & \to & \langle\text{simple-expression}\rangle \quad \textbf{-} \quad \langle\text{term}\rangle \\\langle\text{term}\rangle & \to & \langle\text{factor}\rangle \\ & \to & \langle\text{term}\rangle \quad\textbf{*}\quad \langle\text{factor}\rangle \\ & \to & \langle\text{term}\rangle \quad\textbf{/}\quad \langle\text{factor}\rangle \\ & \to & \langle\text{term}\rangle \quad\textbf{%}\quad \langle\text{factor}\rangle \\\langle\text{factor}\rangle & \to & \langle\text{primary}\rangle \\ & \to & \langle\text{unary-minus}\rangle \\\langle\text{unary-minus}\rangle & \to & \textbf{-} \quad \langle\text{primary}\rangle \\\langle\text{pimary}\rangle & \to & \langle\text{integer}\rangle \\ & \to & \langle\text{identifier}\rangle \\ & \to & \textbf{(} \quad \langle\text{simple-expression}\rangle \quad \textbf{)}\\\langle\text{integer}\rangle & \to & \text{decimal-constant} \\ & \to & \text{hexadecimal-constant} \\ & \to & \text{octal-constant} \\ & \to & \text{char-constant} \\\langle\text{identifier}\rangle & \to & \text{ident} \\\end{array}\]

In the simplest cases an expression is just an identifier or a constant. Compared to the ULM calculator the constants can be decimal, hexadecimal, octal and character constants. Here an example for halt instructions that all have the same exit code given by an expression

1
2
3
4
5
6
7
8
9
    halt    65              // exit code as decimal constant
    halt    0x41            // exit code as hexadecimal constant
    halt    0101            // exit code as octal constant
    halt    'A'             // exit code as character constant

    .equ    exit,   'A'     /* here instead of 'A' you also could write
                               65, 0x41, 0101.
                            */
    halt    exit

Check the parse tree

theon$ ulmas-test-parser ex_expr.s
("compilation_unit"
   ("instruction"
      ("mnemonic"
         'halt'
      ), 
      ("immediate_operand"
         ("expression"
            ("integer"
               ("decimal_constant"
                  '65'
               )
            )
         )
      )
   ), 
   ("instruction"
      ("mnemonic"
         'halt'
      ), 
      ("immediate_operand"
         ("expression"
            ("integer"
               ("hexadecimal_constant"
                  '0x41'
               )
            )
         )
      )
   ), 
   ("instruction"
      ("mnemonic"
         'halt'
      ), 
      ("immediate_operand"
         ("expression"
            ("integer"
               ("octal_constant"
                  '0101'
               )
            )
         )
      )
   ), 
   ("instruction"
      ("mnemonic"
         'halt'
      ), 
      ("immediate_operand"
         ("expression"
            ("integer"
               ("char_constant"
                  'A'
               )
            )
         )
      )
   ), 
   ("directive"
      ("pseudo_op_def"
         '.equ'
      ), 
      ("identifier"
         'exit'
      ), 
      ("expression"
         ("integer"
            ("char_constant"
               'A'
            )
         )
      )
   ), 
   ("instruction"
      ("mnemonic"
         'halt'
      ), 
      ("immediate_operand"
         ("expression"
            ("identifier"
               'exit'
            )
         )
      )
   )
)
theon$ 

and that you get times the same instruction 09 41 00 00 after translating:

#TEXT 4
0x0000000000000000: 09 41 00 00 #       halt 65
0x0000000000000004: 09 41 00 00 #       halt 0x41
0x0000000000000008: 09 41 00 00 #       halt 0101
0x000000000000000C: 09 41 00 00 #       halt 'A'
0x0000000000000010: 09 41 00 00 #       halt exit
#SYMTAB
a exit                        0x0000000000000041

Structure of an assembly program

The grammar describes an assembly program as a sequence of instructions, directives and labels:

\[\begin{array}{lcl}\langle\text{compilation-unit}\rangle & \to & \langle\text{}\rangle \\ & \to & \langle\text{sequence}\rangle \\\langle\text{sequence}\rangle & \to & \langle\text{labelled-op}\rangle \\ & \to & \langle\text{regular-op}\rangle \\ & \to & \langle\text{sequence}\rangle \quad \langle\text{labelled-op}\rangle \\ & \to & \langle\text{sequence}\rangle \quad \langle\text{regular-op}\rangle \\\langle\text{labelled-op}\rangle & \to & \langle\text{label}\rangle \quad \langle\text{op}\rangle\\\langle\text{regular-op}\rangle & \to & \langle\text{empty-label}\rangle \quad \langle\text{op}\rangle\\\langle\text{op}\rangle & \to & \textbf{eol} \\ & \to & \langle\text{instruction}\rangle \quad \textbf{eol}\\ & \to & \langle\text{directive}\rangle \quad \textbf{eol}\\\end{array}\]

For example, the following program

1
2
3
label   nop
label
        nop

is a sequence of three instructions:

theon$ ulmas-test-parser ex1.s
("compilation_unit"
   ("labelled_op"
      ("label"
         ("identifier"
            'label'
         )
      ), 
      ("instruction"
         ("mnemonic"
            'nop'
         )
      )
   ), 
   ("labelled_op"
      ("label"
         ("identifier"
            'label'
         )
      ), 
      ("empty_line")
   ), 
   ("instruction"
      ("mnemonic"
         'nop'
      )
   )
)
theon$ 

Labels

Labels are identifiers followed by an optional colon:

\[\begin{array}{lcl}\langle\text{label}\rangle & \to & \langle\text{identifier}\rangle \\ & \to & \langle\text{identifier}\rangle \quad \textbf{:}\\\end{array}\]

For example, the following program

consists of two labels, i.e. labelled instructions where \(\langle \text{op} \rangle\) is empty. In the printed tree we don't even show the node for the colon token:

theon$ ulmas-test-parser ex2.s
("compilation_unit"
   ("labelled_op"
      ("label"
         ("identifier"
            'label'
         )
      ), 
      ("empty_line")
   ), 
   ("labelled_op"
      ("label"
         ("identifier"
            'label'
         )
      ), 
      ("empty_line")
   )
)
theon$ 

Instructions

An instruction consists of a mnemonic and up to 3 operands:

\[\begin{array}{lcl}\langle\text{mnemonic}\rangle & \to & \\ & \to & \langle\text{operand}\rangle \\ & \to & \langle\text{operand}\rangle \quad \textbf{,} \quad \langle\text{operand}\rangle\\ & \to & \langle\text{operand}\rangle \quad \textbf{,} \quad \langle\text{operand}\rangle \quad \textbf{,} \quad \langle\text{operand}\rangle \\\langle\text{operand}\rangle & \to & \langle\text{immediate-operand}\rangle \\ & \to & \langle\text{register-operand}\rangle \\ & \to & \langle\text{memory-operand}\rangle \\\end{array}\]

Immediate operand

Immediate operands can be expressions with an optional dollar as prefix. We don't care for now that they also can be an immediate operator:

\[\begin{array}{lcl}\langle\text{immediate-operand}\rangle & \to & \langle\text{expression}\rangle \\ & \to & \textbf{\$} \quad \langle\text{expression}\rangle \\ & \to & \langle\text{immediate-operator}\rangle \\\end{array}\]

Hence for the assembler the following two instructions are the same:

1
2
    addq    1,  %2, %3
    addq    $1, %2, %3
theon$ ulmas-test-parser ex4.s
("compilation_unit"
   ("instruction"
      ("mnemonic"
         'addq'
      ), 
      ("immediate_operand"
         ("expression"
            ("integer"
               ("decimal_constant"
                  '1'
               )
            )
         )
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("integer"
                  ("decimal_constant"
                     '2'
                  )
               )
            )
         )
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("integer"
                  ("decimal_constant"
                     '3'
                  )
               )
            )
         )
      )
   ), 
   ("instruction"
      ("mnemonic"
         'addq'
      ), 
      ("immediate_operand"
         ("expression"
            ("integer"
               ("decimal_constant"
                  '1'
               )
            )
         )
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("integer"
                  ("decimal_constant"
                     '2'
                  )
               )
            )
         )
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("integer"
                  ("decimal_constant"
                     '3'
                  )
               )
            )
         )
      )
   )
)
theon$ 

Register operand

Register operands are expressions with a percentage prefix:

\[\begin{array}{lcl}\langle\text{register-operand}\rangle & \to & \langle\text{reg}\rangle \\\langle\text{reg}\rangle & \to & \textbf{\$} \quad \langle\text{expression}\rangle \\\end{array}\]

Hence you can write %20, %10 +10 or %5+2*2

1
    addq    %20,    %10 + 10,   %5+2*2

Once the code generator evaluated each register operand are referring to the same register:

theon$ ulmas-test-parser ex3.s
("compilation_unit"
   ("instruction"
      ("mnemonic"
         'addq'
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("integer"
                  ("decimal_constant"
                     '20'
                  )
               )
            )
         )
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("+"
                  ("integer"
                     ("decimal_constant"
                        '10'
                     )
                  ), 
                  ("integer"
                     ("decimal_constant"
                        '10'
                     )
                  )
               )
            )
         )
      ), 
      ("register_operand"
         ("reg"
            ("expression"
               ("+"
                  ("integer"
                     ("decimal_constant"
                        '5'
                     )
                  ), 
                  ("*"
                     ("integer"
                        ("decimal_constant"
                           '2'
                        )
                     ), 
                     ("integer"
                        ("decimal_constant"
                           '2'
                        )
                     )
                  )
               )
            )
         )
      )
   )
)
theon$ 

Memory operand

\[\begin{array}{lcl}\langle\text{memory-operand}\rangle & \to & \textbf{(} \quad \langle\text{reg} \rangle \quad \textbf{)} \\ & \to & \langle\text{displacement}\rangle \quad \textbf{(} \quad \langle\text{reg} \rangle \quad \textbf{)} \\ & \to & \textbf{(} \quad \langle\text{reg} \rangle \quad \textbf{,} \quad \langle\text{reg} \rangle \quad \quad \textbf{)} \\ & \to & \textbf{(} \quad \langle\text{reg} \rangle \quad \textbf{,} \quad \langle\text{reg} \rangle \quad \textbf{,} \quad \langle\text{scale} \rangle \quad \quad \textbf{)} \\\langle\text{displacement}\rangle & \to & \langle\text{expression}\rangle \\\langle\text{scale}\rangle & \to & \langle\text{expression}\rangle \\\end{array}\]

Directives

\[\begin{array}{lcl}\langle\text{directive}\rangle & \to & \langle\text{text-header}\rangle \\ & \to & \langle\text{data-header}\rangle \\ & \to & \langle\text{bss-header}\rangle \\ & \to & \langle\text{pseudo-op-data}\rangle \quad \langle\text{expression}\rangle \\ & \to & \langle\text{pseudo-op-string}\rangle \quad \textbf{string-literal} \\ & \to & \langle\text{pseudo-op-flag}\rangle \quad \langle\text{identifier}\rangle \\ & \to & \langle\text{pseudo-op-def}\rangle \quad \langle\text{identifier}\rangle \quad \textbf{,} \quad \langle\text{expression}\rangle \\\langle\text{text-header}\rangle & \to & \textbf{.text} \\\langle\text{data-header}\rangle & \to & \textbf{.data} \\\langle\text{bss-header}\rangle & \to & \textbf{.bss} \\\langle\text{pseudo-op-string}\rangle & \to & \textbf{.string} \\\langle\text{pseudo-op-def}\rangle & \to & \textbf{.eqiv} \\ & \to & \textbf{.equ} \\\langle\text{pseudo-op-data}\rangle & \to & \textbf{.align} \\ & \to & \textbf{.space} \\ & \to & \textbf{.byte} \\ & \to & \textbf{.long} \\ & \to & \textbf{.quad} \\ & \to & \textbf{.word} \\\end{array}\]

More about the assembly language: Code generator

The code generator generates code for the text, data and BSS segment. In this process it also generates the symbol table and the relocation table.

Text, data and BSS segments

If a section is empty no header or code for that section gets generated. For example, the “hello, world” program does not have an BSS segment and you saw it was not mentioned in the output. It does not matter in which order you specify the sections in the assembly program, in the generated out put the order always will be text, data and the BSS segment.

With the header directives .text, .data and .bss the programmer can switch between sections, i.e. specify to which subsequent lines in the program belong. If no section is specified (i.e. lines with no preceding header directive) the text segment get assembled by default. This was the case in the assembly program in Session 6.

Semantic error: Instruction can not be translated

This instruction perfectly matches the grammar rules

However, the Instruction Set of the ULM does not allow to add a memory operand to a register. So obviously the code generator can not create code for that:

theon$ ulmas error_ex1.s
ulmas: error_ex1.s:1.5-29: invalid operands:    addq (%1), %2, %3
theon$ 

No semantic error: Undefined labels are fine

Using undefined labels is allowed

1
2
3
    addq    foo,    %2,    %3
    ldzwq   func,   %4
    jmp     %4,     %1

This generates an “undefined” symbol entry in the symtab and an entry in the fixups, so that a linker can fix it:

#TEXT 4
0x0000000000000000: 38 00 02 03 #       addq foo, %2, %3
0x0000000000000004: 56 00 00 04 #       ldzwq func, %4
0x0000000000000008: 40 04 01 00 #       jmp %4, %1
#SYMTAB
U foo                         0x0000000000000000
U func                        0x0000000000000000
#FIXUPS
text 0x0000000000000000 1 1 absolute foo
text 0x0000000000000004 1 2 absolute func

No semantic error: Using symbol and define them afterwards

The ULM assembler is a multi-pass assembler. Symbols can be defined through labels or an .equ directive. If needed the code generator traverses the syntax tree several time to resolve symbols. Hence, you can jump forward to a label without pre-declaring it:

#TEXT 4
0x0000000000000000: 41 00 00 01 #       jmp foo
#SYMTAB
t foo                         0x0000000000000004

You also use symbols and define them later in the program through a .equ directive:

1
2
    addq    foo,    %2,         %3
    .equ    foo,    42

The assembler traverses the syntax tree until no more symbol can be resolved. Any symbol that could not be resolved will be entered in the symbol table as undefined (and corresponding entries in the fixup table will be generated).

#TEXT 4
0x0000000000000000: 38 2A 02 03 #       addq foo, %2, %3
#SYMTAB
a foo                         0x000000000000002A

Assignment

Write a program factorial.s that computes \(n!\) and has following properties:

  • The value \(n\) is stored as a hard coded constant in the data segment of the program.

  • Your program computes \(n!\) by the following algorithm:

    • \(\text{res} \leftarrow 1\)

    • While \(n \neq 0\)

      • \(\text{res} \leftarrow \text{res} \cdot n\)

      • \(n \leftarrow n - 1\)

    • Halt with exit code \(\text{res}\)

Submit the assignment on theon as follows:

1
submit hpc quiz02 factorial.s

You can use the following skeleton:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
        .equ    N,      1
        .equ    RES,    2

        .data
n       .byte   5

        .text

/*
         Your code here!
*/

Assembly instruction for the multiplication

You can use the imulq instruction. Like the addq or subq instruction you can use it for 64-bit integers that are signed and unsigned. The result is a 64-bit integer. The instruction sets the CF and OF bit in the status register to indicate overflows. Other flags in the status register will not be changed by the instruction.