More about the assembly language
For our next project you have to write an assembly program for the ULM that computes the factorial of an unsigned integer. This project consists of three subprojects: implementing an algorithm for computing the factorial, reading in an integer and printing an integer.
Video tutorial
In the video some features of the assembly language (e.g. labels and directives) are introduced.
Here the “hello, world!” program with labels and .equ directives shown in the video:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | .equ p, 1
.equ ch, 2
.data
msg .string "hello, world!\n"
.text
ldzwq msg, %p
load movzbq (%p), %ch
subq 0, %ch, %0
jz halt
putc %ch
addq 1, %p, %p
jmp load
halt halt 0
|
This program is equivalent (not just similar) to the “hello, world!” program from Session 6. However, when you translate the program with
theon$ ulmas -o hello hello.s theon$
you see that the assembler output contains more information than the code in Session 6:
#TEXT 4
0x0000000000000000: 56 00 20 01 # ldzwq msg, %p
0x0000000000000004: 1B 00 01 02 # movzbq (%p), %ch
0x0000000000000008: 39 00 02 00 # subq 0, %ch, %0
0x000000000000000C: 42 00 00 04 # jz halt
0x0000000000000010: 61 02 00 00 # putc %ch
0x0000000000000014: 38 01 01 01 # addq 1, %p, %p
0x0000000000000018: 41 FF FF FB # jmp load
0x000000000000001C: 09 00 00 00 # halt 0
#DATA 1
0x0000000000000020: 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 0A 00 # .string "hello, world!\n"
#SYMTAB
a ch 0x0000000000000002
t halt 0x000000000000001C
t load 0x0000000000000004
d msg 0x0000000000000000
a p 0x0000000000000001
#FIXUPS
text 0x0000000000000000 1 2 absolute [data]
Recall from Session 5.2 how the sequential execution of a program was explained for the ULM: The program gets loaded into memory, the instruction pointer get initialized with zero and then the ULM performs Von Neumann cycles until an halt instruction gets executed. Loading the program and initializing the instruction pointer is internally done by the loader of the ULM (in the Mini-ULM example you had to press the “load” button to trigger the loader explicitly). The loader recognizes the format of the assembler output and initializes the memory with the following layout:
More details about the assembler format and how the assembler are explained in the following sections.
Format of the assembler output
The generated output consists of different sections. Each of this sections starts with a header (which also separates the section from a previous section). In this case the assembler output has the following 4 sections:
#TEXT <alignment>
is the header for the text segment. This section contains the instructions for the “hello, world” program.
The alignment parameter is (for convenience) a decimal numeral and in this case 4. It specifies that the loader has to copy this section to a memory block with a start address that is a multiple of 4. For the moment this is not relevant as in this case the start address is 0 (and hence a multiple of any integer that is not zero).
Lines of the text segment have the format
[address:] instruction [# comments]
Addresses and comments are optional (and for convenience). The instructions are in hexadecimal. By default the loader copies the text segment to memory beginning by address 0.
Compare the instructions of the text segment with the memory content from address 0x00 to 0x20.
#DATA <alignment>
is the header for the data segment. This section contains the data for the “hello, world” program and has the same format as the text segment (i.e optional address, actual data, optional comments).
The loader copies the data segment to memory such that it follows the text segment. Like the text segment, the data segment can have alignment restrictions. That means in general there can be a gap in memory between the text and data segment. However, in the “hello, world” program the alignment of the data segment is 1. Hence, the loader begins the data segment at address 0x20 (where the text segment ended).
#SYMTAB
is the header for the symbol table.
Labels and .equ directives generate symbols that have a value and a type:
-
Labels in the text segment have type text and the value is an address within the text segment. For example, the text symbol halt has the value 0x1C.
-
Accordingly labels in the data segment have type data and the value is the relative address to the begin of the data segment. For example, the text symbol msg has value 0x00 (and not 0x20).
-
The .equ directive defines a symbol of type absolute with a given value. Here for example, the symbol p has value 1.
For loading (and running) the program the symbol table is not relevant. But it is relevant for linking (which will be covered in upcoming sessions).
If an instruction or directive contains an undefined symbol an entry of type undefined and value 0 is added to the symbol table.
#FIXUPS
is the header for the relocation table.
Like the symbol table this will be relevant for linking and will be covered in upcoming sessions.
Memory layout of the “hello, world” program
The information of the text and data segment together with the text and data labels from symbol table can be visualized by
Disassembling the instructions in the text segment and interpreting the bytes in the data segment as zero terminated string allows to almost see the original source code in the memory layout:
From the symbol table we also know that symbols p and ch were defined with absolute value 1 and 2 respectively. However, we can not determine where the symbols were used. So for example, we don't know that the first instruction was written as ldzwq msg, %p in the source file. In practise that makes it hard to understand disassembled programs where the original source is not available (and there are actually legal cases where you have to deal with such problems).
Pointers! Start learning about them here!
Have a look at what the first two instructions are doing and how this can be represented descriptively.
ldzwq msg, %p
The assembler replaces the label msg with the address of the h in the “hello, world!” string. That means msg has value 0x20 (or 32 in decimal). Compare that with the assembly program in Session 6.
Obviously the address of the string depends on where the loader will copy the data segment when we run the program. And this in turn depends on the size of the text segment. It requires some kind of bookkeeping to figure out the actual address of the string by just looking at the assembly source code. But it is possible, we can do it and the assembler can do it. But it is less error prone if the assembler is doing it, and using labels delegates this job to the assembler.
So after this instruction the value in %p has the meaning “address of the first character in the string”. So we think of %p as being a “pointer to the first character in the string”:
movzbq (%p), %ch
This instruction copies the value at address %p into %ch.
In this instruction the pointer get dereferenced. And it is impossible to overestimate how important it will be to understand what dereferencing a pointer means. So I will explain and talk about it more than once.
Dereferencing means that we refer to a value at the end of a pointer. And this requires two pieces of information:
-
Location: “Where does the pointers point to?”
This information is the address stored as value in the pointer. So in this case the value in %p. Note the difference between “value of %p” and “value at address %p”.
-
Type information: “What is the value at the end of the pointer?”
In the “hello, world” example the value is the byte at the end of the pointer, and this byte has here the meaning of being the ASCII value of a character. This information is only given by the context and not stored in any register or whatsoever. Because we know that the instruction movzb is used to copy a single byte from the end of the pointer, zero extend it and to copy it into the destination register %ch.
In general you have to do the bookkeeping: You have to know of how many bytes the dereferenced value consists of. And to know if the dereferenced value is a character, signed or unsigned integer or whatsoever. You have zero support from the assembly language for bookkeeping this kind of type information.
In this context we can illustrate the meaning of (%p) as follows:
How to know if a register is a pointer?
The assembler does not know whether a register is used as pointer. This is also up to you. You give the register its meaning. And this meaning can change!
Consider this modification of the first two instructions:
1 2 | ldzwq msg, %3
movzbq (%3), %3
|
With the first instruction register %3 could have the meaning “pointer to the string”. After the second instruction the meaning “first character of the string”.
Some personal opinion/experience
When learning C/C++ understanding pointers and how to use them is the hardest part. My rule of thumb is that every non-trivial bug in a C/C++ program is related to pointers. The dangerous thing is this combination:
-
In the C/C++ programming languages the compilers do a lot of the necessary bookkeeping. So compared to programming in assembler the C/C+ compilers can detect many bugs related to pointers. Some of the bugs that slip though this line of defence can be detected by additional tools.
-
Slightly exaggerated but true in the quintessence: If a bug slipped through it is impossible to find it. You don't even know if a bug slipped through! Because such a bug might only once in a while cause the program to crash, or worse, the bug does not crash the program and just causes wrong results.
Because some bugs are detected in C/C++ it is tempting to use these languages and underestimate the danger.
The advantage of programming in assembly is: You will never underestimate what can go wrong! And be aware that you are learning some non-trivial concepts. Be patient now if things don't work out the first time and try to understand the underlying reason. This will allow you to do some solid programming in C/C++ later.
More about the assembly language: Tokens
Like I said in the video it first seems to be odd, that for example halt can be used as an label. How can the assembler distinguish the meanings?
Field format of source lines
This is handled by the scanner during the lexical analysis. The format of the source lines consists of fields:
[label] [operators] [operands]
Tokens for mnemonics like (e.g. “addq”, “halt”, etc.) and pseudo operators (e.g. “.string”, “.byte”, etc) are only generated from the operator field. So in other fields they can be used as identifiers. For example, from this code
1 2 3 4 | addq # some label
addq %0, %12, %0x3 // an instruction
.quad 4 /* some data
*/
|
the scanner (you can call it with ulmas-test-lexer) generates the following tokens:
theon$ ulmas-test-lexer lex_example.s IDENT "addq" at lex_example.s:1.1-4 EOL at lex_example.s:1.45-57 SPACE at lex_example.s:2.1-8 ADDQ "addq" at lex_example.s:2.9-12 PERCENT at lex_example.s:2.17 OCTAL_CONSTANT "0" at lex_example.s:2.18 COMMA at lex_example.s:2.19 PERCENT at lex_example.s:2.25 DECIMAL_CONSTANT "12" at lex_example.s:2.26-27 COMMA at lex_example.s:2.28 PERCENT at lex_example.s:2.33 HEXADECIMAL_CONSTANT "0x3" at lex_example.s:2.34-36 EOL at lex_example.s:2.45-62 SPACE at lex_example.s:3.1-8 DOT_QUAD ".quad" at lex_example.s:3.9-13 DECIMAL_CONSTANT "4" at lex_example.s:3.17 EOL at lex_example.s:4.47 theon$
So note that the character sequence “addq” was first detected as an identifier (IDENT) and in the second case as mnemonic (ADDQ).
Comments
You also might notice that comments are removed. And comments can be used in different flavors:
-
Single-line comments start with “#” or “//”
-
“/*” begins a multi-line comment and “*/” ends a multi-line comment
Tokens recognized only in the operator field
As specified in the grammar, a mnemonic is part of an instruction and a pseudo operator part of a directive.
Mnemonics
The mnemonics specified by the Instruction Set of the ULM:
addq |
andq |
divq |
getc |
halt |
idivq |
imulq |
ja |
jae |
jb |
jbe |
je |
jg |
jge |
jl |
jle |
jmp |
jna |
jnae |
jnb |
jnbe |
jne |
jng |
jnge |
jnl |
jnle |
jnz |
jz |
ldswq |
ldzwq |
movb |
movl |
movq |
movsbq |
movslq |
movswq |
movw |
movzbq |
movzwq |
mulq |
nop |
notq |
orq |
putc |
salq |
sarq |
shldwq |
shlq |
shrq |
subq |
trap |
Pseudo operators
.align |
.byte |
.comm |
.data |
.equ |
.equiv |
.globl |
.global |
.lcomm |
.long |
.quad |
.set |
.string |
.text |
.word |
Tokens recognized in the label or operands fields
Identifiers
Identifiers begin with a letter (i.e. 'A' to 'Z' and 'a' to 'z'), or underscore '_', or a dot '.' and are optionally continued with a sequence of more letters, underscores, dots , or decimal digits 0 to 9.
Hence “foo”, “.fOo”, “.fOo1”, “_”, “.” are allowed nut not “2foo”.
Empty label
You also see some tokens called SPACE in this example. This token gets generated when the label field is empty. For the parser (and describing the grammar) it is important that every line has in general a label (which can be empty, but it exists). Otherwise white space characters get consumed by the scanner.
Literals
-
Decimal, hexadecimal and octal constants (e.g. 12, 018, 0x2a)
-
Character constants (e.g. ‘h’, ‘\n’)
-
String literals (e.g. “hello, wolrd!\n”)
-
End of line (newline character ‘\n’)
Punctuators/Delimiters
+ |
- |
* |
/ |
% |
( |
) |
: |
, |
$ |
Immediate operator
@w0 |
@w1 |
@w2 |
@w3 |
Not important now but have a look at Session 10, Loading a function address for an application.
More about the assembly language: Grammar
Reading a grammar and understanding the essentials is easier when you can look at actual syntax trees. In the terminal you can print the syntax tree of an assembly program with the command ulmas-test-parser:
theon$ ulmas-test-parser hello.s ("compilation_unit" ("directive" ("pseudo_op_def" '.equ' ), ("identifier" 'p' ), ("expression" ("integer" ("decimal_constant" '1' ) ) ) ), ("directive" ("pseudo_op_def" '.equ' ), ("identifier" 'ch' ), ("expression" ("integer" ("decimal_constant" '2' ) ) ) ), ("data_header" '.data' ), ("labelled_op" ("label" ("identifier" 'msg' ) ), ("directive" ("pseudo_op_string" '.string' ), 'hello, world! ' ) ), ("text_header" '.text' ), ("instruction" ("mnemonic" 'ldzwq' ), ("immediate_operand" ("expression" ("identifier" 'msg' ) ) ), ("register_operand" ("reg" ("expression" ("identifier" 'p' ) ) ) ) ), ("labelled_op" ("label" ("identifier" 'load' ) ), ("instruction" ("mnemonic" 'movzbq' ), ("memory_operand" ("reg" ("expression" ("identifier" 'p' ) ) ) ), ("register_operand" ("reg" ("expression" ("identifier" 'ch' ) ) ) ) ) ), ("instruction" ("mnemonic" 'subq' ), ("immediate_operand" ("expression" ("integer" ("octal_constant" '0' ) ) ) ), ("register_operand" ("reg" ("expression" ("identifier" 'ch' ) ) ) ), ("register_operand" ("reg" ("expression" ("integer" ("octal_constant" '0' ) ) ) ) ) ), ("instruction" ("mnemonic" 'jz' ), ("immediate_operand" ("expression" ("identifier" 'halt' ) ) ) ), ("instruction" ("mnemonic" 'putc' ), ("register_operand" ("reg" ("expression" ("identifier" 'ch' ) ) ) ) ), ("instruction" ("mnemonic" 'addq' ), ("immediate_operand" ("expression" ("integer" ("decimal_constant" '1' ) ) ) ), ("register_operand" ("reg" ("expression" ("identifier" 'p' ) ) ) ), ("register_operand" ("reg" ("expression" ("identifier" 'p' ) ) ) ) ), ("instruction" ("mnemonic" 'jmp' ), ("immediate_operand" ("expression" ("identifier" 'load' ) ) ) ), ("labelled_op" ("label" ("identifier" 'halt' ) ), ("instruction" ("mnemonic" 'halt' ), ("immediate_operand" ("expression" ("integer" ("octal_constant" '0' ) ) ) ) ) ) ) theon$
For an interactive visualization of the tree for the “hello, world!” program click here: syntax tree for hello.s
Expressions
In the grammar expressions are defined quite similar to the expressions in the ULM calculator with variables:
\[\begin{array}{lcl}\langle\text{expression}\rangle & \to & \langle\text{simple-expression}\rangle \\\langle\text{simple-expression}\rangle & \to & \langle\text{term}\rangle \\ & \to & \langle\text{simple-expression}\rangle \quad \textbf{+} \quad \langle\text{term}\rangle \\ & \to & \langle\text{simple-expression}\rangle \quad \textbf{-} \quad \langle\text{term}\rangle \\\langle\text{term}\rangle & \to & \langle\text{factor}\rangle \\ & \to & \langle\text{term}\rangle \quad\textbf{*}\quad \langle\text{factor}\rangle \\ & \to & \langle\text{term}\rangle \quad\textbf{/}\quad \langle\text{factor}\rangle \\ & \to & \langle\text{term}\rangle \quad\textbf{%}\quad \langle\text{factor}\rangle \\\langle\text{factor}\rangle & \to & \langle\text{primary}\rangle \\ & \to & \langle\text{unary-minus}\rangle \\\langle\text{unary-minus}\rangle & \to & \textbf{-} \quad \langle\text{primary}\rangle \\\langle\text{pimary}\rangle & \to & \langle\text{integer}\rangle \\ & \to & \langle\text{identifier}\rangle \\ & \to & \textbf{(} \quad \langle\text{simple-expression}\rangle \quad \textbf{)}\\\langle\text{integer}\rangle & \to & \text{decimal-constant} \\ & \to & \text{hexadecimal-constant} \\ & \to & \text{octal-constant} \\ & \to & \text{char-constant} \\\langle\text{identifier}\rangle & \to & \text{ident} \\\end{array}\]In the simplest cases an expression is just an identifier or a constant. Compared to the ULM calculator the constants can be decimal, hexadecimal, octal and character constants. Here an example for halt instructions that all have the same exit code given by an expression
1 2 3 4 5 6 7 8 9 | halt 65 // exit code as decimal constant
halt 0x41 // exit code as hexadecimal constant
halt 0101 // exit code as octal constant
halt 'A' // exit code as character constant
.equ exit, 'A' /* here instead of 'A' you also could write
65, 0x41, 0101.
*/
halt exit
|
Check the parse tree
theon$ ulmas-test-parser ex_expr.s ("compilation_unit" ("instruction" ("mnemonic" 'halt' ), ("immediate_operand" ("expression" ("integer" ("decimal_constant" '65' ) ) ) ) ), ("instruction" ("mnemonic" 'halt' ), ("immediate_operand" ("expression" ("integer" ("hexadecimal_constant" '0x41' ) ) ) ) ), ("instruction" ("mnemonic" 'halt' ), ("immediate_operand" ("expression" ("integer" ("octal_constant" '0101' ) ) ) ) ), ("instruction" ("mnemonic" 'halt' ), ("immediate_operand" ("expression" ("integer" ("char_constant" 'A' ) ) ) ) ), ("directive" ("pseudo_op_def" '.equ' ), ("identifier" 'exit' ), ("expression" ("integer" ("char_constant" 'A' ) ) ) ), ("instruction" ("mnemonic" 'halt' ), ("immediate_operand" ("expression" ("identifier" 'exit' ) ) ) ) ) theon$
and that you get times the same instruction 09 41 00 00 after translating:
#TEXT 4
0x0000000000000000: 09 41 00 00 # halt 65
0x0000000000000004: 09 41 00 00 # halt 0x41
0x0000000000000008: 09 41 00 00 # halt 0101
0x000000000000000C: 09 41 00 00 # halt 'A'
0x0000000000000010: 09 41 00 00 # halt exit
#SYMTAB
a exit 0x0000000000000041
Structure of an assembly program
The grammar describes an assembly program as a sequence of instructions, directives and labels:
\[\begin{array}{lcl}\langle\text{compilation-unit}\rangle & \to & \langle\text{}\rangle \\ & \to & \langle\text{sequence}\rangle \\\langle\text{sequence}\rangle & \to & \langle\text{labelled-op}\rangle \\ & \to & \langle\text{regular-op}\rangle \\ & \to & \langle\text{sequence}\rangle \quad \langle\text{labelled-op}\rangle \\ & \to & \langle\text{sequence}\rangle \quad \langle\text{regular-op}\rangle \\\langle\text{labelled-op}\rangle & \to & \langle\text{label}\rangle \quad \langle\text{op}\rangle\\\langle\text{regular-op}\rangle & \to & \langle\text{empty-label}\rangle \quad \langle\text{op}\rangle\\\langle\text{op}\rangle & \to & \textbf{eol} \\ & \to & \langle\text{instruction}\rangle \quad \textbf{eol}\\ & \to & \langle\text{directive}\rangle \quad \textbf{eol}\\\end{array}\]For example, the following program
1 2 3 | label nop
label
nop
|
is a sequence of three instructions:
theon$ ulmas-test-parser ex1.s ("compilation_unit" ("labelled_op" ("label" ("identifier" 'label' ) ), ("instruction" ("mnemonic" 'nop' ) ) ), ("labelled_op" ("label" ("identifier" 'label' ) ), ("empty_line") ), ("instruction" ("mnemonic" 'nop' ) ) ) theon$
Labels
Labels are identifiers followed by an optional colon:
\[\begin{array}{lcl}\langle\text{label}\rangle & \to & \langle\text{identifier}\rangle \\ & \to & \langle\text{identifier}\rangle \quad \textbf{:}\\\end{array}\]For example, the following program
1 2 | label
label:
|
consists of two labels, i.e. labelled instructions where \(\langle \text{op} \rangle\) is empty. In the printed tree we don't even show the node for the colon token:
theon$ ulmas-test-parser ex2.s ("compilation_unit" ("labelled_op" ("label" ("identifier" 'label' ) ), ("empty_line") ), ("labelled_op" ("label" ("identifier" 'label' ) ), ("empty_line") ) ) theon$
Instructions
An instruction consists of a mnemonic and up to 3 operands:
\[\begin{array}{lcl}\langle\text{mnemonic}\rangle & \to & \\ & \to & \langle\text{operand}\rangle \\ & \to & \langle\text{operand}\rangle \quad \textbf{,} \quad \langle\text{operand}\rangle\\ & \to & \langle\text{operand}\rangle \quad \textbf{,} \quad \langle\text{operand}\rangle \quad \textbf{,} \quad \langle\text{operand}\rangle \\\langle\text{operand}\rangle & \to & \langle\text{immediate-operand}\rangle \\ & \to & \langle\text{register-operand}\rangle \\ & \to & \langle\text{memory-operand}\rangle \\\end{array}\]Immediate operand
Immediate operands can be expressions with an optional dollar as prefix. We don't care for now that they also can be an immediate operator:
\[\begin{array}{lcl}\langle\text{immediate-operand}\rangle & \to & \langle\text{expression}\rangle \\ & \to & \textbf{\$} \quad \langle\text{expression}\rangle \\ & \to & \langle\text{immediate-operator}\rangle \\\end{array}\]Hence for the assembler the following two instructions are the same:
1 2 | addq 1, %2, %3
addq $1, %2, %3
|
theon$ ulmas-test-parser ex4.s ("compilation_unit" ("instruction" ("mnemonic" 'addq' ), ("immediate_operand" ("expression" ("integer" ("decimal_constant" '1' ) ) ) ), ("register_operand" ("reg" ("expression" ("integer" ("decimal_constant" '2' ) ) ) ) ), ("register_operand" ("reg" ("expression" ("integer" ("decimal_constant" '3' ) ) ) ) ) ), ("instruction" ("mnemonic" 'addq' ), ("immediate_operand" ("expression" ("integer" ("decimal_constant" '1' ) ) ) ), ("register_operand" ("reg" ("expression" ("integer" ("decimal_constant" '2' ) ) ) ) ), ("register_operand" ("reg" ("expression" ("integer" ("decimal_constant" '3' ) ) ) ) ) ) ) theon$
Register operand
Register operands are expressions with a percentage prefix:
\[\begin{array}{lcl}\langle\text{register-operand}\rangle & \to & \langle\text{reg}\rangle \\\langle\text{reg}\rangle & \to & \textbf{\$} \quad \langle\text{expression}\rangle \\\end{array}\]Hence you can write %20, %10 +10 or %5+2*2
1 | addq %20, %10 + 10, %5+2*2
|
Once the code generator evaluated each register operand are referring to the same register:
theon$ ulmas-test-parser ex3.s ("compilation_unit" ("instruction" ("mnemonic" 'addq' ), ("register_operand" ("reg" ("expression" ("integer" ("decimal_constant" '20' ) ) ) ) ), ("register_operand" ("reg" ("expression" ("+" ("integer" ("decimal_constant" '10' ) ), ("integer" ("decimal_constant" '10' ) ) ) ) ) ), ("register_operand" ("reg" ("expression" ("+" ("integer" ("decimal_constant" '5' ) ), ("*" ("integer" ("decimal_constant" '2' ) ), ("integer" ("decimal_constant" '2' ) ) ) ) ) ) ) ) ) theon$
Memory operand
\[\begin{array}{lcl}\langle\text{memory-operand}\rangle & \to & \textbf{(} \quad \langle\text{reg} \rangle \quad \textbf{)} \\ & \to & \langle\text{displacement}\rangle \quad \textbf{(} \quad \langle\text{reg} \rangle \quad \textbf{)} \\ & \to & \textbf{(} \quad \langle\text{reg} \rangle \quad \textbf{,} \quad \langle\text{reg} \rangle \quad \quad \textbf{)} \\ & \to & \textbf{(} \quad \langle\text{reg} \rangle \quad \textbf{,} \quad \langle\text{reg} \rangle \quad \textbf{,} \quad \langle\text{scale} \rangle \quad \quad \textbf{)} \\\langle\text{displacement}\rangle & \to & \langle\text{expression}\rangle \\\langle\text{scale}\rangle & \to & \langle\text{expression}\rangle \\\end{array}\]Directives
\[\begin{array}{lcl}\langle\text{directive}\rangle & \to & \langle\text{text-header}\rangle \\ & \to & \langle\text{data-header}\rangle \\ & \to & \langle\text{bss-header}\rangle \\ & \to & \langle\text{pseudo-op-data}\rangle \quad \langle\text{expression}\rangle \\ & \to & \langle\text{pseudo-op-string}\rangle \quad \textbf{string-literal} \\ & \to & \langle\text{pseudo-op-flag}\rangle \quad \langle\text{identifier}\rangle \\ & \to & \langle\text{pseudo-op-def}\rangle \quad \langle\text{identifier}\rangle \quad \textbf{,} \quad \langle\text{expression}\rangle \\\langle\text{text-header}\rangle & \to & \textbf{.text} \\\langle\text{data-header}\rangle & \to & \textbf{.data} \\\langle\text{bss-header}\rangle & \to & \textbf{.bss} \\\langle\text{pseudo-op-string}\rangle & \to & \textbf{.string} \\\langle\text{pseudo-op-def}\rangle & \to & \textbf{.eqiv} \\ & \to & \textbf{.equ} \\\langle\text{pseudo-op-data}\rangle & \to & \textbf{.align} \\ & \to & \textbf{.space} \\ & \to & \textbf{.byte} \\ & \to & \textbf{.long} \\ & \to & \textbf{.quad} \\ & \to & \textbf{.word} \\\end{array}\]More about the assembly language: Code generator
The code generator generates code for the text, data and BSS segment. In this process it also generates the symbol table and the relocation table.
Text, data and BSS segments
If a section is empty no header or code for that section gets generated. For example, the “hello, world” program does not have an BSS segment and you saw it was not mentioned in the output. It does not matter in which order you specify the sections in the assembly program, in the generated out put the order always will be text, data and the BSS segment.
With the header directives .text, .data and .bss the programmer can switch between sections, i.e. specify to which subsequent lines in the program belong. If no section is specified (i.e. lines with no preceding header directive) the text segment get assembled by default. This was the case in the assembly program in Session 6.
Semantic error: Instruction can not be translated
This instruction perfectly matches the grammar rules
1 | addq (%1), %2, %3
|
However, the Instruction Set of the ULM does not allow to add a memory operand to a register. So obviously the code generator can not create code for that:
theon$ ulmas error_ex1.s ulmas: error_ex1.s:1.5-29: invalid operands: addq (%1), %2, %3 theon$
No semantic error: Undefined labels are fine
Using undefined labels is allowed
1 2 3 | addq foo, %2, %3
ldzwq func, %4
jmp %4, %1
|
This generates an “undefined” symbol entry in the symtab and an entry in the fixups, so that a linker can fix it:
#TEXT 4
0x0000000000000000: 38 00 02 03 # addq foo, %2, %3
0x0000000000000004: 56 00 00 04 # ldzwq func, %4
0x0000000000000008: 40 04 01 00 # jmp %4, %1
#SYMTAB
U foo 0x0000000000000000
U func 0x0000000000000000
#FIXUPS
text 0x0000000000000000 1 1 absolute foo
text 0x0000000000000004 1 2 absolute func
No semantic error: Using symbol and define them afterwards
The ULM assembler is a multi-pass assembler. Symbols can be defined through labels or an .equ directive. If needed the code generator traverses the syntax tree several time to resolve symbols. Hence, you can jump forward to a label without pre-declaring it:
1 2 | jmp foo
foo:
|
#TEXT 4
0x0000000000000000: 41 00 00 01 # jmp foo
#SYMTAB
t foo 0x0000000000000004
You also use symbols and define them later in the program through a .equ directive:
1 2 | addq foo, %2, %3
.equ foo, 42
|
The assembler traverses the syntax tree until no more symbol can be resolved. Any symbol that could not be resolved will be entered in the symbol table as undefined (and corresponding entries in the fixup table will be generated).
#TEXT 4
0x0000000000000000: 38 2A 02 03 # addq foo, %2, %3
#SYMTAB
a foo 0x000000000000002A
Assignment
Write a program factorial.s that computes \(n!\) and has following properties:
-
The value \(n\) is stored as a hard coded constant in the data segment of the program.
-
Your program computes \(n!\) by the following algorithm:
-
\(\text{res} \leftarrow 1\)
-
While \(n \neq 0\)
-
\(\text{res} \leftarrow \text{res} \cdot n\)
-
\(n \leftarrow n - 1\)
-
-
Halt with exit code \(\text{res}\)
-
Submit the assignment on theon as follows:
1 | submit hpc quiz02 factorial.s
|
You can use the following skeleton:
1 2 3 4 5 6 7 8 9 10 11 | .equ N, 1
.equ RES, 2
.data
n .byte 5
.text
/*
Your code here!
*/
|
Assembly instruction for the multiplication
You can use the imulq instruction. Like the addq or subq instruction you can use it for 64-bit integers that are signed and unsigned. The result is a 64-bit integer. The instruction sets the CF and OF bit in the status register to indicate overflows. Other flags in the status register will not be changed by the instruction.