ULM Assembler (Part 1): First Steps
Syntax Highlighting in Vim
You can enable syntanx highlighting by:
-
Adding the followoing line to your ~/.vimrc
1
autocmd FileType asm set syntax=ulmasm
-
And saving the following file ulmasm.vim in the directory ~/.vim/syntax/ (if this directory does not exist create it):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
" Vim syntax file " Language: ULM assembler " Maintainer: Michael Christian Lehn <michael.lehn@uni-ulm.de> " Last Change: 2020-2-15 " License: Vim (see :h license) " For version 5.x: Clear all syntax items " For version 6.x: Quit when a syntax file was already loaded if version < 600 syntax clear elseif exists("b:current_syntax") finish endif syntax case match syntax match asmKeyword /[a-zA-Z_]\+\>/ contained skipwhite syntax match asmKeyword /\.align\>/ contained skipwhite syntax match asmKeyword /\.bss\>/ contained skipwhite syntax match asmKeyword /\.byte\>/ contained skipwhite syntax match asmKeyword /\.data\>/ contained skipwhite syntax match asmKeyword /\.equ\>/ contained skipwhite syntax match asmKeyword /\.equiv\>/ contained skipwhite syntax match asmKeyword /\.global\>/ contained skipwhite syntax match asmKeyword /\.globl\>/ contained skipwhite syntax match asmKeyword /\.long\>/ contained skipwhite syntax match asmKeyword /\.quad\>/ contained skipwhite syntax match asmKeyword /\.set\>/ contained skipwhite syntax match asmKeyword /\.space\>/ contained skipwhite syntax match asmKeyword /\.string\>/ contained skipwhite syntax match asmKeyword /\.text\>/ contained skipwhite syntax match asmKeyword /\.word\>/ contained skipwhite syntax match asmColon ":" nextgroup=asmKeyword skipwhite syntax match asmDelimiter /[.$%,()]\|@w[0-3]/ syntax match asmLiteral /[1-9][0-9]*/ syntax match asmLiteral /[0-7][0-7]*/ syntax match asmLiteral /0x[0-9a-zA-Z][0-9a-zA-Z]*/ syntax region asmLiteral start=/"/ skip=/\\"/ end=/"/ syntax match asmIdentifier /[A-Za-z_.][A-Za-z0-9_.]*/ syntax match asmLabel /^[A-Za-z_.][A-Za-z0-9_.]*/ nextgroup=asmKeyword,asmColon,asmComment skipwhite syntax match asmLabel /^[ \t][ \t]*/ nextgroup=asmKeyword skipwhite syntax match asmStillLabel /[A-Za-z_.][A-Za-z0-9_.]*/ contained nextgroup=asmKeyword,asmColon,asmComment skipwhite syntax match asmStillLabel /[ \t][ \t]*/ contained nextgroup=asmKeyword skipwhite syntax region asmComment start="^//" end="$" syntax region asmComment start="//" end="$" syntax region asmComment start="/\*" end="\*/" syntax region asmComment start="^/\*" end="\*/" nextgroup=asmStillLabel syntax region asmComment start="^#" end="$" syntax region asmComment start="#" end="$" highlight link asmComment Comment highlight link asmLabel Label highlight link asmStillLabel Label highlight link asmLiteral Number highlight link asmIdentifier Identifier highlight link asmKeyword Type highlight link asmString String
Note that this syntax description for Vim is not perfect at all. Every identifier in the mnemonic/pseudo-op field that does not begin with a dot will be highlighted as a keyword. This is due to this rule:
1 | syntax match asmKeyword /[a-zA-Z_]\+\>/ contained skipwhite
|
A better solution would be to have here a list (generated from the isa.txt) that contains all mnemonics that are actually defined. This however means that you would have for every isa.txt variant an extra vim syntax description.
Instruction Set Used in the Video
RRR (OP u 8) (X u 8) (Y u 8) (Z u 8)
J26 (OP u 8) (XYZ j 24)
U16R (OP u 8) (XY u 16) (Z u 8)
0x01 RRR
: halt %X
ulm_halt(ulm_regVal(X));
0x02 RRR
: getc %X
ulm_setReg(ulm_readChar() & 0xFF, X);
0x03 RRR
: putc %X
ulm_printChar(ulm_regVal(X) & 0xFF);
0x04 J26
: jmp XYZ
ulm_unconditionalRelJump(XYZ);
0x05 RRR
: subq X, %Y, %Z
ulm_sub64(X, ulm_regVal(Y), Z);
0x06 J26
: jnz XYZ
: jne XYZ
ulm_conditionalRelJump(ulm_statusReg[ULM_ZF] == 0, XYZ);
0x07 J26
: jz XYZ
: je XYZ
ulm_conditionalRelJump(ulm_statusReg[ULM_ZF] == 1, XYZ);
0x08 U16R
: ldzwq XY, %Z
ulm_setReg(XY, Z);
0x09 RRR
: movzbq (%X), %Z
ulm_fetch64(0, X, 0, 0, ULM_ZERO_EXT, 1, Z);
0x0A RRR
: addq X, %Y, %Z
ulm_add64(X, ulm_regVal(Y), Z);
0x0B RRR
: imulq %X, %Y, %Z
ulm_mul64(ulm_regVal(X), ulm_regVal(Y), Z);
The Hello World Assembly Program Shown in the Video
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | .equ addr, 1
.data
msg .string "hello, world!\n"
.equ ch, 2
.text
ldzwq msg, %addr
fetch:
movzbq (%addr),%ch
subq 0, %ch, %0
je halt
putc %ch
addq 1, %addr, %addr
jmp fetch
halt:
halt %0
|
Translating the Assembly Program into an Executable
With
theon$ 1_ulm_build/hello/ulmas 0_ulm_variants/hello/hello.s theon$
the following exectuable a.out (the default name for the assembler output) gets created:
#TEXT 4
0x0000000000000000: 08 00 20 01 # ldzwq msg, %addr
0x0000000000000004: 09 01 00 02 # movzbq (%addr), %ch
0x0000000000000008: 05 00 02 00 # subq 0, %ch, %0
0x000000000000000C: 07 00 00 04 # je halt
0x0000000000000010: 03 02 00 00 # putc %ch
0x0000000000000014: 0A 01 01 01 # addq 1, %addr, %addr
0x0000000000000018: 04 FF FF FB # jmp fetch
0x000000000000001C: 01 00 00 00 # halt %0
#DATA 1
0x0000000000000020: 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 0A 00 # .string "hello, world!\n"
#BSS 1 0
#SYMTAB
a addr 0x0000000000000001
d msg 0x0000000000000000
a ch 0x0000000000000002
t fetch 0x0000000000000004
t halt 0x000000000000001C
#FIXUPS
text 0x0000000000000000 8 16 absolute [data]+0
Format of the Assembler Output
The generated output consists of different sections. Each of this sections starts with a header (which also separates the section from a previous section). In this case the assembler output has the following 4 sections:
#TEXT <alignment>
is the header for the text segment. This section contains the instructions for the “hello, world” program.
The alignment parameter is (for convenience) a decimal numeral and in this case 4. It specifies that the loader has to copy this section to a memory block with a start address that is a multiple of 4. For the moment this is not relevant as in this case the start address is 0 (and hence a multiple of any integer that is not zero).
Lines of the text segment have the format
[address:] instruction [# comments]
Addresses and comments are optional (and for convenience). The instructions are in hexadecimal. By default the loader copies the text segment to memory beginning by address 0.
Compare the instructions of the text segment with the memory content from address 0x00 to 0x20.
#DATA <alignment>
is the header for the data segment. This section contains the data for the “hello, world” program and has the same format as the text segment (i.e optional address, actual data, optional comments).
The loader copies the data segment to memory such that it follows the text segment. Like the text segment, the data segment can have alignment restrictions. That means in general there can be a gap in memory between the text and data segment. However, in the “hello, world” program the alignment of the data segment is 1. Hence, the loader begins the data segment at address 0x20 (where the text segment ended).
#SYMTAB
is the header for the symbol table.
Labels and .equ directives generate symbols that have a value and a type:
-
Labels in the text segment have type text and the value is an address within the text segment. For example, the text symbol halt has the value 0x1C.
-
Accordingly labels in the data segment have type data and the value is the relative address to the begin of the data segment. For example, the text symbol msg has value 0x00 (and not 0x20).
-
The .equ directive defines a symbol of type absolute with a given value. Here for example, the symbol p has value 1.
For loading (and running) the program the symbol table is not relevant. But it is relevant for linking (which will be covered in upcoming sessions).
If an instruction or directive contains an undefined symbol an entry of type undefined and value 0 is added to the symbol table.
#FIXUPS
is the header for the relocation table.
Like the symbol table this will be relevant for linking and will be covered in upcoming sessions.
Memory layout of the “hello, world” program
The information of the text and data segment together with the text and data labels from symbol table can be visualized by
Disassembling the instructions in the text segment and interpreting the bytes in the data segment as zero terminated string allows to almost see the original source code in the memory layout (due to spacing issues the symbols addr and ch for the literals \(1\) and \(2\) respectiuvely are not used):
From the symbol table we also know that symbols p and ch were defined with absolute value 1 and 2 respectively. However, we can not determine where the symbols were used. So for example, we don't know that the first instruction was written as ldzwq msg, %addr in the source file. In practise that makes it hard to understand disassembled programs where the original source is not available (and there are actually legal cases where you have to deal with such problems).
Pointers! Start learning about them here!
Have a look at what the first two instructions are doing and how this can be represented descriptively.
ldzwq msg, %addr
The assembler replaces the label msg with the address of the h in the “hello, world!” string. That means msg has value 0x20 (or 32 in decimal).
Obviously the address of the string depends on where the loader will copy the data segment when we run the program. And this in turn depends on the size of the text segment. It requires some kind of bookkeeping to figure out the actual address of the string by just looking at the assembly source code. But it is possible, we can do it and the assembler can do it. But it is less error prone if the assembler is doing it, and using labels delegates this job to the assembler.
So after this instruction the value in %addr has the meaning “address of the first character in the string”. So we think of %addr as being a “pointer to the first character in the string”:
movzbq (%addr), %ch
This instruction copies the value at address %addr into %ch.
In this instruction the pointer get dereferenced. And it is impossible to overestimate how important it will be to understand what dereferencing a pointer means. So I will explain and talk about it more than once.
Dereferencing means that we refer to a value at the end of a pointer. And this requires two pieces of information:
-
Location: “Where does the pointers point to?”
This information is the address stored as value in the pointer. So in this case the value in %addr. Note the difference between “value of %addr” and “value at address %addr”.
-
Type information: “What is the value at the end of the pointer?”
In the “hello, world” example the value is the byte at the end of the pointer, and this byte has here the meaning of being the ASCII value of a character. This information is only given by the context and not stored in any register or whatsoever. Because we know that the instruction movzb is used to copy a single byte from the end of the pointer, zero extend it and to copy it into the destination register %ch.
In general you have to do the bookkeeping: You have to know of how many bytes the dereferenced value consists of. And to know if the dereferenced value is a character, signed or unsigned integer or whatsoever. You have zero support from the assembly language for bookkeeping this kind of type information.
In this context we can illustrate the meaning of (%addr) as follows:
How to know if a register is storing a pointer?
The assembler does not know whether a register is used as pointer. This is also up to you. You give the register its meaning. And this meaning can change! You have to do the bookkeeping. And you have to keep your bookkeeping up to date.
Consider this modification of the first two instructions:
1 2 | ldzwq msg, %3
movzbq (%3), %3
|
With the first instruction register %3 could have the meaning “pointer to the string”. After the second instruction the meaning “first character of the string”.
Some personal opinion/experience
When learning C/C++ understanding pointers and how to use them is the hardest part. My rule of thumb is that every non-trivial bug in a C/C++ program is related to pointers. The dangerous thing is this combination:
-
In the C/C++ programming languages the compilers do a lot of the necessary bookkeeping. So compared to programming in assembler the C/C+ compilers can detect many bugs related to pointers. Some of the bugs that slip though this line of defence can be detected by additional tools.
-
Slightly exaggerated but true in the quintessence: If a bug slipped through it is impossible to find it. You don't even know if a bug slipped through! Because such a bug might only once in a while cause the program to crash, or worse, the bug does not crash the program and just causes wrong results.
Because some bugs are detected in C/C++ it is tempting to use these languages and underestimate the danger.
The advantage of programming in assembly is: You will never underestimate what can go wrong! And be aware that you are learning some non-trivial concepts. Be patient now if things don't work out the first time and try to understand the underlying reason. This will allow you to do some solid programming in C/C++ later.
More about the assembly language: Tokens
Like I said in the video it first seems to be odd, that for example halt can be used as an label. How can the assembler distinguish the meanings?
Field format of source lines
This is handled by the scanner during the lexical analysis. The format of the source lines consists of fields:
[label] [operators] [operands]
Tokens for mnemonics like (e.g. “addq”, “halt”, etc.) and pseudo operators (e.g. “.string”, “.byte”, etc) are only generated from the operator field. So in other fields they can be used as identifiers. For example, from this code
1 2 3 4 | addq # some label
addq %0, %12, %0x3 // an instruction
.quad 4 /* some data
*/
|
the scanner (you find the lexer test program in 1_ulm_build/hello/.build/ulmas1/) generates the following tokens:
theon$ 1_ulm_build/hello/.build/ulmas1/xtest_lexer < lex_example.s 1.1-1.5: IDENT 'addq' 'addq' 1.57-2.1: EOL '' '' 2.1-2.1: EMPTY_LABEL '' '' 2.9-2.13: ADDQ 'addq' '' 2.17-2.18: PERCENT '%' '' 2.18-2.19: OCTAL_LITERAL '0' '' 2.19-2.20: COMMA ',' '' 2.25-2.26: PERCENT '%' '' 2.26-2.28: DECIMAL_LITERAL '12' '' 2.28-2.29: COMMA ',' '' 2.33-2.34: PERCENT '%' '' 2.34-2.37: HEXADECIMAL_LITERAL '0x3' '' 3.9-3.14: IDENT '.quad' '.quad' 3.17-3.18: DECIMAL_LITERAL '4' '' 4.47-5.1: EOL '' '' theon$
So note that the character sequence “addq” was first detected as an identifier (IDENT) and in the second case as mnemonic (ADDQ).
Comments
You also might notice that comments are removed. And comments can be used in different flavors:
-
Single-line comments start with “#” or “//”
-
“/*” begins a multi-line comment and “*/” ends a multi-line comment
Tokens recognized only in the operator field
As specified in the grammar, a mnemonic is part of an instruction and a pseudo operator part of a directive.
Mnemonics
The mnemonics are specified by the instruction set. In the isa.txt for this video these were
addq |
getc |
halt |
imulq |
je |
jmp |
jne |
jz |
ldzwq |
movzbq |
putc |
subq |
Pseudo operators
.align |
.byte |
.comm |
.data |
.equ |
.equiv |
.globl |
.global |
.lcomm |
.long |
.quad |
.set |
.string |
.text |
.word |
Tokens recognized in the label or operands fields
Identifiers
Identifiers begin with a letter (i.e. 'A' to 'Z' and 'a' to 'z'), or underscore '_', or a dot '.' and are optionally continued with a sequence of more letters, underscores, dots , or decimal digits 0 to 9.
Hence “foo”, “.fOo”, “.fOo1”, “_”, “.” are allowed nut not “2foo”.
Empty label
You also see some tokens called SPACE in this example. This token gets generated when the label field is empty. For the parser (and describing the grammar) it is important that every line has in general a label (which can be empty, but it exists). Otherwise white space characters get consumed by the scanner.
Literals
-
Decimal, hexadecimal and octal constants (e.g. 12, 0x2a, 017). These constants are all unsigned and encoded with 64 bits.
-
Decimal literals begin with a digit "1' to '9' and optionally more decimal digits '0' to '9'. Decimal constants are unsigned and are encoded with 64 bits.
-
Octal literals begin with the digit '0' and optionally more octal digit '0' to '7'. more digits. Decimal constants are unsigned and are encoded with 64 bits.
-
Hexadecimal digit begin with the prefix '0x' or '0X' followed by one or more hexadecimal digit '0' to '9', 'a' to 'f', 'A' to 'F'.
-
-
Character constants (e.g. 'a' or '😎') can be used as integers. The value is determined by the ASCII code or more general the utf-8 code.
-
String literals (e.g.“hello, world!”)
-
End of line (newline character ASCII code 10)
Punctuators/Delimiters
+ |
- |
* |
/ |
% |
( |
) |
: |
, |
$ |
{ |
} |
< |
> |
Some of these punctuators are used for expressions (+, -, *, /, %, ( and )). You also can use them for your assembly notation. But there is a restriction:
If the '(' punctuator is used in the assembly notation then the next punctuator has to be the '%' punctuator. Hence movq (X), %Y would not be allowed.
As the parentheses are also used in expressions this requirements makes it easier (or in my humble opinion possible at all) to use a recursive decent parser.
If you dislike this restriction use the brackets '{' and '}' in your assembly notation instead.
Immediate operator
@w0 |
@w1 |
@w2 |
@w3 |
These operators can be used to extract a particular word from a 64-bit literal or symbol (e.g. a label):
-
@w0(label) gives the least significant word,
-
...,
-
@w3(label) the most significant word.
More about the assembly language: Grammar
With your definition of the assembly notation you define some part of the grammar. Basically you define it by example and the generator extracts the formal grammar for instructions. Your grammar rules are then embedded into the grammar for the assembler.
Structure of an assembly program
The grammar describes an assembly program as a sequence of instructions and directives (pseudo instructions):
\[\begin{array}{lcl}\langle\text{compilation-unit}\rangle & \to & \langle\text{}\rangle \\ & \to & \langle\text{sequence}\rangle \\\langle\text{sequence}\rangle & \to & \langle\text{labelled-op}\rangle \\ & \to & \langle\text{regular-op}\rangle \\ & \to & \langle\text{sequence}\rangle \quad \langle\text{labelled-op}\rangle \\ & \to & \langle\text{sequence}\rangle \quad \langle\text{regular-op}\rangle \\\langle\text{labelled-op}\rangle & \to & \langle\text{label}\rangle \quad \langle\text{op}\rangle\\\langle\text{regular-op}\rangle & \to & \langle\text{empty-label}\rangle \quad \langle\text{op}\rangle\\\langle\text{op}\rangle & \to & \textbf{eol} \\ & \to & \langle\text{instruction}\rangle \quad \textbf{eol}\\ & \to & \langle\text{directive}\rangle \quad \textbf{eol}\\\end{array}\]Instructions
This is the part of the grammar that you define. The fields (e.g. X, Y, Z) from the instruction format can be used in the notation. The parser accepts for this fields then an expression.
Expressions
\[\begin{array}{lcl}\langle\text{expression}\rangle & \to & \langle\text{simple-expression}\rangle \\\langle\text{simple-expression}\rangle & \to & \langle\text{term}\rangle \\ & \to & \langle\text{simple-expression}\rangle \quad \textbf{+} \quad \langle\text{term}\rangle \\ & \to & \langle\text{simple-expression}\rangle \quad \textbf{-} \quad \langle\text{term}\rangle \\\langle\text{term}\rangle & \to & \langle\text{factor}\rangle \\ & \to & \langle\text{term}\rangle \quad\textbf{*}\quad \langle\text{factor}\rangle \\ & \to & \langle\text{term}\rangle \quad\textbf{/}\quad \langle\text{factor}\rangle \\ & \to & \langle\text{term}\rangle \quad\textbf{%}\quad \langle\text{factor}\rangle \\\langle\text{factor}\rangle & \to & \langle\text{primary}\rangle \\ & \to & \langle\text{unary-minus}\rangle \\\langle\text{unary-minus}\rangle & \to & \textbf{-} \quad \langle\text{primary}\rangle \\\langle\text{pimary}\rangle & \to & \langle\text{integer}\rangle \\ & \to & \langle\text{identifier}\rangle \\ & \to & \textbf{(} \quad \langle\text{simple-expression}\rangle \quad \textbf{)}\\\langle\text{integer}\rangle & \to & \text{decimal-constant} \\ & \to & \text{hexadecimal-constant} \\ & \to & \text{octal-constant} \\ & \to & \text{char-constant} \\\langle\text{identifier}\rangle & \to & \text{ident} \\\end{array}\]In the simplest cases an expression is just an identifier or a constant. The constants can be decimal, hexadecimal, octal and character constants. Here an example for halt instructions that all have the same exit code given by an expression
1 2 3 4 5 6 7 8 9 | halt 65 // exit code as decimal constant
halt 0x41 // exit code as hexadecimal constant
halt 0101 // exit code as octal constant
halt 'A' // exit code as character constant
.equ exit, 'A' /* here instead of 'A' you also could write
65, 0x41, 0101.
*/
halt exit
|