=================================== ULM Assembler (Part 1): First Steps [TOC] =================================== ---- VIDEO ------------------------------ https://www.youtube.com/embed/chnIUD6551g ----------------------------------------- Syntax Highlighting in Vim ========================== You can enable syntanx highlighting by: - Adding the followoing line to your `~/.vimrc` ---- CODE (type=vim) --------------------------------------------------------- autocmd FileType asm set syntax=ulmasm ------------------------------------------------------------------------------ - And saving the following file `ulmasm.vim` in the directory `~/.vim/syntax/` (if this directory does not exist create it): ---- CODE (file=session09/ulmasm.vim, fold) ---------------------------------- " Vim syntax file " Language: ULM assembler " Maintainer: Michael Christian Lehn " Last Change: 2020-2-15 " License: Vim (see :h license) " For version 5.x: Clear all syntax items " For version 6.x: Quit when a syntax file was already loaded if version < 600 syntax clear elseif exists("b:current_syntax") finish endif syntax case match syntax match asmKeyword /[a-zA-Z_]\+\>/ contained skipwhite syntax match asmKeyword /\.align\>/ contained skipwhite syntax match asmKeyword /\.bss\>/ contained skipwhite syntax match asmKeyword /\.byte\>/ contained skipwhite syntax match asmKeyword /\.data\>/ contained skipwhite syntax match asmKeyword /\.equ\>/ contained skipwhite syntax match asmKeyword /\.equiv\>/ contained skipwhite syntax match asmKeyword /\.global\>/ contained skipwhite syntax match asmKeyword /\.globl\>/ contained skipwhite syntax match asmKeyword /\.long\>/ contained skipwhite syntax match asmKeyword /\.quad\>/ contained skipwhite syntax match asmKeyword /\.set\>/ contained skipwhite syntax match asmKeyword /\.space\>/ contained skipwhite syntax match asmKeyword /\.string\>/ contained skipwhite syntax match asmKeyword /\.text\>/ contained skipwhite syntax match asmKeyword /\.word\>/ contained skipwhite syntax match asmColon ":" nextgroup=asmKeyword skipwhite syntax match asmDelimiter /[.$%,()]\|@w[0-3]/ syntax match asmLiteral /[1-9][0-9]*/ syntax match asmLiteral /[0-7][0-7]*/ syntax match asmLiteral /0x[0-9a-zA-Z][0-9a-zA-Z]*/ syntax region asmLiteral start=/"/ skip=/\\"/ end=/"/ syntax match asmIdentifier /[A-Za-z_.][A-Za-z0-9_.]*/ syntax match asmLabel /^[A-Za-z_.][A-Za-z0-9_.]*/ nextgroup=asmKeyword,asmColon,asmComment skipwhite syntax match asmLabel /^[ \t][ \t]*/ nextgroup=asmKeyword skipwhite syntax match asmStillLabel /[A-Za-z_.][A-Za-z0-9_.]*/ contained nextgroup=asmKeyword,asmColon,asmComment skipwhite syntax match asmStillLabel /[ \t][ \t]*/ contained nextgroup=asmKeyword skipwhite syntax region asmComment start="^//" end="$" syntax region asmComment start="//" end="$" syntax region asmComment start="/\*" end="\*/" syntax region asmComment start="^/\*" end="\*/" nextgroup=asmStillLabel syntax region asmComment start="^#" end="$" syntax region asmComment start="#" end="$" highlight link asmComment Comment highlight link asmLabel Label highlight link asmStillLabel Label highlight link asmLiteral Number highlight link asmIdentifier Identifier highlight link asmKeyword Type highlight link asmString String ------------------------------------------------------------------------------ Note that this syntax description for Vim is not perfect at all. Every identifier in the mnemonic/pseudo-op field that does not begin with a dot will be highlighted as a keyword. This is due to this rule: ---- CODE (type=vim) ----------------------------------------------------------- syntax match asmKeyword /[a-zA-Z_]\+\>/ contained skipwhite -------------------------------------------------------------------------------- A better solution would be to have here a list (generated from the `isa.txt`) that contains all mnemonics that are actually defined. This however means that you would have for every `isa.txt` variant an extra vim syntax description. Instruction Set Used in the Video ================================= :import: session09/hello/0_ulm_variants/hello/isa.txt [fold] The Hello World Assembly Program Shown in the Video =================================================== :import: session09/hello/0_ulm_variants/hello/hello.s [fold] Translating the Assembly Program into an Executable =================================================== ---- SHELL (path=session09/hello, hide) ---------------------------------------- make -------------------------------------------------------------------------------- With ---- SHELL (path=session09/hello) ---------------------------------------------- 1_ulm_build/hello/ulmas 0_ulm_variants/hello/hello.s -------------------------------------------------------------------------------- the following exectuable `a.out` (the default name for the assembler output) gets created: :import: session09/hello/a.out Format of the Assembler Output ============================== The generated output consists of different sections. Each of this sections starts with a header (which also separates the section from a previous section). In this case the assembler output has the following 4 sections: `#TEXT ` ~~~~~~~~~~~~~~~~~~~ is the header for the _text segment_. This section contains the instructions for the "hello, world" program. The alignment parameter is (for convenience) a decimal numeral and in this case 4. It specifies that the loader has to copy this section to a memory block with a start address that is a multiple of 4. For the moment this is not relevant as in this case the start address is 0 (and hence a multiple of any integer that is not zero). Lines of the text segment have the format `[address:] instruction [# comments]` Addresses and comments are optional (and for convenience). The instructions are in hexadecimal. By default the loader copies the text segment to memory beginning by address 0. ---- BOX ----------------------------------------------------------------------- Compare the instructions of the text segment with the memory content from address 0x00 to 0x20. -------------------------------------------------------------------------------- `#DATA ` ~~~~~~~~~~~~~~~~~~~ is the header for the _data segment_. This section contains the data for the "hello, world" program and has the same format as the text segment (i.e optional address, actual data, optional comments). The loader copies the data segment to memory such that it follows the text segment. Like the text segment, the data segment can have alignment restrictions. That means in general there can be a gap in memory between the text and data segment. However, in the "hello, world" program the alignment of the data segment is 1. Hence, the loader begins the data segment at address 0x20 (where the text segment ended). `#SYMTAB` ~~~~~~~~~ is the header for the _symbol table_. Labels and `.equ` directives generate symbols that have a value and a type: - Labels in the text segment have type _text_ and the value is an address within the text segment. For example, the text symbol `halt` has the value 0x1C. - Accordingly labels in the data segment have type _data_ and the value is the relative address to the begin of the data segment. For example, the text symbol `msg` has value 0x00 (and not 0x20). - The `.equ` directive defines a symbol of type _absolute_ with a given value. Here for example, the symbol `p` has value 1. For loading (and running) the program the symbol table is not relevant. But it is relevant for linking (which will be covered in upcoming sessions). If an instruction or directive contains an undefined symbol an entry of type _undefined_ and value 0 is added to the symbol table. `#FIXUPS` ~~~~~~~~~ is the header for the _relocation table_. Like the symbol table this will be relevant for linking and will be covered in upcoming sessions. Memory layout of the "hello, world" program ------------------------------------------- The information of the text and data segment together with the text and data labels from symbol table can be visualized by ---- TIKZ ---------------------------------------------------------------------- \begin{tikzpicture} \input{memory.tex} \renewcommand\MemCellWidth { 0.6 } \DrawMemArrayOpenRight{0}{31} \DrawMemAddress{0}{0x00} \DrawMemAddress{4}{0x04} \DrawMemAddress{8}{0x08} \DrawMemAddress{12}{0x0C} \DrawMemAddress{16}{0x10} \DrawMemAddress{20}{0x14} \DrawMemAddress{24}{0x18} \DrawMemAddress{28}{0x1C} \DrawMemAddress{32}{0x20} \DrawMemLabel{28}{halt} \DrawMemLabel{4}{load} \DrawMemVariable[gray!50]{0}{32}{} \DrawMemCellContent{0}{08} \DrawMemCellContent{1}{00} \DrawMemCellContent{2}{20} \DrawMemCellContent{3}{01} \DrawMemCellContent{4}{09} \DrawMemCellContent{5}{01} \DrawMemCellContent{6}{00} \DrawMemCellContent{7}{02} \DrawMemCellContent{8}{05} \DrawMemCellContent{9}{00} \DrawMemCellContent{10}{02} \DrawMemCellContent{11}{00} \DrawMemCellContent{12}{07} \DrawMemCellContent{13}{00} \DrawMemCellContent{14}{00} \DrawMemCellContent{15}{04} \DrawMemCellContent{16}{03} \DrawMemCellContent{17}{02} \DrawMemCellContent{18}{00} \DrawMemCellContent{19}{00} \DrawMemCellContent{20}{0A} \DrawMemCellContent{21}{01} \DrawMemCellContent{22}{01} \DrawMemCellContent{23}{01} \DrawMemCellContent{24}{04} \DrawMemCellContent{25}{FF} \DrawMemCellContent{26}{FF} \DrawMemCellContent{27}{FB} \DrawMemCellContent{28}{01} \DrawMemCellContent{29}{00} \DrawMemCellContent{30}{00} \DrawMemCellContent{31}{00} \DrawAnnotateMemCellAbove[2]{15}{Text segment} \end{tikzpicture} -------------------------------------------------------------------------------- ---- TIKZ ---------------------------------------------------------------------- \begin{tikzpicture} \input{memory.tex} \renewcommand\MemCellWidth { 0.6 } \DrawMemArrayOpen{0}{31} \DrawMemAddress{0}{0x20} \DrawMemAddress{4}{0x24} \DrawMemAddress{8}{0x28} \DrawMemAddress{12}{0x2C} \DrawMemAddress{16}{0x30} \DrawMemAddress{20}{0x34} \DrawMemAddress{24}{0x38} \DrawMemAddress{28}{0x3C} \DrawMemAddress{32}{0x40} \DrawMemLabel{0}{msg} \DrawMemVariable[orange!50]{0}{15}{} \DrawMemCellContent{0}{68} \DrawMemCellContent{1}{65} \DrawMemCellContent{2}{6C} \DrawMemCellContent{3}{6C} \DrawMemCellContent{4}{6F} \DrawMemCellContent{5}{2C} \DrawMemCellContent{6}{20} \DrawMemCellContent{7}{77} \DrawMemCellContent{8}{6F} \DrawMemCellContent{9}{72} \DrawMemCellContent{10}{6C} \DrawMemCellContent{11}{64} \DrawMemCellContent{12}{21} \DrawMemCellContent{13}{0A} \DrawMemCellContent{14}{00} \DrawAnnotateMemCell[2]{5}{Data segment} \end{tikzpicture} -------------------------------------------------------------------------------- Disassembling the instructions in the text segment and interpreting the bytes in the data segment as zero terminated string allows to almost see the original source code in the memory layout (due to spacing issues the symbols `addr` and `ch` for the literals $1$ and $2$ respectiuvely are not used): ---- TIKZ ---------------------------------------------------------------------- \begin{tikzpicture} \input{memory.tex} \renewcommand\MemCellWidth { 0.75 } \DrawMemArrayOpenRight{0}{31} \DrawMemAddress{0}{0x00} \DrawMemAddress{4}{0x04} \DrawMemAddress{8}{0x08} \DrawMemAddress{12}{0x0C} \DrawMemAddress{16}{0x10} \DrawMemAddress{20}{0x14} \DrawMemAddress{24}{0x18} \DrawMemAddress{28}{0x1C} \DrawMemAddress{32}{0x20} \DrawMemLabel{28}{halt} \DrawMemLabel{4}{load} \begingroup \renewcommand\PaddingMemVariable {0.05} \DrawMemVariable[gray!90]{0}{32}{} \par\endgroup \DrawLongVariable[gray!50]{0}{ldzwq msg, \%1} \DrawLongVariable[gray!50]{4}{\small movzbq (\%1), \%2} \DrawLongVariable[gray!50]{8}{\small subq 0, \%ch, \%0} \DrawLongVariable[gray!50]{12}{je halt} \DrawLongVariable[gray!50]{16}{putc \%2} \DrawLongVariable[gray!50]{20}{addq 1, \%1, \%1} \DrawLongVariable[gray!50]{24}{jmp fetch} \DrawLongVariable[gray!50]{28}{halt 0} \end{tikzpicture} -------------------------------------------------------------------------------- ---- TIKZ ---------------------------------------------------------------------- \begin{tikzpicture} \input{memory.tex} \renewcommand\MemCellWidth { 0.75 } \DrawMemArrayOpen{0}{31} \DrawMemAddress{0}{0x20} \DrawMemAddress{4}{0x24} \DrawMemAddress{8}{0x28} \DrawMemAddress{12}{0x2C} \DrawMemAddress{16}{0x30} \DrawMemAddress{20}{0x34} \DrawMemAddress{24}{0x38} \DrawMemAddress{28}{0x3C} \DrawMemAddress{32}{0x40} \DrawMemLabel{0}{msg} \begingroup \renewcommand\PaddingMemVariable {0.05} \DrawMemVariable[orange!90]{0}{15}{} \par\endgroup \DrawByteVariable[orange!50]{0}{'h'} \DrawByteVariable[orange!50]{1}{'e'} \DrawByteVariable[orange!50]{2}{'l'} \DrawByteVariable[orange!50]{3}{'l'} \DrawByteVariable[orange!50]{4}{'o'} \DrawByteVariable[orange!50]{5}{','} \DrawByteVariable[orange!50]{6}{' '} \DrawByteVariable[orange!50]{7}{'w'} \DrawByteVariable[orange!50]{8}{'o'} \DrawByteVariable[orange!50]{9}{'r'} \DrawByteVariable[orange!50]{10}{'l'} \DrawByteVariable[orange!50]{11}{'d'} \DrawByteVariable[orange!50]{12}{'!'} \DrawByteVariable[orange!50]{13}{'\textbackslash n'} \DrawByteVariable[orange!50]{14}{0} \end{tikzpicture} -------------------------------------------------------------------------------- From the symbol table we also know that symbols `p` and `ch` were defined with absolute value 1 and 2 respectively. However, we can not determine where the symbols were used. So for example, we don't know that the first instruction was written as `ldzwq msg, %addr` in the source file. In practise that makes it hard to understand disassembled programs where the original source is not available (and there are actually legal cases where you have to deal with such problems). Pointers! Start learning about them here! ========================================= Have a look at what the first two instructions are doing and how this can be represented descriptively. `ldzwq msg, %addr` ~~~~~~~~~~~~~~~~~~ The assembler replaces the label `msg` with the address of the `h` in the "hello, world!" string. That means `msg` has value 0x20 (or 32 in decimal). Obviously the address of the string depends on where the loader will copy the data segment when we run the program. And this in turn depends on the size of the text segment. It requires some kind of bookkeeping to figure out the actual address of the string by just looking at the assembly source code. But it is possible, we can do it and the assembler can do it. But it is less error prone if the assembler is doing it, and using labels delegates this job to the assembler. So after this instruction the value in `%addr` has the meaning "address of the first character in the string". So we think of `%addr` as being a "pointer to the first character in the string": ---- TIKZ ---------------------------------------------------------------------- \begin{tikzpicture} \input{memory.tex} \renewcommand\MemCellWidth { 1.2 } \DrawMemArrayOpen{0}{15} \DrawMemLabel{0}{msg} \begingroup \renewcommand\PaddingMemVariable {0.05} \DrawMemVariable[orange!90]{0}{15}{} \par\endgroup \DrawPointer{0}{\%addr} \DrawByteVariable[orange!50]{0}{'h'} \DrawByteVariable[orange!50]{1}{'e'} \DrawByteVariable[orange!50]{2}{'l'} \DrawByteVariable[orange!50]{3}{'l'} \DrawByteVariable[orange!50]{4}{'o'} \DrawByteVariable[orange!50]{5}{','} \DrawByteVariable[orange!50]{6}{' '} \DrawByteVariable[orange!50]{7}{'w'} \DrawByteVariable[orange!50]{8}{'o'} \DrawByteVariable[orange!50]{9}{'r'} \DrawByteVariable[orange!50]{10}{'l'} \DrawByteVariable[orange!50]{11}{'d'} \DrawByteVariable[orange!50]{12}{'!'} \DrawByteVariable[orange!50]{13}{'\textbackslash n'} \DrawByteVariable[orange!50]{14}{0} \end{tikzpicture} -------------------------------------------------------------------------------- `movzbq (%addr), %ch` ~~~~~~~~~~~~~~~~~~~~~ This instruction copies the _value at address `%addr`_ into `%ch`. In this instruction the pointer get _dereferenced_. And it is impossible to overestimate how important it will be to understand what _dereferencing a pointer_ means. So I will explain and talk about it more than once. Dereferencing means that we refer to a value at the end of a pointer. And this requires two pieces of information: - _Location: "Where does the pointers point to?"_ This information is the address stored as value in the pointer. So in this case the value in `%addr`. Note the difference between "value of `%addr`" and "value at address `%addr`". - _Type information: "What is the value at the end of the pointer?"_ In the "hello, world" example the value is the byte at the end of the pointer, and this byte has here the meaning of being the ASCII value of a character. This information is only given by the context and not stored in any register or whatsoever. Because we know that the instruction `movzb` is used to copy a single byte from the end of the pointer, zero extend it and to copy it into the destination register `%ch`. In general you have to do the bookkeeping: You have to know of how many bytes the dereferenced value consists of. And to know if the dereferenced value is a character, signed or unsigned integer or whatsoever. You have zero support from the assembly language for bookkeeping this kind of _type information_. In this context we can illustrate the meaning of `(%addr)` as follows: ---- TIKZ ---------------------------------------------------------------------- \begin{tikzpicture} \input{memory.tex} \renewcommand\MemCellWidth { 1.2 } \DrawMemArrayOpen{0}{15} \DrawMemLabel{0}{msg} \begingroup \renewcommand\PaddingMemVariable {0.05} \DrawMemVariable[orange!90]{0}{15}{} \par\endgroup \DrawPointer{0}{\%addr} \DrawByteVariable[orange!50]{0}{(\%addr)} \DrawByteVariable[orange!50]{1}{'e'} \DrawByteVariable[orange!50]{2}{'l'} \DrawByteVariable[orange!50]{3}{'l'} \DrawByteVariable[orange!50]{4}{'o'} \DrawByteVariable[orange!50]{5}{','} \DrawByteVariable[orange!50]{6}{' '} \DrawByteVariable[orange!50]{7}{'w'} \DrawByteVariable[orange!50]{8}{'o'} \DrawByteVariable[orange!50]{9}{'r'} \DrawByteVariable[orange!50]{10}{'l'} \DrawByteVariable[orange!50]{11}{'d'} \DrawByteVariable[orange!50]{12}{'!'} \DrawByteVariable[orange!50]{13}{'\textbackslash n'} \DrawByteVariable[orange!50]{14}{0} \end{tikzpicture} -------------------------------------------------------------------------------- How to know if a register is storing a pointer? ----------------------------------------------- The assembler does not know whether a register is used as pointer. This is also up to you. You give the register its meaning. And this meaning can change! You have to do the bookkeeping. And you have to keep your bookkeeping up to date. Consider this modification of the first two instructions: ---- CODE (type=s) ------------------------------------------------------------- ldzwq msg, %3 movzbq (%3), %3 -------------------------------------------------------------------------------- With the first instruction register `%3` could have the meaning "pointer to the string". After the second instruction the meaning "first character of the string". Some personal opinion/experience -------------------------------- When learning C/C++ understanding pointers and how to use them is the hardest part. My rule of thumb is that every non-trivial bug in a C/C++ program is related to pointers. The dangerous thing is this combination: - In the C/C++ programming languages the compilers do a lot of the necessary bookkeeping. So compared to programming in assembler the C/C+ compilers can detect many bugs related to pointers. Some of the bugs that slip though this line of defence can be detected by additional tools. - Slightly exaggerated but true in the quintessence: If a bug slipped through it is impossible to find it. You don't even know if a bug slipped through! Because such a bug might only once in a while cause the program to crash, or worse, the bug does not crash the program and just causes wrong results. Because some bugs are detected in C/C++ it is tempting to use these languages and underestimate the danger. The advantage of programming in assembly is: You will never underestimate what can go wrong! And be aware that you are learning some non-trivial concepts. Be patient now if things don't work out the first time and try to understand the underlying reason. This will allow you to do some solid programming in C/C++ later. More about the assembly language: Tokens ======================================== Like I said in the video it first seems to be odd, that for example `halt` can be used as an label. How can the assembler distinguish the meanings? Field format of source lines ---------------------------- This is handled by the scanner during the lexical analysis. The format of the source lines consists of fields: `[label] [operators] [operands]` Tokens for mnemonics like (e.g. "addq", "halt", etc.) and pseudo operators (e.g. ".string", ".byte", etc) are only generated from the operator field. So in other fields they can be used as identifiers. For example, from this code ---- CODE (file=session09/hello/lex_example.s) --------------------------------- addq # some label addq %0, %12, %0x3 // an instruction .quad 4 /* some data */ -------------------------------------------------------------------------------- the scanner (you find the lexer test program in `1_ulm_build/hello/.build/ulmas1/`) generates the following tokens: ---- SHELL(path=session09/hello/) ---------------------------------------------- 1_ulm_build/hello/.build/ulmas1/xtest_lexer < lex_example.s --------------------------------------------------------------------------------- So note that the character sequence "addq" was first detected as an identifier (`IDENT`) and in the second case as mnemonic (`ADDQ`). Comments -------- You also might notice that comments are removed. And comments can be used in different flavors: - Single-line comments start with "#" or "//" - "/*" begins a multi-line comment and "*/" ends a multi-line comment Tokens recognized only in the operator field -------------------------------------------- As specified in the grammar, a mnemonic is part of an instruction and a pseudo operator part of a directive. Mnemonics ~~~~~~~~~ The mnemonics are specified by the instruction set. In the `isa.txt` for this video these were + `addq` + `getc` + `halt` + `imulq` + `je` * `jmp` * `jne` * `jz` * `ldzwq` * `movzbq` + `putc` + `subq` Pseudo operators ~~~~~~~~~~~~~~~~ + `.align` + `.byte` + `.comm` + `.data` + `.equ` + `.equiv` * `.globl` * `.global` * `.lcomm` * `.long` * `.quad` * `.set` + `.string` + `.text` + `.word` Tokens recognized in the label or operands fields ------------------------------------------------- Identifiers ~~~~~~~~~~~ Identifiers begin with a letter (i.e. 'A' to 'Z' and 'a' to 'z'), or underscore '_', or a dot '.' and are optionally continued with a sequence of more letters, underscores, dots , or decimal digits 0 to 9. Hence "foo", ".fOo", ".fOo1", "_", "." are allowed nut not "2foo". Empty label ~~~~~~~~~~~ You also see some tokens called `SPACE` in this example. This token gets generated when the label field is empty. For the parser (and describing the grammar) it is important that every line has in general a label (which can be empty, but it exists). Otherwise white space characters get consumed by the scanner. Literals ~~~~~~~~ - Decimal, hexadecimal and octal constants (e.g. 12, 0x2a, 017). These constants are all unsigned and encoded with 64 bits. - Decimal literals begin with a digit "1' to '9' and optionally more decimal digits '0' to '9'. Decimal constants are unsigned and are encoded with 64 bits. - Octal literals begin with the digit '0' and optionally more octal digit '0' to '7'. more digits. Decimal constants are unsigned and are encoded with 64 bits. - Hexadecimal digit begin with the prefix '0x' or '0X' followed by one or more hexadecimal digit '0' to '9', 'a' to 'f', 'A' to 'F'. - Character constants (e.g. 'a' or '😎') can be used as integers. The value is determined by the ASCII code or more general the utf-8 code. - String literals (e.g."hello, world!") - End of line (newline character ASCII code 10) Punctuators/Delimiters ~~~~~~~~~~~~~~~~~~~~~~ + `+` + `-` + `*` + `/` + `%` + `(` * `)` * `:` * `,` * `$` * `{` * `}` + `<` + `>` Some of these punctuators are used for expressions (`+`, `-`, `*`, `/`, `%`, `(` and `)`). You also can use them for your assembly notation. But there is a restriction: ---- BOX ----------------------------------------------------------------------- If the `'('` punctuator is used in the assembly notation then the next punctuator has to be the `'%'` punctuator. Hence _movq (X), %Y_ would *not* be allowed. As the parentheses are also used in expressions this requirements makes it easier (or in my humble opinion possible at all) to use a recursive decent parser. If you dislike this restriction use the brackets `'{'` and `'}'` in your assembly notation instead. -------------------------------------------------------------------------------- Immediate operator ~~~~~~~~~~~~~~~~~~ + `@w0` + `@w1` + `@w2` + `@w3` These operators can be used to extract a particular word from a 64-bit literal or symbol (e.g. a label): - `@w0(label)` gives the least significant word, - ..., - `@w3(label)` the most significant word. More about the assembly language: Grammar ========================================= With your definition of the assembly notation you define some part of the grammar. Basically you define it by example and the generator extracts the formal grammar for instructions. Your grammar rules are then embedded into the grammar for the assembler. Structure of an assembly program -------------------------------- The grammar describes an assembly program as a sequence of instructions and directives (pseudo instructions): ---- LATEX ------------------------------------------------------------------- \begin{array}{lcl} \langle\text{compilation-unit}\rangle & \to & \langle\text{}\rangle \\ & \to & \langle\text{sequence}\rangle \\ \langle\text{sequence}\rangle & \to & \langle\text{labelled-op}\rangle \\ & \to & \langle\text{regular-op}\rangle \\ & \to & \langle\text{sequence}\rangle \quad \langle\text{labelled-op}\rangle \\ & \to & \langle\text{sequence}\rangle \quad \langle\text{regular-op}\rangle \\ \langle\text{labelled-op}\rangle & \to & \langle\text{label}\rangle \quad \langle\text{op}\rangle\\ \langle\text{regular-op}\rangle & \to & \langle\text{empty-label}\rangle \quad \langle\text{op}\rangle\\ \langle\text{op}\rangle & \to & \textbf{eol} \\ & \to & \langle\text{instruction}\rangle \quad \textbf{eol}\\ & \to & \langle\text{directive}\rangle \quad \textbf{eol}\\ \end{array} -------------------------------------------------------------------------------- Instructions ------------ This is the part of the grammar that you define. The fields (e.g. `X`, `Y`, `Z`) from the instruction format can be used in the notation. The parser accepts for this fields then an expression. Expressions ----------- ---- LATEX ------------------------------------------------------------------- \begin{array}{lcl} \langle\text{expression}\rangle & \to & \langle\text{simple-expression}\rangle \\ \langle\text{simple-expression}\rangle & \to & \langle\text{term}\rangle \\ & \to & \langle\text{simple-expression}\rangle \quad \textbf{+} \quad \langle\text{term}\rangle \\ & \to & \langle\text{simple-expression}\rangle \quad \textbf{-} \quad \langle\text{term}\rangle \\ \langle\text{term}\rangle & \to & \langle\text{factor}\rangle \\ & \to & \langle\text{term}\rangle \quad\textbf{*}\quad \langle\text{factor}\rangle \\ & \to & \langle\text{term}\rangle \quad\textbf{/}\quad \langle\text{factor}\rangle \\ & \to & \langle\text{term}\rangle \quad\textbf{%}\quad \langle\text{factor}\rangle \\ \langle\text{factor}\rangle & \to & \langle\text{primary}\rangle \\ & \to & \langle\text{unary-minus}\rangle \\ \langle\text{unary-minus}\rangle & \to & \textbf{-} \quad \langle\text{primary}\rangle \\ \langle\text{pimary}\rangle & \to & \langle\text{integer}\rangle \\ & \to & \langle\text{identifier}\rangle \\ & \to & \textbf{(} \quad \langle\text{simple-expression}\rangle \quad \textbf{)}\\ \langle\text{integer}\rangle & \to & \text{decimal-constant} \\ & \to & \text{hexadecimal-constant} \\ & \to & \text{octal-constant} \\ & \to & \text{char-constant} \\ \langle\text{identifier}\rangle & \to & \text{ident} \\ \end{array} ------------------------------------------------------------------------------ In the simplest cases an expression is just an identifier or a constant. The constants can be decimal, hexadecimal, octal and character constants. Here an example for halt instructions that all have the same exit code given by an expression ---- CODE (file=session09/grammar/ex_expr.s) ----------------------------------- halt 65 // exit code as decimal constant halt 0x41 // exit code as hexadecimal constant halt 0101 // exit code as octal constant halt 'A' // exit code as character constant .equ exit, 'A' /* here instead of 'A' you also could write 65, 0x41, 0101. */ halt exit -------------------------------------------------------------------------------- Directives ---------- ---- LATEX ------------------------------------------------------------------- \begin{array}{lcl} \langle\text{directive}\rangle & \to & \langle\text{text-header}\rangle \\ & \to & \langle\text{data-header}\rangle \\ & \to & \langle\text{bss-header}\rangle \\ & \to & \langle\text{pseudo-op-data}\rangle \quad \langle\text{expression}\rangle \\ & \to & \langle\text{pseudo-op-string}\rangle \quad \textbf{string-literal} \\ & \to & \langle\text{pseudo-op-flag}\rangle \quad \langle\text{identifier}\rangle \\ & \to & \langle\text{pseudo-op-def}\rangle \quad \langle\text{identifier}\rangle \quad \textbf{,} \quad \langle\text{expression}\rangle \\ \langle\text{text-header}\rangle & \to & \textbf{.text} \\ \langle\text{data-header}\rangle & \to & \textbf{.data} \\ \langle\text{bss-header}\rangle & \to & \textbf{.bss} \\ \langle\text{pseudo-op-string}\rangle & \to & \textbf{.string} \\ \langle\text{pseudo-op-def}\rangle & \to & \textbf{.eqiv} \\ & \to & \textbf{.equ} \\ \langle\text{pseudo-op-flag}\rangle & \to & \textbf{.global} \\ & \to & \textbf{.globl} \\ \langle\text{pseudo-op-data}\rangle & \to & \textbf{.align} \\ & \to & \textbf{.space} \\ & \to & \textbf{.byte} \\ & \to & \textbf{.long} \\ & \to & \textbf{.quad} \\ & \to & \textbf{.word} \\ \end{array} --------------------------------------------------------------------------------