========================================= Formal and technical description of ULM C ========================================= In the following the ULM C programming language is described in more detail. The description is suppoed to be both, formal and technical. By "formal" I mean that the lexical elements and the production rules (also see __grammar for ULM C__) for the syntax are described. By "technical" I mean that you should get an idea how the compiler generates assembly code. Lexical elements ================ Each program gets converted into a sequence of tokens during the lexical analysis. These tokens serve as terminal symbols in the description of the grammar. Program sources for the ULM C compiler are to be encoded in UTF-8. While tokens including identifiers consist of ASCII characters only, arbitrary non-ASCII characters are permitted in string literals and comments. Example ~~~~~~~ The C source code ---- CODE (file=session12/ex02/hello.c) ---------------------------------------- extern int puts(char *str); int main() { puts("hello, world!"); } -------------------------------------------------------------------------------- consists of the following tokens: ---- SHELL (path=session12/ex02) ----------------------------------------------- ulmcc-test-lexer hello.c -------------------------------------------------------------------------------- C preprocessor leftovers ------------------------ In general C code first runs through the C preprocessor and the C compiler only can see what it produced. Even if you have no preprocessors in your code there will be some leftovers. Try this: - Write some text into a file, e.g. ---- CODE (file=session12/ex02/some_text.c) ---------------------------------- This is some text. And to make clear that the C preprocessor does not care if it is actually processing a C source file it is really just some text. Bla bla bla. ------------------------------------------------------------------------------ - The run it through the preprocessor (you can directly call the preprocessor with `cpp`): ---- SHELL (path=session12/ex02) --------------------------------------------- cpp some_text.c ------------------------------------------------------------------------------ For completeness, alternatively you can use `gcc` with option `-E` to call the preprocessor: ---- SHELL (path=session12/ex02, fold) --------------------------------------- gcc -E some_text.c ------------------------------------------------------------------------------ The C compiler ignores lines produced by the preprocessor that begin with a '`#`' character. So in this case these lines are removed: ---- SHELL (path=session12/ex02) ----------------------------------------------- cpp some_text.c | grep '#' -------------------------------------------------------------------------------- Note however, that these "leftovers" produced by the preprocessor have a special form (e.g. a space after the '`#`'). Hence in general you can not use the '`#`' for single-line comments. Comments -------- The lexer consumes comments which can be single-line and multi-line: - Single-line comments start with "//", - "/*" begins a multi-line comment and "*/" ends a multi-line comment. Keywords -------- The following tokens are recognized by the lexer as keywords and therefore can not be used as identifiers: + `bool` + `break` + `char` + `continue` + `else` + `extern` * `false` * `for` * `if` * `int` * `int16_t` * `int32_t` + `int64_t` + `int8_t` + `ptrdiff_t` + `return` + `size_t` + `sizeof` * `static` * `struct` * `true` * `uint16_t` * `uint32_t` * `uint64_t` + `uint8_t` + `void` + `while` For simplifying the grammar some keywords are substituted by the lexer: +-----------------------+-------------------------------+ | keyword found | replaced with | +-----------------------+-------------------------------+ | `char` | `uint8_t` | +-----------------------+-------------------------------+ | `bool` | `uint8_t` | +-----------------------+-------------------------------+ | `int` | `int32_t` | +-----------------------+-------------------------------+ | `ptrdiff_t` | `int64_t` | +-----------------------+-------------------------------+ | `size_t` | `uint64_t` | +-----------------------+-------------------------------+ Nate that in ULM C quite a few keywords are missing compared to C17: - The keyword `const` is missing deliberately. Many beginners in C think that it can be used to define some constant. But actually it is used to declare a variable as "supposed to be read-only but if I change my mind I can write to it anyway". So it's more about getting help from the compiler that you don't write to it by accident. In ULM C you have to memorize yourself that some variables are supposed to be read-only. Also, the C grammar part related to `const` is unnecessarily complicated. But to be clear: When you program in C17 you should use `const` where appropriate. But you also should know that it does not affect the generated code. - The keyword `do` (and some others) is missing but on the to-do list. Punctuators ----------- + `[` + `]` + `(` + `)` + `{` + `}` * `,` * `.` * `->` * `&` * `&&` * `||` + `*` + `+` + `-` + `!` + `/` + `%` * `<` * `>` * `<=` * `>=` * `=` * `==` + `!=` + `+=` + `-=` + `++` + `--` + `;` Literals -------- - Decimal, hexadecimal and octal constants (e.g. 12, 018, 0x2a) - Character constants (e.g. ‘h’, ‘\n’) - String literals (e.g. “hello, wolrd!\n”) Identifiers ----------- Identifiers begin with a letter (i.e. 'A' to 'Z' and 'a' to 'z'), or underscore '_' and are optionally continued with a sequence of more letters, underscores, dots , or decimal digits 0 to 9. Hence "foo", "_fOo", "_fOo1", "_" are allowed but not "2foo". Language grammar: Bookkeeping and guidance for code generation ============================================================== The source code is a sequence of definitions and declarations for global variables and functions. The terms _definition_ and _declaration_ can be described as follows: - - From a definition code gets generated The C code describes how to generate code for global variables and functions. Definitions: Code for functions and global variables. Declarations: Global variables defined in other translation units. The compiler translates each source file into assembly code. and therefore the structure of the source files has to describe the text, data and bss segment of the - Implementation of functions, - Initialized global variables - uninitialized global variables ---- LATEX --------------------------------------------------------------------- \begin{array}{lcl} \langle\text{translation-unit}\rangle\; &\to& \\ &\to& \langle\text{external-declaration-list}\rangle\; \\ \langle\text{external-declaration-list}\rangle\; &\to& \langle\text{external-declaration}\rangle\; \\ &\to& \langle\text{external-declaration-list}\rangle\; \langle\text{external-declaration}\rangle\; \\ \langle\text{external-declaration}\rangle\; &\to& \langle\text{declaration}\rangle\; \\ &\to& \langle\text{function-definition}\rangle\; \\ \end{array} -------------------------------------------------------------------------------- Of course going first through the details of the complete grammer and then describing the meaning would not take advantage that we already have a good technical understanding on how to write programs in assembly. Declarations ------------ ---- LATEX --------------------------------------------------------------------- \begin{array}{lcl} \langle\text{declaration}\rangle\; &\to& \langle\text{declaration-specifiers}\rangle\; \langle\text{init-declarator-list}\rangle\; \textbf{;} \\ &\to& \langle\text{declaration-specifiers}\rangle\; \textbf{;} \\ \langle\text{declaration-specifiers}\rangle\; &\to& \langle\text{type-specifier}\rangle\; \\ &\to& \langle\text{storage-class-specifier}\rangle\; \langle\text{type-specifier}\rangle \\ \end{array} -------------------------------------------------------------------------------- Type specifiers ~~~~~~~~~~~~~~~ ---- LATEX --------------------------------------------------------------------- \begin{array}{lcl} \langle\text{type-specifier}\rangle\; &\to& \langle\text{integer-type}\rangle\; \\ &\to& \langle\text{void-type}\rangle\; \\ &\to& \langle\text{struct-specifier}\rangle\; \\ \langle\text{integer-type}\rangle\; &\to& \textbf{int8_t} \\ &\to& \textbf{int16_t} \\ &\to& \textbf{int32_t} \\ &\to& \textbf{int64_t} \\ &\to& \textbf{uint8_t} \\ &\to& \textbf{uint16_t} \\ &\to& \textbf{uint32_t} \\ &\to& \textbf{uint64_t} \\ \langle\text{void-type}\rangle\; &\to& \textbf{void} \\ \end{array} -------------------------------------------------------------------------------- Storage class specifiers ~~~~~~~~~~~~~~~~~~~~~~~~ ---- LATEX --------------------------------------------------------------------- \begin{array}{lcl} \langle\text{storage-class-specifier}\rangle\; &\to& \textbf{static} \\ &\to& \textbf{extern} \\ \end{array} -------------------------------------------------------------------------------- Function definition ------------------- ---- LATEX --------------------------------------------------------------------- \begin{array}{lcl} \langle\text{function-definition}\rangle\; &\to& \langle\text{declaration-specifiers}\rangle\; \langle\text{declarator}\rangle\; \langle\text{compound-statement}\rangle\; \\ \end{array} -------------------------------------------------------------------------------- Every C program is a description of the text segment, data segment and bss segment. For example this code fragment ---- CODE (file=session12/ex02/main.c) ----------------------------------------- int a = 42; int b; int main() { /* ... */ } -------------------------------------------------------------------------------- describes an assembly program that has the following form: ---- CODE (type=s) ------------------------------------------------------------- .data .globl a .align 4 a: .long 42 .bss .globl b .align 4 b: .space 4 .text .globl main main: /* ... */ -------------------------------------------------------------------------------- ---- SHELL (path=session12/ex02/, fold) ---------------------------------------- ulmcc-test-parser main.c ulmcc_mk_jstree -o main03 main.c -------------------------------------------------------------------------------- ---- RAW ------------------------------------ session12/ex02/main03 --------------------------------------------- :links: grammar for ULM C -> http://www.mathematik.uni-ulm.de/numerik/hpc/ss20/hpc0/ulmc-grammar.pdf