=========================================
Formal and technical description of ULM C
=========================================

In the following the ULM C programming language is described in more detail.
The description is suppoed to be both, formal and technical. By "formal" I mean
that the lexical elements and the production rules (also see __grammar for ULM
C__) for the syntax are described. By "technical" I mean that you should get an
idea how the compiler generates assembly code. 

Lexical elements
================
Each program gets converted into a sequence of tokens during the lexical
analysis. These tokens serve as terminal symbols in the description of the
grammar. Program sources for the ULM C compiler are to be encoded in UTF-8.
While tokens including identifiers consist of ASCII characters only, arbitrary
non-ASCII characters are permitted in string literals and comments.

Example
~~~~~~~
The C source code

---- CODE (file=session12/ex02/hello.c) ----------------------------------------
extern int
puts(char *str);

int
main()
{
    puts("hello, world!");
}
--------------------------------------------------------------------------------

consists of the following tokens:

---- SHELL (path=session12/ex02) -----------------------------------------------
ulmcc-test-lexer hello.c
--------------------------------------------------------------------------------

C preprocessor leftovers
------------------------
In general C code first runs through the C preprocessor and the C compiler only
can see what it produced. Even if you have no preprocessors in your code there
will be some leftovers. Try this:

- Write some text into a file, e.g.

  ---- CODE (file=session12/ex02/some_text.c) ----------------------------------
  This is some text.
  And to make clear that the
  C preprocessor does not
  care if it is actually processing
  a C source file it is
  really just some text.
  Bla bla bla.
  ------------------------------------------------------------------------------

- The run it through the preprocessor (you can directly call the preprocessor
  with `cpp`):

  ---- SHELL (path=session12/ex02) ---------------------------------------------
  cpp some_text.c
  ------------------------------------------------------------------------------

  For completeness, alternatively you can use `gcc` with option `-E` to call
  the preprocessor:

  ---- SHELL (path=session12/ex02, fold) ---------------------------------------
  gcc -E some_text.c
  ------------------------------------------------------------------------------

The C compiler ignores lines produced by the preprocessor that begin with a
'`#`' character. So in this case these lines are removed:

---- SHELL (path=session12/ex02) -----------------------------------------------
cpp some_text.c | grep '#'
--------------------------------------------------------------------------------

Note however, that these "leftovers" produced by the preprocessor have a
special form (e.g. a space after the '`#`'). Hence in general you can not use
the '`#`' for single-line comments.

Comments
--------
The lexer consumes comments which can be single-line and multi-line:

- Single-line comments start with "//",
- "/*" begins a multi-line comment and "*/" ends a multi-line comment.


Keywords
--------
The following tokens are recognized by the lexer as  keywords and therefore can
not be used as identifiers:

+ `bool`
+ `break`
+ `char`
+ `continue`
+ `else`
+ `extern`


* `false`
* `for`
* `if`
* `int`
* `int16_t`
* `int32_t`

+ `int64_t`
+ `int8_t`
+ `ptrdiff_t`
+ `return`
+ `size_t`
+ `sizeof`

* `static`
* `struct`
* `true`
* `uint16_t`
* `uint32_t`
* `uint64_t`

+ `uint8_t`
+ `void`
+ `while`

For simplifying the grammar some keywords are substituted by the lexer:

+-----------------------+-------------------------------+
| keyword found		| replaced with			|
+-----------------------+-------------------------------+
| `char`		| `uint8_t`			|
+-----------------------+-------------------------------+
| `bool`		| `uint8_t`			|
+-----------------------+-------------------------------+
| `int`			| `int32_t`			|
+-----------------------+-------------------------------+
| `ptrdiff_t`		| `int64_t`			|
+-----------------------+-------------------------------+
| `size_t`		| `uint64_t`			|
+-----------------------+-------------------------------+

Nate that in ULM C quite a few keywords are missing compared to C17:

- The keyword `const` is missing deliberately. Many beginners in C think that
  it can be used to define some constant. But actually it is used to declare a
  variable as "supposed to be read-only but if I change my mind I can write to
  it anyway". So it's more about getting help from the compiler that you don't
  write to it by accident. In ULM C you have to memorize yourself that some
  variables are supposed to be read-only. Also, the C grammar part related to
  `const` is unnecessarily complicated.

  But to be clear: When you program in C17 you should use `const` where
  appropriate. But you also should know that it does not affect the generated
  code.

- The keyword `do` (and some others) is missing but on the to-do list.

Punctuators
-----------

+ `[`
+ `]`
+ `(`
+ `)`
+ `{`
+ `}`

* `,`
* `.`
* `->`
* `&`
* `&&`
* `||`

+ `*`
+ `+`
+ `-`
+ `!`
+ `/`
+ `%`

* `<`
* `>`
* `<=`
* `>=`
* `=`
* `==`

+ `!=`
+ `+=`
+ `-=`
+ `++`
+ `--`
+ `;`

Literals
--------
- Decimal, hexadecimal and octal constants (e.g. 12, 018, 0x2a)
- Character constants (e.g. ‘h’, ‘\n’)
- String literals (e.g. “hello, wolrd!\n”)


Identifiers
-----------
Identifiers begin with a letter (i.e. 'A' to 'Z' and 'a' to 'z'), or underscore
'_' and are optionally continued with a sequence of more letters, underscores,
dots , or decimal digits 0 to 9.

Hence "foo", "_fOo", "_fOo1", "_" are allowed but not "2foo".


Language grammar: Bookkeeping and guidance for code generation
==============================================================

The source code is a sequence of definitions and declarations for global
variables and functions. The terms _definition_ and _declaration_ can be
described as follows:

- 
- From a definition code gets generated

The C code describes how to generate code for global variables and functions. 

Definitions: Code for functions and global variables.  
Declarations: Global variables defined in other translation units. 


The compiler translates each source file into assembly code.  and therefore the structure of the source files has to describe the text, data and bss segment of the 

- Implementation of functions,
- Initialized global variables
- uninitialized global variables

---- LATEX ---------------------------------------------------------------------
\begin{array}{lcl}
    \langle\text{translation-unit}\rangle\;
	&\to& \\
	&\to& \langle\text{external-declaration-list}\rangle\; \\
    \langle\text{external-declaration-list}\rangle\;
	&\to& \langle\text{external-declaration}\rangle\; \\
	&\to& \langle\text{external-declaration-list}\rangle\;
	    \langle\text{external-declaration}\rangle\; \\
    \langle\text{external-declaration}\rangle\;
	&\to& \langle\text{declaration}\rangle\; \\
	&\to& \langle\text{function-definition}\rangle\; \\
\end{array}
--------------------------------------------------------------------------------


Of course going first through the details of the complete grammer and then
describing the meaning would not take advantage that we already have a good
technical understanding on how to write programs in assembly. 


Declarations
------------


---- LATEX ---------------------------------------------------------------------
\begin{array}{lcl}
    \langle\text{declaration}\rangle\;
	&\to&
	    \langle\text{declaration-specifiers}\rangle\;
	    \langle\text{init-declarator-list}\rangle\;
	    \textbf{;} \\
	&\to&
	    \langle\text{declaration-specifiers}\rangle\;
	    \textbf{;} \\
    \langle\text{declaration-specifiers}\rangle\;
	&\to&
	    \langle\text{type-specifier}\rangle\; \\
	&\to&
	    \langle\text{storage-class-specifier}\rangle\;
	    \langle\text{type-specifier}\rangle \\
\end{array}
--------------------------------------------------------------------------------

Type specifiers
~~~~~~~~~~~~~~~

---- LATEX ---------------------------------------------------------------------
\begin{array}{lcl}
    \langle\text{type-specifier}\rangle\;
	&\to& \langle\text{integer-type}\rangle\; \\
	&\to& \langle\text{void-type}\rangle\; \\
	&\to& \langle\text{struct-specifier}\rangle\; \\
    \langle\text{integer-type}\rangle\;
	&\to& \textbf{int8_t} \\
	&\to& \textbf{int16_t} \\
	&\to& \textbf{int32_t} \\
	&\to& \textbf{int64_t} \\
	&\to& \textbf{uint8_t} \\
	&\to& \textbf{uint16_t} \\
	&\to& \textbf{uint32_t} \\
	&\to& \textbf{uint64_t} \\
    \langle\text{void-type}\rangle\;
	&\to& \textbf{void} \\
\end{array}
--------------------------------------------------------------------------------

Storage class specifiers
~~~~~~~~~~~~~~~~~~~~~~~~

---- LATEX ---------------------------------------------------------------------
\begin{array}{lcl}
    \langle\text{storage-class-specifier}\rangle\;
	&\to& \textbf{static} \\
	&\to& \textbf{extern} \\
\end{array}
--------------------------------------------------------------------------------


Function definition
-------------------

---- LATEX ---------------------------------------------------------------------
\begin{array}{lcl}
    \langle\text{function-definition}\rangle\;
	&\to&
	    \langle\text{declaration-specifiers}\rangle\;
	    \langle\text{declarator}\rangle\;
	    \langle\text{compound-statement}\rangle\; \\
\end{array}
--------------------------------------------------------------------------------


Every C program is a description of the text segment, data segment and bss
segment. For example this code fragment

---- CODE (file=session12/ex02/main.c) -----------------------------------------
int a = 42;
int b;

int
main()
{
    /* ... */
}
--------------------------------------------------------------------------------

describes an assembly program that has the following form:

---- CODE (type=s) -------------------------------------------------------------
    .data
    .globl a
    .align 4
a:
    .long 42

    .bss
    .globl b
    .align 4
b:
    .space 4

    .text
    .globl main
main:
    /* ... */
--------------------------------------------------------------------------------

---- SHELL (path=session12/ex02/, fold) ----------------------------------------
ulmcc-test-parser main.c
ulmcc_mk_jstree -o main03 main.c
--------------------------------------------------------------------------------

---- RAW ------------------------------------
session12/ex02/main03
---------------------------------------------


:links: grammar for ULM C -> http://www.mathematik.uni-ulm.de/numerik/hpc/ss20/hpc0/ulmc-grammar.pdf