Formal and technical description of ULM C

In the following the ULM C programming language is described in more detail. The description is suppoed to be both, formal and technical. By “formal” I mean that the lexical elements and the production rules (also see grammar for ULM C) for the syntax are described. By “technical” I mean that you should get an idea how the compiler generates assembly code.

Lexical elements

Each program gets converted into a sequence of tokens during the lexical analysis. These tokens serve as terminal symbols in the description of the grammar. Program sources for the ULM C compiler are to be encoded in UTF-8. While tokens including identifiers consist of ASCII characters only, arbitrary non-ASCII characters are permitted in string literals and comments.

Example

The C source code

1
2
3
4
5
6
7
8
extern int
puts(char *str);

int
main()
{
    puts("hello, world!");
}

consists of the following tokens:

theon$ ulmcc-test-lexer hello.c
EXTERN "extern" at hello.c:1.1-6
INT32 "int" at hello.c:1.8-10
IDENT "puts" at hello.c:2.1-4
LPAREN at hello.c:2.5
UINT8 "char" at hello.c:2.6-9
ASTERISK at hello.c:2.11
IDENT "str" at hello.c:2.12-14
RPAREN at hello.c:2.15
SEMICOLON at hello.c:2.16
INT32 "int" at hello.c:4.1-3
IDENT "main" at hello.c:5.1-4
LPAREN at hello.c:5.5
RPAREN at hello.c:5.6
LBRACE at hello.c:6.1
IDENT "puts" at hello.c:7.5-8
LPAREN at hello.c:7.9
STRING_LITERAL ""hello, world!"" at hello.c:7.10-24
RPAREN at hello.c:7.25
SEMICOLON at hello.c:7.26
RBRACE at hello.c:8.1
theon$ 

C preprocessor leftovers

In general C code first runs through the C preprocessor and the C compiler only can see what it produced. Even if you have no preprocessors in your code there will be some leftovers. Try this:

  • Write some text into a file, e.g.

    1
    2
    3
    4
    5
    6
    7
    This is some text.
    And to make clear that the
    C preprocessor does not
    care if it is actually processing
    a C source file it is
    really just some text.
    Bla bla bla.
    
  • The run it through the preprocessor (you can directly call the preprocessor with cpp):

    theon$ cpp some_text.c
    # 1 "some_text.c"
    # 1 "<built-in>"
    # 1 "<command-line>"
    # 1 "some_text.c"
    This is some text.
    And to make clear that the
    C preprocessor does not
    care if it is actually processing
    a C source file it is
    really just some text.
    Bla bla bla.
    theon$ 

    For completeness, alternatively you can use gcc with option -E to call the preprocessor:

    theon$ gcc -E some_text.c
    # 1 "some_text.c"
    # 1 "<built-in>"
    # 1 "<command-line>"
    # 1 "some_text.c"
    This is some text.
    And to make clear that the
    C preprocessor does not
    care if it is actually processing
    a C source file it is
    really just some text.
    Bla bla bla.
    theon$ 

The C compiler ignores lines produced by the preprocessor that begin with a '#' character. So in this case these lines are removed:

theon$ cpp some_text.c | grep '#'
# 1 "some_text.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "some_text.c"
theon$ 

Note however, that these “leftovers” produced by the preprocessor have a special form (e.g. a space after the '#'). Hence in general you can not use the '#' for single-line comments.

Comments

The lexer consumes comments which can be single-line and multi-line:

  • Single-line comments start with “//”,

  • “/*” begins a multi-line comment and “*/” ends a multi-line comment.

Keywords

The following tokens are recognized by the lexer as keywords and therefore can not be used as identifiers:

bool

break

char

continue

else

extern

false

for

if

int

int16_t

int32_t

int64_t

int8_t

ptrdiff_t

return

size_t

sizeof

static

struct

true

uint16_t

uint32_t

uint64_t

uint8_t

void

while

For simplifying the grammar some keywords are substituted by the lexer:

keyword found

replaced with

char

uint8_t

bool

uint8_t

int

int32_t

ptrdiff_t

int64_t

size_t

uint64_t

Nate that in ULM C quite a few keywords are missing compared to C17:

  • The keyword const is missing deliberately. Many beginners in C think that it can be used to define some constant. But actually it is used to declare a variable as “supposed to be read-only but if I change my mind I can write to it anyway”. So it's more about getting help from the compiler that you don't write to it by accident. In ULM C you have to memorize yourself that some variables are supposed to be read-only. Also, the C grammar part related to const is unnecessarily complicated.

    But to be clear: When you program in C17 you should use const where appropriate. But you also should know that it does not affect the generated code.

  • The keyword do (and some others) is missing but on the to-do list.

Punctuators

[

]

(

)

{

}

,

.

->

&

&&

||

*

+

-

!

/

%

<

>

<=

>=

=

==

!=

+=

-=

++

--

;

Literals

  • Decimal, hexadecimal and octal constants (e.g. 12, 018, 0x2a)

  • Character constants (e.g. ‘h’, ‘\n’)

  • String literals (e.g. “hello, wolrd!\n”)

Identifiers

Identifiers begin with a letter (i.e. 'A' to 'Z' and 'a' to 'z'), or underscore '_' and are optionally continued with a sequence of more letters, underscores, dots , or decimal digits 0 to 9.

Hence “foo”, “_fOo”, “_fOo1”, “_” are allowed but not “2foo”.

Language grammar: Bookkeeping and guidance for code generation

The source code is a sequence of definitions and declarations for global variables and functions. The terms definition and declaration can be described as follows:

- - From a definition code gets generated

The C code describes how to generate code for global variables and functions.

Definitions: Code for functions and global variables. Declarations: Global variables defined in other translation units.

The compiler translates each source file into assembly code. and therefore the structure of the source files has to describe the text, data and bss segment of the

  • Implementation of functions,

  • Initialized global variables

  • uninitialized global variables

\[\begin{array}{lcl} \langle\text{translation-unit}\rangle\; &\to& \\ &\to& \langle\text{external-declaration-list}\rangle\; \\ \langle\text{external-declaration-list}\rangle\; &\to& \langle\text{external-declaration}\rangle\; \\ &\to& \langle\text{external-declaration-list}\rangle\; \langle\text{external-declaration}\rangle\; \\ \langle\text{external-declaration}\rangle\; &\to& \langle\text{declaration}\rangle\; \\ &\to& \langle\text{function-definition}\rangle\; \\\end{array}\]

Of course going first through the details of the complete grammer and then describing the meaning would not take advantage that we already have a good technical understanding on how to write programs in assembly.

Declarations

\[\begin{array}{lcl} \langle\text{declaration}\rangle\; &\to& \langle\text{declaration-specifiers}\rangle\; \langle\text{init-declarator-list}\rangle\; \textbf{;} \\ &\to& \langle\text{declaration-specifiers}\rangle\; \textbf{;} \\ \langle\text{declaration-specifiers}\rangle\; &\to& \langle\text{type-specifier}\rangle\; \\ &\to& \langle\text{storage-class-specifier}\rangle\; \langle\text{type-specifier}\rangle \\\end{array}\]

Type specifiers

\[\begin{array}{lcl} \langle\text{type-specifier}\rangle\; &\to& \langle\text{integer-type}\rangle\; \\ &\to& \langle\text{void-type}\rangle\; \\ &\to& \langle\text{struct-specifier}\rangle\; \\ \langle\text{integer-type}\rangle\; &\to& \textbf{int8_t} \\ &\to& \textbf{int16_t} \\ &\to& \textbf{int32_t} \\ &\to& \textbf{int64_t} \\ &\to& \textbf{uint8_t} \\ &\to& \textbf{uint16_t} \\ &\to& \textbf{uint32_t} \\ &\to& \textbf{uint64_t} \\ \langle\text{void-type}\rangle\; &\to& \textbf{void} \\\end{array}\]

Storage class specifiers

\[\begin{array}{lcl} \langle\text{storage-class-specifier}\rangle\; &\to& \textbf{static} \\ &\to& \textbf{extern} \\\end{array}\]

Function definition

\[\begin{array}{lcl} \langle\text{function-definition}\rangle\; &\to& \langle\text{declaration-specifiers}\rangle\; \langle\text{declarator}\rangle\; \langle\text{compound-statement}\rangle\; \\\end{array}\]

Every C program is a description of the text segment, data segment and bss segment. For example this code fragment

1
2
3
4
5
6
7
8
int a = 42;
int b;

int
main()
{
    /* ... */
}

describes an assembly program that has the following form:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
    .data
    .globl a
    .align 4
a:
    .long 42

    .bss
    .globl b
    .align 4
b:
    .space 4

    .text
    .globl main
main:
    /* ... */
theon$ ulmcc-test-parser main.c
("translation_unit"
   ("external_declaration"
      ("declaration"
         ("declaration_specifiers"
            ("type_specifier"
               ("integer_type"
                  ("int32_t")
               )
            )
         ), 
         ("init_declarator_list"
            ("init_declarator"
               ("declarator"
                  ("direct_declarator"
                     ("identifier"
                        'a'
                     )
                  )
               ), 
               ("initializer"
                  ("expression"
                     ("integer_constant"
                        ("decimal_constant"
                           '42'
                        )
                     )
                  )
               )
            )
         )
      )
   ), 
   ("external_declaration"
      ("declaration"
         ("declaration_specifiers"
            ("type_specifier"
               ("integer_type"
                  ("int32_t")
               )
            )
         ), 
         ("init_declarator_list"
            ("init_declarator"
               ("declarator"
                  ("direct_declarator"
                     ("identifier"
                        'b'
                     )
                  )
               )
            )
         )
      )
   ), 
   ("external_declaration"
      ("function_definition"
         ("declaration_specifiers"
            ("type_specifier"
               ("integer_type"
                  ("int32_t")
               )
            )
         ), 
         ("declarator"
            ("direct_declarator"
               ("direct_declarator"
                  ("identifier"
                     'main'
                  )
               ), 
               ("parameter_list")
            )
         ), 
         ("compound_statement")
      )
   )
)
theon$ ulmcc_mk_jstree -o main03 main.c
theon$