CPW Part 2: Initial Design for a Lexer
In this C programming workout we will begin with our compiler project. The compiler will be called ABC for ABC is A Bloody Compiler.
For our compiler project we will need a lexer, and we can make use of the experience gathered from Session 12 when we implemented the function getUint() base on a finite state machine.
In its initial version the lexer will merely recognize certain pattern in a input stream.
Initial Lexer for the Compiler
Tokens can be punctuators consisting of one special characters, identifiers, or literals. Tokens end if the next character if the next character cannot be added to it.
Whitespace and newline
Space (ASCII 32), tab (ASCII 9) characters, and newline (ASCII 10) characters are treated as whitespace characters. White space characters are not part of any token and therefore can separate tokens but are otherwise ignored.
Literals
Literals can be decimal literals octal literals, or hexadecimal literals.
Decimal literals begin with a digit "1' to '9' and optionally more decimal digits '0' to '9'. Decimal constants are unsigned and are encoded with 64 bits.
Octal literals begin with the digit '0' and optionally more octal digit '0' to '7'. more digits. Decimal constants are unsigned and are encoded with 64 bits.
Hexadecimal digit begin with the prefix '0x' or '0X' followed by one or more hexadecimal digit '0' to '9', 'a' to 'f', 'A' to 'F'.
Punctuators/Delimiter
Following punctuators consist of one character:
+ |
- |
* |
/ |
% |
= |
( |
) |
; |
Identifiers
Identifiers begin with a letter, i.e. 'A' to 'Z' and 'a' to 'z', or an underscore “_”, and are optionally continued with a sequence of more letters, underscores, or decimal digits “0” to “9”.
Interface for the Lexer
The lexer reads input from stdin and provides the function getToken(). When getToken() is called the lexer consumes characters from stdin until a token was found and returns the token kind.
In the initial version the token kind is a integer literal documented in the header within a comment.
In addition to the token kinds defined above the lexer defines the kind EOI (end of input) if no further character is available, and BAD_TOKEN if non-whitespace characters can not be grouped to a token.
Note, the macro in the include guards will always begin with the prefix ABC the name of the compiler project.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | #ifndef ABC_LEXER_H
#define ABC_LEXER_H
/*
Returns token kind:
0 = EOI (end of input)
1 = BAD_TOKEN
2 = HEX_LITERAL
3 = OCT_LITERAL
4 = DEC_LITERAL
5 = PLUS ('+')
6 = MINUS ('-')
7 = ASTERISK ('*')
8 = SLASH ('/')
9 = PERCENT ('%')
10 = EQUAL ('=')
11 = LPAREN (left paranthesis '(')
12 = RPAREN (right paranthesis ')')
13 = SEMICOLON
14 = IDENTIFIER
*/
int getToken(void);
#endif // ABC_LEXER_H
|
Test Program for the Lexer
For testing the lexer we simply call getToken() until it returns the EOI token kind. For each token it prints its kind in textual form:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | #include <stdio.h>
#include "lexer.h"
int
main(void)
{
int token;
while ((token = getToken()) != 0) {
if (token == 1) {
printf("BAD_TOKEN\n");
} else if (token == 2) {
printf("HEX_LITERAL\n");
} else if (token == 3) {
printf("OCT_LITERAL\n");
} else if (token == 4) {
printf("DEC_LITERAL\n");
} else if (token == 5) {
printf("PLUS\n");
}
}
}
|
Implementation of the Lexer
The current implementation is based on the code that we developed for the getUint() function in Session 12. In its current form it just recognizes the literal tokens (DEC_LITERAL, OCT_LITERAL, HEX_LITERAL) and the plus punctuator (PLUS).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | #include <stdbool.h>
#include <stdio.h>
#include "lexer.h"
int ch;
int
nextCh(void)
{
ch = getchar();
return ch;
}
bool
isWhiteSpace(int ch)
{
return ch == ' ' || ch == '\t';
}
bool
isDecDigit(int ch)
{
return ch >= '0' && ch <= '9';
}
bool
isOctDigit(int ch)
{
return ch >= '0' && ch <= '7';
}
bool
isHexDigit(int ch)
{
return isDecDigit(ch) || (ch >= 'a' && ch <= 'f') ||
(ch >= 'A' && ch <= 'F');
}
int
getToken(void)
{
// init ch, skip white spaces and newlines
while (ch == 0 || isWhiteSpace(ch) || ch == '\n') {
nextCh();
}
if (ch == EOF) {
return 0; // EOI
} else if (isDecDigit(ch)) {
// parse literal
if (ch == '0') {
nextCh();
if (ch == 'x') {
nextCh();
if (isHexDigit(ch)) {
while (isHexDigit(ch)) {
nextCh();
}
return 2; // HEX_LITERAL
}
return 1; // BAD_TOKEN
}
while (isOctDigit(ch)) {
ch -= '0';
nextCh();
}
return 3; // OCT_LITERAL
} else if (isDecDigit(ch)) {
while (isDecDigit(ch)) {
nextCh();
}
return 4; // DEC_LITERAL
}
} if (ch == '+') {
nextCh();
return 5; // PLUS
}
nextCh();
return 1; // BAD_TOKEN
}
|
Initial Makefile
We will begin with the makefile developed on the previous page:
#
# patch: If user has not defined CC and default value does not exist use gcc
#
ifeq ($(origin CC),default)
cc_check := $(shell $(CC) -v > /dev/null 2>&1 && echo "sane")
ifneq ($(strip $(cc_check)),sane)
CC := gcc
endif
endif
#
# Define list of source files, object files, targets, etc
#
# all source files
src :=\
$(wildcard *.c)
# all object files
obj :=\
$(patsubst %.c,%.o,\
$(src))
# all targets
target :=\
$(filter xtest%,\
$(patsubst %.c,%,\
$(src)))
# objects that are required by the targets
lib.o :=\
$(filter-out xtest%,\
$(obj))
# dependency file that will be generated by compiler
deps :=\
$(patsubst %,%.d,\
$(src))
# dependency file leftovers of gone source files
obsolete.deps:=\
$(filter-out $(deps),\
$(wildcard *.c.d))
#
# Build rules
#
.PHONY: all
all: $(target) $(obj)
# rule for removing obsolete dependency files
.PHONY: $(obsolete.deps)
$(obsolete.deps) :
$(RM) $(obsolete.deps)
# delete implicit rule for building an executable directly from its source file
% : %.c
# our rule: to build target link its object file against library object files
%: %.o $(lib.o) | $(obsolete.deps)
$(CC) -o $@ $^ $(LDFLAGS)
# our rule to build objects: also generate a dependency file
%.o: %.c | $(obsolete.deps)
$(CC) -c $(CPPFLAGS) $(CFLAGS) -MT $@ -MMD -MP -MF $<.d $<
.PHONY: clean
clean:
$(RM) $(target) $(obj) $(deps) $(obsolete.deps)
#
# Include dependencies (if already generated)
#
-include $(deps)
Here a make on a clean directory:
theon$ make gcc -c -MT xtest_lexer.o -MMD -MP -MF xtest_lexer.c.d xtest_lexer.c gcc -c -MT lexer.o -MMD -MP -MF lexer.c.d lexer.c gcc -o xtest_lexer xtest_lexer.o lexer.o theon$
Testing the Lexer
The lexer reads input from stdin. So you can use it interactive and terminate the program with CONTROL-d.
When you test your lexer you soon will get tied of typing in test cases. You can prepare some test cases in a single line command that pipes the output of echo to the lexer like this (for running the test again just use the “up” arrow key):
theon$ echo "123 + 0x123x12" | ./xtest_lexer DEC_LITERAL PLUS HEX_LITERAL BAD_TOKEN DEC_LITERAL theon$
In the long run it is more conventient to have a test case like this in some file:
1 2 3 | 123 +0x123x12
0123
x = 3 + 4;
|
and run the test be redirecting the file to stdin:
theon$ ./xtest_lexer < test.in DEC_LITERAL PLUS HEX_LITERAL BAD_TOKEN DEC_LITERAL OCT_LITERAL BAD_TOKEN BAD_TOKEN DEC_LITERAL PLUS DEC_LITERAL BAD_TOKEN theon$
You might guess how diff and a check target in the makefile later can be used for automated tests.
Exercise
Complete the implementation of the lexer so that it recognizes all tokens that are specified. In this initial form the lexer is supposed to just recognize tokens in the input. Unlike in Session 12 we do not compute the numerical value of literals. We can add the code for that later (or we might see reasons why this should be done in some other part of the compiler).
The exercise is not a quiz. Do it for fun or don't do it at all.
