====================================== CPW Part 2: Initial Design for a Lexer [TOC] ====================================== In this C programming workout we will begin with our compiler project. The compiler will be called _ABC_ for _ABC is A Bloody Compiler_. For our compiler project we will need a lexer, and we can make use of the experience gathered from __Session 12__ when we implemented the function _getUint()_ base on a finite state machine. In its initial version the lexer will merely recognize certain pattern in a input stream. Initial Lexer for the Compiler ============================== Tokens can be punctuators consisting of one special characters, identifiers, or literals. Tokens end if the next character if the next character cannot be added to it. Whitespace and newline ~~~~~~~~~~~~~~~~~~~~~~ Space (ASCII 32), tab (ASCII 9) characters, and newline (ASCII 10) characters are treated as whitespace characters. White space characters are not part of any token and therefore can separate tokens but are otherwise ignored. Literals ~~~~~~~~ Literals can be decimal literals octal literals, or hexadecimal literals. Decimal literals begin with a digit "1' to '9' and optionally more decimal digits '0' to '9'. Decimal constants are unsigned and are encoded with 64 bits. Octal literals begin with the digit '0' and optionally more octal digit '0' to '7'. more digits. Decimal constants are unsigned and are encoded with 64 bits. Hexadecimal digit begin with the prefix '0x' or '0X' followed by one or more hexadecimal digit '0' to '9', 'a' to 'f', 'A' to 'F'. Punctuators/Delimiters ~~~~~~~~~~~~~~~~~~~~~ Following punctuators consist of one character: + `+` + `-` + `*` + `/` + `%` * `=` * `(` * `)` * `;` Identifiers ~~~~~~~~~~~ Identifiers begin with a letter, i.e. 'A' to 'Z' and 'a' to 'z', or an underscore “_”, and are optionally continued with a sequence of more letters, underscores, or decimal digits “0” to “9”. Interface for the Lexer ----------------------- The lexer reads input from _stdin_ and provides the function _getToken()_. When _getToken()_ is called the lexer consumes characters from _stdin_ until a token was found and returns the token kind. In the initial version the token kind is a integer literal documented in the header within a comment. In addition to the token kinds defined above the lexer defines the kind _EOI (end of input)_ if no further character is available, and ~BAD_TOKEN~ if non-whitespace characters can not be grouped to a token. Note, the macro in the include guards will always begin with the prefix ~ABC~ the name of the compiler project. ---- CODE (file=session13/abc/lexer.h,fold) ------------------------------------ #ifndef ABC_LEXER_H #define ABC_LEXER_H /* Returns token kind: 0 = EOI (end of input) 1 = BAD_TOKEN 2 = HEX_LITERAL 3 = OCT_LITERAL 4 = DEC_LITERAL 5 = PLUS ('+') 6 = MINUS ('-') 7 = ASTERISK ('*') 8 = SLASH ('/') 9 = PERCENT ('%') 10 = EQUAL ('=') 11 = LPAREN (left paranthesis '(') 12 = RPAREN (right paranthesis ')') 13 = SEMICOLON 14 = IDENTIFIER */ int getToken(void); #endif // ABC_LEXER_H -------------------------------------------------------------------------------- Test Program for the Lexer -------------------------- For testing the lexer we simply call _getToken()_ until it returns the _EOI_ token kind. For each token it prints its kind in textual form: ---- CODE (file=session13/abc/xtest_lexer.c,fold) ------------------------------ #include #include "lexer.h" int main(void) { int token; while ((token = getToken()) != 0) { if (token == 1) { printf("BAD_TOKEN\n"); } else if (token == 2) { printf("HEX_LITERAL\n"); } else if (token == 3) { printf("OCT_LITERAL\n"); } else if (token == 4) { printf("DEC_LITERAL\n"); } else if (token == 5) { printf("PLUS\n"); } } } -------------------------------------------------------------------------------- Implementation of the Lexer --------------------------- The current implementation is based on the code that we developed for the _getUint()_ function in __Session 12__. In its current form it just recognizes the literal tokens (~DEC_LITERAL~, ~OCT_LITERAL~, ~HEX_LITERAL~) and the plus punctuator (~PLUS~). ---- CODE (file=session13/abc/lexer.c, fold) ----------------------------------- #include #include #include "lexer.h" int ch; int nextCh(void) { ch = getchar(); return ch; } bool isWhiteSpace(int ch) { return ch == ' ' || ch == '\t'; } bool isDecDigit(int ch) { return ch >= '0' && ch <= '9'; } bool isOctDigit(int ch) { return ch >= '0' && ch <= '7'; } bool isHexDigit(int ch) { return isDecDigit(ch) || (ch >= 'a' && ch <= 'f') || (ch >= 'A' && ch <= 'F'); } int getToken(void) { // init ch, skip white spaces and newlines while (ch == 0 || isWhiteSpace(ch) || ch == '\n') { nextCh(); } if (ch == EOF) { return 0; // EOI } else if (isDecDigit(ch)) { // parse literal if (ch == '0') { nextCh(); if (ch == 'x') { nextCh(); if (isHexDigit(ch)) { while (isHexDigit(ch)) { nextCh(); } return 2; // HEX_LITERAL } return 1; // BAD_TOKEN } while (isOctDigit(ch)) { ch -= '0'; nextCh(); } return 3; // OCT_LITERAL } else if (isDecDigit(ch)) { while (isDecDigit(ch)) { nextCh(); } return 4; // DEC_LITERAL } } if (ch == '+') { nextCh(); return 5; // PLUS } nextCh(); return 1; // BAD_TOKEN } -------------------------------------------------------------------------------- Initial Makefile ---------------- ---- SHELL (path=session13/abc,hide) ------------------------------------------- cp ../using_make10/Makefile . make clean rm -f test.in -------------------------------------------------------------------------------- We will begin with the makefile developed on the previous page: :import: session13/abc/Makefile [fold] Here a _make_ on a clean directory: ---- SHELL (path=session13/abc) ------------------------------------------------ make -------------------------------------------------------------------------------- Testing the Lexer ----------------- The lexer reads input from _stdin_. So you can use it interactive and terminate the program with _CONTROL-d_. When you test your lexer you soon will get tied of typing in test cases. You can prepare some test cases in a single line command that pipes the output of _echo_ to the lexer like this (for running the test again just use the "up" arrow key): ---- SHELL (path=session13/abc) ------------------------------------------------ echo "123 + 0x123x12" | ./xtest_lexer -------------------------------------------------------------------------------- In the long run it is more conventient to have a test case like this in some file: ---- CODE (file=session13/abc/test.in,unet) ------------------------------------ 123 +0x123x12 0123 x = 3 + 4; -------------------------------------------------------------------------------- and run the test be redirecting the file to stdin: ---- SHELL (path=session13/abc) ------------------------------------------------ ./xtest_lexer < test.in -------------------------------------------------------------------------------- You might guess how _diff_ and a _check_ target in the makefile later can be used for automated tests. Exercise ======== Complete the implementation of the lexer so that it recognizes all tokens that are specified. In this initial form the lexer is supposed to just recognize tokens in the input. Unlike in __Session 12__ we do not compute the numerical value of literals. We can add the code for that later (or we might see reasons why this should be done in some other part of the compiler). The exercise is not a quiz. Do it for fun or don't do it at all. --- IMAGE ----------------- session13/do_or_do_not.jpeg --------------------------- :links: Session 12 -> doc:session12/page04