Lexer: Recognizing Keywords
For supporting control structures the lexer has to recognize keywords. For now the necessary lexer extension will be realized quick and dirty:
-
Some new enum constants will be added.
-
After the lexer found an identifier it checks whether it is actually a keyword.
For compound statements also the curly braces “{” and “}” need to be recognized as tokens. Of course, as an alternative keywords like begin and end could be used.
New Token Kinds
If you use tokenkind.txt and generate from that code for the enum constants used by the lexer and function strTokenKind() simply add a corresponding entry like WHILE or TK_WHILE. Otherwise add manually a new enum constant and patch your implementation of strTokenKind().
Analogously add enum constants for the punctuators “{” and “{”. For example, LBRACE and RBRACE or TK_LBRACE and TK_RBRACE respectively. Recognizing punctuators is nothing new and will not be discussed here any further ;-)
Recognizing Keywords
In lexer.c add a static function that checks if a string represents a keyword. The quick and dirty approach is to compare the string against all keywords. It then returns the proper token kind. Either for a matched keyword or an identifier. For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | static enum TokenKind
checkForKeyword(const char *s)
{
static bool first = true;
const static struct UStr *kwWhile;
if (first) {
first = false;
kwWhile = UStrAdd("while");
}
const struct UStr *id = UStrAdd(s);
if (id == kwWhile) {
return WHILE;
} else {
return IDENTIFIER;
}
}
|
When an identifier was found in getToken() it now checks whether it is actually a keyword before it stes and returns the token kind. For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 | static enum TokenKind
getToken(void)
{
// ...
} else if (isLetter(ch)) {
do {
appendCharToStr(&token.val, ch);
nextCh();
} while (isLetter(ch) || isDecDigit(ch));
return token.kind = checkForKeyword(token.val.cstr);
}
// ...
}
|
Exercise
Extend the lexer such taht it also recognized the keywords for, do, if and else. Here a simple test case:
1 2 | while {} for do if else
ffor doo ddo
|
with keywords and identifiers
theon$ ./xtest_lexer < test_lexer_kw.in 1.1: WHILE 'while' 1.7: LBRACE '{' 1.8: RBRACE '}' 1.10: FOR 'for' 1.14: DO 'do' 1.17: IF 'if' 1.20: ELSE 'else' 2.1: IDENTIFIER 'ffor' 2.6: IDENTIFIER 'doo' 2.10: IDENTIFIER 'ddo' theon$