Recursive Descent Parser for a Calculator
In this video a predictive recursive descent parser was implemented. This is only possible for so called LL(k) grammars. So it is not possible to parse all context free grammars with this type of a parser. Furthermore, the methodic shown for translating an EBNF grammar into a parser implementation even requires a LL(1) grammar. So with that we are limited to an even smaller subset. Fortunately most programming languages of practical interest can be parsed with a recursive descent parser.
Current Grammar for the Calculator
Below you find the grammar of the calculator developed in the video. Note that the production rule for \(\langle\text{translation-unit}\rangle\) gets kind of canceled out when it gets rewritten into EBNF.
\[\begin{array}{rcl}\langle\text{translation-unit}\rangle & \to & \texttt{EOI} \\ & \to & \langle\text{input-sequence}\rangle \; \texttt{EOI} \\\langle\text{input-sequence}\rangle & \to & \langle\text{expr-statement}\rangle\; \\ & \to & \langle\text{input-sequence}\rangle\; \langle\text{expr-statement}\rangle \\\langle\text{expr-statement}\rangle & \to & \langle\text{expr}\rangle\; \texttt{';'}\; \\\langle\text{expr}\rangle & \to & \langle\text{term}\rangle\; \\ & \to & \langle\text{expr}\rangle\; \texttt{'+'}\; \langle\text{term}\rangle\; \\ & \to & \langle\text{expr}\rangle\; \texttt{'-'}\; \langle\text{term}\rangle\; \\\langle\text{term}\rangle & \to & \langle\text{factor}\rangle\; \\ & \to & \langle\text{term}\rangle\; \texttt{'*'}\; \langle\text{factor}\rangle\; \\ & \to & \langle\text{term}\rangle\; \texttt{'/'}\; \langle\text{factor}\rangle\; \\\langle\text{factor}\rangle & \to & \langle\text{dec-literal}\rangle\; \\ & \to & \langle\text{hex-literal}\rangle\; \\ & \to & \langle\text{oct-literal}\rangle\; \\ & \to & \texttt{'('}\; \langle\text{expr}\rangle\; \texttt{')'}\; \\\end{array}\]The grammar can be rewritten into EBNF notation as follows:
\[\begin{array}{rcl}\text{translation-unit} & = & [\; \text{input-sequence} ] \; \texttt{EOI} \\\text{input-sequence} & = & \{\; \text{expr-statement}\; \} \\\text{expr-statement} & = & \text{expr}\; \texttt{;}\; \\\text{expr} & = & \text{term}\; \{\; (\; \texttt{"+"}\; |\; \texttt{"-"}\; )\; \text{term} \} \\\text{term} & = & \text{factor}\; \{\; (\; \texttt{"*"}\; |\; \texttt{"/"}\; )\; \text{factor} \} \\\text{factor} & = & \text{dec-literal}\; \\ & | & \text{hex-literal}\; \\ & | & \text{oct-literal}\; \\ & | & \texttt{"("}\; \text{expr}\; \texttt{")"}\; \\\end{array}\]In EBNF you see that the rule for \(\text{translation-unit}\) can be fused with the rule for \(\text{input-sequence}\) to
\[\text{translation-unit} = [\;\{\; \text{expr-statement}\;\} ] \; \texttt{EOI}\]Obviously an optional repetition is just a repetition. So this can be simplified to
\[\text{translation-unit} = \{\; \text{expr-statement}\;\} \texttt{EOI}\]And with renaming it would be
\[\text{input-sequence} = [\;\{\; \text{expr-statement}\;\} ] \; \texttt{EOI}\]So we have:
\[\begin{array}{rcl}\text{input-sequence} & = & \{\; \text{expr-statement}\; \}\; \texttt{EOI} \\\text{expr-statement} & = & \text{expr}\; \texttt{;}\; \\\text{expr} & = & \text{term}\; \{\; (\; \texttt{"+"}\; |\; \texttt{"-"}\; )\; \text{term} \} \\\text{term} & = & \text{factor}\; \{\; (\; \texttt{"*"}\; |\; \texttt{"/"}\; )\; \text{factor} \} \\\text{factor} & = & \text{dec-literal}\; \\ & | & \text{hex-literal}\; \\ & | & \text{oct-literal}\; \\ & | & \texttt{"("}\; \text{expr}\; \texttt{")"}\; \\\end{array}\]The parse function parseInputSequence() can be implemented with a simple while loop. It checks in its condition whether the current token is not EOI and parses in its loop body an expression statement.
Provided Material
Here the source file xtest_calc.c developed in the video:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | #include <assert.h>
#include <stddef.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include "lexer.h"
#include "tokenkind.h"
// error handling
void
expectedError(const char *expectedStr)
{
fprintf(stderr, "%zu.%zu: error expected '%s' got '%s'\n", token.pos.line,
token.pos.col, expectedStr, strTokenKind(token.kind));
exit(1);
}
void
expected(enum TokenKind tokenKind)
{
if (tokenKind != token.kind) {
expectedError(strTokenKind(tokenKind));
}
}
// for debugging
void
printWithIndent(size_t indent, const char *str)
{
#ifndef NDEBUG
for (size_t i = 0; i < indent * 4; ++i) {
printf(" ");
}
printf("%s\n", str);
#endif // NDEBUG
}
// parse functions
void parseInputSequence(size_t);
void parseExprStatement(size_t);
uint64_t parseExpr(size_t);
uint64_t parseTerm(size_t);
uint64_t parseFactor(size_t);
void
parseInputSequence(size_t indent)
{
while (token.kind != EOI) {
parseExprStatement(indent + 1);
}
}
void
parseExprStatement(size_t indent)
{
printWithIndent(indent, "parseExprStatement");
printf("> %" PRIu64 "\n", parseExpr(indent + 1));
expected(SEMICOLON);
getToken();
}
uint64_t
parseExpr(size_t indent)
{
printWithIndent(indent, "parseExpr");
uint64_t val = parseTerm(indent + 1);
while (token.kind == PLUS || token.kind == MINUS) {
printWithIndent(indent, strTokenKind(token.kind));
enum TokenKind op = token.kind;
getToken();
uint64_t valRight = parseTerm(indent + 1);
if (op == PLUS) {
val += valRight;
} else {
val -= valRight;
}
}
return val;
}
uint64_t
parseTerm(size_t indent)
{
printWithIndent(indent, "parseTerm");
uint64_t val = parseFactor(indent + 1);
while (token.kind == ASTERISK || token.kind == SLASH) {
printWithIndent(indent, strTokenKind(token.kind));
enum TokenKind op = token.kind;
getToken();
uint64_t valRight = parseFactor(indent + 1);
if (op == ASTERISK) {
val *= valRight;
} else {
val /= valRight;
}
}
return val;
}
uint64_t
parseFactor(size_t indent)
{
printWithIndent(indent, "parseFactor");
uint64_t val = 0;
if (token.kind == DEC_LITERAL) {
printWithIndent(indent, token.val.cstr);
val = strtoull(token.val.cstr, 0, 10);
getToken();
return val;
} else if (token.kind == HEX_LITERAL) {
printWithIndent(indent, token.val.cstr);
val = strtoull(token.val.cstr, 0, 16);
getToken();
return val;
} else if (token.kind == OCT_LITERAL) {
printWithIndent(indent, token.val.cstr);
val = strtoull(token.val.cstr, 0, 8);
getToken();
return val;
} else if (token.kind == LPAREN) {
printWithIndent(indent, strTokenKind(token.kind));
getToken();
val = parseExpr(indent + 1);
expected(RPAREN);
printWithIndent(indent, strTokenKind(token.kind));
getToken();
return val;
} else {
expectedError("factor");
return val; // never reached
}
}
// test for the parse
int
main(void)
{
// we need a current token before we begin parsing
getToken();
parseInputSequence(0);
}
|
If you have fixed the makefile it should compile with a simple make like that:
theon$ make gcc -c -MT xtest_calc.o -MMD -MP -MF xtest_calc.c.d xtest_calc.c gcc -o xtest_calc xtest_calc.o lexer.o str.o tokenkind.o theon$ echo "12 * (3 + 4);" | ./xtest_calc parseExprStatement parseExpr parseTerm parseFactor 12 ASTERISK parseFactor LPAREN parseExpr parseTerm parseFactor 3 PLUS parseTerm parseFactor 4 RPAREN > 84 theon$
To get rid of the debug messages compile with CPPFLAGS=-DNDEBUG:
theon$ touch xtest_calc.c theon$ make CPPFLAGS=-DNDEBUG gcc -c -DNDEBUG -MT xtest_calc.o -MMD -MP -MF xtest_calc.c.d xtest_calc.c gcc -o xtest_calc xtest_calc.o lexer.o str.o tokenkind.o theon$ echo "12 * (3 + 4);" | ./xtest_calc > 84 theon$
The calculator is part of ABC compiler project. Here the headers and source files that we currently have:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | #ifndef ABC_LEXER_H
#define ABC_LEXER_H
#include <stddef.h>
#include "str.h"
#include "tokenkind.h"
enum TokenKind getToken(void);
struct Token
{
enum TokenKind kind;
struct TokenPos
{
size_t line, col;
} pos;
struct Str val;
};
extern struct Token token;
#endif // ABC_LEXER_H
|
#include <stdbool.h>
#include <stdio.h>
#include "lexer.h"
struct Token token;
//------------------------------------------------------------------------------
// position of current character ch
static struct TokenPos curr = {
1,
0,
};
static int ch;
static int
nextCh(void)
{
++curr.col;
ch = getchar();
if (ch == '\n') {
++curr.line;
curr.col = 0;
}
return ch;
}
static bool
isWhiteSpace(int ch)
{
return ch == ' ' || ch == '\t';
}
static bool
isDecDigit(int ch)
{
return ch >= '0' && ch <= '9';
}
static bool
isOctDigit(int ch)
{
return ch >= '0' && ch <= '7';
}
static bool
isHexDigit(int ch)
{
return isDecDigit(ch) || (ch >= 'a' && ch <= 'f') ||
(ch >= 'A' && ch <= 'F');
}
static bool
isLetter(int ch)
{
return ((ch >= 'a') && (ch <= 'z')) || ((ch >= 'A' && ch <= 'Z')) ||
ch == '_';
}
enum TokenKind
getToken(void)
{
unsigned long long val = 0;
// init ch, skip white spaces and newlines
while (ch == 0 || isWhiteSpace(ch) || ch == '\n') {
nextCh();
}
token.pos.line = curr.line;
token.pos.col = curr.col;
clearStr(&token.val);
if (ch == EOF) {
return token.kind = EOI;
} else if (isDecDigit(ch)) {
// parse literal
if (ch == '0') {
appendCharToStr(&token.val, ch);
nextCh();
if (ch == 'x') {
appendCharToStr(&token.val, ch);
nextCh();
if (isHexDigit(ch)) {
while (isHexDigit(ch)) {
appendCharToStr(&token.val, ch);
nextCh();
}
return token.kind = HEX_LITERAL;
}
return token.kind = BAD_TOKEN;
}
while (isOctDigit(ch)) {
appendCharToStr(&token.val, ch);
nextCh();
}
return token.kind = OCT_LITERAL;
} else if (isDecDigit(ch)) {
while (isDecDigit(ch)) {
appendCharToStr(&token.val, ch);
nextCh();
}
return token.kind = DEC_LITERAL;
}
} else if (ch == '+') {
appendCharToStr(&token.val, ch);
nextCh();
return token.kind = PLUS;
} else if (ch == '-') {
appendCharToStr(&token.val, ch);
nextCh();
return token.kind = MINUS;
} else if (ch == '*') {
appendCharToStr(&token.val, ch);
nextCh();
return token.kind = ASTERISK;
} else if (ch == '/') {
appendCharToStr(&token.val, ch);
nextCh();
return token.kind = SLASH;
} else if (ch == '%') {
appendCharToStr(&token.val, ch);
nextCh();
return token.kind = PERCENT;
} else if (ch == '=') {
appendCharToStr(&token.val, ch);
nextCh();
return token.kind = EQUAL;
} else if (ch == '(') {
appendCharToStr(&token.val, ch);
nextCh();
return token.kind = LPAREN;
} else if (ch == ')') {
appendCharToStr(&token.val, ch);
nextCh();
return token.kind = RPAREN;
} else if (ch == ';') {
appendCharToStr(&token.val, ch);
nextCh();
return token.kind = SEMICOLON;
} else if (isLetter(ch)) {
do {
appendCharToStr(&token.val, ch);
nextCh();
} while (isLetter(ch) || isDecDigit(ch));
return token.kind = IDENTIFIER;
}
nextCh();
return token.kind = BAD_TOKEN;
}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | #include <stdio.h>
#include <stdlib.h>
#include "tokenkind.h"
const char *
strTokenKind(enum TokenKind tokenKind)
{
switch (tokenKind) {
case EOI:
return "EOI";
case BAD_TOKEN:
return "BAD_TOKEN";
case HEX_LITERAL:
return "HEX_LITERAL";
case OCT_LITERAL:
return "OCT_LITERAL";
case DEC_LITERAL:
return "DEC_LITERAL";
case PLUS:
return "PLUS";
case MINUS:
return "MINUS";
case ASTERISK:
return "ASTERISK";
case SLASH:
return "SLASH";
case PERCENT:
return "PERCENT";
case EQUAL:
return "EQUAL";
case LPAREN:
return "LPAREN";
case RPAREN:
return "RPAREN";
case SEMICOLON:
return "SEMICOLON";
case IDENTIFIER:
return "IDENTIFIER";
default:
fprintf(stderr, "internal error in strTokenKind: tokenKind = %d\n",
tokenKind);
exit(1);
return "";
}
}
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | #ifndef ABC_TOKENKIND_H
#define ABC_TOKENKIND_H
enum TokenKind
{
EOI, // end of input
BAD_TOKEN,
HEX_LITERAL,
OCT_LITERAL,
DEC_LITERAL,
PLUS, // '+'
MINUS, // '-'
ASTERISK, // '*'
SLASH, // '/'
PERCENT, // '%'
EQUAL, // '='
LPAREN, // '('
RPAREN, // ')'
SEMICOLON, // ';'
IDENTIFIER,
};
const char *strTokenKind(enum TokenKind tokenKind);
#endif // ABC_TOKENKIND_H
|
The string class is not the solution for the previous quiz. It still uses a buffer with fixed size. If you have already solved the quiz then you of course can and should use your solution instead:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | #ifndef ABC_STR_H
#define ABC_STR_H
struct Str
{
char cstr[7];
char *end;
};
// set str->cstr to empty string
void clearStr(struct Str *str);
// append character to str->cstr
void appendCharToStr(struct Str *str, char c);
#endif // ABC_STR_H
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | #include <stdlib.h>
#include <stdio.h>
#include "str.h"
void
clearStr(struct Str *str)
{
*(str->end = str->cstr) = 0;
}
void
appendCharToStr(struct Str *str, char c)
{
// check if another character and 0 byte fits into string
if (str->end - str->cstr + 2 > sizeof(str->cstr)) {
fprintf(stderr, "error in appendCharToStr: string too long\n");
exit(1);
}
*str->end++ = c;
*str->end = 0;
}
|
Quiz 18: Extend the Calculator
Extend the calculator so that it supports:
-
Unary plus and minus expressions. An unary minus negates a value, and a unary plus has no effect.
For example:
-
---5 should be equal to -5,
-
+-+-5 should equal 5,
-
6 + -5 should equal 6 - 5.
-
-
An operator '^' for computing the power.
For example:
-
5^3 should compute 5 * 5 * 5,
-
5^-3 should compute 1 / (5 * 5 * 5).
-
The calculator currently uses unsigned integer arithmetic. Change it to double precision, i.e. use the type double for the value of an expression node. For computing the power you can use the function pow() defined in <math.h>. This will on some platform you explicitly have to link against the C math library. You either can use the option LDFLAGS=-lm when you run make or add
1 | LDFLAGS += -lm # extend value of LDFLAGS for '-lm'
|
as first line to your makefile.
Adapt the lexer so that it can detect '^' (caret) as token.
Grammar in EBNF
Use the following grammar in EBNF for the extension of the calculator:
\[\begin{array}{rcl}\text{input-sequence} & = & \{\; \text{expr-statement}\; \}\; \texttt{EOI} \\\text{expr-statement} & = & \text{expr}\; \texttt{;}\; \\\text{expr} & = & \text{term}\; \{\; (\; \texttt{"+"}\; |\; \texttt{"-"}\; )\; \text{term} \} \\\text{term} & = & \text{power-expr}\; \{\; (\; \texttt{"*"}\; |\; \texttt{"/"}\; )\; \text{power-expr} \} \\\text{power-expr} & = & \text{unary-expr}\; [\; \texttt{"^"}\; \text{power-expr} ]\; \\\text{unary-expr} & = & \text{factor}\; |\; (\; \texttt{"+"}\; |\; \texttt{"-"}\;)\; \text{unary-expr} \\\text{factor} & = & \text{dec-literal}\; \\ & | & \text{hex-literal}\; \\ & | & \text{oct-literal}\; \\ & | & \texttt{"("}\; \text{expr}\; \texttt{")"}\; \\\end{array}\]Example Output
Here a file with some expression statements for testing the calculator:
2 + 5 / 3 * 2;
(2 + 5) / 3;
((10 - 213) * 25) + 27;
(7 - 3) * 4^3;
(1 + 1) * 1 - 2;
--1;
1--1^2;
1--1^3;
1++2^2;
4 + 1 * 3;
-1^2;
1-1^2;
1+-1^2;
And here what the calculator computes from that:
theon$ make CPPFLAGS=-DNDEBUG LDFLAGS=-lm xtest_calc gcc -c -DNDEBUG -MT xtest_calc.o -MMD -MP -MF xtest_calc.c.d xtest_calc.c gcc -o xtest_calc -lm xtest_calc.o lexer.o str.o tokenkind.o theon$ ./xtest_calc < test_calc.in > 5.333333 > 2.333333 > -5048.000000 > 256.000000 > 0.000000 > 1.000000 > 0.000000 > 2.000000 > 5.000000 > 7.000000 > 1.000000 > 0.000000 > 2.000000 theon$
Note that the grammar specifies that for example:
-
the expression “-5^2” should be evaluated like “(-5)^2”,
-
the expression “3-5^2” should be evaluated like “3-5^2”,
-
the expression “3--5^2” should be evaluated like “3-(-5)^2”.
Test such cases.
Submission
Submit your improved calculator implementation as follows:
1 | submit hpc quiz18 xtest_calc.c str.h str.c lexer.h lexer.c tokenkind.h tokenkind.c
|