Overview Home Back Next 1 2 3 Content of this session

CPW Pt.4 Recursive Descent Parser, CPW Pt.5 Unique Strings
Recursive Descent Parser for a Calculator
Data Structure for Unique Strings

Recursive Descent Parser for a Calculator

In this video a predictive recursive descent parser was implemented. This is only possible for so called LL(k) grammars. So it is not possible to parse all context free grammars with this type of a parser. Furthermore, the methodic shown for translating an EBNF grammar into a parser implementation even requires a LL(1) grammar. So with that we are limited to an even smaller subset. Fortunately most programming languages of practical interest can be parsed with a recursive descent parser.

Current Grammar for the Calculator

Below you find the grammar of the calculator developed in the video. Note that the production rule for \(\langle\text{translation-unit}\rangle\) gets kind of canceled out when it gets rewritten into EBNF.

\[\begin{array}{rcl}\langle\text{translation-unit}\rangle & \to & \texttt{EOI} \\ & \to & \langle\text{input-sequence}\rangle \; \texttt{EOI} \\\langle\text{input-sequence}\rangle & \to & \langle\text{expr-statement}\rangle\; \\ & \to & \langle\text{input-sequence}\rangle\; \langle\text{expr-statement}\rangle \\\langle\text{expr-statement}\rangle & \to & \langle\text{expr}\rangle\; \texttt{';'}\; \\\langle\text{expr}\rangle & \to & \langle\text{term}\rangle\; \\ & \to & \langle\text{expr}\rangle\; \texttt{'+'}\; \langle\text{term}\rangle\; \\ & \to & \langle\text{expr}\rangle\; \texttt{'-'}\; \langle\text{term}\rangle\; \\\langle\text{term}\rangle & \to & \langle\text{factor}\rangle\; \\ & \to & \langle\text{term}\rangle\; \texttt{'*'}\; \langle\text{factor}\rangle\; \\ & \to & \langle\text{term}\rangle\; \texttt{'/'}\; \langle\text{factor}\rangle\; \\\langle\text{factor}\rangle & \to & \langle\text{dec-literal}\rangle\; \\ & \to & \langle\text{hex-literal}\rangle\; \\ & \to & \langle\text{oct-literal}\rangle\; \\ & \to & \texttt{'('}\; \langle\text{expr}\rangle\; \texttt{')'}\; \\\end{array}\]

The grammar can be rewritten into EBNF notation as follows:

\[\begin{array}{rcl}\text{translation-unit} & = & [\; \text{input-sequence} ] \; \texttt{EOI} \\\text{input-sequence} & = & \{\; \text{expr-statement}\; \} \\\text{expr-statement} & = & \text{expr}\; \texttt{;}\; \\\text{expr} & = & \text{term}\; \{\; (\; \texttt{"+"}\; |\; \texttt{"-"}\; )\; \text{term} \} \\\text{term} & = & \text{factor}\; \{\; (\; \texttt{"*"}\; |\; \texttt{"/"}\; )\; \text{factor} \} \\\text{factor} & = & \text{dec-literal}\; \\ & | & \text{hex-literal}\; \\ & | & \text{oct-literal}\; \\ & | & \texttt{"("}\; \text{expr}\; \texttt{")"}\; \\\end{array}\]

In EBNF you see that the rule for \(\text{translation-unit}\) can be fused with the rule for \(\text{input-sequence}\) to

\[\text{translation-unit} = [\;\{\; \text{expr-statement}\;\} ] \; \texttt{EOI}\]

Obviously an optional repetition is just a repetition. So this can be simplified to

\[\text{translation-unit} = \{\; \text{expr-statement}\;\} \texttt{EOI}\]

And with renaming it would be

\[\text{input-sequence} = [\;\{\; \text{expr-statement}\;\} ] \; \texttt{EOI}\]

So we have:

\[\begin{array}{rcl}\text{input-sequence} & = & \{\; \text{expr-statement}\; \}\; \texttt{EOI} \\\text{expr-statement} & = & \text{expr}\; \texttt{;}\; \\\text{expr} & = & \text{term}\; \{\; (\; \texttt{"+"}\; |\; \texttt{"-"}\; )\; \text{term} \} \\\text{term} & = & \text{factor}\; \{\; (\; \texttt{"*"}\; |\; \texttt{"/"}\; )\; \text{factor} \} \\\text{factor} & = & \text{dec-literal}\; \\ & | & \text{hex-literal}\; \\ & | & \text{oct-literal}\; \\ & | & \texttt{"("}\; \text{expr}\; \texttt{")"}\; \\\end{array}\]

The parse function parseInputSequence() can be implemented with a simple while loop. It checks in its condition whether the current token is not EOI and parses in its loop body an expression statement.

Provided Material

Here the source file xtest_calc.c developed in the video:

session17/git/abc/xtest_calc.c

#include <assert.h>
#include <stddef.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>

#include "lexer.h"
#include "tokenkind.h"

// error handling

void
expectedError(const char *expectedStr)
{
    fprintf(stderr, "%zu.%zu: error expected '%s' got '%s'\n", token.pos.line,
            token.pos.col, expectedStr, strTokenKind(token.kind));
    exit(1);
}

void
expected(enum TokenKind tokenKind)
{
    if (tokenKind != token.kind) {
        expectedError(strTokenKind(tokenKind));
    }
}

// for debugging
void
printWithIndent(size_t indent, const char *str)
{
#ifndef NDEBUG
    for (size_t i = 0; i < indent * 4; ++i) {
        printf(" ");
    }
    printf("%s\n", str);
#endif // NDEBUG
}

// parse functions
void parseInputSequence(size_t);
void parseExprStatement(size_t);
uint64_t parseExpr(size_t);
uint64_t parseTerm(size_t);
uint64_t parseFactor(size_t);

void
parseInputSequence(size_t indent)
{
    while (token.kind != EOI) {
        parseExprStatement(indent + 1);
    }
}

void
parseExprStatement(size_t indent)
{
    printWithIndent(indent, "parseExprStatement");
    printf("> %" PRIu64 "\n", parseExpr(indent + 1));
    expected(SEMICOLON);
    getToken();
}

uint64_t
parseExpr(size_t indent)
{
    printWithIndent(indent, "parseExpr");
    uint64_t val = parseTerm(indent + 1);
    while (token.kind == PLUS || token.kind == MINUS) {
        printWithIndent(indent, strTokenKind(token.kind));
        enum TokenKind op = token.kind;
        getToken();
        uint64_t valRight = parseTerm(indent + 1);
        if (op == PLUS) {
            val += valRight;
        } else {
            val -= valRight;
        }
    }
    return val;
}

uint64_t
parseTerm(size_t indent)
{
    printWithIndent(indent, "parseTerm");
    uint64_t val = parseFactor(indent + 1);
    while (token.kind == ASTERISK || token.kind == SLASH) {
        printWithIndent(indent, strTokenKind(token.kind));
        enum TokenKind op = token.kind;
        getToken();
        uint64_t valRight = parseFactor(indent + 1);
        if (op == ASTERISK) {
            val *= valRight;
        } else {
            val /= valRight;
        }
    }
    return val;
}

uint64_t
parseFactor(size_t indent)
{
    printWithIndent(indent, "parseFactor");
    uint64_t val = 0;
    if (token.kind == DEC_LITERAL) {
        printWithIndent(indent, token.val.cstr);
        val = strtoull(token.val.cstr, 0, 10);
        getToken();
        return val;
    } else if (token.kind == HEX_LITERAL) {
        printWithIndent(indent, token.val.cstr);
        val = strtoull(token.val.cstr, 0, 16);
        getToken();
        return val;
    } else if (token.kind == OCT_LITERAL) {
        printWithIndent(indent, token.val.cstr);
        val = strtoull(token.val.cstr, 0, 8);
        getToken();
        return val;
    } else if (token.kind == LPAREN) {
        printWithIndent(indent, strTokenKind(token.kind));
        getToken();
        val = parseExpr(indent + 1);
        expected(RPAREN);
        printWithIndent(indent, strTokenKind(token.kind));
        getToken();
        return val;
    } else {
        expectedError("factor");
        return val; // never reached
    }
}


// test for the parse

int
main(void)
{
    // we need a current token before we begin parsing
    getToken();
    parseInputSequence(0);
}

If you have fixed the makefile it should compile with a simple make like that:

theon$ make
gcc -c   -MT xtest_calc.o -MMD -MP -MF xtest_calc.c.d xtest_calc.c
gcc -o xtest_calc  xtest_calc.o lexer.o str.o tokenkind.o
theon$ echo "12 * (3 + 4);" | ./xtest_calc
    parseExprStatement
        parseExpr
            parseTerm
                parseFactor
                12
            ASTERISK
                parseFactor
                LPAREN
                    parseExpr
                        parseTerm
                            parseFactor
                            3
                    PLUS
                        parseTerm
                            parseFactor
                            4
                RPAREN
> 84
theon$

To get rid of the debug messages compile with CPPFLAGS=-DNDEBUG:

theon$ touch xtest_calc.c
theon$ make CPPFLAGS=-DNDEBUG
gcc -c -DNDEBUG  -MT xtest_calc.o -MMD -MP -MF xtest_calc.c.d xtest_calc.c
gcc -o xtest_calc  xtest_calc.o lexer.o str.o tokenkind.o
theon$ echo "12 * (3 + 4);" | ./xtest_calc
> 84
theon$

The calculator is part of ABC compiler project. Here the headers and source files that we currently have:

session17/git/abc/lexer.h

#ifndef ABC_LEXER_H
#define ABC_LEXER_H

#include <stddef.h>

#include "str.h"
#include "tokenkind.h"

enum TokenKind getToken(void);

struct Token
{
    enum TokenKind kind;
    struct TokenPos
    {
        size_t line, col;
    } pos;
    struct Str val;
};

extern struct Token token;

#endif // ABC_LEXER_H

session17/git/abc/lexer.c

#include <stdbool.h>
#include <stdio.h>

#include "lexer.h"

struct Token token;

//------------------------------------------------------------------------------

// position of current character ch
static struct TokenPos curr = {
    1,
    0,
};

static int ch;

static int
nextCh(void)
{
    ++curr.col;
    ch = getchar();
    if (ch == '\n') {
        ++curr.line;
        curr.col = 0;
    }
    return ch;
}

static bool
isWhiteSpace(int ch)
{
    return ch == ' ' || ch == '\t';
}

static bool
isDecDigit(int ch)
{
    return ch >= '0' && ch <= '9';
}

static bool
isOctDigit(int ch)
{
    return ch >= '0' && ch <= '7';
}

static bool
isHexDigit(int ch)
{
    return isDecDigit(ch) || (ch >= 'a' && ch <= 'f') ||
           (ch >= 'A' && ch <= 'F');
}

static bool
isLetter(int ch)
{
    return ((ch >= 'a') && (ch <= 'z')) || ((ch >= 'A' && ch <= 'Z')) ||
           ch == '_';
}

enum TokenKind
getToken(void)
{
    unsigned long long val = 0;

    // init ch, skip white spaces and newlines
    while (ch == 0 || isWhiteSpace(ch) || ch == '\n') {
        nextCh();
    }

    token.pos.line = curr.line;
    token.pos.col = curr.col;

    clearStr(&token.val);

    if (ch == EOF) {
        return token.kind = EOI;
    } else if (isDecDigit(ch)) {
        // parse literal
        if (ch == '0') {
            appendCharToStr(&token.val, ch);
            nextCh();
            if (ch == 'x') {
                appendCharToStr(&token.val, ch);
                nextCh();
                if (isHexDigit(ch)) {
                    while (isHexDigit(ch)) {
                        appendCharToStr(&token.val, ch);
                        nextCh();
                    }
                    return token.kind = HEX_LITERAL;
                }
                return token.kind = BAD_TOKEN;
            }
            while (isOctDigit(ch)) {
                appendCharToStr(&token.val, ch);
                nextCh();
            }
            return token.kind = OCT_LITERAL;
        } else if (isDecDigit(ch)) {
            while (isDecDigit(ch)) {
                appendCharToStr(&token.val, ch);
                nextCh();
            }
            return token.kind = DEC_LITERAL;
        }
    } else if (ch == '+') {
        appendCharToStr(&token.val, ch);
        nextCh();
        return token.kind = PLUS;
    } else if (ch == '-') {
        appendCharToStr(&token.val, ch);
        nextCh();
        return token.kind = MINUS;
    } else if (ch == '*') {
        appendCharToStr(&token.val, ch);
        nextCh();
        return token.kind = ASTERISK;
    } else if (ch == '/') {
        appendCharToStr(&token.val, ch);
        nextCh();
        return token.kind = SLASH;
    } else if (ch == '%') {
        appendCharToStr(&token.val, ch);
        nextCh();
        return token.kind = PERCENT;
    } else if (ch == '=') {
        appendCharToStr(&token.val, ch);
        nextCh();
        return token.kind = EQUAL;
    } else if (ch == '(') {
        appendCharToStr(&token.val, ch);
        nextCh();
        return token.kind = LPAREN;
    } else if (ch == ')') {
        appendCharToStr(&token.val, ch);
        nextCh();
        return token.kind = RPAREN;
    } else if (ch == ';') {
        appendCharToStr(&token.val, ch);
        nextCh();
        return token.kind = SEMICOLON;
    } else if (isLetter(ch)) {
        do {
            appendCharToStr(&token.val, ch);
            nextCh();
        } while (isLetter(ch) || isDecDigit(ch));
        return token.kind = IDENTIFIER;
    }

    nextCh();
    return token.kind = BAD_TOKEN;
}

session17/git/abc/tokenkind.c

#include <stdio.h>
#include <stdlib.h>

#include "tokenkind.h"

const char *
strTokenKind(enum TokenKind tokenKind)
{
    switch (tokenKind) {
        case EOI:
            return "EOI";
        case BAD_TOKEN:
            return "BAD_TOKEN";
        case HEX_LITERAL:
            return "HEX_LITERAL";
        case OCT_LITERAL:
            return "OCT_LITERAL";
        case DEC_LITERAL:
            return "DEC_LITERAL";
        case PLUS:
            return "PLUS";
        case MINUS:
            return "MINUS";
        case ASTERISK:
            return "ASTERISK";
        case SLASH:
            return "SLASH";
        case PERCENT:
            return "PERCENT";
        case EQUAL:
            return "EQUAL";
        case LPAREN:
            return "LPAREN";
        case RPAREN:
            return "RPAREN";
        case SEMICOLON:
            return "SEMICOLON";
        case IDENTIFIER:
            return "IDENTIFIER";
        default:
            fprintf(stderr, "internal error in strTokenKind: tokenKind = %d\n",
                    tokenKind);
            exit(1);
            return "";
    }
}

session17/git/abc/tokenkind.h

#ifndef ABC_TOKENKIND_H
#define ABC_TOKENKIND_H

enum TokenKind
{
    EOI,            // end of input
    BAD_TOKEN,
    HEX_LITERAL,
    OCT_LITERAL,
    DEC_LITERAL,
    PLUS,           // '+'
    MINUS,          // '-'
    ASTERISK,       // '*'
    SLASH,          // '/'
    PERCENT,        // '%'
    EQUAL,          // '='
    LPAREN,         // '('
    RPAREN,         // ')'
    SEMICOLON,      // ';'
    IDENTIFIER,
};

const char *strTokenKind(enum TokenKind tokenKind);

#endif // ABC_TOKENKIND_H

The string class is not the solution for the previous quiz. It still uses a buffer with fixed size. If you have already solved the quiz then you of course can and should use your solution instead:

session17/git/abc/str.h

#ifndef ABC_STR_H
#define ABC_STR_H

struct Str
{
    char cstr[7];
    char *end;
};

// set str->cstr to empty string
void clearStr(struct Str *str);

// append character to str->cstr
void appendCharToStr(struct Str *str, char c);

#endif // ABC_STR_H

session17/git/abc/str.c

#include <stdlib.h>
#include <stdio.h>

#include "str.h"

void
clearStr(struct Str *str)
{
    *(str->end = str->cstr) = 0;
}

void
appendCharToStr(struct Str *str, char c)
{
    // check if another character and 0 byte fits into string
    if (str->end - str->cstr + 2 > sizeof(str->cstr)) {
        fprintf(stderr, "error in appendCharToStr: string too long\n");
        exit(1);
    }

    *str->end++ = c;
    *str->end = 0;
}

Quiz 18: Extend the Calculator

Extend the calculator so that it supports:

Unary plus and minus expressions. An unary minus negates a value, and a unary plus has no effect.

For example:
- ---5 should be equal to -5,
- +-+-5 should equal 5,
- 6 + -5 should equal 6 - 5.
An operator '^' for computing the power.

For example:
- 5^3 should compute 5 * 5 * 5,
- 5^-3 should compute 1 / (5 * 5 * 5).

The calculator currently uses unsigned integer arithmetic. Change it to double precision, i.e. use the type double for the value of an expression node. For computing the power you can use the function pow() defined in <math.h>. This will on some platform you explicitly have to link against the C math library. You either can use the option LDFLAGS=-lm when you run make or add

1	LDFLAGS += -lm # extend value of LDFLAGS for '-lm'

as first line to your makefile.

Adapt the lexer so that it can detect '^' (caret) as token.

Grammar in EBNF

Use the following grammar in EBNF for the extension of the calculator:

\[\begin{array}{rcl}\text{input-sequence} & = & \{\; \text{expr-statement}\; \}\; \texttt{EOI} \\\text{expr-statement} & = & \text{expr}\; \texttt{;}\; \\\text{expr} & = & \text{term}\; \{\; (\; \texttt{"+"}\; |\; \texttt{"-"}\; )\; \text{term} \} \\\text{term} & = & \text{power-expr}\; \{\; (\; \texttt{"*"}\; |\; \texttt{"/"}\; )\; \text{power-expr} \} \\\text{power-expr} & = & \text{unary-expr}\; [\; \texttt{"^"}\; \text{power-expr} ]\; \\\text{unary-expr} & = & \text{factor}\; |\; (\; \texttt{"+"}\; |\; \texttt{"-"}\;)\; \text{unary-expr} \\\text{factor} & = & \text{dec-literal}\; \\ & | & \text{hex-literal}\; \\ & | & \text{oct-literal}\; \\ & | & \texttt{"("}\; \text{expr}\; \texttt{")"}\; \\\end{array}\]

Example Output

Here a file with some expression statements for testing the calculator:

session17/sol/abc/test_calc.in

2 + 5 / 3 * 2;
(2 + 5) / 3;
((10 - 213) * 25) + 27;
(7 - 3) * 4^3;
(1 + 1) * 1 - 2;
--1;
1--1^2;
1--1^3;
1++2^2;
4 + 1 * 3;
-1^2;
1-1^2;
1+-1^2;

And here what the calculator computes from that:

theon$ make CPPFLAGS=-DNDEBUG LDFLAGS=-lm xtest_calc
gcc -c -DNDEBUG  -MT xtest_calc.o -MMD -MP -MF xtest_calc.c.d xtest_calc.c
gcc -o xtest_calc -lm xtest_calc.o lexer.o str.o tokenkind.o
theon$ ./xtest_calc < test_calc.in
> 5.333333
> 2.333333
> -5048.000000
> 256.000000
> 0.000000
> 1.000000
> 0.000000
> 2.000000
> 5.000000
> 7.000000
> 1.000000
> 0.000000
> 2.000000
theon$

Note that the grammar specifies that for example:

the expression “-5^2” should be evaluated like “(-5)^2”,
the expression “3-5^2” should be evaluated like “3-5^2”,
the expression “3--5^2” should be evaluated like “3-(-5)^2”.

Test such cases.

Submission

Submit your improved calculator implementation as follows:

submit hpc quiz18 xtest_calc.c str.h str.c lexer.h lexer.c tokenkind.h tokenkind.c