Our textbook does not cover parsing. I’ve found the following resources helpful:
Here’s a grammar that describes RPN strings with no left recursion or common prefixes: \[ \begin{align*} S & \rightarrow \_ S \mid PS \mid DNT \mid \epsilon \\ T & \rightarrow \_ S \mid PS \mid \epsilon \\ N & \rightarrow DN \mid \epsilon \\ D & \rightarrow 0 \mid 1 \mid 2 \mid \cdots \mid 9 \\ P & \rightarrow + \mid - \mid * \mid / \\ \end{align*} \] Here, the terminals are \(\Sigma = \{0, 1, \ldots, 9, +, -, *, /, \_\}\), where the underscore \(\_\) is used as a visible representation of a space character. Let’s start to understand this by attaching some intuitive meaning to the nonterminals:
Let’s think about parsing an RPN expression using this grammar.
12 3 +
(see whiteboard for how this is parsed)
Notice that, by looking at the next (unprocessed) terminal in the input, it was always clear which rule to use. This is a good sign!
To write a parser, we need to formalize the process we went through above, and figure out what rule we should apply when replacing any nonterminal in the presence of any lookahead symbol. Of course, in some (perhaps many) of these cases, it may be the case that there is no rule we should apply because that symbol can’t follow that nonterminal; in this case, we conclude that the string is not in the language and report a syntax error.
So our goal is a table with nonterminals in each row and terminals in each column:
\(P\) (any operator) | \(\_\) (a space) | \(D\) (any digit) | \(\$\) (end of input) | |
---|---|---|---|---|
\(S\) | ||||
\(T\) | ||||
\(N\) | ||||
\(D\) | ||||
\(P\) |
We’ve made a couple adjustments here:
Intuitively, we knew which rule to apply when looking at the symbol because the right-hand side of whichever rule we apply must eventually begin with that symbol when all is said and done. For example:
Parsing the string 1_3_+
starting with the string \(S\), we know that S has to produce
something beginning with a digit. The only production from \(S\) that can produce something that starts
with a digit is \(S \rightarrow DNT\),
so we know this must be the one to use.
We can formalize this by defining a set called \(FIRST\) of terminals that a derived string can start with:
Definition: A terminal \(x\) is a member of \(FIRST(A)\) if \(A \Rightarrow^* x\alpha\) for some string \(\alpha\) of terminals and nonterminals.
To calculate the FIRST set, we repeat the following procedure for each symbol \(X\) until nothing changes:
This works as long as there are no rules that produce \(\epsilon\). However, if \(Y_1 \Rightarrow^* \epsilon\), then whatever is in \(FIRST(Y_2)\) could also appear at the beginning of a string derived from \(X\). So, we need to add the following rule:
The above is sufficient for some grammars, but in general there are still cases where the grammar is LL(1) but cannot be parsed based on only FIRST alone. This is because if you are trying to apply a rule to a nonteminal \(A\) and \(A \Rightarrow^* \epsilon\), then the next symbol is one that follows \(A\), but it’s not in \(FIRST(Y_i)\) for any \(Y_i\) on the right-hand side of any production from \(A\).
We won’t go into the details of how to construct \(FOLLOW\), but the process is similar, if slightly more intricate, than \(FIRST\). You can find details in the resources linked at the top of the notes.
Again, details abridged, you can use the contents of FIRST and FOLLOW to derive the table we were looking for above. Here is the result for the RPN grammar:
(any operator) | \(\_\) (a space) | (any digit) | \(\$\) (end of input) | |
---|---|---|---|---|
\(S \rightarrow\) | \(PS\) | \(\_S\) | \(DNT\) | \(\epsilon\) |
\(T \rightarrow\) | \(PS\) | \(\_S\) | Error | \(\epsilon\) |
\(N \rightarrow\) | \(\epsilon\) | \(\epsilon\) | \(DN\) | \(\epsilon\) |
\(D \rightarrow\) | error | error | \(D\) | Error |
\(P \rightarrow\) | \(P\) | error | error | error |
See the code for the RPN parser. There are two versions:
rpn-LL.rkt
attempts to parse an input string. If it’s
in the language, the parser returns '()
; otherwise, a
syntax error is thrown. It does this by having a function for each
nonterminal; that funct ion implements the logic encapsulated in the
parse table above, making recursive calls for any nonterminals on the
right-hand side of the rule to be applied, and returning any input that
remains.rpn-ast.rkt
builds a parse tree in the course of
parsing. The setup is similar, except the functions return two items:
the first is the value of the expression the variable parsed
to, and the second is (as before) any remaining input to be parsed.To give an example, suppose we’re parsing the string
44_9_+
. At the top level, we’d call S(44_9_+)
and this would result in the following call tree:
call Parsed Rest
S(44_9_+) (empty) 44_9_+
D(44_9_+) 4 4_9_+
N(4_9_+) 4 4_9_+
D(4_9_+) 44 _9_+
N(_9_+) 44 _9_+
T(_9_+) 44_ 9_+
S(9_+) 44_ 9_+
D(9_+) 44_9 _+
N(_+) 44_9 _+
T(_+) 44_9 _+
S(_+) 44_9 _+
S(+) 44_9_ +
P(+) 44_9_+
In rpn-LL
, the return value is just the
Rest
column above. So the call D(9_+)
returns
_+
, and that’s what’s passed into the following call to
N
.
In rpn-ast
, the return value is a list containing the
parsed value, then the rest of the input symbols. So the call
D(9_+)
returns (9, _+)
. To be precise about
the details, the arguments and return values are themsleves lists of
characters, so the actual call is D((#\9, #\_ #\+))
and the
return value is (9, #\_, #\+)
. Notice that the 9 is now an
integer value, wheras the “rest of the input” part is still a list of
characters.