CSCI 301 L30 Notes

Lecture 30 - Notes

Goals

Know how to compute the NULLABLE and FIRST sets for basic grammars.
Be prepared to implement an LL(1) recursive descent parser given a grammar and LL(1) parse table.

Announcements

Tonight: Week 7 Survey, A7, end of extended grace period for Lab 7

Resources

Our textbook does not cover parsing. I’ve found the following resources helpful:

Handout - Context-Free Grammars
Handout - Top-Down Parsing
The book Compilers: Principles, Techniques, and Tools, by Aho, Lam, Sethi, and Ullman.

Reverse Polish Notation

A Grammar for RPN

Here’s a grammar that describes RPN strings with no left recursion or common prefixes: \[ \begin{align*} S & \rightarrow \_ S \mid PS \mid DNT \mid \epsilon \\ T & \rightarrow \_ S \mid PS \mid \epsilon \\ N & \rightarrow DN \mid \epsilon \\ D & \rightarrow 0 \mid 1 \mid 2 \mid \cdots \mid 9 \\ P & \rightarrow + \mid - \mid * \mid / \\ \end{align*} \] Here, the terminals are $\Sigma = \{0, 1, \ldots, 9, +, -, *, /, \_\}$, where the underscore $\_$ is used as a visible representation of a space character. Let’s start to understand this by attaching some intuitive meaning to the nonterminals:

S is the start symbol; it constitutes a list of numbers and operators, possibly separated by spaces
N is a number; it has one or more digits
D is a single digit
P is an operator

Let’s think about parsing an RPN expression using this grammar.

12 3 +

(see whiteboard for how this is parsed)

Notice that, by looking at the next (unprocessed) terminal in the input, it was always clear which rule to use. This is a good sign!

LL(1) Parse Tables

To write a parser, we need to formalize the process we went through above, and figure out what rule we should apply when replacing any nonterminal in the presence of any lookahead symbol. Of course, in some (perhaps many) of these cases, it may be the case that there is no rule we should apply because that symbol can’t follow that nonterminal; in this case, we conclude that the string is not in the language and report a syntax error.

So our goal is a table with nonterminals in each row and terminals in each column:

	$P$ (any operator)	$\_$ (a space)	$D$ (any digit)	$\$$ (end of input)
$S$
$T$
$N$
$D$
$P$

We’ve made a couple adjustments here:

The operator and digit terminals are just represented by $P$ and $D$, respectively, because (in the case of $D$, for example) the rule to apply is the same regardless of which digit.
We’ve added a column for a special symbol $\$$, which is not in the language, but simply signifies the end of the input string. This will tell us when we’ve finished parsing input.

FIRST

Intuitively, we knew which rule to apply when looking at the symbol because the right-hand side of whichever rule we apply must eventually begin with that symbol when all is said and done. For example:

Parsing the string 1_3_+ starting with the string $S$, we know that S has to produce something beginning with a digit. The only production from $S$ that can produce something that starts with a digit is $S \rightarrow DNT$, so we know this must be the one to use.

We can formalize this by defining a set called $FIRST$ of terminals that a derived string can start with:

Definition: A terminal $x$ is a member of $FIRST(A)$ if $A \Rightarrow^* x\alpha$ for some string $\alpha$ of terminals and nonterminals.

To calculate the FIRST set, we repeat the following procedure for each symbol $X$ until nothing changes:

If $X$ is a terminal, then $FIRST(X) = \{X\}$
If $X$ is a nonterminal and there is a production $X \rightarrow Y_1Y_2\ldots Y_k$, add $FIRST(Y_1)$ to $FIRST(X)$.

This works as long as there are no rules that produce $\epsilon$. However, if $Y_1 \Rightarrow^* \epsilon$, then whatever is in $FIRST(Y_2)$ could also appear at the beginning of a string derived from $X$. So, we need to add the following rule:

For any $i$ such that $Y_{\1\ldots i\}$ are all nullable, then add $FIRST(Y_{i+1})$ to $FIRST(X)$.

FOLLOW

The above is sufficient for some grammars, but in general there are still cases where the grammar is LL(1) but cannot be parsed based on only FIRST alone. This is because if you are trying to apply a rule to a nonteminal $A$ and $A \Rightarrow^* \epsilon$, then the next symbol is one that follows $A$, but it’s not in $FIRST(Y_i)$ for any $Y_i$ on the right-hand side of any production from $A$.

We won’t go into the details of how to construct $FOLLOW$, but the process is similar, if slightly more intricate, than $FIRST$. You can find details in the resources linked at the top of the notes.

Again, details abridged, you can use the contents of FIRST and FOLLOW to derive the table we were looking for above. Here is the result for the RPN grammar:

	(any operator)	$\_$ (a space)	(any digit)	$\$$ (end of input)
$S \rightarrow$	$PS$	$\_S$	$DNT$	$\epsilon$
$T \rightarrow$	$PS$	$\_S$	Error	$\epsilon$
$N \rightarrow$	$\epsilon$	$\epsilon$	$DN$	$\epsilon$
$D \rightarrow$	error	error	$D$	Error
$P \rightarrow$	$P$	error	error	error

Recursive Descent Parsing

See the code for the RPN parser. There are two versions:

rpn-LL.rkt attempts to parse an input string. If it’s in the language, the parser returns '(); otherwise, a syntax error is thrown. It does this by having a function for each nonterminal; that funct ion implements the logic encapsulated in the parse table above, making recursive calls for any nonterminals on the right-hand side of the rule to be applied, and returning any input that remains.
rpn-ast.rkt builds a parse tree in the course of parsing. The setup is similar, except the functions return two items: the first is the value of the expression the variable parsed to, and the second is (as before) any remaining input to be parsed.

To give an example, suppose we’re parsing the string 44_9_+. At the top level, we’d call S(44_9_+) and this would result in the following call tree:

call                       Parsed   Rest
S(44_9_+)       (empty)   44_9_+
  D(44_9_+)           4   4_9_+
  N(4_9_+)            4   4_9_+
    D(4_9_+)         44   _9_+
    N(_9_+)          44   _9_+   
  T(_9_+)           44_   9_+
    S(9_+)          44_   9_+
      D(9_+)       44_9   _+
      N(_+)        44_9   _+
      T(_+)        44_9   _+
        S(_+)      44_9   _+
          S(+)    44_9_   +
            P(+) 44_9_+

In rpn-LL, the return value is just the Rest column above. So the call D(9_+) returns _+, and that’s what’s passed into the following call to N.

In rpn-ast, the return value is a list containing the parsed value, then the rest of the input symbols. So the call D(9_+) returns (9, _+). To be precise about the details, the arguments and return values are themsleves lists of characters, so the actual call is D((#\9, #\_ #\+)) and the return value is (9, #\_, #\+). Notice that the 9 is now an integer value, wheras the “rest of the input” part is still a list of characters.

	\(P\) (any operator)	\(\_\) (a space)	\(D\) (any digit)	\(\$\) (end of input)
\(S\)
\(T\)
\(N\)
\(D\)
\(P\)

	(any operator)	\(\_\) (a space)	(any digit)	\(\$\) (end of input)
\(S \rightarrow\)	\(PS\)	\(\_S\)	\(DNT\)	\(\epsilon\)
\(T \rightarrow\)	\(PS\)	\(\_S\)	Error	\(\epsilon\)
\(N \rightarrow\)	\(\epsilon\)	\(\epsilon\)	\(DN\)	\(\epsilon\)
\(D \rightarrow\)	error	error	\(D\)	Error
\(P \rightarrow\)	\(P\)	error	error	error