CSCI 301 L30 Notes

Lecture 30 - Notes

Goals

Announcements

Resources

Our textbook does not cover parsing. I’ve found the following resources helpful:

Reverse Polish Notation

A Grammar for RPN

Here’s a grammar that describes RPN strings with no left recursion or common prefixes: \[ \begin{align*} S & \rightarrow \_ S \mid PS \mid DNT \mid \epsilon \\ T & \rightarrow \_ S \mid PS \mid \epsilon \\ N & \rightarrow DN \mid \epsilon \\ D & \rightarrow 0 \mid 1 \mid 2 \mid \cdots \mid 9 \\ P & \rightarrow + \mid - \mid * \mid / \\ \end{align*} \] Here, the terminals are \(\Sigma = \{0, 1, \ldots, 9, +, -, *, /, \_\}\), where the underscore \(\_\) is used as a visible representation of a space character. Let’s start to understand this by attaching some intuitive meaning to the nonterminals:

Let’s think about parsing an RPN expression using this grammar.

12 3 +

(see whiteboard for how this is parsed)

Notice that, by looking at the next (unprocessed) terminal in the input, it was always clear which rule to use. This is a good sign!

LL(1) Parse Tables

To write a parser, we need to formalize the process we went through above, and figure out what rule we should apply when replacing any nonterminal in the presence of any lookahead symbol. Of course, in some (perhaps many) of these cases, it may be the case that there is no rule we should apply because that symbol can’t follow that nonterminal; in this case, we conclude that the string is not in the language and report a syntax error.

So our goal is a table with nonterminals in each row and terminals in each column:

\(P\) (any operator) \(\_\) (a space) \(D\) (any digit) \(\$\) (end of input)
\(S\)
\(T\)
\(N\)
\(D\)
\(P\)

We’ve made a couple adjustments here:

FIRST

Intuitively, we knew which rule to apply when looking at the symbol because the right-hand side of whichever rule we apply must eventually begin with that symbol when all is said and done. For example:

Parsing the string 1_3_+ starting with the string \(S\), we know that S has to produce something beginning with a digit. The only production from \(S\) that can produce something that starts with a digit is \(S \rightarrow DNT\), so we know this must be the one to use.

We can formalize this by defining a set called \(FIRST\) of terminals that a derived string can start with:

Definition: A terminal \(x\) is a member of \(FIRST(A)\) if \(A \Rightarrow^* x\alpha\) for some string \(\alpha\) of terminals and nonterminals.

To calculate the FIRST set, we repeat the following procedure for each symbol \(X\) until nothing changes:

  1. If \(X\) is a terminal, then \(FIRST(X) = \{X\}\)
  2. If \(X\) is a nonterminal and there is a production \(X \rightarrow Y_1Y_2\ldots Y_k\), add \(FIRST(Y_1)\) to \(FIRST(X)\).

This works as long as there are no rules that produce \(\epsilon\). However, if \(Y_1 \Rightarrow^* \epsilon\), then whatever is in \(FIRST(Y_2)\) could also appear at the beginning of a string derived from \(X\). So, we need to add the following rule:

  1. For any \(i\) such that \(Y_{\1\ldots i\}\) are all nullable, then add \(FIRST(Y_{i+1})\) to \(FIRST(X)\).

FOLLOW

The above is sufficient for some grammars, but in general there are still cases where the grammar is LL(1) but cannot be parsed based on only FIRST alone. This is because if you are trying to apply a rule to a nonteminal \(A\) and \(A \Rightarrow^* \epsilon\), then the next symbol is one that follows \(A\), but it’s not in \(FIRST(Y_i)\) for any \(Y_i\) on the right-hand side of any production from \(A\).

We won’t go into the details of how to construct \(FOLLOW\), but the process is similar, if slightly more intricate, than \(FIRST\). You can find details in the resources linked at the top of the notes.

Again, details abridged, you can use the contents of FIRST and FOLLOW to derive the table we were looking for above. Here is the result for the RPN grammar:

(any operator) \(\_\) (a space) (any digit) \(\$\) (end of input)
\(S \rightarrow\) \(PS\) \(\_S\) \(DNT\) \(\epsilon\)
\(T \rightarrow\) \(PS\) \(\_S\) Error \(\epsilon\)
\(N \rightarrow\) \(\epsilon\) \(\epsilon\) \(DN\) \(\epsilon\)
\(D \rightarrow\) error error \(D\) Error
\(P \rightarrow\) \(P\) error error error

Recursive Descent Parsing

See the code for the RPN parser. There are two versions:

To give an example, suppose we’re parsing the string 44_9_+. At the top level, we’d call S(44_9_+) and this would result in the following call tree:

call                       Parsed   Rest
S(44_9_+)       (empty)   44_9_+
  D(44_9_+)           4   4_9_+
  N(4_9_+)            4   4_9_+
    D(4_9_+)         44   _9_+
    N(_9_+)          44   _9_+   
  T(_9_+)           44_   9_+
    S(9_+)          44_   9_+
      D(9_+)       44_9   _+
      N(_+)        44_9   _+
      T(_+)        44_9   _+
        S(_+)      44_9   _+
          S(+)    44_9_   +
            P(+) 44_9_+   

In rpn-LL, the return value is just the Rest column above. So the call D(9_+) returns _+, and that’s what’s passed into the following call to N.

In rpn-ast, the return value is a list containing the parsed value, then the rest of the input symbols. So the call D(9_+) returns (9, _+). To be precise about the details, the arguments and return values are themsleves lists of characters, so the actual call is D((#\9, #\_ #\+)) and the return value is (9, #\_, #\+). Notice that the 9 is now an integer value, wheras the “rest of the input” part is still a list of characters.