Our textbook does not cover parsing. I’ve found the following resources helpful:
The key issue in parsing is an inability to take a “global” view of the input string - you have to process it piece by piece; in the case of LL(1) parsing, we’re only allowed for 1 symbol of lookahead. In the following examples, think about how much of the input string you need to see in order to choose the correct production.
Do Exercises Part A
Consider:
\(S \rightarrow aS \mid bS \mid \epsilon\)
Parse \(aabb\).
\(S \Rightarrow aS \Rightarrow aaS \Rightarrow aabS \Rightarrow aabbS \Rightarrow aabb\)
Seems easy enough! We can look at one character and correctly decide which production to apply.
Consider:
\(S \rightarrow Sa \mid Sb \mid \epsilon\)
Parse \(aabb\). Not so simple!
\(S \Rightarrow Sb \Rightarrow Sbb \Rightarrow Sabb \Rightarrow Saab \Rightarrow aabb\)
You need to see the whole input to choose the \(Sb\) production in the first step of the derivation.
This problem arises because of left recursion. In simple cases such as this, it can be eliminated by converting the grammar to an equivalent one.
Consider the grammar \(S \Rightarrow Sa \mid Sb \mid c \mid d\). The strings that can be generated from this are:
\((c \cup d)(a \cup b)^*\).
We can write equivalent grammar rules to generate the same strings by introducing a new variable \(S'\) to capture the \((a \cup b)^*\) part:
\(S \rightarrow cS' \mid dS'\)
\(S' \rightarrow aS' \mid bS' \mid \epsilon\)
Do Exercises Part B
If two rules in a grammar have a common prefix, it can be very difficult to know which rule to apply. For example, in the following rule representing a conditional statement in some plausible programming language:
\(S \Rightarrow \text{ if } C \text{ then } S \mid \text{ if } C \text{ then } S \text{ else } S\)
or, with more abstract names:
\(S \rightarrow a C b S \mid a C b S c S\)
the prefix \(aCb\) appears in both rules, so we’d need to look beyond that in the input string - and determine whether \(cS\) follows the last \(S\). But the first \(S\) could have lots of stuff in it, so we may be looking a long way.
We can eliminate common prefixes by introducing new variables; this is called left factoring.
\(S \rightarrow aSbSM\)
\(M \rightarrow cS \mid \epsilon\)
If we think in terms of the conditional statement example above, the new variable \(M\) can be thought of as meaning “Maybe an else clause”. The general approach here is to unify the suffixes under a single new variable, so that the original rules can be combined into a single one with the new variable on the right-hand side.
Do Exercises Part C
It’s worth noting that left recursion and common prefixes can hide in grammars by way of indirection. An example of this can be seen in the following grammar: \[ \begin{align*} A &\rightarrow da \mid acB\\ B &\rightarrow abB \mid daA \mid Af \end{align*} \] Try substituting the possible expansions of \(A\) into the third production for \(B\), and you’ll see that there was a common prefix lurking even though it wasn’t obvious.
Similarly,
\[ S \rightarrow Tu \mid wx\\ T \rightarrow Sq \mid vvS \] Try substituting \(S\) into \(T\) to find the hidden left recursion.
I’d like to be able to tell you that if we can left factor a grammar and eliminate all left recursion, then we can write an LL(1) parser for it, but unfortunately that’s not even true. These are necessary, but not sufficient, steps for LL(1) parsing to work. The simplest way to tell if a grammar can be LL(1) parsed is to go from the factored grammar all the way to the parse table, and see if there’s any ambiguity. We’ll see how to do next time; this will build up to Lab 8, where you’ll implement an LL(1) a parser for a grammar of the Racket language.
You’re used to seeing arithmetic expressions written in
infix notation, such as \(1 +
2 * (4-3)\). Operands are positioned on either side of an
operator, and grouping is done with parentheses. In Racket, you’re now
accustomed to prefix notation, such as
(+ 1 (* 2 (- 4 3)))
; prefix notation is sometimes also
known as Polish
notation. We are going to write an LL(1) parser for Reverse
Polish notation (RPN), otherwise known as
postfix notation.
One advantage of RPN is that it removes the need for operator precedence and parentheses. For example, to evaluate the infix expression \(1 + 2 * (4-3)\), we need to know that \(*\) has higher precedence than +, and we need parentheses around \(4-3\) to avoid multiplying 2 by 4 before subtracting 3 from it. The equivalent expression in RPN is:
1 2 4 3 - * +
This would be evaluated in the following way, where the first lines in each pair shows parentheses to highlight the operation to be computed and the second lines show the expression with the evaluated operation substituted.
1 2 (4 3 -) * +
1 2 1 * +
1 (2 1 *) +
1 2 +
(1 2 +)
3
The nifty thing here is that we could unambiguously express an order of operations without needing parentheses or operator precedence.
Notice that a simple evaluation algorithm works here: if you see a number, push it onto a stack; if you see an operator, pop the top two operands off the stack, apply it to the two operands, then push the result back onto the stack. Repeat until the stack contains one number, which is the result.
Do Exercises Part D
Here’s a grammar that describes RPN strings: \[ \begin{align*} S & \rightarrow \_ S \mid SP \mid NS \mid \epsilon \\ N & \rightarrow ND \mid D \\ D & \rightarrow 0 \mid 1 \mid \cdots \mid 9 \\ P & \rightarrow + \mid - \mid * \mid / \\ \end{align*} \] Here, the terminals are \(\Sigma = \{0, 1, \ldots, 9, +, -, *, /, \_\}\), where the underscore \(\_\) is used as a visible representation of a space character. Let’s start to understand this by attaching some intuitive meaning to the nonterminals:
This grammar basically describes a space-separated list of integers
and operators. Notice that it doesn’t require a matched number of
numbers and operators; this grammar allows expressions like
1 2 3 +
, which perhaps could evaluate to 6, but also
expressions like 6 7 8 /
or + 1
. The
interpretation of such expressions isn’t our concern for now: as long as
the string is a list of numbers and operators, we’ll parse it.