CSCI 301 L29 Notes

Lecture 29 - Notes

Goals

Know how to identify and remove left recursion from a grammar
Know how to identify and remove common prefixes from a grammar
Know how to interpret and translate aritmetic expressions into/out of Reverse Polish notation

Announcements

Lab 7 accepted until Wednesday night without penalty
A7 due Wednesday night
Lab 8 will be Tuesday 11/26; no lab tomorrow, 11/19.
- Ryan will hold office hours during his lab (10-noon) today in CF 162

Resources

Our textbook does not cover parsing. I’ve found the following resources helpful:

Handout - Context-Free Grammars
Handout - Top-Down Parsing
The book Compilers: Principles, Techniques, and Tools, by Aho, Lam, Sethi, and Ullman.

Obstacles to Efficient Parsing

The key issue in parsing is an inability to take a “global” view of the input string - you have to process it piece by piece; in the case of LL(1) parsing, we’re only allowed for 1 symbol of lookahead. In the following examples, think about how much of the input string you need to see in order to choose the correct production.

Do Exercises Part A

Consider:

\(S \rightarrow aS \mid bS \mid \epsilon\)

Parse \(aabb\).

\(S \Rightarrow aS \Rightarrow aaS \Rightarrow aabS \Rightarrow aabbS \Rightarrow aabb\)

Seems easy enough! We can look at one character and correctly decide which production to apply.

Consider:

\(S \rightarrow Sa \mid Sb \mid \epsilon\)

Parse \(aabb\). Not so simple!

\(S \Rightarrow Sb \Rightarrow Sbb \Rightarrow Sabb \Rightarrow Saab \Rightarrow aabb\)

You need to see the whole input to choose the \(Sb\) production in the first step of the derivation.

This problem arises because of left recursion. In simple cases such as this, it can be eliminated by converting the grammar to an equivalent one.

Consider the grammar \(S \Rightarrow Sa \mid Sb \mid c \mid d\). The strings that can be generated from this are:

\((c \cup d)(a \cup b)^*\).

We can write equivalent grammar rules to generate the same strings by introducing a new variable \(S'\) to capture the \((a \cup b)^*\) part:

\(S \rightarrow cS' \mid dS'\)

\(S' \rightarrow aS' \mid bS' \mid \epsilon\)

Do Exercises Part B

Common Prefixes (Left Factoring)

If two rules in a grammar have a common prefix, it can be very difficult to know which rule to apply. For example, in the following rule representing a conditional statement in some plausible programming language:

\(S \Rightarrow \text{ if } C \text{ then } S \mid \text{ if } C \text{ then } S \text{ else } S\)

or, with more abstract names:

\(S \rightarrow a C b S \mid a C b S c S\)

the prefix \(aCb\) appears in both rules, so we’d need to look beyond that in the input string - and determine whether \(cS\) follows the last \(S\). But the first \(S\) could have lots of stuff in it, so we may be looking a long way.

We can eliminate common prefixes by introducing new variables; this is called left factoring.

\(S \rightarrow aSbSM\)

\(M \rightarrow cS \mid \epsilon\)

If we think in terms of the conditional statement example above, the new variable \(M\) can be thought of as meaning “Maybe an else clause”. The general approach here is to unify the suffixes under a single new variable, so that the original rules can be combined into a single one with the new variable on the right-hand side.

Do Exercises Part C

It’s worse than that

It’s worth noting that left recursion and common prefixes can hide in grammars by way of indirection. An example of this can be seen in the following grammar: \[ \begin{align*} A &\rightarrow da \mid acB\\ B &\rightarrow abB \mid daA \mid Af \end{align*} \] Try substituting the possible expansions of \(A\) into the third production for \(B\), and you’ll see that there was a common prefix lurking even though it wasn’t obvious.

Similarly,

\[ S \rightarrow Tu \mid wx\\ T \rightarrow Sq \mid vvS \] Try substituting \(S\) into \(T\) to find the hidden left recursion.

Where does this leave us?

I’d like to be able to tell you that if we can left factor a grammar and eliminate all left recursion, then we can write an LL(1) parser for it, but unfortunately that’s not even true. These are necessary, but not sufficient, steps for LL(1) parsing to work. The simplest way to tell if a grammar can be LL(1) parsed is to go from the factored grammar all the way to the parse table, and see if there’s any ambiguity. We’ll see how to do next time; this will build up to Lab 8, where you’ll implement an LL(1) a parser for a grammar of the Racket language.

Reverse Polish Notation

You’re used to seeing arithmetic expressions written in infix notation, such as \(1 + 2 * (4-3)\). Operands are positioned on either side of an operator, and grouping is done with parentheses. In Racket, you’re now accustomed to prefix notation, such as (+ 1 (* 2 (- 4 3))); prefix notation is sometimes also known as Polish notation. We are going to write an LL(1) parser for Reverse Polish notation (RPN), otherwise known as postfix notation.

One advantage of RPN is that it removes the need for operator precedence and parentheses. For example, to evaluate the infix expression \(1 + 2 * (4-3)\), we need to know that \(*\) has higher precedence than +, and we need parentheses around \(4-3\) to avoid multiplying 2 by 4 before subtracting 3 from it. The equivalent expression in RPN is:

1 2 4 3 - * +

This would be evaluated in the following way, where the first lines in each pair shows parentheses to highlight the operation to be computed and the second lines show the expression with the evaluated operation substituted.

1 2 (4 3 -) * + 
1 2 1 * +

1 (2 1 *) +
1 2 +

(1 2 +)
3

The nifty thing here is that we could unambiguously express an order of operations without needing parentheses or operator precedence.

Notice that a simple evaluation algorithm works here: if you see a number, push it onto a stack; if you see an operator, pop the top two operands off the stack, apply it to the two operands, then push the result back onto the stack. Repeat until the stack contains one number, which is the result.

Do Exercises Part D

A Grammar for RPN

Here’s a grammar that describes RPN strings: \[ \begin{align*} S & \rightarrow \_ S \mid SP \mid NS \mid \epsilon \\ N & \rightarrow ND \mid D \\ D & \rightarrow 0 \mid 1 \mid \cdots \mid 9 \\ P & \rightarrow + \mid - \mid * \mid / \\ \end{align*} \] Here, the terminals are \(\Sigma = \{0, 1, \ldots, 9, +, -, *, /, \_\}\), where the underscore \(\_\) is used as a visible representation of a space character. Let’s start to understand this by attaching some intuitive meaning to the nonterminals:

S is the start symbol; it constitutes a list of numbers and operators, possibly separated by spaces
N is a number; it has one or more digits
D is a single digit
P is an operator

This grammar basically describes a space-separated list of integers and operators. Notice that it doesn’t require a matched number of numbers and operators; this grammar allows expressions like 1 2 3 +, which perhaps could evaluate to 6, but also expressions like 6 7 8 / or + 1. The interpretation of such expressions isn’t our concern for now: as long as the string is a list of numbers and operators, we’ll parse it.