CSCI 301 L28 Notes

Lecture 28 - Notes

Goals

Announcements

Resources

Our textbook does not cover parsing. I’ve found the following resources helpful:

Motivation: Parsing

One big application area for context-free grammars and context-free languages is in parsing, where the goal is to determine whether a string is in the language of the grammar, and often also to build a parse tree for that string. This is an early stage in the compiler or interpreter for every programming language you’ve ever used!

Parsing is a pretty interesting problem, primarily because of the following fact: Given a grammar and a string in its language, there are potentially multiple ways to derive the same string. (The same thing is true of Regular Expressions, if you think of regular expressions as being able to “generate” a string).

To give an example that’s fairly close to a real-world application, consider the following grammar over the alphabet \(\Sigma = \{0, 1, \ldots 9, +, -, *, /, (, )\}\): \[ \begin{align*} E &\rightarrow E + E \\ E &\rightarrow E - E\\ E &\rightarrow E * E\\ E &\rightarrow E / E\\ E &\rightarrow (E)\\ E &\rightarrow 0 \mid 1 \mid 2 \mid 3 \mid 4 \mid 5 \mid 6 \mid 7 \mid 8 \mid 9 \end{align*} \] This grammar can generate syntactically valid arithmetic expressions, such as:

The goal of parsing would be to look at a string and determine whether it can be generated by this grammar (and, again, potentially build a parse tree representing that derivation).

Let’s consider the string \(1 + 1 * 4\). We can derive it in many ways! Three examples:

  1. \(E \Rightarrow E + E \Rightarrow E + E * E \Rightarrow 1 + E * E \Rightarrow 1 + E + 4 \Rightarrow 1 + 1 + 4\)

  2. \(E \Rightarrow E + E \Rightarrow 1 + E \Rightarrow 1 + E * E \Rightarrow 1 + 1 * E \Rightarrow 1 + 1 * 4\)

  3. \(E \Rightarrow E * E \Rightarrow E * 4 \Rightarrow E + E * 4 \Rightarrow E + 1 * 4 \Rightarrow 1 + 1 * 4\)

We could also make a poor choice of which rule to apply and get “stuck”:

  1. \(E \Rightarrow E - E\); from this point there is no sequence of derivations that would get us to \(1 + 1 * 4\).

Derivations 1-3 all get to the same string, but by a different route. Notice:

Ambiguous Grammars

If we construct the parse trees for the above derivations, we’ll notice that we can get different parse trees. Specifically #1 and #2 yield the same parse tree, where #3 has a different one. Knowing what we know about order of operations, we’d probably hope to get the parse tree from #1 in this case (but notice that using the left-most derivation doesn’t solve this universally - the situation would be reversed for \(1 * 1 + 4\)).

Definition: A grammar \(G\) is ambiguous if there is some string \(w \in L(G)\) with more than one distinct parse tree.

Equivalent definition: A grammar \(G\) is ambiguous if there is some string \(w \in L(G)\) that has more than one left-most derivation.

Do Exercises Part A

Parsing

Given a grammar \(G = (V, \Sigma, R, S)\) and a string \(w \in \Sigma^*\), determine whether \(w \in L(G)\), or in other words, whether \(S \Rightarrow^* w\).

There are two broad categories of approaches:

We will focus on top-down parsing, but both are used often in practice.

The key question a parser needs to answer is, given a string, which production should I apply?

This isn’t obvious:

Backtracking

For any old grammar, there aren’t very good answers to this, and the best we can do is called “backtracking”. Recall in derivation 4 from above, we made a substitution and got “stuck”; a backtracking parser will simply “undo” one or more productions, then try out a different choice. This can get very expensive since in general it requires an exahustive search of a large space of possibilities.

As a result, most parsing techniques do better by a combination of

  1. putting grammars in parse-friendly formats, and

  2. carefully analyzing the grammar and the input string for structure or patterns that can reveal which production should be used without ambiguity.

LL(k) parsing

We’re going to dive into one example of such a parsing technique: LL(k). This stands for Left-to-right, Left-most derivation with (k) tokens of lookahead. In particular, we’re going to work on LL(1) parsers that look only one symbol of the input ahead of the left-most nonterminal (the one being substituted).

Other examples of common parsing techniques include bottom-up approaches such as LR(k) (left-to-right, right-most derivation) and LALR (look-ahead, left-to-right, rightmost derivation).

Not just any grammar can be parsed in LL(1) fashion; we need to eliminate some pitfalls first.

Left Recursion

The key issue in parsing is an inability to take a “global” view of the input string - you have to process it piece by piece; in our case, we’re only allowing for 1-symbol lookahead. In the following examples, think about how much of the input string you need to see in order to choose the correct production.

(End of what we covered in L28)

Do Exercises Part B

Consider:

\(S \Rightarrow aS \mid bS \mid \epsilon\)

Parse \(aabb\).

\(S \Rightarrow aS \Rightarrow aaS \Rightarrow aabS \Rightarrow aabbS \Rightarrow aabb\)

Seems easy enough! We can look at one character and correctly decide which production to apply.

Consider:

\(S \Rightarrow Sa \mid Sb \mid \epsilon\)

Parse \(aabb\). Not so simple!

\(S \Rightarrow Sb \Rightarrow Sbb \Rightarrow Sabb \Rightarrow Saab \Rightarrow aabb\)

You need to see the whole input to choose the \(Sb\) production in the first step of the derivation.

This problem arises because of left recursion. In simple cases such as this, it can be eliminated by converting the grammar to an equivalent one.

Consider the grammar \(S \Rightarrow Sa \mid Sb \mid c \mid d\). The strings that can be generated from this are:

\((c \cup d)(a \cup b)^*\).

We can write equivalent grammar rules to generate the same strings by introducing a new variable \(S'\) to capture the \((a \cup b)^*\) part:

\(S \rightarrow cS' \mid dS'\)

\(S' \rightarrow aS' \mid bS' \mid \epsilon\)

Common Prefixes (Left Factoring)

If two rules in a grammar have a common prefix, it can be very difficult to know which rule to apply. For example, in the following rule representing a conditional statement in some plausible programming language:

\(S \Rightarrow \text{ if } C \text{ then } S \mid \text{ if } C \text{ then } S \text{ else } S\)

or, with more abstract names:

\(S \rightarrow a C b S \mid a C b S c S\)

the prefix \(aCb\) appears in both rules, so we’d need to look beyond that in the input string - and determine whether \(cS\) follows the last \(S\). But the first \(S\) could have lots of stuff in it, so we may be looking a long way.

We can eliminate common prefixes by introducing new variables; this is called left factoring.

\(S \rightarrow aSbSM\)

\(M \rightarrow cS | \epsilon\)

If we think in terms of the conditional statement example above, the new variable \(M\) can be thought of as meaning “Maybe an else clause”.

It’s worse than that

It’s worth noting that left recursion and common prefixes can hide in grammars by way of indirection. An example of this can be seen in the following grammar: \[ \begin{align*} A &\rightarrow da \mid acB\\ B &\rightarrow abB \mid daA \mid Af \end{align*} \] Try substituting the possible expansions of \(A\) into the third production for \(B\), and you’ll see that there was a common prefix lurking even though it wasn’t obvious.

Similarly,

\[ S \rightarrow Tu \mid wx\\ T \rightarrow Sq \mid vvS \] Try substituting \(S\) into \(T\) to find the hidden left recursion.

Where does this leave us?

I’d like to be able to tell you that if we can left factor a grammar and eliminate all left recursion, then we can write an LL(1) parser for it, but unfortunately that’s not even true. These are necessary, but not sufficient, steps for LL(1) parsing to work. The simplest way to tell if a grammar can be LL(1) parsed is to go from the factored grammar all the way to the parse table, and see if there’s any ambiguity. We’ll see how to do next time; this will build up to Lab 8, where you’ll implement an LL(1) a parser for a grammar of the Racket language.