Context-Free Grammars and Pushdown Automata
5.1 Learning Objectives
By the end of this chapter, you should be able to:
- Define a context-free grammar (CFG) formally using the 4-tuple
- Derive strings from a grammar using leftmost and rightmost derivations
- Construct parse trees and explain their relationship to derivations
- Define pushdown automata (PDA) formally and trace their execution on an input string
- Explain why CFGs are strictly more expressive than regular languages
- Identify the role of CFGs in the parser stage of a compiler pipeline
- Recognize when a grammar is ambiguous and understand why ambiguity matters
5.2 Beyond Regular Languages
Chapter 3 established that finite automata recognize exactly the regular languages, and Chapter 4 showed how to compile regular expressions into efficient DFAs for lexical analysis. But the lexer is only the first stage of a compiler. After tokenization, the compiler must determine whether a sequence of tokens has valid syntactic structure — whether it forms a grammatically well-formed program.
This is a harder problem, and regular languages are provably insufficient for it. Consider the language of balanced parentheses:
No DFA can recognize this language. The reason is intuitive: a DFA has a finite, fixed number of states, so it cannot count to an arbitrary depth. Checking that 47 open parens are eventually matched by exactly 47 close parens would require at least 48 states for counting alone — and since the requirement holds for every , the count is unbounded.
The Pumping Lemma for regular languages makes this precise. It guarantees that for any regular language , there is a constant such that every sufficiently long string can be “pumped” — a middle section repeated any number of times — and the result stays in . The language of balanced parentheses fails this test immediately: pumping always creates a mismatch between opening and closing counts.
To handle the recursive, nested structure of real programming languages, we need a more expressive formalism. Context-free grammars (CFGs) are the standard answer, and they are the mathematical foundation of every parser.
5.2.1 Where We Are in the Pipeline
The full compiler front-end looks like this:
Source text → [Lexer] → Token stream → [Parser] → Parse tree / AST
(Regex/DFA) (CFG/PDA) The lexer — built from regular expressions and DFAs in Chapters 3–4 — handles the local structure of the language: what a valid identifier looks like, what the literal 42 means. The parser handles the global structure: how expressions nest, how statements compose, how a function body is delimited. This chapter establishes the theory behind the parser.
5.3 Context-Free Grammars
5.3.1 Intuition
A grammar is a system of rewriting rules. Each rule says: this symbol can be replaced by that sequence of symbols. Starting from a designated start symbol and applying rules repeatedly, we generate strings. The set of all strings we can generate this way is the language of the grammar.
The word context-free refers to the shape of the rules: every rule’s left-hand side is a single nonterminal, and the replacement does not depend on the surrounding context. This constraint keeps the theory tractable — efficient parsing algorithms exist precisely because the rules are context-free.
5.3.2 Formal Definition
Definition (CFG): A context-free grammar is a 4-tuple:
where:
- is a finite set of nonterminal symbols (also called variables)
- is a finite set of terminal symbols (the alphabet of the strings we generate), with
- is a finite set of production rules
- is the start symbol
A production rule is written , meaning “the nonterminal may be replaced by the string .”
5.3.3 Conventions and Notation
By convention:
- Uppercase letters () denote nonterminals
- Lowercase letters and quoted strings () denote terminals
- Greek letters () denote strings over
- Multiple rules sharing a left-hand side are collapsed with
|: writing is shorthand for the two rules and
5.3.4 Derivations
Starting from , we generate a string by applying production rules one at a time.
Definition (Single-step derivation): If is a production rule and , then:
We say derives in one step: we replaced the nonterminal with the string , leaving everything else unchanged.
Definition (Multi-step derivation): means derives in zero or more steps.
Language of :
The language is the set of all terminal strings reachable from the start symbol by some sequence of rule applications.
5.3.5 Leftmost and Rightmost Derivations
When a sentential form contains multiple nonterminals, we must choose which one to expand. Two canonical orderings are:
- Leftmost derivation: always expand the leftmost nonterminal. Written .
- Rightmost derivation: always expand the rightmost nonterminal. Written .
Both orderings generate the same set of strings; the choice is about discipline, not power. The distinction matters for parsing: top-down parsers (like LL parsers) construct leftmost derivations, while bottom-up parsers (like LR parsers) construct the reverse of rightmost derivations. We will return to this in the section about parsers later in the course.
5.3.6 Worked Example: Arithmetic Expressions
Task: Write a CFG for arithmetic expressions over integer literals, using +, *, and parentheses.
Productions:
Deriving int + int * int:
At each step we expanded the leftmost , so this is a leftmost derivation.
5.3.7 Parse Trees
A derivation records what rules were applied. A parse tree records how they fit together structurally.
Definition: A parse tree for a string is a rooted, ordered tree where:
- The root is labeled with the start symbol
- Each interior node is labeled with a nonterminal ; its children spell out the right-hand side of whichever production was applied
- Each leaf is labeled with a terminal or
Reading the leaves from left to right recovers the derived string .
Example: The parse tree for int + int * int under the grammar above. Addition sits at the root, meaning int is added to the subexpression int * int:
This corresponds to the evaluation . The tree makes the implicit grouping explicit in a way that a flat derivation sequence does not.
5.3.8 Ambiguity
A grammar is ambiguous if some string has more than one distinct parse tree — equivalently, more than one leftmost derivation.
The arithmetic expression grammar above is ambiguous. The string int + int * int admits two parse trees, depending on which operator is treated as the root of the expression:
Tree 1 — multiplication at the root, computing (int + int) * int:
Tree 2 — addition at the root, computing int + (int * int):
For a programming language, ambiguity is a serious problem: if the grammar allows two interpretations of the same expression, the compiler has no principled basis for choosing between them. The usual remedies are:
- Rewriting the grammar to be structurally unambiguous, using precedence levels (Section 5.3.9)
- Imposing external disambiguation rules — declaring that
*binds more tightly than+— without changing the grammar itself
Real parser generators like Yacc and Bison support the second approach and will warn when they detect ambiguity. For correctness, the first approach is cleaner.
5.3.9 Eliminating Ambiguity: Precedence and Associativity
The standard technique is to stratify the grammar by precedence level: introduce one nonterminal per level, with higher-precedence operators appearing deeper in the grammar (closer to the leaves, where they bind more tightly).
Unambiguous arithmetic grammar:
Here governs + (lowest precedence), governs * (higher), and handles atoms and parenthesized subexpressions. Because * can only appear inside , which is itself a subgoal of , multiplication binds tighter by construction. Left-recursion () encodes left-associativity: 1 + 2 + 3 parses as (1 + 2) + 3, never as 1 + (2 + 3).
Deriving int + int * int now produces exactly one parse tree:
You can use the visualizer below to experiment with grammars of your own, step through derivations symbol by symbol:
5.4 Context-Free Languages
5.4.1 Definition
A language is context-free if there exists a CFG such that, . The class of context-free languages (CFLs) strictly contains the regular languages: every regular language is context-free, but not every CFL is regular. The balanced-parentheses language is the canonical separating example.
5.4.2 Closure Properties
Context-free languages are closed under:
- Union: If and are CFLs, so is
- Concatenation: If and are CFLs, so is
- Kleene star: If is a CFL, so is
They are not closed in general under:
- Intersection: may not be context-free
- Complement: may not be context-free
The closure under union, concatenation, and Kleene star means that programming language constructs — which compose freely (loops inside conditionals inside functions) — remain context-free even after combination. The failure of closure under intersection is the reason that some cross-cutting constraints (like type consistency) fall outside the parser’s reach and must be handled separately in semantic analysis.
5.4.3 Examples of Context-Free Languages
| Language | Grammar sketch | Regular? |
|---|---|---|
| No | ||
| No | ||
| No | ||
| Yes | ||
| Any finite language | Enumerate all strings | Yes |
The palindrome language is worth pausing on: it requires matching symbols across a center point, a symmetry that a finite automaton can never track because the center arrives at an unpredictable position.
5.4.4 The Pumping Lemma for CFLs
Context-free languages have their own pumping lemma, used to prove that a given language is not context-free. If is a CFL, there exists a constant such that any with can be written as satisfying:
- (at least one of is nonempty)
- for all
As an example: the language is not context-free. Any split of a long string confines to a small window, so pumping and simultaneously can affect at most two of the three symbol groups — making it impossible to maintain equal counts of all three.
5.5 Pushdown Automata (PDA)
5.5.1 Motivation
We have established that DFAs are too weak for context-free languages. What additional capability is needed? The answer is simple: memory that grows with the input. To verify that open parens are matched by close parens, a machine needs to remember — a quantity with no fixed upper bound.
A pushdown automaton (PDA) is a finite automaton equipped with a stack: an unbounded, last-in, first-out memory structure. A stack is exactly the right amount of extra power — PDAs recognize precisely the context-free languages, no more and no less.
5.5.2 Formal Definition
Definition (PDA): A pushdown automaton is a 6-tuple:
where:
- is a finite set of states
- is the input alphabet
- is the stack alphabet, which may differ from
- is the transition relation, where
- is the start state
- is the set of accepting states
A transition means: in state , reading input symbol (or taking a free -move), with on top of the stack, the PDA may move to state and replace with the string . Setting pops without pushing anything; setting pushes on top of .
A note on nondeterminism: Like NFAs, PDAs are nondeterministic by default — the machine accepts if any sequence of choices leads to an accepting configuration. Unlike with NFAs, nondeterministic and deterministic PDAs are not equivalent in power. Deterministic PDAs (DPDAs) recognize a strict subset of the CFLs: languages like genuinely require nondeterminism. DPDAs correspond to the class of grammars that practical parsers can handle efficiently.
5.5.3 Configurations and Acceptance
A configuration of a PDA is a triple : the current state , the remaining input , and the current stack contents (top at the left). The PDA starts in — initial state, full input, empty stack.
Acceptance by final state: the PDA accepts if some sequence of transitions leads from to for some , regardless of stack contents.
Acceptance by empty stack: alternatively, the PDA accepts if it can reach for any state — input consumed and stack empty. The two modes are equivalent in expressive power, and either can be used depending on which is more convenient for a given construction.
5.5.4 Worked Example:
Task: Build a PDA that accepts exactly the strings of balanced parentheses.
States:
Alphabets: , , where $ marks the bottom of the stack and P records each unmatched open paren.
Strategy: The key insight is that the stack can serve as a counter. Every time we see (, we push P; every time we see ), we pop one. If the pops and pushes balance perfectly — meaning the stack returns to only $ exactly when the input ends — the string is accepted. Any early underflow (a ) with nothing to pop) or leftover pushes (open parens that were never matched) cause rejection.
Concretely: exists only to push the bottom-of-stack marker $ and hand off to immediately. From we process symbols as described above. The transition to the accepting state is only available as an -move when the stack top is $ — that is, when the stack is exactly as empty as it started, and all parens have been matched.
Transition relation:
| State | Input | Stack top | Next state | Stack action |
|---|---|---|---|---|
push $ | ||||
( | any | push P | ||
) | P | pop | ||
$ | pop |
Notice what is deliberately absent: there is no transition for ) when the stack top is $. Trying to close a paren that was never opened leaves the PDA stuck, with no move available — the string is rejected.
Trace on (()) — accepted:
| Step | State | Remaining input | Stack |
|---|---|---|---|
| Start | (()) | — | |
Push $ | (()) | $ | |
Read ( | ()) | P$ | |
Read ( | )) | PP$ | |
Read ) | ) | P$ | |
Read ) | — | $ | |
| -move | — | — |
Input exhausted, stack reduced to $, -move to is available. Final state . Accept ✓
Trace on ())( — rejected:
| Step | State | Remaining input | Stack |
|---|---|---|---|
| Start | ()( | — | |
Push $ | ()( | $ | |
Read ( | )( | P$ | |
Read ) | ( | $ | |
Read ( | — | P$ |
Input is exhausted with P still on the stack — there is an unmatched open paren. The -move to requires $ on top, so no accepting configuration is reachable. Reject ✗
The stack is the visual heart of this computation: it grows with each ( and shrinks with each ), and acceptance comes down to whether those changes cancel out perfectly. The simulator below uses a and b in place of ( and ) to make the stack changes easier to follow visually — step through it on inputs of your choice before moving on:
5.6 The Equivalence Theorem
5.6.1 Statement
The grammar and machine formalisms define exactly the same class of languages:
Theorem: A language is context-free if and only if there exists a PDA that recognizes .
This mirrors the DFA/NFA/regex equivalence from Chapters 3–4, and closes the same kind of loop one level up in expressiveness:
5.6.2 From CFG to PDA
Given , we construct an equivalent PDA using a three-state skeleton: a start state, a central loop state, and an accepting state. The machine works by maintaining a sentential form on the stack and consuming the input one terminal at a time, simulating a leftmost derivation step by step.
- Initialize: push a bottom marker
$followed by the start symbol . - Main loop: look at the top of the stack.
- If it is a nonterminal : for each rule in , nondeterministically pop and push (with on top), consuming no input. This is an -transition that “expands” one step of the derivation.
- If it is a terminal : read the next input symbol. If it matches , pop it and continue; if it does not match, this branch of the nondeterministic computation dies — the current choice of productions was wrong.
- Accept: when the stack holds only
$and all input has been consumed, pop$and move to the accepting state.
The nondeterminism is load-bearing here. At every expansion step, the PDA simultaneously explores all possible productions for the current nonterminal, in parallel. A string is accepted if any sequence of choices produces a complete match. This is precisely nondeterministic top-down parsing — the machine is “predicting” the derivation from the start symbol downward. The problem with doing this deterministically is that you cannot know which production to choose without looking ahead at the input, and sometimes a significant amount of lookahead is needed. Making these guesses efficiently and deterministically, via lookahead tables, is what LL and LR parsing algorithms are about — the subject of Chapter 6.
5.6.3 From PDA to CFG
The reverse direction — constructing a grammar from a PDA — is more involved, but follows a clear pattern. The idea is to characterize each PDA computation segment as a grammar rule.
First, simplify the PDA so that it has a single accepting state reached only when the stack is empty. (Any PDA can be put in this form.) Then, for each ordered pair of states , introduce a nonterminal with the intended meaning: generates exactly the strings that take the PDA from state with an empty stack to state with an empty stack.
Productions are generated from the PDA’s transitions as follows:
- If the PDA can move from to by reading and pushing symbol , and later from to by reading and popping , with any intervening computation captured by nonterminals for intermediate states , then this yields a production for .
- The base case is for every state : staying in place on an empty stack generates the empty string.
The resulting grammar can be large — it has nonterminals and productions in the worst case — but it is correct. The key insight is the correspondence between grammatical recursion and the push-pop discipline of the stack: every time the PDA pushes a symbol, it opens a “recursive call” that must eventually be resolved by popping that same symbol, mirroring the recursive structure of a CFG production.
5.6.4 The Chomsky Hierarchy
Finite automata and pushdown automata sit at adjacent levels in a four-tier hierarchy of language classes and machine models:
| Level | Language class | Machine |
|---|---|---|
| 3 | Regular | Finite automaton |
| 2 | Context-free | Pushdown automaton |
| 1 | Context-sensitive | Linear-bounded automaton |
| 0 | Recursively enumerable | Turing machine |
Each class is strictly contained in the one above it. For compiler design, levels 3 and 2 cover almost everything: regular languages for the lexer, context-free languages for the parser.
5.7 Normal Forms
Parsing algorithms often require grammars in a standardized shape. Two normal forms arise frequently in theory and practice.
5.7.1 Chomsky Normal Form (CNF)
Definition: A CFG is in Chomsky Normal Form if every production has one of the forms:
where , , and the last form is only permitted when .
Theorem: Every CFG has an equivalent grammar in CNF.
CNF matters because it is the prerequisite for the CYK algorithm (Cocke-Younger-Kasami), a dynamic programming procedure that decides whether for any CFG in CNF and any input . CYK is used in natural language processing and serves as the standard theoretical baseline for general parsing.
Converting to CNF proceeds in four steps:
- Eliminate -productions — remove rules of the form (except possibly ) by propagating their effect into other rules.
- Eliminate unit productions — remove rules of the form by substituting the productions of directly.
- Remove useless symbols — any nonterminal that cannot derive a terminal string, or that is unreachable from , can be discarded.
- Binarize — replace any right-hand side with three or more symbols by a chain of binary productions using fresh nonterminals.
5.7.2 Greibach Normal Form (GNF)
Definition: A CFG is in Greibach Normal Form if every production has the form:
where and .
In GNF, every rule begins with exactly one terminal, so every derivation step consumes exactly one input symbol. This gives a direct correspondence with PDA transitions and is used in theoretical arguments about parsing complexity and PDA construction.
5.8 CFGs and Programming Languages
5.8.1 The Role of CFGs in a Compiler
The parser takes the token stream from the lexer and determines whether it constitutes a syntactically valid program. This determination is driven by a CFG called the language grammar or syntax specification. A simplified grammar for Python-like statement syntax illustrates the structure:
Bold tokens (if, while, return, id, int) are terminals supplied by the lexer; the nonterminals (, , etc.) are syntactic categories the parser reasons about.
5.8.2 From Parse Tree to AST
The parse tree produced by the parser captures every syntactic detail: all parentheses, commas, and keywords appear as nodes. Most of this detail is structurally useful during parsing but semantically irrelevant afterwards. The compiler therefore derives a more compact Abstract Syntax Tree (AST) by:
- Dropping punctuation and grouping tokens that carry no meaning beyond their role in parsing
- Collapsing chains of unit productions that add no branching structure
- Annotating nodes with semantic information, such as resolved types and scope bindings
The AST is the data structure that flows into subsequent compilation phases: semantic analysis, type checking, and code generation.
5.8.3 What CFGs Cannot Express
Not every constraint in a real programming language can be captured by a CFG. Two important examples:
- Declaration before use: “Every variable must be declared before it is referenced.” Enforcing this requires tracking which names are currently in scope — a context-sensitive property that depends on the global structure of the program, not just local syntactic shape.
- Type consistency: “The left and right sides of an assignment must have compatible types.” This depends on type information accumulated from earlier declarations, which is again beyond what a context-free rule can see.
These constraints are checked in the semantic analysis phase, after parsing. The parser verifies structural well-formedness only; deeper consistency is verified separately. This is a deliberate division of labor: CFG-based parsers are fast, well-understood, and backed by efficient algorithms — use them for what they do well, and defer the rest.
5.9 Chapter Summary
This chapter introduced the formal machinery that underlies every parser.
- Context-free grammars generate languages through recursive rewriting rules. Every context-free language is recognized by a pushdown automaton, and every PDA-recognizable language is context-free.
- Parse trees record the derivation structure of a string and are the primary output of a parser. Ambiguity — multiple parse trees for the same string — is a correctness hazard that must be resolved before a grammar can safely drive a compiler.
- Pushdown automata extend finite automata with a stack, giving them the unbounded counting ability that regular machines lack. Nondeterministic PDAs are strictly more expressive than deterministic ones.
- Normal forms (CNF and GNF) standardize grammars for use in algorithms. CNF enables the general CYK parsing algorithm; GNF aligns grammar rules with PDA transitions.
- In the compiler pipeline, the parser consumes the token stream and produces an AST, which propagates forward through semantic analysis and code generation.
5.10 Exercises
CFG Fundamentals
1. For each of the following languages, write a CFG. Give the full 4-tuple and at least one sample derivation.
a.
b.
c.
d. The set of all properly nested strings of parentheses and square brackets, e.g. ([]), ([()[]]).
2. Consider the grammar:
S → aSb | SS | ε a. What language does this grammar generate? Give an English description and a set-builder definition.
b. Show two distinct parse trees for the string aabb. What does this tell you about the grammar?
c. Is there an unambiguous grammar for the same language? If so, give one.
3. Write a grammar for arithmetic expressions that supports +, -, *, /, is unambiguous, treats * and / as higher-precedence than + and -, makes all operators left-associative, and supports parentheses and integer literals. Give a leftmost derivation of int - int * int + int under your grammar.
Parse Trees and Derivations
4. Given the grammar:
S → AB
A → aA | ε
B → bB | ε a. Give a leftmost derivation of aabb.
b. Give a rightmost derivation of aabb.
c. Draw the parse tree. Is it the same for both derivations?
5. Show that is ambiguous by exhibiting two distinct leftmost derivations for int * int + int.
Pushdown Automata
6. For each of the following languages, construct a PDA. Give the stack alphabet and the transition relation.
a.
b.
c.
7. Trace the PDA from Exercise 6a on the input aaabbb. Show the full configuration sequence at each step.
8. Explain why cannot be recognized by any PDA. Use the pumping lemma for context-free languages.
Normal Forms and Closure
9. Convert the following grammar to Chomsky Normal Form, showing each step.
S → ASB | ε
A → aAS | a
B → SbS | A | bb 10. Prove that the class of context-free languages is closed under union: given CFGs for and for , construct a CFG for . (Hint: introduce a new start symbol.)
11. Show that context-free languages are not closed under intersection by exhibiting two CFLs whose intersection is not context-free.
Programming Language Grammars
12. Consider this grammar for if-else statements:
stmt → if ( expr ) stmt
| if ( expr ) stmt else stmt
| other a. Show that this grammar is ambiguous by drawing two parse trees for if (e1) if (e2) s1 else s2.
b. This is the classic dangling else problem. Describe the disambiguation convention used by C and Java, then rewrite the grammar to enforce it structurally.
c. Why does this ambiguity matter for a compiler?
Advanced
13. Theoretical: Prove that every regular language is context-free. Given a DFA with states , construct an equivalent CFG with one nonterminal per state and one production per transition.
14. Theoretical: Let and . Prove that is not context-free, and conclude that the intersection of two CFLs need not be a CFL.