Regular Languages and Regular Expressions
Note to readers: Chapters 2–4 cover the theoretical foundation behind lexer generators like lex/flex. You can skip ahead and return to this section later for deeper understanding.
2.1 Learning Objectives
By the end of this chapter, you should be able to:
- Define alphabets, strings, and languages using precise mathematical notation
- Apply the three fundamental regular operations: union (), concatenation (), and Kleene star ()
- Read and write regular expressions for practical token patterns
- Prove that specific languages are not regular using the Pumping Lemma
- Apply closure properties to compose and reason about regular languages
- Articulate why regular languages are the natural model for lexical analysis
2.2 Motivation: Why Study Regular Languages?
2.2.1 The Lexical Analysis Problem
Recall from Section 1.1 that the lexer is the first phase of compilation. It receives raw source code as a stream of characters and must identify token boundaries. Given the input:
int count = 42; the lexer must recognize:
intas a keyword tokencountas an identifier token=as an assignment operator token42as an integer literal token;as a delimiter token
Each of these token categories is defined by a pattern. For example:
- Identifiers: start with a letter or underscore, followed by any number of letters, digits, or underscores
- Integer literals: one or more decimal digits
- Whitespace: one or more spaces, tabs, or newlines
These patterns share a property that matters deeply: they can be recognized by scanning the input once, left to right, using only a fixed amount of memory. You never need to count arbitrarily high numbers or remember arbitrarily long prefixes.
Regular languages are precisely the class of patterns that share this property. Studying them gives us:
- A theoretical foundation — a precise mathematical framework for defining token patterns
- Practical tools — regular expressions for specification and finite automata for implementation
- Efficiency guarantees — linear-time recognition with predictable performance
- Clear limits — a rigorous account of what lexers can and cannot do
2.2.2 The Two-Level Architecture
Modern compilers maintain a clean separation between two phases. The following table summarizes this division:
| Phase | Input | Model | Handles |
|---|---|---|---|
| Lexer | Characters | Regular languages | Token boundaries, pattern matching |
| Parser | Tokens | Context-free grammars | Nested structure, abstract syntax trees |
This separation isn’t arbitrary—it reflects a deep mathematical distinction between two classes of languages. Regular languages handle flat, bounded patterns. Context-free languages handle recursive, nested structures. Understanding regular languages precisely tells us exactly where to draw this line.
This theoretical boundary sets the stage for the formal mathematical definitions that follow.
2.3 Fundamental Definitions
To build a robust lexer, we first need a formal way to describe the characters it reads and the words it forms. This section defines the foundational mathematical objects—alphabets, strings, and languages—that allow us to specify token patterns unambiguously. Every concept in this section will be used throughout the rest of the course.
2.3.1 Alphabets
Definition (Alphabet): An alphabet is a finite, nonempty set of symbols. We typically denote alphabets with the Greek letter .
Examples:
- Binary:
- Lowercase Latin:
- ASCII:
2.3.2 Strings
Definition (String): A string over alphabet is a finite sequence of symbols from .
We write strings by juxtaposing their symbols: abc, 01101, count.
Notation:
- denotes the length of string
- The empty string satisfies
- denotes all strings over (including )
- denotes all nonempty strings over
Critical distinction: is the empty string (zero symbols). The empty language contains no strings at all. These are completely different objects that are easy to conflate.
Example: Over :
This set is infinite, even though itself has only two elements.
2.3.3 String Operations
Before defining operations on collections of strings, we need operations on individual strings. These operations model how a lexer processes sequences of characters.
Concatenation: For strings and , the concatenation appends to .
- If
abcandde, thenabcde - for all (identity element)
- (associative), but in general (not commutative)
Exponentiation: For string and integer :
So a aaa and ab abab.
Reversal: For , the reversal is .
So abc cba and .
2.3.4 Languages
A lexer needs to recognize sets of valid tokens, not just single strings. We formalize these sets as languages.
Definition (Language): A language over alphabet is any subset of .
Since is infinite, languages can be finite, infinite, or empty. The following table provides examples:
| Language | Description |
|---|---|
| Finite language | |
| Infinite language: | |
| Empty language — contains no strings | |
| Singleton containing only the empty string |
Key insight: A language is just a set of strings. We can apply all standard set operations — , , complement — to languages directly.
With alphabets, strings, and languages formally defined, we can now explore how to combine these sets to build the complex patterns required for lexical analysis.
2.4 Operations on Languages
To describe practical token patterns like identifiers or numbers, we need ways to combine simple languages into more complex ones. The three operations in this section—union, concatenation, and Kleene star—are the fundamental building blocks for all lexical specifications.
2.4.1 Union
Definition (Union):
Example:
- ,
Union is commutative and associative. The identity is .
2.4.2 Concatenation
Definition (Concatenation):
Example:
- ,
Two special cases are worth memorizing:
- (the language , not the empty language, is the identity)
- (concatenating with the empty language destroys everything)
Concatenation is associative but not commutative.
2.4.3 Kleene Star
Definition (Kleene Star): The Kleene star is the most powerful and distinctive operation on languages.
where , , , and so on.
In plain English: contains every string you can form by concatenating zero or more strings from .
Example — character class:
- — all strings over
Example — single word:
Three facts to memorize:
- for any language , even if
- If , then — not
- is exactly
Positive closure is the variant that requires at least one concatenation:
2.4.4 Worked Examples
Combining the three operations takes practice. Here are four examples that build intuition.
Example A:
The set and . Their concatenation is all strings with zero or more ’s followed by zero or more ’s:
This includes , , , , and — but not or .
Example B:
First, . Then is all strings over . This includes and — contrasting with Example A.
Takeaway: . The placement of matters.
Example C:
The second factor . Concatenating with gives strings starting with one or two ’s, followed by zero or more ‘s.
Example D:
These three operations form the mathematical core of regular languages, which we will now define formally.
2.5 Regular Languages and Regular Expressions
With our language operations defined, we can now identify the specific class of languages that a lexer can recognize. We call these regular languages, and we use regular expressions as a concise notation to write them down in our compiler’s source code.
2.5.1 Defining Regular Languages
We can now state the central definition precisely.
Definition (Regular Language): The class of regular languages over is defined inductively.
Base cases:
- is a regular language
- is a regular language
- is a regular language for each symbol
Inductive cases — if and are regular, then so are: 4. 5. 6.
Nothing else is regular.
Reading the definition: Start with the smallest possible languages — empty, epsilon, a single character — and build up using exactly three operations. Every regular language can be traced back to this construction. If a language cannot be built this way, it is not regular.
Example: The language of strings over that start with is regular:
Each step uses only base cases and the three inductive operations.
Non-example: The language is not regular. Membership requires matching two unbounded counts — the number of ’s and the number of ’s — and finite memory cannot enforce that equality for arbitrarily large . We will prove this rigorously in Section 2.7.
2.5.2 Regular Expressions: Notation for Regular Languages
Writing out set-theoretic constructions like is cumbersome. Regular expressions are a compact notation for the same thing.
Syntax: Regular expressions over are defined inductively.
Base cases: \emptyset, \epsilon, and a for each .
Inductive cases — if r and s are regular expressions, then so are:
(r | s)— alternation (union)(rs)— concatenation(r*)— Kleene star
Semantics: Each expression r denotes a language . The following table defines the formal semantics of regular expressions, mapping each syntactic form to the language it denotes:
| Expression | Language |
|---|---|
\emptyset | |
\epsilon | |
a | |
| `r | s` |
rs | |
r* |
Operator precedence (highest to lowest):
*(Kleene star)- Concatenation
|(alternation)
This lets us omit most parentheses. The following table illustrates how operator precedence resolves ambiguity in regular expressions:
| Regex | Parsed as | Language | In English |
|---|---|---|---|
ab*c | a(b*)c | a, any number of b’s, then c | |
| `a | b*` | `a | (b*)` |
| `(a | b)*` | — | |
a*b* | (a*)(b*) | All a’s before all b’s |
2.5.3 Extended Notation
The pure definition of regular expressions is minimal by design. In practice, we use convenient shorthands. None of these add expressive power — they are all syntactic sugar. The following table lists common extended notations:
| Notation | Meaning | Example |
|---|---|---|
r+ | rr* — one or more | a+ = |
r? | `(r | \epsilon)` — zero or one |
[abc] | `(a | b |
[a-z] | Range shorthand | [a-z] = any lowercase letter |
[^abc] | Any character except , , | [^\n] = any non-newline |
. | Any single character | .+ = any nonempty string |
2.5.4 Regular Expressions for Token Patterns
With this notation, we can write precise definitions for the token categories a lexer needs.
Identifiers — start with a letter or underscore, continue with letters, digits, or underscores:
[A-Za-z_][A-Za-z0-9_]* Accepted: x, count, _temp, value42
Rejected: 42value (starts with digit), my-var (contains hyphen)
Integer literals:
[0-9]+ Or, if you want to forbid leading zeros: 0 | [1-9][0-9]*
Floating-point literals:
[0-9]+.[0-9]* | .[0-9]+ Whitespace:
[ \t\n\r]+ C/C++ single-line comments:
//[^\n]* Starts with //, followed by zero or more non-newline characters.
A complete token specification for a simple language:
KEYWORD : if | while | return | int | void
IDENT : [A-Za-z_][A-Za-z0-9_]*
INT_LIT : [0-9]+
OPERATOR : + | - | * | / | == | != | <= | >= | < | > | =
DELIM : ; | , | ( | ) | { | }
WHITESPACE : [ \t\n\r]+
COMMENT : //[^\n]* Each of these patterns is regular. A lexer can scan the input once, in a single pass, matching these patterns to produce tokens — and we can prove this is possible in linear time.
Important: Regular expressions cannot handle:
- Nested comments like
/* /* inner */ outer */— requires counting depth (unbounded memory) - Matching parentheses in expressions — requires a stack
- Context-dependent patterns like “is
counta keyword or a variable?” — requires scope information
These require more powerful models. This is why we have two separate phases.
While regular expressions are excellent for specifying individual tokens, we also need to understand how these patterns behave when combined, which leads us to their closure properties.
2.6 Closure Properties of Regular Languages
In a lexer, we often need to combine multiple token patterns or exclude specific strings (like keywords) from a general pattern (like identifiers). Closure properties guarantee that when we combine regular languages using certain operations, the result remains a regular language that our lexer can process.
A class of languages is closed under an operation if applying that operation to languages in the class always produces a result in the class. Regular languages are remarkably well-behaved in this respect.
Theorem (Closure Properties): If and are regular, then so are:
- — union
- — concatenation
- — Kleene star
- — intersection
- — complement
- — difference
- — reversal
Proof:
Properties 1–3 follow directly from the definition of regular languages. The others require construction arguments:
Intersection uses a product automaton: given two DFAs (with formal state sets ) for and , build a new DFA whose states are pairs with one component from each machine. Accept when both components are accepting. (We will formalize this in Chapter 3.)
Complement is simpler: given a DFA for , swap its accepting and non-accepting states. Every accepted string becomes rejected and vice versa.
Difference follows from intersection and complement: .
Reversal follows from reversing all transitions in an NFA (Chapter 3).
2.6.1 Using Closure Practically
Closure properties are a proof technique. To show a language is regular, it suffices to express it as a combination of known regular languages under closure operations.
Example: The language of C identifiers that do not contain the substring goto:
- — regular by definition
- — strings containing
goto— regular (concatenation of a constant with on both sides) - — regular by closure under intersection and complement
No explicit construction needed. The closure properties do the work.
While closure properties help us prove that certain languages are regular, we also need tools to prove when a language is definitively not regular.
2.7 Limits of Regular Languages: The Pumping Lemma
To understand the boundaries of our lexer, we must know what patterns it mathematically cannot recognize. The Pumping Lemma provides a rigorous way to prove that languages requiring unbounded memory fall outside the regular class.
2.7.1 Intuition
A DFA (which we will define formally in Chapter 3 as a 5-tuple ) has a finite number of states in its set —say . If you feed it a string of length , by the pigeonhole principle it must visit some state twice. That means a segment of the input was processed in a loop. Any language that cannot tolerate repeating that segment is not regular.
2.7.2 The Pumping Lemma
Theorem (Pumping Lemma): If is a regular language, then there exists an integer (the pumping length) such that every string with can be written as where:
- — the pumped portion is nonempty
- — the split happens near the start
- For all : — pumping any number of times stays in
The key phrase in condition 3 is for all . In particular, pumping zero times () gives , which must also be in .
2.7.3 Using the Lemma to Prove Non-Regularity
The pumping lemma is used as a proof by contradiction. The structure is always the same:
- Assume is regular with pumping length
- Choose a specific string with — your choice should make pumping impossible
- Consider all valid splits satisfying conditions 1 and 2
- Show that for each such split, some gives
- Conclude: contradiction — is not regular
The choice of in step 2 is the key creative step. Choose it so that any split must land in a predictable part of the string.
2.7.4 Example:
Claim: is not regular.
Proof:
Assume is regular with pumping length . Choose . Then and .
By the pumping lemma, with and .
Since and the first characters of are all ’s, both and consist entirely of ‘s. So for some .
Now pump zero times: . This string has copies of and copies of . Since , we have , so .
This contradicts condition 3. Therefore is not regular.
2.7.5 Example: Balanced Parentheses
Claim: is not regular.
Proof:
Assume is regular with pumping length . Choose . The argument is identical to the previous example: any valid split puts all of in the opening-parenthesis prefix. Pumping produces a mismatch, causing for , which is a contradiction. Thus, is not regular.
2.7.6 Other Non-Regular Languages
The following table summarizes common non-regular languages and the intuitive reason why they fail to be regular:
| Language | Why not regular |
|---|---|
| Cannot count matching pairs | |
| Same — balanced nesting | |
| Cannot remember the first half | |
| Pumping breaks primality | |
| Cannot track difference of two unbounded counts |
The common thread: regularity fails when a language requires tracking an unbounded quantity. A DFA with states has only possible “configurations” — it cannot distinguish between and open parentheses.
2.7.7 Implications for Compilers
Regular languages can handle:
- Identifiers:
[A-Za-z_][A-Za-z0-9_]* - Numbers:
[0-9]+,[0-9]+\.[0-9]+ - Fixed operators:
+,==,!=,<= - Single-level comments:
//[^\n]*
Regular languages cannot handle:
- Arbitrarily nested comments
/* /* */ */ - Balanced parentheses in expressions
- Context-dependent token meaning (a keyword vs. a variable in scope)
This mathematical boundary is exactly why compilers have two distinct phases: the lexer (regular, flat patterns) and the parser (context-free, nested structure). The boundary is not a convention — it is a theorem.
Understanding these theoretical limits confirms why our compiler architecture requires a separate parser to handle nested constructs.
2.8 Chapter Summary
This chapter established the mathematical foundation for lexical analysis.
The progression:
Core definitions:
- An alphabet is a finite nonempty set of symbols
- A string is a finite sequence of symbols; is all strings over
- A language is any subset of
- The three fundamental operations are union, concatenation, and Kleene star
Regular languages: exactly those languages buildable from , , and single symbols using the three operations. They correspond exactly to patterns expressible as regular expressions.
Closure: Regular languages are closed under union, concatenation, star, intersection, complement, difference, and reversal. This lets us build complex patterns from simple ones and reason about them compositionally.
Limits: The Pumping Lemma proves that languages requiring unbounded counting or memory are not regular. This boundary explains the two-phase architecture of compilers.
What comes next: Regular expressions tell us what to recognize. We still need an operational model — a machine that reads characters and decides acceptance. This is the subject of Chapter 3, where we will meet deterministic and nondeterministic finite automata (formally defined with components like ) and prove that they recognize exactly the regular languages.
2.9 Exercises
Fundamental Concepts
1. Define the following in your own words:
a. The difference between and
b. The difference between and
c. Why is infinite even when is finite
2. Let and . Compute:
a.
b.
c. (describe the pattern in English)
d.
3. Evaluate these special cases and explain each answer:
a.
b.
c.
d.
Regular Expressions
4. Describe in English the language defined by each regex:
a. (a|b)*a
b. a*ba*ba*
c. (aa)*
d. (a|b)*abb(a|b)*
5. Write regular expressions for:
a. Binary strings with an even number of 0’s
b. Strings over that start and end with the same symbol
c. C-style identifiers
d. Simplified email addresses: alphanumeric@alphanumeric.alphanumeric
6. Write token patterns for:
a. Floating-point literals: 3.14, 0.5, .5, 2.
b. Hexadecimal integers: 0x1A, 0xFF, 0x0
c. C++ single-line comments: // comment text
d. Python string literals (single and double quoted, no escapes for simplicity)
Closure Properties
7. Let = the language of C identifiers and = the language of C keywords. Express using closure operations:
a. Valid user-defined names (identifiers that are not keywords)
b. Strings that are valid as either an identifier or a keyword
8. Prove that if is regular, then is also regular. (Hint: use induction on the structure of the regular expression for .)
Non-Regular Languages
9. Use the Pumping Lemma to prove that is not regular.
10. For each language, determine whether it is regular and justify your answer:
a.
b.
c.
d.
e. Strings over that contain more ’s than ’s
11. Which of the following should be handled by the lexer and which by the parser? Justify each answer.
a. Recognizing that 123 is an integer literal
b. Flagging 123abc as a lexical error
c. Verifying that parentheses balance in (a + (b * c))
d. Identifying that if is a keyword
e. Checking that a variable is declared before use