Week 4: Grep: Regular Expression Parsing And Evaluating
Chris Tralie
One of my learning goals in this class is to "use theory in nonobvious ways to create awesome practical solutions to problems." We saw the first real application of unit 1 to substring search on problem 8 in HW3 with a simplified version of the KnuthMorris Pratt (KMP) algorithm. The nonobvious thing here (before taking TOC) was to use a DFA. In fact, there are legends of some people being so confused about the source code of KMP when they had to maintain code containing it that they deleted it and replaced it with naive substring search (I'll find a citation for this story later).
One even less obvious and even more powerful application is regular expression search; rather than searching for an exact string, we can search for a regular expression inside of text. This is the culmination of the first unit of the course.
In this writeup, we'll restrict ourselves to the slightly simpler case of seeing if the entirety of a particular string s is in the language generated by a regular expression R. The nonobvious thing here is to convert the regular expression to an NFA, which can be efficiently evaluated, as we showed in code before. In fact, a more "naive" way of recursively exploring possible traversals of the regular expression without an NFA can lead to exponential time algorithms (see this link for a nice discussion about this which mirrors my writeup here). On the other hand, NFAbased evaluation takes O(len(s) len(R)) time in the worst case, and in practice, it's often closer to O(len(s)).
My goal in this document will be to explain a variant of the full pipeline for RegEx to NFA, as first outlined by Ken Thompson in his seminal 1968 paper. We'll use python to be consistent with everything else we've done so far. Before proceeding, you may want to review the notes on evaluating NFAs in python with and without λ, as I'll take that as a given, focusing mainly on the RegEx to NFA conversion in this document.
In the discussion below, we have the following operators, with the following order of operations (where 3 is the highest precedence)

3
*
: Star (repeat an arbitrary number of times, or no times) 
2
.
: Concatenation (the language on the left followed by the language on the right) 
1

: Union or "OR"; take the union of the two languages on either side
Step 1: Infix > Postfix
Suppose we have the regular expression
((AB)*C)*B(C(A*B))
From an algorithm perspective, one challenge we immediately see is how to figure out what to look at first based on the parentheses and order of operations. In fact, while this format of a regular expression, known as infix notation, is convenient for humans to read and group mentally, it's not as natural to computers. Instead, we want to put the operators *
, 
, and .
closer to the operands that they're applied to. In the process, we'll also be able to eliminate parentheses completely.
Shorthand Expansion
Before we proceed, we'll take care of some shorthand with concatenation in the above expression. We'll add a dot after every character not followed by any of *)
, as well as after every *
. Furthermore, we'll replace every single character c in the alphabet by c_i, where i is the order in which it occurs in the expression, just as we did in problem 8 in HW3. If we do this on the above, we're left with the following:
( ( A_0 . B_1 ) * . C_2 ) * . B_3 . ( C_4  ( A_5 * . B_6 ) )
Now we're in a format that will be easier to describe systematically
Algorithm 1: Infix To Postfix
Before we describe the algorithm to convert infix to postfix, let's look at what the postfix would be for the above
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
You'll notice that there are no more parentheses, and that the operators .*
are right next to the operands on which they are applied.
To convert infix to postfix, we'll a stack to keep track of the operators. The algorithm is simple enough to describe directly in python:
Try it yourself on a piece of paper, and you'll see that we get the desired result on our example
Step 2: Postfix > NFA with Evaluator Stack
Now that we have a good representation of our regular expression R, we can describe an algorithm to assemble an NFA out of it. In the discussion that follows, we'll be working with NFA fragments, which are smaller machines that we will incrementally splice into each other to make the entire automaton that recognizes the language generated by R. We'll assume that every NFA fragment has the following form:
That is, every NFA fragment has exactly one arrow going into it at a state called in
, with an unspecified character transitioning into it, as well as exactly one arrow going out of it which transitions on the arrow with character c.
Algorithm 2: Postfix RegEx To NFA
The algorithm to splice together NFA fragments incrementally uses a stack. The overall pseudocode is as follows:
Now let's describe the assembly process to create the meta machines for *
, .
, and 
Star *
Meta Machine
The image below shows the meta machine that we create to put a *
around an NFA fragment (in, out, c)
:
The result is a new NFA fragment (in_*start, out_*end, λ)
in place of the original NFA fragment. To get here, we create two new states in_*start
and out_*end
, as well as 4 arrows (highlighted in red):
 A λ arrow to go into and run the NFA fragment we're wrapping around

A λ arrow to skip the NFA fragment entirely, since we can take the empty string λ with the
*

A λ arrow to repeat the NFA fragment we're wrapping around, since we can loop through it an arbitrary number of times with the
*
 An arrow to leave this entire block through the NFA fragment's end character c
Concatenation .
Meta Machine
The image below shows how to create the meta fragment performing a concatenation of two NFA fragments (in1, out1, c1)
and (in2, out2, c2)
The result is a new NFA fragment (in1, out2, c2)
in the place of the original two fragments. To get here, we add a single arrow from out1
to in2
along the character c1 to glue the two fragments together. We don't actually need to create any new states
Union 
Meta Machine
The image below shows how to create the meta fragment performing a union of two NFA fragments (in1, out1, c1)
and (in2, out2, c2)
The result is a new NFA fragment (in1_start, out2_end, λ)
in the place of the original two fragments, which branches out nondeterministically out into one of the original NFA fragments and forces the automaton to execute at least one of them to pass onto the next phase. To accomplish this, we add the two new states in1_start
and out2_end
, as well as the following four arrows (highlighted in red):

A λ arrow from
in1_start
toin1
to run the first NFA fragment as a possibility 
A λ arrow from
in1_start
toin2
to run the second NFA fragment as a possibility 
An arrow from
out1
toout2_end
along the characterc1
to finish the first machine and continue 
An arrow from
out2
toout2_end
along the characterc2
to finish the second machine and continue
Worked Example
It's worth now doing a full example of algorithm 2 on a postfix regular expression. We'll start with the example
((AB)*C)*B(C(A*B))
which, in postfix, is
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
First, we see A_0
and B_1
, which are ordinary operands. We push primitive NFA fragments for each onto the stack in that order:
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
Next, we see a concatenation character .
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
We'll pop off the top two elements and create a meta machine according to the concatenation meta machine recipe, pushing the resulting meta fragment back onto the stack:
Next, we see a *
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
We'll pop off the top element of the stack and wrap it into the star meta machine, pushing the resulting fragment back onto the stack:
Next, we see an ordinary operand C_2
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
so we simply push its primitive NFA fragment onto the stack:
Next, we see a concatenation operator .
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
So we pop off the top two elements and connect them together, pushing the result back on top of the stack:
Next, we see a *
operator
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
We'll pop off the top element of the stack and wrap it into the star meta machine, pushing the resulting fragment back onto the stack:
Next, we see an ordinary operand B_3
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
so we push a corresponding primitive NFA fragment onto the stack
Next, we see a concatenation operator .
So we pop off the top two elements and glue them together, pushing the result onto the stack
Next, we see two primitive operands C_4
and A_5
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
and we push their corresponding primitive NFA fragments onto the stack in that order
Next, we see a *
operator
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
We'll pop off the top element of the stack and wrap it into the star meta machine, pushing the resulting fragment back onto the stack:
Next, we see an ordinary operand B_6
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
and we push the corresponding primitive NFA fragment onto the stack
Next, we see a concatenation character .
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
We'll pop off the top two elements and create a meta machine according to the concatenation meta machine recipe, pushing the resulting meta fragment back onto the stack:
Next, we see a union character 
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
We'll pop off the top two elements and create a meta machine according to the union meta machine recipe, pushing the resulting meta fragment back onto the stack:
Finally, we see a concatenation character .
A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 .  .
We'll pop off the top two elements and create a meta machine according to the concatenation meta machine recipe, pushing the resulting meta fragment back onto the stack:
The last step is to create a start state Start
where we start and an end state Finish
where we accept. We pop off the only element that's left on the stack, we feed Start
into the beginning of that element with a λ arrow, and then we feed the end of that machine into Finish
with the character coming out of it:
This is the final result for the regular expression ((AB)*C)*B(C(A*B))
!