Week 4: Grep: Regular Expression Parsing And Evaluating

Chris Tralie

One of my learning goals in this class is to "use theory in non-obvious ways to create awesome practical solutions to problems." We saw the first real application of unit 1 to substring search on problem 8 in HW3 with a simplified version of the Knuth-Morris Pratt (KMP) algorithm. The non-obvious thing here (before taking TOC) was to use a DFA. In fact, there are legends of some people being so confused about the source code of KMP when they had to maintain code containing it that they deleted it and replaced it with naive substring search (I'll find a citation for this story later).

One even less obvious and even more powerful application is regular expression search; rather than searching for an exact string, we can search for a regular expression inside of text. This is the culmination of the first unit of the course.

In this writeup, we'll restrict ourselves to the slightly simpler case of seeing if the entirety of a particular string s is in the language generated by a regular expression R. The non-obvious thing here is to convert the regular expression to an NFA, which can be efficiently evaluated, as we showed in code before. In fact, a more "naive" way of recursively exploring possible traversals of the regular expression without an NFA can lead to exponential time algorithms (see this link for a nice discussion about this which mirrors my writeup here). On the other hand, NFA-based evaluation takes O(len(s) len(R)) time in the worst case, and in practice, it's often closer to O(len(s)).

My goal in this document will be to explain a variant of the full pipeline for RegEx to NFA, as first outlined by Ken Thompson in his seminal 1968 paper. We'll use python to be consistent with everything else we've done so far. Before proceeding, you may want to review the notes on evaluating NFAs in python with and without λ, as I'll take that as a given, focusing mainly on the RegEx to NFA conversion in this document.

In the discussion below, we have the following operators, with the following order of operations (where 3 is the highest precedence)

  • 3 *: Star (repeat an arbitrary number of times, or no times)
  • 2 .: Concatenation (the language on the left followed by the language on the right)
  • 1 |: Union or "OR"; take the union of the two languages on either side

Step 1: Infix -> Postfix

Suppose we have the regular expression

((AB)*C)*B(C|(A*B))

From an algorithm perspective, one challenge we immediately see is how to figure out what to look at first based on the parentheses and order of operations. In fact, while this format of a regular expression, known as infix notation, is convenient for humans to read and group mentally, it's not as natural to computers. Instead, we want to put the operators *, |, and . closer to the operands that they're applied to. In the process, we'll also be able to eliminate parentheses completely.

Shorthand Expansion

Before we proceed, we'll take care of some shorthand with concatenation in the above expression. We'll add a dot after every character not followed by any of *|), as well as after every *. Furthermore, we'll replace every single character c in the alphabet by c_i, where i is the order in which it occurs in the expression, just as we did in problem 8 in HW3. If we do this on the above, we're left with the following:

( ( A_0 . B_1 ) * . C_2 ) * . B_3 . ( C_4 | ( A_5 * . B_6 ) )

Now we're in a format that will be easier to describe systematically

Algorithm 1: Infix To Postfix

Before we describe the algorithm to convert infix to postfix, let's look at what the postfix would be for the above

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

You'll notice that there are no more parentheses, and that the operators .*| are right next to the operands on which they are applied.

To convert infix to postfix, we'll a stack to keep track of the operators. The algorithm is simple enough to describe directly in python:

Try it yourself on a piece of paper, and you'll see that we get the desired result on our example

Step 2: Postfix -> NFA with Evaluator Stack

Now that we have a good representation of our regular expression R, we can describe an algorithm to assemble an NFA out of it. In the discussion that follows, we'll be working with NFA fragments, which are smaller machines that we will incrementally splice into each other to make the entire automaton that recognizes the language generated by R. We'll assume that every NFA fragment has the following form:

That is, every NFA fragment has exactly one arrow going into it at a state called in, with an unspecified character transitioning into it, as well as exactly one arrow going out of it which transitions on the arrow with character c.

Algorithm 2: Postfix RegEx To NFA

The algorithm to splice together NFA fragments incrementally uses a stack. The overall pseudocode is as follows:

Now let's describe the assembly process to create the meta machines for *, ., and |

Star * Meta Machine

The image below shows the meta machine that we create to put a * around an NFA fragment (in, out, c):

The result is a new NFA fragment (in_*start, out_*end, λ) in place of the original NFA fragment. To get here, we create two new states in_*start and out_*end, as well as 4 arrows (highlighted in red):

  • A λ arrow to go into and run the NFA fragment we're wrapping around
  • A λ arrow to skip the NFA fragment entirely, since we can take the empty string λ with the *
  • A λ arrow to repeat the NFA fragment we're wrapping around, since we can loop through it an arbitrary number of times with the *
  • An arrow to leave this entire block through the NFA fragment's end character c

Concatenation . Meta Machine

The image below shows how to create the meta fragment performing a concatenation of two NFA fragments (in1, out1, c1) and (in2, out2, c2)

The result is a new NFA fragment (in1, out2, c2) in the place of the original two fragments. To get here, we add a single arrow from out1 to in2 along the character c1 to glue the two fragments together. We don't actually need to create any new states

Union | Meta Machine

The image below shows how to create the meta fragment performing a union of two NFA fragments (in1, out1, c1) and (in2, out2, c2)

The result is a new NFA fragment (in1_|start, out2_|end, λ) in the place of the original two fragments, which branches out nondeterministically out into one of the original NFA fragments and forces the automaton to execute at least one of them to pass onto the next phase. To accomplish this, we add the two new states in1_|start and out2_|end, as well as the following four arrows (highlighted in red):

  • A λ arrow from in1_|start to in1 to run the first NFA fragment as a possibility
  • A λ arrow from in1_|start to in2 to run the second NFA fragment as a possibility
  • An arrow from out1 to out2_|end along the character c1 to finish the first machine and continue
  • An arrow from out2 to out2_|end along the character c2 to finish the second machine and continue

Worked Example

It's worth now doing a full example of algorithm 2 on a postfix regular expression. We'll start with the example

((AB)*C)*B(C|(A*B))

which, in postfix, is

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

First, we see A_0 and B_1, which are ordinary operands. We push primitive NFA fragments for each onto the stack in that order:

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

Next, we see a concatenation character .

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

We'll pop off the top two elements and create a meta machine according to the concatenation meta machine recipe, pushing the resulting meta fragment back onto the stack:

Next, we see a *

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

We'll pop off the top element of the stack and wrap it into the star meta machine, pushing the resulting fragment back onto the stack:

Next, we see an ordinary operand C_2

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

so we simply push its primitive NFA fragment onto the stack:

Next, we see a concatenation operator .

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

So we pop off the top two elements and connect them together, pushing the result back on top of the stack:

Next, we see a * operator

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

We'll pop off the top element of the stack and wrap it into the star meta machine, pushing the resulting fragment back onto the stack:

Next, we see an ordinary operand B_3

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

so we push a corresponding primitive NFA fragment onto the stack

Next, we see a concatenation operator .

So we pop off the top two elements and glue them together, pushing the result onto the stack

Next, we see two primitive operands C_4 and A_5

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

and we push their corresponding primitive NFA fragments onto the stack in that order

Next, we see a * operator

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

We'll pop off the top element of the stack and wrap it into the star meta machine, pushing the resulting fragment back onto the stack:

Next, we see an ordinary operand B_6

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

and we push the corresponding primitive NFA fragment onto the stack

Next, we see a concatenation character .

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

We'll pop off the top two elements and create a meta machine according to the concatenation meta machine recipe, pushing the resulting meta fragment back onto the stack:

Next, we see a union character |

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

We'll pop off the top two elements and create a meta machine according to the union meta machine recipe, pushing the resulting meta fragment back onto the stack:

Finally, we see a concatenation character .

A_0 B_1 . * C_2 . * B_3 . C_4 A_5 * B_6 . | .

We'll pop off the top two elements and create a meta machine according to the concatenation meta machine recipe, pushing the resulting meta fragment back onto the stack:

The last step is to create a start state Start where we start and an end state Finish where we accept. We pop off the only element that's left on the stack, we feed Start into the beginning of that element with a λ arrow, and then we feed the end of that machine into Finish with the character coming out of it:

This is the final result for the regular expression ((AB)*C)*B(C|(A*B))!