Glenn Stovall


Metrics-Driven Business and Software Design

My Eureka Moment With Regular Expressions

Regular expressions are a tough cookie for most programmers to figure out. They are terrifying to look at with their daunting syntax. There is also nothing quite like regular expressions anywhere in programming. programmers may learn to hack together basic regex expressions, or how to cut and paste some they find online for their needs, but it can be tricky to get a real grasp of the concept. Here I’d like to explain the line of thinking that finally led to me grokking regex.

For reference, I’ll be using the PERL-style regular expression syntax used by languages like PHP. Here’s a Regular Expressions Cheat Sheet That I’ll be referencing throughout this article, and it is a great resource to keep around.

Regular Expressions: A Language Unto Itself

The tricky thing about regular expressions is that it is its own language within a programming language. It has its own collection of symbols and syntax. Those long, scary strings that regexes are made up of are collections of these symbols and are used to make the pattern you want to match. Think of a regular expression as a sentence, and each one of the symbols as a word.

Writing “Words” in Regular Expressions

Most of the words of your regular expression are going to be made of a few different language constructs:

  • Ranges : defining a set of characters that can match. Common examples are [A-Z], which means all capital letters, [0-9] or which means all digits. These can be combined as well. A common on is [A-Za-z0-9-_] which would match all letters, numbers, hyphens, and underscores.
  • Character Classes : similar to ranges. \s means ‘any white space character’, where \S means ‘any non-white space character’.
  • Metacharacters : characters that have thier own special meaning, the most common being the pipe | character, which means ‘or’ similar to || in most programming languages. A subset of metacharacters is anchors, where can denote the beginning of a string (^) or the end of a string ($).
  • Quantifiers : quantifiers are a kind of ‘modifier’ to the above patterns. These let you tell the pattern how many of the previous pattern to look for. The common ones are: * for 0 or more, + for 1 or more, and ? for 0 or 1. You can also explicity state an amount with something like {5} for 5 characters, or a range, such as {3-6}, which matches anywhere between three and six characters.

Building Patterns with Words

Let’s look at the following example problem:

we have a system that has to match social security numbers. 
They can look like any of the following:

123-45-6789
123/45/6789
123456789

so, now that we have the set of patterns we want to match, we can start building up smaller words to match each part of the pattern. Writing regular expressions is similar to writing functions or classes: you start by breaking down the problem, figuring out smaller parts of it, and then start working on combining these smaller solutions into a large one. So let’s start by writing out our pattern in plain English:

  • We have the start of the string,
  • then three digits,
  • then either a hyphen, forward slash, or neither,
  • then two digits,
  • then either a hyphen, forward slash, or neither,
  • then four digits,
  • and that is the end of the string.

So, let’s look at this step by step:

  • We have the start of the string : this is where we start with one of the anchors we mentioned earlier, ^.
  • Three Digits : We discussed the range of digits earlier ([0-9]), but that only matches a single digit. To match exactly 3, we will need a quantifier as well. So this word can be written as [0-9]{3}.
  • Either a hyphen, forward slash, or neither : this is going to take another quantifier, but which one? It helps if we rephrase this statement a bit, and think of it as this: “exactly 0 or 1 hyphen or forward slash”. Now we can see that we need the ? quantifier. We also see that we’ll need to use an ‘or’ for this statement. For statements like this, you can wrap this part of the expression in parenthesis so that its clear whats going on. The answer here is (-|/)?. Declaring parts of a regular expression inside of parenthesis like this is called defining a sub-pattern
  • Two Digits : similar to above: [0-9]{2}.
  • Either a hyphen, forward slash, or neither : same as above: (-|/)?.
  • Four Digits : similar to above: [0-9]{4}.
  • End of the String : another anchor, the $ one this time.

So, now let’s put that all together, and we have a full regular expression:

^[0-9]{3}(-|/)?[0-9]{2}(-|/)?[0-9]{4}$

Tada! Now that you can see how the regular expressions is really just a bunch of small parts that fit together, hopefully you can figure out how to both write your own regular expressions, and read other ones you come across.

This is just the tip of the regex iceberg though. There is a whole lot more you can do with it that is much more than the span of this article. Play around with them, refer back to the cheat sheet, and see what you can come up with.

Edit

I realized there was a small problem with the regular expression above. While this work for most of our cases, there is the scenario of getting a number formatted like so: 123-45/6789 which the pattern would match, even though it is not in a valid format. I asked about this on Stack Overflow And learned about using back references for situations like this. you can use the syntax \1 where 1 is the number of the sub pattern you want to reference. By doing so, you can make sure that the second delimiter matches the first. So our regular expression would now look like this:

^[0-9]{3}(-|/)?[0-9]{2}\1?[0-9]{4}$