What the Regex?!

What the Regex?!

A Practical Guide to Regular Expressions

My first real programming job was as an intern with the University of Minnesota - Duluth IT department (eons ago). My job was to convert all of the University department data from a SIR database to load into MySQL and to build new web interfaces for each with Perl and CGI scripts. Not only did I get to learn the ins and outs of Perl and regular expressions, but I got my first exposure to writing code to rewrite code. The Perl scripts I was writing used regular expressions to rewrite the old scripts.

The hairiness of the regular expression syntax was mind-boggling at the time (and sometimes still is), but I gained a love for it as a tool and the power it gave me. When I get to work with someone who is new to regular expressions or on a new pattern, I jump at the opportunity. Which brought me to writing this post today.

What are regular expressions?

A regular expression (regex or regexp) is a sequence of characters defining a pattern to search against. Each character in the pattern is a metacharacter with special meaning (i.e. match the start of a string) or it is a regular character with a literal meaning (i.e. match the literal letter a).

Regex is a more precise way of specifying the possible variations of a string. For example, the words:

donut

doughnut

can be specified more precisely with:

do(ugh)?nut

Practical Applications

What do we use regex for? Here are a few practical applications:

  • creating a search algorithm to find text on a page
  • validating data like a URL format, telephone number, or password
  • parsing a string into separate parts like an area code and a phone number
  • replacing a generic error message with something more specific and on-brand
  • searching your hard drive for files that mention a specific string
  • searching in your IDE through files
  • validating a markdown document format

Five concepts

I use these five concepts of regex regularly and over the years have (mostly) memorized them for when I need to do pattern matching. For each, I'll explain what it does, give an example or two, and then you can try it out yourself.

Boolean Or

The boolean | operator will match either character or character sequence. For instance, if you wanted to test the user agent string to make sure it was an allowable user agent or to run customized logic for a specific user agent, you could use the boolean | operator.

Example:

iPhone|iPad|Android

will match any of the following:

iPhone
iPad
Android

but not:

BlackBerry
iPod

Need to match a string case-insensitively? Each programming language that supports regex might do it in slightly different ways, so be sure to check yours. In Ruby, it is done with the i character at the end of the regex pattern (surrounded by /).

/iPhone|iPad|Android/i

will match:

iPhone
IPAD
anDROID

Try it out!

Wildcard

To match any character, we can use the wildcard. The wildcard is specified by . and matches any character except newlines.

Example:

.+ing

will match:

boating
sailing
kayaking
surfing

but not:

boat
sail
kayak
surf

Try it out!

What's that +? We'll cover that in the Quantifiers section.

Anchors

To match the start or end of a string (just after or before a newline), we can use the anchors ^ and $ respectively.

We could use anchors to validate a URL starts with https:// not http:// and ends with .com.

Example:

^https:\/\/.+\.com$

will match:

[https://jennapederson.com](https://jennapederson.com/)

but not:

http://jennapederson.com
https://jennapederson.dev

Note the \ before the two / and .. We use the backslash \ to escape these metachacters to match the literal characters. The list includes [ \ ^ $ . | ? * + ( ).

Wondering about that +? We'll talk about that in the next section on Quantifiers!

Try it out!

Quantifiers

Quantifiers follow either a character or a group and specify how many times to match that character or group.

? will match zero or one occurrence

* will match zero or more occurrences

+ will match one or more occurrences

Less common quantifiers like these allow you to match exactly n times or between min and max times:

{n} will match exactly n occurrences

{min,} will match at least min occurrences

{,max} will match up to max occurrences

{min,max} will match between min and max occurrences

Example:

\d{2}-\d{2}-\d{4}

Example:

^(\d{1,3})(\d{0,3})(\d{0,4})$

will match:

02-02-2020

but not:

12 May 2020

but not:

Try it out!

Grouping

Grouping allows us to match groups of characters. We use parenthesis ( and ) to open and close a group in our pattern. Capturing groups lets us operate on them individually. These groups will be captured in an array and can be accessed by index.

Example:

(\d{2})-(\d{2})-(\d{4})

will match:

02-02-2020

and will capture three groups:

02
02
2020

Try it out!

Note the \d is a character class representing a larger set of characters, in this case, any digit. Other examples would be [0-9] to represent any digit or [A-Z] to represent any upper case letter.

You can also use named groups using ?<group name> and access each group by it's group name:

(?<month>\d{2})-(?<day>\d{2})-(?<year>\d{4})

Regex Tools

For a more complex pattern, I will test it using Rubular where I can shove in a bunch of strings to match and fiddle with it until it's right. This is specific to Ruby and there will be differences if you're working in other programming languages, but Regex101 can come in handy for that and it provides some pretty handy explanations.

Don't tell anyone, but I usually just use Rubular because it's so fantastic with its cheat sheet right there for me. Occasionally I have to drop out to figure out a specific variation for the language I'm writing my regex for.

Write a Unit Test

Regex pattern matching is a prime candidate for a unit test. I can't tell you how many times I've written a regex pattern, wrapped it in a test, deployed to production, and days, weeks, or months later, I find out there's another variation of that string we have to match. With the unit test in place, I can start by writing a failing test, fix the regex pattern, and then run my test to make sure I've fixed the problem.

For Practice

There are plenty of challenges and practice tools out there, but Regex Crossword and Hacker Rank are my favorites.

Share Your Favorite Regex Patterns & Uses

Have you had to write a particularly hairy regex pattern to match a string or to find text in a document or in a codebase or to validate some data? What other uses have you seen regex used for? I'd love to see what others have experienced and are using them for. Share it with us in the comments below!

Get the goods. In your inbox. On the regular.