Regular Expressions — Sweetest Poison

It’s amazing how much time you can save by using regular expressions; it’s even more amazing how much time you can spend getting them to work correctly.

Because they are so powerful and easy to use, regexps can easily be misused, for instance by applying them to problems that are not “regular”, that is, where balancing is important:

Parsing problems like this are not suited for a regular expression matcher, as you need to retain state information and regexps simply cannot keep track of which blocks or braces are open or closed. In cases like this, what you really need is a parser. Period.

Alas, often people can’t be bothered writing a true parser, even if lex/yacc-like tools greatly simplify the work. And I’m guilty of this myself. Years ago I wrote a profiling tool for embedded systems. Since the embedded C code that I wanted to profile had to be instrumented (each function required enter/exit logging calls to get out the execution timing data) I needed to write a tool to do the job. I was not particularly interested in this job — hacking the actual performance analysis code was much more fun — so I decided, well, to go for a heuristic “parser” based on regexps.

In less than one hour I had cobbled together a little script that seemed to work fine. Over the next couple of months I had to spend endless hours fixing all the nasty corner cases; even today it doesn’t work in all circumstances! But I’ve learned my lessons: don’t use regexps when you need a true parser. Again, period.

But even if the problem is regular, people often define regexps sloppily. Look at the following example that checks if a .cfg file appears anywhere in a given string:

So let’s see what we’ve got here. We are obviously looking for a Windows-style absolute path: a single drive letter, followed by a colon and a backslash, followed by n optional directories (each of which followed by a backslash), followed by a mandatory filename that has a .cfg extension. Looks really neat…

These are the regexps people love to write and I don’t know how many times I’ve had to fix one because of this pathological “simplicity”. It might work today, but it is far from future-proof. Sooner or later the surrounding context will change and this regexp will match much more (or much less) than was intended.

Here a some of the major shortcomings:

– Using word characters \w is way too restrictive. According to the Windows long filename specification, a filename may contain any UTF-16 character, but for all practical purposes \w is really only a shortcut for [a-zA-Z0-9_]. If a filename contains a blank or umlaut, the expression won’t match anymore.

– Actually a corollary of the previous item: you cannot have partial relativity within an absolute path, e. g. C:\files\services\base\..\items\main.cfg would not match because the \w character class does not allow for dots.

– The regexp is not aligned on a word boundary, which means that if your editor happens to create backup files like C:\config\user.cfg~ they’ll match, too.

Often — but not always — using regexps means striking a careful balance between accuracy and convenience. It makes little sense to implement the complete Windows filename spec in a regexp. But investing a little energy to tighten them up usually pays off in spades. How about this?

At the cost of being only slightly more difficult to read, this solution is much more resilient to change due to the use of some good practices. First of all, it is easy do define a set of valid drive letters, so I used [a-zA-Z] instead of \w; second, the whole regexp is aligned on word boundaries, which means no more regexp over-/underruns and third, by stating that everything between separators (backslashes, in this case) is a series of non-separators we won’t run into “strange character” problems.

Next time you write a regexp think this: “I know that by using regexps I’m saving hours of development time, so I can afford to spend another 10 minutes to make them more robust”.