Have you ever heard about Cliff Young? In case you haven’t here is his story…
For those who think that running a marathon is not challenging enough, there is the Sydney to Melbourne ultra-marathon: 875 kilometers across the hot Australian desert. This race is obviously only for the best of the top athletes and it usually takes them six to seven days to complete. Having gone through the pains of preparing and running a “normal” marathon myself I can only try to imagine what a race like that means.
Despite of these apparent tortures, in 1983, a 61-year-old farmer decided to take part in this race, wearing an old army jacket and gumboots. Spectators were worried that this poor guy wouldn’t survive the first hours of the race but when he left the stadium they gave him a big cheer anyway.
The race started and as expected, Cliff lagged way behind. After 20 hours, the first runners arrived at the camp, where they took a bite, got a massage and slept for about two hours (they typically stopped for a four hour break for every 20 hours of running). When Cliff arrived at the camp three hours later, he didn’t stop — he just continued to run. When a reporter asked him whether he didn’t want to have a break like the other athletes, Cliff responded that he was here to run and not to sleep.
He ran and ran. On the third day he decided to “sleep” for 25 minutes. Believe it or not: he arrived at the finishing line in Melbourne almost two days ahead of number two.
Amazing story, isn’t it? There a various slightly different versions of this story out there, but what they all have in common is that they stop here. What usually isn’t told is that the next year, 17 out of the 18 runners even exceeded Cliff’s record — all of a sudden, they had realized that it is possible to run the race without sleeping at all.
Cliff Young won the race because he challenged existing limits. Not really hard limits, like laws of nature, but arbitrary limits.
So the question is this: is something truly impossible or do we just think it is impossible because we haven’t tried before?
It’s amazing how much time you can save by using regular expressions; it’s even more amazing how much time you can spend getting them to work correctly.
Because they are so powerful and easy to use, regexps can easily be misused, for instance by applying them to problems that are not “regular”, that is, where balancing is important:
Parsing problems like this are not suited for a regular expression matcher, as you need to retain state information and regexps simply cannot keep track of which blocks or braces are open or closed. In cases like this, what you really need is a parser. Period.
Alas, often people can’t be bothered writing a true parser, even if lex/yacc-like tools greatly simplify the work. And I’m guilty of this myself. Years ago I wrote a profiling tool for embedded systems. Since the embedded C code that I wanted to profile had to be instrumented (each function required enter/exit logging calls to get out the execution timing data) I needed to write a tool to do the job. I was not particularly interested in this job — hacking the actual performance analysis code was much more fun — so I decided, well, to go for a heuristic “parser” based on regexps.
In less than one hour I had cobbled together a little script that seemed to work fine. Over the next couple of months I had to spend endless hours fixing all the nasty corner cases; even today it doesn’t work in all circumstances! But I’ve learned my lessons: don’t use regexps when you need a true parser. Again, period.
But even if the problem is regular, people often define regexps sloppily. Look at the following example that checks if a .cfg file appears anywhere in a given string:
So let’s see what we’ve got here. We are obviously looking for a Windows-style absolute path: a single drive letter, followed by a colon and a backslash, followed by n optional directories (each of which followed by a backslash), followed by a mandatory filename that has a .cfg extension. Looks really neat…
These are the regexps people love to write and I don’t know how many times I’ve had to fix one because of this pathological “simplicity”. It might work today, but it is far from future-proof. Sooner or later the surrounding context will change and this regexp will match much more (or much less) than was intended.
Here a some of the major shortcomings:
– Using word characters \w is way too restrictive. According to the Windows long filename specification, a filename may contain any UTF-16 character, but for all practical purposes \w is really only a shortcut for [a-zA-Z0-9_]. If a filename contains a blank or umlaut, the expression won’t match anymore.
– Actually a corollary of the previous item: you cannot have partial relativity within an absolute path, e. g. C:\files\services\base\..\items\main.cfg would not match because the \w character class does not allow for dots.
– The regexp is not aligned on a word boundary, which means that if your editor happens to create backup files like C:\config\user.cfg~ they’ll match, too.
Often — but not always — using regexps means striking a careful balance between accuracy and convenience. It makes little sense to implement the complete Windows filename spec in a regexp. But investing a little energy to tighten them up usually pays off in spades. How about this?
At the cost of being only slightly more difficult to read, this solution is much more resilient to change due to the use of some good practices. First of all, it is easy do define a set of valid drive letters, so I used [a-zA-Z] instead of \w; second, the whole regexp is aligned on word boundaries, which means no more regexp over-/underruns and third, by stating that everything between separators (backslashes, in this case) is a series of non-separators we won’t run into “strange character” problems.
Next time you write a regexp think this: “I know that by using regexps I’m saving hours of development time, so I can afford to spend another 10 minutes to make them more robust”.