Holistic Bug Fixing

A quick fix…The other day, while doing my morning jog, I was thinking about a particularly nasty bug I had been chasing for quite some time. I was wearing a portable radio and the guy on the radio was talking about Paul Simon’s all-time classic “Fifty ways to leave your lover”. He wasn’t so much talking about the song — rather about the amazing drum beat created by drummer Steve Gadd. Steve’s performance is incredible — about 100 beats per minute, it just sounds like “prrrrrrrrrr…”. Anyway, the broadcast led my thoughts astray and I suddenly thought: “How many ways are there to fix a bug?” Which brings us back to the topic of this post…

After having chased down a nasty bug, all you have to do is fix it. But a fix is not a fix, right?

Imagine this setting. There is a test execution framework that you and your team mates are using to run automated regression testing. Since your software is configurable (i. e. certain features can be turned on/off either at build time or at run-time), your tests need to be configurable, too. The framework sports a preprocessor (implemented in Perl) that gives you just that — very much like what the C preprocessor gives to C developers. Here is what a typical test script looks like:

During execution of this test script, a log file is updated:

This is obviously just an artificial (or at least, simplified) example, but I guess it is sufficient to explain the idea.

Now, your test execution framework probably has code like this:

A loop iterates over all test scripts in the current directory. In a first step, the test script is fed through a preprocessor Perl script where the configuration business is done. Next, the preprocessed test script is executed and finally, it is checked whether the execution of the test script resulted in any failing test steps.

So far so good. But on a dark and rainy day, you find out by coincidence that some of your test scripts have not been executed at all; worse yet, testexec.sh reported success on them! The logs clearly show that nothing has been executed due to a fatal error:

You immediately know what the problem is. A couple of weeks back you added another include directive to some of your test scripts since they depend on additional configuration switches:

Unfortunately, config_ex.h is located in a different directory than config.h, so the preprocessor — who is given only a single include base path (dev/project/build/config) — cannot find it.

What possibilities do we have to get rid of this problem?

Level 1: Fix the symptom.

A very simple fix would be to change the failing test script by changing the #include statement to include config_ex.h based on an explicit path:

This would do the job. Yet, this approach is ugly: other developers (including yourself) can easily step into the same pitfall again (most likely they have already).

You would never only fix the symptom, would you? EVERYONE knows that it is bad to only fix the symptom!

But, hey, such a hack is not bad per se. Sometimes, you need a quick fix to be able to carry on. Maybe you cannot change the test execution script yourself because it is located on a remote testbot. Or your company doesn’t like the idea of collective code ownership and only Sam is allowed to change the test execution script and — unfortunately — Sam has already left for the weekend. Anyway, fixing the symptom is sometimes appropriate, but at the very least you should ensure that it will be cleaned up later, by (for instance) using TODOs that can be tracked automatically:

Level 2: Fix immediate problem.

If you can, you’d better fix the bug in the test execution script directly by adding the missing include base path:

Now this looks good and you can claim that you “fixed the problem, not the symptom”, right? While this level 2 fix is certainly more pleasing than the level 1 fix above, I think we can do much better.

Level 3: Prevent bug from happening again.

Our level 2 fix still has shortcomings. The same mistake (i. e. forgetting to add or update an include path in the execution script) can and will lead to the same misery. So we need to safeguard against future errors.

Looking at the invocation of the preprocessor, we can clearly see that there is no error handling at all: either preprocess_test.pl doesn’t produce an exit code in case of fatal errors or our test execution script doesn’t evaluate it. So here is a potential level 3 fix:

Now this is slick, isn’t it? Never will a wrong or missing include path trouble you again. But the best is yet to come, please bear with me.

Level 4: Prevent a whole class of bugs.

Our previous fix makes sure that a particular kind of error will not occur again. In a highly automated environment (where only machines look at output) this is not enough. Consider what happens if somebody makes a modification to execute_test.pl that leads to a crash. In this case, no output would be produced, and hence no FAIL messages would be generated and as a result, grep wouldn’t find any FAILs.

Of course, this can only happen because of the “textual” interface of execute_test.pl. A better design would use exit codes instead of grep — 0 for no errors, 1 for “normal” test errors and anything else for fatal errors:

You probably think that this only solves yet another kind of bug, but not really a whole class of bugs. What if somebody someday adds another step and forgets to check an exit code again? Wouldn’t we get the same problem again? We would, of course, at least until we pull out our level 4 laser gun: post-condition checking.

What our test execution script actually promises to do is this: “If you give me a set of N test scripts I will give you back a set of P passed test scripts and a set of F failed test scripts. Either P or F maybe zero but P + F is always N”. This is the post-condition and it holds as long as the pre-conditions (e. g. well-formed test scripts) are respected. So here we have our (hopefully) bullet-proof level 4 version:

I used a testing example to show you the many facets of bug-fixing but these principles equally apply to “real” source code. Here is a summary of what I wanted to show:

– it is fine to do a quick and dirty “symptom-level” bug-fix every now and then — as long as you are explicit about it.
– Repeatedly zoom out in the search of the root cause, zoom out as much out as possible, but stay within your circle of influence (“We wouldn’t have all of these bugs if these spec coders were fired” is clearly beyond your circle of influence).
– Fortify your fix by making sure that the same or similar bugs will not creep in again.