Holistic Bug Fixing

The other day, while doing my morning jog, I was thinking about a particularly nasty bug I had been chasing for quite some time. I was wearing a portable radio and the guy on the radio was talking about Paul Simon’s all-time classic “Fifty ways to leave your lover”. He wasn’t so much talking about the song — rather about the amazing drum beat created by drummer Steve Gadd. Steve’s performance is incredible — about 100 beats per minute, it just sounds like “prrrrrrrrrr…”. Anyway, the broadcast led my thoughts astray and I suddenly thought: “How many ways are there to fix a bug?” Which brings us back to the topic of this post…

After having chased down a nasty bug, all you have to do is fix it. But a fix is not a fix, right?

Imagine this setting. There is a test execution framework that you and your team mates are using to run automated regression testing. Since your software is configurable (i. e. certain features can be turned on/off either at build time or at run-time), your tests need to be configurable, too. The framework sports a preprocessor (implemented in Perl) that gives you just that — very much like what the C preprocessor gives to C developers. Here is what a typical test script looks like:


    |File: sample.test|
    ...
    #include "config.h"

    Execute test step 1
    Execute test step 2
    #if CONFIGURED_FEATURE_A
    Execute test step 3
    #endif
    Execute test step 4
    #ifndef CONFIGURED_FEATURE_Z
    Execute test step 5
    #endif
    ...

|File: sample.test|

...

#include "config.h"

Execute test step 1

Execute test step 2

#if CONFIGURED_FEATURE_A

Execute test step 3

#endif

Execute test step 4

#ifndef CONFIGURED_FEATURE_Z

Execute test step 5

#endif

...

During execution of this test script, a log file is updated:


    |File: sample.test.log|

    # Log file for test script 'sample.test'
    PASS test step1
    PASS test step2
    PASS test step4
    ...
    FAIL test step25
    ...
    PASS test step 42

|File: sample.test.log|

# Log file for test script 'sample.test'

PASS test step1

PASS test step2

PASS test step4

...

FAIL test step25

...

PASS test step 42

This is obviously just an artificial (or at least, simplified) example, but I guess it is sufficient to explain the idea.

Now, your test execution framework probably has code like this:


    |File: testexec.sh|

    ...
    for $test in *.test ; do
        # Preprocess.
        preprocess_test.pl -I dev/project/build/config $test

        # Execute.
        execute_test.pl $test

        # Now check if there are any FAILs in the log.
        grep ^FAIL $test.log
        if (($?)) ; then
            echo FAIL $test
        else
            echo PASS $test
        fi
    done
    ...

|File: testexec.sh|

...

for $test in *.test ; do

# Preprocess.

preprocess_test.pl -I dev/project/build/config $test

# Execute.

execute_test.pl $test

# Now check if there are any FAILs in the log.

grep ^FAIL $test.log

if (($?)) ; then

echo FAIL $test

else

echo PASS $test

done

...

A loop iterates over all test scripts in the current directory. In a first step, the test script is fed through a preprocessor Perl script where the configuration business is done. Next, the preprocessed test script is executed and finally, it is checked whether the execution of the test script resulted in any failing test steps.

So far so good. But on a dark and rainy day, you find out by coincidence that some of your test scripts have not been executed at all; worse yet, testexec.sh reported success on them! The logs clearly show that nothing has been executed due to a fatal error:


    |File: mytest.test.log|

    preprocess_test.pl: cannot find include file 'config_ex.h'.

|File: mytest.test.log|

preprocess_test.pl: cannot find include file 'config_ex.h'.

You immediately know what the problem is. A couple of weeks back you added another include directive to some of your test scripts since they depend on additional configuration switches:


    |File: mytest.test|

    #include "config.h"
    #include "config_ex.h"
    ...

|File: mytest.test|

#include "config.h"

#include "config_ex.h"

...

Unfortunately, config_ex.h is located in a different directory than config.h, so the preprocessor — who is given only a single include base path (dev/project/build/config) — cannot find it.

What possibilities do we have to get rid of this problem?

Level 1: Fix the symptom.

A very simple fix would be to change the failing test script by changing the #include statement to include config_ex.h based on an explicit path:


    |File: mytest.test|

    #include "config.h"
    #include "dev/project/build/other/config_ex.h"
    ...

|File: mytest.test|

#include "config.h"

#include "dev/project/build/other/config_ex.h"

...

This would do the job. Yet, this approach is ugly: other developers (including yourself) can easily step into the same pitfall again (most likely they have already).

You would never only fix the symptom, would you? EVERYONE knows that it is bad to only fix the symptom!

But, hey, such a hack is not bad per se. Sometimes, you need a quick fix to be able to carry on. Maybe you cannot change the test execution script yourself because it is located on a remote testbot. Or your company doesn’t like the idea of collective code ownership and only Sam is allowed to change the test execution script and — unfortunately — Sam has already left for the weekend. Anyway, fixing the symptom is sometimes appropriate, but at the very least you should ensure that it will be cleaned up later, by (for instance) using TODOs that can be tracked automatically:


    |File: mytest.test|

    #include "config.h"
    #include "dev/project/build/other/config_ex.h" // TODO:2009-07-29:ralf:Quick fix to get my test running again
                                                   // (missing include base path in test execution script)
    ...

|File: mytest.test|

#include "config.h"

#include "dev/project/build/other/config_ex.h" // TODO:2009-07-29:ralf:Quick fix to get my test running again

// (missing include base path in test execution script)

...

Level 2: Fix immediate problem.

If you can, you’d better fix the bug in the test execution script directly by adding the missing include base path:


    |File: testexec.sh|

    ...
    for $test in *.test ; do
        # Preprocess.
        preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test

        # Execute.
        execute_test.pl $test
        ...

|File: testexec.sh|

...

for $test in *.test ; do

# Preprocess.

preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test

# Execute.

execute_test.pl $test

...

Now this looks good and you can claim that you “fixed the problem, not the symptom”, right? While this level 2 fix is certainly more pleasing than the level 1 fix above, I think we can do much better.

Level 3: Prevent bug from happening again.

Our level 2 fix still has shortcomings. The same mistake (i. e. forgetting to add or update an include path in the execution script) can and will lead to the same misery. So we need to safeguard against future errors.

Looking at the invocation of the preprocessor, we can clearly see that there is no error handling at all: either preprocess_test.pl doesn’t produce an exit code in case of fatal errors or our test execution script doesn’t evaluate it. So here is a potential level 3 fix:


    ...
    for $test in *.test ; do
        # Preprocess.
        preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test
        if (($?)) ; then
            echo "Fatal: Preprocessing of file $test failed."
            exit 2;
        fi

        # Execute.
        execute_test.pl $test
        ...

...

for $test in *.test ; do

# Preprocess.

preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test

if (($?)) ; then

echo "Fatal: Preprocessing of file $test failed."

exit 2;

# Execute.

execute_test.pl $test

...

Now this is slick, isn’t it? Never will a wrong or missing include path trouble you again. But the best is yet to come, please bear with me.

Level 4: Prevent a whole class of bugs.

Our previous fix makes sure that a particular kind of error will not occur again. In a highly automated environment (where only machines look at output) this is not enough. Consider what happens if somebody makes a modification to execute_test.pl that leads to a crash. In this case, no output would be produced, and hence no FAIL messages would be generated and as a result, grep wouldn’t find any FAILs.

Of course, this can only happen because of the “textual” interface of execute_test.pl. A better design would use exit codes instead of grep — 0 for no errors, 1 for “normal” test errors and anything else for fatal errors:


    for $test in *.test ; do
        # Preprocess.
        preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test
        if (($?)) ; then
            echo "Fatal: Preprocessing of file $test failed."
            exit 2;
        fi

        # Execute.
        execute_test.pl $test
        # Save exit code.
        my_error=$?
        if (($error == 0)) ; then
            echo PASS $test
        elif (($error == 1)) ; then
            echo FAIL $test
        else
            echo "Fatal: Test execution of file $test failed."
            exit 2;
        fi

        ...

for $test in *.test ; do

# Preprocess.

preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test

if (($?)) ; then

echo "Fatal: Preprocessing of file $test failed."

exit 2;

# Execute.

execute_test.pl $test

# Save exit code.

my_error=$?

if (($error == 0)) ; then

echo PASS $test

elif (($error == 1)) ; then

echo FAIL $test

else

echo "Fatal: Test execution of file $test failed."

exit 2;

...

You probably think that this only solves yet another kind of bug, but not really a whole class of bugs. What if somebody someday adds another step and forgets to check an exit code again? Wouldn’t we get the same problem again? We would, of course, at least until we pull out our level 4 laser gun: post-condition checking.

What our test execution script actually promises to do is this: “If you give me a set of N test scripts I will give you back a set of P passed test scripts and a set of F failed test scripts. Either P or F maybe zero but P + F is always N”. This is the post-condition and it holds as long as the pre-conditions (e. g. well-formed test scripts) are respected. So here we have our (hopefully) bullet-proof level 4 version:


    test_glob=*.test

    # Get total number of test scripts.
    tests_total=`ls -1 $test_glob | wc -l`

    for $test in $test_glob ; do
        # Preprocess.
        preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test
        if (($?)) ; then
            echo "Fatal: Preprocessing of file $test failed."
            exit 2;
        fi

        # Execute.
        execute_test.pl $test
        my_error=$?
        if (($error == 0)) ; then
            $(($tests_ok++))
            echo PASS $test
        elif (($error == 1)) ; then
            $(($tests_bad++))
            echo FAIL $test
        else
            echo "Fatal: Test execution of file $test failed."
            exit 2;
        fi

        # Check post-condition.
        if (( $((tests_ok + tests_bad)) != tests_total )) ; then
            echo "Fatal: Not all tests were executed."
            exit 2;
        fi

        ...

test_glob=*.test

# Get total number of test scripts.

tests_total=`ls -1 $test_glob | wc -l`

for $test in $test_glob ; do

# Preprocess.

preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test

if (($?)) ; then

echo "Fatal: Preprocessing of file $test failed."

exit 2;

# Execute.

execute_test.pl $test

my_error=$?

if (($error == 0)) ; then

$(($tests_ok++))

echo PASS $test

elif (($error == 1)) ; then

$(($tests_bad++))

echo FAIL $test

else

echo "Fatal: Test execution of file $test failed."

exit 2;

# Check post-condition.

if (( $((tests_ok + tests_bad)) != tests_total )) ; then

echo "Fatal: Not all tests were executed."

exit 2;

...

I used a testing example to show you the many facets of bug-fixing but these principles equally apply to “real” source code. Here is a summary of what I wanted to show:

– it is fine to do a quick and dirty “symptom-level” bug-fix every now and then — as long as you are explicit about it.
– Repeatedly zoom out in the search of the root cause, zoom out as much out as possible, but stay within your circle of influence (“We wouldn’t have all of these bugs if these spec coders were fired” is clearly beyond your circle of influence).
– Fortify your fix by making sure that the same or similar bugs will not creep in again.

Approxion

Code – People – Everything

Holistic Bug Fixing