« Posts under Bug Hunting Adventures

Bug Hunting Adventures #6: Logarithmic Errors

Sometimes, it is necessary to find out how an application uses dynamically memory. Especially on constrained (think embedded) systems you should know, for instance, what block sizes are typically requested, how much memory is allocated at any one time, and how much memory has been requested at most so far (allocation high-water mark). Data like this allows you to fine-tune your memory allocator and/or roll out your own highly-efficient, special-purpose allocator for popular block sizes.

I recently implemented such a statistics-enhanced version of operator new/delete and one of its tasks was to find out how frequently blocks of certain sizes were requested. Instead of keeping track of exact request sizes (which would have been prohibitively expensive on the target system) I came up with a simpler, but sufficient scheme where I counted allocations to block sizes, rounded to the next base-2 boundary:

log2ceil is the ‘ceiled’ (rounded-up) binary logarithm of a given value. As an example, log2ceil(100) and log2ceil(128) both yield the same value of 7. Hence, the value g_allocations[7] would tell me how many allocations of block sizes in the range of 65 – 128 bytes there have been.

Below is the code of a log2ceil function; it works by counting the number of leading (left-most) zero bits in a word and subtracting this value from 31. In order to get the desired rounding behavior, a well-known trick is applied: first, decrement the argument by one and then increment the result by one.

This implementation is straightforward and all the tests pass. Still, it has a subtle portability issue. Can you see it?

Code
Solution

Bug Hunting Adventures #5: False Diagnosis

Today’s cars host dozens of ECUs (electronic control units) that exchange tens of thousands of messages and signals on various buses (eg. CAN, MOST, Flexray, Ethernet). On top of that, ECUs provide hundredths of diagnostic services, that allow factories, maintenance personnel and even devices on the car to obtain (and change) various parameters, including vehicle speed, diagnostic trouble codes, and sensor/actuator data. By using diagnostic services, you can literally remote control your car, provided you get passed the security mechanisms, of course.

Diagnostic commands are exchanged based on the UDS protocol, as specified by ISO 14229-1:2013, via request/response pairs. Requests and responses are coded like this:

Request: SA, DA, SID1, ..., SIDn
Response (OK): SA, DA, SID1|0x40, SID2, ..., SIDn, RESP-PAYLOAD
Response (error): SA, DA, 0x7F, SID1, ERROR-CODE

Every ECU in a car has a unique 1-byte diagnostic address that uniquely identifies it (signified as source address SA and destination address DA, depending on whether a device acts as a sender or receiver) and each diagnostic service of an ECU is addressed by a 1 to n byte long service ID (SID). (Note that in reality, the n-byte “SID” consists of the real SID and a sub-level function ID, but this detail is of no importance for our considerations.)

If, for example, a device with SA = 0x20 wants to read all diagnostic trouble codes from an ECU with DA = 0x60, it would issue this request:

0x20 0x60 0x19 0x0A

where 0x19 0x0A is the two-byte service ID for the “READ DIAGNOSTIC TROUBLE CODES” service.

The positive response of the targeted ECU might look like this:

0x60 0x20 0x59 0x0A 0xFF 0x02 0x40 0x00 0x50 0x02 ...

Notice that the values of SA and DA are swapped; the replying ECU is now the sender and the requesting ECU the receiver. 0x59 is the first byte of the service ID (SID1) or’ed with the “response OK” flag 0x40. The response payload (the requested information, that is, the actual trouble codes) starts with 0xFF 0x02… .

If an ECU cannot fulfill a request, it sends a negative response indicator, followed by an error code that gives details on what went wrong:

0x60 0x20 0x7F 0x019 0x11

Here, 0x7F denotes ‘error’, 0x19 is the first SID byte of the original request (this time without the “response OK” flag) and 0x11 is the error code for “SERVICE NOT SUPPORTED”.

There are two error code values that usually require special treatment: If an ECU responds with error code 0x21 it means that it is currently busy and unable to process the request; if it responds with 0x78 it denotes that it is able to process the request but needs more time to complete it.

This description of the UDS protocol is of course an oversimplification, but it is enough for our bug-hunting purposes. Attached is an extract of an application that incorrectly processes the response to a diagnostic service request. Can you diagnose the problem?

Code
Solution

Bug Hunting Adventures #4: Casino Royale

On August 18, 1913, something very strange happened at Monaco’s Monte Carlo casino: for the twentieth time in a row, the ball had fallen into a black pocket. More and more people gathered at the table and started betting like crazy for red — they were convinced that the time for red was more than overdue. Well, most of them lost a lot of money because it took another seven spins of the wheel until red finally arrived.

Today, I present a little Groovy program that I wrote in order to get insight into Roulette probability, to save us from what is now known as the Monte Carlo Fallacy.

When you call ‘runSimulation’ you can specify how many number of times you want to spin the wheel. With every spin, the method ‘spinWheel’ returns either 0 (for black), 1 (for red), or 2 (for Zero).

‘runSimulation’ keeps track of the length of a certain color series in three associative arrays (i. e. maps), one for every color (including the color Zero). The key into these maps is the length of the series and the associated value is the count of how often this series length was encountered during the experiment.

To make accessing these color length maps generic, their references are stored in a plain list (‘seriesMaps’). By using the color value returned from ‘spinWheel’ as an index, one can easily obtain the length map for a particular color. This list of length maps is returned at the end of the simulation. As an example, after 100 spins, the list might look like this:

In this simulation, for black, a series of length 2 was encountered 6 times, and a length of 5 one time. For red, there was one series length of 7, and our virtual ball landed three times on Zero, but there was never a series of Zeros (i. e. longer than 1). (Aside: in one experiment I did with the bug-fixed version, I spun the wheel more than 100 million times; I got a maximum series of 30 for black and — believe it or not — I once got a series of four times Zero in a row.)

But the program, as it is presented to you, contains a bug. It just doesn’t work as it is supposed to. Can you spot it? (Note: the bug has nothing to do with any Groovy idiosyncrasies.)

Code
Solution

Bug Hunting Adventures #3: Silent Threads

‘select’ isn’t broken
— The Pragmatic Programmers

Many moons ago, when I tried to familiarize myself with POSIX threads, I wrote a simple test program that was based on a textbook example.

My program sported two threads, one printing ‘+’ characters, the other one printing ‘-‘ characters. Everything worked as expected: a mixed stream of ‘+’ and ‘-‘ characters was emitted to stdout.

But everything happened so fast! Literally thousands of characters were outputted at the blink of an eye, so I added a little extra code that made the threads sleep for a specified amount of time before printing the next character.

Alas, when I set the delay (SLEEP_SECS) to 1 second (or in fact any value different to zero) nothing was printed at all! It looked like the threads got locked up completely. I came up with the weirdest theories about what had happened, including a bug in the pthreads library and the implementation of ‘sleep’.

It wasn’t until the next morning that I realized my mistake. Once I again, I had blamed it on the good ones, when the real problem was blind stupidity.

What was my mistake?

Code
Solution

Bug Hunting Adventures #2: Monitoring Temperature Sensors

Imagine a distributed control system that relies on correct and timely temperature measurements. Various temperature controllers distribute their temperature readings periodically over a bus for other controllers to consume.

To ensure that the temperature sensors work as expected, a ‘TemperatureMonitor’ was implemented. It listens for messages from the temperature controllers and checks whether they arrive in time (at the very latest every 20 ms) and whether their values are within the valid range.

This checking of temperature messages upon arrival obviously doesn’t catch cases where temperature controllers don’t send messages at all (or with a delay that would cause the timer to overflow multiple times) so there is an additional cyclic task that caters for situations like these. This cyclic task is executed every 100 ms and it additionally takes care of the so-called ‘idle period’.

During the ‘idle period’, which is 500 ms, ‘TemperatureMonitor’ is lenient and doesn’t report problems (if any) to allow for an undisturbed start-up of the whole system.

Once the idle period is over and ‘TemperatureMonitor’ detects abnormal conditions it notifies the global ‘ErrorManager’, which will decide how to handle problems based on its current error handling policy (eg. just log the error, reset or disable a temperature controller).

Since the cyclic task and the message handler (the one that is invoked upon the reception of a temperature sensor message) may run in parallel, access to shared data is protected by a simple (but sufficient) synchronization scheme based on enabling/disabling interrupts.

Time measurement is implemented based on a wrap-around 32-bit tick counter that counts raw clock cycles. These clock cycles are converted to a more convenient unit (ie. milliseconds).

Code
Solution

Bug Hunting Adventures #1: Logging Binary Data

Normally, loggers are used to track down bugs, but today, I present a Logger class method that contains a bug itself — isn’t it ironic?

Logger::logbuf() takes ‘len’ bytes of binary data from memory pointed to by ‘buf’ and converts it into printable, zero-terminated hex strings ala “AA 01 B3 C4…”.

For efficiency and readability, the hex string is broken up into smaller parts. Every hex string part is fed to the existing Logger::log() method which is capable of outputting arbitrary, zero-terminated strings (a new-line character is appended automatically).

Happy bug hunting! I will post the solution in two weeks time.

Code
Solution