Distrusting Experts

“You don’t know what you don’t know until you know it.”
— anon

I recently wrote a post about Scott Meyers, one of my programming heroes. The one I’m writing about today ranks even higher: Michael Abrash. For decades, I have admired his ability to craft awesome, non-obvious, super-tight code. But he’s not just an excellent performance programmer; he is also an execellent writer and storyteller.

It’s no coincidence that he’s also the author of one of my favorite programming books: “ZEN of Code Optimization” [1]. Released in 1994, it predates the Internet age (lo and behold, it comes with a 3.5″ floppy disk), but most of the content is timeless advice for every developer who cares about performance. (We all should. Michael’s rule #1: “From the user’s perspective, performance is fundamental”.)

In chapter 10, “Patient Coding, Faster Code”, he explains that just doing micro optimizations (or recoding in assembly language) usually doesn’t cut it; all you end up with is “fast slow code”. For best results, you need key insights; that is, view the problem from a different angle and use an offbeat approach to solve it.

He presents the venerable “greatest common divisor of two given numbers” problem as a case in point. After toying with a brute-force approach (incrementing a variable and checking if it divides both numbers) he employs Euclid’s key insight, which improves the run-time by orders of magnitude:

gcd(int1, int2) ≡ gcd(int2, int1 % int2)

Or in plain English: the GCD of a larger positive integer ‘int1’ and a smaller positive integer ‘int2’ is the same as the GCD of ‘int2’ and the remainder of ‘int1’ divided by ‘int2’.

This is his original code:

Needless to say, being that assembly wizard that he is, he goes on and reimplements this routine in assembly language to make it even faster. But as for C, this is probably as fast as it can get. Or is it?

I’ve used this code blindly and almost literally for many, many years. After all, it’s from Michael Abrash, and to me, his code has always been beyond any doubt. Definitely not the right attitude, but let me explain.

I’m a big fan of code katas, little exercises that you do (and redo time and again, sometimes in different programming languages) to flex your programming muscle. I don’t know why, but when I did the GCD code kata this time, I stopped and scratched my head. What the heck is this int swapping business about? Why is this code necessary, as the comment says?

Just for the fun of it, I took it out and reran the unit tests — they all passed.

Actually, it’s not that hard to see why it’s not necessary. Let’s assume ‘int1’ is 366 and ‘int2’ is 60. Since ‘int1’ is greater than ‘int2’, no swapping is necessary. Within the ‘for’ loop, the variables take on the following values:

iteration temp int1 int2 return
1 366 % 60 60 6
2 60 % 6 60 6 6

Now if ‘int1’ is 60 and ‘int2’ is 366, the code as it is would swap their values, but let’s pretend there was no swapping code:

iteration temp int1 int2 return
1 60 % 366 366 60
2 366 % 60 60 6
3 60 % 6 60 6 6

Aha! There is one more iteration and this additional iteration (actually the very first iteration) swaps ‘int1’ and ‘int2’. The ‘for’ loop by itself takes care of rearranging the arguments, there is no need for the extra swap test at the beginning of the ‘gcd’ routine. I was puzzled. Is it really possible that I have discovered redundant code in Michael Abrash’s sacred code? I felt quite uneasy because I wanted my hero to be infallible.

A couple of days ago, I send him an email about my findings. His reply was short and to the point:

“Believe me, I never claimed to be immune to cycle-wasting! And anyway, There Ain’t No Such Thing As The Fastest Code :)”

There you have it. It’s that simple. I found unnecessary code in a 20+ years old routine. That’s what I call delayed gratification!

The basic problem was that I blindly trusted what I saw, probably because it came from an expert. However, nobody is perfect, every once in a while even the best make blunders. Have a flexible mind, assume nothing and question everything — this is the right attitude for every software developer and proclaimed in many of Michael’s writings [2].

[1] “ZEN of Code Optimization” is available online as part I of “Michael Abrash’s Graphics Programming Black Book, Special Edition”.

[2] Some years ago, I gave a presentation titled “The Art of Writing Efficient Software”. It includes a lot of Michael Abrash’s advice plus more. Check it out here. I truly believe that efficiency is still (and forever will be) of utmost importance for end users.

A Micro Logger for Constrained Systems

Drei Notizzettel vor HolzhintergrundLogging is one of the oldest and most widely used software debugging tools: for decades, developers have put printf-like statements in their code to find out what their code is really doing, in situ. Even today, logging is still crucial for finding bugs in software, especially sporadic problems in concurrent and event-driven systems.

Fortunately, most programming environments come with powerful logging facilities (like Android’s Log) — but what if you are working on a highly-constrained system that simply can’t afford them?

Let’s have a look at an example that uses printf-like logging:

Traditional (printf-like) logging is expensive because:

  1. Log strings consume memory (ROM), which can easily increase the overall size of the executable by 10% to 30%.
  2. printf-based logging is extremely flexible, but the parsing of format strings can incur a significant run-time overhead.
  3. The amount of log data produced at run-time is quite large because all the log strings need to be transmitted (or stored locally, in case you keep logging data in a file), usually accompanied with timestamps and other bookkeeping information.

I recently had a crazy idea how to compress a log string of (almost) arbitrary size to a single byte: Instead of using real (read: expensive) C strings, why not define a constant whose symbol name serves as the “log string” and only log it’s address?

On the receiving side, you can reconstruct the original log strings from the map file produced by the linker.

Given an appropriate set of magic macros, the example above looks like this:

It may be hard to believe, but these three “log strings” really only consume three bytes of memory. To send out their addresses, four bytes are needed (assuming 32-bit addresses are used) and ‘sampledValue’ requires the transmission of another four bytes. Let’s compare these two approaches assuming that ‘sampledValue’ is 42 (two digits) and taking into account that the size of a C string is the number of characters plus one for a trailing ‘\0’:

footprint (bytes) transmission (bytes)
traditional logging
16 + 37 + 15
16 + 37 + 15
MLOG logging
1 + 1 + 1
4 + 4 + 4 + 4
savings %

That’s what I call an improvement!

I’ve implemented this idea for the GCC toolchain. The log server emits MLOG traces via UDP, which are received by a little tool that converts them back into human-readable log messages by exchanging MLOG symbol addresses with (beautified) MLOG symbol names from the linker map file. Extracting the MLOG symbols from the map file is the most GCC-dependent part. I believe it is a straightforward task to port MLOG to other platforms and toolchains.

Check it out at GitHub.

This is my 100th blog post (actually 101st, damn)! Thanks, faithful reader, for bearing with me through all these years.
— Yours Truly

Bug Hunting Adventures #10: For Whom The Bell Tolls

“Then later that night when the ship’s bell rang
Could it be the north wind they’d been feelin’?”

“The Wreck Of The Edmund Fitzgerald”
— Gordon Lightfoot

At my home, I’m using a Raspberry Pi as a watchdog (aptly named “Brutus”) for all kinds of tasks: burglar detection, network intrusion detection, and server monitoring, just to name a few. Still, most of the time, my watchdog hangs around, idling away. That’s not the way I like it, so I’m constantly on the lookout for new jobs that I can assign to Brutus, small or big.

My current plan is to create a little ship’s bell app that emits pleasing bell sounds every 30 minutes, just like it has been done traditionally on all ships since the 16th century: double-strikes for full hours and an additional single-strike for half an hour. But unlike civil clocks, ship’s bells don’t have dedicated indications for every one of the 12 (or rather 24) hours in a day; instead, bell patterns repeat every four hours:

Bell pattern Time (a.m. and p.m.)
1 12:30 4:00 8:00
2 1:00 5:00 9:00
2 1 1:30 5:30 9:30
2 2 2:00 6:00 10:00
2 2 1 2:30 6:30 10:30
2 2 2 3:00 7:00 11:00
2 2 2 1 3:30 7:30 11:30
2 2 2 2 4:00 8:00 12:00

In this table, a “2” denotes a double-strike whereas a “1” signifies a single-strike of the bell.

The code below is a first draft of my ship’s bell app. It is running as a thread, sleeping most of the time (so you can still call Brutus a lazy dog). When it wakes up, it checks the current local time and determines how many strikes are to be done (‘compute_strikes’). Afterwards, the thread puts itself to rest again. However, I didn’t want to wake it up every second to check the wall time — that would be too inefficient. Instead, I base the sleep time on the temporal distance between now and the next half hour (‘compute_sleep_time’) and sleep for half of this time before checking again.

Alas, my initial implementation comes with a bug and the bell doesn’t work as it is supposed to. Can you spot it? (The bug is in the algorithm — it has nothing to do with any Python language quirks, of course.)

Ship’s Bell app code at GitHub.

Code Kata 4: Struct Member Alignment


“The whole is greater than the sum of its parts”

How time flies! More than two years have passed since I posted my last kata, so it’s high time for a new one. Today’s kata is based on a real problem that I had to solve last week when I worked on a code generator that produced C/C++ code. I hope you enjoy this task as much as I did!

When you define a struct in C, the overall size of the struct is not necessarily the sum of its members. Much like in real life, the whole is greater than the sum of the parts. Consider this example:

Only novice programmers would assume that sizeof(Foo) equals 5; experienced programmers know that 8 is a much safer bet. How come?

Most computer architectures can access data only if it is properly aligned in memory, e. g. a 32-bit (4-byte) integer can only be accessed if it is stored at an address that is evenly divisible by 4. The compiler usually achieves this by inserting invisible padding bytes between the struct members. Thus, internally, struct Foo is likely to look like this:

As a first step, familiarize yourself with struct padding. Check out this must-read by legendary Eric S. Raymond, especially if you are a systems programmer.

Now that you have read it, you should understand why sometimes there is also trailing padding (but never leading padding) and hence why the size of the following struct is most likely 12:

Equipped with this knowledge we are ready to tackle our first programming task: assume that every primitive struct member of base-2 size is to be aligned on its base-2 boundary (a 2-byte integer on an address that is evenly divisible by 2, a 4-byte integer on an address that is evenly divisible by 4 and so on). Implement an algorithm that computes the overall size of a struct given an ordered list of its members. Instead of real types, provide a list of integer values where the values represent the sizes of the members. Here are examples for Foo and Bar (in Python):

One weakness of this approach is that you cannot differentiate between a unit32_t (size is 4, alignment is 4) and an array of 4 uint8_ts (size is 4, alignment is 1):

Extend your algorithm to accept a list of pairs, where the first pair member specifies the size and the second pair member specifies the alignment:

But there is one more feature we need to add before we’re done: support for arbitrarily nested structs:

How does a member that itself is a struct impact the alignment of the outer struct? Devise a suitable data structure for passing in nested struct specifications.

[Update 2016-03-12: I’ve uploaded a sample solution to GitHub]

So long Scott Meyers, So long C++?

Last month, Scott Meyers wrote a blog post where he announced that he will withdraw from active involvement in C++.

Scott is famous for at least two things: first, his excellent way of explaining dry technical stuff in an entertaining way (I read all his books on C++, except for “Effective Modern C++” which is still on my to-be-read pile) and second, his He-man like hairdo.

After 25 years of dedicating his life exclusively to C++ he has become tired — tired of a language that gets more and more complex at an seemingly ever-increasing rate, all in the name of backwards compatibility and efficiency. (He didn’t say that, but this is my view.)

Ah, speaking of efficiency, the “e” word.

Just because C++ gives you control over efficiency doesn’t mean that you will get it in the end. Due to a lack of compiler and hardware knowledge, many C++ developers have a wrong (insufficient, at least) notion about efficiency. There are many misconceptions, because they don’t know how compilers, CPUs, or memories work.

One example is not understanding the effects of caching. Many C++ developers blindly trust std::map’s or std::unordered_map’s O(log n) and O(1) promises but there are situations where an O(n) std::vector (or plain C-style array) can be orders of magnitude faster because it accesses memory in a cache-friendly way. There is a nice talk by Scott on YouTube where he gives a good overview about caching and its consequences.

Another common efficiency fallacy is illustrated by this little for loop:

Many developers I’ve met believe that using a ‘uint8_t’ for the loop counter is more efficient than using a plain ‘int’. But what most likely will happen is that by using ‘uint8_t’ the code becomes both, bigger and slower, especially on modern RISC-style processor architectures like ARM and PowerPC. Why? If the value of ‘arrayLength’ is not known at compile-time, the compiler has to create additional code that ensures that ‘i’ wraps around for values greater or equal to 256. Internally, the compiler assigns a 32-bit register to ‘i’ (provided you are targeting a 32-bit platform) and adding 1 to 255 in a 32-bit register is different to adding 1 to 255 in an 8-bit register. Behind the scenes, your compiler rewrites your loop to look like this:

Granted, in most situations this additional code will not amount to much, but maybe in a low-level driver or some communications stack, situations which systems languages like C++ were made for. But this example shows a problem that many would-be efficiency experts share: for the sake of (false) efficiency, they increase complexity and risk correctness and security. What happens if some day ‘arrayLength’ can be larger than 255? The for loop will loop forever, of course.

So while C++ is a language that has the potential to yield extremely efficient systems, you neither get efficiency automatically nor for free. C++ has a steep learning curve and there are many pitfalls. I truly belief that much of C++’s efficiency is wasted on too many developers. If you don’t need utmost efficiency or don’t know how to put the corresponding language features to best use, better keep away from C++ and use a managed language. You will be much more productive and create programs that are also (probably) more secure by default.

Getting back to Scott Meyers, I must admit that I’m somewhat happy about his decision. Not because he left C++ per se but because he now has time to focus on other important topics — topics that he will explain with the same quality he is renowned for. Like some programmers say: when one curly brace closes, another one opens.

Continuous “Commit”ment

“Strive for continuous improvement, instead of perfection.”
— Kim Collins

Checking in early and often is a well-accepted practice these days. Instead of keeping changes local for an extended period of time, software is continuously integrated and thus the overall integration risk is significantly reduced because every time you check into the central repository, your changes become immediately visible and available to others. Problems show up early, which is for sure a good thing.

On the other hand, your code must have a certain level of quality before you can foist it upon others. At the very least, it must compile without errors. Most likely, it is also required to be in line with other project or coding standards, for instance, code must be free of compiler (or MISRA) warnings. Some projects even demand that every code that is checked-in has been 100% code-coverage tested.

Having such “pre-commit quality gates” is a blessing but it stands in the way of the “commit early, commit often” paradigm: Depending on the size of the task it may take hours (if not days) to meet all check-in criteria. Deferring commits for such a long time would certainly be foolish because check-ins are important for another reason: With every commit you drive in a piton that not only saves you from data loss but also allows you to go back and forth in time.

Fortunately, with distributed version control systems like Git, you can check in locally at your heart’s content without affecting others. When your code is nice and shiny (and is in line with your project standards), you integrate it by “pushing” to the central repository. I love to work like this!

Often, I check in every couple of minutes, sometimes even though my code doesn’t compile yet. Maybe after some minutes of tedious editing, only to ensure that my changes are not lost. Just like Hansel and Gretel, I like to leave bread crumbs behind, but unlike theirs, mine won’t be eaten by the birds.

The hard part is deciding on a commit message, though. I don’t want to break my flow by thinking about something suitable. Many times, the changes are not even coherent, so the commit messages cannot be meaningful. What’s my solution? I simply run

over and over again.

As you can imagine, there will be dozens of “checkpoints” before I’m finished with my high-level task. Consequently, the commit history (what git log shows) is an utter mess. Even though Git supports various commands to alter the commit history (git rebase -i, for example), doing it manually is both, tedious and error-prone. I therefore decided to automate this process through a little tool named git-autocommit.

git-autocommit is a short Bash script that you invoke in your working directory. It runs indefinitely (at least until you hit Ctrl-C) and periodically executes git add -A && git commit -m "<git-autocommit>". When you’re done with your changes, you hit Ctrl-C and run git-autocommit again. Then, the script checks if there is a series of <git-autocommit> commit messages already at the top of your commit history and if so, performs a soft reset to the predecessor of the first autocommit; otherwise, it just waits for new changes in your working directory and autocommits them as before.

The upshot of this is that all the changes that you’ve done (those changes that have been autocommitted) are now staged and ready to be committed again, but this time en bloc and with a descriptive check-in comment. Once checked-in, all the intermediate autocommits are gone from the log and it looks as if you’ve made a perfect sausage.

Circular Adventures VI: When the Winner is Known Beforehand

    “Girls have an unfair advantage over men: if they can’t get what they want by being smart, they can get it by being dumb.”
    — Yul Brynner

    In part III and IV I discussed solutions for a circular problem where two indices performed a race within a circular buffer; that is, either index could be ahead of the other.

    Sometimes, however, life is simpler and it is always known which index is the leader, and thus the winner — from the outset:

    In this example, provided we know that b must always be ahead of a, we can deduce that b has wrapped around and the distance between a and b is 4.

    Either implementation (the one given in part III and the one given in part IV) of circular_distance will return the same result:

    However, both will fail for this case:

    Why? Under the premise that b is always ahead of a, the distance is +6 not -4. circular_distance computes the wrong result because it assumes that the leading index is less than half the circular buffer size ahead of the other index. This assumption was made (I called it ‘invariant’ at the time) to be able to compute the circular distance even if it is not known which index is ahead of which.

    Based on the prerequisite that b is always ahead of a we can give a simplified version of the circular_distance function from part III:

    I call this function circular_lead instead of circular_distance to emphasize that it returns how much b is ahead of a, which is always a positive number. As usual, all the restrictions (and optimizations) regarding the mod operator apply.

    In situations where one index is known to be ahead of the other, circular_lead has an edge over circular_distance because it supports distances in the range 0 to N-1, whereas circular_distances only supports ranges from 0 to (N-1)/2. This is always the case in “monotonically increasing” scenarios, like run-time measurement, at least until the flux capacitor is invented. Hence, the last example of part IV can be rewritten like this:

    If we replace the mod operator with a cast to uint32_t, circular_lead_32bit boils down to this:

    [For the mathematically inclined: What we are really talking about here is residue class rings, a concept used by all contemporary computers. I might explore the number theory behind this and other circular topics in a future post.]

    More circular adventures…

Bug Hunting Adventures #9: A Random Piece of PI

According to an old saying, there’s more than one way to skin a cat. There are at least as many ways to compute the value of π. One of them uses the Monte Carlo method to approximate π’s value and it is the subject of today’s Bug Hunting epsisode.

We start with a so-called unit circle, a circle with radius 1 whose center is positioned at the origin in the Cartesian coordinate system. Next, we put a square around the unit circle whose sides have length 2 (the diameter of the unit circle):

drawingThere are two areas (literally!) of interest in this picture: the circle area Ac and the square area As:

Ac = πr² = π
As = (2r)² = 4

The ratio Ac/As is π/4

Why is this ratio important? Because we can use it to calculate the value of π:

π = 4 Ac/As

Now let’s do some random sampling. We take N random points whose x and y values are both in range [-1; +1] and tally the number of points that fall within the square (Ns) and the number of points that fall within the circle (Nc). Given enough points, the ratio Nc/Ns is a very good approximation for Ac/As and we hence can compute:

π ≈ 4 Nc/Ns

The C code below attempts to calculate π in this manner, but sports a blunder. What is the bug? Bonus question for the mathematically inclined: without executing the code, what value does it really compute (instead of π)?


The Need for an Hippocratic Oath for Software Engineers

Two months ago, I wrote about an incident where a fault in an airbag system was responsible for the death of a child. (Actually, as a court ruled later, it was not the airbag system that was responsible, but rather an engineer who kept quiet about a bug that he had discovered earlier.)

Now, it seems like we have another case of a software-based disaster: Volkswagen recently admitted that they used software in their cars that detects official test situations and then reconfigures the engine to reduce pollution — just to get the figures right. When not in this “cheat mode”, that is, under real driving conditions, Volkswagen’s NOx emission rates are up to 40 times higher.

I think this cunning piece of German engineering has the potential to smash Volkswagen. Not extinguish entirely, of course, but Volkswagen’s stock value might decrease so much that it can be taken over easily by the competition. In any case, it will cost this company, whose bosses probably thought it was too big to fail, big: billions of dollars, thousands of jobs, an immeasurable amount of reputation and credibility.

What makes this story so exceptionally shocking is that we are not talking about a bug (like in the failing airbag case) but a feature. This software was put in deliberately, so it is rather a sin of commission than a sin of omission.

While it would be interesting to know what made people act in such a criminal way, what really matters is this: the whole mess could have probably been avoided if someone along the chain of command would have stood up and said “No!”.

I will use this sad story as an opportunity to introduce the “Software Engineering Code of Ethics and Professional Practice” to you, a joint effort by IEEE and ACM. As for the Volkswagen scandal, it looks as if at least the following principles have been grossly violated:

Software Engineers shall:

1.03. Approve software only if they have a well-founded belief that it is safe, meets specifications, passes appropriate tests, and does not diminish quality of life, diminish privacy or harm the environment. The ultimate effect of the work should be to the public good.

1.04. Disclose to appropriate persons or authorities any actual or potential danger to the user, the public, or the environment, that they reasonably believe to be associated with software or related documents.

1.06. Be fair and avoid deception in all statements, particularly public ones, concerning software or related documents, methods and tools.

2.06. Identify, document, collect evidence and report to the client or the employer promptly if, in their opinion, a project is likely to fail, to prove too expensive, to violate intellectual property law, or otherwise to be problematic.

3.03. Identify, define and address ethical, economic, cultural, legal and environmental issues related to work projects.

5.11. Not ask a software engineer to do anything inconsistent with this Code.

6.13. Report significant violations of this Code to appropriate authorities when it is clear that consultation with people involved in these significant violations is impossible, counter-productive or dangerous.

In a world in which an ever-increasing part is driven by software, “The Code” should be considered the “Hippocratic Oath” of software engineers, something we are obliged to swear before we can call ourselves software professionals, before we are unleashed to an unsuspecting world. I wonder how many more software catastrophes it will take until we finally get there.

Bug Hunting Adventures #8: Just Like a Rubber Ball…

Two incidents gave rise to this Bug Hunting episode: first, I’m currently beefing up my burglar alarm system with a Raspberry Pi and second, I’ve just come across the old song from the ’60s “Rubber Ball” by Bobby Vee while randomly surfing YouTube (“Bouncy, bouncy, bouncy, bouncy”).

The alarm system extension I have in mind is this: I want to be able to configure temporary inhibition of motion detection via a simple push-button: if you press it once, you suppress motion detection by one hour, if you press it n times, you suppress it by n hours, respectively. If you press the button after you haven’t pressed it for five seconds, any suppression is undone. There are status LEDs that give visual feedback such that you know what you’ve configured.

Thanks to the fine RPi.GPIO library, accessing GPIO ports is usually a piece of Pi. However, there is a fundamental problem when reading digital signals from switches and buttons called “switch bounce”, a phenomenon that has driven many brave engineers nuts.

If you flip a switch, there never is a nice transition from zero to one (or one to zero, depending on how you flipped the switch). Instead, the signal bounces between zero and one, in a rather indeterminate fashion for many milliseconds. The bouncing characteristics depend on various factors and can even vary for the same type of switch.

The takeaway is this: with switches (and buttons, of course), expect the unexpected! You must use some sort of debouncing mechanism, implemented in either hardware or software. For some background information check out this excellent article by embedded systems luminary and die-hard veteran Jack Ganssle.

But back to my alarm system. I skimmed the RPi.GPIO documentation and was happy to read that since version 0.5.7, RPi.GPIO has support for debouncing of digital input signals, so there was no need to debounce the push-button myself. Exuberantly, I sat down and wrote this test code:

To avoid arbitrary, floating input voltages, I setup the port such that it is pulled down (to ground) by default. As a consequence, reading that port when the button is not pressed yields a clean zero.

I didn’t want to poll the button port, as this would burn valuable CPU cycles, so I registered an event handler with the RPi.GPIO library that would be called back once a rising edge had been detected. I added a ‘bouncetime’ parameter of 200 milliseconds, to avoid multiple calls to my ‘button_pressed’ handler, in full accordance with the RPi.GPIO documentation:

To debounce using software, add the ‘bouncetime=’ parameter to a function where you specify a callback function. Bouncetime should be specified in milliseconds. For example:

In other words, after a rising edge has been detected due to a button press, I won’t be bothered with spurious rising edges for 200 ms. (200 ms is plenty of time — no switch I’ve ever seen bounced for more than 100 ms. Yet, 200 ms is short enough to handle cases where the user presses the button in rapid succession.)

I tested my button for a minute and it worked like a charm; then, I varied my press/release scheme and suddenly some spurious “Button pressed!” messages appeared. After some head scratching, I discovered the problem — can you spot it, too? (Hint: it’s a logic error, not something related to misuse of Python or the RPi.GPIO library.)

Epilogue: Once the problem was understood, the bug was fixed easily; getting rid of this darn, “bouncy, bouncy” earworm was much harder, though.