« Posts under Dangerously Confusing Interfaces

Dangerously Confusing Interfaces III

confused.jpgJust like the other “Dangerously Confusing Interfaces” posts, this one was also inspired by a real-world blunder that I made.

Here’s the background: usually, routines that accept data via a pointer from the caller either execute synchronously or copy the data into their own internal data structures for later processing. Take the venerable ‘fwrite’ from the C standard library as an example:

‘fwrite’ blocks until the data has been written, either to disk or to an internal buffer. In either case, once ‘fwrite’ returns, it doesn’t care about the original data anymore. That’s why it’s safe (and common practice) to pass a pointer to a local buffer on the stack:

All standard library and POSIX APIs behave like ‘fwrite’, which is both, safe and convenient. However, with embedded systems, the story is different: in some cases, memory is so tight that additional buffers/internal storage can’t be afforded. Such functions don’t copy the provided data but only store a pointer to your data and expect the memory pointed-to by this pointer to be still valid long after the function call has returned. Here is an example from the AUTOSAR standard, which is used by almost all embedded automotive products:

‘NvM_WriteBlock’ is used to store data to a given non-volatile memory block. However, what this function does is only enqueue a request for the given block ID together with the data pointer (not a copy of your data). This is done for the sake of efficiency, because there can be multiple write requests in parallel. The queue is later processed in another task, long after any local buffer would have been removed from the stack.

Passing a pointer to a buffer with automatic storage is an easy mistake to make, especially since such “non-copy” interfaces are so rarely encountered. How can “write-like” interfaces that don’t make a copy of the provided data be made safer, such that misuse is less likely? Obviously, just adding documentation is not enough — nobody reads documentation, especially in the heat of the moment.

In my view, the root of the problem is that such functions accept just about any pointer. What if the caller was forced to explicitly cast the pointer to another type? A type with a cunningly chosen typename, one that reminded the caller of the potential pitfall? Here is my approach:

Whenever a pointer is passed to this function, developers have to write something like this to make the compiler happy:

Typing ‘uncopied_memory’ should shake up even the most focused developers and remind them to double-check what they are passing into this function.

Of course, within ‘SomeWritelikeFunction’, the provided pointer needs to be cast back into something more useful, like a ‘const uint8_t*’. Further, note that the ‘dummy’ member within ‘uncopied_memory’ must not be used; it only exists to make sure that the cast to ‘uncopied_memory*’ in the calling function is safe: a pointer to a struct is aligned such that it is compatible with the struct’s most-aligned member, which is ‘void*’ and ‘void*’ is by definition compatible with any other pointer type.

Dangerously Confusing Interfaces II

confused.jpgLast week was a sad week for me. A bug in my code made it into the final version that was shipped to an important customer.

When something like this happens, it is almost always due to fact that there is a higher-level “bug” in the software development process, but I don’t want to go there. Instead, I want to focus on the technicalities. Once more, I got bitten by another instance of a dangerously confusing interface. Let me explain.

I’ve always had a liking for interfaces that are self-evident; that is, one knows immediately what’s going on by just looking at how the interface is used — without having to consult the documentation. Let me give you a counter example, an interface that is far from what I desire:

The interface to ‘sort_temperatures’ is not at all self-explanatory. What is clear is that it sorts ‘values_count’ values, which are obviously temperature values, but what the heck does the ‘true’ argument stand for? In order to find out, you have to look at the declaration of ‘sort_temperatures’ and/or its API documentation:

Now it is clear what the boolean parameter is for, but at the cost of having to take a detour. Maybe you got so distracted by this detour that you forgot what you originally were about to do.

Programmers often make this mistake when designing interfaces. They use boolean parameters when they have two mutual exclusive modes of operation. This is easy for the implementer of the routine, but confusing to the ones who have to use it. Contrast this with this alternative:

By replacing the boolean parameter with symbolic constants you not only make the code more readable, you also open it up for future extension: adding more modes becomes straightforward.

Now have a look at this code and try to guess what it does:

That’s fairly simple: ‘calc_hash’ calculates a SHA-256 checksum over ‘mydata’ (which is ‘mydata_len’ bytes in size) and stores it into the provided buffer ‘hash’. The length of the hash is stored to ‘hash_len’ via call-by-reference.

But is this code correct? The answer is — you guessed it — no. If you can’t spot the bug, you are in good company. You can’t see it with just the information given. ‘calc_hash’ interface is not self-evident.

I wrote this code more or less one year ago. It contains a bug that remained dormant until the product was in the hands of our customer. And it is there because of a silly interface.

The last (pointer) parameter ‘hash_len’ actually serves as both, an input AND output parameter. When you call ‘calc_hash’ it is expected that ‘*hash_len’ contains the size of the provided ‘hash’ buffer; on return ‘*hash_len’ will contain the actual number of bytes used by the hash algorithm; that is,the size (or length) of the SHA-256 checksum stored in ‘hash’. The whole idea behind this is that ‘calc_hash’ (or rather its author) wants to offer protection against buffer overruns — for cases where the provided ‘hash’buffer is not large enough to accommodate the checksum.

So the problem here is that ‘hash_len’ (being a stack variable) is not properly initialized to ‘HASH_SHA256_LEN’; it’s value is more or less arbitrary. If it is by chance greater or equal to 32 (the value of the ‘HASH_SHA256_LEN’ symbolic constant) everything is fine and the checksum is correctly calculated. If it is not, ‘calc_hash’ returns ‘false’ and an error is reported.

For as long as a year — by sheer coincidence — ‘*hashLen’ was never below 32 (which is not that unlikely, given that ‘size_t’ can accommodate values ranging from 0 to 4,294,967,296); but in the hands of the customer — and very much in line with Murphy’s Law — it happened.

OK, accuse me of not having initialized ‘*hashLen’ properly, accuse me of not having read the API documentation carefully. Maybe I did read the API documentation and then I was interrupted. I don’t know. But what I know for sure is that this bug would have never happened if the interface had been clearer.

The first problem with ‘calc_hash’ is that ‘hash_len’ is an IN/OUT parameter, which is unusual. I’m not aware of any function in the C/C++ standard library (or the POSIX libraries) which makes use of IN/OUT parameters. Since the input value (not just the output value) is passed by reference, neither the compiler nor static analysis tools like PC-Lint are able to detect its uninitialized state. One obvious improvement is to pass the length of the buffer by value:

Granted, there is now one more argument to pass (‘hash_buf_len’), but if the unlucky programmer ever forgets to initialize it, the compiler will issue a warning.

But let’s not stop here. I’d like to pose the following question: what good is the hash buffer length check, anyway?

In my view, it is not at all necessary. The length of a particular checksum is constant and known a priori — that’s why the hash buffer is allocated statically like this:

What additional benefit does a developer get by providing the same information again to the ‘calc_hash’? Isn’t this redundancy that just begs for consistency errors?

And what use is the output value that tells how long the checksum is? Again, this should, no it MUST be known a priori, there should never be a mismatch between what the caller of ‘calc_hash’ expects and what is returned. Of course, there can be error conditions but if ‘calc_hash’ fails, it should return ‘false’ and not a length different to what the caller expects.

Note that ‘calc_hash’ is not at all comparable to functions in the C API that add an additional output buffer length parameter like ‘strncpy’ or ‘snprintf’. These functions carry the length parameter for completely legitimate safety reasons because the total length of the output is usually not known a priori (for instance, as some input may stem from a human user and one has little control over how many characters (s)he will enter).

Based on these arguments, I dismiss the ‘hash_len’ parameter altogether and propose the following simplified interface:

and would use it like this:

Easy to read, hard to get wrong. It is impossible to forget to initialize a variable that just isn’t there.

Dangerously Confusing Interfaces

confused.jpgDesigning intuitive interfaces that are easy to use and easy to learn is hard, often very hard; and for economic reasons it might not always be possible to strive for perfection. Nevertheless, in my view, at the very least, interfaces should be designed such that obvious, day-to-day usage doesn’t lead to damage.

In his classic book “Writing Solid Code”, Steve Maguire calls confusing interfaces that lead to unexpected bugs “Candy Machine Interfaces”. He tells a story from a vending machine at Microsoft that used to cause him grief: The machine displayed “45 cent” for “number 21”, but after he had finally inserted the last coin he would sometimes enter “45” instead of “21” (and would get a jalapeño flavored bubble-gum instead of the peanut butter cookie that he wanted so much — Ha Ha Ha!). He suggests an easy fix: replace the numeric keypad with a letter keypad and no confusion between money and items would be possible anymore.

The other day I did something like this:

My goal was to recursively copy the ‘gamma’ folder to my home folder. What I expected was a ‘gamma’ folder within my home directory, but instead I ended up with hundreds of files from the ‘gamma’ directory right at the top-level of my home directory — the ‘gamma’ directory simply wasn’t created!

I have to confess that similar things sometimes happen to me with other recursive-copy-like tools, too — this seems to be my candy machine problem. Now you know it.

As for ‘rsync’, there is a feature that allows you to copy just the contents of a directory, without creating the directory, flat into a target directory. Granted, this is sometimes useful, but do you know how to activate this mode? By appending a trailing slash to the source directory! That’s what happened in my case. But I didn’t even add the slash myself: if you use Bash’s TAB completion (like I did) a trailing slash is automatically appended for directories…

But good old ‘cp’ puzzles me even more. If you use it like this

it will copy ‘from3’ to a folder named ‘to2’ under ‘to1’ such that both directories (‘from3’ and ‘to2’) will have the same contents, which is more or less a copy-and-rename-at-the-same-time operation. Unless ‘to2’ already exists, in which case ‘from3’ will be copied in ‘to2’ resulting in ‘to1/to2/from3’. Unless, as an exception within an exception, there is already a ‘from3’ directory under ‘to2’; in this case ‘cp’ will copy ‘from3’ flat into the existing ‘to2/from3’ which might overwrite existing files in that folder.

Both, ‘cp’ and ‘rsync’ suffer from fancy interfaces that try to add smart features — which is normally good — but they do it in an implicit, hard-to-guess, hard-to-remember way — which is always bad. Flat copies are sometimes useful but they might be dangerous as they could inadvertently overwrite existing files or at least deluge a target directory. A potential cure could be an explicit ‘–flat’ command-line option.

To me, a wonderfully simple approach is the one taken by Subversion: checkouts are always flat and I’ve never had any problems with it:

This copies (actually checks-out) the contents of the ‘trunk’ flat into the specified destination directory — always, without any exceptions. That’s the only thing you have to learn and remember. There are no trailing backslashes or any other implicit rules. It will also create the target parent directories up to any level, if needed.

Naturally, dangerously confusing interfaces exist in programming interfaces, too. Sometimes the behavior of a method depends on some global state, sometimes it is easy to confuse parameters. The ‘memset’ function from the C standard library is a classic example:

Does this put 40 times the value of 32 in ‘buffer’ or is it the other way around?

I have no idea how many programmers don’t know the answer to this question or how many bugs can be attributed to this bad interface, but I suspect that in both cases the answer must be “way too many”. I don’t want to guess or look up the specification in a manual — I want the compiler to tell me if I’m wrong. Here is an alternative implementation:

Now you write

If you confuse the fill character with the length parameter the compiler will bark at you — a parameter mix-up is impossible. Even though this is more to type than the original (dangerous) interface: it is usually worth the while if there are two parameters of the same (or convertible) type next to each other.

Like I said in the beginning: designing intuitive interfaces is hard but spending extra effort to avoid errors for the most typical cases is usually a worthwhile investment: don’t make people think, make it difficult for them to do wrong things — even if it sometimes means a little bit more typing.