Category Archives: Dangerously Confusing Interfaces

Dangerously Confusing Interfaces V: The Erroneous ERRORLEVEL

“Design interfaces that are easy to use correctly and hard to use incorrectly.”
— Scott Meyers

Dangerously confusing interfaces can lurk anywhere, even in the venerable (yuck!) DOS batch scripting language. Some time ago, I burnt my fingers when I made a tiny tweak to an existing batch file, deploy.bat, which was part of a larger build script:

Because we had seen the ‘copy’ command fail in the past, I tried to improve things a little by adding an ‘if’ statement to ensure that we would get a clear error message in such events:

Alas, it didn’t work. There still was no error message produced in case the copy command failed. Worse yet, the outer build script happily continued to run. Puzzled, I opened a DOS box and did some experiments:

Hmm. Everything worked as expected. Why didn’t it work in deploy.bat? Next, I changed deploy.bat to output the exit code:

And tried again:

What? The copy command failed and yet the exit code was zero? How can this be? After some head scratching, I vaguely remembered that there was another (arcane) way of checking the exit code, namely ERRORLEVEL (without the percentage signs), so I tried it out:

I never really liked this style of checking the exit code, because ‘ERRORLEVEL n’ actually doesn’t test whether the last exit code was n; it rather checks if the last exit code was at least n. Thus, this statement

doesn’t check if the exit code is zero (ie. no error occurred). What it really does is check if the exit code is greater to or equal to zero, which is more or less always true, no matter the value of the exit code. That’s pretty confusing, if you’d ask me.

Anyway, for some reason, it seemed to work nicely in deploy.bat:

I hardly couldn’t believe my eyes. The copy command obviously failed, %ERRORLEVEL% was obviously zero, still the if statement detected a non-zero exit code. What was going on? I delved deeply into the documentation of the DOS batch language. After some searching I found this paragraph:

%ERRORLEVEL% will expand into a string representation of the current value of ERRORLEVEL, provided that there is not already an environment variable with the name ERRORLEVEL, in which case you will get its value instead.

Whoa! There are two kinds of ERRORLEVEL, who knew? One (the one whose value you can query with %ERRORLEVEL%) will be set to the value of the former, provided there is no variable named ERRORLEVEL already. Now I had a suspision what was going on. I opened the parent batch file and came across the following:

In an attempt to clear the error level, some unlucky developer introduced a variable named ERRORLEVEL which shadowed the value %ERRORLEVEL% from this point on. This can be easily verified:

Once the problem was understood, it was easy to fix: clear the error level in an “accepted way” (yuck, again!) instead of wrongly tying it to zero:

Even though the interface to DOS exit codes is dangerously confusing (and disgusting as well), it facilitates a nice practical joke: next time a colleague leaves the room without locking the screen, open Windows control panel, create a new global environment variable called ERRORLEVEL and set it to 0.

Dangerously Confusing Interfaces IV: The Perils of C’s “safe” String Functions

“It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”
–Mark Twain

Buffer overflows are among the most frequent causes of security flaws in software. They typically arise in situations such as when a programmer is 100% certain that the buffer to hold a user’s name is big enough — until a guy from India logs in. Thus, well-behaved developers always use the bounded-length versions of string functions. Alas, they come with differing, dangerously confusing interfaces.

THE GOOD

Let’s start with ‘fgets‘:

No matter what users type into their terminals, ‘fgets’ will ensure that ‘user_name’ is a well-formed, zero-terminated string of at most 29 characters (one character is needed for the ‘\0’ terminator). The same goes for the ‘snprintf‘ function. After executing the following code

‘buffer’ will contain the string “The”, again, properly zero-terminated.

Both functions follow the same, easy-to-grasp pattern: you pass a pointer to a target buffer as well as the buffer’s total size and get back a terminated string that doesn’t overflow the buffer. Awesome!

THE BAD

In order to copy strings safely, developers often reach for ‘strncpy‘ to guard themselves against dreaded buffer overruns:

Unfortunately, this is not how ‘strncpy’ works! We assumed that it followed the pattern established by ‘fgets’ and ‘snprintf’ but that’s not the case. Even if ‘strncpy’ promises that it never overflows the target buffer, it doesn’t necessarily zero-terminate it. What it does is copy up to ‘sizeof(buffer)’ bytes from ‘user_name’ to ‘buffer’ but if the last byte that is copied is not ‘\0’ (i. e. ‘user_name’ comprises more than ‘sizeof(buffer)’ characters), ‘strncpy’ leaves you with an untermiated string! A traditional approach to solve this shortcoming is to enforce zero-termination by putting a ‘\0’ character as the last element of the target buffer after the call to ‘strncpy’:

Using ‘strncpy’ without such explicit string termination is almost always an error — a rather insidious one, as your code will work most of the time but not when the buffer is completely filled (i. e. your Indian colleague “Villupuram Chinnaih Pillai Ganesan” logs on).

Boy, oh boy is this inconsistent! ‘fgets’ and ‘snprintf’ give you guaranteed zero-termination but ‘strncpy’ doesn’t. A clear violation of the principle of least surprise. Apparently, ‘strncpy’ fixes one safety problem and at the same time lays the foundation for another one.

THE UGLY

Can it get worse? You bet! How do you think ‘strncat‘, the bounded-length string concatenation function, behaves? Ponder this code:

But this is wrong, of course: the third argument to ‘strncat’ (let’s call this argument ‘n’) is not the size of the target buffer. It is the maximum number of characters to copy from the source string (‘string2’) to the destination buffer (‘buffer’). If the length of the source string is greater or equal to ‘n’, ‘strncat’ copies ‘n’ characters plus a ‘\0’ to terminate the target string. Confused? Don’t worry, here’s how you would use it to avoid concatenation buffer overruns:

Yuck! What’s the likelihood that people remember this correctly?

THE REMEDY

Even if the different interfaces and behaviors of the bounded-length string functions in the C API make sense for certain use cases (or made sense at some point in time), the upshot is that they confuse programmers and potentially lead to new security holes when in fact they were supposed to plug them. What’s a poor C coder supposed to do?

As always, you can roll your own versions of bounded/safe string functions or use my safe version of ‘strcpy’. If you rather prefer something from the standard library, I’d suggest that you use ‘snprintf’ as a replacement for both, ‘strncpy’ and ‘strncat’:

Looks like ‘snprintf’ is the swiss army knife of safe string processing, doesn’t it? The moral is this: use whatever you’re comfortable with, but refrain from using ‘strncpy’ or ‘strncat’ directly.

More dangerously confusing interfaces…

Dangerously Confusing Interfaces III

confused.jpgJust like the other “Dangerously Confusing Interfaces” posts, this one was also inspired by a real-world blunder that I made.

Here’s the background: usually, routines that accept data via a pointer from the caller either execute synchronously or copy the data into their own internal data structures for later processing. Take the venerable ‘fwrite’ from the C standard library as an example:

‘fwrite’ blocks until the data has been written, either to disk or to an internal buffer. In either case, once ‘fwrite’ returns, it doesn’t care about the original data anymore. That’s why it’s safe (and common practice) to pass a pointer to a local buffer on the stack:

All standard library and POSIX APIs behave like ‘fwrite’, which is both, safe and convenient. However, with embedded systems, the story is different: in some cases, memory is so tight that additional buffers/internal storage can’t be afforded. Such functions don’t copy the provided data but only store a pointer to your data and expect the memory pointed-to by this pointer to be still valid long after the function call has returned. Here is an example from the AUTOSAR standard, which is used by almost all embedded automotive products:

‘NvM_WriteBlock’ is used to store data to a given non-volatile memory block. However, what this function does is only enqueue a request for the given block ID together with the data pointer (not a copy of your data). This is done for the sake of efficiency, because there can be multiple write requests in parallel. The queue is later processed in another task, long after any local buffer would have been removed from the stack.

Passing a pointer to a buffer with automatic storage is an easy mistake to make, especially since such “non-copy” interfaces are so rarely encountered. How can “write-like” interfaces that don’t make a copy of the provided data be made safer, such that misuse is less likely? Obviously, just adding documentation is not enough — nobody reads documentation, especially in the heat of the moment.

In my view, the root of the problem is that such functions accept just about any pointer. What if the caller was forced to explicitly cast the pointer to another type? A type with a cunningly chosen typename, one that reminded the caller of the potential pitfall? Here is my approach:

Whenever a pointer is passed to this function, developers have to write something like this to make the compiler happy:

Typing ‘uncopied_memory’ should shake up even the most focused developers and remind them to double-check what they are passing into this function.

Of course, within ‘SomeWritelikeFunction’, the provided pointer needs to be cast back into something more useful, like a ‘const uint8_t*’. Further, note that the ‘dummy’ member within ‘uncopied_memory’ must not be used; it only exists to make sure that the cast to ‘uncopied_memory*’ in the calling function is safe: a pointer to a struct is aligned such that it is aligned with the struct’s first member, which is ‘char’ and ‘char’ has the weakest alignment requirements.

Dangerously Confusing Interfaces II

confused.jpgLast week was a sad week for me. A bug in my code made it into the final version that was shipped to an important customer.

When something like this happens, it is almost always due to fact that there is a higher-level “bug” in the software development process, but I don’t want to go there. Instead, I want to focus on the technicalities. Once more, I got bitten by another instance of a dangerously confusing interface. Let me explain.

I’ve always had a liking for interfaces that are self-evident; that is, one knows immediately what’s going on by just looking at how the interface is used — without having to consult the documentation. Let me give you a counter example, an interface that is far from what I desire:

The interface to ‘sort_temperatures’ is not at all self-explanatory. What is clear is that it sorts ‘values_count’ values, which are obviously temperature values, but what the heck does the ‘true’ argument stand for? In order to find out, you have to look at the declaration of ‘sort_temperatures’ and/or its API documentation:

Now it is clear what the boolean parameter is for, but at the cost of having to take a detour. Maybe you got so distracted by this detour that you forgot what you originally were about to do.

Programmers often make this mistake when designing interfaces. They use boolean parameters when they have two mutual exclusive modes of operation. This is easy for the implementer of the routine, but confusing to the ones who have to use it. Contrast this with this alternative:

By replacing the boolean parameter with symbolic constants you not only make the code more readable, you also open it up for future extension: adding more modes becomes straightforward.

Now have a look at this code and try to guess what it does:

That’s fairly simple: ‘calc_hash’ calculates a SHA-256 checksum over ‘mydata’ (which is ‘mydata_len’ bytes in size) and stores it into the provided buffer ‘hash’. The length of the hash is stored to ‘hash_len’ via call-by-reference.

But is this code correct? The answer is — you guessed it — no. If you can’t spot the bug, you are in good company. You can’t see it with just the information given. ‘calc_hash’ interface is not self-evident.

I wrote this code more or less one year ago. It contains a bug that remained dormant until the product was in the hands of our customer. And it is there because of a silly interface.

The last (pointer) parameter ‘hash_len’ actually serves as both, an input AND output parameter. When you call ‘calc_hash’ it is expected that ‘*hash_len’ contains the size of the provided ‘hash’ buffer; on return ‘*hash_len’ will contain the actual number of bytes used by the hash algorithm; that is,the size (or length) of the SHA-256 checksum stored in ‘hash’. The whole idea behind this is that ‘calc_hash’ (or rather its author) wants to offer protection against buffer overruns — for cases where the provided ‘hash’buffer is not large enough to accommodate the checksum.

So the problem here is that ‘hash_len’ (being a stack variable) is not properly initialized to ‘HASH_SHA256_LEN’; it’s value is more or less arbitrary. If it is by chance greater or equal to 32 (the value of the ‘HASH_SHA256_LEN’ symbolic constant) everything is fine and the checksum is correctly calculated. If it is not, ‘calc_hash’ returns ‘false’ and an error is reported.

For as long as a year — by sheer coincidence — ‘*hashLen’ was never below 32 (which is not that unlikely, given that ‘size_t’ can accommodate values ranging from 0 to 4,294,967,296); but in the hands of the customer — and very much in line with Murphy’s Law — it happened.

OK, accuse me of not having initialized ‘*hashLen’ properly, accuse me of not having read the API documentation carefully. Maybe I did read the API documentation and then I was interrupted. I don’t know. But what I know for sure is that this bug would have never happened if the interface had been clearer.

The first problem with ‘calc_hash’ is that ‘hash_len’ is an IN/OUT parameter, which is unusual. I’m not aware of any function in the C/C++ standard library (or the POSIX libraries) which makes use of IN/OUT parameters. Since the input value (not just the output value) is passed by reference, neither the compiler nor static analysis tools like PC-Lint are able to detect its uninitialized state. One obvious improvement is to pass the length of the buffer by value:

Granted, there is now one more argument to pass (‘hash_buf_len’), but if the unlucky programmer ever forgets to initialize it, the compiler will issue a warning.

But let’s not stop here. I’d like to pose the following question: what good is the hash buffer length check, anyway?

In my view, it is not at all necessary. The length of a particular checksum is constant and known a priori — that’s why the hash buffer is allocated statically like this:

What additional benefit does a developer get by providing the same information again to the ‘calc_hash’? Isn’t this redundancy that just begs for consistency errors?

And what use is the output value that tells how long the checksum is? Again, this should, no it MUST be known a priori, there should never be a mismatch between what the caller of ‘calc_hash’ expects and what is returned. Of course, there can be error conditions but if ‘calc_hash’ fails, it should return ‘false’ and not a length different to what the caller expects.

Note that ‘calc_hash’ is not at all comparable to functions in the C API that add an additional output buffer length parameter like ‘strncpy’ or ‘snprintf’. These functions carry the length parameter for completely legitimate safety reasons because the total length of the output is usually not known a priori (for instance, as some input may stem from a human user and one has little control over how many characters (s)he will enter).

Based on these arguments, I dismiss the ‘hash_len’ parameter altogether and propose the following simplified interface:

and would use it like this:

Easy to read, hard to get wrong. It is impossible to forget to initialize a variable that just isn’t there.

Dangerously Confusing Interfaces

confused.jpgDesigning intuitive interfaces that are easy to use and easy to learn is hard, often very hard; and for economic reasons it might not always be possible to strive for perfection. Nevertheless, in my view, at the very least, interfaces should be designed such that obvious, day-to-day usage doesn’t lead to damage.

In his classic book “Writing Solid Code”, Steve Maguire calls confusing interfaces that lead to unexpected bugs “Candy Machine Interfaces”. He tells a story from a vending machine at Microsoft that used to cause him grief: The machine displayed “45 cent” for “number 21”, but after he had finally inserted the last coin he would sometimes enter “45” instead of “21” (and would get a jalapeƱo flavored bubble-gum instead of the peanut butter cookie that he wanted so much — Ha Ha Ha!). He suggests an easy fix: replace the numeric keypad with a letter keypad and no confusion between money and items would be possible anymore.

The other day I did something like this:

My goal was to recursively copy the ‘gamma’ folder to my home folder. What I expected was a ‘gamma’ folder within my home directory, but instead I ended up with hundreds of files from the ‘gamma’ directory right at the top-level of my home directory — the ‘gamma’ directory simply wasn’t created!

I have to confess that similar things sometimes happen to me with other recursive-copy-like tools, too — this seems to be my candy machine problem. Now you know it.

As for ‘rsync’, there is a feature that allows you to copy just the contents of a directory, without creating the directory, flat into a target directory. Granted, this is sometimes useful, but do you know how to activate this mode? By appending a trailing slash to the source directory! That’s what happened in my case. But I didn’t even add the slash myself: if you use Bash’s TAB completion (like I did) a trailing slash is automatically appended for directories…

But good old ‘cp’ puzzles me even more. If you use it like this

it will copy ‘from3’ to a folder named ‘to2’ under ‘to1’ such that both directories (‘from3’ and ‘to2’) will have the same contents, which is more or less a copy-and-rename-at-the-same-time operation. Unless ‘to2’ already exists, in which case ‘from3’ will be copied in ‘to2’ resulting in ‘to1/to2/from3’. Unless, as an exception within an exception, there is already a ‘from3’ directory under ‘to2’; in this case ‘cp’ will copy ‘from3’ flat into the existing ‘to2/from3’ which might overwrite existing files in that folder.

Both, ‘cp’ and ‘rsync’ suffer from fancy interfaces that try to add smart features — which is normally good — but they do it in an implicit, hard-to-guess, hard-to-remember way — which is always bad. Flat copies are sometimes useful but they might be dangerous as they could inadvertently overwrite existing files or at least deluge a target directory. A potential cure could be an explicit ‘–flat’ command-line option.

To me, a wonderfully simple approach is the one taken by Subversion: checkouts are always flat and I’ve never had any problems with it:

This copies (actually checks-out) the contents of the ‘trunk’ flat into the specified destination directory — always, without any exceptions. That’s the only thing you have to learn and remember. There are no trailing backslashes or any other implicit rules. It will also create the target parent directories up to any level, if needed.

Naturally, dangerously confusing interfaces exist in programming interfaces, too. Sometimes the behavior of a method depends on some global state, sometimes it is easy to confuse parameters. The ‘memset’ function from the C standard library is a classic example:

Does this put 40 times the value of 32 in ‘buffer’ or is it the other way around?

I have no idea how many programmers don’t know the answer to this question or how many bugs can be attributed to this bad interface, but I suspect that in both cases the answer must be “way too many”. I don’t want to guess or look up the specification in a manual — I want the compiler to tell me if I’m wrong. Here is an alternative implementation:

Now you write

If you confuse the fill character with the length parameter the compiler will bark at you — a parameter mix-up is impossible. Even though this is more to type than the original (dangerous) interface: it is usually worth the while if there are two parameters of the same (or convertible) type next to each other.

Like I said in the beginning: designing intuitive interfaces is hard but spending extra effort to avoid errors for the most typical cases is usually a worthwhile investment: don’t make people think, make it difficult for them to do wrong things — even if it sometimes means a little bit more typing.