So long Scott Meyers, So long C++?

Last month, Scott Meyers wrote a blog post where he announced that he will withdraw from active involvement in C++.

Scott is famous for at least two things: first, his excellent way of explaining dry technical stuff in an entertaining way (I read all his books on C++, except for “Effective Modern C++” which is still on my to-be-read pile) and second, his He-man like hairdo.

After 25 years of dedicating his life exclusively to C++ he has become tired — tired of a language that gets more and more complex at an seemingly ever-increasing rate, all in the name of backwards compatibility and efficiency. (He didn’t say that, but this is my view.)

Ah, speaking of efficiency, the “e” word.

Just because C++ gives you control over efficiency doesn’t mean that you will get it in the end. Due to a lack of compiler and hardware knowledge, many C++ developers have a wrong (insufficient, at least) notion about efficiency. There are many misconceptions, because they don’t know how compilers, CPUs, or memories work.

One example is not understanding the effects of caching. Many C++ developers blindly trust std::map’s or std::unordered_map’s O(log n) and O(1) promises but there are situations where an O(n) std::vector (or plain C-style array) can be orders of magnitude faster because it accesses memory in a cache-friendly way. There is a nice talk by Scott on YouTube where he gives a good overview about caching and its consequences.

Another common efficiency fallacy is illustrated by this little for loop:

for (uint8_t i = 0; i < arrayLength; ++i) {
    if (array[i] == 42) {
        ...
    }
}

Many developers I've met believe that using a 'uint8_t' for the loop counter is more efficient than using a plain 'int'. But what most likely will happen is that by using 'uint8_t' the code becomes both, bigger and slower, especially on modern RISC-style processor architectures like ARM and PowerPC. Why? If the value of 'arrayLength' is not known at compile-time, the compiler has to create additional code that ensures that 'i' wraps around for values greater or equal to 256. Internally, the compiler assigns a 32-bit register to 'i' (provided you are targeting a 32-bit platform) and adding 1 to 255 in a 32-bit register is different to adding 1 to 255 in an 8-bit register. Behind the scenes, your compiler rewrites your loop to look like this:

for (int i = 0; i < arrayLength; i = ((i + 1) & 0xFF) {
    if (array[i] == 42) {
        ...
    }
}

Granted, in most situations this additional code will not amount to much, but maybe in a low-level driver or some communications stack, situations which systems languages like C++ were made for. But this example shows a problem that many would-be efficiency experts share: for the sake of (false) efficiency, they increase complexity and risk correctness and security. What happens if some day 'arrayLength' can be larger than 255? The for loop will loop forever, of course.

So while C++ is a language that has the potential to yield extremely efficient systems, you neither get efficiency automatically nor for free. C++ has a steep learning curve and there are many pitfalls. I truly belief that much of C++'s efficiency is wasted on too many developers. If you don't need utmost efficiency or don't know how to put the corresponding language features to best use, better keep away from C++ and use a managed language. You will be much more productive and create programs that are also (probably) more secure by default.

Getting back to Scott Meyers, I must admit that I'm somewhat happy about his decision. Not because he left C++ per se but because he now has time to focus on other important topics -- topics that he will explain with the same quality he is renowned for. Like some programmers say: when one curly brace closes, another one opens.

Continuous “Commit”ment

“Strive for continuous improvement, instead of perfection.”
— Kim Collins

Checking in early and often is a well-accepted practice these days. Instead of keeping changes local for an extended period of time, software is continuously integrated and thus the overall integration risk is significantly reduced because every time you check into the central repository, your changes become immediately visible and available to others. Problems show up early, which is for sure a good thing.

On the other hand, your code must have a certain level of quality before you can foist it upon others. At the very least, it must compile without errors. Most likely, it is also required to be in line with other project or coding standards, for instance, code must be free of compiler (or MISRA) warnings. Some projects even demand that every code that is checked-in has been 100% code-coverage tested.

Having such “pre-commit quality gates” is a blessing but it stands in the way of the “commit early, commit often” paradigm: Depending on the size of the task it may take hours (if not days) to meet all check-in criteria. Deferring commits for such a long time would certainly be foolish because check-ins are important for another reason: With every commit you drive in a piton that not only saves you from data loss but also allows you to go back and forth in time.

Fortunately, with distributed version control systems like Git, you can check in locally at your heart’s content without affecting others. When your code is nice and shiny (and is in line with your project standards), you integrate it by “pushing” to the central repository. I love to work like this!

Often, I check in every couple of minutes, sometimes even though my code doesn’t compile yet. Maybe after some minutes of tedious editing, only to ensure that my changes are not lost. Just like Hansel and Gretel, I like to leave bread crumbs behind, but unlike theirs, mine won’t be eaten by the birds.

The hard part is deciding on a commit message, though. I don’t want to break my flow by thinking about something suitable. Many times, the changes are not even coherent, so the commit messages cannot be meaningful. What’s my solution? I simply run

git add -A && git commit -m "Checkpoint"

over and over again.

As you can imagine, there will be dozens of “checkpoints” before I’m finished with my high-level task. Consequently, the commit history (what git log shows) is an utter mess. Even though Git supports various commands to alter the commit history (git rebase -i, for example), doing it manually is both, tedious and error-prone. I therefore decided to automate this process through a little tool named git-autocommit.

git-autocommit is a short Bash script that you invoke in your working directory. It runs indefinitely (at least until you hit Ctrl-C) and periodically executes git add -A && git commit -m "<git-autocommit>". When you’re done with your changes, you hit Ctrl-C and run git-autocommit again. Then, the script checks if there is a series of <git-autocommit> commit messages already at the top of your commit history and if so, performs a soft reset to the predecessor of the first autocommit; otherwise, it just waits for new changes in your working directory and autocommits them as before.

The upshot of this is that all the changes that you’ve done (those changes that have been autocommitted) are now staged and ready to be committed again, but this time en bloc and with a descriptive check-in comment. Once checked-in, all the intermediate autocommits are gone from the log and it looks as if you’ve made a perfect sausage.

Circular Adventures VI: When the Winner is Known Beforehand

    “Girls have an unfair advantage over men: if they can’t get what they want by being smart, they can get it by being dumb.”
    — Yul Brynner

    In part III and IV I discussed solutions for a circular problem where two indices performed a race within a circular buffer; that is, either index could be ahead of the other.

    Sometimes, however, life is simpler and it is always known which index is the leader, and thus the winner — from the outset:

    0 1 2 3 4 5 6 7 8 9
        ^           ^ 
        b           a
    

    In this example, provided we know that b must always be ahead of a, we can deduce that b has wrapped around and the distance between a and b is 4.

    Either implementation (the one given in part III and the one given in part IV) of circular_distance will return the same result:

    circular_distance(8, 2, 10) == 4
    

    However, both will fail for this case:

    0 1 2 3 4 5 6 7 8 9
            ^       ^ 
            b       a
    
    circular_distance(8, 4, 10) == -4
    

    Why? Under the premise that b is always ahead of a, the distance is +6 not -4. circular_distance computes the wrong result because it assumes that the leading index is less than half the circular buffer size ahead of the other index. This assumption was made (I called it ‘invariant’ at the time) to be able to compute the circular distance even if it is not known which index is ahead of which.

    Based on the prerequisite that b is always ahead of a we can give a simplified version of the circular_distance function from part III:

    circular_lead(a, b, N):
        return (b - a) mod N
    

    I call this function circular_lead instead of circular_distance to emphasize that it returns how much b is ahead of a, which is always a positive number. As usual, all the restrictions (and optimizations) regarding the mod operator apply.

    In situations where one index is known to be ahead of the other, circular_lead has an edge over circular_distance because it supports distances in the range 0 to N-1, whereas circular_distances only supports ranges from 0 to (N-1)/2. This is always the case in “monotonically increasing” scenarios, like run-time measurement, at least until the flux capacitor is invented. Hence, the last example of part IV can be rewritten like this:

    volatile uint32_t g_systemTime; // Wraps around.
    ...
    uint32_t start = g_systemTime;
    
    // Do some lengthy operation. After that, we know
    // that 'g_systemTime' will be ahead of 'start'.
    
    unt32_t delta = circular_lead_32bit(start, g_systemTime);
    

    If we replace the mod operator with a cast to uint32_t, circular_lead_32bit boils down to this:

    inline uint32_t circular_lead_32bit(uint32_t a, uint32_t b) {
        return (uint32_t) (b - a);
    }
    

    [For the mathematically inclined: What we are really talking about here is residue class rings, a concept used by all contemporary computers. I might explore the number theory behind this and other circular topics in a future post.]

    More circular adventures…

Bug Hunting Adventures #9: A Random Piece of PI

According to an old saying, there’s more than one way to skin a cat. There are at least as many ways to compute the value of π. One of them uses the Monte Carlo method to approximate π’s value and it is the subject of today’s Bug Hunting epsisode.

We start with a so-called unit circle, a circle with radius 1 whose center is positioned at the origin in the Cartesian coordinate system. Next, we put a square around the unit circle whose sides have length 2 (the diameter of the unit circle):

drawingThere are two areas (literally!) of interest in this picture: the circle area Ac and the square area As:

Ac = πr² = π
As = (2r)² = 4

The ratio Ac/As is π/4

Why is this ratio important? Because we can use it to calculate the value of π:

π = 4 Ac/As

Now let’s do some random sampling. We take N random points whose x and y values are both in range [-1; +1] and tally the number of points that fall within the square (Ns) and the number of points that fall within the circle (Nc). Given enough points, the ratio Nc/Ns is a very good approximation for Ac/As and we hence can compute:

π ≈ 4 Nc/Ns

The C code below attempts to calculate π in this manner, but sports a blunder. What is the bug? Bonus question for the mathematically inclined: without executing the code, what value does it really compute (instead of π)?

Code
Solution

The Need for an Hippocratic Oath for Software Engineers

Two months ago, I wrote about an incident where a fault in an airbag system was responsible for the death of a child. (Actually, as a court ruled later, it was not the airbag system that was responsible, but rather an engineer who kept quiet about a bug that he had discovered earlier.)

Now, it seems like we have another case of a software-based disaster: Volkswagen recently admitted that they used software in their cars that detects official test situations and then reconfigures the engine to reduce pollution — just to get the figures right. When not in this “cheat mode”, that is, under real driving conditions, Volkswagen’s NOx emission rates are up to 40 times higher.

I think this cunning piece of German engineering has the potential to smash Volkswagen. Not extinguish entirely, of course, but Volkswagen’s stock value might decrease so much that it can be taken over easily by the competition. In any case, it will cost this company, whose bosses probably thought it was too big to fail, big: billions of dollars, thousands of jobs, an immeasurable amount of reputation and credibility.

What makes this story so exceptionally shocking is that we are not talking about a bug (like in the failing airbag case) but a feature. This software was put in deliberately, so it is rather a sin of commission than a sin of omission.

While it would be interesting to know what made people act in such a criminal way, what really matters is this: the whole mess could have probably been avoided if someone along the chain of command would have stood up and said “No!”.

I will use this sad story as an opportunity to introduce the “Software Engineering Code of Ethics and Professional Practice” to you, a joint effort by IEEE and ACM. As for the Volkswagen scandal, it looks as if at least the following principles have been grossly violated:

Software Engineers shall:

1.03. Approve software only if they have a well-founded belief that it is safe, meets specifications, passes appropriate tests, and does not diminish quality of life, diminish privacy or harm the environment. The ultimate effect of the work should be to the public good.

1.04. Disclose to appropriate persons or authorities any actual or potential danger to the user, the public, or the environment, that they reasonably believe to be associated with software or related documents.

1.06. Be fair and avoid deception in all statements, particularly public ones, concerning software or related documents, methods and tools.

2.06. Identify, document, collect evidence and report to the client or the employer promptly if, in their opinion, a project is likely to fail, to prove too expensive, to violate intellectual property law, or otherwise to be problematic.

3.03. Identify, define and address ethical, economic, cultural, legal and environmental issues related to work projects.

5.11. Not ask a software engineer to do anything inconsistent with this Code.

6.13. Report significant violations of this Code to appropriate authorities when it is clear that consultation with people involved in these significant violations is impossible, counter-productive or dangerous.

In a world in which an ever-increasing part is driven by software, “The Code” should be considered the “Hippocratic Oath” of software engineers, something we are obliged to swear before we can call ourselves software professionals, before we are unleashed to an unsuspecting world. I wonder how many more software catastrophes it will take until we finally get there.

Bug Hunting Adventures #8: Just Like a Rubber Ball…

Two incidents gave rise to this Bug Hunting episode: first, I’m currently beefing up my burglar alarm system with a Raspberry Pi and second, I’ve just come across the old song from the ’60s “Rubber Ball” by Bobby Vee while randomly surfing YouTube (“Bouncy, bouncy, bouncy, bouncy”).

The alarm system extension I have in mind is this: I want to be able to configure temporary inhibition of motion detection via a simple push-button: if you press it once, you suppress motion detection by one hour, if you press it n times, you suppress it by n hours, respectively. If you press the button after you haven’t pressed it for five seconds, any suppression is undone. There are status LEDs that give visual feedback such that you know what you’ve configured.

Thanks to the fine RPi.GPIO library, accessing GPIO ports is usually a piece of Pi. However, there is a fundamental problem when reading digital signals from switches and buttons called “switch bounce”, a phenomenon that has driven many brave engineers nuts.

If you flip a switch, there never is a nice transition from zero to one (or one to zero, depending on how you flipped the switch). Instead, the signal bounces between zero and one, in a rather indeterminate fashion for many milliseconds. The bouncing characteristics depend on various factors and can even vary for the same type of switch.

The takeaway is this: with switches (and buttons, of course), expect the unexpected! You must use some sort of debouncing mechanism, implemented in either hardware or software. For some background information check out this excellent article by embedded systems luminary and die-hard veteran Jack Ganssle.

But back to my alarm system. I skimmed the RPi.GPIO documentation and was happy to read that since version 0.5.7, RPi.GPIO has support for debouncing of digital input signals, so there was no need to debounce the push-button myself. Exuberantly, I sat down and wrote this test code:

#!/usr/bin/env python

import RPi.GPIO as GPIO
import time

# My push-button is connected to GPIO port 17.
GPIO_PORT = 17

# Use official Raspi GPIO numbering scheme (port 17 is pin 11).
GPIO.setmode(GPIO.BCM)

# Configure GPIO port for input, use a pull-down resistor.
GPIO.setup(GPIO_PORT, GPIO.IN, pull_up_down=GPIO.PUD_DOWN)

# Handler that handles rising edge events.
def button_pressed():
    print "Button pressed!"
        
# Register rising edge handler, ignore further rising edges for 200 ms.
GPIO.add_event_detect(GPIO_PORT, GPIO.RISING, 
    callback=button_pressed, bouncetime=200)

# Keep looping forever.
while True:
    time.sleep(1.0)

To avoid arbitrary, floating input voltages, I setup the port such that it is pulled down (to ground) by default. As a consequence, reading that port when the button is not pressed yields a clean zero.

I didn’t want to poll the button port, as this would burn valuable CPU cycles, so I registered an event handler with the RPi.GPIO library that would be called back once a rising edge had been detected. I added a ‘bouncetime’ parameter of 200 milliseconds, to avoid multiple calls to my ‘button_pressed’ handler, in full accordance with the RPi.GPIO documentation:

To debounce using software, add the ‘bouncetime=’ parameter to a function where you specify a callback function. Bouncetime should be specified in milliseconds. For example:

# add rising edge detection on a channel, ignoring further 
# edges for 200ms for switch bounce handling
GPIO.add_event_detect(channel, GPIO.RISING, callback=my_callback,
    bouncetime=200)

In other words, after a rising edge has been detected due to a button press, I won’t be bothered with spurious rising edges for 200 ms. (200 ms is plenty of time — no switch I’ve ever seen bounced for more than 100 ms. Yet, 200 ms is short enough to handle cases where the user presses the button in rapid succession.)

I tested my button for a minute and it worked like a charm; then, I varied my press/release scheme and suddenly some spurious “Button pressed!” messages appeared. After some head scratching, I discovered the problem — can you spot it, too? (Hint: it’s a logic error, not something related to misuse of Python or the RPi.GPIO library.)

Epilogue: Once the problem was understood, the bug was fixed easily; getting rid of this darn, “bouncy, bouncy” earworm was much harder, though.

Solution

Two German Maxims That Will Save Your Neck (and Others’ Necks as Well)

I quite remember the uneasy sensation that I had when a former coworker told me a story — a story about a senior engineer who went to jail because of a bug, a fatal one, as it turned out.

The bug lurked in an electronic control unit (ECU) which was, among other things, controlling the manual deactivation of the front passenger seat’s airbag. Under normal circumstances, you wouldn’t want to disable an airbag, a feature that saves lives every day around the world. However, if you intend to put a rearward-facing baby seat in the front, you have to do it, or you risk severe injury of your child in case the airbag deploys during an accident.

Now, this unfortunate engineer discovered that under extremely rare conditions there was a tiny window of opportunity for the airbag deactivation mechanism to fail silently; that is, it would appear to be deactivated when in fact it wasn’t. I don’t remember the necessary prerequisites, but what I do remember is that the combination of inputs and actions sounded so silly, so unusual, so improbable that he — like probably most of us would have — expected that the fault would never ever show up in practice. But what a terrible mistake this was, as this is exactly what happened and a child lost its life.

How unlikely or likely is the higly improbable? The chances of winning a 6-number lottery game are typically 1 against many tens of millions; yet, the likelihood that some player (not a particular player, of course) wins is quite high. Why? Because there are millions of players who take part in such lotteries. The same is true for ECUs which frequently find their way into millions of cars.

The developer was punished not for creating the bug but for not telling his managers about his discovery, for keeping it secret. But why didn’t he report the problem to his superiors? I can only guess. Maybe there was a lot of schedule pressure, perhaps he didn’t want to upset his boss. Or, the product was already released and a recall would have cost a lot of money, let alone reputation. If you ask me, it was a deadly cocktail of fear and pride.

When I did my military training at the German Armed Forces, one of the first rules I learned was “Melden macht frei”, which more or less translates to “reporting is liberating”. It is your duty to report an incident and it has a liberating effect on you, both emotionally and legally. After reporting, it is your superior’s problem. He has to decide what to do next. That’s not dodging responsibility — it’s passing on an issue that is outside your area of responsibility to the right person.

In the same spirit, as professionals we also have to report any issue that is harmful to customers or the company, regardless of how unlikely it appears to us. Even if management makes a (hopefully prudent) decision to ignore the problem (like it was the case in the Space Shuttle Challenger disaster, where engineers clearly raised their concerns that the O-rings on the rocket boosters would not seal at low temperatures), at least you have behaved professionally and are saved from prosecution and guilt feeling.

There is, however, a strange phenomenon: People sometimes forget that you informed them, especially when they have to testify in court. That’s why I want to share another important German wisdom with you: “Wer schreibt, der bleibt”, which can be translated as “you write, you stay”. It means that (only) if you write something down, you will be remembered. In other words: always keep a paper trail; email usually suffices.

Bug Hunting Adventures #7: Random Programming

If you want to draw m random samples out of a set of n elements, Knuth’s algorithm ‘S’ (ref. The Art of Computer Programming, Vol 2, 3.4.2) comes in handy. It starts out by attempting to select the first element with a probability of m/n; subsequent elements are selected with a probability that depends on whether a previous element was drawn or not.

Let’s do an example. To select 2 random samples out of a set of 5 (say, the digits ‘0’, ‘1’, ‘2’, ‘3’, and ‘4’), we select ‘0’ with a probability of 2/5. If ‘0’ is not chosen, we attempt to select ‘1’ with a probability of 2/4 (since there are only 4 candidates left); otherwise (if ‘0’ has been selected) with a probability of 1/4, since one attempt to draw has already been used up. It’s easy to show the corresponding element selection probabilities in a binary tree:

                _____  2/5 _____              '0'
               /                \
          __ 2/4 __          __ 1/4 __        '1'
         /         \        /         \
       2/3         1/3    0/3         1/3     '2'
        :           :      :           :       :

In this tree, the nodes represent the likelihood that the element in the column on the right will be picked. Transitions from a node to the right mean that the element was picked, while transistions to the left denote that the corresponding element was not picked. In the ‘selected’ case, the numerator is decreased by one; the denominator is decreased in either case.

Here is an implementation of Knuth’s algorithm in C:

// genknuth -- print m random samples out of the set of numbers [0 .. n) (exclusive).
void genknuth(int m, int n) {
    int i;
    for (i = 0; i < n; ++i) {
        if ((bigrand() % (n - i)) < m) {
            printf("%d ", i);
            // Selected, so decrement probability numerator.
            --m;
        }
    }
}

For the sake of demonstration, assume that bigrand() is a function that returns a random number that is "big enough", meaning a value that is greater or equal to 0 and has an upper bound of at least n - 1 (a larger upper bound does not harm). Spend some time to understand how the 'if' statement implements the selection with the correct probablity.

So far, so good. But how would this clever algorithm look in another language? Let's port it to Ruby!

Not being a Ruby programmer (but being an experienced Online Programmer), I had to ask Auntie Google many questions in order to cobble together this code:

# genknuth -- print m random samples out of the set of numbers [0 .. n) (exclusive).
def genknuth(m, n)
    for i in 0 ... n
        if (bigrand() % (n - i)) < m
            print i, " "
            # Selected, so decrement probability numerator.
            --m
        end
    end
end

This Ruby code looks remarkably similar to the C version. It doesn't contain any syntax errors, but nevertheless doesn't output what it is supposed to. Can you see why?

For those folks who don't want to waste their time on Ruby, here is the Python version, sporting the same behavior and thus the same blunder:

# genknuth -- print m random samples out of the set of numbers [0 .. n) (exclusive).
def genknuth(m, n):
    for i in range(0, n):
        if (bigrand() % (n - i)) < m:
            print i,
            # Selected, so decrement probability numerator.
            --m

Solution

The Game of Life

game_of_life

“Imagine there’s no heaven
It’s easy if you try
No hell below us
Above us only sky”

— John Lennon “Imagine”

Once again, like every year, time has come to celebrate Towel Day, a great occasion to ponder Life, the Universe and Everything.

Speaking of Life — a surprising number of people, including software developers, don’t know about LIFE, also called Conway’s Game of Life; an even smaller number is aware of the corollaries, let alone accept them as a fact of life (pun intended!). So what is LIFE?

In LIFE, which isn’t really a game, but rather a simulation, there is an infinite field of cells; cells can be in either of two states: dead or alive. Conway defined four simple rules:

  1. A live cell with fewer than two live neighbors dies (think: dies of loneliness).
  2. A live cell with two or three live neighbors continues to live.
  3. A live cell with more than three live neighbors dies (think: dies of overpopulation).
  4. A dead cell with exactly three live neighbors becomes a live cell (think: birth of a cell).

You start with an initial (e.g. random) set of live and dead cells, let time increase in discrete steps and after every step you apply these four rules. That’s all. After every step the board contains a new set of live and dead cells.

It is quite fascinating to see how structures, patterns (or objects) emerge, move, disappear and reappear. Some of these objects have been given names that aptly describe their nature, like “pulsars”, “gliders”, “glider guns”, just to name a few.

What is even more fascinating is that these objects are governed by higher-order “laws” that are not obvious from the four simple rules. For instance, you can observe that “blocks” never move, “gliders” always move diagonally and “lightweight spaceships” always move from right to left. (Here is a great site for trying out various patterns yourself.)

Isn’t this very much like our own universe? In our universe, we have some fundamental laws, which give rise to higher-level structures and laws, up to highest-level laws of physics or principles of human behavior.

What Conway proved was that complex structures can emerge from simple rules; he proved that you don’t need a Creator to obtain a complex universe, just simple rules, time and favorable circumstances.

Religious people often have a hard time accepting that. One the one hand, they argue, Conway didn’t prove the absence of a Creator, and second, Conway himself acted as a Creator himself since he — after all — created the rules of the game. Isn’t there, in our world, at least room for such a “Creator of Rules”?

Nobody knows, but I personally don’t think so. What Conway did was not create the rules — he rather zoomed in on a particular universe with a particular rule set, chosen from an infinite set of rules: He just shed light on one particular universe that is favorable of life in an infinite multiverse.

Let me close this post with the words of Stephen Hawking. Like all human beings, he doesn’t know everything, but he is probably one of the persons who has the best grasp of our universe:

“We are each free to believe what we want and it is my view that the simplest explanation is there is no God. No one created the universe and no one directs our fate. This leads me to a profound realization. There is probably no heaven, and no afterlife either. We have this one life to appreciate the grand design of the universe, and for that, I am extremely grateful.”

Oh No! The ‘in’ Operator Strikes Again!

depressed

“Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.”
— Scott Adams

When I discovered the eerie truth about JavaScript’s ‘in’ operator, I thought that it couldn’t get any weirder. But boy-oh-boy, was I wrong!

Let’s first recap what’s so bad about JavaScript’s implementation of ‘in’. If you have a list (aka. array) in JavaScript and use the ‘in’ operator to test for membership, you are asking for trouble:

> array = [4, 2, 1]
> 0 in array
true
> 4 in array
false

Contrary to what you might expect, the ‘in’ operator doesn’t test whether the left-hand side value is a member of the array, but rather whether it is a valid index into the array. This makes absolutely no sense for arrays, but it does for dictionaries, since it tests whether a given key is present or not:

> dict = {4 : 'four', 2 : 'two', 1 : 'one'}
> 0 in dict
false
> 4 in dict
true

The moral of the story? Don’t use the ‘in’ operator in JavaScript on arrays to test for membership; with dictionaries everything is fine.

The other day, I found another shocking behavior of the ‘in’ operator, but this time in Groovy; not when used on arrays — when used on dictionaries! Consider this routine. It has two parameters, a list of words (the input text) and a dictionary which contains an entry for every word whose number of occurrences in the input text is to be counted:

def countWords(wordsList, wordsToCount) {
    // For every word in list.
    wordsList.each { word ->
        // If word is a word that is to be counted.
        if (word in wordsToCount) {
            // Increase count.
            ++wordsToCount[word]
        }
    }
}

The initial word counts in ‘wordsToCount’ are set to 0; the word counts are increased for every corresponding word found in ‘wordsList’. For instance:

> def myText = ['spam', 'blah', 'eggs', 'foo', 'eggs', 'buzz']
> def myWordsToCount = ['spam' : 0, 'eggs' : 0, 'ham' : 0]
> countWords(myText, myWordsToCount)
> print myWordsToCount

should print:

[spam:1, eggs:2, ham:0]

At least, this is what most people would expect and what one would get in Python or JavaScript. Alas, what you will get in Groovy is this:

[spam:0, eggs:0, ham:0]

When used on dictionaries, Groovy’s ‘in’ operator not only checks whether the given key on its left-hand side exists in the dictionary, it also overeagerly checks if the corresponding value is ‘true’ according to the Groovy Truth Rules; that is, 0, null, empty strings and empty lists will make the ‘in’ operator return false.

The ‘in’ operator should only test for membership; that is, it should test if the left-hand side is in the collection on the right-hand side. On no account should it interpret an element’s value.

Not following a well-established convention is bad enought, but being inconsistent within the same language is a lot worse: when used with lists/arrays, the ‘in’ operator just checks for membership and doesn’t care a bag of beans about the value of a member:

> def mylist = [ 0, 1, null, "", [] ]
> 0 in mylist
true
> null in mylist
true
> "" in mylist
true
> [] in mylist
true

It won’t get any weirder — it simply can’t!