Archive for May, 2009

No Bills for Nancy

Friday, May 29th, 2009 by hartmans

A while ago I was working with a client in the financial services sector. They had an online banking product. After a software upgrade, they got a confusing customer service call. A customer, we’ll call her Nancy, called up and reported that she could no longer access the online billpay portion of the site. However, she was sure that she had access because her bills kept paying. After some investigation it turned out that what she meant is that when she tried to click on the bill payments section of the site, she received a page indicating that she didn’t have the online bill pay feature. However, her periodic payments were still being deducted from her account. According to the provisioning system, she did have bill pay access, and audit records confirmed her claim that her payments were being made.

The client was concerned. Taking money out of end-user accounts without giving the user visibility into what was going on or a way to change it was problematic. Besides, she was paying for a service, and the client wanted to provide it. I got involved when the other developers were unable to reproduce the problem in the development environment. When the data set including Nancy’s information was loaded, they could see that she did in fact have access. If someone logged in as her in the development environment, then the bill payment system was available.

When the second user called with the same problem, the issue was escalated. The problem was not universal, but a small number of users reported trying to use online payments and getting the page indicating they did not have the service even when they did.

We still were not able to reproduce on the development environment. We were focusing on whether there was some sort of release engineering error that had caused a problem to creep into the production environment. We could easily create a fresh instance of the development code (although not the supporting database) and that didn’t cause the problem to appear. Our system was interpreted and the access control logic was isolated from the operating system or hardware. So, while we acknowledged the possibility of a problem in that area, we didn’t think that development running on Alpha while production was running on newer Sparc hardware would be the issue.

We looked very closely at the code that decided whether to display the online payments page. We did find a few problems, but none should have caused this bug. For example, the system had a concept of an individual account and a business account. an individual account could be given access to a business account for auditing and dual control reasons. However we did not support using an individual account to access the payments section of a business account. So, there was logic to make sure that the login ID of the user matched the login ID of the resource being accessed. This check would always return true because it used a numeric comparison rather than a string comparison.

While going over the situation someone joked that we could just ask people to change their name: the problem only seemed to happen for users named Nancy. I heard this description and looked at the users who had reported the problem. Sure enough, all named Nancy. How do you get a name dependent bug in access control? Surely this was just sampling error.

What’s special about Nancy? Well, it starts with “nan” as in not-a-number. Surely, though, that shouldn’t affect anything. Unless . . . I logged into a development and production server. Sure enough, on production, “nan+0″ evaluated to “nan”. On development, “nan+0″ evaluated to “0″. Apparently whatever the interpreter was using to read numbers respected nan on Sparc but not alpha. Still, how could this be our issue?

Then I remembered that numeric comparison instead of the string comparison. You see there are two cases where using a numeric comparison to compare strings is not true. The first is when the string represents the number zero. The second is when it represents nan. Sure enough, with a two character change and a lot of paperwork, the Nancies gained access to their payments.

I really love the story of this bug, so I’m sharing it here. People have tried to find a moral in “No Bills for Nancy,” over the years. “This shows the value of static type checking!” some have said. “This shows the critical importance of having identical test environments,” others have said. “If you had the right development methodology, this wouldn’t happen!” others have said.

In some sense, that’s all true. However I’ve found that no matter how good your testing, no matter how good your practices, Nancy is out there lurking, ready to demonstrate that there is some facet of the system that we do not understand. Really, though, Mark Twain had the right answer. It’s a good story; enjoy it for itself.