Defending the indefensible

Whenever there’s an application problem, project managers and analysts immediately turn to two groups of people to find out what’s wrong: sysadmins and developers.

So once more, let’s play….IT’S NOT MY PROBLEM!!

A good sysadmin will quickly view the logs and perform a quick sanity check on the system config to confirm that all is well.

A developer, often, in my experience, subject to sampling error and other qualifications to avoid too much offence, will simply say, “It’s a configuration problem.” They may follow that with one of two qualifiers:

“It works on my box.”
“The code doesn’t do anything to cause that.”

…And guess who now has to back up their assertion?

Every.
Single. Time.

It doesn’t matter that the config looks good, or that the logs suggest it’s not a problem with the system (e.g. they contain huge APPLICATION errors). It doesn’t matter how many times you’ve been right in the past. It doesn’t even matter that the developer hasn’t produced any firm corresponding evidence the other way. It’s “your” system, therefore it must be “your” issue.

Programming is a complex, intellectual activity and, faced with several thousand lines of code spread across a hundred files that interact in subtle ways, without much of a clue where to dive in, it’s hardly surprising that many developers will forego the opportunity to examine it in detail. Besides, it “works on their box”, therefore it must be something on the other system, right? This has become the cop-out du jour since the advent of programming models like J2EE, where applications are developed and tested on individual workstations then deployed, unaltered, on distributed multi-host production environments. Yes, configurations do tend to very different between the two. For example, the production environment probably has performance tweaks, extra features enabled, live backend data feeds and vastly increased load and demands. Small wonder that configuration errors are more likely. But it’s also quite likely that the application code will behave differently in the wider world, and that subtle bugs or incompatibilities can materialise. (The usual way to catch these before deployment is to utilise a load-testing simulation tool, but such packages are often difficult to use effectively and anyway, even the best simulation falls short of the random twists of fate that can befall the average Internet application. But in this case, we’re dealing with a problem that has appeared subsequent to testing.)

At this point, the sysadmin has to go out of their way to prove that the fault lies in the application, without necessarily possessing the actual code or the skills to understand it. Assuming you’ve triple-checked the configuration again, forensic examination of the logs is often the best avenue here, tracing sequences of events, following error trails and searching for a sign of the app saying “oops!”. Sometimes it’s possible to enable extra levels of debugging info (gird loins, hassle developer to release the magic incantation). Failing that, it might come down to process tracing (as we’ve said before, “when your only tool is a hammer…”). A hefty dose of logic applied in root-cause analysis may pinpoint the fault, but translating this into management-speak is a challenge in itself. In the worst case, you may have to set up a complete facsimile test environment and try to recreate the problem (which often fails for similar reasons to the development/production dichotomy). And all the while, you have to fend off distracting conversations like this:

“Are we sure a reboot won’t fix it?”
“Positive. It’s an application issue.”
“Well, can we try it anyway?”
(“What, and break my uptime world record attempt??!”)

It can be absorbing fun, but more frequently it’s frustrating because you find yourself going through this again and again. “Why is it They get to disclaim all responsibility and no one calls Them on it, but I have to move heaven and earth to defend My position? I’ve heard of being a system advocate, but I didn’t realise it meant in the legal sense.”

So if you’re a manager or a developer, please consider the following:

If your sysadmin has had a good hit rate in the past, take a little more on trust next time.
If you have less or poorer evidence for your position than the other guy, your’s is the weaker case. Work on it.
Many bugs can be found and fixed in the time it takes a sysadmin to demonstrate them via indirect means.

Because there’s one thing we really enjoy, that spurs us on, that makes it all worthwhile, that nearly compensates for the time, effort and trouble we’ve been forced to go to:

Wiping the smug grin off your face.

PostScript: Of course, there are rules for sysadmins too: Don’t Gloat; and When It’s Your Fault, Admit Blame And Say Sorry. This sometimes makes the developers gloat, but take comfort in your moral superiority. Besides, they’re bound to fuck up again sooner or later. We’re all human.

What sucks, who sucks and you suck