(Too) Close to the Edge

Some time ago I posted about the kind strangers and unkind systems I encountered when our car broke down rather dramatically in the middle of a fairly busy provincial road. Kind strangers got my wife and I safely home; unkind systems made it very hard for us to regain our mobility. For those wondering where this adventure has left us: today we have been able to pick up a rental car the dealer is lending us – with no costs attached – for the duration of the repairs. So, that part of the mishap has come to a satisfactory conclusion.

For me, however, that is not the end of this incident but rather the beginning of a learning journey. I am always trying to learn from the disruptions and upheavals in my life. Partially to find out what I can do differently in the future to prevent the same things from happening again. But also because retrospective learning is what gives such incidents a meaning beyond merely being annoying, painful or worse. It helps me put things in perspective.

So, talking to the mechanic who is working on my car today, the first thing I asked was if there was anything I could have done differently, either to prevent the car from malfunctioning or to get if safely off the road when it did. He assured me I had done nothing wrong. In fact, when he took the car for a test-drive after resetting the on-board computer and running some diagnostics, the car malfunctioned in exactly the same way, leaving him stranded in the middle of a roundabout. He had to be rescued by his colleagues from angrily honking cars driven by frantically gesticulating drivers. Clearly the car was at fault, not the driver.

So then question became: what exactly is causing the car to play up in this way? Why is it doing this? What part is at fault?

Interestingly enough, the mechanic explained that none of the parts were faulty as such – they all did what they needed to do, pretty much performing according to their design specifications. However, when put together, a few of the components, under very specific circumstance, managed to create a combined spike in electricity. That spike triggered the onboard computer’s safety system shutting everything down to prevent the electronics being fried. In other words: nominally correctly functioning parts could, without any of them actually malfunctioning, cause a sudden collapse of the whole system’s functionality; in fact bringing it to a dramatic mid-journey emergency stop.

To make a long story short (the mechanic and I talked for several hours, while waiting for the rental car to arrive): what was wrong with the car had less to do with its components than with the way they operated together. When more than a few parts came close to their edge-condition, the result could push the combined system over the edge.

So, what does that teach us?

First: none of the parts are to blame. They all did what was expected of them. So, no blame there. Second: the system as a whole had been designed with enough fail-saves to prevent major damage. So, no blame there either.

However, fail-safe isn’t the same as failing safely. I think there is a lot of room for improvement in the way the car stopped functioning. For instance, a manual override of the steering and brake system locks would at least allow the driver to push the car off the road once it had stopped. Perhaps the system could have degraded more gradually and gracefully, giving the driver more time to reach a safe place to stop the car. And I would argue that the warning signals on the dashboard could do with driver-centric redesign as well. Most of the warnings may have been useful for a mechanic trying to diagnose what was wrong but didn’t help me, driving the car as it was breaking down, understand what was going on and how best to respond to that. Just a small example: “inspect braking system” is not a useful instruction when you are going 80kms an hour and your car is suddenly and erratically applying the brakes.

Second: tolerances and redundancies can look wasteful on paper, but can make or break a system under stress. The mechanic and I suspect that some of the components had been under-dimensioned to save costs. While technically within specifications, they didn’t have the extra ‘wriggle room’ to handle various edge cases gracefully.

Finally: complex systems, especially when tightly integrated and full of dependencies become sources of unpredictable exceptions. Our car’s onboard computer is full of rule-based software telling it how to respond to all the predictable exceptions. But rules-based systems are helpless in the face of unpredictable edge-exceeding cases. It should be the designers’ responsibility to a) reduce dependencies between components; b) built in more tolerances and redundancies to improve the system’s ability to recover (or degrade) gracefull from faults that occur; and c) provide an interface that is driver-centric, not car-centric to assist the driver in safely getting the car out of harms’ way when its systems are failing.

I am not blaming the designers, the manufacturer or the mechanics for what happened. Blame doesn’t teach us anything useful. But I do hope somebody somewhere learns from this and applies those lessons to their own situation to get a better outcome than the one I got the other day.

Leave a comment