Please enable JS

FAILURE MODE AND EFFECTS

BLOGROLL

FAILURE MODE AND EFFECTS

MAY 16/MICHAEL BAYLOR

Failure, something that nobody likes to contemplate but it is inevitable, especially when it comes to technology. However, the mode in which something fails and its effects can be radically different.

For example, a light bulb is designed to illuminate and it only has one failure mode, that is off and the effect is darkness. Not a big deal other than possibly a stubbed toe while fumbling around in the dark looking for a candle.

However, a carbon monoxide detector is designed to warn you of dangerous CO levels. It too has only one failure mode but the potential effect is death.

The recent failure at Amazon got me thinking about Cloud computing failure modes and effects. The Cloud is significantly more complex than a light bulb or even a CO detector and has tens of thousands of potential failure modes and the effects can range from performance degradation to catastrophic failure that results in complete outage and data loss.

Side note: If as a result of catastrophic hardware or software failure (or any other reason), data integrity is lost, then that data may as well be floating around space as a bunch of random electrons because it is useless junk so don’t bother recovering it and charging me for the disk space to store the garbage created by your failure. Sorry, back to the point.

It is beneficial then to understand the “failure mode and effects” of everything we use in our personal and professional lives in order that we may identify those that are critical and develop mitigation strategies for their imminent failure. A Failure Mode and Effects Analysis (FMEA) is a very useful tool for this. It’s not a complex exercise nor does it require a Ph.D. in statistics to perform.

Simply get out a pad of paper and start listing all the potential failure scenarios and their effects. This is the easy part so don’t get too comfortable yet. To be a truly meaningful exercise the FMEA should be supplemented by a basic probability analysis that would identify the likelihood (high/moderate/low) of the potential failure and some mitigation or contingency plans.

“Oh wait”, you are probably saying to yourself, “this all sounds familiar”. That’s because it’s from Business Continuity Planning 101. That’s right, I have not told you a thing that you didn’t already know but your two minutes have not been wasted because I have reminded you of something that you had forgotten.

That seems to be quite common when it comes to Cloud computing. We forget all the basics. Just because we can’t see it doesn’t mean we get to forget it. Apply the same level of due diligence to the Cloud that you do in the physical world and all will be fine.

That is not to say that there won’t ever be a failure, but at least you will have a plan in place to mitigate the effects.