Sorting through the nines to compare reliability or availability in designs

When comparing architectures for availability and reliability, it is often difficult to comprehend the small differences in the estimates.

It is tempting to compare in terms of downtime, but doing that is a slippery slope; people like to think of minutes per year as an actual outcome, almost too real for reality.  Remember there is no such thing as an average; there is also no such outcome, no prediction that is as accurate as people like it to be. But that doesn’t stop the decision makers from thinking as though the estimate is a tangible number.

A couple tricks I like to use are Data Bars in Excel, and a log conversion that compares the “number of nines.”

Under Conditional Formatting in Excel, the Data Bars capability is a very simple way of presenting a grid of numbers, such as when plotting the effect of two different variables in a system design, or combinations of connections in a network design.  Figure 1 is an example of the availability of node to node connections in a network, in pairs. Each row is a particular node pair, and each column is as well, so the diagonal is the availability of a node to node connection, and the off diagonal is the availability of the pair of node to node connections.  Notice that clusters and patterns are easier to see, such as the higher availability of the diagonal as expected, or the relatively high availability cluster at the bottom right for nodes 3, 4, 5.

Figure 1. Node to node availability grid for an arbitrary network.

When decision makers are comfortable with thinking in terms of the number of nines for comparison, then a log scale can be helpful, as it makes the smaller differences easier to see against the larger differences. But that can lead to unfair comparisons on a graph, where a small difference is made to look as large as a much larger difference. So I like to use this type of comparison against a standard or target, say “five nines” or 0.99999 availability for example.  Table I below gives a few of the common conversions. Note that “four and a half nines” is not 0.99995, but rather about 0.9999684. Precisely, the “number of nines” is calculates as –log(unavailability) in the table.

Table I

Conversions of availability, downtime, and fair weighting of nines.

                                      

These methods have different uses, and are not for every use either. The first method of Data Bars is broadly useful for comparing large numbers of data visually. And as long as the numbers are fair, the weighting provided by that method will be, and so the comparison should be useful and fair. Edward Tufte might approve (http://www.edwardtufte.com/tufte/). But the second method of log conversion to “the number of nines” can be misleading in comparisons of a large number of results as it is nonlinear. So I reserve its use for comparisons against targets and goals for a design.

 

About Rupe

Dr. Jason Rupe wants to make the world more reliable, even though he likes to break things. He received his BS (1989), and MS (1991) degrees in Industrial Engineering from Iowa State University; and his Ph.D. (1995) from Texas A&M University. He worked on research contracts at Iowa State University for CECOM on the Command & Control Communication and Information Network Analysis Tool, and conducted research on large scale systems and network modeling for Reliability, Availability, Maintainability, and Survivability (RAMS) at Texas A&M University. He has taught quality and reliability at these universities, published several papers in respected technical journals, reviewed books, and refereed publications and conference proceedings. He is a Senior Member of IEEE and of IIE. He has served as Associate Editor for IEEE Transactions on Reliability, and currently works as its Managing Editor. He has served as Vice-Chair'n for RAMS, on the program committee for DRCN, and on the committees of several other reliability conferences because free labor is always welcome. He has also served on the advisory board for IIE Solutions magazine, as an officer for IIE Quality and Reliability division, and various local chapter positions for IEEE and IIE. Jason has worked at USWEST Advanced Technologies, and has held various titles at Qwest Communications Intl., Inc, most recently as Director of the Technology Modeling Team, Qwest's Network Modeling and Operations Research group for the CTO. He has always been those companies' reliability lead. Occasionally, he can be found teaching as an Adjunct Professor at Metro State College of Denver. Jason is the Director of Operational Modeling (DOM) at Polar Star Consulting where he helps government and private industry to plan and build highly performing and reliable networks and services. He holds two patents. If you read this far, congratulations for making it to the end!
This entry was posted in Engineering Consulting, IT and Telecommunications, Quality, RAMS - all the -ilities, Uncategorized and tagged , , , , . Bookmark the permalink.

Comments are closed.