How many 9s are enough

How many 9s are enough?

By Scott Bradner

If there is one thing that the phone people are insistent on it is reliability. At least they are in their demands on equipment. The common belief among phone people who are trying to build data networks is that equipment needs to be "five nines" (99.999%) reliable in order to be useful in a network that they might want to build. I think they are wrong to want this level of reliability in data networking equipment and I fear that their insistence on this level is inhibiting their deployment of useful data networks.

It will be interesting to see if they maintain this belief in the future where they will have to compete for customers against other providers. It is currently easy for the traditional phone company to insist on reliability at great cost since they live in a world where increased costs mean increased revenues being authorized by the local utility commissions. But utility commission distorted economics aside, I think the problem here is that the people that are insisting on "five nines" do not understand data networking.

Back in 1964 Paul Baron, then at the Rand Corporation, produced a series of articles proposing the idea of packet switching networks. (These papers were recently put on-line at http://www.rand.org/publications/RM/baran.list.html) He was working at a time when there was considerable worry about destruction of the US communications infrastructure by enemy action. He proposed a network design which would survive large-scale node or link destruction. His design was for a distributed network with many small, cheap packet switches and many redundant links between them instead of the then common network design which had a few large phone circuit switches. He showed that when reliability was measured end to end, a distributed network would exhibit very high reliability even in the face of the failure of a number of the switches or links in the network. He concluded "From the user's viewpoint, the system appears to be virtually noise- and error-free when handling data." He was describing the current Internet architecture quite a bit ahead of its time.

A key reason to use a distributed network is to minimize the reliance on any single network component. The network will route around link or switch failures. In this type of environment "five nines" reliability is way overkill. But it is not a surprise that the phone types think in terms of the need for extreme reliability - they generally do not have distributed networks with redundant paths.

There are places in many Internet service provider (ISP) networks where redundancy is not as rich as it might be, the link to the customer site for example, and in many ISPs the level of traffic is such that routing around a failure will cause congestion and data loss. But Internet-style networks are not the same as telephone style ones and the reliability demanded from each component should not have to be as high because the net will cover up in most cases. Less expensive, reasonably reliable switches may not result in less reliable service to the customer.

disclaimer: Harvard spends more time understanding the reliability of people than electronic components so the above postulation is mine.