The following text is copyright 1998 by
Network World, permission is hearby given for reproduction, as long as
attribution is given and this notice is included.
How many 9s are enough?
By Scott Bradner
Network World, 08/24/98
If there is one
thing telephone people are insistent on, it is reliability -
at least in their
demands on equipment. The common belief among
phone people who are
trying to build data networks is that equipment
needs to be
"five nines" (99.999%) reliable in order to be useful in a
network they want to
build. I think they are wrong to want this level
of reliability in
data networking equipment, and I fear their insistence
on this level is
inhibiting their deployment of useful data networks.
It will be
interesting to see if they maintain this belief in the future, in
which case they will
have to compete against other providers for
customers. It is
currently easy for the traditional phone company to
insist on reliability
at great cost because it exists in a world where
increased cost means
increased revenue being authorized by the local
utility commissions.
But
utility-commission-distorted economics aside, I think the problem
is that the people
who are insisting on five nines do not understand
data networking.
Back in 1964, Paul
Baron, then at Rand Corp., produced a series of
articles proposing
the idea of packet-switching nets. (The papers were
recently posted
online.) Baron was working at a time when there was
considerable worry
about the destruction of the U.S. communications
infrastructure by
enemy action. He proposed a network design that
would survive
large-scale node or link destruction. His design was
for a distributed
network with many small cheap packet switches and
many redundant links
between them instead of the then-common
network design that
had a few large phone circuit switches. He
showed that when
reliability was measured end to end, a distributed
net would exhibit
very high reliability even in the face of the failure of
a number of the
switches or links in the network. He concluded,
"From the
user's viewpoint, the system appears to be virtually noise-
and error-free when
handling data." He was describing the current
Internet
architecture long before its time.
A key reason to use
a distributed network is to minimize the reliance
on any single
network component. The network will route around
link or switch
failures. In this type of environment, five nines
reliability is
overkill. But it's not a surprise phone types think in terms
of the need for
extreme reliability - they generally don't have
distributed networks
with redundant paths.
There are places in
many ISP networks where redundancy is not as
rich as it might be
- the link to the customer site for example. And in
many ISPs, the level
of traffic is such that routing around a failure
will cause
congestion and data loss. But Internet-style networks are
not the same as
telephone-style ones, and the reliability demanded
from each component
should not have to be as high because the net
will cover up for a
lack of reliability due to redundancy in most cases.
Less expensive,
reasonably reliable switches may not result in less
reliable service to
the customer.
Disclaimer: Harvard
spends more time understanding the reliability of
people than
electronic components, so the above postulation is mine.