Digital Tweed June 2002
Fault-Tolerant vs. Tolerating Faults
In the Internet era as campuses are moving toward e-learning,
the expectation -- indeed the requirement -- is for services to be
available 24-7. No matter what.
By Kenneth C. Green
A long, long time ago, years before the dot-coms
injected the notion of "24-7" into our everyday vocabulary,
the computer industry's term for reliability was fault
tolerant. Fault-tolerant computers were the elite computers
of the hardware industry, designed and marketed as the
most reliable computers available for "mission critical"
tasks. Ordinary mainframe and minicomputers handled the
mundane tasks of personnel and payroll, tracking sales or
preparing utility bills. Fault-tolerant computers did the
really important stuff for NSA (the National Security
Agency), the Pentagon and industries that really needed
dependable equipment that could run 24 hours a day, seven
days a week.
Case in point: Several months ago, I experienced some
major service problems with Adelphia, the company that
has the monopoly for cable service in my community. As
you may know, cable companies across the country are
upgrading their systems from analog to digital. The promise
of digital, in cable as elsewhere, is more and better stuff:
better sound, better image quality, plus more content -- not
just one HBO or Showtime channel, but six or eight.
Like other cable companies, Adelphia is investing
millions to upgrade infrastructure -- hardware, software
and wiring -- while promising to deliver more and better
stuff to consumers. (Hardware, software and network
wiring? Does this sound familiar?) And because the cable
companies typically operate as protected monopolies,
consumers generally have little recourse when things go
wrong. The marketplace offers multiple suppliers for books,
groceries, laundry soap and long distance service. Not so
with HBO: if I want HBO cable, in my community there is
only one source: Adelphia.
In other words, fault-tolerant computers did not crash.
Sure, these systems required care and attention and might
be taken off-line for routine maintenance. But unlike other
computers, fault-tolerant computers were designed
specifically so that they would not experience service
Of course, the emergence of the "24-7" standard as part
of the Internet era has raised all our expectations about
both computer reliability and customer service.
Amazon.com has to run "24-7" because Amazon.com's
customers shop all hours of the day. So too with eBay
(auctions), or Orbitz (travel), CNN.com (news), Schwab
(stocks), the Converge magazine Web site
, the Campus Computing Project
and almost all
destinations on the Web: the unspoken, inferred covenant is
"build it, and it damn well better work 24-7."
So earlier this year, I was a very unhappy Adelphia
customer: My newly installed digital cable service "crashed"
and failed to work for more than a week. The digital images
on my HBO channels were digitally fragmented. My
willingness to tolerate faults was, simply stated,
nonexistent. I wanted my HBO to work; I felt I was entitled
to have Adelphia deliver reliable HBO to my home.
Sadly, my repeated calls to Adelphia -- to the local
office, to a call center in Denver, and to the corporate
headquarters in Coudersport, Pa., failed to resolve the
problem. The pleasant-sounding people taking my twice-
daily phone calls told me in empathic voices that the source
of the problem was probably Adelphia's efforts to upgrade
the system. But beyond trying to sound pleasant and take
note of the problems with Adelphia's service, they could not
(or would not) arrange for a fix. Even an angry, five-page fax
to John Rigas, Adelphia's chairman and CEO, failed to
remedy the situation. As best I can tell, it was only because I
Not surprisingly, the Internet mantra of "24-7 service,
24-7 reliability" now casts a shadow over other aspects of
our lives. And because of our Internet experiences and
expectations, everything is now expected to be fault
tolerant; moreover, our willingness to "tolerate faults" has