Fault-tolerant computer systems


Fault-tolerant computer systems

Fault-tolerant computer systems are systems designed around the concepts of fault tolerance. In essence, they have to be able to keep working to a level of satisfaction in the presence of faults.

Types of fault tolerance

Most fault-tolerant computer systems are designed to be able to handle several possible failures, including hardware-related faults such as hard disk failures, input or output device failures, or other temporary or permanent failures; software bugs and errors; interface errors between the hardware and software, including driver failures; operator errors, such as erroneous keystrokes, bad command sequences, or installing unexpected software; and physical damage or other flaws introduced to the system from an outside source [Fault-tolerant computer system design book contents. Dhiraj K. Pradhan, Pages: 135 - 138 1996 ISBN:0-13-057887-8] .

Hardware fault-tolerance is the most common application of these systems, designed to prevent failures due to hardware components. Typically, components have multiple backups and are separated into smaller "segments" that act to contain a fault, and extra redundancy is built into all physical connectors, power supplies, fans, etc. [Formal Techniques in Real-Time and Fault-Tolerant Systems: Second International Symposium, Nijmegen, the Netherlands, January 8-10, 1992, Proceedings

By Jan Vytopil

Contributor Jan Vytopil, Published by Springer, 1991, ISBN 3540550925, 9783540550921] . There are special software and instrumentation packages designed to detect failures, such as fault masking, which is a way to ignore faults by seamlessly preparing a backup component to execute something as soon as the instruction is sent, using a sort of voting protocol where if the main and backups don't give the same results, the flawed output is ignored.

Software fault-tolerance is based more around nullifying programming errors using real-time redundancy, or static "emergency" subprograms to fill in for programs that crash. There are many ways to conduct such fault-regulation, depending on the application and the available hardware. [Fault-tolerant computer system design book contents. Dhiraj K. Pradhan, Pages: 221 - 235 1996 ISBN:0-13-057887-8] .

History

The first known fault-tolerant computer was SAPO, built in 1951 in Czechoslovakia by Antonin Svoboda [Computer structures: principles and examples, pg 155By Daniel P. Siewiorek, C. Gordon Bell, Allen NewellPublished by McGraw-Hill, 1982ISBN 0070573026, 9780070573024] . Its basic design was magnetic drums connected via relays, with a voting method of memory error detection. Several other machines were developed along this line, mostly for military use. Eventually, they separated into three distinct categories: machines that would last a long time without any maintenance, such as the ones used on NASA space probes and satellites; computers that were very dependable but required constant monitoring, such as those used to monitor and control nuclear power plants or supercollider experiments; and finally, computers with a high amount of runtime which would be under heavy use, such as many of the supercomputers used by insurance companies for their probability monitoring.

Most of the development in the so called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960's [Computer structures: principles and examples, pg 189By Daniel P. Siewiorek, C. Gordon Bell, Allen NewellPublished by McGraw-Hill, 1982ISBN 0070573026, 9780070573024] , in preparation for Project Apollo and other research aspects. NASA's first machine went into a space observatory, and their second attempt, the JSTAR computer, was used in Voyager. This computer had a backup of memory arrays to use memory recovery methods and thus it was called the JPL Self-Testing-And-Repairing computer. It could detect its own errors and fix them or bring up redundant modules as needed. The computer is still working today.

Hyper-dependable computers were pioneered mostly by aircraft manufacturers, [Computer structures: principles and examples, pg 210By Daniel P. Siewiorek, C. Gordon Bell, Allen NewellPublished by McGraw-Hill, 1982ISBN 0070573026, 9780070573024] nuclear power companies, and the railroad industry in the USA. These needed computers with massive amounts of uptime that would fail gracefully enough with a fault to allow continued operation, while relying on the fact that the computer output would be constantly monitored by humans to detect faults. Again, IBM developed the first computer of this kind for NASA for guidance of Saturn V rockets, but later on BNSF, Unisys, and General Electric built their own [Computer structures: principles and examples, pg 223By Daniel P. Siewiorek, C. Gordon Bell, Allen NewellPublished by McGraw-Hill, 1982ISBN 0070573026, 9780070573024] .

In general, the early efforts at fault-tolerant designs were focused mainly on internal diagnosis, where a fault would indicate something was failing and a worker could replace it. SAPO, for instance, had a method by which faulty memory drums would emit a noise before failure [Fault tolerant computing in computer designNeilforoshan, M.RJournal of Computing Sciences in Colleges archiveVolume 18 , Issue 4 (April 2003) Pages: 213 - 220 ISSN:1937-4771 ] . Later efforts showed that, to be fully effective, the system had to be self-repairing and diagnosing -- isolating a fault and then implementing a redundant backup while alerting a need for repair. This is known as N-model redundancy, where faults cause automatic fail safes and a warning to the operator, and it is still the most common form of level one fault-tolerant design in use today.

Voting was another initial method, as discussed above, with multiple redundant backups operating constantly and checking each other's results, with the outcome that if, for example, four components reported an answer of 5 and one component reported an answer of 6, the other four would "vote" that the fifth component was faulty and have it taken out of service. This is called M out of N majority voting.

Historically, motion has always been to move further from N-model and more to M out of N due to the fact that the complexity of systems and the difficulty of ensuring the transitive state from fault-negative to fault-positive did not disrupt operations.

Fault tolerance verification and validation

The most important requirement of design in a fault tolerant computer system is making sure it actually meets its requirements for reliability. This is done by using various failure models to simulate various failures, and analyzing how well the system reacts. These statistical models are very complex, involving probability curves and specific fault rates, latency curves, error rates, and the like. The most commonly used models are HARP, SAVE, and SHARPE in the USA, and SURF or LASS in Europe.

Fault tolerance research

Research into the kinds of tolerances needed for critical systems involves a large amount of interdisciplinary work. The more complex the system, the more carefully all possible interactions have to be considered and prepared for. Considering the importance of high-value systems in transport, utilities and the military, the field of topics that touch on research is very wide: it can include such obvious subjects as software modeling and reliability, or hardware design, to arcane elements such as stochastic models, graph theory, formal or exclusionary logic, parallel processing, remote data transmission, and more. [

Reliability Evaluation of Some Fault-Tolerant Computer Architectures

By Shunji Osaki, Toshihiko Nishio

Published by Springer, 1980

ISBN 3540102744, 9783540102748

]

See also

* Fault Tolerant System

References

External links

* [http://64.233.169.104/search?q=cache:uBL7iMOpV9UJ:www.cs.ucla.edu/~rennels/article98.pdf+Fault-tolerant+computer+systems&hl=en&ct=clnk&cd=13&gl=us&client=firefox-a Primer on Fault-Tolerant Computer Systems from UCLA]
* [http://www.freepatentsonline.com/5099485.html A fault-tolerant patent with a lot of basic information on specific ways to detect faults]


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Fault-tolerant system — This article contains specific implementations of fault tolerant systems. For general theory, see fault tolerant design. Fault tolerance or graceful degradation is the property that enables a system (often computer based) to continue operating… …   Wikipedia

  • Fault-tolerant design — In engineering, Fault tolerant design, also known as fail safe design, is a design that enables a system to continue operation, possibly at a reduced level (also known as graceful degradation), rather than failing completely, when some part of… …   Wikipedia

  • AT&T Computer Systems — is the generic name for American Telephone Telegraph s unsuccessful attempt to compete in the computer business. In return for divesting the local Bell Operating Companies (Baby Bells), AT T was allowed to have an unregulated division to sell… …   Wikipedia

  • Configurable Fault Tolerant Processor — The Configurable Fault Tolerant Processor (CFTP), developed by the Space Systems Academic Group at the Naval Postgraduate School, is an experimental payload on board the United States Naval Academy s (USNA) MidSTAR 1 satellite. Midstar 1 was… …   Wikipedia

  • Computer cluster — Not to be confused with data cluster. A computer cluster is a group of linked computers, working together closely thus in many respects forming a single computer. The components of a cluster are commonly, but not always, connected to each other… …   Wikipedia

  • Computer Consoles Inc. — Computer Consoles Inc. or CCI was a telephony and computer company located in Rochester, New York, USA, which did business first as a private, and then ultimately a public company from 1968 to 1990. CCI provided worldwide telephone companies with …   Wikipedia

  • fault tolerance —    A design method that ensures continued system operation in the event of individual failures by providing redundant elements. At the component level, the design includes redundant chips and circuits and the capability to bypass faults… …   Dictionary of networking

  • Replication (computer science) — Replication is the process of sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault tolerance, or accessibility. It could be data replication if the… …   Wikipedia

  • Consensus (computer science) — Consensus is a problem in distributed computing that encapsulates the task of group agreement in the presence of faults.[1] In particular, any process in the group may fail at any time. Consensus is fundamental to core techniques in fault… …   Wikipedia

  • Computers and Information Systems — ▪ 2009 Introduction Smartphone: The New Computer.       The market for the smartphone in reality a handheld computer for Web browsing, e mail, music, and video that was integrated with a cellular telephone continued to grow in 2008. According to… …   Universalium


Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”

We are using cookies for the best presentation of our site. Continuing to use this site, you agree with this.