High Performance and High Availibility Services

Digitask specializes in the benefits of clustering for speed and availability on VMS. Clustering maintains uninterrupted service even when individual servers fail or are taken down for maintenance. Clustering multiple servers will increase the processing power of the system for each server added in a one-for-one performance increase in contrast to adding processors or memory to a single server, which does not necessarily conserve resources at that rate. The following cluster sections briefly discuss some clustering benefits.

Availability

Availability is the proportion of time that a system can be used for productive work. Typically expressed as a percentage, 100% is the best possible rating. Availability is a better term than reliability because it stresses the real goal -- keeping resources, services, and applications running and available to users. Reliability, in contrast, focuses more on the attributes of individual components. Typical stand-alone systems can achieve about 99% availability. This may sound great, but once you realize that the missing 1% represents roughly 90 hours -- over three and a half days -- it loses some of its luster. 99% availability is sufficient only for forgiving organizations and casual applications. Systems upon which a business depends must do much better. On the other end of the spectrum, critical applications such as emergency call centers, telecommunications hubs, air traffic control, and medical equipment must be up and running 24 hours a day, every day of the year. Any downtime at all may risk lives, money and reputations. These situations are the province of "fault-tolerant" or "continuous processing" systems that use extensive redundancy and specialized construction in a heroic attempt to prevent service interruptions. These systems can achieve 99.999% availability or better -- that is, about five minutes downtime in an average year. So why aren't fault-tolerant systems used everywhere? Doesn't everyone want zero downtime? Sure - but there's a catch. While everyone wants to completely eliminate downtime, few applications can justify the expense. Fault-tolerant systems' specialized construction and extensive redundancy makes them cost several times that of conventional systems. Even more important, once a system has been made 99.95% or 99.99% available, all of the likely failures will be software or environmental failures, not hardware breakdowns. Spending money to make the hardware even more reliable is not very cost-effective. It should only be considered in the most availability-sensitive situations. The systems of greatest interest to most users experience between three days and three minutes of downtime per year. If attentively managed with a supportive set of systems management tools, conventional stand-alone systems can achieve between 99.5% and 99.8% availability -- or 18 to 44 hours of downtime per year. To go beyond into "high-availability" or "fault-resilience" requires clustering. An availability oriented cluster in a master/slave configuration can eliminate all but a few hours of downtime per year. What downtime it cannot eliminate, it ameliorates. Unplanned downtime is converted from a serious problem into a brief service hiccup. Most of whatever downtime remains is made into planned downtime.

Highly-available environments suit customers for whom money is an object and who can tolerate a brief delay while service is being restored. So while an air-traffic control system may require fault-tolerance, a reservations system based on a highly available or fault-resilient cluster is more than adequate to keep agents selling tickets to satisfied customers.

Clusters Deliver Scalable Performance

In addition to high availability, clustering helps achieve high performance. The term scalability (really, an abbreviation of performance scalability) is often used to stress the goal of high overall performance. It also hints at the incremental performance growth clusters enable.

Working in parallel is one of the most direct paths to higher performance. Uniprocessors, multiprocessors, clusters, parallel processors, and distributed computing all use parallelism at a variety of levels. Emphatically, they are not all the same. Each flavor of parallelism has advantages and limitations. Digitask has no bias regarding these approaches - just a simple, practical goal of getting maximum value from each technique. The real issue is how well application programs can make use of the parallelism each approach offers. Some programs can easily be partitioned into pieces, each of which can run on a separate processor. Such multi-threaded programs can achieve significant performance gains. The more pieces the better the utilization of parallel components, and the bigger the performance gains. The perfect case is "linear scalability"; for N processors, the application runs N times faster than on one processor. True linearity is rare, particularly as the number of processors grows. Many important programs cannot be extensively multi-threaded. They are somewhat partitionable, but only into a limited number of pieces, and decomposition requires significant effort - maybe even a rearchitecting. This inherent difficulty is complicated by the need of the pieces to coordinate among themselves. Even if a program can be decomposed, communications can become a limiting factor. The overhead it imposes can quickly overcome whatever inter-processor communications facility system designers offer. SMP has become popular because it provides a particularly effective way of scaling performance. It offers a few parallel processors, connected by a high-speed system bus, and coordinated by an attentive operating system. When well implemented as it is in many UNIX implementations, this modest level of parallelism can be handled fairly easily by applications. Often, developers do not bother to parallelize individual applications; users just run many jobs on an SMP system, and the OS takes the responsibility for distributing the total workload, one job to a processor. Those programs such as database managers with a high payoff for optimization are explicitly parallelized by sophisticated developers. Though effective with a few processors, as more processors are added, demand for the intimately shared resources grows. These demands are satisfiable at first, but they become less so. How fast the demand grows -- and thus the number of CPUs that can be effectively used -- depends on the level of inter-thread communications required by the application workload. Most workloads benefit from between two and six processors. At some point -- generally around 12 CPUs -- the law of diminishing returns takes full effect. Contention for shared resources becomes a bottleneck. As more processors are added, total performance rises only slowly, if at all. Extending the capabilities of the system bus is not an answer because it can no longer be done cost-effectively. Another limitation is that while multiprocessors can be quite reliable, they are not highly available. Should a processor fail, the system must be rebooted. Should some other component fail -- say, a SCSI disk controller or network adaptor -- the system cannot reboot its problems away. Multiprocessors must also be taken off-line for maintenance and upgrades. MPP, in contrast to SMP, has not become broadly popular. It uses vast multitudes of processors (often with relatively weak individual capabilities), linked by intricate proprietary interconnects, and coordinated explicitly by application structuring. This architecture suits problems that can be decomposed into hundreds - or better yet, thousands - of pieces. Though some important tasks qualify -- forecasting the weather and large-scale text retrieval, for instance -- these are not the workaday tasks of most organizations. Parallelization for large-scale parallelism is just too hard and the performance gains too haphazard for general use. MPP also does little more than SMP to ensure system availability. In terms of performance, clustering fits between SMP and MPP. Depending on the configuration, a cluster can appear more like an SMP or more like an MPP. The number of processors, the inter-processor interconnect, and the software used to coordinate operation are the primary differentiators. "Loosely" coupled clusters offer a large number of processors - potentially hundreds - linked by networking technology. "Firmly" coupled clusters use a smaller number of processors -- a few to some small multiple of ten -- linked by network, storage channel, or specialized cluster interconnect. ("Tight" coupling is often used to describe shared memory SMP systems.)Because clusters have looser processor-to-processor communications than SMP systems, more care must be taken to structure their workloads for scalable performance. On the other hand, the logical "distance" between processors has advantages. Because processors are isolated from one another, there is less contention for shared resources. Should one system fail, other nodes are generally unaffected. An alternate elsewhere in the cluster can generally stand in. A well-implemented cluster (UNIX, Windows or OpenVMS) yields substantially greater availability than either SMP or MPP technologies.

While it is instructive to compare SMP and clustering, the approaches are not exclusive. Clusters often incorporate SMP systems, MPP or fault-tolerant systems for specialized requirements.