Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond

Homepage | People | Publications | Software | Opportunities

Overview

Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond is a collaborative computer science research effort of Oak Ridge National Laboratory (ORNL), Louisiana Tech University, and North Carolina State University in advanced software solutions for parallel and distributed computing systems with an emphasis on extreme-scale scientific high performance computing (HPC). Specifically, this project aims at providing high-level RAS for next-generation supercomputers to improve their resiliency (and ultimately efficiency) by performing research and development in novel high availability and fault tolerance system software solutions. Since the overall component count (processor, memory, and network) of these supercomputers is in the several hundreds of thousands to millions, failures will occur at a much higher rate than in a personal computer. For example, an annual failure rate of less than one-percent (0.73%) for a single processor in a 50,000-processor system will result in approximately one failure each day on behalf of a processor. If this is multiplied by the number of component types, it is easy to see how these systems need to deal with unprecedented high failure rates in such a manner that they are not rendered useless due to continuous failure-recovery cycles.



This project aims at scalable technologies for providing high-level RAS for next-generation petascale scientific high-end computing (HEC) resources and beyond as outlined by the U.S. Department of Energy (DOE) Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) and the U.S. National Coordination Office for Networking and Information Technology Research and Development (NCO/NITRD) High-End Computing Revitalization Task Force (HECRTF) activities. Based on virtualized adaptation, reconfiguration, and preemptive measures, the ultimate goal is to provide for non-stop scientific computing on a 24x7 basis without interruption. The taken technical approach leverages system-level virtualization technology to enable transparent proactive and reactive fault tolerance mechanisms on extreme scale HEC systems. This effort targets: (1) reliability analysis for identifying pre-fault indicators, predicting failures, and modeling and monitoring component and system reliability, (2) proactive fault tolerance technology based on preemptive migration away from components that are about to fail, (3) reactive fault tolerance enhancements, such as checkpoint interval and placement adaption to actual and predicted system health threats, and (4) holistic fault tolerance through combination of adaptive proactive and reactive fault tolerance.



Approach

Since no single solution is the perfect answer to providing high-level RAS for all HPC environments, the taken approach covers the mechanisms needed to provide a broad spectrum of advanced fault handling solutions centered on a standardized core of scalable RAS technologies. Each of these technologies may be used individually or in combination with others to provide a comprehensive solution that fits individual HPC center and/or scientific application needs. The targeted core of RAS technologies offers innovative proactive fault handling techniques for extreme scale HPC systems by providing advanced scalable approaches for fault prediction, detection, recovery, and avoidance, while enhancing existing reactive fault handling and recovery schemes.



Generic RAS Framework Concept

The core vision of this effort is based on a generic RAS framework concept that coordinates individual solutions with regards to their respective field, and offers a modular approach that allows for adaptation to system properties and application needs. Each of the depicted RAS framework modules is interchangeable and may be an existing or future open source solution or commercial product. At its core, a highly available RAS engine running on a set of redundant service nodes processes historical and monitoring data from hardware and software health probes collected and preprocessed on compute nodes. A local policy analysis on compute nodes provides optimal local filtering of events, e.g., a trend analysis, to the global policy analysis of the RAS engine, which in-turn provides optimal global filtering of events for coordinated decision making. Based on this coordinated decision making, fault tolerance mechanisms, such as preemptive migration, are triggered and executed. The reallocation of application parts performed by the fault tolerance mechanism and respective monitoring data from hardware and software health probes assures a full control feedback loop. The local and global policy analyses perform online reliability modeling for individual system components and for the overall system using historical and monitoring data to constantly optimize this control feedback loop based on historical and actual system health threats.

Click to enlarge

RAS Framework: Proactive Fault Tolerance Example
Click to enlarge

RAS Framework: Reactive Fault Tolerance Example

Within this RAS framework concept, this project focuses on a reliability-aware HPC runtime framework with near-real-time fault prediction based on system health monitoring and statistical modeling to enable advanced proactive fault avoidance mechanisms and to improve the efficiency of existing reactive fault recovery solutions. In contrast to traditional reactive fault handling, the targeted reliability-aware runtime offers a new approach for fault tolerance by performing preemptive measures that utilize reliability models based on historical events and current system health status information in order to avoid application faults. For example, a process, task, or virtual machine may be temporarily migrated away from a compute node that displays a behavior similar to one that is about to fail. Pre-fault indicators, such as a significant increase in heat, an unusual number of network communication errors, or a fan fault, can be used to avoid an imminent application fault through anticipation and reconfiguration. The targeted technology is further able to tune existing reactive fault recovery solutions to actual and predicted system health threats. For example, the checkpoint frequency may be adjusted based on the current reliability of the HPC system, or checkpoints may be initiated based on predicted faults. Additionally, the overall costs of fault tolerance measures may be significantly reduced by combining reactive and proactive solutions, such that an HPC system preemptively migrates and very infrequently checkpoints scientific applications, and restarts them only in the event of an unpredicted fault. Further improvement may be achieved by exploiting multicore processor technology, e.g., for off-loading core RAS mechanisms to specifically dedicated or to underutilized processor cores.

Click to enlarge

Poster at High Performance Computer Science Week (HPCSW) 2008

This research is sponsored by the Office of Advanced Scientific Computing Research; Office of Science; U.S. Department of Energy. The work is performed jointly at Oak Ridge National Laboratory, which is managed by UT-Battelle, LLC under Contract No. De-AC05-00OR22725, Louisiana Tech University, and North Carolina State University. Please contact engelmannc@ornl.gov with questions or comments regarding this page.