Homepage | People | Projects | Publications

MOLAR: Modular Linux and Adaptive Runtime Support for
High-end Computing Operating and Runtime Systems*



HA-OSCAR/MOLAR for HEC

HA-OSCAR/Molar for HEC is the Federated System Management (fSM), which is RAS-aware resource management comprising of the following elements: 1) there are m partitions (processor group) within HEC environments, 2) each partition consists of service and management nodes and a significant number of processors (compute nodes), 3) Partition-centric Service and Management nodes provide critical services for local and intra-partition requests (e.g., local/global scheduling, monitoring partition and its current state and some important time-series data set for reliability and QoS improvement). A list of related publications can be found in the publications section. For more information, please visit http://xcr.cenit.latech.edu/ha-oscar.

OS-level Data Replication and Distributed Control

Based on our experience with HA-OSCAR and Harness we are developing an OS-level data replication and distributed control framework that is capable of providing both, active/hot-standby and active/active high-availability, to system management services, such as job schedulers, system performance and health monitors, software installation and maintenance tools. Our main objective is to enable existing proprietary group communication middleware solutions that are based on different communication, distributed locking and control models, to be moved out of the middleware layer into the OS in form of pluggable and interchangeable modules. A list of related publications can be found in the publications section.

Scalable Algorithms for High Availability

The overall goal of the research is to develop scalable algorithms for high-availability without single points of failure and without single points of control. A list of related publications can be found in the publications section. For more information, please visit http://moss.csc.ncsu.edu/~mueller/molar.html.

Communications and I/O Performance Monitoring to Support Adaptation and Tuning of Operating Systems, Runtimes, and Applications

The availability of hardware-based counters for CPU and memory, and standardized interfaces to them, such as PAPI, have provided tremendous benefits to software developers at all levels of the software stack (operating systems, runtimes, and applications). The goal of this effort is to extend these benefits to the monitoring of communications and I/O by establishing suitable interfaces and conventions. Communications will be our initial focus, and intend to support programming models beyond MPI, including various one-sided messaging and global address space approaches. We have established a collaboration with the PAPI team, who are generalizing their framework to support a broader range of data sources. We are in the process of brainstorming with potential users of the performance data at all levels of the software stack to establish what they would like to have and how it might be provided. Your input is welcome! A list of related publications can be found in the publications section. Please contact David Bernholdt if you have any comments or need more information.

*This research is sponsored by the Office of Advanced Scientific Computing Research; U.S. Department of Energy. The work is performed jointly at Oak Ridge National Laboratory, which is managed by UT-Battelle, LLC under Contract No. De-AC05-00OR22725, Louisiana Tech University, Ohio State University and North Carolina State University in collaboration with University of Reading and Cray Inc..

Please contact engelmannc@ornl.gov with questions or comments regarding this page.
Copyright © 2004-2007. All Rights Reserved.