|
Homepage | People | Projects | Publications
HA-OSCAR/Molar for HEC is the Federated System Management (fSM), which is RAS-aware resource management comprising of the following elements: 1) there are m partitions (processor group) within HEC environments, 2) each partition consists of service and management nodes and a significant number of processors (compute nodes), 3) Partition-centric Service and Management nodes provide critical services for local and intra-partition requests (e.g., local/global scheduling, monitoring partition and its current state and some important time-series data set for reliability and QoS improvement). A list of related publications can be found in the publications section. For more information, please visit http://xcr.cenit.latech.edu/ha-oscar. OS-level Data Replication and Distributed Control Based on our experience with HA-OSCAR and Harness we are developing an OS-level data replication and distributed control framework that is capable of providing both, active/hot-standby and active/active high-availability, to system management services, such as job schedulers, system performance and health monitors, software installation and maintenance tools. Our main objective is to enable existing proprietary group communication middleware solutions that are based on different communication, distributed locking and control models, to be moved out of the middleware layer into the OS in form of pluggable and interchangeable modules. A list of related publications can be found in the publications section. Scalable Algorithms for High Availability The overall goal of the research is to develop scalable algorithms for high-availability without single points of failure and without single points of control. A list of related publications can be found in the publications section. For more information, please visit http://moss.csc.ncsu.edu/~mueller/molar.html. Communications and I/O Performance Monitoring to Support Adaptation and Tuning of Operating Systems, Runtimes, and Applications The availability of hardware-based counters for CPU and memory, and standardized interfaces to them, such as PAPI, have provided tremendous benefits to software developers at all levels of the software stack (operating systems, runtimes, and applications). The goal of this effort is to extend these benefits to the monitoring of communications and I/O by establishing suitable interfaces and conventions. Communications will be our initial focus, and intend to support programming models beyond MPI, including various one-sided messaging and global address space approaches. We have established a collaboration with the PAPI team, who are generalizing their framework to support a broader range of data sources. We are in the process of brainstorming with potential users of the performance data at all levels of the software stack to establish what they would like to have and how it might be provided. Your input is welcome! A list of related publications can be found in the publications section. Please contact David Bernholdt if you have any comments or need more information.
Please contact engelmannc@ornl.gov with questions or comments regarding this page. |