Homepage | People | Projects | Publications

MOLAR: Modular Linux and Adaptive Runtime Support for
High-end Computing Operating and Runtime Systems*



MOLAR

    2006
  1. C. Engelmann, S. L. Scott, D. E. Bernholdt, N. R. Gottumukkala, C. Leangsuksun, J. Varma, C. Wang, F. Mueller, A. G. Shet, and P. Sadayappan. MOLAR: Adaptive runtime support for high-end computing operating and runtime systems. ACM SIGOPS Operating Systems Review (OSR), 40(2), pages 63-72, 2006.

Proactive Fault Tolerance

    2008
  1. G. Vallée, K. Charoenpornwattana, C. Engelmann, A. Tikotekar, C. Leangsuksun, T. Naughton, and S. L. Scott. A framework for proactive fault tolerance. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 659-664, Barcelona , Spain, March 4-7, 2008.
  2. 2007
  3. A. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive fault tolerance for HPC with Xen virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, pages 23 - 32, Seattle, WA, USA, June 16-20, 2007.

HA-OSCAR/MOLAR for HEC

    2006
  1. S. Rani, C. Leangsuksun, A. Tikotekar, V. Rampure, and S. L. Scott. Toward efficient failure detection and recovery in HPC. In Proceedings of High Availability and Performance Workshop (HAPCW) 2006, in conjunction with Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe, NM, USA, October 17, 2006.
  2. N. R. Gottumukkala, C. Leangsuksun, Y. Liu, R. Nassar, and S. L. Scott. Reliability analysis in HPC clusters. In Proceedings of High Availability and Performance Workshop (HAPCW) 2006, in conjunction with Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe, NM, USA, October 17, 2006.
  3. C. Leangsuksun, T. Rao, A. Tikotekar, S. L. Scott, R. Libby, J. Vetter, Y. C. Fang, and H. Ong. IPMI-based efficient notification framework for large scale cluster computing In Proceedings of the 2nd International Workshop on Cluster Security (Cluster-Sec) 2006, held in conjunction with the 6th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid) 2006,, Singapore, May 16-19, 2006.
  4. A. Tikotekar, C. Leangsuksun, and S. L. Scott. On the survivability of standard MPI applications. In Proceedings of 7th LCI International Conference on Linux Clusters: The HPC Revolution 2006, Norman, OK, USA, May 1-4, 2006.
  5. H. Song, C. Leangsuksun, R. Nassar, N. R. Gottumukkala, S. L. Scott, and A. Yoo. Availability modeling and analysis on high performance cluster computing systems. In Proceedings of The First International Conference on Availability, Reliability and Security (ARES) 2006, Vienna, Austria, April 20-22, 2006.
  6. N. R. Gottumukkala, C. Leangsuksun, and S. L. Scott. Reliability-aware approach to improve job completion time for large-scale parallel applications. In Proceedings of 2nd Workshop on High Performance Computing Reliability Issues (HPCRI) 2006, Austin, TX, USA, February 11-15, 2006.
  7. 2005
  8. H. Song, C. Leangsuksun, N. R. Gottumukkala, S. L. Scott, and A. Yoo. Near-real-time availability monitoring and modeling for HPC/HEC runtime system. In Proceedings of Los Alamos Computer Science Institute (LACSI) Symposium 2005, Santa Fe, NM, USA, October 11-13, 2005.
  9. K. Limaye, C. Leangsuksun, and A. Tikotekar. Fault tolerance-enabled HPC scheduling with HA-OSCAR framework. In Proceedings of High Availability and Performance Workshop (HAPCW) 2005, Santa Fe, NM, USA, October 11, 2005.
  10. K. Limaye, C. Leangsuksun, V. K. Munganuru, Z. Greenwood, S. L. Scott, and K. Chanchio. Reliability-aware resource management for computational grid/cluster environments. In Proceedings of 6th IEEE/ACM International Workshop on Grid Computing (Grid) 2005, Seattle, WA, USA, November 13-14, 2005.
  11. K. Limaye, C. Leangsuksun, Z. Greenwood, S. L. Scott, C. Engelmann, R. Libby, and K. Chanchio. Jobsite level fault tolerance for cluster and grid environments. In Proceedings of IEEE International Conference on Cluster Computing (Cluster) 2005, Boston, MA, USA, September 26-30, 2005.
  12. Y. Liu and C. Leangsuksun. Reliability-aware checkpoint/restart Scheme: a performability trade-off. In Proceedings of IEEE International Conference on Cluster Computing (Cluster) 2005, Boston, MA, USA, September 26-30, 2005.
  13. H. Song and C. Leangsuksun. Availability specification and evaluation of HA-OSCAR cluster servers - an object-oriented approach. In Proceedings of 3rd International Conference on Computing, Communications and Control Technologies (CCCT) 2005, Austin, TX, USA, July 24-27, 2005.
  14. H. Song, C. Leangsuksun, and R. Nassar. OOMSE - An object oriented markov chain specification and evaluation framework. In Proceedings of 17th International Conference on Software Engineering and Knowledge Engineering (SEKE) 2005, Taipei, Taiwan, July 14-16, 2005.
  15. H. Song, C. Leangsuksun, R. Nassar, Y. Liu, C. Engelmann, and S. L. Scott. UML-based Beowulf cluster availability modeling. In Proceedings of International Conference on Software Engineering Research and Practice (SERP) 2005, pages 161167, Las Vegas, Nevada, USA, June 27-30, 2005.
  16. C. Leangsuksun, V. K. Munganuru, T. Liu, S. L. Scott, and C. Engelmann. Asymmetric active-active high availability for high-end computing. In Proceedings of 2nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005, in conjunction with 19th ACM International Conference on Supercomputing (ICS) 2005, Cambridge, MA, USA, June 19, 2005.
  17. K. Limaye, C. Leangsuksun, V. K. Munganuru, Z. Greenwood, S. L. Scott, and K. Chanchio. Grid-enabled HA-OSACAR. In Proceedings of OSCAR Symposium (OSCAR) 2005, Ontario, Canada, May 15-18, 2005.
  18. C. Leangsuksun, A. Tikotekar, S. L. Scott, M. Pourzandi, and I. Haddad. Towards cluster survivability. In Proceedings of 6th LCI International Conference on Linux Clusters: The HPC Revolution 2005, Chapel Hill, NC, USA, May 1-4, 2005.
  19. C. Leangsuksun and H. Song. A light-weight model of solution for markov processes. In Proceedings of 43rd annual ACM Southeast Conference (ACMSE) 2005, Kennesaw, GA, USA, March 18-20, 2005.
  20. K. Limaye, C. Leangsuksun, V. K. Munganuru, and Z. Greenwood. HA-OSACAR: Grid-enabled high availability framework. In Proceedings of 13th Annual Mardi Gras Conference 2005, Baton Rouge, LA, USA, February 3-5, 2005.

OS-level Data Replication and Distributed Control

    2008
  1. C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. Symmetric active/active high availability for high-performance computing system services: Accomplishments and limitations. In Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2008, Lyon, France, May 19-22, 2008. To appear.
  2. C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. Symmetric active/active replication for dependent services. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 260-267, Barcelona, Spain, March 4-7, 2008.
  3. 2007
  4. L. Ou, X. He, C. Engelmann, and S. L. Scott. A fast delivery protocol for total order broadcasting. In Proceedings of the 16th IEEE International Conference on Computer Communications and Networks (ICCCN) 2007, Honolulu, HI, USA, August 13-16, 2007.
  5. C. Engelmann, H. Ong, and S. L. Scott. Middleware in modern high performance computing system architectures. In Lecture Notes in Computer Science: Proceedings of the 7th International Conference on Computational Science (ICCS) 2007, Part II, volume 4488, pages 784-791, Beijing, China, May 27-30, 2007.
  6. C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. Transparent symmetric active/active replication for service-level high availability. In Proceedings of 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2007, pages 755-760, Rio de Janeiro, Brazil, May 14-17, 2007.
  7. C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. On programming models for service-level high availability. In Proceedings of 2nd International Conference on Availability, Reliability and Security (ARES) 2007, Vienna, Austria, April 10-13, 2007.
  8. 2006
  9. C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. Symmetric active/active high availability for high-performance computing system services. Journal of Computers (JCP), 1(8), pages 43-54, 2006.
  10. C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. Towards high availability for high-performance computing system services: Accomplishments and limitations. In Proceedings of High Availability and Performance Workshop (HAPCW) 2006, in conjunction with Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe, NM, USA, October 17, 2006.
  11. K. Uhlemann, C. Engelmann, and S. L. Scott. JOSHUA: symmetric active/active replication for highly available HPC job and resource management. In Proceedings of IEEE International Conference on Cluster Computing (Cluster) 2006, Barcelona, Spain, September 25-28, 2006.
  12. D. Okunbor, C. Engelmann, and S. L. Scott. Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems. In Proceedings of 2nd International Conference on Computer Science and Information Systems 2006, Athens, Greece, June 19-21, 2006.
  13. C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. Active/active replication for highly available HPC system services. In Proceedings of The First International Conference on Availability, Reliability and Security (ARES) 2006, pages 639645, Vienna, Austria, April 20-22, 2006.
  14. 2005
  15. C. Engelmann and S. L. Scott. Concepts for high availability in scientific high-end computing. In Proceedings of High Availability and Performance Workshop (HAPCW) 2005, in conjunction with Los Alamos Computer Science Institute (LACSI) Symposium 2005, Santa Fe, NM, USA, October 11, 2005.
  16. G. Sabin and P. Sadayappan. On enhancing the reliability of job schedulers. In Proceedings of High Availability and Performance Workshop (HAPCW) 2005, in conjunction with Los Alamos Computer Science Institute (LACSI) Symposium 2005, Santa Fe, NM, USA, October 11, 2005.
  17. C. Engelmann and S. L. Scott. High availability for ultra-scale high-end scientific computing. In Proceedings of 2nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005, in conjunction with 19th ACM International Conference on Supercomputing (ICS) 2005, Cambridge, MA, USA, June 19, 2005.
    2004
  18. C. Engelmann, S. L. Scott, and G. A. Geist. High availability through distributed control. In Proceedings of High Availability and Performance Workshop (HAPCW) 2004, in conjunction with Los Alamos Computer Science Institute (LACSI) Symposium 2004, Santa Fe, NM, USA, October 12, 2004.

Scalable Algorithms for High Availability

    2007
  1. C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. A job pause service under LAM/MPI+BLCR for transparent fault tolerance. In Proceedings of 21st International Parallel and Distributed Processing Symposium (IPDPS) 2007, Long Beach, CA, USA, March 26-30, 2007.
  2. 2006
  3. J. Varma, C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Scalable, fault-tolerant membership for MPI tasks on HPC systems. In Proceedings of 20th ACM International Conference on Supercomputing (ICS) 2006, pages 219-228, Cairns, Australia, June 28-30, 2006.

Communications and I/O Performance Monitoring to Support Adaptation and Tuning of Operating Systems, Runtimes, and Applications

    2008
  1. A. G. Shet, P. Sadayappan, D. E. Bernholdt, J. Nieplocha, and V. Tipparaju. A framework for characterizing overlap of communication and computation in parallel applications. In Cluster Computing Journal, 11(1):75-90, 2008.
  2. 2006
  3. A. G. Shet, D. E. Bernholdt, J. Nieplocha, V. Tipparaju, and P.Sadayappan. A performance instrumentation framework to characterize computation-communication overlap in message-passing systems. In Proceedings of IEEE International Conference on Cluster Computing (Cluster) 2006, Barcelona, Spain, September 25-28, 2006.

Data Storage Scalability and Availability

    2007
  1. L. Ou, C. Engelmann, X. He, X. Chen, S. L. Scott. Symmetric Active/Active Metadata Service for Highly Available Cluster Storage Systems. In Lecture Notes in Computer Science: Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS) 2007, Cambridge, MA, USA, November 19-21, 2007.
  2. X. He, L. Ou, M. Kosa, S. L. Scott, and C. Engelmann. A unified cache for high performance cluster storage systems. International Journal of High Performance Computing and Networking (IJHPCN), 5(1), pages 97-109, 2007.
  3. 2005
  4. X. He, M. Zhang, and Q. Yang. SPEK: A storage performance evaluation kernel module in consideration of availability for block level storage systems. IEEE Transactions on Dependable and Secure Computing, 2(2), pages 138-149, 2005.
  5. L. Ou, X. He, S. L. Scott, Z. Xu, and Y. Fang. Design and evaluation of a high performance parallel file system. In Proceedings of the 30th Annual IEEE Conference on Local Computer Networks (LCN) 2005, Sydney, Australia, November 15-17, 2005.
  6. 2004
  7. X. He, L. Ou, S. L. Scott, C. Engelmann. A highly available cluster storage system using scavenging. In Proceedings of High Availability and Performance Workshop (HAPCW) 2004, in conjunction with Los Alamos Computer Science Institute (LACSI) Symposium 2004, Santa Fe, NM, USA, October 12, 2004.

*This research is sponsored by the Office of Advanced Scientific Computing Research; U.S. Department of Energy. The work is performed jointly at Oak Ridge National Laboratory, which is managed by UT-Battelle, LLC under Contract No. De-AC05-00OR22725, Louisiana Tech University, Ohio State University and North Carolina State University in collaboration with University of Reading and Cray Inc..

Please contact engelmannc@ornl.gov with questions or comments regarding this page.
Copyright © 2004-2007. All Rights Reserved.