Environment Aware Performance Diagnosis
We propose a new approach to automated performance debugging, Environment Aware Performance
Diagnosis (EAPD), that can detect and correct interference between the runtime environment and high
end applications and thereby substantially improve performance. Our approach allows us to remove
bottlenecks that current tools cannot identify, or worse, that they misdiagnose. For example, existing
tools will often diagnose a high interrupt rate as a load imbalance in the application rather than correctly
detecting that certain processors are being used for scheduled system activities or that failing hardware is interrupting excessively. EAPD will allow developers to optimize application use of runtime system resources. Our objective is a fully automated, runtime measurement tool that accurately diagnoses performance problems, attributes their root cause to the correct layer of the runtime system, and communicates this through a well-defined interface. By combining the appropriate data from the environment, the application, and the runtime, EAPD will enable a more effective and scalable performance diagnosis and improvement tool. Our techniques target system software for Petascale platforms of the near future, including the Cray XT systems, Linux clusters, and IBM Blue Gene platforms, as well as future Exascale systems. Sources of interference we study include the OS kernel, hypervisor, hardware, and the physical room environment including power, cooling, and physical configuration..
Past Collaborators
- Dr. Douglas M. Pase, IBM Corporation
- Dr. Andres Marquez, Pacific Northwest National Laboratory
Acknowledgments
-
This work was sponsored in part by a grant from the PSU Center for Sustainable Processes and Practices, and in part by PNNL contract.
Publications
- Jeffrey J. Evans, Sandeep Gupta, Karen L. Karavanic, Andres Marquez, and Georgios Varsamopoulos, "Evaluating Performance, Power, and Cooling in High Performance Computing (HPC) Data Centers," chapter in Ahmad and Ranka, ed.s, Handbook of Energy-Aware and Green Computing, Chapman and Hall/CRC, 2012.
-
Rashawn L. Knapp, Karen L. Karavanic, Sriram Krishnamoorthy, and Andres Marquez, "Power- and Cooling- Aware Parallel Performance Diagnosis," Parallel and Distributed Computing and Systems (PDCS 2011), Dallas, Texas, December 2011.
- Rashawn L. Knapp, Karen L. Karavanic, and Andres Marquez, "Integrating Power and Cooling Data into Parallel Performance Analysis," The Second International Workshop on Green Computing (GreenCom 2010), San Diego, CA September 13, 2010.
- Agniv Adhikari, Rashawn L. Knapp, Karen L. Karavanic and Andres Marquez, "Integrating Sys
tem and Application Performance," Research Poster, 11th LCI International Conference on High-P
erformance Clustered Computing, March 2010, Pittsburgh, PA, USA.
- Agniv Adhikari, "Measuring system level performance to identify bottlenecks in high end systems," PSU CS Department Masters Thesis, 2010.
-
Rashawn L. Knapp, Agniv Adhikari, Karen L. Karavanic and Andres Marquez, "Sustainable High End Computing: Integrating Power and Cooling Data with Application and System Performance for Linux Clusters," Research Poster, 10th LCI International Conference on High-Performance Clustered Computing, March 2009, Boulder, Colorado, USA.
- Rashawn L. Knapp, Douglas M. Pase, and Karen L. Karavanic, "ARUM: Application Resource
Usage Monitor," The 9th LCI International Conference on High-Performance Clustered Computing
, April 29 - May 1, 2008, National Center for Supercomputing Applications, Urbana IL.
- Rashawn L. Knapp, Karen L. Karavanic and Douglas M. Pase, "A Model for Environment Aware Performance Diagnosis," Research Poster, 8th LCI International Conference on High-Performance Clustered Computing, May 15-17 2007, South Lake Tahoe, California, USA.
- Rashawn L. Knapp, Karen L. Karavanic, and Douglas M. Pase, "Detecting Runtime Environment Interference with Parallel Application Behavior," 3rd Workshop on System Management Techniques, Processes and Services (SMTPS'07).
- Rashawn L. Knapp and Karen L. Karavanic, "Correct Diagnosis of Parallel Performance Problems Caused by the Runtime Environment," OSIHPA 2006, September 17, 2006, Seattle, WA.
- Rashawn Knapp, "Environment Aware Performance Analysis," PSU CS Department Masters Thesis, 2006.