Environment Aware Performance Diagnosis

We propose a new approach to automated performance debugging, Environment Aware Performance Diagnosis (EAPD), that can detect and correct interference between the runtime environment and high end applications and thereby substantially improve performance. Our approach allows us to remove bottlenecks that current tools cannot identify, or worse, that they misdiagnose. For example, existing tools will often diagnose a high interrupt rate as a load imbalance in the application rather than correctly detecting that certain processors are being used for scheduled system activities or that failing hardware is interrupting excessively. EAPD will allow developers to optimize application use of runtime system resources. Our objective is a fully automated, runtime measurement tool that accurately diagnoses performance problems, attributes their root cause to the correct layer of the runtime system, and communicates this through a well-defined interface. By combining the appropriate data from the environment, the application, and the runtime, EAPD will enable a more effective and scalable performance diagnosis and improvement tool. Our techniques target system software for Petascale platforms of the near future, including the Cray XT systems, Linux clusters, and IBM Blue Gene platforms, as well as future Exascale systems. Sources of interference we study include the OS kernel, hypervisor, hardware, and the physical room environment including power, cooling, and physical configuration..