Effective performance tuning of large scale parallel and distributed applications on current and future clusters, supercomputers, and grids requires the ability to integrate performance data gathered with a variety of monitoring tools, stored in different formats, and possibly residing in geographically separate data stores. Performance data sharing between different performance studies or scientists is currently done manually or not done at all. Manual transfer of disorganized files with unique and varying storage formats is a time consuming and error prone approach. The granularity of exchange is often entire data sets, even if only a small subset of the transferred data is actually needed. All of these problems worsen as high end systems continue to scale up, greatly increasing the size of the data sets generated by performance studies. The overhead for sharing data discourages collaboration and data reuse. There are several key challenges associated with performance data sharing. First, performance tool output is not uniform, requiring the ability to translate between different metrics and tool outputs to compare the datasets. Second, the data sets resulting from performance studies of tera- and peta-scale level applications are potentially large and raise challenging scalability issues. For some types of summary data, database storage is a well-understood task; for others, such as trace data and data generated with a dynamic instrumentation based tool, further research is necessary to develop an efficient and flexible representation. In addition to increased ease of collaboration between scientists using or studying a common application, the potential benefits of a tool to collect and integrate performance data include the ability to analyze performance of an application across architectures, and the ability to automatically compare different application versions.
To address this problem we have designed and implemented an experiment management tool for collecting and analyzing parallel performance data called PerfTrack. PerfTrack comprises a data store and interface for the storage and retrieval of performance data describing the runtime behavior of large-scale parallel applications. PerfTrack uses an extensible resource type system that allows performance data stored in different formats to be integrated, stored, and used in a single performance analysis session. To address scalability, robustness, and fault-tolerance, PerfTrack uses a database management system for the underlying data store. It includes interfaces to the data store plus a set of modules for automatically collecting descriptive data for each performance experiment, including the build and runtime platforms. We present a model for representing parallel performance data, and an API for loading it.
This project was initiated in collaboration with researchers at Lawrence Livermore National Laboratory.