Supercomputing 2002 Poster Submission

Subset selection of performance metrics describing system-software interactions.

Nayda G. Santiago, Michigan State University

Diane T. Rover, Iowa State University

Domingo Rodriguez, University of Puerto Rico at Mayaguez

Problem Statement:

Performance data analysis is integral to the process of tuning parallel applications to advanced architectures. Large volumes of data with complex relationships need to be analyzed, and it is becoming increasingly difficult as the scale of the systems increases along with the possible metrics to observe. As a first step to understanding the relationships in the observed performance, we are interested in automatically identifying important relationships that have the most bearing on system performance. One of the challenges to doing this is determining which metrics carry the most relevant information about relationships and the behavior of the system under study.

Feature subset selection methods are suitable to establish which metrics are important to understand the behavior of the system. Feature selection involves two main steps: the identification of the dimensionality of the data and the use of feature subset selection methods. Identification gives an estimate how many metrics are necessary to explain the behavior of the system-software interactions. Subset selection indicates which metrics will best describe the system. These methods require the use of a cost function. The appropriate cost function for the high-performance computing (HPC) environment also needs to be defined.

Statement of Approach:

A detailed comparison of metric selection methods for the analysis of HPC instrumentation data using different subset selection methods is presented. A case study application of finite elements methods for conformal antenna analysis is shown. The number of features required for subset selection so as to preserve the variability of the data is given by Velez-Reyes and Jimenez [1]. Three methods to determine the number of metrics to be retained were used: cumulative percentage of total variation, size of variances of principal components, and scree graph. The subset selection methods presented are sequential forward selection, sequential backward selection, plus l-take away r, and SVD subset selection [1, 2]. An entropy measure was used as the cost function. A subset of metrics selected by each method are shown and analyzed. The results indicate that we can effectively automate early steps in exploratory data analysis, thus pointing users to the most important data as well as enabling more advanced analysis and automatic decision-making.

[1] M. Velez-Reyes and L.O. Jimenez. Subset selection analysis for the reduction of hyperspectral imagery. Proc. IGARRS’98, p. 1577-1581.

[2] D. Zongker and A.K. Jain. Algorithms for Feature Selection: An Evaluation. Proc. Intl. Conf. Pattern Recognition, p. 18-22, 1996.

Description of Visual Presentation:

The poster will contain an abstract, an introduction, a description of the importance of dimensionality/subset selection methods, results from a case study showing validity of the method, analysis of results, and conclusions.

A demonstration will be held using a laptop in which all steps in the methodology will be illustrated. The audience will be encouraged to submit their data sets to be analyzed using this methodology and to predict which metrics are the most important ones for their specific software and hardware. Then we will compare their estimate with the actual outcome from the methods.