Statistical Performance Comparisons of Computers
IEEE Transactions on Computers，2014，65（5）： 1442 - 14 | 2014年04月04日 | 10.1109/TC.2014.2315614
As a fundamental task in computer architecture research, performance comparison has been continuously hampered by the variability of computer performance. In traditional performance comparisons, the impact of performance variability is usually ignored (i.e., the means of performance observations are compared regardless of the variability), or in the few cases directly addressed with i-statistics without checking the number and normality of performance observations. In this paper, we formulate a performance comparison as a statistical task, and empirically illustrate why and how common practices can lead to incorrect comparisons. We propose a non-parametric hierarchical performance testing (HPT) framework for performance comparison, which is significantly more practical than standard i-statistics because it does not require to collect a large number of performance observations in order to achieve a normal distribution of sample mean. In particular, the proposed HPT can facilitate quantitative performance comparison, in which the performance speedup of one computer over another is statistically evaluated. Compared with the HPT, a common practice which uses geometric mean performance scores to estimate the performance speedup has errors of 8.0 to 56.3 percent on SPEC CPU2006 or SPEC MPI2007, which demonstrates the necessity of using appropriate statistical techniques. This HPT framework has been implemented as an open-source software, and integrated in the PARSEC 3.0 benchmark suite.