Abstract
Researchers often use a two-step process to analyze multivariate data. First, dimensionality is reduced using a technique such as principal component analysis, followed by a group comparison using a (Formula presented.) -test or analysis of variance. Although this practice is often discouraged, the statistical properties of this procedure are not well understood, starting with the hypothesis being tested. We suggest that this approach might be considering two distinct hypotheses, one of which is a global test of no differences in the mean vectors, and the other being a focused test of a specific linear combination where the coefficients have been estimated from the data. We study the asymptotic properties of the two-sample (Formula presented.) -statistic for these two scenarios, assuming a nonsparse setting. We show that the size of the global test agrees with the presumed level but that the test has poor power. In contrast, the size of the focused test can be arbitrarily distorted with certain mean and covariance structures. A simple method is provided to correct the size of the focused test. Data analyses and simulations are used to illustrate the results. Recommendations on the use of this two-step method and the related use of principal components for prediction are provided.
Original language | English (US) |
---|---|
Pages (from-to) | 508-517 |
Number of pages | 10 |
Journal | Biometrics |
Volume | 76 |
Issue number | 2 |
DOIs | |
State | Published - Jun 1 2020 |
Keywords
- data reduction
- pooled covariance
- principal component analysis
ASJC Scopus subject areas
- Statistics and Probability
- General Biochemistry, Genetics and Molecular Biology
- General Immunology and Microbiology
- General Agricultural and Biological Sciences
- Applied Mathematics