--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
help paran
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Title
paran -- Horn's Test of Principal Components/Factors
Syntax
Parallel analysis of data
paran [varlist] [weight] [if exp] [in range] [, options]
options description
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Model
iterations(#) specify the number of iterations
centile(#) specify using a centile value of instead of the mean
factor(factor_type) use factor instead of pca (defaults to pca if left blank)
citerate(#) communality re-estimation iterations (ipf only)
Reporting
quietly suppresses pca or factor output
nostatus suppresses the status indicator
all report all eigenvalues (default reports only those retained)
Graphing
graph graphs unadjusted, adjusted, and random eigenvalues
color renders graph in color (default is black and white)
lcolors(3 x rgb) specifies colors using three rgb triples for observed, random, and adjusted eigenvalues (overrides the color option)
saving(filename) saves graph as a .gph file
replace replaces an existing file when saving()
Miscellaneous
protect(#) perform # optimizations and report the best solution (ml only with factor)
seed(#) seed the random number generator with the supplied integer
mat(matrix name) option to provide a correlation matrix instead of the varlist
n(#) specifies the required sample size when using the mat() option
copyleft displays the GPL license for paran
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fweights, and aweights are allowed when using varlist. See help weights.
Description
paran is an implementation of Horn's technique for evaluating the components or common factors retained in a principle component analysis (PCA) or a common factor
analysis (FA). According to Horn, a common interpretation of non-correlated data is that they are perfectly non-collinear, and one would expect therefore to see
eigenvalues equal to 1 in a PCA of such data (or equal to 0 in the case of a common factor analysis, as with pf). However, Horn notes that multi-colinearity occurs due to
sampling error and least-squares "bias," even in uncorrelated data, and therefore actual PCAs of such data will reveal eigenvalues of components greater than and less
than 1. His strategy is to contrast eigenvalues produced through a PCA on a random dataset (uncorrelated variables) with the same number of variables and observations as
the experimental or observational dataset to produce eigenvalues for components or factors that are adjusted for the sample error-induced inflation. Values greater than
zero are retained in the adjustment given by:
For principal component analysis:
Observed Data Eigenvalue_p - (Random Data Eigenvalue_p - 1)
For common factor analysis:
Observed Data Eigenvalue_p - Random Data Eigenvalue_p
paran is used in place of a pca varlist command (or factor). The user may also specify how many times to make the contrast with a random dataset (default is 30 per
variable). Values less than 1 will be ignored (or less than 0 for factor), and the default value assumed. Random datasets are generated using the uniform() function. The
program returns the estimated mean eigenvalues of random data if the centile option is unspecified, otherwise it returns the specified centile. biases on each eigenvector
are also returned. paran may be thus be used to conduct parallel analysis following Glorfeld's suggestions to reduce the likelihood of over-retention (Glorfeld, 1995).
Use of weights applies to the supplied variables, but weights are not applied to the pca or factor of the random data.
When the all option is not used, unadjusted eigenvalues greater than 1 (for prinicpal components) or 0 (for factors) are reported, with retained adjusted eigenvalues
printed in yellow, and unretained adjusted eigenvalues printed in red.
Options
iteration(#) sets the number of contrast datasets to evaluate. The default value is 30 * the number of variables, and values less than 1 are ignored. For large datasets
with large numbers of variables many iterations may be time consuming. The greater the number of iterations the more accurate the estimates of sample bias will be.
centile(#) specifies that supplied centile value is to be used instead of the mean (assumed median, since the distribution is symmetrical) in estimating bias. Values
above the mean/median, such as the 95th percentile, are more conservative estimates of chance bias in the eigenvalues from a PCA of sample data. This option
supercedes the older pnf, which was equivalent to centile(95). Values of centile() must be greater than 0 and less than 100. Non-integer values will be rounded to
the nearest integer value. Running paran without this option uses the mean value (very close to centile(50)). (see Glorfeld, 1995)
quietly suppresses output for the PCA or factor analysis. This option is only used if a varlist is specified in the paran command.
nostatus By default paran indicates when every tenth percent of the computation has been completed. nostatus eliminates this behavior.
factor(factor_type) selects one of the factor estimation types: pf, pcf, ipf, or ml (for principal factors, principal component factors, iterated principal factors, or
maximum likelihood factors, respectively). If you specify anything but one of these four abbreviations, you will be warned and the program will halt. CAVEAT:
Conducting parallel analysis using factor methods other than pf is unorthodox. Interpret such results at your own risk. If factor is not used, or if one of the factor
estimation types is not used paran performs parallel analysis using pca by default.
citerate(#) sets how many iterations will be used to re-estimate communalities for the iterated principal factor type. (see factor)
protect(#) sets the number of optimizations for starting values option for the maximum likelihood factor type. (see factor)
all reports all components or factors, not just those with unadjusted eigenvalues greater than one (or greater than zero for factor). The default is not to report all
components or factors.
graph draws a graph of the observed eigenvalues, the random eigenvalues, and the adjusted eigenvalues much like the graphs presented by Horn in his 1965 paper.
color renders the graph in color (only with graph) with unadjusted eigenvalues drawn in red, adjusted eigenvalues drawn in black, and random eigenvalues drawn in blue,
and all lines drawn solid. Without the color option, the graph is rendered in black and white, and the line connecting the unadjusted eigenvalues is dashed, the line
connecting the random eigenvalues is dotted, and the line connecting the adjusted eigenvalues is solid.
lcolors(# # # # # # # # #) specifies the colors of each line on the graph using three rgb triples (only with graph). The first triple indicates the R, G and B components
of the observed eigenvalues, the second triple sets the values for the mean or centile random eigenvalues, and the third triple sets the values for the adjusted
eigenvalues. These settings override the default (red, blue, and black) colors of the color option.
saving(filename) outputs the graph to the specified filename as a .gph file (only with graph).
replace overwrites an existing filename when the saving() option is used with graph.
seed(#) specifies an integer seed for the random number generator (see set seed) so that results of paran upon a specific data set can be exactly reproduced. The default
behavior of paran is not to specify a seed.
mat(matrix name) specifies an optional correlation matrix to be used instead of the varlist; requires the n(#) option also be specified. This option is not compatible
with aweights or fweights.
n(#) specifies the sample size when using the mat(matrix name) option.
copyleft displays the copying permission statement for paran. paran is free software, licensed under the GPL. The full license can be obtained by typing:
. net describe paran, from (http://www.alexisdinno.com/stata)
and clicking on the click here to get link for the ancillary file.
Remarks
Hayton, et al. (2004) urge a parameterization of the random data to approximate the distribution of the observed data with respect to the middle ("mid-point") and the
observed min and max. However, the PCA as I understand it is insensitive to standardizing transformations of each variable, and any linear transformation of all
variables, and produces the same eigenvalues used in component or factor retention decisions. This is born by the notable lack of difference between analyses conducted
using a variety of simulated distributional assumptions (Dinno, 2009). The central limit theorem would seem to make the selection of a distributional form for the random
data moot with any sizable number of iterations. Former functionality implementing the recommendation by Hayton et al. (2004) has been removed, since parallel analysis
is insensitive to it, and it only adds to the computation time required to conduct parallel analysis.
As of paran version 1.4.0 application of parallel analysis to common factor
analysishas been revised. See the accompanying document Gently Clarifying the Application of Horn's Parallel Analysis to Principal Component Analysis Versus Factor Analysis.
Examples
. paran var1-var16
. paran var1-var26, iter(5000) q centile(95)
. paran var1-var10, iter(1) factor(ipf) cit(50)
Author
Alexis Dinno
alexis dot dinno at pdx dot edu
I am receptive to comments and requests.
Reference
Dinno A. 2009 "Exploring the Sensitivity of Horn's Parallel Analysis to the Distributional Form of Simulated Data" Multivariate Behavioral Research. 44: 362-388
Glorfeld LW. 1995. "An Improvement on Horn's Parallel Analysis Methodology for Selecting the Correct Number of Factors to Retain. Educational and Psychological Measurement. 55: 377-393
Hayton JC, Allen DG, and Scarpello V. 2004. "Factor Retention Decisions in Exploratory Factor Analysis: A Tutorial on Parallel Analysis" Organizational Research Methods.
7: 191-205
Horn JL. 1965. "A rationale and a test for the number of factors in factor analysis." Psychometrika. 30: 179-185
Zwick WR, Velicer WF. 1986. "Comparison of Five Rules for Determining the Number of Components to Retain." Psychological Bulletin. 99: 432-442
Saved results
paran saves the following 1 by P matrices in e():
Matrices
e(UnadjustedEv) Unadjusted eigenvalues from the pca or factor command
e(AdjustedEv) Eigenvalues from the analysis adjusted by subtracting the estimated bias
e(MeanRandomEv) The mean of the eigenvalues of random data sets of size N by P (only if the centile() option is unspecified)
e(CentRandomEv) The centile of the eigenvalues of random data sets of size N by P as given by the centile() option (and only if that option is specified)
e(Bias) The estimated bias (which is the mean of the eigenvalues of random data sets of size N by P when using the factor option
Also See
On-line: help for: pca, factor