Principal Components Analysis
Hong Zheng / 2017-12-11
Denote the data matrix as $X$
. It is a $n*p$
matrix with $n$
individuals/observations as rows and $p$
features/variables as columns.
Firstly, center (necessary) and scale (not required, depending on the data structure) X, so that the column means are 0s, and the column variances are 1s.
X <- scale(X,center = T, scale = T)
Use any of the three functions in R to perform PCA.
# The centering and scaling options are still specified, although not necessary here since X has already been centered and scaled.
X.princomp = princomp(X, cor = T, scores = T)
X.prcomp = prcomp(X,scale. = T)
X.svd = svd(scale(X,center=TRUE,scale=TRUE))
princomp() | prcomp() | svd() | |
---|---|---|---|
standard deviations of principal components | sdev | sdev | $\sqrt{(D^2/(n-1))}$ |
matrix of variable loadings | loadings | rotation | V |
principal components | scores | x | U%*%D |
Note:
- In svd approach,
- if p>n,
$X_{n*p}=U_{n*n}D_{n*n}V_{p*n}'$
- if p<n,
$X_{n*p}=U_{n*p}D_{p*p}V_{p*p}'$
.
- if p>n,
$U$
and$V$
has the following properties:$V'V=I$
,$U'U=I$
; column sum of squares are ones.$U$
has equal column variances, which is equal to$\frac {1}{n-1} $
- Loadings in princomp output, rotation in prcomp output, and v in svd outout are the matrix of variable loadings (columns are eigenvectors). Their row and column sum of squares are ones.
How to get new principle components?
These are the coordinates of the individuals/observations on the principal components.
- Using princomp
X.princomp$scores
# or
X %*% X.princomp$loadings
- Using prcomp
X.prcomp$x
# or
X %*% X.prcomp$rotation
- Using svd
X.svd$u %*% diag(X.svd$d) # UD
# or
X %*% X.svd$v # XV
princomp outout has different signs with the other two.
Relevant plots
# scatter plot of individuals on PC1 vs. PC2 plot(X.prcomp$x[,1],X.prcomp$x[,2],pch=16) # use factoextra package. color individuals by their groups. fviz_pca_ind(X.prcomp, geom.ind = "point", col.ind = as.factor(sample.group), palette = "Dark2", repel = F, addEllipses = TRUE ) # use factoextra package. color individuals by their cos2 values. fviz_pca_ind(X.prcomp, geom.ind = "point", col.ind = cos2, gradient.cols = "Dark2", repel = F, addEllipses = TRUE )
How to get eigenvalue/variance explained by each PC?
An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. This holds true only when the data are standardized.
Using princomp
apply(X.princomp$scores,2,function(x){var(x)}) # a bit inaccurate # or X.princomp$sdev^2
Using prcomp
apply(X.prcomp$x,2,function(x){var(x)}) # or X.prcomp$sdev^2
Using svd
apply(X.svd$u %*% diag(X.svd$d),2,function(x){var(x)}) # or X.svd$d^2/(nrow(X)-1)
Relevant plots
# get variance explained by each PC and cumulative variance
get_eigenvalue(X.prcomp)
X.prcomp.varex = 100*X.prcomp$sdev^2/sum(X.prcomp$sdev^2) # variance explained by each PC
X.prcomp.cvarex = NULL; for(i in 1:ncol(cdat)){X.prcomp.cvarex[i] = sum(X.prcomp.varex[1:i])} # cumulative variance
# plots
screeplot(X.prcomp)
fviz_eig(X.prcomp,addlabels = TRUE)
fviz_screeplot(X.prcomp)
How to get eigenvectors?
- Using princomp
X.princomp$loadings # Strictly speaking, they are not the real "loadings". See below.
- Using prcomp
X.prcomp$rotation
# look at variables that contribute to principal component 1
barplot(X.prcomp$rotation[,1]
- Using svd
X.svd$v
How to get loadings?
These are the coordinates of the features/variables on the principal components.
Loadings are unstandardized eigenvectors’ elements, i.e. eigenvectors endowed by corresponding component variances, or eigenvalues. They are also covariances/correlations between the original variables and the unit-scaled components.
- Using princomp
# use get_pca_var from factoextra package
X.princomp.var <-get_pca_var(X.princomp)
X.princomp.var$coord
# or
X.princomp$loadings %*% diag(X.princomp$sdev)
- Using prcomp
# use get_pca_var from factoextra package
X.prcomp.var <-get_pca_var(X.prcomp)
X.prcomp.var$coord
# or
X.prcomp$rotation %*% diag(X.prcomp$sdev)
The column sum of squares of the loadings are the variances of PCs.
The row sum of squares of the loadings are 1s.
If we square the loading matrix, we get the quality of representation for variables on the factor map (the cos2 output from get_pca_var
function).
all.equal(X.prcomp.var$cos2,X.prcomp.var$coord^2)
The column sums of the cos2 matrix are the variances of PCs.
The row sums of the cos2 matrix are 1s.
X.prcomp.var$contrib
contains the contributions (in percentage) of the variables to the principal components.
# The contribution of a variable to the first principal component is (in percentage)
X.prcomp.var$cos2[,1] * 100 / sum(X.prcomp.var$cos2[,1])
- Relevant plots
# look at variables that contribute to principal component 1
# use eigenvalues
barplot(X.prcomp$rotation[,1])
# use loadings.Same as above, only different axis scales (scaled by X.prcomp$sdev[1])
barplot(X.prcomp.var$coord[,1])
# use cos2/contrib
barplot(X.prcomp.var$cos2[,1])
eigenvectors and eigenvalues
http://setosa.io/ev/eigenvectors-and-eigenvalues/
Nice illustrations by ttnphns et.al.
- Variable space and subject space
https://stats.stackexchange.com/a/192637 - Visual explanation of Canonical correlation analysis, PCA and Linear regression
https://stats.stackexchange.com/a/65817 - Association measure of a variable with a PCA component (on a biplot / loading plot)
https://stats.stackexchange.com/a/119758 - Principal component analysis and Factor analysis
https://stats.stackexchange.com/a/95106 - Regression
https://stats.stackexchange.com/a/124892 - Theoretical and pratical explanations of biplot
https://stats.stackexchange.com/a/141755
https://stats.stackexchange.com/a/276746 - Loadings and eigenvectors
https://stats.stackexchange.com/a/143949
More vivid explanation
PCA pratical tutorials using R
http://www.statpower.net/Content/312/R%20Stuff/PCA.html
http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/
https://rpubs.com/crazyhottommy/PCA_MDS