Principal Components Analysis

Hong Zheng / 2017-12-11

Denote the data matrix as $X$ . It is a $n*p$ matrix with $n$ individuals/observations as rows and $p$ features/variables as columns.

Firstly, center (necessary) and scale (not required, depending on the data structure) X, so that the column means are 0s, and the column variances are 1s.

X <- scale(X,center = T, scale = T)

Use any of the three functions in R to perform PCA.

# The centering and scaling options are still specified, although not necessary here since X has already been centered and scaled. 
X.princomp = princomp(X, cor = T, scores = T) 
X.prcomp = prcomp(X,scale. = T) 
X.svd = svd(scale(X,center=TRUE,scale=TRUE))

	princomp()	prcomp()	svd()
standard deviations of principal components	sdev	sdev	$\sqrt{(D^2/(n-1))}$
matrix of variable loadings	loadings	rotation	V
principal components	scores	x	U%*%D

Note:

In svd approach,
- if p>n, $X_{n*p}=U_{n*n}D_{n*n}V_{p*n}'$
- if p<n, $X_{n*p}=U_{n*p}D_{p*p}V_{p*p}'$ .
$U$ and $V$ has the following properties: $V'V=I$ , $U'U=I$ ; column sum of squares are ones.
$U$ has equal column variances, which is equal to $\frac {1}{n-1} $
Loadings in princomp output, rotation in prcomp output, and v in svd outout are the matrix of variable loadings (columns are eigenvectors). Their row and column sum of squares are ones.

How to get new principle components?

These are the coordinates of the individuals/observations on the principal components.

Using princomp

  X.princomp$scores
# or
  X %*% X.princomp$loadings

Using prcomp

  X.prcomp$x
# or  
  X %*% X.prcomp$rotation

Using svd

   X.svd$u %*% diag(X.svd$d) # UD
# or
   X %*% X.svd$v # XV

princomp outout has different signs with the other two.

Relevant plots

# scatter plot of individuals on PC1 vs. PC2
plot(X.prcomp$x[,1],X.prcomp$x[,2],pch=16) 
# use factoextra package. color individuals by their groups.
fviz_pca_ind(X.prcomp,
         geom.ind = "point",
         col.ind = as.factor(sample.group), 
         palette = "Dark2",
         repel = F,
         addEllipses = TRUE
         )
# use factoextra package. color individuals by their cos2 values.          
fviz_pca_ind(X.prcomp,
         geom.ind = "point",
         col.ind = cos2,
         gradient.cols = "Dark2",
         repel = F,
         addEllipses = TRUE
         )

How to get eigenvalue/variance explained by each PC?

An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. This holds true only when the data are standardized.

Using princomp

apply(X.princomp$scores,2,function(x){var(x)}) # a bit inaccurate
# or
X.princomp$sdev^2

Using prcomp

apply(X.prcomp$x,2,function(x){var(x)})
# or
X.prcomp$sdev^2

Using svd

apply(X.svd$u %*% diag(X.svd$d),2,function(x){var(x)})
# or  
X.svd$d^2/(nrow(X)-1)

Relevant plots

# get variance explained by each PC and cumulative variance
  get_eigenvalue(X.prcomp)  
  X.prcomp.varex = 100*X.prcomp$sdev^2/sum(X.prcomp$sdev^2) # variance explained by each PC
  X.prcomp.cvarex = NULL; for(i in 1:ncol(cdat)){X.prcomp.cvarex[i] = sum(X.prcomp.varex[1:i])} # cumulative variance
# plots  
  screeplot(X.prcomp)
  fviz_eig(X.prcomp,addlabels = TRUE)
  fviz_screeplot(X.prcomp)

How to get eigenvectors?

Using princomp

  X.princomp$loadings # Strictly speaking, they are not the real "loadings". See below.

Using prcomp

  X.prcomp$rotation
# look at variables that contribute to principal component 1  
  barplot(X.prcomp$rotation[,1]

Using svd

  X.svd$v

How to get loadings?

These are the coordinates of the features/variables on the principal components.

Loadings are unstandardized eigenvectors’ elements, i.e. eigenvectors endowed by corresponding component variances, or eigenvalues. They are also covariances/correlations between the original variables and the unit-scaled components.

Using princomp

# use get_pca_var from factoextra package
  X.princomp.var <-get_pca_var(X.princomp) 
  X.princomp.var$coord
# or  
  X.princomp$loadings %*% diag(X.princomp$sdev)

Using prcomp

# use get_pca_var from factoextra package
  X.prcomp.var <-get_pca_var(X.prcomp) 
  X.prcomp.var$coord
# or  
  X.prcomp$rotation %*% diag(X.prcomp$sdev)

The column sum of squares of the loadings are the variances of PCs.
The row sum of squares of the loadings are 1s.

If we square the loading matrix, we get the quality of representation for variables on the factor map (the cos2 output from get_pca_var function).

all.equal(X.prcomp.var$cos2,X.prcomp.var$coord^2)

The column sums of the cos2 matrix are the variances of PCs.
The row sums of the cos2 matrix are 1s.

X.prcomp.var$contrib contains the contributions (in percentage) of the variables to the principal components.

# The contribution of a variable to the first principal component is (in percentage) 
X.prcomp.var$cos2[,1] * 100 / sum(X.prcomp.var$cos2[,1])

Relevant plots

# look at variables that contribute to principal component 1  
# use eigenvalues
  barplot(X.prcomp$rotation[,1])
# use loadings.Same as above, only different axis scales (scaled by X.prcomp$sdev[1])
  barplot(X.prcomp.var$coord[,1]) 
# use cos2/contrib
  barplot(X.prcomp.var$cos2[,1])

eigenvectors and eigenvalues

http://setosa.io/ev/eigenvectors-and-eigenvalues/

Nice illustrations by ttnphns et.al.

Variable space and subject space
https://stats.stackexchange.com/a/192637
Visual explanation of Canonical correlation analysis, PCA and Linear regression
https://stats.stackexchange.com/a/65817
Association measure of a variable with a PCA component (on a biplot / loading plot)
https://stats.stackexchange.com/a/119758
Principal component analysis and Factor analysis
https://stats.stackexchange.com/a/95106
Regression
https://stats.stackexchange.com/a/124892
Theoretical and pratical explanations of biplot
https://stats.stackexchange.com/a/141755
https://stats.stackexchange.com/a/276746
Loadings and eigenvectors
https://stats.stackexchange.com/a/143949

More vivid explanation

https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/

PCA pratical tutorials using R

http://www.statpower.net/Content/312/R%20Stuff/PCA.html
http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/
https://rpubs.com/crazyhottommy/PCA_MDS