Hong Zheng

Principal Components Analysis

Hong Zheng / 2017-12-11

Denote the data matrix as $X$. It is a $n*p$ matrix with $n$ individuals/observations as rows and $p$ features/variables as columns.

Firstly, center (necessary) and scale (not required, depending on the data structure) X, so that the column means are 0s, and the column variances are 1s.

X <- scale(X,center = T, scale = T)

Use any of the three functions in R to perform PCA.

# The centering and scaling options are still specified, although not necessary here since X has already been centered and scaled. 
X.princomp = princomp(X, cor = T, scores = T) 
X.prcomp = prcomp(X,scale. = T) 
X.svd = svd(scale(X,center=TRUE,scale=TRUE))
princomp() prcomp() svd()
standard deviations of principal components sdev sdev $\sqrt{(D^2/(n-1))}$
matrix of variable loadings loadings rotation V
principal components scores x U%*%D


How to get new principle components?

These are the coordinates of the individuals/observations on the principal components.

# or
  X %*% X.princomp$loadings
# or  
  X %*% X.prcomp$rotation
   X.svd$u %*% diag(X.svd$d) # UD
# or
   X %*% X.svd$v # XV

princomp outout has different signs with the other two.

How to get eigenvalue/variance explained by each PC?

An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. This holds true only when the data are standardized.

# get variance explained by each PC and cumulative variance
  X.prcomp.varex = 100*X.prcomp$sdev^2/sum(X.prcomp$sdev^2) # variance explained by each PC
  X.prcomp.cvarex = NULL; for(i in 1:ncol(cdat)){X.prcomp.cvarex[i] = sum(X.prcomp.varex[1:i])} # cumulative variance
# plots  
  fviz_eig(X.prcomp,addlabels = TRUE)

How to get eigenvectors?

  X.princomp$loadings # Strictly speaking, they are not the real "loadings". See below.
# look at variables that contribute to principal component 1  

How to get loadings?

These are the coordinates of the features/variables on the principal components.

Loadings are unstandardized eigenvectors’ elements, i.e. eigenvectors endowed by corresponding component variances, or eigenvalues. They are also covariances/correlations between the original variables and the unit-scaled components.

# use get_pca_var from factoextra package
  X.princomp.var <-get_pca_var(X.princomp) 
# or  
  X.princomp$loadings %*% diag(X.princomp$sdev)
# use get_pca_var from factoextra package
  X.prcomp.var <-get_pca_var(X.prcomp) 
# or  
  X.prcomp$rotation %*% diag(X.prcomp$sdev)

The column sum of squares of the loadings are the variances of PCs.
The row sum of squares of the loadings are 1s.

If we square the loading matrix, we get the quality of representation for variables on the factor map (the cos2 output from get_pca_var function).


The column sums of the cos2 matrix are the variances of PCs.
The row sums of the cos2 matrix are 1s.

X.prcomp.var$contrib contains the contributions (in percentage) of the variables to the principal components.

# The contribution of a variable to the first principal component is (in percentage) 
X.prcomp.var$cos2[,1] * 100 / sum(X.prcomp.var$cos2[,1]) 
# look at variables that contribute to principal component 1  
# use eigenvalues
# use loadings.Same as above, only different axis scales (scaled by X.prcomp$sdev[1])
# use cos2/contrib

eigenvectors and eigenvalues


Nice illustrations by ttnphns et.al.

More vivid explanation


PCA pratical tutorials using R