# Principal Components Analysis

### Hong Zheng / 2017-12-11

Denote the data matrix as $X$. It is a $n*p$ matrix with $n$ individuals/observations as rows and $p$ features/variables as columns.

Firstly, center (necessary) and scale (not required, depending on the data structure) X, so that the column means are 0s, and the column variances are 1s.

X <- scale(X,center = T, scale = T)


Use any of the three functions in R to perform PCA.

# The centering and scaling options are still specified, although not necessary here since X has already been centered and scaled.
X.princomp = princomp(X, cor = T, scores = T)
X.prcomp = prcomp(X,scale. = T)
X.svd = svd(scale(X,center=TRUE,scale=TRUE))

princomp() prcomp() svd()
standard deviations of principal components sdev sdev $\sqrt{(D^2/(n-1))}$
principal components scores x U%*%D

Note:

• In svd approach,
• if p>n, $X_{n*p}=U_{n*n}D_{n*n}V_{p*n}'$
• if p<n, $X_{n*p}=U_{n*p}D_{p*p}V_{p*p}'$.
• $U$ and $V$ has the following properties: $V'V=I$, $U'U=I$; column sum of squares are ones.
• $U$ has equal column variances, which is equal to $\frac {1}{n-1}$
• Loadings in princomp output, rotation in prcomp output, and v in svd outout are the matrix of variable loadings (columns are eigenvectors). Their row and column sum of squares are ones.

### How to get new principle components?

These are the coordinates of the individuals/observations on the principal components.

• Using princomp
  X.princomp$scores # or X %*% X.princomp$loadings

• Using prcomp
  X.prcomp$x # or X %*% X.prcomp$rotation

• Using svd
   X.svd$u %*% diag(X.svd$d) # UD
# or
X %*% X.svd$v # XV  princomp outout has different signs with the other two. • Relevant plots # scatter plot of individuals on PC1 vs. PC2 plot(X.prcomp$x[,1],X.prcomp$x[,2],pch=16) # use factoextra package. color individuals by their groups. fviz_pca_ind(X.prcomp, geom.ind = "point", col.ind = as.factor(sample.group), palette = "Dark2", repel = F, addEllipses = TRUE ) # use factoextra package. color individuals by their cos2 values. fviz_pca_ind(X.prcomp, geom.ind = "point", col.ind = cos2, gradient.cols = "Dark2", repel = F, addEllipses = TRUE )  ### How to get eigenvalue/variance explained by each PC? An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. This holds true only when the data are standardized. • Using princomp apply(X.princomp$scores,2,function(x){var(x)}) # a bit inaccurate
# or
X.princomp$sdev^2  • Using prcomp apply(X.prcomp$x,2,function(x){var(x)})
# or
X.prcomp$sdev^2  • Using svd apply(X.svd$u %*% diag(X.svd$d),2,function(x){var(x)}) # or X.svd$d^2/(nrow(X)-1)

• Relevant plots

# get variance explained by each PC and cumulative variance
get_eigenvalue(X.prcomp)
X.prcomp.varex = 100*X.prcomp$sdev^2/sum(X.prcomp$sdev^2) # variance explained by each PC
X.prcomp.cvarex = NULL; for(i in 1:ncol(cdat)){X.prcomp.cvarex[i] = sum(X.prcomp.varex[1:i])} # cumulative variance
# plots
screeplot(X.prcomp)
fviz_screeplot(X.prcomp)


### How to get eigenvectors?

• Using princomp
  X.princomp$loadings # Strictly speaking, they are not the real "loadings". See below.  • Using prcomp  X.prcomp$rotation
# look at variables that contribute to principal component 1
barplot(X.prcomp$rotation[,1]  • Using svd  X.svd$v


These are the coordinates of the features/variables on the principal components.

Loadings are unstandardized eigenvectors’ elements, i.e. eigenvectors endowed by corresponding component variances, or eigenvalues. They are also covariances/correlations between the original variables and the unit-scaled components.

• Using princomp
# use get_pca_var from factoextra package
X.princomp.var <-get_pca_var(X.princomp)
X.princomp.var$coord # or X.princomp$loadings %*% diag(X.princomp$sdev)  • Using prcomp # use get_pca_var from factoextra package X.prcomp.var <-get_pca_var(X.prcomp) X.prcomp.var$coord
# or
X.prcomp$rotation %*% diag(X.prcomp$sdev)


The column sum of squares of the loadings are the variances of PCs.

If we square the loading matrix, we get the quality of representation for variables on the factor map (the cos2 output from get_pca_var function).

all.equal(X.prcomp.var$cos2,X.prcomp.var$coord^2)


The column sums of the cos2 matrix are the variances of PCs.
The row sums of the cos2 matrix are 1s.

X.prcomp.var$contrib contains the contributions (in percentage) of the variables to the principal components. # The contribution of a variable to the first principal component is (in percentage) X.prcomp.var$cos2[,1] * 100 / sum(X.prcomp.var$cos2[,1])  • Relevant plots # look at variables that contribute to principal component 1 # use eigenvalues barplot(X.prcomp$rotation[,1])
# use loadings.Same as above, only different axis scales (scaled by X.prcomp$sdev) barplot(X.prcomp.var$coord[,1])
# use cos2/contrib
barplot(X.prcomp.var\$cos2[,1])


### eigenvectors and eigenvalues

http://setosa.io/ev/eigenvectors-and-eigenvalues/

### More vivid explanation

https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/