# Variance, Covariance, and Correlation

### Hong Zheng / 2017-12-08

Variance

Variance is the difference between when we square the inputs to expectation and when we square the expectation itself.

$Var(x) = \frac {1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})^2$ =
$E[(x-\bar{x})^2]$ =
$E[x^2 - 2 \times x \times \bar{x} + (\bar{x})^2]$ =
$E[x^2] -2 \times E[X] \times E[\bar{x}] + E[(\bar{x})^2]$ =
$E[x^2] - (E[x])^2$

Note:

• $x$ is a vector ($n * 1$ matrix).
• $E[\bar{x}] == E[x] == \bar{x}$

Covariance

It measures the variance between two variables.

We can rewrite the variance equation as:

$Var(x) =E[xx]−E[x]E[x]$

What if one of the $x$ is another random variable?“, so that we would have:

$E[xy]−E[x]E[y]$

which is the definition of covariance between $x$ and $y$: $Cov(x,y)$

It can also be written as $\frac {1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y})$

Note: $x$ and $y$ are both vectors ($n * 1$ matrix).

Correlation

$Cor(x,y) = \frac {Cov(x,y)} {\sqrt{(Var(x)Var(y))}}$ = $\frac {\sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y})} {\sqrt{ \sum_{i=1}^{n} (x_{i} - \bar{x})^2} \sqrt{\sum_{i=1}^{n}(y_{i} - \bar{y})^2}}$

It is the Pearson correlation coefficient between variables $x$ and $y$.

Covariance is just an unstandardized version of correlation. To compute any correlation, we divide the covariance by the standard deviation of both variables to remove units of measurement. So a covariance is just a correlation measured in the units of the original variables.

Note: $x$ and $y$ are both vectors ( $n * 1$ matrix).

Covariance matrix

$$\left(\begin{array}{cc} s_{1}^2 & s_{12} & ... & s_{1p} \\ s_{21} & s_{2}^2 & ... & s_{2p} \\ ... & ... & ... & ... \\ s_{p1} & s_{p2} & ... & s_{p}^2 \end{array}\right)$$

• $X$ is a $n * p$ matrix.
• $p$: number of features
• $n$: number of observations
• $s_{j}^2$ is the variance of the j-th variable. $\frac {1}{n} \sum_{i=1}^{n} (x_{ij} - \bar{x_{j}})^2$
• $s_{jk}$ is the covariance between the j-th and k-th variables. $\frac {1}{n} \sum_{i=1}^{n} (x_{ij} - \bar{x_{j}})(x_{ik} - \bar{x_{k}})$

In matrix form:
$$S = \frac {1} {n} Xc^TXc$$

or

$$S = \frac {1} {n-1} Xc^TXc$$

• $Xc$, centered matrix of $X$. $Xc = X - 1_{n} \bar{X}'$. $\bar{X}'$ is column means of $X$, in the form of a $1*p$ matrix. $1_{n}$ is a $n*1$ matrix.

$$\left(\begin{array}{cc} x_{11}-\bar{x_{1}} & x_{12}-\bar{x_{2}} & ... & x_{1p}-\bar{x_{p}} \\ x_{21}-\bar{x_{1}} & x_{22}-\bar{x_{2}} & ... & x_{2p}-\bar{x_{p}} \\ ... & ... & ... & ... \\ x_{n1}-\bar{x_{1}} & x_{n2}-\bar{x_{2}} & ... & x_{np}-\bar{x_{p}} \end{array}\right)$$

• Sometimes (and in R) it is also divided by $n−1$, which is a typical way to correct for the bias introduced by using the sample mean instead of the true population mean.

Calculate covariance matrix in R:

S <- cov(X)


Correlation matrix

$$\left(\begin{array}{cc} 1 & r_{12} & ... & r_{1p} \\ r_{21} & 1 & ... & r_{2p} \\ ... & ... & ... & ... \\ r_{p1} & r_{p2} & ... & 1 \end{array}\right)$$

where

$r_{jk} = \frac {s_{jk}}{s_{j}s_{k}}$ = $\frac {\sum_{i=1}^{n} (x_{ij} - \bar{x_{j}})(x_{ik} - \bar{x_{k}})} {\sqrt{ \sum_{i=1}^{n} (x_{ij} - \bar{x_{j}})^2} \sqrt{\sum_{i=1}^{n}(x_{ik} - \bar{x_{k}})^2}}$ is the Pearson correlation coefficient between variables $x_{j}$ and $x_{k}$.

In matrix form:
$$R = \frac {1} {n} Xs^TXs$$

or

$$R = \frac {1} {n-1} Xs^TXs$$

• $Xs = XcD^{-1}$, where $D = diag(s_{1}, . . . , s_{p})$ is the diagonal scaling matrix.

$$\left(\begin{array}{cc} \frac {x_{11}-\bar{x_{1}}}{s_{1}} & \frac {x_{12}-\bar{x_{2}}}{s_{2}} & ... & \frac {x_{1p}-\bar{x_{p}}}{s_{p}} \\ \frac {x_{21}-\bar{x_{1}}}{s_{1}} & \frac {x_{22}-\bar{x_{2}}}{s_{2}} & ... & \frac {x_{2p}-\bar{x_{p}}}{s_{p}} \\ ... & ... & ... & ... \\ \frac {x_{21}-\bar{x_{1}}}{s_{1}} & \frac {x_{n2}-\bar{x_{2}}}{s_{2}} & ... & \frac {x_{np}-\bar{x_{p}}}{s_{p}} \end{array}\right)$$

Calculate correlation matrix in R:

S <- cor(X)