Principal Component Analysis (PCA) is a method to get low dimensional representation of a high-dimensional dataset without losing the important features. PCA can be useful for visualization of the dataset or determining features from data. More formally, if we are given data set of $n$ original $d$-dimensional points $X = \{x_1,\dots,x_n\} \in \mathbb{R}^{n\times d}$, we want to obtain $n$ new $k$-dimensional points $Z = \{z_1, \dots, z_n\} \in \mathbb{R}^{n\times k}$ with $k\ll d$, for which the projection back onto the original dimension $d$ ($\bar X = \{\bar x_1,\dots,\bar x_n\} \in \mathbb{R}^{n\times d}$), is as close as possible to the original data $X$.
SVD decomposition of matrix $A$ is a representation of that matrix as $$A = USV^\top$$ where $U,V$ are orthogonal matrices and $S$ is diagonal matrix with singular values. We can (together with eigenvectors) reorder them such that $\lambda_1 \geq \lambda_2 \geq \dots \lambda_d \geq 0$. That will be important for the next step.
Hence the SVD decomposition goes as follows.
$$
n \Sigma={X}^T {X}={V} {S}^T \underbrace{{U}^T {U}}_{=I\ (\text{orthogonal})} {S} {V}^T={V} {S}^T {S} {V}^T={V} {D} {V}^T
$$
And so (assuming the decreasing order of the eigenvalues) the solution is given by $v_1$ for $k=1$ - the first column of $V = \begin{pmatrix} v_1 | \dots |v_d \end{pmatrix}$ from the SVD, the eigenvector corresponding to the biggest eigenvalue. This is the vector carrying the most amount of information about the dataset.
In case of PCA for $k\geq1$, the problem generalizes into
$$
(W^*,Z) = \arg\min_{W^\top W = I, Z} \sum_{i=1}^n \|Wz_i - x_i \|^2_2
$$
We can solve it analogously to $k=1$.
generate_points_line()
takes 4 arguments. min_x, max_x
determine the range of x-axis, deviation
allows the point to be a little further away from the $y=x$ line and the 2-dimensional point will be taken to the dataset with probability p
.
def generate_points_line(min_x, max_x, deviation, p):
points = list()
for x in range(min_x, max_x+1):
for y in range(min_x, max_x+1):
if y < x+deviation and y > x-deviation:
if (random.random() < p):
points.append((x,y))
return points
X = generate_points_line(0,120,50,0.23)
Using the parameters min_x
= 0, max_x
= 120,
deviation
= 50 and p
= .023, I generated the following dataset $X$.
X = (X-np.mean(X, axis=0)) / np.std(X, axis=0) # standardize the data
If we plot the data now, we notice that the plot looks exactly the same, only the axes shifted and the distance between points shrunk.
def cov_of_standardized(X):
n = X.shape[1]
return np.dot(X.T,X)/(n)
cov_X = cov_of_standardized(X)
For this particular dataset, the covariance matrix looks like this:
$$
Cov_X = \begin{bmatrix}
110.5 & 68.73427235 \\
68.73427235 & 110.5
\end{bmatrix}
$$
U,s,Vt = np.linalg.svd(cov_X) # https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html
print(s)
print(U)
returns
[179.23427235 41.76572765] # eigenvalues
[[-0.70710678 -0.70710678] # first eigenvector
[-0.70710678 0.70710678]] # second eigenvector
It might be a little abstract what we just did, so let's look at the eigenvectors (these are the principal components) in the plane. They are scaled respectively to their eigenvalue. The blue vector is the [-0.70710678 0.70710678]
component and the red one corresponds to the [-0.70710678 -0.70710678]
component.
[-0.70710678, -0.70710678]
to be extracted).
In the following snippet I implemented exactly that:
def k_most_eigenvectors(mat, k):
u,s,vt = np.linalg.svd(mat) # https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html
indices = np.argsort(s) # returns indices from *smallest* to biggest EValue
indices = np.flip(indices,0)[:k] # (data, axis) and up to k-th index
W = np.real(u[:,indices]) # more context: https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html
return W
W = k_most_eigenvectors(cov, 1) # extract the 1 most significant eigenvector of the covariance of X
W
And indeed, what is returned is the expected eigenvector.
[-0.70710678, -0.70710678]
def fit_pca(X, W):
return np.dot(X,W)
Z = fit_pca(X,W)
X_bar = np.dot(W,Z.T)
And now we're done! We can plot the new points $(z_1, \dots, z_n)$ in orange against the original points in blue to visualize $Z$.