Principal Component Analysis(PCA) is a method for reducing the dimensionality of data. It is the process of finding the principal components of a dataset with n-columns (features) and projecting it into a subspace with fewer columns, whilst retaining the essence of the original data.
The first principal component of a dataset is the direction along the dataset with the highest variation. In the example shown below, the orange arrow points into the direction with the largest variance.
Steps to find PCA
Collect the data
Normalize the data
Calculate the covariance matrix
Find the eigenvalues and eigenvectors of the covariance matrix
Use the principal components to transform the data - Reduce the dimensionality of the data
Manually Calculate Principal Component Analysis
Steps:
Get the original matrix
Note: Transpose the original matrix(row x column) to group columns together for calculation
Calculate mean of each column which is grouped together
Subtract original matrix columns by subtracting column mean to get a centered matrix
Calculate covariance of centered matrix
Find eigenvalues and eigenvectors using eigendecomposition
Select k eigenvectors, called principal components, that have the k largest eigenvalues
Find projection P = B^T . C
Where C is the normalized/centered matrix that we wish to project, B^T is the transpose of the chosen principal components and P is the projection of A.
Note: In the following example, we can see that only the first eigenvector is required, suggesting that we could project our 3×2 matrix onto a 3×1 matrix with little loss.
from numpy import arrayfrom numpy import dotfrom numpy import meanfrom numpy import covfrom numpy.linalg import eigA =array([[1, 2], [3, 4], [5, 6]])print('Printing original A')print(A)# Regroup column/feature i and column/feature ii together by transposing# feature i: 1,3,4# feature ii: 2,4,6print('------------------')print('Printing A.T which regroups column data together')print(A.T)# calculate the mean of each row (axis=1) i.e Transposed/grouped columnsM =mean(A.T, axis=1)print('------------------')print('Printing mean')print(M)# center columns by subtracting column means# Also regroup column/feature i and column/feature ii together by transposing the centered matrixC = (A - M).Tprint('------------------')print('Printing normalized/centered matrix that was transposed to regroup column data together')print(C)# calculate covariance of matrix(normalized/centered)print('------------------')print('Printing covariance of matrix')V =cov(C)print(V)# eigendecomposition of covariance matrixvalues, vectors =eig(V)print('------------------')print('Printing eigenvalues')print(values)print('------------------')print('Printing eigenvectors')print(vectors)# project data on centered matrix# projection and all the above calculation is done by grouping column together# Note: The final matrix has to be transposed back to its row * column formP = vectors.T.dot(C)print('------------------')print('Printing projection')print(P.T)
Printing original A
[[1 2]
[3 4]
[5 6]]
------------------
Printing A.T which regroups column data together
[[1 3 5]
[2 4 6]]
------------------
Printing mean
[3. 4.]
------------------
Printing normalized/centered matrix that was transposed to regroup column data together
[[-2. 0. 2.]
[-2. 0. 2.]]
------------------
Printing covariance of matrix
[[4. 4.]
[4. 4.]]
------------------
Printing eigenvalues
[8. 0.]
------------------
Printing eigenvectors
[[ 0.70710678 -0.70710678]
[ 0.70710678 0.70710678]]
------------------
Printing projection
[[-2.82842712 0. ]
[ 0. 0. ]
[ 2.82842712 0. ]]
Principle Component Analysis using scikit-learn library
PCA can be calculated using scikit-learn library.
Steps:
Define a matrix
Create the PCA instance with number of components as parameter
Fit on data
Find eigenvalues and eigenvectors using eigendecomposition
Project data
Note: In the projection, the value 2.22044605e-16 is very close to 0.
from numpy import arrayfrom sklearn.decomposition import PCAA =array([[1, 2], [3, 4], [5, 6]])print('------------------')print('Printing A')print(A)# create PCA instance with number of components as parameterpca =PCA(2)# fit on datapca.fit(A)# access eigenvalues and eigenvectorsprint('------------------')print('Printing eigenvectors')print(pca.components_)print('------------------')print('Printing eigenvalues')print(pca.explained_variance_)# note: the eigenvector printed in the previous manual calculation was a transposeprint('------------------')print('Printing eigenvalues.transpose')print(pca.components_.T)# project/transformP = pca.transform(A)print('------------------')print('Printing projection')print(P)