Principle Component Analysis (PCA)
Principal Component Analysis(PCA) is a method for reducing the dimensionality of data. It is the process of finding the principal components of a dataset with n-columns (features) and projecting it into a subspace with fewer columns, whilst retaining the essence of the original data.
The first principal component of a dataset is the direction along the dataset with the highest variation. In the example shown below, the orange arrow points into the direction with the largest variance.
Steps to find PCA
Collect the data
Normalize the data
Calculate the covariance matrix
Find the eigenvalues and eigenvectors of the covariance matrix
Use the principal components to transform the data - Reduce the dimensionality of the data
Manually Calculate Principal Component Analysis
Steps:
Get the original matrix
Note: Transpose the original matrix(row x column) to group columns together for calculation
Calculate mean of each column which is grouped together
Subtract original matrix columns by subtracting column mean to get a centered matrix
Calculate covariance of centered matrix
Find eigenvalues and eigenvectors using eigendecomposition
Select k eigenvectors, called principal components, that have the k largest eigenvalues
Find projection
P = B^T . C
Where C is the normalized/centered matrix that we wish to project, B^T is the transpose of the chosen principal components and P is the projection of A.
Note: In the following example, we can see that only the first eigenvector is required, suggesting that we could project our 3×2 matrix onto a 3×1 matrix with little loss.
Principle Component Analysis using scikit-learn library
PCA can be calculated using scikit-learn library. Steps:
Define a matrix
Create the PCA instance with number of components as parameter
Fit on data
Find eigenvalues and eigenvectors using eigendecomposition
Project data Note: In the projection, the value
2.22044605e-16
is very close to 0.
Link:
Last updated