Principle Component Analysis (PCA)

Principal Component Analysis(PCA) is a method for reducing the dimensionality of data. It is the process of finding the principal components of a dataset with n-columns (features) and projecting it into a subspace with fewer columns, whilst retaining the essence of the original data.

The first principal component of a dataset is the direction along the dataset with the highest variation. In the example shown below, the orange arrow points into the direction with the largest variance.

Steps to find PCA

  1. Collect the data

  2. Normalize the data

  3. Calculate the covariance matrix

  4. Find the eigenvalues and eigenvectors of the covariance matrix

  5. Use the principal components to transform the data - Reduce the dimensionality of the data

Manually Calculate Principal Component Analysis

Steps:

  • Get the original matrix

  • Note: Transpose the original matrix(row x column) to group columns together for calculation

  • Calculate mean of each column which is grouped together

  • Subtract original matrix columns by subtracting column mean to get a centered matrix

  • Calculate covariance of centered matrix

  • Find eigenvalues and eigenvectors using eigendecomposition

    • Select k eigenvectors, called principal components, that have the k largest eigenvalues

  • Find projection P = B^T . C

    Where C is the normalized/centered matrix that we wish to project, B^T is the transpose of the chosen principal components and P is the projection of A.

Note: In the following example, we can see that only the first eigenvector is required, suggesting that we could project our 3×2 matrix onto a 3×1 matrix with little loss.

Principle Component Analysis using scikit-learn library

PCA can be calculated using scikit-learn library. Steps:

  • Define a matrix

  • Create the PCA instance with number of components as parameter

  • Fit on data

  • Find eigenvalues and eigenvectors using eigendecomposition

  • Project data Note: In the projection, the value 2.22044605e-16 is very close to 0.

Link:

Last updated

Was this helpful?