Principle Component Analysis - PCA

Principle component analysis (PCA) is by far the most popular dimension reduction algorithm.

First, it identifies the hyperplane that lies closest to the data, and then it projects the data onto it.

Use the PCA in sklearn

from sklearn.decomposition import PCA

pca = PCA(n_component=2)
X2D = pca.fit_transform(X)

PCA identifies the axis that accounts for the largest amount of variance in the training set. It also finds a second axis, orthogonal to the first one, that accounts for the largest amount of the remaining variance. If it were a higher-dimensional dataset, PCA would also find a third axis, orthogonal to both previous axes, and a fourth, a fifth, and so on — as many axes as the number of dimensions in the dataset.

PCA For Compression

After dimensionality reduction, the training set takes up much less space. For example, after applying PCA to the MNIST dataset while preserving 95% of its variance, we are left with 154 features, instead of the original 784 features. So the dataset is now less than 20% of its original size. and we only lost 5% of its variance! This is a reasonable compression ratio, and it’s easy to see how such a size reduction would speed up a classification algorithm tremendously.

Main Types of PCA

Randomized PCA

If you set the svd_solver hyperparameter to “randomized”, Scikit-Learn uses a stochastic algorithm called randomized PCA that quickly approximation of the first d principle components.

rnd_pca = PCA(n_components=154, svd_solver="randomized", random_state=42)

X_reduced=rnd_pca.fit_transform(X_train)

Incremental PCA

Incremental PCA algorithms have been developed that allow you to split the training set into mini-batches and feed these in one mini-batch at a time.

This is useful for large training sets and applying PCA online.

from sklearn.decomposition import IncrementalPCA

n_batches = 100
inc_pca = IncrementalPCA(n_component=154)
for X_batch in np.array_split(X_train, n_batches):
    inc_pca.partial_fit(X_batch)

X_reduced = inc_pca.transform(X_train)

For very high-dimension datasets, PCA can be too slow. As you saw earlier, even if you use randomized PCA its computational complexity is still O(m x d square2) + O(d square3), so the target number of dimension d must not be too large. If you are dealing with a dataset with tens of thousands of features or more, then training may become much too slow: in this case, you should consider random projection instead.

Dislike

Thanks for feedback.