Principle Component Analysis - PCA
Principle component analysis (PCA) is by far the most popular dimension reduction algorithm.
First, it identifies the hyperplane that lies closest to the data, and then it projects the data onto it.
Use the PCA in sklearn
from sklearn.decomposition import PCA
pca = PCA(n_component=2)
X2D = pca.fit_transform(X)
PCA identifies the axis that accounts for the largest amount of variance in the training set. It also finds a second axis, orthogonal to the first one, that accounts for the largest amount of the remaining variance. If it were a higher-dimensional dataset, PCA would also find a third axis, orthogonal to both previous axes, and a fourth, a fifth, and so on — as many axes as the number of dimensions in the dataset.
PCA For Compression
After dimensionality reduction, the training set takes up much less space. For example, after applying PCA to the MNIST dataset while preserving 95% of its variance, we are left with 154 features, instead of the original 784 features. So the dataset is now less than 20% of its original size. and we only lost 5% of its variance! This is a reasonable compression ratio, and it’s easy to see how such a size reduction would speed up a classification algorithm tremendously.
Main Types of PCA
Randomized PCA
If you set the svd_solver hyperparameter to “randomized”, Scikit-Learn uses a stochastic algorithm called randomized PCA that quickly approximation of the first d principle components.
rnd_pca = PCA(n_components=154, svd_solver="randomized", random_state=42)
X_reduced=rnd_pca.fit_transform(X_train)
Incremental PCA
Incremental PCA algorithms have been developed that allow you to split the training set into mini-batches and feed these in one mini-batch at a time.
This is useful for large training sets and applying PCA online.
from sklearn.decomposition import IncrementalPCA
n_batches = 100
inc_pca = IncrementalPCA(n_component=154)
for X_batch in np.array_split(X_train, n_batches):
inc_pca.partial_fit(X_batch)
X_reduced = inc_pca.transform(X_train)
For very high-dimension datasets, PCA can be too slow. As you saw earlier, even if you use randomized PCA its computational complexity is still O(m x d square2) + O(d square3), so the target number of dimension d must not be too large. If you are dealing with a dataset with tens of thousands of features or more, then training may become much too slow: in this case, you should consider random projection instead.