Reducers#

Reduce dimension for data visualisation.

Reduce number of dimension down to two (or to three) for instance, make it possible to plot high-dimension training set on the graph and often gain some important insights by visually detecting patterns, such as clusters.

watex.analysis.dimensionality.LLE(X, n_components=None, *, return_X=True, n_neighbors=5, **lle_kws)[source]#

Locally Linear Embedding(LLE)

LLE is nonlinear dimensinality reduction based on closest neighbors (c.n).

LLE is another powerfull non linear dimensionality reduction(NLDR) technique. It is Manifold Learning technique that does not rely on projections like PCA. In a nutshell, works by first measurement how each training instance library lineraly relates to its closest neighbors(c.n.), and then looking for a low-dimensional representation of the training set where these local relationships are best preserved (more details shortly).Using LLE yields good resuls especially when makes it particularly good at unrolling twisted manifolds, especially when there is too much noise.

Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • n_components (int, optional) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.

  • n_neighbors (int, default=5) – Number of neighbors to consider for each point.

  • return_X (bool, default =True ,) – return the train set transformed with most representative varaince ratio.

  • lle_kws (dict,) – Additional keyword arguments passed to sklearn.decomposition.LocallyLinearEmbedding.

Returns:

The transformed training set or the LLE container attributes for plotting purposes.

Return type:

X (NDArray) or LLE object,

References

Gokhan H. Bakir, Jason Wetson and Bernhard Scholkoft, 2004; “Learning to Find Pre-images”;Tubingen, Germany:Max Planck Institute for Biological Cybernetics.

S. Roweis, L.Saul, 2000, Nonlinear Dimensionality Reduction by Loccally Linear Embedding.

Notes

Scikit-Learn used the algorithm based on Kernel Ridge Regression

Example

>>> from watex.analysis.dimensionality import LLE
>>> from watex.datasets import fetch_data
>>> X, _=fetch_data('Bagoue analysed data')
>>> lle_kws ={
...    'n_components': 4,
...    "n_neighbors": 5}
>>> Xtransf=LLE(X,**lle_kws)
watex.analysis.dimensionality.find_features_importances(fnames, components, n_axes=2)[source]#

Retreive the features importance with variance ratio. :param fnames: array_like of feature’s names :param components: pca components on different axes

watex.analysis.dimensionality.get_component_with_most_variance(X, **pca_kws)[source]#

Get the number of component with 95% ratio.

Parameters:
  • X – Training set.

  • pca_kws – additional pca keywords arguments.

watex.analysis.dimensionality.iPCA(X, n_components=None, *, view=False, n_batches=None, return_X=True, store_in_binary_file=False, filename=None, **ipca_kws)[source]#

Incremental PCA

iPCA allows to split the trainsing set into mini-batches and feed algorithm one mini-batch at a time.

Once problem with the preceeding implementation of PCA is that requires the whole training set to fit in memory in order of the SVD algorithm to run. This is usefull for large training sets, and also applying PCA online(i.e, on the fly as a new instance arrive)

Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • n_components (int, optional) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.

  • n_batches (int, optional) – Number of batches to split the training set.

  • store_in_binary_file (bool, default=False) – Alternatively, we used numpy` memmap` class to manipulate a large array stored in a binary file on disk as if it were entirely in memory. The class load only the data it need in memory when it need its.

  • filename (str,optional) – Default binary filename to store in a binary file in a disk.

  • return_X (bool, default =True ,) – return the train set transformed with most representative varaince ratio.

  • view (bool,default=False,) – Plot the explained varaince as a function of number of dimension.

  • ipca_kws (dict,) – Additional keyword arguments passed to sklearn.decomposition.IncrementalPCA

Returns:

The transformed training set or the iPCA container attributes for plotting purposes.

Return type:

X (NDArray) or iPCA object,

Examples

>>> from watex.analysis.dimensionality import iPCA
>>> from watex.datasets import fetch_data
>>> X, _=fetch_data('Bagoue analysed data')
>>> Xtransf = iPCA(X,n_components=None,n_batches=100, view=True)
watex.analysis.dimensionality.kPCA(X, n_components=None, *, return_X=True, kernel='rbf', reconstruct_pre_image=False, **kpca_kws)[source]#

Kernel PCA

kPCA performs complex nonlinear projections for dimentionality reduction.

Commonly the kernel tricks is a mathematically technique that implicitly maps instances into a very high-dimensionality space(called the feature space), enabling non linear classification or regression with SVMs. Recall that a linear decision boundary in the high dimensional feature space corresponds to a complex non-linear decison boundary in the original space.

Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • n_components (int, optional) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.

  • return_X (bool, default =True ,) – return the train set transformed with most representative varaince ratio.

  • kernel ({'linear', 'poly', 'rbf', 'sigmoid', 'cosine', 'precomputed'}, default='rbf') – Kernel used for PCA.

  • kpca_kws (dict,) – Additional keyword arguments passed to sklearn.decomposition.KernelPCA

Returns:

The transformed training set or the kPCA container attributes for plotting purposes.

Return type:

X (NDArray) or kPCA object,

Examples

>>> from watex.analysis.dimensionality import kPCA
>>> from watex.datasets import fetch_data
>>> X, _=fetch_data('Bagoue analysis data')
>>> Xtransf=kPCA(X,n_components=None,kernel='rbf',
                            gamma=0.04, view=True)
watex.analysis.dimensionality.nPCA(X, n_components=None, *, view=False, return_X=True, plot_kws=None, n_axes=None, **pca_kws)[source]#

Normal Principal Components analysis (PCA)

PCA is by far the most popular dimensional reduction algorithm. First it identifies the hyperplane that lies closest to the data and project it to the data onto it.

Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • n_components (int, optional) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.

  • return_X (bool, default =True ,) – return the train set transformed with most representative varaince ratio.

  • view (bool,default=False,) – Plot the explained varaince as a function of number of dimension.

  • n_axes (int, optional,) – Number of importance components to retrieve the variance ratio. If None the features importance is computed using the cumulative variance representative of 95% .

  • pca_kws (dict,) – Additional matplotlib.pyplot keywords arguments passed to sklearn.decomposition.PCA

Returns:

The transformed training set or the PCA container attributes for plotting purpose.

Return type:

X or nPCA object,

Examples

>>> from watex.analysis.dimensionality import nPCA
>>> from watex.datasets import fetch_data
>>> X, _= fetch_data('Bagoue analysed dataset')
>>> pca = nPCA(X, 0.95, n_axes =3, return_X=False)
>>> pca.components_
>>> pca.feature_importances_
watex.analysis.dimensionality.plot_projection(self, n_components=None, **plot_kws)[source]#

Quick plot the N-Dimension VS explained variance Ratio. :param n_components: pca components on different axes