watex.analysis package#

Analysis sub-package is used for basic feature extraction, transformation and matrices covariance computations (decomposition). It also includes some dimensional reduction (dimensionality) and factor analysis from factor.

watex.analysis.LLE(X, n_components=None, *, return_X=True, n_neighbors=5, **lle_kws)[source]#

Locally Linear Embedding(LLE)

LLE is nonlinear dimensinality reduction based on closest neighbors (c.n).

LLE is another powerfull non linear dimensionality reduction(NLDR) technique. It is Manifold Learning technique that does not rely on projections like PCA. In a nutshell, works by first measurement how each training instance library lineraly relates to its closest neighbors(c.n.), and then looking for a low-dimensional representation of the training set where these local relationships are best preserved (more details shortly).Using LLE yields good resuls especially when makes it particularly good at unrolling twisted manifolds, especially when there is too much noise.

Parameters

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
n_components (int, optional) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.
n_neighbors (int, default=5) – Number of neighbors to consider for each point.
return_X (bool, default =True ,) – return the train set transformed with most representative varaince ratio.
lle_kws (dict,) – Additional keyword arguments passed to sklearn.decomposition.LocallyLinearEmbedding.

Returns

The transformed training set or the LLE container attributes for plotting purposes.

Return type

X (NDArray) or LLE object,

References

Gokhan H. Bakir, Jason Wetson and Bernhard Scholkoft, 2004; “Learning to Find Pre-images”;Tubingen, Germany:Max Planck Institute for Biological Cybernetics.

S. Roweis, L.Saul, 2000, Nonlinear Dimensionality Reduction by Loccally Linear Embedding.

Notes

Scikit-Learn used the algorithm based on Kernel Ridge Regression

Example

>>> from watex.analysis.dimensionality import LLE
>>> from watex.datasets import fetch_data
>>> X, _=fetch_data('Bagoue analysed data')
>>> lle_kws ={
...    'n_components': 4,
...    "n_neighbors": 5}
>>> Xtransf=LLE(X,**lle_kws)

watex.analysis.LW_score(X, store_precision=True, assume_centered=False, **kws)[source]#

Models score from Ledoit-Wolf.

Parameters

store_precision (bool, default=True) – Specify if the estimated precision is stored.
assume_centered (bool, default=False) – If True, data will not be centered before computation. Useful when working with data whose mean is almost, but not exactly zero. If False (default), data will be centered before computation.
block_size (int, default=1000) – Size of blocks into which the covariance matrix will be split during its Ledoit-Wolf estimation. This is purely a memory optimization and does not affect results.

Notes

The regularised covariance is:

\[(1 - text{shrinkage}) * \text{cov} + \text{shrinkage} * \mu * \text{np.identity(n_features)}\]

where \(\mu = \text{trace(cov)} / n_{features}\) and shrinkage is given by the Ledoit and Wolf formula

References

“A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices”, Ledoit and Wolf, Journal of Multivariate Analysis, Volume 88, Issue 2, February 2004, pages 365-411.

watex.analysis.compute_scores(X, n_features, n_components=5)[source]#

Compute PCA score and Factor Analysis scores from training X.

Parameters

X (Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
n_features (int,) – number of features that composes X
n_components (int, default {5}) – number of component to retrieve.

Returns

Scores from PCA and FA from transformed X

Return type

Tuple (pca_scores, fa_scores)

watex.analysis.decision_region(X, y, clf, Xt=None, yt=None, random_state=42, test_size=0.3, scaling=True, split=False, n_components=2, view='X', resolution=0.02, return_expl_variance_ratio=False, return_axe=False, axe=None, **kws)[source]#

View decision regions for the training data reduced to two principal component axes.

Parameters

X (Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like of shape (M, ) :math:`M=m-samples) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
Xt (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.
yt (array-like, shape (M, ) M=m-samples,) – test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
clf (callable, always as a function, classifier estimator) –
A supervised (or semi-supervised) predictor with a finite set of discrete possible output values. A classifier supports modeling some of binary, multiclass, multilabel, or multiclass multioutput targets. Within scikit-learn, all classifiers support multi-class classification, defaulting to using a one-vs-rest strategy over the binary classification problem. Classifiers must store a classes_ attribute after fitting, and usually inherit from base.ClassifierMixin, which sets their _estimator_type attribute. A classifier can be distinguished from other estimators with is_classifier. It must implement:
```
* fit
* predict
* score
```
It may also be appropriate to implement decision_function, predict_proba and predict_log_proba.
random_state (int, default {42}) – state of shuffling the data
test_size (float < 1 , default {.3}) – the size to keep remainder data into the test set .
split (bool, False) – Split (X,y) data into a training and test sets(Xt, yt). Here, it value is triggered to True, we assume (X, y) previously given are all the whole dataset with target y.
n_components (int, float 2 , default {2}) – the number of principal component to retrieve. If value is given as a ratio for instance ‘.95’ i.e. the ratio of keeping variance is 95% and the n_components can be get using the attributes scikit-learn getter as `<estimator>.n_components_
view (str , ['X', 'Xt', None]) – the kind of vizualization. ‘X’, ‘Xt’ mean the training and test set decision region visualization respectively. If set to ``None``(default), the view are muted.
resolution (float, default{.02}) – level of the extension of numpy meshgrip to tighting layout the plot.
return_expl_variance_ratio (bool, default is {False}) – returns the PCA variance ratio explaines of all principal components.
return_axes (bool, default=False,) – Return matplotlib object axe
ax (Matplotlib.Axes object, optional) – If not supplied, it is created.
kws (dict) – Additional keywords arguments passed to the scikit-learn function sklearn.model_selection.train_test_split()

Returns

X PCA training set transformed or PCA explained variance ratio.

Return type

nd-array | arraylike (return_expl_variance_ratio=True)

Examples

>>> from watex.datasets import fetch_data
>>> from watex.exlib.sklearn import SimpleImputer, LogisticRegression
>>> from watex.analysis.decomposition import decision_region
>>> data= fetch_data("bagoue original").get('data=dfy1') # encoded flow categories
>>> y = data.flow ; X= data.drop(columns='flow')
>>> # select the numerical features
>>> X =selectfeatures(X, include ='number')
>>> # imputed the missing data
>>> X = SimpleImputer().fit_transform(X)
>>> lr_clf = LogisticRegression(multi_class ='ovr', random_state =1, solver ='lbfgs')
>>> Xpca= decision_region(X, y, clf=lr_clf, split = True, view ='Xt') # test set view
>>> Xpca[0]
... array([-1.02925449,  1.42195127])

watex.analysis.extract_pca(X)[source]#

A naive approach to extract PCA from training set X

Parameters: X (Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
Returns: Eigen values , eigen vectors and Xsc scaled (standardized)
Return type: Tuple (eigen_vals, eigen_vecs, Xsc)

Examples

>>> from watex.exlib.sklearn import SimpleImputer
>>> from watex.utils import selectfeatures
>>> from watex.datasets import fetch_data
>>> from watex.analysis import extract_pca
>>> data= fetch_data("bagoue original").get('data=dfy1') # encoded flow categories
>>> y = data.flow ; X= data.drop(columns='flow')
>>> # select the numerical features
>>> X =selectfeatures(X, include ='number')
>>> # imputed the missing data
>>> X = SimpleImputer().fit_transform(X)
>>> eigval, eigvecs, _ = extract_pca(X)
>>> eigval
... array([2.09220756, 1.43940464, 0.20251943, 1.08913226, 0.97512157,
       0.85749283, 0.64907948, 0.71364687])

Notes

All consequent principal component (pc) will have the larget variance given the constraint that these component are uncorrelated (orthogonal) to other pc - even if the inputs features are corralated , the resulting of pc will be mutually orthogonal (uncorelated). Note that the PCA directions are highly sensistive to data scaling and we need to standardize the features prior to PCA if the features were measured on different scales and we assign equal importances of all features

the numpy function was designed to operate on both symetric and non-symetric squares matrices. However you may find it return complex eigenvalues in certains casesA related function, numpy.linalg.eigh has been implemented to decompose Hermetian matrices which is numerically more stable to work with symetric matrices such as the covariance matrix. numpy.linalg.eigh always returns real eigh eigenvalues

watex.analysis.feature_transformation(X, y=None, n_components=2, positive_class=1, view=False)[source]#

Transform X into new principal components after decomposing the covariances matrices.

Parameters

X (Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like of shape (M, ) :math:`M=m-samples) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
n_components (int, default=2) – Number of components with most total variance ratio.
positive_class (int,) – class label as an integer indenfier within the class representation.
view (bool, default {'False'}) – give an overview of the total explained variance.

Returns

X_transf – X PCA training set transformed.

Return type

nd-array

Examples

>>> from watex.analysis import feature_transformation
>>> # Use the X, y value in the example of `extract_pca` function
>>> Xtransf = feature_transformation(X, y=y,  positive_class = 2 , view =True)
>>> Xtransf[0]
... array([-1.0168034 ,  2.56417088])

watex.analysis.find_features_importances(fnames, components, n_axes=2)[source]#: Retreive the features importance with variance ratio. :param fnames: array_like of feature’s names :param components: pca components on different axes

watex.analysis.get_component_with_most_variance(X, **pca_kws)[source]#

Get the number of component with 95% ratio.

Parameters

X – Training set.
pca_kws – additional pca keywords arguments.

watex.analysis.iPCA(X, n_components=None, *, view=False, n_batches=None, return_X=True, store_in_binary_file=False, filename=None, **ipca_kws)[source]#

Incremental PCA

iPCA allows to split the trainsing set into mini-batches and feed algorithm one mini-batch at a time.

Once problem with the preceeding implementation of PCA is that requires the whole training set to fit in memory in order of the SVD algorithm to run. This is usefull for large training sets, and also applying PCA online(i.e, on the fly as a new instance arrive)

Parameters

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
n_components (int, optional) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.
n_batches (int, optional) – Number of batches to split the training set.
store_in_binary_file (bool, default=False) – Alternatively, we used numpy` memmap` class to manipulate a large array stored in a binary file on disk as if it were entirely in memory. The class load only the data it need in memory when it need its.
filename (str,optional) – Default binary filename to store in a binary file in a disk.
return_X (bool, default =True ,) – return the train set transformed with most representative varaince ratio.
view (bool,default=False,) – Plot the explained varaince as a function of number of dimension.
ipca_kws (dict,) – Additional keyword arguments passed to sklearn.decomposition.IncrementalPCA

Returns

The transformed training set or the iPCA container attributes for plotting purposes.

Return type

X (NDArray) or iPCA object,

Examples

>>> from watex.analysis.dimensionality import iPCA
>>> from watex.datasets import fetch_data
>>> X, _=fetch_data('Bagoue analysed data')
>>> Xtransf = iPCA(X,n_components=None,n_batches=100, view=True)

watex.analysis.kPCA(X, n_components=None, *, return_X=True, kernel='rbf', reconstruct_pre_image=False, **kpca_kws)[source]#

Kernel PCA

kPCA performs complex nonlinear projections for dimentionality reduction.

Commonly the kernel tricks is a mathematically technique that implicitly maps instances into a very high-dimensionality space(called the feature space), enabling non linear classification or regression with SVMs. Recall that a linear decision boundary in the high dimensional feature space corresponds to a complex non-linear decison boundary in the original space.

Parameters

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
n_components (int, optional) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.
return_X (bool, default =True ,) – return the train set transformed with most representative varaince ratio.
kernel ({'linear', 'poly', 'rbf', 'sigmoid', 'cosine', 'precomputed'}, default='rbf') – Kernel used for PCA.
kpca_kws (dict,) – Additional keyword arguments passed to sklearn.decomposition.KernelPCA

Returns

The transformed training set or the kPCA container attributes for plotting purposes.

Return type

X (NDArray) or kPCA object,

Examples

>>> from watex.analysis.dimensionality import kPCA
>>> from watex.datasets import fetch_data
>>> X, _=fetch_data('Bagoue analysis data')
>>> Xtransf=kPCA(X,n_components=None,kernel='rbf',
                            gamma=0.04, view=True)

watex.analysis.linear_discriminant_analysis(X, y, n_components=2, view=False, verbose=0, return_X=True)[source]#

Linear Discriminant Analysis LDA.

LDA is used as a technique for feature extraction to increase the computational efficiency and reduce the degree of overfitting due to the curse of dimensionnality in non-regularized models. The general concept behind LDA is very similar to the principal component analysis (PCA), but whereas PCA attempts to find the orthogonal component axes of minimum variance in a dataset, the goal in LDA is to find the features subspace that optimize class separability. The main steps requiered to perform LDA are summarized below:

Standardize the d-dimensional datasets (d is the number of features)
For each class , compute the d-dimensional mean vectors. Thus for each mean feature value, \(\mu_m\) with respect to the examples of class \(i\):

\[m_i = \frac{1}{n_i} \sum{x\in D_i} x_m\]
Construct the between-class scatter matrix, \(S_B\) and the within class scatter matrix, \(S_W\). Individual scatter matrices are scalled \(S_i\) before we sum them up as scatter matrix \(S_W\) as:

\[ \begin{align}\begin{aligned}\sum{i} = \frac{1}{n_i}S_i\\\sum{i} = \frac{1}{n_i} \sum{x\in D_i} (x-m_i)(x-m_i)^T\end{aligned}\end{align} \]

The within-class is also called the covariance matrix, thus we can compute the between class scatter_matrix \(S_B\).

\[S_B= \sum{i}^{n_i}(m_i-m) (m_i-m)^T\]

where \(m\) is the overall mean that is computed , including examples from all classes.
Compute the eigenvectors and corresponding eigenvalues of the matrix \(S_W^{-1}S_B\).
Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors
Choose the \(k\) eigenvectors that correspond to the \(k\) largest eigenvalues to construct \(dxk\)-dimensional transformation matrix, \(W\); the eigenvectors are the columns of this matrix.
project the examples onto the new_features subspaces using the transformation matrix \(W\).

Parameters

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
n_components (int, default =2) – Number of components considered as the most discriminative eigen vector.
return_X (bool, default =True) – return the transformed training set from n_components.
view (bool ,default =False,) – Visualize the LDA plot. If set to True, the plot is triggered.

Returns

X or W – The transformed train set (X) or matrix (W) from the most discriminative eigenvector columns

Return type

ndarray (n_samples, 2 )

Examples

>>> from watex.datasets import fetch_data
>>> from watex.exlib.sklearn import SimpleImputer, LogisticRegression
>>> from watex.analysis.decomposition import linear_discriminant_analysis
>>> data= fetch_data("bagoue original").get('data=dfy1') # encoded flow
>>> y = data.flow ; X= data.drop(columns='flow')
>>> # select the numerical features
>>> X =selectfeatures(X, include ='number')
>>> # imputed the missing data
>>> X = SimpleImputer().fit_transform(X)
>>> Xtr= linear_discriminant_analysis (X, y , view =True)

watex.analysis.make_scedastic_data(n_samples=1000, n_features=50, rank=10, sigma=1.0, random_state=42)[source]#

Generate a sampling data for probabilistic PCA and Factor Analysis for model comparison.

By default:: nsamples = 1000 n_features = 50 rank =10

Returns

* X (sampling data)
* X_homo (sampling data with homoscedastic noise)
* X_hetero (sampling with heteroscedastic noise)
* n_components (number of components 50 features.)

watex.analysis.nPCA(X, n_components=None, *, view=False, return_X=True, plot_kws=None, n_axes=None, **pca_kws)[source]#

Normal Principal Components analysis (PCA)

PCA is by far the most popular dimensional reduction algorithm. First it identifies the hyperplane that lies closest to the data and project it to the data onto it.

Parameters

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
n_components (int, optional) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.
return_X (bool, default =True ,) – return the train set transformed with most representative varaince ratio.
view (bool,default=False,) – Plot the explained varaince as a function of number of dimension.
n_axes (int, optional,) – Number of importance components to retrieve the variance ratio. If None the features importance is computed using the cumulative variance representative of 95% .
pca_kws (dict,) – Additional matplotlib.pyplot keywords arguments passed to sklearn.decomposition.PCA

Returns

The transformed training set or the PCA container attributes for plotting purpose.

Return type

X or nPCA object,

Examples

>>> from watex.analysis.dimensionality import nPCA
>>> from watex.datasets import fetch_data
>>> X, _= fetch_data('Bagoue analysed dataset')
>>> pca = nPCA(X, 0.95, n_axes =3, return_X=False)
>>> pca.components_
>>> pca.feature_importances_

watex.analysis.pcavsfa(X, rank=10, sigma=1.0, n_components=5, random_state=42, verbose=0, view=False)[source]#

Compute PCA score and Factor Analysis scores from training X and compare probabilistic PCA and Factor Analysis models.

Parameters

X (Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
n_features (int,) – number of features that composes X
n_components (int, default {5}) – number of component to retrieve.
rank (int, default{10}) – Bounding for ranking
sigma (float, default {1.}) – data pertubator ratio for adding heteroscedastic noise
random_state (int , default {42}) – Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls.
verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.

Returns

Scores from PCA and FA from transformed X

Return type

Tuple (pca_scores, fa_scores)

watex.analysis.plot_projection(self, n_components=None, **plot_kws)[source]#: Quick plot the N-Dimension VS explained variance Ratio. :param n_components: pca components on different axes

watex.analysis.shrunk_cov_score(X)[source]#

shrunk the covariance scores.

Parameters: X (Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
Returns: score
Return type: score of covariance estimator (best ) with shrinkage

watex.analysis.total_variance_ratio(X, view=False)[source]#

Compute the total variance ratio.

Is the ratio of an eigenvalues \(\lambda_j\), as simply the fraction of and eigen value, \(\lambda_j\) and the total sum of the eigen values as:

\[\text{explained_variance_ratio}= \frac{\lambda_j}{\sum{j=1}^{d} \lambda_j}\]

Using numpy cumsum function, we can then calculate the cumulative sum of explained variance which can be plot if plot is set to True via matplotlib set function.

Parameters

X (Nd-array, shape(M, N)) – Array of training set with M examples and N-features
view (bool, default {'False'}) – give an overview of the total explained variance.

Returns

cum_var_exp – Cumulative sum of variance total explained.

Return type

array-like

Examples

>>> from watex.analysis import total_variance_ratio
>>> # Use the X value in the example of `extract_pca` function
>>> cum_var = total_variance_ratio(X, view=True)
>>> cum_var
... array([0.26091916, 0.44042728, 0.57625294, 0.69786032, 0.80479823,
       0.89379712, 0.97474381, 1.        ])

watex.analysis package#

Submodules#