watex.exlib package#

‘Exlib’ sub-package is composed of third-party packages called for models working with datasets. Currently watex uses scikit-learn and XGBoost for prediction purpose.

class watex.exlib.AdaBoostClassifier(estimator=None, *, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None, base_estimator='deprecated')[source]#

Bases: ClassifierMixin, BaseWeightBoosting

An AdaBoost classifier.

An AdaBoost [1] classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

This class implements the algorithm known as AdaBoost-SAMME [2].

Read more in the User Guide.

New in version 0.14.

Parameters:

estimator (object, default=None) –
The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes. If None, then the base estimator is DecisionTreeClassifier initialized with max_depth=1.

New in version 1.2: base_estimator was renamed to estimator.
n_estimators (int, default=50) – The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early. Values must be in the range [1, inf).
learning_rate (float, default=1.0) – Weight applied to each classifier at each boosting iteration. A higher learning rate increases the contribution of each classifier. There is a trade-off between the learning_rate and n_estimators parameters. Values must be in the range (0.0, inf).
algorithm ({'SAMME', 'SAMME.R'}, default='SAMME.R') – If ‘SAMME.R’ then use the SAMME.R real boosting algorithm. estimator must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given at each estimator at each boosting iteration. Thus, it is only used when estimator exposes a random_state. Pass an int for reproducible output across multiple function calls. See Glossary.
base_estimator (object, default=None) –
The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes. If None, then the base estimator is DecisionTreeClassifier initialized with max_depth=1.

Deprecated since version 1.2: base_estimator is deprecated and will be removed in 1.4. Use estimator instead.

estimator_#

The base estimator from which the ensemble is grown.

New in version 1.2: base_estimator_ was renamed to estimator_.

Type:: estimator

base_estimator_#

The base estimator from which the ensemble is grown.

Deprecated since version 1.2: base_estimator_ is deprecated and will be removed in 1.4. Use estimator_ instead.

Type:: estimator

estimators_#

The collection of fitted sub-estimators.

Type:: list of classifiers

classes_#

The classes labels.

Type:: ndarray of shape (n_classes,)

n_classes_#

The number of classes.

Type:: int

estimator_weights_#

Weights for each estimator in the boosted ensemble.

Type:: ndarray of floats

estimator_errors_#

Classification error for each estimator in the boosted ensemble.

Type:: ndarray of floats

feature_importances_#

The impurity-based feature importances if supported by the estimator (when based on decision trees).

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

Type:: ndarray of shape (n_features,)

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

AdaBoostRegressor: An AdaBoost regressor that begins by fitting a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of instances are adjusted according to the error of the current prediction.
GradientBoostingClassifier: GB builds an additive model in a forward stage-wise fashion. Regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.
sklearn.tree.DecisionTreeClassifier: A non-parametric supervised learning method used for classification. Creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

References

[1]

Y. Freund, R. Schapire, “A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting”, 1995.

[2]

Zhu, H. Zou, S. Rosset, T. Hastie, “Multi-class AdaBoost”, 2009.

Examples

>>> from sklearn.ensemble import AdaBoostClassifier
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=1000, n_features=4,
...                            n_informative=2, n_redundant=0,
...                            random_state=0, shuffle=False)
>>> clf = AdaBoostClassifier(n_estimators=100, random_state=0)
>>> clf.fit(X, y)
AdaBoostClassifier(n_estimators=100, random_state=0)
>>> clf.predict([[0, 0, 0, 0]])
array([1])
>>> clf.score(X, y)
0.983...

decision_function(X)[source]#

Compute the decision function of X.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted to CSR.
Returns:: score – The decision function of the input samples. The order of outputs is the same of that of the classes_ attribute. Binary classification is a special cases with k == 1, otherwise k==n_classes. For binary classification, values closer to -1 or 1 mean more like the first or second class in classes_, respectively.
Return type:: ndarray of shape of (n_samples, k)

predict(X)[source]#

Predict classes for X.

The predicted class of an input sample is computed as the weighted mean prediction of the classifiers in the ensemble.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted to CSR.
Returns:: y – The predicted classes.
Return type:: ndarray of shape (n_samples,)

predict_log_proba(X)[source]#

Predict class log-probabilities for X.

The predicted class log-probabilities of an input sample is computed as the weighted mean predicted class log-probabilities of the classifiers in the ensemble.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted to CSR.
Returns:: p – The class probabilities of the input samples. The order of outputs is the same of that of the classes_ attribute.
Return type:: ndarray of shape (n_samples, n_classes)

predict_proba(X)[source]#

Predict class probabilities for X.

The predicted class probabilities of an input sample is computed as the weighted mean predicted class probabilities of the classifiers in the ensemble.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted to CSR.
Returns:: p – The class probabilities of the input samples. The order of outputs is the same of that of the classes_ attribute.
Return type:: ndarray of shape (n_samples, n_classes)

staged_decision_function(X)[source]#

Compute decision function of X for each boosting iteration.

This method allows monitoring (i.e. determine error on testing set) after each boosting iteration.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted to CSR.
Yields:: score (generator of ndarray of shape (n_samples, k)) – The decision function of the input samples. The order of outputs is the same of that of the classes_ attribute. Binary classification is a special cases with k == 1, otherwise k==n_classes. For binary classification, values closer to -1 or 1 mean more like the first or second class in classes_, respectively.

staged_predict(X)[source]#

Return staged predictions for X.

The predicted class of an input sample is computed as the weighted mean prediction of the classifiers in the ensemble.

This generator method yields the ensemble prediction after each iteration of boosting and therefore allows monitoring, such as to determine the prediction on a test set after each boost.

Parameters:: X (array-like of shape (n_samples, n_features)) – The input samples. Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted to CSR.
Yields:: y (generator of ndarray of shape (n_samples,)) – The predicted classes.

staged_predict_proba(X)[source]#

Predict class probabilities for X.

The predicted class probabilities of an input sample is computed as the weighted mean predicted class probabilities of the classifiers in the ensemble.

This generator method yields the ensemble predicted class probabilities after each iteration of boosting and therefore allows monitoring, such as to determine the predicted class probabilities on a test set after each boost.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted to CSR.
Yields:: p (generator of ndarray of shape (n_samples,)) – The class probabilities of the input samples. The order of outputs is the same of that of the classes_ attribute.

class watex.exlib.BaggingClassifier(estimator=None, n_estimators=10, *, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0, base_estimator='deprecated')[source]#

Bases: ClassifierMixin, BaseBagging

A Bagging classifier.

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting [1]. If samples are drawn with replacement, then the method is known as Bagging [2]. When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces [3]. Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches [4].

Read more in the User Guide.

New in version 0.15.

Parameters:

estimator (object, default=None) –
The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a DecisionTreeClassifier.

New in version 1.2: base_estimator was renamed to estimator.
n_estimators (int, default=10) – The number of base estimators in the ensemble.
max_samples (int or float, default=1.0) –
The number of samples to draw from X to train each base estimator (with replacement by default, see bootstrap for more details).
- If int, then draw max_samples samples.
- If float, then draw max_samples * X.shape[0] samples.
max_features (int or float, default=1.0) –
The number of features to draw from X to train each base estimator ( without replacement by default, see bootstrap_features for more details).
- If int, then draw max_features features.
- If float, then draw max(1, int(max_features * n_features_in_)) features.
bootstrap (bool, default=True) – Whether samples are drawn with replacement. If False, sampling without replacement is performed.
bootstrap_features (bool, default=False) – Whether features are drawn with replacement.
oob_score (bool, default=False) – Whether to use out-of-bag samples to estimate the generalization error. Only available if bootstrap=True.
warm_start (bool, default=False) –
When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble. See the Glossary.

New in version 0.17: warm_start constructor parameter.
n_jobs (int, default=None) – The number of jobs to run in parallel for both fit() and predict(). None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
random_state (int, RandomState instance or None, default=None) – Controls the random resampling of the original dataset (sample wise and feature wise). If the base estimator accepts a random_state attribute, a different seed is generated for each instance in the ensemble. Pass an int for reproducible output across multiple function calls. See Glossary.
verbose (int, default=0) – Controls the verbosity when fitting and predicting.
base_estimator (object, default="deprecated") –
Use estimator instead.

Deprecated since version 1.2: base_estimator is deprecated and will be removed in 1.4. Use estimator instead.

estimator_#

The base estimator from which the ensemble is grown.

New in version 1.2: base_estimator_ was renamed to estimator_.

Type:: estimator

base_estimator_#

The base estimator from which the ensemble is grown.

Deprecated since version 1.2: base_estimator_ is deprecated and will be removed in 1.4. Use estimator_ instead.

Type:: estimator

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

estimators_#

The collection of fitted base estimators.

Type:: list of estimators

estimators_samples_#

The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Each subset is defined by an array of the indices selected.

Type:: list of arrays

estimators_features_#

The subset of drawn features for each base estimator.

Type:: list of arrays

classes_#

The classes labels.

Type:: ndarray of shape (n_classes,)

n_classes_#

The number of classes.

Type:: int or list

oob_score_#

Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when oob_score is True.

Type:: float

oob_decision_function_#

Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN. This attribute exists only when oob_score is True.

Type:: ndarray of shape (n_samples, n_classes)

See also

BaggingRegressor: A Bagging regressor.

References

[1]

L. Breiman, “Pasting small votes for classification in large databases and on-line”, Machine Learning, 36(1), 85-103, 1999.

[2]

L. Breiman, “Bagging predictors”, Machine Learning, 24(2), 123-140, 1996.

[3]

T. Ho, “The random subspace method for constructing decision forests”, Pattern Analysis and Machine Intelligence, 20(8), 832-844, 1998.

[4]

G. Louppe and P. Geurts, “Ensembles on Random Patches”, Machine Learning and Knowledge Discovery in Databases, 346-361, 2012.

Examples

>>> from sklearn.svm import SVC
>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=100, n_features=4,
...                            n_informative=2, n_redundant=0,
...                            random_state=0, shuffle=False)
>>> clf = BaggingClassifier(estimator=SVC(),
...                         n_estimators=10, random_state=0).fit(X, y)
>>> clf.predict([[0, 0, 0, 0]])
array([1])

decision_function(X)[source]#

Average of the decision functions of the base classifiers.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns:: score – The decision function of the input samples. The columns correspond to the classes in sorted order, as they appear in the attribute classes_. Regression and binary classification are special cases with k == 1, otherwise k==n_classes.
Return type:: ndarray of shape (n_samples, k)

predict(X)[source]#

Predict class for X.

The predicted class of an input sample is computed as the class with the highest mean predicted probability. If base estimators do not implement a predict_proba method, then it resorts to voting.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns:: y – The predicted classes.
Return type:: ndarray of shape (n_samples,)

predict_log_proba(X)[source]#

Predict class log-probabilities for X.

The predicted class log-probabilities of an input sample is computed as the log of the mean predicted class probabilities of the base estimators in the ensemble.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns:: p – The class log-probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
Return type:: ndarray of shape (n_samples, n_classes)

predict_proba(X)[source]#

Predict class probabilities for X.

The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the base estimators in the ensemble. If base estimators do not implement a predict_proba method, then it resorts to voting and the predicted class probabilities of an input sample represents the proportion of estimators predicting each class.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns:: p – The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
Return type:: ndarray of shape (n_samples, n_classes)

class watex.exlib.BaseEstimator[source]#

Bases: object

Base class for all estimators in scikit-learn.

Notes

All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments (no *args or **kwargs).

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

class watex.exlib.ClassifierMixin[source]#

Bases: object

Mixin class for all classifiers in scikit-learn.

score(X, y, sample_weight=None)[source]#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

Return type:

float

class watex.exlib.ColumnTransformer(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=True)[source]#

Bases: TransformerMixin, _BaseComposition

Applies transformers to columns of an array or pandas DataFrame.

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

Read more in the User Guide.

New in version 0.20.

Parameters:

transformers (list of tuples) –
List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data.

namestr
Like in Pipeline and FeatureUnion, this allows the transformer and its parameters to be set using set_params and searched in grid search.

transformer{‘drop’, ‘passthrough’} or estimator
Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.

columnsstr, array-like of str, int, array-like of int, array-like of bool, slice or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.
remainder ({'drop', 'passthrough'} or estimator, default='drop') – By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers, but present in the data passed to fit will be automatically passed through. This subset of columns is concatenated with the output of the transformers. For dataframes, extra columns not seen during fit will be excluded from the output of transform. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform. Note that using this feature requires that the DataFrame columns input at fit and transform have identical order.
sparse_threshold (float, default=0.3) – If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.
n_jobs (int, default=None) – Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
transformer_weights (dict, default=None) – Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.
verbose (bool, default=False) – If True, the time elapsed while fitting each transformer will be printed as it is completed.
verbose_feature_names_out (bool, default=True) –
If True, get_feature_names_out() will prefix all feature names with the name of the transformer that generated that feature. If False, get_feature_names_out() will not prefix any feature names and will error if feature names are not unique.

New in version 1.0.

transformers_#

The collection of fitted transformers as tuples of (name, fitted_transformer, column). fitted_transformer can be an estimator, ‘drop’, or ‘passthrough’. In case there were no columns selected, this will be the unfitted transformer. If there are remaining columns, the final element is a tuple of the form: (‘remainder’, transformer, remaining_columns) corresponding to the remainder parameter. If there are remaining columns, then len(transformers_)==len(transformers)+1, otherwise len(transformers_)==len(transformers).

Type:: list

named_transformers_#

Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.

Type:: Bunch

sparse_output_#

Boolean flag indicating whether the output of transform is a sparse matrix or a dense numpy array, which depends on the output of the individual transformers and the sparse_threshold keyword.

Type:: bool

output_indices_#

A dictionary from each transformer name to a slice, where the slice corresponds to indices in the transformed output. This is useful to inspect which transformer is responsible for which transformed feature(s).

New in version 1.0.

Type:: dict

n_features_in_#

Number of features seen during fit. Only defined if the underlying transformers expose such an attribute when fit.

New in version 0.24.

Type:: int

See also

make_column_transformer: Convenience function for combining the outputs of multiple transformer objects applied to column subsets of the original feature space.
make_column_selector: Convenience function for selecting columns based on datatype or the columns name with a regex pattern.

Notes

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.

Examples

>>> import numpy as np
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import Normalizer
>>> ct = ColumnTransformer(
...     [("norm1", Normalizer(norm='l1'), [0, 1]),
...      ("norm2", Normalizer(norm='l1'), slice(2, 4))])
>>> X = np.array([[0., 1., 2., 2.],
...               [1., 1., 0., 1.]])
>>> # Normalizer scales each row of X to unit norm. A separate scaling
>>> # is applied for the two first and two last elements of each
>>> # row independently.
>>> ct.fit_transform(X)
array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])

ColumnTransformer can be configured with a transformer that requires a 1d array by setting the column to a string:

>>> from sklearn.feature_extraction import FeatureHasher
>>> from sklearn.preprocessing import MinMaxScaler
>>> import pandas as pd   
>>> X = pd.DataFrame({
...     "documents": ["First item", "second one here", "Is this the last?"],
...     "width": [3, 4, 5],
... })  
>>> # "documents" is a string which configures ColumnTransformer to
>>> # pass the documents column as a 1d array to the FeatureHasher
>>> ct = ColumnTransformer(
...     [("text_preprocess", FeatureHasher(input_type="string"), "documents"),
...      ("num_preprocess", MinMaxScaler(), ["width"])])
>>> X_trans = ct.fit_transform(X)  

fit(X, y=None)[source]#

Fit all transformers using X.

Parameters:

X ({array-like, dataframe} of shape (n_samples, n_features)) – Input data, of which specified subsets are used to fit the transformers.
y (array-like of shape (n_samples,...), default=None) – Targets for supervised learning.

Returns:

self – This estimator.

Return type:

ColumnTransformer

fit_transform(X, y=None)[source]#

Fit all transformers, transform the data and concatenate results.

Parameters:

X ({array-like, dataframe} of shape (n_samples, n_features)) – Input data, of which specified subsets are used to fit the transformers.
y (array-like of shape (n_samples,), default=None) – Targets for supervised learning.

Returns:

X_t – Horizontally stacked results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

Return type:

{array-like, sparse matrix} of shape (n_samples, sum_n_components)

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out – Transformed feature names.

Return type:

ndarray of str objects

get_params(deep=True)[source]#

Get parameters for this estimator.

Returns the parameters given in the constructor as well as the estimators contained within the transformers of the ColumnTransformer.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

property named_transformers_#

Access the fitted transformer by name.

Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.

set_output(*, transform=None)[source]#

Set the output container when “transform” and “fit_transform” are called.

Calling set_output will set the output of all estimators in transformers and transformers_.

Parameters:

transform ({"default", "pandas"}, default=None) –

Configure output of transform and fit_transform.

”default”: Default output format of a transformer
”pandas”: DataFrame output
None: Transform configuration is unchanged

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**kwargs)[source]#

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params(). Note that you can directly set the parameters of the estimators contained in transformers of ColumnTransformer.

Parameters:: **kwargs (dict) – Estimator parameters.
Returns:: self – This estimator.
Return type:: ColumnTransformer

transform(X)[source]#

Transform X separately by each transformer, concatenate results.

Parameters:: X ({array-like, dataframe} of shape (n_samples, n_features)) – The data to be transformed by subset.
Returns:: X_t – Horizontally stacked results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.
Return type:: {array-like, sparse matrix} of shape (n_samples, sum_n_components)

class watex.exlib.DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)[source]#

Bases: ClassifierMixin, BaseDecisionTree

A decision tree classifier.

Read more in the User Guide.

Parameters:

criterion ({"gini", "entropy", "log_loss"}, default="gini") – The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain, see tree_mathematical_formulation.
splitter ({"best", "random"}, default="best") – The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
max_depth (int, default=None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split (int or float, default=2) –
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for fractions.
min_samples_leaf (int or float, default=1) –
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
- If int, then consider min_samples_leaf as the minimum number.
- If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for fractions.
min_weight_fraction_leaf (float, default=0.0) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_features (int, float or {"auto", "sqrt", "log2"}, default=None) –
The number of features to consider when looking for the best split:
- If int, then consider max_features features at each split.
- If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.
- If “auto”, then max_features=sqrt(n_features).
- If “sqrt”, then max_features=sqrt(n_features).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features=n_features.
Deprecated since version 1.1: The “auto” option was deprecated in 1.1 and will be removed in 1.3.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
random_state (int, RandomState instance or None, default=None) – Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer. See Glossary for details.
max_leaf_nodes (int, default=None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_decrease (float, default=0.0) –
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:
```
N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)
```
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

New in version 0.19.
class_weight (dict, list of dict or "balanced", default=None) –
Weights associated with classes in the form {class_label: weight}. If None, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
ccp_alpha (non-negative float, default=0.0) –
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.

New in version 0.22.

classes_#

The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).

Type:: ndarray of shape (n_classes,) or list of ndarray

feature_importances_#

The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [4].

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

Type:: ndarray of shape (n_features,)

max_features_#

The inferred value of max_features.

Type:: int

n_classes_#

The number of classes (for single output problems), or a list containing the number of classes for each output (for multi-output problems).

Type:: int or list of int

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

n_outputs_#

The number of outputs when fit is performed.

Type:: int

tree_#

The underlying Tree object. Please refer to help(sklearn.tree._tree.Tree) for attributes of Tree object and sphx_glr_auto_examples_tree_plot_unveil_tree_structure.py for basic usage of these attributes.

Type:: Tree instance

See also

DecisionTreeRegressor: A decision tree regressor.

Notes

The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.

The predict() method operates using the numpy.argmax() function on the outputs of predict_proba(). This means that in case the highest predicted probabilities are tied, the classifier will predict the tied class with the lowest index in classes_.

References

[1]

https://en.wikipedia.org/wiki/Decision_tree_learning

[2]

L. Breiman, J. Friedman, R. Olshen, and C. Stone, “Classification and Regression Trees”, Wadsworth, Belmont, CA, 1984.

[3]

T. Hastie, R. Tibshirani and J. Friedman. “Elements of Statistical Learning”, Springer, 2009.

[4]

L. Breiman, and A. Cutler, “Random Forests”, https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Examples

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...                             
...
array([ 1.     ,  0.93...,  0.86...,  0.93...,  0.93...,
        0.93...,  0.93...,  1.     ,  0.93...,  1.      ])

fit(X, y, sample_weight=None, check_input=True)[source]#

Build a decision tree classifier from the training set (X, y).

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – The target values (class labels) as integers or strings.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.
check_input (bool, default=True) – Allow to bypass several input checking. Don’t use this parameter unless you know what you’re doing.

Returns:

self – Fitted estimator.

Return type:

DecisionTreeClassifier

predict_log_proba(X)[source]#

Predict class log-probabilities of the input samples X.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
Returns:: proba – The class log-probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
Return type:: ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1

predict_proba(X, check_input=True)[source]#

Predict class probabilities of the input samples X.

The predicted class probability is the fraction of samples of the same class in a leaf.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
check_input (bool, default=True) – Allow to bypass several input checking. Don’t use this parameter unless you know what you’re doing.

Returns:

proba – The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.

Return type:

ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1

class watex.exlib.DummyClassifier(*, strategy='prior', random_state=None, constant=None)[source]#

Bases: MultiOutputMixin, ClassifierMixin, BaseEstimator

DummyClassifier makes predictions that ignore the input features.

This classifier serves as a simple baseline to compare against other more complex classifiers.

The specific behavior of the baseline is selected with the strategy parameter.

All strategies make predictions that ignore the input feature values passed as the X argument to fit and predict. The predictions, however, typically depend on values observed in the y parameter passed to fit.

Note that the “stratified” and “uniform” strategies lead to non-deterministic predictions that can be rendered deterministic by setting the random_state parameter if needed. The other strategies are naturally deterministic and, once fit, always return the same constant prediction for any value of X.

Read more in the User Guide.

New in version 0.13.

Parameters:

strategy ({"most_frequent", "prior", "stratified", "uniform", "constant"}, default="prior") –
Strategy to use to generate predictions.
- ”most_frequent”: the predict method always returns the most frequent class label in the observed y argument passed to fit. The predict_proba method returns the matching one-hot encoded vector.
- ”prior”: the predict method always returns the most frequent class label in the observed y argument passed to fit (like “most_frequent”). predict_proba always returns the empirical class distribution of y also known as the empirical class prior distribution.
- ”stratified”: the predict_proba method randomly samples one-hot vectors from a multinomial distribution parametrized by the empirical class prior probabilities. The predict method returns the class label which got probability one in the one-hot vector of predict_proba. Each sampled row of both methods is therefore independent and identically distributed.
- ”uniform”: generates predictions uniformly at random from the list of unique classes observed in y, i.e. each class has equal probability.
- ”constant”: always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class.
  
  Changed in version 0.24: The default value of strategy has changed to “prior” in version 0.24.
random_state (int, RandomState instance or None, default=None) – Controls the randomness to generate the predictions when strategy='stratified' or strategy='uniform'. Pass an int for reproducible output across multiple function calls. See Glossary.
constant (int or str or array-like of shape (n_outputs,), default=None) – The explicit constant as predicted by the “constant” strategy. This parameter is useful only for the “constant” strategy.

classes_#

Unique class labels observed in y. For multi-output classification problems, this attribute is a list of arrays as each output has an independent set of possible classes.

Type:: ndarray of shape (n_classes,) or list of such arrays

n_classes_#

Number of label for each output.

Type:: int or list of int

class_prior_#

Frequency of each class observed in y. For multioutput classification problems, this is computed independently for each output.

Type:: ndarray of shape (n_classes,) or list of such arrays

n_outputs_#

Number of outputs.

Type:: int

sparse_output_#

True if the array returned from predict is to be in sparse CSC format. Is automatically set to True if the input y is passed in sparse format.

Type:: bool

See also

DummyRegressor: Regressor that makes predictions using simple rules.

Examples

>>> import numpy as np
>>> from sklearn.dummy import DummyClassifier
>>> X = np.array([-1, 1, 1, 1])
>>> y = np.array([0, 1, 1, 1])
>>> dummy_clf = DummyClassifier(strategy="most_frequent")
>>> dummy_clf.fit(X, y)
DummyClassifier(strategy='most_frequent')
>>> dummy_clf.predict(X)
array([1, 1, 1, 1])
>>> dummy_clf.score(X, y)
0.75

fit(X, y, sample_weight=None)[source]#

Fit the baseline classifier.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target values.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

self – Returns the instance itself.

Return type:

object

predict(X)[source]#

Perform classification on test vectors X.

Parameters:: X (array-like of shape (n_samples, n_features)) – Test data.
Returns:: y – Predicted target values for X.
Return type:: array-like of shape (n_samples,) or (n_samples, n_outputs)

predict_log_proba(X)[source]#

Return log probability estimates for the test vectors X.

Parameters:: X ({array-like, object with finite length or shape}) – Training data.
Returns:: P – Returns the log probability of the sample for each class in the model, where classes are ordered arithmetically for each output.
Return type:: ndarray of shape (n_samples, n_classes) or list of such arrays

predict_proba(X)[source]#

Return probability estimates for the test vectors X.

Parameters:: X (array-like of shape (n_samples, n_features)) – Test data.
Returns:: P – Returns the probability of the sample for each class in the model, where classes are ordered arithmetically, for each output.
Return type:: ndarray of shape (n_samples, n_classes) or list of such arrays

score(X, y, sample_weight=None)[source]#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:

X (None or array-like of shape (n_samples, n_features)) – Test samples. Passing None as test samples gives the same result as passing real test samples, since DummyClassifier operates independently of the sampled observations.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

Return type:

float

class watex.exlib.ExtraTreesClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=False, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]#

Bases: ForestClassifier

An extra-trees classifier.

This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Read more in the User Guide.

Parameters:

n_estimators (int, default=100) –
The number of trees in the forest.

Changed in version 0.22: The default value of n_estimators changed from 10 to 100 in 0.22.
criterion ({"gini", "entropy", "log_loss"}, default="gini") – The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain, see tree_mathematical_formulation. Note: This parameter is tree-specific.
max_depth (int, default=None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split (int or float, default=2) –
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for fractions.
min_samples_leaf (int or float, default=1) –
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
- If int, then consider min_samples_leaf as the minimum number.
- If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for fractions.
min_weight_fraction_leaf (float, default=0.0) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_features ({"sqrt", "log2", None}, int or float, default="sqrt") –
The number of features to consider when looking for the best split:
- If int, then consider max_features features at each split.
- If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.
- If “auto”, then max_features=sqrt(n_features).
- If “sqrt”, then max_features=sqrt(n_features).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features=n_features.
Changed in version 1.1: The default of max_features changed from “auto” to “sqrt”.

Deprecated since version 1.1: The “auto” option was deprecated in 1.1 and will be removed in 1.3.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
max_leaf_nodes (int, default=None) – Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_decrease (float, default=0.0) –
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:
```
N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)
```
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

New in version 0.19.
bootstrap (bool, default=False) – Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
oob_score (bool, default=False) – Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True.
n_jobs (int, default=None) – The number of jobs to run in parallel. fit(), predict(), decision_path() and apply() are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
random_state (int, RandomState instance or None, default=None) –
Controls 3 sources of randomness:
- the bootstrapping of the samples used when building trees (if bootstrap=True)
- the sampling of the features to consider when looking for the best split at each node (if max_features < n_features)
- the draw of the splits for each of the max_features
See Glossary for details.
verbose (int, default=0) – Controls the verbosity when fitting and predicting.
warm_start (bool, default=False) – When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See Glossary and gradient_boosting_warm_start for details.
class_weight ({"balanced", "balanced_subsample"}, dict or list of dicts, default=None) –
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
ccp_alpha (non-negative float, default=0.0) –
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.

New in version 0.22.
max_samples (int or float, default=None) –
If bootstrap is True, the number of samples to draw from X to train each base estimator.
- If None (default), then draw X.shape[0] samples.
- If int, then draw max_samples samples.
- If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0.0, 1.0].
New in version 0.22.

estimator_#

The child estimator template used to create the collection of fitted sub-estimators.

New in version 1.2: base_estimator_ was renamed to estimator_.

Type:: ExtraTreesClassifier

base_estimator_#

The child estimator template used to create the collection of fitted sub-estimators.

Deprecated since version 1.2: base_estimator_ is deprecated and will be removed in 1.4. Use estimator_ instead.

Type:: ExtraTreesClassifier

estimators_#

The collection of fitted sub-estimators.

Type:: list of DecisionTreeClassifier

classes_#

The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).

Type:: ndarray of shape (n_classes,) or a list of such arrays

n_classes_#

The number of classes (single output problem), or a list containing the number of classes for each output (multi-output problem).

Type:: int or list

feature_importances_#

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

Type:: ndarray of shape (n_features,)

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

n_outputs_#

The number of outputs when fit is performed.

Type:: int

oob_score_#

Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when oob_score is True.

Type:: float

oob_decision_function_#

Type:: ndarray of shape (n_samples, n_classes) or (n_samples, n_classes, n_outputs)

See also

ExtraTreesRegressor: An extra-trees regressor with random splits.
RandomForestClassifier: A random forest classifier with optimal splits.
RandomForestRegressor: Ensemble regressor using trees with optimal splits.

Notes

References

[1]

P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.

Examples

>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_features=4, random_state=0)
>>> clf = ExtraTreesClassifier(n_estimators=100, random_state=0)
>>> clf.fit(X, y)
ExtraTreesClassifier(random_state=0)
>>> clf.predict([[0, 0, 0, 0]])
array([1])

class watex.exlib.FactorAnalysis(n_components=None, *, tol=0.01, copy=True, max_iter=1000, noise_variance_init=None, svd_method='randomized', iterated_power=3, rotation=None, random_state=0)[source]#

Bases: ClassNamePrefixFeaturesOutMixin, TransformerMixin, BaseEstimator

Factor Analysis (FA).

A simple linear generative model with Gaussian latent variables.

The observations are assumed to be caused by a linear transformation of lower dimensional latent factors and added Gaussian noise. Without loss of generality the factors are distributed according to a Gaussian with zero mean and unit covariance. The noise is also zero mean and has an arbitrary diagonal covariance matrix.

If we would restrict the model further, by assuming that the Gaussian noise is even isotropic (all diagonal entries are the same) we would obtain PCA.

FactorAnalysis performs a maximum likelihood estimate of the so-called loading matrix, the transformation of the latent variables to the observed ones, using SVD based approach.

Read more in the User Guide.

New in version 0.13.

Parameters:

n_components (int, default=None) – Dimensionality of latent space, the number of components of X that are obtained after transform. If None, n_components is set to the number of features.
tol (float, default=1e-2) – Stopping tolerance for log-likelihood increase.
copy (bool, default=True) – Whether to make a copy of X. If False, the input X gets overwritten during fitting.
max_iter (int, default=1000) – Maximum number of iterations.
noise_variance_init (array-like of shape (n_features,), default=None) – The initial guess of the noise variance for each feature. If None, it defaults to np.ones(n_features).
svd_method ({'lapack', 'randomized'}, default='randomized') – Which SVD method to use. If ‘lapack’ use standard SVD from scipy.linalg, if ‘randomized’ use fast randomized_svd function. Defaults to ‘randomized’. For most applications ‘randomized’ will be sufficiently precise while providing significant speed gains. Accuracy can also be improved by setting higher values for iterated_power. If this is not sufficient, for maximum precision you should choose ‘lapack’.
iterated_power (int, default=3) – Number of iterations for the power method. 3 by default. Only used if svd_method equals ‘randomized’.
rotation ({'varimax', 'quartimax'}, default=None) –
If not None, apply the indicated rotation. Currently, varimax and quartimax are implemented. See “The varimax criterion for analytic rotation in factor analysis” H. F. Kaiser, 1958.

New in version 0.24.
random_state (int or RandomState instance, default=0) – Only used when svd_method equals ‘randomized’. Pass an int for reproducible results across multiple function calls. See Glossary.

components_#

Components with maximum variance.

Type:: ndarray of shape (n_components, n_features)

loglike_#

The log likelihood at each iteration.

Type:: list of shape (n_iterations,)

noise_variance_#

The estimated noise variance for each feature.

Type:: ndarray of shape (n_features,)

n_iter_#

Number of iterations run.

Type:: int

mean_#

Per-feature empirical mean, estimated from the training set.

Type:: ndarray of shape (n_features,)

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

PCA: Principal component analysis is also a latent linear variable model which however assumes equal noise variance for each feature. This extra assumption makes probabilistic PCA faster as it can be computed in closed form.
FastICA: Independent component analysis, a latent variable model with non-Gaussian latent variables.

References

David Barber, Bayesian Reasoning and Machine Learning, Algorithm 21.1.
Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 12.2.4.

Examples

>>> from sklearn.datasets import load_digits
>>> from sklearn.decomposition import FactorAnalysis
>>> X, _ = load_digits(return_X_y=True)
>>> transformer = FactorAnalysis(n_components=7, random_state=0)
>>> X_transformed = transformer.fit_transform(X)
>>> X_transformed.shape
(1797, 7)

fit(X, y=None)[source]#

Fit the FactorAnalysis model to X using SVD based approach.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – Ignored parameter.

Returns:

self – FactorAnalysis class instance.

Return type:

object

get_covariance()[source]#

Compute data covariance with the FactorAnalysis model.

cov = components_.T * components_ + diag(noise_variance)

Returns:: cov – Estimated covariance of data.
Return type:: ndarray of shape (n_features, n_features)

get_precision()[source]#

Compute data precision matrix with the FactorAnalysis model.

Returns:: precision – Estimated precision of data.
Return type:: ndarray of shape (n_features, n_features)

score(X, y=None)[source]#

Compute the average log-likelihood of the samples.

Parameters:

X (ndarray of shape (n_samples, n_features)) – The data.
y (Ignored) – Ignored parameter.

Returns:

ll – Average log-likelihood of the samples under the current model.

Return type:

float

score_samples(X)[source]#

Compute the log-likelihood of each sample.

Parameters:: X (ndarray of shape (n_samples, n_features)) – The data.
Returns:: ll – Log-likelihood of each sample under the current model.
Return type:: ndarray of shape (n_samples,)

transform(X)[source]#

Apply dimensionality reduction to X using the model.

Compute the expected mean of the latent variables. See Barber, 21.2.33 (or Bishop, 12.66).

Parameters:: X (array-like of shape (n_samples, n_features)) – Training data.
Returns:: X_new – The latent variables of X.
Return type:: ndarray of shape (n_samples, n_components)

class watex.exlib.FeatureUnion(transformer_list, *, n_jobs=None, transformer_weights=None, verbose=False)[source]#

Bases: TransformerMixin, _BaseComposition

Concatenates results of multiple transformer objects.

This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

Parameters of the transformers may be set using its name and the parameter name separated by a ‘__’. A transformer may be replaced entirely by setting the parameter with its name to another transformer, removed by setting to ‘drop’ or disabled by setting to ‘passthrough’ (features are passed without transformation).

Read more in the User Guide.

New in version 0.13.

Parameters:

transformer_list (list of (str, transformer) tuples) –
List of transformer objects to be applied to the data. The first half of each tuple is the name of the transformer. The transformer can be ‘drop’ for it to be ignored or can be ‘passthrough’ for features to be passed unchanged.

New in version 1.1: Added the option “passthrough”.

Changed in version 0.22: Deprecated None as a transformer in favor of ‘drop’.
n_jobs (int, default=None) –
Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

Changed in version v0.20: n_jobs default changed from 1 to None
transformer_weights (dict, default=None) – Multiplicative weights for features per transformer. Keys are transformer names, values the weights. Raises ValueError if key not present in transformer_list.
verbose (bool, default=False) – If True, the time elapsed while fitting each transformer will be printed as it is completed.

named_transformers#

Dictionary-like object, with the following attributes. Read-only attribute to access any transformer parameter by user given name. Keys are transformer names and values are transformer parameters.

New in version 1.2.

Type:: Bunch

n_features_in_#

Number of features seen during fit. Only defined if the underlying first transformer in transformer_list exposes such an attribute when fit.

New in version 0.24.

Type:: int

See also

make_union: Convenience function for simplified feature union construction.

Examples

>>> from sklearn.pipeline import FeatureUnion
>>> from sklearn.decomposition import PCA, TruncatedSVD
>>> union = FeatureUnion([("pca", PCA(n_components=1)),
...                       ("svd", TruncatedSVD(n_components=2))])
>>> X = [[0., 1., 3], [2., 2., 5]]
>>> union.fit_transform(X)
array([[ 1.5       ,  3.0...,  0.8...],
       [-1.5       ,  5.7..., -0.4...]])

fit(X, y=None, **fit_params)[source]#

Fit all transformers using X.

Parameters:

X (iterable or array-like, depending on transformers) – Input data, used to fit transformers.
y (array-like of shape (n_samples, n_outputs), default=None) – Targets for supervised learning.
**fit_params (dict, default=None) – Parameters to pass to the fit method of the estimator.

Returns:

self – FeatureUnion class instance.

Return type:

object

fit_transform(X, y=None, **fit_params)[source]#

Fit all transformers, transform the data and concatenate results.

Parameters:

X (iterable or array-like, depending on transformers) – Input data to be transformed.
y (array-like of shape (n_samples, n_outputs), default=None) – Targets for supervised learning.
**fit_params (dict, default=None) – Parameters to pass to the fit method of the estimator.

Returns:

X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.

Return type:

array-like or sparse matrix of shape (n_samples, sum_n_components)

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:: input_features (array-like of str or None, default=None) – Input features.
Returns:: feature_names_out – Transformed feature names.
Return type:: ndarray of str objects

get_params(deep=True)[source]#

Get parameters for this estimator.

Returns the parameters given in the constructor as well as the estimators contained within the transformer_list of the FeatureUnion.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: mapping of string to any

property n_features_in_#: Number of features seen during fit.

property named_transformers#

set_output(*, transform=None)[source]#

Set the output container when “transform” and “fit_transform” are called.

set_output will set the output of all estimators in transformer_list.

Parameters:

transform ({"default", "pandas"}, default=None) –

Configure output of transform and fit_transform.

”default”: Default output format of a transformer
”pandas”: DataFrame output
None: Transform configuration is unchanged

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**kwargs)[source]#

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params(). Note that you can directly set the parameters of the estimators contained in transformer_list.

Parameters:: **kwargs (dict) – Parameters of this estimator or parameters of estimators contained in transform_list. Parameters of the transformers may be set using its name and the parameter name separated by a ‘__’.
Returns:: self – FeatureUnion class instance.
Return type:: object

transform(X)[source]#

Transform X separately by each transformer, concatenate results.

Parameters:: X (iterable or array-like, depending on transformers) – Input data to be transformed.
Returns:: X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.
Return type:: array-like or sparse matrix of shape (n_samples, sum_n_components)

class watex.exlib.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)[source]#

Bases: BaseSearchCV

Exhaustive search over specified parameter values for an estimator.

Important members are fit, predict.

GridSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Read more in the User Guide.

Parameters:

estimator (estimator object) – This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed.
param_grid (dict or list of dictionaries) – Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
scoring (str, callable, list, tuple or dict, default=None) –
Strategy to evaluate the performance of the cross-validated model on the test set.

If scoring represents a single score, one can use:
- a single string (see scoring_parameter);
- a callable (see scoring) that returns a single value.
If scoring represents multiple scores, one can use:
- a list or tuple of unique strings;
- a callable returning a dictionary where the keys are the metric names and the values are the metric scores;
- a dictionary with metric names as keys and callables a values.
See multimetric_grid_search for an example.
n_jobs (int, default=None) –
Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

Changed in version v0.20: n_jobs default changed from 1 to None
refit (bool, str, or callable, default=True) –
Refit an estimator using the best found parameters on the whole dataset.

For multiple metric evaluation, this needs to be a str denoting the scorer that would be used to find the best parameters for refitting the estimator at the end.

Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available.

The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.

Also for multiple metric evaluation, the attributes best_index_, best_score_ and best_params_ will only be available if refit is set and all of them will be determined w.r.t this specific scorer.

See scoring parameter to know more about multiple metric evaluation.

See sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py to see how to design a custom selection strategy using a callable via refit.

Changed in version 0.20: Support for callable added.
cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- integer, to specify the number of folds in a (Stratified)KFold,
- CV splitter,
- An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

Refer User Guide for the various cross-validation strategies that can be used here.

Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.
verbose (int) –
Controls the verbosity: the higher, the more messages.
- >1 : the computation time for each fold and parameter candidate is displayed;
- >2 : the score is also displayed;
- >3 : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.
pre_dispatch (int, or str, default='2*n_jobs') –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
- None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
- An int, giving the exact number of total jobs that are spawned
- A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
error_score ('raise' or numeric, default=np.nan) – Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.
return_train_score (bool, default=False) –
If False, the cv_results_ attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.

New in version 0.19.

Changed in version 0.21: Default value was changed from True to False

cv_results_#

A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

For instance the below given table

param_kernel	param_gamma	param_degree	split0_test_score	…	rank_t…
‘poly’	–	2	0.80	…	2
‘poly’	–	3	0.70	…	4
‘rbf’	0.1	–	0.80	…	3
‘rbf’	0.2	–	0.93	…	1

will be represented by a cv_results_ dict of:

{
'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
                             mask = [False False False False]...)
'param_gamma': masked_array(data = [-- -- 0.1 0.2],
                            mask = [ True  True False False]...),
'param_degree': masked_array(data = [2.0 3.0 -- --],
                             mask = [False False  True  True]...),
'split0_test_score'  : [0.80, 0.70, 0.80, 0.93],
'split1_test_score'  : [0.82, 0.50, 0.70, 0.78],
'mean_test_score'    : [0.81, 0.60, 0.75, 0.85],
'std_test_score'     : [0.01, 0.10, 0.05, 0.08],
'rank_test_score'    : [2, 4, 3, 1],
'split0_train_score' : [0.80, 0.92, 0.70, 0.93],
'split1_train_score' : [0.82, 0.55, 0.70, 0.87],
'mean_train_score'   : [0.81, 0.74, 0.70, 0.90],
'std_train_score'    : [0.01, 0.19, 0.00, 0.03],
'mean_fit_time'      : [0.73, 0.63, 0.43, 0.49],
'std_fit_time'       : [0.01, 0.02, 0.01, 0.01],
'mean_score_time'    : [0.01, 0.06, 0.04, 0.04],
'std_score_time'     : [0.00, 0.00, 0.00, 0.01],
'params'             : [{'kernel': 'poly', 'degree': 2}, ...],
}

NOTE

The key 'params' is used to store a list of parameter settings dicts for all the parameter candidates.

The mean_fit_time, std_fit_time, mean_score_time and std_score_time are all in seconds.

For multi-metric evaluation, the scores for all the scorers are available in the cv_results_ dict at the keys ending with that scorer’s name ('_<scorer_name>') instead of '_score' shown above. (‘split0_test_precision’, ‘mean_train_precision’ etc.)

Type:: dict of numpy (masked) ndarrays

best_estimator_#

Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.

See refit parameter for more information on allowed values.

Type:: estimator

best_score_#

Mean cross-validated score of the best_estimator

For multi-metric evaluation, this is present only if refit is specified.

This attribute is not available if refit is a function.

Type:: float

best_params_#

Parameter setting that gave the best results on the hold out data.

For multi-metric evaluation, this is present only if refit is specified.

Type:: dict

best_index_#

The index (of the cv_results_ arrays) which corresponds to the best candidate parameter setting.

The dict at search.cv_results_['params'][search.best_index_] gives the parameter setting for the best model, that gives the highest mean score (search.best_score_).

For multi-metric evaluation, this is present only if refit is specified.

Type:: int

scorer_#

Scorer function used on the held out data to choose the best parameters for the model.

For multi-metric evaluation, this attribute holds the validated scoring dict which maps the scorer key to the scorer callable.

Type:: function or a dict

n_splits_#

The number of cross-validation splits (folds/iterations).

Type:: int

refit_time_#

Seconds used for refitting the best model on the whole dataset.

This is present only if refit is not False.

New in version 0.20.

Type:: float

multimetric_#

Whether or not the scorers compute several metrics.

Type:: bool

classes_#

The classes labels. This is present only if refit is specified and the underlying estimator is a classifier.

Type:: ndarray of shape (n_classes,)

n_features_in_#

Number of features seen during fit. Only defined if best_estimator_ is defined (see the documentation for the refit parameter for more details) and that best_estimator_ exposes n_features_in_ when fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Only defined if best_estimator_ is defined (see the documentation for the refit parameter for more details) and that best_estimator_ exposes feature_names_in_ when fit.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

ParameterGrid: Generates all the combinations of a hyperparameter grid.
train_test_split: Utility function to split the data into a development set usable for fitting a GridSearchCV instance and an evaluation set for its final evaluation.
sklearn.metrics.make_scorer: Make a scorer from a performance metric or loss function.

Notes

The parameters selected are those that maximize the score of the left out data, unless an explicit score is passed in which case it is used instead.

If n_jobs was set to a value higher than one, the data is copied for each point in the grid (and not n_jobs times). This is done for efficiency reasons if individual jobs take very little time, but may raise errors if the dataset is large and not enough memory is available. A workaround in this case is to set pre_dispatch. Then, the memory is copied only pre_dispatch many times. A reasonable value for pre_dispatch is 2 * n_jobs.

Examples

>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import GridSearchCV
>>> iris = datasets.load_iris()
>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
>>> svc = svm.SVC()
>>> clf = GridSearchCV(svc, parameters)
>>> clf.fit(iris.data, iris.target)
GridSearchCV(estimator=SVC(),
             param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')})
>>> sorted(clf.cv_results_.keys())
['mean_fit_time', 'mean_score_time', 'mean_test_score',...
 'param_C', 'param_kernel', 'params',...
 'rank_test_score', 'split0_test_score',...
 'split2_test_score', ...
 'std_fit_time', 'std_score_time', 'std_test_score']

class watex.exlib.IncrementalPCA(n_components=None, *, whiten=False, copy=True, batch_size=None)[source]#

Bases: _BasePCA

Incremental principal components analysis (IPCA).

Linear dimensionality reduction using Singular Value Decomposition of the data, keeping only the most significant singular vectors to project the data to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.

Depending on the size of the input data, this algorithm can be much more memory efficient than a PCA, and allows sparse input.

This algorithm has constant memory complexity, on the order of batch_size * n_features, enabling use of np.memmap files without loading the entire file into memory. For sparse matrices, the input is converted to dense in batches (in order to be able to subtract the mean) which avoids storing the entire dense matrix at any one time.

The computational overhead of each SVD is O(batch_size * n_features ** 2), but only 2 * batch_size samples remain in memory at a time. There will be n_samples / batch_size SVD computations to get the principal components, versus 1 large SVD of complexity O(n_samples * n_features ** 2) for PCA.

Read more in the User Guide.

New in version 0.16.

Parameters:

n_components (int, default=None) – Number of components to keep. If n_components is None, then n_components is set to min(n_samples, n_features).
whiten (bool, default=False) –
When True (False by default) the components_ vectors are divided by n_samples times components_ to ensure uncorrelated outputs with unit component-wise variances.

Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometimes improve the predictive accuracy of the downstream estimators by making data respect some hard-wired assumptions.
copy (bool, default=True) – If False, X will be overwritten. copy=False can be used to save memory but is unsafe for general use.
batch_size (int, default=None) – The number of samples to use for each batch. Only used when calling fit. If batch_size is None, then batch_size is inferred from the data and set to 5 * n_features, to provide a balance between approximation accuracy and memory consumption.

components_#

Principal axes in feature space, representing the directions of maximum variance in the data. Equivalently, the right singular vectors of the centered input data, parallel to its eigenvectors. The components are sorted by decreasing explained_variance_.

Type:: ndarray of shape (n_components, n_features)

explained_variance_#

Variance explained by each of the selected components.

Type:: ndarray of shape (n_components,)

explained_variance_ratio_#

Percentage of variance explained by each of the selected components. If all components are stored, the sum of explained variances is equal to 1.0.

Type:: ndarray of shape (n_components,)

singular_values_#

The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space.

Type:: ndarray of shape (n_components,)

mean_#

Per-feature empirical mean, aggregate over calls to partial_fit.

Type:: ndarray of shape (n_features,)

var_#

Per-feature empirical variance, aggregate over calls to partial_fit.

Type:: ndarray of shape (n_features,)

noise_variance_#

The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf.

Type:: float

n_components_#

The estimated number of components. Relevant when n_components=None.

Type:: int

n_samples_seen_#

The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across partial_fit calls.

Type:: int

batch_size_#

Inferred batch size from batch_size.

Type:: int

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

PCA: Principal component analysis (PCA).
KernelPCA: Kernel Principal component analysis (KPCA).
SparsePCA: Sparse Principal Components Analysis (SparsePCA).
TruncatedSVD: Dimensionality reduction using truncated SVD.

Notes

Implements the incremental PCA model from: D. Ross, J. Lim, R. Lin, M. Yang, Incremental Learning for Robust Visual Tracking, International Journal of Computer Vision, Volume 77, Issue 1-3, pp. 125-141, May 2008. See https://www.cs.toronto.edu/~dross/ivt/RossLimLinYang_ijcv.pdf

This model is an extension of the Sequential Karhunen-Loeve Transform from: A. Levy and M. Lindenbaum, Sequential Karhunen-Loeve Basis Extraction and its Application to Images, IEEE Transactions on Image Processing, Volume 9, Number 8, pp. 1371-1374, August 2000.

We have specifically abstained from an optimization used by authors of both papers, a QR decomposition used in specific situations to reduce the algorithmic complexity of the SVD. The source for this technique is Matrix Computations, Third Edition, G. Holub and C. Van Loan, Chapter 5, section 5.4.4, pp 252-253.. This technique has been omitted because it is advantageous only when decomposing a matrix with n_samples (rows) >= 5/3 * n_features (columns), and hurts the readability of the implemented algorithm. This would be a good opportunity for future optimization, if it is deemed necessary.

References

D. Ross, J. Lim, R. Lin, M. Yang. Incremental Learning for Robust Visual Tracking, International Journal of Computer Vision, Volume 77, Issue 1-3, pp. 125-141, May 2008.

G. Golub and C. Van Loan. Matrix Computations, Third Edition, Chapter 5, Section 5.4.4, pp. 252-253.

Examples

>>> from sklearn.datasets import load_digits
>>> from sklearn.decomposition import IncrementalPCA
>>> from scipy import sparse
>>> X, _ = load_digits(return_X_y=True)
>>> transformer = IncrementalPCA(n_components=7, batch_size=200)
>>> # either partially fit on smaller batches of data
>>> transformer.partial_fit(X[:100, :])
IncrementalPCA(batch_size=200, n_components=7)
>>> # or let the fit function itself divide the data into batches
>>> X_sparse = sparse.csr_matrix(X)
>>> X_transformed = transformer.fit_transform(X_sparse)
>>> X_transformed.shape
(1797, 7)

fit(X, y=None)[source]#

Fit the model with X, using minibatches of size batch_size.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Not used, present for API consistency by convention.

Returns:

self – Returns the instance itself.

Return type:

object

partial_fit(X, y=None, check_input=True)[source]#

Incremental fit with X. All of X is processed as a single batch.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Not used, present for API consistency by convention.
check_input (bool, default=True) – Run check_array on X.

Returns:

self – Returns the instance itself.

Return type:

object

transform(X)[source]#

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set, using minibatches of size batch_size if X is sparse.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data, where n_samples is the number of samples and n_features is the number of features.
Returns:: X_new – Projection of X in the first principal components.
Return type:: ndarray of shape (n_samples, n_components)

Examples

>>> import numpy as np
>>> from sklearn.decomposition import IncrementalPCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2],
...               [1, 1], [2, 1], [3, 2]])
>>> ipca = IncrementalPCA(n_components=2, batch_size=3)
>>> ipca.fit(X)
IncrementalPCA(batch_size=3, n_components=2)
>>> ipca.transform(X) 

class watex.exlib.KMeans(n_clusters=8, *, init='k-means++', n_init='warn', max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')[source]#

Bases: _BaseKMeans

K-Means clustering.

Read more in the User Guide.

Parameters:

n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.
init ({'k-means++', 'random'}, callable or array-like of shape (n_clusters, n_features), default='k-means++') –
Method for initialization:

’k-means++’ : selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is “greedy k-means++”. It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them.

’random’: choose n_clusters observations (rows) at random from data for the initial centroids.

If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.
n_init ('auto' or int, default=10) –
Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of n_init consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see kmeans_sparse_high_dim).

When n_init=’auto’, the number of runs depends on the value of init: 10 if using init=’random’, 1 if using init=’k-means++’.

New in version 1.2: Added ‘auto’ option for n_init.

Changed in version 1.4: Default value for n_init will change from 10 to ‘auto’ in version 1.4.
max_iter (int, default=300) – Maximum number of iterations of the k-means algorithm for a single run.
tol (float, default=1e-4) – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
verbose (int, default=0) – Verbosity mode.
random_state (int, RandomState instance or None, default=None) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See Glossary.
copy_x (bool, default=True) – When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False.
algorithm ({"lloyd", "elkan", "auto", "full"}, default="lloyd") –
K-means algorithm to use. The classical EM-style algorithm is “lloyd”. The “elkan” variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

”auto” and “full” are deprecated and they will be removed in Scikit-Learn 1.3. They are both aliases for “lloyd”.

Changed in version 0.18: Added Elkan algorithm

Changed in version 1.1: Renamed “full” to “lloyd”, and deprecated “auto” and “full”. Changed “auto” to use “lloyd” instead of “elkan”.

cluster_centers_#

Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.

Type:: ndarray of shape (n_clusters, n_features)

labels_#

Labels of each point

Type:: ndarray of shape (n_samples,)

inertia_#

Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.

Type:: float

n_iter_#

Number of iterations run.

Type:: int

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

MiniBatchKMeans: Alternative online implementation that does incremental updates of the centers positions using mini-batches. For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much faster than the default batch implementation.

Notes

The k-means problem is solved using either Lloyd’s or Elkan’s algorithm.

The average complexity is given by O(k n T), where n is the number of samples and T is the number of iteration.

The worst case complexity is given by O(n^(k+2/p)) with n = n_samples, p = n_features. Refer to “How slow is the k-means method?” D. Arthur and S. Vassilvitskii - SoCG2006. for more details.

In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times.

If the algorithm stops before fully converging (because of tol or max_iter), labels_ and cluster_centers_ will not be consistent, i.e. the cluster_centers_ will not be the means of the points in each cluster. Also, the estimator will reassign labels_ after the last iteration to make labels_ consistent with predict on the training set.

Examples

>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10.,  2.],
       [ 1.,  2.]])

fit(X, y=None, sample_weight=None)[source]#

Compute k-means clustering.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
y (Ignored) – Not used, present here for API consistency by convention.
sample_weight (array-like of shape (n_samples,), default=None) –
The weights for each observation in X. If None, all observations are assigned equal weight.

New in version 0.20.

Returns:

self – Fitted estimator.

Return type:

object

class watex.exlib.KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)[source]#

Bases: KNeighborsMixin, ClassifierMixin, NeighborsBase

Classifier implementing the k-nearest neighbors vote.

Read more in the User Guide.

Parameters:

n_neighbors (int, default=5) – Number of neighbors to use by default for kneighbors() queries.
weights ({'uniform', 'distance'}, callable or None, default='uniform') –
Weight function used in prediction. Possible values:
- ’uniform’ : uniform weights. All points in each neighborhood are weighted equally.
- ’distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
- [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.
algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto') –
Algorithm used to compute the nearest neighbors:
- ’ball_tree’ will use BallTree
- ’kd_tree’ will use KDTree
- ’brute’ will use a brute-force search.
- ’auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit() method.
Note: fitting on sparse input will override the setting of this parameter, using brute force.
leaf_size (int, default=30) – Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
p (int, default=2) – Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
metric (str or callable, default='minkowski') –
Metric to use for distance computation. Default is “minkowski”, which results in the standard Euclidean distance when p = 2. See the documentation of scipy.spatial.distance and the metrics listed in distance_metrics for valid metric values.

If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors.

If metric is a callable function, it takes two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.
metric_params (dict, default=None) – Additional keyword arguments for the metric function.
n_jobs (int, default=None) – The number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. Doesn’t affect fit() method.

classes_#

Class labels known to the classifier

Type:: array of shape (n_classes,)

effective_metric_#

The distance metric used. It will be same as the metric parameter or a synonym of it, e.g. ‘euclidean’ if the metric parameter set to ‘minkowski’ and p parameter set to 2.

Type:: str or callble

effective_metric_params_#

Additional keyword arguments for the metric function. For most metrics will be same with metric_params parameter, but may also contain the p parameter value if the effective_metric_ attribute is set to ‘minkowski’.

Type:: dict

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

n_samples_fit_#

Number of samples in the fitted data.

Type:: int

outputs_2d_#

False when y’s shape is (n_samples, ) or (n_samples, 1) during fit otherwise True.

Type:: bool

See also

RadiusNeighborsClassifier: Classifier based on neighbors within a fixed radius.
KNeighborsRegressor: Regression based on k-nearest neighbors.
RadiusNeighborsRegressor: Regression based on neighbors within a fixed radius.
NearestNeighbors: Unsupervised learner for implementing neighbor searches.

Notes

See Nearest Neighbors in the online documentation for a discussion of the choice of algorithm and leaf_size.

Warning

Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor k+1 and k, have identical distances but different labels, the results will depend on the ordering of the training data.

https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm

Examples

>>> X = [[0], [1], [2], [3]]
>>> y = [0, 0, 1, 1]
>>> from sklearn.neighbors import KNeighborsClassifier
>>> neigh = KNeighborsClassifier(n_neighbors=3)
>>> neigh.fit(X, y)
KNeighborsClassifier(...)
>>> print(neigh.predict([[1.1]]))
[0]
>>> print(neigh.predict_proba([[0.9]]))
[[0.666... 0.333...]]

fit(X, y)[source]#

Fit the k-nearest neighbors classifier from the training dataset.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples) if metric='precomputed') – Training data.
y ({array-like, sparse matrix} of shape (n_samples,) or (n_samples, n_outputs)) – Target values.

Returns:

self – The fitted k-nearest neighbors classifier.

Return type:

KNeighborsClassifier

predict(X)[source]#

Predict the class labels for the provided data.

Parameters:: X ({array-like, sparse matrix} of shape (n_queries, n_features), or (n_queries, n_indexed) if metric == 'precomputed') – Test samples.
Returns:: y – Class labels for each data sample.
Return type:: ndarray of shape (n_queries,) or (n_queries, n_outputs)

predict_proba(X)[source]#

Return probability estimates for the test data X.

Parameters:: X ({array-like, sparse matrix} of shape (n_queries, n_features), or (n_queries, n_indexed) if metric == 'precomputed') – Test samples.
Returns:: p – The class probabilities of the input samples. Classes are ordered by lexicographic order.
Return type:: ndarray of shape (n_queries, n_classes), or a list of n_outputs of such arrays if n_outputs > 1.

class watex.exlib.KernelPCA(n_components=None, *, kernel='linear', gamma=None, degree=3, coef0=1, kernel_params=None, alpha=1.0, fit_inverse_transform=False, eigen_solver='auto', tol=0, max_iter=None, iterated_power='auto', remove_zero_eig=False, random_state=None, copy_X=True, n_jobs=None)[source]#

Bases: ClassNamePrefixFeaturesOutMixin, TransformerMixin, BaseEstimator

Kernel Principal component analysis (KPCA) [1].

Non-linear dimensionality reduction through the use of kernels (see metrics).

It uses the scipy.linalg.eigh() LAPACK implementation of the full SVD or the scipy.sparse.linalg.eigsh() ARPACK implementation of the truncated SVD, depending on the shape of the input data and the number of components to extract. It can also use a randomized truncated SVD by the method proposed in [3], see eigen_solver.

Read more in the User Guide.

Parameters:

n_components (int, default=None) – Number of components. If None, all non-zero components are kept.
kernel ({'linear', 'poly', 'rbf', 'sigmoid', 'cosine', 'precomputed'} or callable, default='linear') – Kernel used for PCA.
gamma (float, default=None) – Kernel coefficient for rbf, poly and sigmoid kernels. Ignored by other kernels. If gamma is None, then it is set to 1/n_features.
degree (int, default=3) – Degree for poly kernels. Ignored by other kernels.
coef0 (float, default=1) – Independent term in poly and sigmoid kernels. Ignored by other kernels.
kernel_params (dict, default=None) – Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels.
alpha (float, default=1.0) – Hyperparameter of the ridge regression that learns the inverse transform (when fit_inverse_transform=True).
fit_inverse_transform (bool, default=False) – Learn the inverse transform for non-precomputed kernels (i.e. learn to find the pre-image of a point). This method is based on [2].
eigen_solver ({'auto', 'dense', 'arpack', 'randomized'}, default='auto') –
Select eigensolver to use. If n_components is much less than the number of training samples, randomized (or arpack to a smaller extent) may be more efficient than the dense eigensolver. Randomized SVD is performed according to the method of Halko et al [3].

auto :
the solver is selected by a default policy based on n_samples (the number of training samples) and n_components: if the number of components to extract is less than 10 (strict) and the number of samples is more than 200 (strict), the ‘arpack’ method is enabled. Otherwise the exact full eigenvalue decomposition is computed and optionally truncated afterwards (‘dense’ method).

dense :
run exact full eigenvalue decomposition calling the standard LAPACK solver via scipy.linalg.eigh, and select the components by postprocessing

arpack :
run SVD truncated to n_components calling ARPACK solver using scipy.sparse.linalg.eigsh. It requires strictly 0 < n_components < n_samples

randomized :
run randomized SVD by the method of Halko et al. [3]. The current implementation selects eigenvalues based on their module; therefore using this method can lead to unexpected results if the kernel is not positive semi-definite. See also [4].

Changed in version 1.0: ‘randomized’ was added.
tol (float, default=0) – Convergence tolerance for arpack. If 0, optimal value will be chosen by arpack.
max_iter (int, default=None) – Maximum number of iterations for arpack. If None, optimal value will be chosen by arpack.
iterated_power (int >= 0, or 'auto', default='auto') –
Number of iterations for the power method computed by svd_solver == ‘randomized’. When ‘auto’, it is set to 7 when n_components < 0.1 * min(X.shape), other it is set to 4.

New in version 1.0.
remove_zero_eig (bool, default=False) – If True, then all components with zero eigenvalues are removed, so that the number of components in the output may be < n_components (and sometimes even zero due to numerical instability). When n_components is None, this parameter is ignored and components with zero eigenvalues are removed regardless.
random_state (int, RandomState instance or None, default=None) –
Used when eigen_solver == ‘arpack’ or ‘randomized’. Pass an int for reproducible results across multiple function calls. See Glossary.

New in version 0.18.
copy_X (bool, default=True) –
If True, input X is copied and stored by the model in the X_fit_ attribute. If no further changes will be done to X, setting copy_X=False saves memory by storing a reference.

New in version 0.18.
n_jobs (int, default=None) –
The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

New in version 0.18.

eigenvalues_#

Eigenvalues of the centered kernel matrix in decreasing order. If n_components and remove_zero_eig are not set, then all values are stored.

Type:: ndarray of shape (n_components,)

eigenvectors_#

Eigenvectors of the centered kernel matrix. If n_components and remove_zero_eig are not set, then all components are stored.

Type:: ndarray of shape (n_samples, n_components)

dual_coef_#

Inverse transform matrix. Only available when fit_inverse_transform is True.

Type:: ndarray of shape (n_samples, n_features)

X_transformed_fit_#

Projection of the fitted data on the kernel principal components. Only available when fit_inverse_transform is True.

Type:: ndarray of shape (n_samples, n_components)

X_fit_#

The data used to fit the model. If copy_X=False, then X_fit_ is a reference. This attribute is used for the calls to transform.

Type:: ndarray of shape (n_samples, n_features)

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

FastICA: A fast algorithm for Independent Component Analysis.
IncrementalPCA: Incremental Principal Component Analysis.
NMF: Non-Negative Matrix Factorization.
PCA: Principal Component Analysis.
SparsePCA: Sparse Principal Component Analysis.
TruncatedSVD: Dimensionality reduction using truncated SVD.

References

[1]

Schölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller. “Kernel principal component analysis.” International conference on artificial neural networks. Springer, Berlin, Heidelberg, 1997.

[2]

Bakır, Gökhan H., Jason Weston, and Bernhard Schölkopf. “Learning to find pre-images.” Advances in neural information processing systems 16 (2004): 449-456.

[3] (1,2,3)

Halko, Nathan, Per-Gunnar Martinsson, and Joel A. Tropp. “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.” SIAM review 53.2 (2011): 217-288.

[4]

Martinsson, Per-Gunnar, Vladimir Rokhlin, and Mark Tygert. “A randomized algorithm for the decomposition of matrices.” Applied and Computational Harmonic Analysis 30.1 (2011): 47-68.

Examples

>>> from sklearn.datasets import load_digits
>>> from sklearn.decomposition import KernelPCA
>>> X, _ = load_digits(return_X_y=True)
>>> transformer = KernelPCA(n_components=7, kernel='linear')
>>> X_transformed = transformer.fit_transform(X)
>>> X_transformed.shape
(1797, 7)

fit(X, y=None)[source]#

Fit the model from data in X.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Not used, present for API consistency by convention.

Returns:

self – Returns the instance itself.

Return type:

object

fit_transform(X, y=None, **params)[source]#

Fit the model from data in X and transform X.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Not used, present for API consistency by convention.
**params (kwargs) – Parameters (keyword arguments) and values passed to the fit_transform instance.

Returns:

X_new – Returns the instance itself.

Return type:

ndarray of shape (n_samples, n_components)

inverse_transform(X)[source]#

Transform X back to original space.

inverse_transform approximates the inverse transformation using a learned pre-image. The pre-image is learned by kernel ridge regression of the original data on their low-dimensional representation vectors.

Note

When users want to compute inverse transformation for ‘linear’ kernel, it is recommended that they use PCA instead. Unlike PCA, KernelPCA’s inverse_transform does not reconstruct the mean of data when ‘linear’ kernel is used due to the use of centered kernel.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_components)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
Returns:: X_new – Returns the instance itself.
Return type:: ndarray of shape (n_samples, n_features)

References

Bakır, Gökhan H., Jason Weston, and Bernhard Schölkopf. “Learning to find pre-images.” Advances in neural information processing systems 16 (2004): 449-456.

transform(X)[source]#

Transform X.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
Returns:: X_new – Returns the instance itself.
Return type:: ndarray of shape (n_samples, n_components)

class watex.exlib.LabelBinarizer(*, neg_label=0, pos_label=1, sparse_output=False)[source]#

Bases: TransformerMixin, BaseEstimator

Binarize labels in a one-vs-all fashion.

Several regression and binary classification algorithms are available in scikit-learn. A simple way to extend these algorithms to the multi-class classification case is to use the so-called one-vs-all scheme.

At learning time, this simply consists in learning one regressor or binary classifier per class. In doing so, one needs to convert multi-class labels to binary labels (belong or does not belong to the class). LabelBinarizer makes this process easy with the transform method.

At prediction time, one assigns the class for which the corresponding model gave the greatest confidence. LabelBinarizer makes this easy with the inverse_transform method.

Read more in the User Guide.

Parameters:

neg_label (int, default=0) – Value with which negative labels must be encoded.
pos_label (int, default=1) – Value with which positive labels must be encoded.
sparse_output (bool, default=False) – True if the returned array from transform is desired to be in sparse CSR format.

classes_#

Holds the label for each class.

Type:: ndarray of shape (n_classes,)

y_type_#

Represents the type of the target data as evaluated by utils.multiclass.type_of_target. Possible type are ‘continuous’, ‘continuous-multioutput’, ‘binary’, ‘multiclass’, ‘multiclass-multioutput’, ‘multilabel-indicator’, and ‘unknown’.

Type:: str

sparse_input_#

True if the input data to transform is given as a sparse matrix, False otherwise.

Type:: bool

See also

label_binarize: Function to perform the transform operation of LabelBinarizer with fixed classes.
OneHotEncoder: Encode categorical features using a one-hot aka one-of-K scheme.

Examples

>>> from sklearn import preprocessing
>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit([1, 2, 6, 4, 2])
LabelBinarizer()
>>> lb.classes_
array([1, 2, 4, 6])
>>> lb.transform([1, 6])
array([[1, 0, 0, 0],
       [0, 0, 0, 1]])

Binary targets transform to a column vector

>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit_transform(['yes', 'no', 'no', 'yes'])
array([[1],
       [0],
       [0],
       [1]])

Passing a 2D matrix for multilabel classification

>>> import numpy as np
>>> lb.fit(np.array([[0, 1, 1], [1, 0, 0]]))
LabelBinarizer()
>>> lb.classes_
array([0, 1, 2])
>>> lb.transform([0, 1, 2, 1])
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [0, 1, 0]])

fit(y)[source]#

Fit label binarizer.

Parameters:: y (ndarray of shape (n_samples,) or (n_samples, n_classes)) – Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.
Returns:: self – Returns the instance itself.
Return type:: object

fit_transform(y)[source]#

Fit label binarizer/transform multi-class labels to binary labels.

The output of transform is sometimes referred to as the 1-of-K coding scheme.

Parameters:: y ({ndarray, sparse matrix} of shape (n_samples,) or (n_samples, n_classes)) – Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification. Sparse matrix can be CSR, CSC, COO, DOK, or LIL.
Returns:: Y – Shape will be (n_samples, 1) for binary problems. Sparse matrix will be of CSR format.
Return type:: {ndarray, sparse matrix} of shape (n_samples, n_classes)

inverse_transform(Y, threshold=None)[source]#

Transform binary labels back to multi-class labels.

Parameters:

Y ({ndarray, sparse matrix} of shape (n_samples, n_classes)) – Target values. All sparse matrices are converted to CSR before inverse transformation.
threshold (float, default=None) –
Threshold used in the binary and multi-label cases.

Use 0 when Y contains the output of decision_function (classifier). Use 0.5 when Y contains the output of predict_proba.

If None, the threshold is assumed to be half way between neg_label and pos_label.

Returns:

y – Target values. Sparse matrix will be of CSR format.

Return type:

{ndarray, sparse matrix} of shape (n_samples,)

Notes

In the case when the binary labels are fractional (probabilistic), inverse_transform chooses the class with the greatest value. Typically, this allows to use the output of a linear model’s decision_function method directly as the input of inverse_transform.

transform(y)[source]#

Transform multi-class labels to binary labels.

The output of transform is sometimes referred to by some authors as the 1-of-K coding scheme.

Parameters:: y ({array, sparse matrix} of shape (n_samples,) or (n_samples, n_classes)) – Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification. Sparse matrix can be CSR, CSC, COO, DOK, or LIL.
Returns:: Y – Shape will be (n_samples, 1) for binary problems. Sparse matrix will be of CSR format.
Return type:: {ndarray, sparse matrix} of shape (n_samples, n_classes)

class watex.exlib.LabelEncoder[source]#

Bases: TransformerMixin, BaseEstimator

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

Read more in the User Guide.

New in version 0.12.

classes_#

Holds the label for each class.

Type:: ndarray of shape (n_classes,)

See also

OrdinalEncoder: Encode categorical features using an ordinal encoding scheme.
OneHotEncoder: Encode categorical features as a one-hot numeric array.

Examples

LabelEncoder can be used to normalize labels.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

fit(y)[source]#

Fit label encoder.

Parameters:: y (array-like of shape (n_samples,)) – Target values.
Returns:: self – Fitted label encoder.
Return type:: returns an instance of self.

fit_transform(y)[source]#

Fit label encoder and return encoded labels.

Parameters:: y (array-like of shape (n_samples,)) – Target values.
Returns:: y – Encoded labels.
Return type:: array-like of shape (n_samples,)

inverse_transform(y)[source]#

Transform labels back to original encoding.

Parameters:: y (ndarray of shape (n_samples,)) – Target values.
Returns:: y – Original encoding.
Return type:: ndarray of shape (n_samples,)

transform(y)[source]#

Transform labels to normalized encoding.

Parameters:: y (array-like of shape (n_samples,)) – Target values.
Returns:: y – Labels as normalized encodings.
Return type:: array-like of shape (n_samples,)

class watex.exlib.LedoitWolf(*, store_precision=True, assume_centered=False, block_size=1000)[source]#

Bases: EmpiricalCovariance

LedoitWolf Estimator.

Ledoit-Wolf is a particular form of shrinkage, where the shrinkage coefficient is computed using O. Ledoit and M. Wolf’s formula as described in “A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices”, Ledoit and Wolf, Journal of Multivariate Analysis, Volume 88, Issue 2, February 2004, pages 365-411.

Read more in the User Guide.

Parameters:

store_precision (bool, default=True) – Specify if the estimated precision is stored.
assume_centered (bool, default=False) – If True, data will not be centered before computation. Useful when working with data whose mean is almost, but not exactly zero. If False (default), data will be centered before computation.
block_size (int, default=1000) – Size of blocks into which the covariance matrix will be split during its Ledoit-Wolf estimation. This is purely a memory optimization and does not affect results.

covariance_#

Estimated covariance matrix.

Type:: ndarray of shape (n_features, n_features)

location_#

Estimated location, i.e. the estimated mean.

Type:: ndarray of shape (n_features,)

precision_#

Estimated pseudo inverse matrix. (stored only if store_precision is True)

Type:: ndarray of shape (n_features, n_features)

shrinkage_#

Coefficient in the convex combination used for the computation of the shrunk estimate. Range is [0, 1].

Type:: float

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

EllipticEnvelope: An object for detecting outliers in a Gaussian distributed dataset.
EmpiricalCovariance: Maximum likelihood covariance estimator.
GraphicalLasso: Sparse inverse covariance estimation with an l1-penalized estimator.
GraphicalLassoCV: Sparse inverse covariance with cross-validated choice of the l1 penalty.
MinCovDet: Minimum Covariance Determinant (robust estimator of covariance).
OAS: Oracle Approximating Shrinkage Estimator.
ShrunkCovariance: Covariance estimator with shrinkage.

Notes

The regularised covariance is:

(1 - shrinkage) * cov + shrinkage * mu * np.identity(n_features)

where mu = trace(cov) / n_features and shrinkage is given by the Ledoit and Wolf formula (see References)

References

“A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices”, Ledoit and Wolf, Journal of Multivariate Analysis, Volume 88, Issue 2, February 2004, pages 365-411.

Examples

>>> import numpy as np
>>> from sklearn.covariance import LedoitWolf
>>> real_cov = np.array([[.4, .2],
...                      [.2, .8]])
>>> np.random.seed(0)
>>> X = np.random.multivariate_normal(mean=[0, 0],
...                                   cov=real_cov,
...                                   size=50)
>>> cov = LedoitWolf().fit(X)
>>> cov.covariance_
array([[0.4406..., 0.1616...],
       [0.1616..., 0.8022...]])
>>> cov.location_
array([ 0.0595... , -0.0075...])

fit(X, y=None)[source]#

Fit the Ledoit-Wolf shrunk covariance model to X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Not used, present for API consistency by convention.

Returns:

self – Returns the instance itself.

Return type:

object

class watex.exlib.LinearSVC(penalty='l2', loss='squared_hinge', *, dual=True, tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)[source]#

Bases: LinearClassifierMixin, SparseCoefMixin, BaseEstimator

Linear Support Vector Classification.

Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

This class supports both dense and sparse input and the multiclass support is handled according to a one-vs-the-rest scheme.

Read more in the User Guide.

Parameters:

penalty ({'l1', 'l2'}, default='l2') – Specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.
loss ({'hinge', 'squared_hinge'}, default='squared_hinge') – Specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss. The combination of penalty='l1' and loss='hinge' is not supported.
dual (bool, default=True) – Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features.
tol (float, default=1e-4) – Tolerance for stopping criteria.
C (float, default=1.0) – Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.
multi_class ({'ovr', 'crammer_singer'}, default='ovr') – Determines the multi-class strategy if y contains more than two classes. "ovr" trains n_classes one-vs-rest classifiers, while "crammer_singer" optimizes a joint objective over all classes. While crammer_singer is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If "crammer_singer" is chosen, the options loss, penalty and dual will be ignored.
fit_intercept (bool, default=True) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).
intercept_scaling (float, default=1.0) – When self.fit_intercept is True, instance vector x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equals to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.
class_weight (dict or 'balanced', default=None) – Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
verbose (int, default=0) – Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in liblinear that, if enabled, may not work properly in a multithreaded context.
random_state (int, RandomState instance or None, default=None) – Controls the pseudo random number generation for shuffling the data for the dual coordinate descent (if dual=True). When dual=False the underlying implementation of LinearSVC is not random and random_state has no effect on the results. Pass an int for reproducible output across multiple function calls. See Glossary.
max_iter (int, default=1000) – The maximum number of iterations to be run.

coef_#

Weights assigned to the features (coefficients in the primal problem).

coef_ is a readonly property derived from raw_coef_ that follows the internal memory layout of liblinear.

Type:: ndarray of shape (1, n_features) if n_classes == 2 else (n_classes, n_features)

intercept_#

Constants in decision function.

Type:: ndarray of shape (1,) if n_classes == 2 else (n_classes,)

classes_#

The unique classes labels.

Type:: ndarray of shape (n_classes,)

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

n_iter_#

Maximum number of iterations run across all classes.

Type:: int

See also

SVC: Implementation of Support Vector Machine classifier using libsvm: the kernel can be non-linear but its SMO algorithm does not scale to large number of samples as LinearSVC does. Furthermore SVC multi-class mode is implemented using one vs one scheme while LinearSVC uses one vs the rest. It is possible to implement one vs the rest with SVC by using the OneVsRestClassifier wrapper. Finally SVC can fit dense data without memory copy if the input is C-contiguous. Sparse data will still incur memory copy though.
sklearn.linear_model.SGDClassifier: SGDClassifier can optimize the same cost function as LinearSVC by adjusting the penalty and loss parameters. In addition it requires less memory, allows incremental (online) learning, and implements various loss functions and regularization regimes.

Notes

The underlying C implementation uses a random number generator to select features when fitting the model. It is thus not uncommon to have slightly different results for the same input data. If that happens, try with a smaller tol parameter.

The underlying implementation, liblinear, uses a sparse internal representation for the data that will incur a memory copy.

Predict output may not match that of standalone liblinear in certain cases. See differences from liblinear in the narrative documentation.

References

LIBLINEAR: A Library for Large Linear Classification

Examples

>>> from sklearn.svm import LinearSVC
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_features=4, random_state=0)
>>> clf = make_pipeline(StandardScaler(),
...                     LinearSVC(random_state=0, tol=1e-5))
>>> clf.fit(X, y)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearsvc', LinearSVC(random_state=0, tol=1e-05))])

>>> print(clf.named_steps['linearsvc'].coef_)
[[0.141...   0.526... 0.679... 0.493...]]

>>> print(clf.named_steps['linearsvc'].intercept_)
[0.1693...]
>>> print(clf.predict([[0, 0, 0, 0]]))
[1]

fit(X, y, sample_weight=None)[source]#

Fit the model according to the given training data.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target vector relative to X.
sample_weight (array-like of shape (n_samples,), default=None) –
Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

New in version 0.18.

Returns:

self – An instance of the estimator.

Return type:

object

class watex.exlib.LinearSVR(*, epsilon=0.0, tol=0.0001, C=1.0, loss='epsilon_insensitive', fit_intercept=True, intercept_scaling=1.0, dual=True, verbose=0, random_state=None, max_iter=1000)[source]#

Bases: RegressorMixin, LinearModel

Linear Support Vector Regression.

Similar to SVR with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

This class supports both dense and sparse input.

Read more in the User Guide.

New in version 0.16.

Parameters:

epsilon (float, default=0.0) – Epsilon parameter in the epsilon-insensitive loss function. Note that the value of this parameter depends on the scale of the target variable y. If unsure, set epsilon=0.
tol (float, default=1e-4) – Tolerance for stopping criteria.
C (float, default=1.0) – Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.
loss ({'epsilon_insensitive', 'squared_epsilon_insensitive'}, default='epsilon_insensitive') – Specifies the loss function. The epsilon-insensitive loss (standard SVR) is the L1 loss, while the squared epsilon-insensitive loss (‘squared_epsilon_insensitive’) is the L2 loss.
fit_intercept (bool, default=True) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).
intercept_scaling (float, default=1.0) – When self.fit_intercept is True, instance vector x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equals to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.
dual (bool, default=True) – Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features.
verbose (int, default=0) – Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in liblinear that, if enabled, may not work properly in a multithreaded context.
random_state (int, RandomState instance or None, default=None) – Controls the pseudo random number generation for shuffling the data. Pass an int for reproducible output across multiple function calls. See Glossary.
max_iter (int, default=1000) – The maximum number of iterations to be run.

coef_#

Weights assigned to the features (coefficients in the primal problem).

coef_ is a readonly property derived from raw_coef_ that follows the internal memory layout of liblinear.

Type:: ndarray of shape (n_features) if n_classes == 2 else (n_classes, n_features)

intercept_#

Constants in decision function.

Type:: ndarray of shape (1) if n_classes == 2 else (n_classes)

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

n_iter_#

Maximum number of iterations run across all classes.

Type:: int

See also

LinearSVC: Implementation of Support Vector Machine classifier using the same library as this class (liblinear).
SVR: Implementation of Support Vector Machine regression using libsvm: the kernel can be non-linear but its SMO algorithm does not scale to large number of samples as LinearSVC does.
sklearn.linear_model.SGDRegressor: SGDRegressor can optimize the same cost function as LinearSVR by adjusting the penalty and loss parameters. In addition it requires less memory, allows incremental (online) learning, and implements various loss functions and regularization regimes.

Examples

>>> from sklearn.svm import LinearSVR
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.datasets import make_regression
>>> X, y = make_regression(n_features=4, random_state=0)
>>> regr = make_pipeline(StandardScaler(),
...                      LinearSVR(random_state=0, tol=1e-5))
>>> regr.fit(X, y)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearsvr', LinearSVR(random_state=0, tol=1e-05))])

>>> print(regr.named_steps['linearsvr'].coef_)
[18.582... 27.023... 44.357... 64.522...]
>>> print(regr.named_steps['linearsvr'].intercept_)
[-4...]
>>> print(regr.predict([[0, 0, 0, 0]]))
[-2.384...]

fit(X, y, sample_weight=None)[source]#

Fit the model according to the given training data.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target vector relative to X.
sample_weight (array-like of shape (n_samples,), default=None) –
Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

New in version 0.18.

Returns:

self – An instance of the estimator.

Return type:

object

class watex.exlib.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)[source]#

Bases: LinearClassifierMixin, SparseCoefMixin, BaseEstimator

Logistic Regression (aka logit, MaxEnt) classifier.

In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)

This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ solvers. Note that regularization is applied by default. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).

The ‘newton-cg’, ‘sag’, and ‘lbfgs’ solvers support only L2 regularization with primal formulation, or no regularization. The ‘liblinear’ solver supports both L1 and L2 regularization, with a dual formulation only for the L2 penalty. The Elastic-Net regularization is only supported by the ‘saga’ solver.

Read more in the User Guide.

Parameters:

penalty ({'l1', 'l2', 'elasticnet', None}, default='l2') –
Specify the norm of the penalty:
- None: no penalty is added;
- ’l2’: add a L2 penalty term and it is the default choice;
- ’l1’: add a L1 penalty term;
- ’elasticnet’: both L1 and L2 penalty terms are added.
Warning

Some penalties may not work with some solvers. See the parameter solver below, to know the compatibility between the penalty and solver.

New in version 0.19: l1 penalty with SAGA solver (allowing ‘multinomial’ + L1)

Deprecated since version 1.2: The ‘none’ option was deprecated in version 1.2, and will be removed in 1.4. Use None instead.
dual (bool, default=False) – Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features.
tol (float, default=1e-4) – Tolerance for stopping criteria.
C (float, default=1.0) – Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
fit_intercept (bool, default=True) – Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
intercept_scaling (float, default=1) –
Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. In this case, x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equal to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic_feature_weight.

Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.
class_weight (dict or 'balanced', default=None) –
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

New in version 0.17: class_weight=’balanced’
random_state (int, RandomState instance, default=None) – Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. See Glossary for details.
solver ({'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs') –
Algorithm to use in the optimization problem. Default is ‘lbfgs’. To choose a solver, you might want to consider the following aspects:
- For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones;
- For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss;
- ’liblinear’ is limited to one-versus-rest schemes.
- ’newton-cholesky’ is a good choice for n_samples >> n_features, especially with one-hot encoded categorical features with rare categories. Note that it is limited to binary classification and the one-versus-rest reduction for multiclass classification. Be aware that the memory usage of this solver has a quadratic dependency on n_features because it explicitly computes the Hessian matrix.
Warning

The choice of the algorithm depends on the penalty chosen. Supported penalties by solver:
- ’lbfgs’ - [‘l2’, None]
- ’liblinear’ - [‘l1’, ‘l2’]
- ’newton-cg’ - [‘l2’, None]
- ’newton-cholesky’ - [‘l2’, None]
- ’sag’ - [‘l2’, None]
- ’saga’ - [‘elasticnet’, ‘l1’, ‘l2’, None]
Note

’sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.

See also

Refer to the User Guide for more information regarding LogisticRegression and more specifically the Table summarizing solver/penalty supports.

New in version 0.17: Stochastic Average Gradient descent solver.

New in version 0.19: SAGA solver.

Changed in version 0.22: The default solver changed from ‘liblinear’ to ‘lbfgs’ in 0.22.

New in version 1.2: newton-cholesky solver.
max_iter (int, default=100) – Maximum number of iterations taken for the solvers to converge.
multi_class ({'auto', 'ovr', 'multinomial'}, default='auto') –
If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.

New in version 0.18: Stochastic Average Gradient descent solver for ‘multinomial’ case.

Changed in version 0.22: Default changed from ‘ovr’ to ‘auto’ in 0.22.
verbose (int, default=0) – For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.
warm_start (bool, default=False) –
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See the Glossary.

New in version 0.17: warm_start to support lbfgs, newton-cg, sag, saga solvers.
n_jobs (int, default=None) – Number of CPU cores used when parallelizing over classes if multi_class=’ovr’”. This parameter is ignored when the solver is set to ‘liblinear’ regardless of whether ‘multi_class’ is specified or not. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
l1_ratio (float, default=None) – The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'. Setting l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.

classes_#

A list of class labels known to the classifier.

Type:: ndarray of shape (n_classes, )

coef_#

Coefficient of the features in the decision function.

coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class=’multinomial’, coef_ corresponds to outcome 1 (True) and -coef_ corresponds to outcome 0 (False).

Type:: ndarray of shape (1, n_features) or (n_classes, n_features)

intercept_#

Intercept (a.k.a. bias) added to the decision function.

If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary. In particular, when multi_class=’multinomial’, intercept_ corresponds to outcome 1 (True) and -intercept_ corresponds to outcome 0 (False).

Type:: ndarray of shape (1,) or (n_classes,)

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

n_iter_#

Actual number of iterations for all classes. If binary or multinomial, it returns only 1 element. For liblinear solver, only the maximum number of iteration across all classes is given.

Changed in version 0.20: In SciPy <= 1.0.0 the number of lbfgs iterations may exceed max_iter. n_iter_ will now report at most max_iter.

Type:: ndarray of shape (n_classes,) or (1, )

See also

SGDClassifier: Incrementally trained logistic regression (when given the parameter loss="log").
LogisticRegressionCV: Logistic regression with built-in cross validation.

Notes

The underlying C implementation uses a random number generator to select features when fitting the model. It is thus not uncommon, to have slightly different results for the same input data. If that happens, try with a smaller tol parameter.

Predict output may not match that of standalone liblinear in certain cases. See differences from liblinear in the narrative documentation.

References

L-BFGS-B – Software for Large-scale Bound-constrained Optimization: Ciyou Zhu, Richard Byrd, Jorge Nocedal and Jose Luis Morales. http://users.iems.northwestern.edu/~nocedal/lbfgsb.html
LIBLINEAR – A Library for Large Linear Classification: https://www.csie.ntu.edu.tw/~cjlin/liblinear/
SAG – Mark Schmidt, Nicolas Le Roux, and Francis Bach: Minimizing Finite Sums with the Stochastic Average Gradient https://hal.inria.fr/hal-00860051/document
SAGA – Defazio, A., Bach F. & Lacoste-Julien S. (2014).: “SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives”
Hsiang-Fu Yu, Fang-Lan Huang, Chih-Jen Lin (2011). Dual coordinate descent: methods for logistic regression and maximum entropy models. Machine Learning 85(1-2):41-75. https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf

Examples

>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = load_iris(return_X_y=True)
>>> clf = LogisticRegression(random_state=0).fit(X, y)
>>> clf.predict(X[:2, :])
array([0, 0])
>>> clf.predict_proba(X[:2, :])
array([[9.8...e-01, 1.8...e-02, 1.4...e-08],
       [9.7...e-01, 2.8...e-02, ...e-08]])
>>> clf.score(X, y)
0.97...

fit(X, y, sample_weight=None)[source]#

Fit the model according to the given training data.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target vector relative to X.
sample_weight (array-like of shape (n_samples,) default=None) –
Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

New in version 0.17: sample_weight support to LogisticRegression.

Returns:

Fitted estimator.

Return type:

self

Notes

The SAGA solver supports both float64 and float32 bit arrays.

predict_log_proba(X)[source]#

Predict logarithm of probability estimates.

The returned estimates for all classes are ordered by the label of classes.

Parameters:: X (array-like of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.
Returns:: T – Returns the log-probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.
Return type:: array-like of shape (n_samples, n_classes)

predict_proba(X)[source]#

Probability estimates.

The returned estimates for all classes are ordered by the label of classes.

For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class. Else use a one-vs-rest approach, i.e calculate the probability of each class assuming it to be positive using the logistic function. and normalize these values across all the classes.

Parameters:: X (array-like of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.
Returns:: T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.
Return type:: array-like of shape (n_samples, n_classes)

class watex.exlib.MinMaxScaler(feature_range=(0, 1), *, copy=True, clip=False)[source]#

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.

Read more in the User Guide.

Parameters:

feature_range (tuple (min, max), default=(0, 1)) – Desired range of transformed data.
copy (bool, default=True) – Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).
clip (bool, default=False) –
Set to True to clip transformed values of held-out data to provided feature range.

New in version 0.24.

min_#

Per feature adjustment for minimum. Equivalent to min - X.min(axis=0) * self.scale_

Type:: ndarray of shape (n_features,)

scale_#

Per feature relative scaling of the data. Equivalent to (max - min) / (X.max(axis=0) - X.min(axis=0))

New in version 0.17: scale_ attribute.

Type:: ndarray of shape (n_features,)

data_min_#

Per feature minimum seen in the data

New in version 0.17: data_min_

Type:: ndarray of shape (n_features,)

data_max_#

Per feature maximum seen in the data

New in version 0.17: data_max_

Type:: ndarray of shape (n_features,)

data_range_#

Per feature range (data_max_ - data_min_) seen in the data

New in version 0.17: data_range_

Type:: ndarray of shape (n_features,)

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

n_samples_seen_#

The number of samples processed by the estimator. It will be reset on new calls to fit, but increments across partial_fit calls.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

minmax_scale: Equivalent function without the estimator API.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import MinMaxScaler
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler()
>>> print(scaler.data_max_)
[ 1. 18.]
>>> print(scaler.transform(data))
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]
>>> print(scaler.transform([[2, 2]]))
[[1.5 0. ]]

fit(X, y=None)[source]#

Compute the minimum and maximum to be used for later scaling.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.
y (None) – Ignored.

Returns:

self – Fitted scaler.

Return type:

object

inverse_transform(X)[source]#

Undo the scaling of X according to feature_range.

Parameters:: X (array-like of shape (n_samples, n_features)) – Input data that will be transformed. It cannot be sparse.
Returns:: Xt – Transformed data.
Return type:: ndarray of shape (n_samples, n_features)

partial_fit(X, y=None)[source]#

Online computation of min and max on X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.

Returns:

self – Fitted scaler.

Return type:

object

transform(X)[source]#

Scale features of X according to feature_range.

Parameters:: X (array-like of shape (n_samples, n_features)) – Input data that will be transformed.
Returns:: Xt – Transformed data.
Return type:: ndarray of shape (n_samples, n_features)

class watex.exlib.Normalizer(norm='l2', *, copy=True)[source]#

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Normalize samples individually to unit norm.

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you want to avoid the burden of a copy / conversion).

Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

Read more in the User Guide.

Parameters:

norm ({'l1', 'l2', 'max'}, default='l2') – The norm to use to normalize each non zero sample. If norm=’max’ is used, values will be rescaled by the maximum of the absolute values.
copy (bool, default=True) – Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix).

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

normalize: Equivalent function without the estimator API.

Notes

This estimator is stateless and does not need to be fitted. However, we recommend to call fit_transform() instead of transform(), as parameter validation is only performed in fit().

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import Normalizer
>>> X = [[4, 1, 2, 2],
...      [1, 3, 9, 3],
...      [5, 7, 5, 1]]
>>> transformer = Normalizer().fit(X)  # fit does nothing.
>>> transformer
Normalizer()
>>> transformer.transform(X)
array([[0.8, 0.2, 0.4, 0.4],
       [0.1, 0.3, 0.9, 0.3],
       [0.5, 0.7, 0.5, 0.1]])

fit(X, y=None)[source]#

Only validates estimator’s parameters.

This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data to estimate the normalization parameters.
y (Ignored) – Not used, present here for API consistency by convention.

Returns:

self – Fitted transformer.

Return type:

object

transform(X, copy=None)[source]#

Scale each non zero row of X to unit norm.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data to normalize, row by row. scipy.sparse matrices should be in CSR format to avoid an un-necessary copy.
copy (bool, default=None) – Copy the input X or not.

Returns:

X_tr – Transformed array.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

class watex.exlib.OneHotEncoder(*, categories='auto', drop=None, sparse='deprecated', sparse_output=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None)[source]#

Bases: _BaseEncoder

Encode categorical features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse_output parameter)

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

Read more in the User Guide.

Parameters:

categories ('auto' or a list of array-like, default='auto') –
Categories (unique values) per feature:
- ’auto’ : Determine categories automatically from the training data.
- list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.
The used categories can be found in the categories_ attribute.

New in version 0.20.
drop ({'first', 'if_binary'} or an array-like of shape (n_features,), default=None) –
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model.

However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.
- None : retain all features (the default).
- ’first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
- ’if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.
- array : drop[i] is the category in feature X[:, i] that should be dropped.
When max_categories or min_frequency is configured to group infrequent categories, the dropping behavior is handled after the grouping.

New in version 0.21: The parameter drop was added in 0.21.

Changed in version 0.23: The option drop=’if_binary’ was added in 0.23.

Changed in version 1.1: Support for dropping infrequent categories.
sparse (bool, default=True) –
Will return sparse matrix if set True else will return an array.

Deprecated since version 1.2: sparse is deprecated in 1.2 and will be removed in 1.4. Use sparse_output instead.
sparse_output (bool, default=True) –
Will return sparse matrix if set True else will return an array.

New in version 1.2: sparse was renamed to sparse_output
dtype (number type, default=float) – Desired dtype of output.
handle_unknown ({'error', 'ignore', 'infrequent_if_exist'}, default='error') –
Specifies the way unknown categories are handled during transform().
- ’error’ : Raise an error if an unknown category is present during transform.
- ’ignore’ : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
- ’infrequent_if_exist’ : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted ‘infrequent’ if it exists. If the ‘infrequent’ category does not exist, then transform() and inverse_transform() will handle an unknown category as with handle_unknown=’ignore’. Infrequent categories exist based on min_frequency and max_categories. Read more in the User Guide.
Changed in version 1.1: ‘infrequent_if_exist’ was added to automatically handle unknown categories and infrequent categories.
min_frequency (int or float, default=None) –
Specifies the minimum frequency below which a category will be considered infrequent.
- If int, categories with a smaller cardinality will be considered infrequent.
- If float, categories with a smaller cardinality than min_frequency * n_samples will be considered infrequent.
New in version 1.1: Read more in the User Guide.
max_categories (int, default=None) –
Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, max_categories includes the category representing the infrequent categories along with the frequent categories. If None, there is no limit to the number of output features.

New in version 1.1: Read more in the User Guide.

categories_#

The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of transform). This includes the category specified in drop (if any).

Type:: list of arrays

drop_idx_#

drop_idx_[i] is the index in categories_[i] of the category to be dropped for each feature.
drop_idx_[i] = None if no category is to be dropped from the feature with index i, e.g. when drop=’if_binary’ and the feature isn’t binary.
drop_idx_ = None if all the transformed features will be retained.

If infrequent categories are enabled by setting min_frequency or max_categories to a non-default value and drop_idx[i] corresponds to a infrequent category, then the entire infrequent category is dropped.

Changed in version 0.23: Added the possibility to contain None values.

Type:: array of shape (n_features,)

infrequent_categories_#

Defined only if infrequent categories are enabled by setting min_frequency or max_categories to a non-default value. infrequent_categories_[i] are the infrequent categories for feature i. If the feature i has no infrequent categories infrequent_categories_[i] is None.

New in version 1.1.

Type:: list of ndarray

n_features_in_#

Number of features seen during fit.

New in version 1.0.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

OrdinalEncoder: Performs an ordinal (integer) encoding of the categorical features.
sklearn.feature_extraction.DictVectorizer: Performs a one-hot encoding of dictionary items (also handles string-valued features).
sklearn.feature_extraction.FeatureHasher: Performs an approximate one-hot encoding of dictionary items or strings.
LabelBinarizer: Binarizes labels in a one-vs-all fashion.
MultiLabelBinarizer: Transforms between iterable of iterables and a multilabel format, e.g. a (samples x classes) binary matrix indicating the presence of a class label.

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder

One can discard categories not seen during fit:

>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])
>>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
array([['Male', 1],
       [None, 2]], dtype=object)
>>> enc.get_feature_names_out(['gender', 'group'])
array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], ...)

One can always drop the first column for each feature:

>>> drop_enc = OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 0., 0.],
       [1., 1., 0.]])

Or drop a column for feature only having 2 categories:

>>> drop_binary_enc = OneHotEncoder(drop='if_binary').fit(X)
>>> drop_binary_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 1., 0., 0.],
       [1., 0., 1., 0.]])

Infrequent categories are enabled by setting max_categories or min_frequency.

>>> import numpy as np
>>> X = np.array([["a"] * 5 + ["b"] * 20 + ["c"] * 10 + ["d"] * 3], dtype=object).T
>>> ohe = OneHotEncoder(max_categories=3, sparse_output=False).fit(X)
>>> ohe.infrequent_categories_
[array(['a', 'd'], dtype=object)]
>>> ohe.transform([["a"], ["b"]])
array([[0., 0., 1.],
       [1., 0., 0.]])

fit(X, y=None)[source]#

Fit OneHotEncoder to X.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data to determine the categories of each feature.
y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Returns:

Fitted encoder.

Return type:

self

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out – Transformed feature names.

Return type:

ndarray of str objects

property infrequent_categories_#: Infrequent categories for each feature.

inverse_transform(X)[source]#

Convert the data back to the original representation.

When unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category. If the feature with the unknown category has a dropped category, the dropped category will be its inverse.

For a given input feature, if there is an infrequent category, ‘infrequent_sklearn’ will be used to represent the infrequent category.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_encoded_features)) – The transformed data.
Returns:: X_tr – Inverse transformed array.
Return type:: ndarray of shape (n_samples, n_features)

transform(X)[source]#

Transform X using one-hot encoding.

If there are infrequent categories for a feature, the infrequent categories will be grouped into a single category.

Parameters:: X (array-like of shape (n_samples, n_features)) – The data to encode.
Returns:: X_out – Transformed input. If sparse_output=True, a sparse matrix will be returned.
Return type:: {ndarray, sparse matrix} of shape (n_samples, n_encoded_features)

class watex.exlib.OrdinalEncoder(*, categories='auto', dtype=<class 'numpy.float64'>, handle_unknown='error', unknown_value=None, encoded_missing_value=nan)[source]#

Bases: OneToOneFeatureMixin, _BaseEncoder

Encode categorical features as an integer array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.

Read more in the User Guide.

New in version 0.20.

Parameters:

categories ('auto' or a list of array-like, default='auto') –
Categories (unique values) per feature:
- ’auto’ : Determine categories automatically from the training data.
- list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values, and should be sorted in case of numeric values.
The used categories can be found in the categories_ attribute.
dtype (number type, default np.float64) – Desired dtype of output.
handle_unknown ({'error', 'use_encoded_value'}, default='error') –
When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value. In inverse_transform(), an unknown category will be denoted as None.

New in version 0.24.
unknown_value (int or np.nan, default=None) –
When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.

New in version 0.24.
encoded_missing_value (int or np.nan, default=np.nan) –
Encoded value of missing categories. If set to np.nan, then the dtype parameter must be a float dtype.

New in version 1.1.

categories_#

The categories of each feature determined during fit (in order of the features in X and corresponding with the output of transform). This does not include categories that weren’t seen during fit.

Type:: list of arrays

n_features_in_#

Number of features seen during fit.

New in version 1.0.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

OneHotEncoder: Performs a one-hot encoding of categorical features.
LabelEncoder: Encodes target labels with values between 0 and n_classes-1.

Notes

With a high proportion of nan values, inferring categories becomes slow with Python versions before 3.10. The handling of nan values was improved from Python 3.10 onwards, (c.f. bpo-43475).

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to an ordinal encoding.

>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OrdinalEncoder()
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 3], ['Male', 1]])
array([[0., 2.],
       [1., 0.]])

>>> enc.inverse_transform([[1, 0], [0, 1]])
array([['Male', 1],
       ['Female', 2]], dtype=object)

By default, OrdinalEncoder is lenient towards missing values by propagating them.

>>> import numpy as np
>>> X = [['Male', 1], ['Female', 3], ['Female', np.nan]]
>>> enc.fit_transform(X)
array([[ 1.,  0.],
       [ 0.,  1.],
       [ 0., nan]])

You can use the parameter encoded_missing_value to encode missing values.

>>> enc.set_params(encoded_missing_value=-1).fit_transform(X)
array([[ 1.,  0.],
       [ 0.,  1.],
       [ 0., -1.]])

fit(X, y=None)[source]#

Fit the OrdinalEncoder to X.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data to determine the categories of each feature.
y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Returns:

self – Fitted encoder.

Return type:

object

inverse_transform(X)[source]#

Convert the data back to the original representation.

Parameters:: X (array-like of shape (n_samples, n_encoded_features)) – The transformed data.
Returns:: X_tr – Inverse transformed array.
Return type:: ndarray of shape (n_samples, n_features)

transform(X)[source]#

Transform X to ordinal codes.

Parameters:: X (array-like of shape (n_samples, n_features)) – The data to encode.
Returns:: X_out – Transformed input.
Return type:: ndarray of shape (n_samples, n_features)

class watex.exlib.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', n_oversamples=10, power_iteration_normalizer='auto', random_state=None)[source]#

Bases: _BasePCA

Principal component analysis (PCA).

Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.

It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.

It can also use the scipy.sparse.linalg ARPACK implementation of the truncated SVD.

Notice that this class does not support sparse input. See TruncatedSVD for an alternative with sparse data.

Read more in the User Guide.

Parameters:

n_components (int, float or 'mle', default=None) –
Number of components to keep. if n_components is not set all components are kept:
```
n_components == min(n_samples, n_features)
```
If n_components == 'mle' and svd_solver == 'full', Minka’s MLE is used to guess the dimension. Use of n_components == 'mle' will interpret svd_solver == 'auto' as svd_solver == 'full'.

If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.

If svd_solver == 'arpack', the number of components must be strictly less than the minimum of n_features and n_samples.

Hence, the None case results in:
```
n_components == min(n_samples, n_features) - 1
```
copy (bool, default=True) – If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
whiten (bool, default=False) –
When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
svd_solver ({'auto', 'full', 'arpack', 'randomized'}, default='auto') –

If auto :
The solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.

If full :
run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing

If arpack :
run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)

If randomized :
run randomized SVD by the method of Halko et al.

New in version 0.18.0.
tol (float, default=0.0) –
Tolerance for singular values computed by svd_solver == ‘arpack’. Must be of range [0.0, infinity).

New in version 0.18.0.
iterated_power (int or 'auto', default='auto') –
Number of iterations for the power method computed by svd_solver == ‘randomized’. Must be of range [0, infinity).

New in version 0.18.0.
n_oversamples (int, default=10) –
This parameter is only relevant when svd_solver=”randomized”. It corresponds to the additional number of random vectors to sample the range of X so as to ensure proper conditioning. See randomized_svd() for more details.

New in version 1.1.
power_iteration_normalizer ({'auto', 'QR', 'LU', 'none'}, default='auto') –
Power iteration normalizer for randomized SVD solver. Not used by ARPACK. See randomized_svd() for more details.

New in version 1.1.
random_state (int, RandomState instance or None, default=None) –
Used when the ‘arpack’ or ‘randomized’ solvers are used. Pass an int for reproducible results across multiple function calls. See Glossary.

New in version 0.18.0.

components_#

Type:: ndarray of shape (n_components, n_features)

explained_variance_#

The amount of variance explained by each of the selected components. The variance estimation uses n_samples - 1 degrees of freedom.

Equal to n_components largest eigenvalues of the covariance matrix of X.

New in version 0.18.

Type:: ndarray of shape (n_components,)

explained_variance_ratio_#

Percentage of variance explained by each of the selected components.

If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.

Type:: ndarray of shape (n_components,)

singular_values_#

The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space.

New in version 0.19.

Type:: ndarray of shape (n_components,)

mean_#

Per-feature empirical mean, estimated from the training set.

Equal to X.mean(axis=0).

Type:: ndarray of shape (n_features,)

n_components_#

The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.

Type:: int

n_features_#

Number of features in the training data.

Type:: int

n_samples_#

Number of samples in the training data.

Type:: int

noise_variance_#

Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.

Type:: float

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

KernelPCA: Kernel Principal Component Analysis.
SparsePCA: Sparse Principal Component Analysis.
TruncatedSVD: Dimensionality reduction using truncated SVD.
IncrementalPCA: Incremental Principal Component Analysis.

References

For n_components == ‘mle’, this class uses the method from: Minka, T. P.. “Automatic choice of dimensionality for PCA”. In NIPS, pp. 598-604

Implements the probabilistic PCA model from: Tipping, M. E., and Bishop, C. M. (1999). “Probabilistic principal component analysis”. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611-622. via the score and score_samples methods.

For svd_solver == ‘arpack’, refer to scipy.sparse.linalg.svds.

For svd_solver == ‘randomized’, see: Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions”. SIAM review, 53(2), 217-288. and also Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011). “A randomized algorithm for the decomposition of matrices”. Applied and Computational Harmonic Analysis, 30(1), 47-68.

Examples

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> pca.fit(X)
PCA(n_components=2)
>>> print(pca.explained_variance_ratio_)
[0.9924... 0.0075...]
>>> print(pca.singular_values_)
[6.30061... 0.54980...]

>>> pca = PCA(n_components=2, svd_solver='full')
>>> pca.fit(X)
PCA(n_components=2, svd_solver='full')
>>> print(pca.explained_variance_ratio_)
[0.9924... 0.00755...]
>>> print(pca.singular_values_)
[6.30061... 0.54980...]

>>> pca = PCA(n_components=1, svd_solver='arpack')
>>> pca.fit(X)
PCA(n_components=1, svd_solver='arpack')
>>> print(pca.explained_variance_ratio_)
[0.99244...]
>>> print(pca.singular_values_)
[6.30061...]

fit(X, y=None)[source]#

Fit the model with X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Ignored.

Returns:

self – Returns the instance itself.

Return type:

object

fit_transform(X, y=None)[source]#

Fit the model with X and apply the dimensionality reduction on X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Ignored.

Returns:

X_new – Transformed values.

Return type:

ndarray of shape (n_samples, n_components)

Notes

This method returns a Fortran-ordered array. To convert it to a C-ordered array, use ‘np.ascontiguousarray’.

property n_features_#

score(X, y=None)[source]#

Return the average log-likelihood of all samples.

See. “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf

Parameters:

X (array-like of shape (n_samples, n_features)) – The data.
y (Ignored) – Ignored.

Returns:

ll – Average log-likelihood of the samples under the current model.

Return type:

float

score_samples(X)[source]#

Return the log-likelihood of each sample.

See. “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf

Parameters:: X (array-like of shape (n_samples, n_features)) – The data.
Returns:: ll – Log-likelihood of each sample under the current model.
Return type:: ndarray of shape (n_samples,)

class watex.exlib.Pipeline(steps, *, memory=None, verbose=False)[source]#

Bases: _BaseComposition

Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to ‘passthrough’ or None.

Read more in the User Guide.

New in version 0.5.

Parameters:

steps (list of tuple) – List of (name, transform) tuples (implementing fit/transform) that are chained in sequential order. The last transform must be an estimator.
memory (str or object with the joblib.Memory interface, default=None) – Used to cache the fitted transformers of the pipeline. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute named_steps or steps to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.
verbose (bool, default=False) – If True, the time elapsed while fitting each step will be printed as it is completed.

named_steps#

Dictionary-like object, with the following attributes. Read-only attribute to access any step parameter by user given name. Keys are step names and values are steps parameters.

Type:: Bunch

classes_#

The classes labels. Only exist if the last step of the pipeline is a classifier.

Type:: ndarray of shape (n_classes,)

n_features_in_#

Number of features seen during fit. Only defined if the underlying first estimator in steps exposes such an attribute when fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Only defined if the underlying estimator exposes such an attribute when fit.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

make_pipeline: Convenience function for simplified pipeline construction.

Examples

>>> from sklearn.svm import SVC
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.pipeline import Pipeline
>>> X, y = make_classification(random_state=0)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
...                                                     random_state=0)
>>> pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
>>> # The pipeline can be used as any other estimator
>>> # and avoids leaking the test set into the train set
>>> pipe.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())])
>>> pipe.score(X_test, y_test)
0.88

property classes_#: The classes labels. Only exist if the last step is a classifier.

decision_function(X)[source]#

Transform the data, and apply decision_function with the final estimator.

Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls decision_function method. Only valid if the final estimator implements decision_function.

Parameters:: X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
Returns:: y_score – Result of calling decision_function on the final estimator.
Return type:: ndarray of shape (n_samples, n_classes)

property feature_names_in_#: Names of features seen during first step fit method.

fit(X, y=None, **fit_params)[source]#

Fit the model.

Fit all the transformers one after the other and transform the data. Finally, fit the transformed data using the final estimator.

Parameters:

X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
**fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns:

self – Pipeline with fitted steps.

Return type:

object

fit_predict(X, y=None, **fit_params)[source]#

Transform the data, and apply fit_predict with the final estimator.

Call fit_transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls fit_predict method. Only valid if the final estimator implements fit_predict.

Parameters:

X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
**fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns:

y_pred – Result of calling fit_predict on the final estimator.

Return type:

ndarray

fit_transform(X, y=None, **fit_params)[source]#

Fit the model and transform with the final estimator.

Fits all the transformers one after the other and transform the data. Then uses fit_transform on transformed data with the final estimator.

Parameters:

X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
**fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns:

Xt – Transformed samples.

Return type:

ndarray of shape (n_samples, n_transformed_features)

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Transform input features using the pipeline.

Parameters:: input_features (array-like of str or None, default=None) – Input features.
Returns:: feature_names_out – Transformed feature names.
Return type:: ndarray of str objects

get_params(deep=True)[source]#

Get parameters for this estimator.

Returns the parameters given in the constructor as well as the estimators contained within the steps of the Pipeline.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: mapping of string to any

inverse_transform(Xt)[source]#

Apply inverse_transform for each step in a reverse order.

All estimators in the pipeline must support inverse_transform.

Parameters:: Xt (array-like of shape (n_samples, n_transformed_features)) – Data samples, where n_samples is the number of samples and n_features is the number of features. Must fulfill input requirements of last step of pipeline’s inverse_transform method.
Returns:: Xt – Inverse transformed data, that is, data in the original feature space.
Return type:: ndarray of shape (n_samples, n_features)

property n_features_in_#: Number of features seen during first step fit method.

property named_steps#

Access the steps by name.

Read-only attribute to access any step by given name. Keys are steps names and values are the steps objects.

predict(X, **predict_params)[source]#

Transform the data, and apply predict with the final estimator.

Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls predict method. Only valid if the final estimator implements predict.

Parameters:

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
**predict_params (dict of string -> object) –
Parameters to the predict called at the end of all transformations in the pipeline. Note that while this may be used to return uncertainties from some models with return_std or return_cov, uncertainties that are generated by the transformations in the pipeline are not propagated to the final estimator.

New in version 0.20.

Returns:

y_pred – Result of calling predict on the final estimator.

Return type:

ndarray

predict_log_proba(X, **predict_log_proba_params)[source]#

Transform the data, and apply predict_log_proba with the final estimator.

Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls predict_log_proba method. Only valid if the final estimator implements predict_log_proba.

Parameters:

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
**predict_log_proba_params (dict of string -> object) – Parameters to the predict_log_proba called at the end of all transformations in the pipeline.

Returns:

y_log_proba – Result of calling predict_log_proba on the final estimator.

Return type:

ndarray of shape (n_samples, n_classes)

predict_proba(X, **predict_proba_params)[source]#

Transform the data, and apply predict_proba with the final estimator.

Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls predict_proba method. Only valid if the final estimator implements predict_proba.

Parameters:

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
**predict_proba_params (dict of string -> object) – Parameters to the predict_proba called at the end of all transformations in the pipeline.

Returns:

y_proba – Result of calling predict_proba on the final estimator.

Return type:

ndarray of shape (n_samples, n_classes)

score(X, y=None, sample_weight=None)[source]#

Transform the data, and apply score with the final estimator.

Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls score method. Only valid if the final estimator implements score.

Parameters:

X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
y (iterable, default=None) – Targets used for scoring. Must fulfill label requirements for all steps of the pipeline.
sample_weight (array-like, default=None) – If not None, this argument is passed as sample_weight keyword argument to the score method of the final estimator.

Returns:

score – Result of calling score on the final estimator.

Return type:

float

score_samples(X)[source]#

Transform the data, and apply score_samples with the final estimator.

Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls score_samples method. Only valid if the final estimator implements score_samples.

Parameters:: X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
Returns:: y_score – Result of calling score_samples on the final estimator.
Return type:: ndarray of shape (n_samples,)

set_output(*, transform=None)[source]#

Set the output container when “transform” and “fit_transform” are called.

Calling set_output will set the output of all estimators in steps.

Parameters:

transform ({"default", "pandas"}, default=None) –

Configure output of transform and fit_transform.

”default”: Default output format of a transformer
”pandas”: DataFrame output
None: Transform configuration is unchanged

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**kwargs)[source]#

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params(). Note that you can directly set the parameters of the estimators contained in steps.

Parameters:: **kwargs (dict) – Parameters of this estimator or parameters of estimators contained in steps. Parameters of the steps may be set using its name and the parameter name separated by a ‘__’.
Returns:: self – Pipeline class instance.
Return type:: object

transform(X)[source]#

Transform the data, and apply transform with the final estimator.

Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls transform method. Only valid if the final estimator implements transform.

This also works where final estimator is None in which case all prior transformations are applied.

Parameters:: X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.
Returns:: Xt – Transformed data.
Return type:: ndarray of shape (n_samples, n_transformed_features)

class watex.exlib.PolynomialFeatures(degree=2, *, interaction_only=False, include_bias=True, order='C')[source]#

Bases: TransformerMixin, BaseEstimator

Generate polynomial and interaction features.

Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

Read more in the User Guide.

Parameters:

degree (int or tuple (min_degree, max_degree), default=2) – If a single int is given, it specifies the maximal degree of the polynomial features. If a tuple (min_degree, max_degree) is passed, then min_degree is the minimum and max_degree is the maximum polynomial degree of the generated features. Note that min_degree=0 and min_degree=1 are equivalent as outputting the degree zero term is determined by include_bias.
interaction_only (bool, default=False) –
If True, only interaction features are produced: features that are products of at most degree distinct input features, i.e. terms with power of 2 or higher of the same input feature are excluded:
- included: x[0], x[1], x[0] * x[1], etc.
- excluded: x[0] ** 2, x[0] ** 2 * x[1], etc.
include_bias (bool, default=True) – If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).
order ({'C', 'F'}, default='C') –
Order of output array in the dense case. ‘F’ order is faster to compute, but may slow down subsequent estimators.

New in version 0.21.

powers_#

powers_[i, j] is the exponent of the jth input in the ith output.

Type:: ndarray of shape (n_output_features_, n_features_in_)

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

n_output_features_#

The total number of polynomial output features. The number of output features is computed by iterating over all suitably sized combinations of input features.

Type:: int

See also

SplineTransformer: Transformer that generates univariate B-spline bases for features.

Notes

Be aware that the number of features in the output array scales polynomially in the number of features of the input array, and exponentially in the degree. High degrees can cause overfitting.

See examples/linear_model/plot_polynomial_interpolation.py

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X)
array([[ 1.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5., 16., 20., 25.]])
>>> poly = PolynomialFeatures(interaction_only=True)
>>> poly.fit_transform(X)
array([[ 1.,  0.,  1.,  0.],
       [ 1.,  2.,  3.,  6.],
       [ 1.,  4.,  5., 20.]])

fit(X, y=None)[source]#

Compute number of output features.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data.
y (Ignored) – Not used, present here for API consistency by convention.

Returns:

self – Fitted transformer.

Return type:

object

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out – Transformed feature names.

Return type:

ndarray of str objects

property powers_#: Exponent for each of the inputs in the output.

transform(X)[source]#

Transform data to polynomial features.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) –

The data to transform, row by row.

Prefer CSR over CSC for sparse input (for speed), but CSC is required if the degree is 4 or higher. If the degree is less than 4 and the input format is CSC, it will be converted to CSR, have its polynomial features generated, then converted back to CSC.

If the degree is 2 or 3, the method described in “Leveraging Sparsity to Speed Up Polynomial Feature Expansions of CSR Matrices Using K-Simplex Numbers” by Andrew Nystrom and John Hughes is used, which is much faster than the method used on CSC input. For this reason, a CSC input will be converted to CSR, and the output will be converted back to CSC prior to being returned, hence the preference of CSR.

Returns:

XP – The matrix of features, where NP is the number of polynomial features generated from the combination of inputs. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

Return type:

{ndarray, sparse matrix} of shape (n_samples, NP)

class watex.exlib.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]#

Bases: ForestClassifier

A random forest classifier.

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

Read more in the User Guide.

Parameters:

n_estimators (int, default=100) –
The number of trees in the forest.

Changed in version 0.22: The default value of n_estimators changed from 10 to 100 in 0.22.
criterion ({"gini", "entropy", "log_loss"}, default="gini") – The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain, see tree_mathematical_formulation. Note: This parameter is tree-specific.
max_depth (int, default=None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split (int or float, default=2) –
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for fractions.
min_samples_leaf (int or float, default=1) –
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
- If int, then consider min_samples_leaf as the minimum number.
- If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for fractions.
min_weight_fraction_leaf (float, default=0.0) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_features ({"sqrt", "log2", None}, int or float, default="sqrt") –
The number of features to consider when looking for the best split:
- If int, then consider max_features features at each split.
- If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.
- If “auto”, then max_features=sqrt(n_features).
- If “sqrt”, then max_features=sqrt(n_features).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features=n_features.
Changed in version 1.1: The default of max_features changed from “auto” to “sqrt”.

Deprecated since version 1.1: The “auto” option was deprecated in 1.1 and will be removed in 1.3.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
max_leaf_nodes (int, default=None) – Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_decrease (float, default=0.0) –
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:
```
N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)
```
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

New in version 0.19.
bootstrap (bool, default=True) – Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
oob_score (bool, default=False) – Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True.
n_jobs (int, default=None) – The number of jobs to run in parallel. fit(), predict(), decision_path() and apply() are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
random_state (int, RandomState instance or None, default=None) – Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details.
verbose (int, default=0) – Controls the verbosity when fitting and predicting.
warm_start (bool, default=False) – When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See Glossary and gradient_boosting_warm_start for details.
class_weight ({"balanced", "balanced_subsample"}, dict or list of dicts, default=None) –
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
ccp_alpha (non-negative float, default=0.0) –
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.

New in version 0.22.
max_samples (int or float, default=None) –
If bootstrap is True, the number of samples to draw from X to train each base estimator.
- If None (default), then draw X.shape[0] samples.
- If int, then draw max_samples samples.
- If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0.0, 1.0].
New in version 0.22.

estimator_#

The child estimator template used to create the collection of fitted sub-estimators.

New in version 1.2: base_estimator_ was renamed to estimator_.

Type:: DecisionTreeClassifier

base_estimator_#

The child estimator template used to create the collection of fitted sub-estimators.

Deprecated since version 1.2: base_estimator_ is deprecated and will be removed in 1.4. Use estimator_ instead.

Type:: DecisionTreeClassifier

estimators_#

The collection of fitted sub-estimators.

Type:: list of DecisionTreeClassifier

classes_#

The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).

Type:: ndarray of shape (n_classes,) or a list of such arrays

n_classes_#

The number of classes (single output problem), or a list containing the number of classes for each output (multi-output problem).

Type:: int or list

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

n_outputs_#

The number of outputs when fit is performed.

Type:: int

feature_importances_#

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

Type:: ndarray of shape (n_features,)

oob_score_#

Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when oob_score is True.

Type:: float

oob_decision_function_#

Type:: ndarray of shape (n_samples, n_classes) or (n_samples, n_classes, n_outputs)

See also

sklearn.tree.DecisionTreeClassifier: A decision tree classifier.
sklearn.ensemble.ExtraTreesClassifier: Ensemble of extremely randomized tree classifiers.

Notes

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data, max_features=n_features and bootstrap=False, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

References

[1]

Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=1000, n_features=4,
...                            n_informative=2, n_redundant=0,
...                            random_state=0, shuffle=False)
>>> clf = RandomForestClassifier(max_depth=2, random_state=0)
>>> clf.fit(X, y)
RandomForestClassifier(...)
>>> print(clf.predict([[0, 0, 0, 0]]))
[1]

class watex.exlib.RandomizedSearchCV(estimator, param_distributions, *, n_iter=10, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score=nan, return_train_score=False)[source]#

Bases: BaseSearchCV

Randomized search on hyper parameters.

RandomizedSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

The parameters of the estimator used to apply these methods are optimized by cross-validated search over parameter settings.

In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.

If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used. It is highly recommended to use continuous distributions for continuous parameters.

Read more in the User Guide.

New in version 0.14.

Parameters:

estimator (estimator object) – An object of that type is instantiated for each grid point. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed.
param_distributions (dict or list of dicts) – Dictionary with parameters names (str) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly. If a list of dicts is given, first a dict is sampled uniformly, and then a parameter is sampled using that dict as above.
n_iter (int, default=10) – Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution.
scoring (str, callable, list, tuple or dict, default=None) –
Strategy to evaluate the performance of the cross-validated model on the test set.

If scoring represents a single score, one can use:
- a single string (see scoring_parameter);
- a callable (see scoring) that returns a single value.
If scoring represents multiple scores, one can use:
- a list or tuple of unique strings;
- a callable returning a dictionary where the keys are the metric names and the values are the metric scores;
- a dictionary with metric names as keys and callables a values.
See multimetric_grid_search for an example.

If None, the estimator’s score method is used.
n_jobs (int, default=None) –
Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

Changed in version v0.20: n_jobs default changed from 1 to None
refit (bool, str, or callable, default=True) –
Refit an estimator using the best found parameters on the whole dataset.

For multiple metric evaluation, this needs to be a str denoting the scorer that would be used to find the best parameters for refitting the estimator at the end.

Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given the cv_results. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available.

The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this RandomizedSearchCV instance.

Also for multiple metric evaluation, the attributes best_index_, best_score_ and best_params_ will only be available if refit is set and all of them will be determined w.r.t this specific scorer.

See scoring parameter to know more about multiple metric evaluation.

Changed in version 0.20: Support for callable added.
cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- integer, to specify the number of folds in a (Stratified)KFold,
- CV splitter,
- An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

Refer User Guide for the various cross-validation strategies that can be used here.

Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.
verbose (int) –
Controls the verbosity: the higher, the more messages.
- >1 : the computation time for each fold and parameter candidate is displayed;
- >2 : the score is also displayed;
- >3 : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.
pre_dispatch (int, or str, default='2*n_jobs') –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
- None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
- An int, giving the exact number of total jobs that are spawned
- A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
random_state (int, RandomState instance or None, default=None) – Pseudo random number generator state used for random uniform sampling from lists of possible values instead of scipy.stats distributions. Pass an int for reproducible output across multiple function calls. See Glossary.
error_score ('raise' or numeric, default=np.nan) – Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.
return_train_score (bool, default=False) –
If False, the cv_results_ attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.

New in version 0.19.

Changed in version 0.21: Default value was changed from True to False

cv_results_#

A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

For instance the below given table

param_kernel	param_gamma	split0_test_score	…	rank_test_score
‘rbf’	0.1	0.80	…	1
‘rbf’	0.2	0.84	…	3
‘rbf’	0.3	0.70	…	2

will be represented by a cv_results_ dict of:

{
'param_kernel' : masked_array(data = ['rbf', 'rbf', 'rbf'],
                              mask = False),
'param_gamma'  : masked_array(data = [0.1 0.2 0.3], mask = False),
'split0_test_score'  : [0.80, 0.84, 0.70],
'split1_test_score'  : [0.82, 0.50, 0.70],
'mean_test_score'    : [0.81, 0.67, 0.70],
'std_test_score'     : [0.01, 0.24, 0.00],
'rank_test_score'    : [1, 3, 2],
'split0_train_score' : [0.80, 0.92, 0.70],
'split1_train_score' : [0.82, 0.55, 0.70],
'mean_train_score'   : [0.81, 0.74, 0.70],
'std_train_score'    : [0.01, 0.19, 0.00],
'mean_fit_time'      : [0.73, 0.63, 0.43],
'std_fit_time'       : [0.01, 0.02, 0.01],
'mean_score_time'    : [0.01, 0.06, 0.04],
'std_score_time'     : [0.00, 0.00, 0.00],
'params'             : [{'kernel' : 'rbf', 'gamma' : 0.1}, ...],
}

NOTE

The key 'params' is used to store a list of parameter settings dicts for all the parameter candidates.

The mean_fit_time, std_fit_time, mean_score_time and std_score_time are all in seconds.

Type:: dict of numpy (masked) ndarrays

best_estimator_#

Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.

For multi-metric evaluation, this attribute is present only if refit is specified.

See refit parameter for more information on allowed values.

Type:: estimator

best_score_#

Mean cross-validated score of the best_estimator.

For multi-metric evaluation, this is not available if refit is False. See refit parameter for more information.

This attribute is not available if refit is a function.

Type:: float

best_params_#

Parameter setting that gave the best results on the hold out data.

For multi-metric evaluation, this is not available if refit is False. See refit parameter for more information.

Type:: dict

best_index_#

The index (of the cv_results_ arrays) which corresponds to the best candidate parameter setting.

The dict at search.cv_results_['params'][search.best_index_] gives the parameter setting for the best model, that gives the highest mean score (search.best_score_).

For multi-metric evaluation, this is not available if refit is False. See refit parameter for more information.

Type:: int

scorer_#

Scorer function used on the held out data to choose the best parameters for the model.

For multi-metric evaluation, this attribute holds the validated scoring dict which maps the scorer key to the scorer callable.

Type:: function or a dict

n_splits_#

The number of cross-validation splits (folds/iterations).

Type:: int

refit_time_#

Seconds used for refitting the best model on the whole dataset.

This is present only if refit is not False.

New in version 0.20.

Type:: float

multimetric_#

Whether or not the scorers compute several metrics.

Type:: bool

classes_#

The classes labels. This is present only if refit is specified and the underlying estimator is a classifier.

Type:: ndarray of shape (n_classes,)

n_features_in_#

New in version 0.24.

Type:: int

feature_names_in_#

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

GridSearchCV: Does exhaustive search over a grid of parameters.
ParameterSampler: A generator over parameter settings, constructed from param_distributions.

Notes

The parameters selected are those that maximize the score of the held-out data, according to the scoring parameter.

If n_jobs was set to a value higher than one, the data is copied for each parameter setting(and not n_jobs times). This is done for efficiency reasons if individual jobs take very little time, but may raise errors if the dataset is large and not enough memory is available. A workaround in this case is to set pre_dispatch. Then, the memory is copied only pre_dispatch many times. A reasonable value for pre_dispatch is 2 * n_jobs.

Examples

>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from scipy.stats import uniform
>>> iris = load_iris()
>>> logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
...                               random_state=0)
>>> distributions = dict(C=uniform(loc=0, scale=4),
...                      penalty=['l2', 'l1'])
>>> clf = RandomizedSearchCV(logistic, distributions, random_state=0)
>>> search = clf.fit(iris.data, iris.target)
>>> search.best_params_
{'C': 2..., 'penalty': 'l1'}

class watex.exlib.RobustScaler(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True, unit_variance=False)[source]#

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform() method.

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.

New in version 0.17.

Read more in the User Guide.

Parameters:

with_centering (bool, default=True) – If True, center the data before scaling. This will cause transform() to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_scaling (bool, default=True) – If True, scale the data to interquartile range.
quantile_range (tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0, default=(25.0, 75.0)) –
Quantile range used to calculate scale_. By default this is equal to the IQR, i.e., q_min is the first quantile and q_max is the third quantile.

New in version 0.18.
copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
unit_variance (bool, default=False) –
If True, scale data so that normally distributed features have a variance of 1. In general, if the difference between the x-values of q_max and q_min for a standard normal distribution is greater than 1, the dataset will be scaled down. If less than 1, the dataset will be scaled up.

New in version 0.24.

center_#

The median value for each feature in the training set.

Type:: array of floats

scale_#

The (scaled) interquartile range for each feature in the training set.

New in version 0.17: scale_ attribute.

Type:: array of floats

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

robust_scale: Equivalent function without the estimator API.
sklearn.decomposition.PCA: Further removes the linear correlation across features with ‘whiten=True’.

Notes

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

https://en.wikipedia.org/wiki/Median https://en.wikipedia.org/wiki/Interquartile_range

Examples

>>> from sklearn.preprocessing import RobustScaler
>>> X = [[ 1., -2.,  2.],
...      [ -2.,  1.,  3.],
...      [ 4.,  1., -2.]]
>>> transformer = RobustScaler().fit(X)
>>> transformer
RobustScaler()
>>> transformer.transform(X)
array([[ 0. , -2. ,  0. ],
       [-1. ,  0. ,  0.4],
       [ 1. ,  0. , -1.6]])

fit(X, y=None)[source]#

Compute the median and quantiles to be used for scaling.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the median and quantiles used for later scaling along the features axis.
y (Ignored) – Not used, present here for API consistency by convention.

Returns:

self – Fitted scaler.

Return type:

object

inverse_transform(X)[source]#

Scale back the data to the original representation.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The rescaled data to be transformed back.
Returns:: X_tr – Transformed array.
Return type:: {ndarray, sparse matrix} of shape (n_samples, n_features)

transform(X)[source]#

Center and scale the data.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the specified axis.
Returns:: X_tr – Transformed array.
Return type:: {ndarray, sparse matrix} of shape (n_samples, n_features)

class watex.exlib.SGDClassifier(loss='hinge', *, penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=0.1, n_jobs=None, random_state=None, learning_rate='optimal', eta0=0.0, power_t=0.5, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, class_weight=None, warm_start=False, average=False)[source]#

Bases: BaseSGDClassifier

Linear classifiers (SVM, logistic regression, etc.) with SGD training.

This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online/out-of-core) learning via the partial_fit method. For best results using the default learning rate schedule, the data should have zero mean and unit variance.

This implementation works with data represented as dense or sparse arrays of floating point values for the features. The model it fits can be controlled with the loss parameter; by default, it fits a linear support vector machine (SVM).

The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for learning sparse models and achieve online feature selection.

Read more in the User Guide.

Parameters:

loss ({'hinge', 'log_loss', 'log', 'modified_huber', 'squared_hinge', 'perceptron', 'squared_error', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'}, default='hinge') –
The loss function to be used.
- ’hinge’ gives a linear SVM.
- ’log_loss’ gives logistic regression, a probabilistic classifier.
- ’modified_huber’ is another smooth loss that brings tolerance to
  outliers as well as probability estimates.
- ’squared_hinge’ is like hinge but is quadratically penalized.
- ’perceptron’ is the linear loss used by the perceptron algorithm.
- The other losses, ‘squared_error’, ‘huber’, ‘epsilon_insensitive’ and ‘squared_epsilon_insensitive’ are designed for regression but can be useful in classification as well; see SGDRegressor for a description.
More details about the losses formulas can be found in the User Guide.

Deprecated since version 1.1: The loss ‘log’ was deprecated in v1.1 and will be removed in version 1.3. Use loss=’log_loss’ which is equivalent.
penalty ({'l2', 'l1', 'elasticnet', None}, default='l2') – The penalty (aka regularization term) to be used. Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’ might bring sparsity to the model (feature selection) not achievable with ‘l2’. No penalty is added when set to None.
alpha (float, default=0.0001) – Constant that multiplies the regularization term. The higher the value, the stronger the regularization. Also used to compute the learning rate when learning_rate is set to ‘optimal’. Values must be in the range [0.0, inf).
l1_ratio (float, default=0.15) – The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Only used if penalty is ‘elasticnet’. Values must be in the range [0.0, 1.0].
fit_intercept (bool, default=True) – Whether the intercept should be estimated or not. If False, the data is assumed to be already centered.
max_iter (int, default=1000) –
The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit() method. Values must be in the range [1, inf).

New in version 0.19.
tol (float or None, default=1e-3) –
The stopping criterion. If it is not None, training will stop when (loss > best_loss - tol) for n_iter_no_change consecutive epochs. Convergence is checked against the training loss or the validation loss depending on the early_stopping parameter. Values must be in the range [0.0, inf).

New in version 0.19.
shuffle (bool, default=True) – Whether or not the training data should be shuffled after each epoch.
verbose (int, default=0) – The verbosity level. Values must be in the range [0, inf).
epsilon (float, default=0.1) – Epsilon in the epsilon-insensitive loss functions; only if loss is ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’. For ‘huber’, determines the threshold at which it becomes less important to get the prediction exactly right. For epsilon-insensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold. Values must be in the range [0.0, inf).
n_jobs (int, default=None) – The number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
random_state (int, RandomState instance, default=None) – Used for shuffling the data, when shuffle is set to True. Pass an int for reproducible output across multiple function calls. See Glossary. Integer values must be in the range [0, 2**32 - 1].
learning_rate (str, default='optimal') –
The learning rate schedule:
- ’constant’: eta = eta0
- ’optimal’: eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.
- ’invscaling’: eta = eta0 / pow(t, power_t)
- ’adaptive’: eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.
  
  New in version 0.20: Added ‘adaptive’ option
eta0 (float, default=0.0) – The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’ schedules. The default value is 0.0 as eta0 is not used by the default schedule ‘optimal’. Values must be in the range (0.0, inf).
power_t (float, default=0.5) – The exponent for inverse scaling learning rate [default 0.5]. Values must be in the range (-inf, inf).
early_stopping (bool, default=False) –
Whether to use early stopping to terminate training when validation score is not improving. If set to True, it will automatically set aside a stratified fraction of training data as validation and terminate training when validation score returned by the score method is not improving by at least tol for n_iter_no_change consecutive epochs.

New in version 0.20: Added ‘early_stopping’ option
validation_fraction (float, default=0.1) –
The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True. Values must be in the range (0.0, 1.0).

New in version 0.20: Added ‘validation_fraction’ option
n_iter_no_change (int, default=5) –
Number of iterations with no improvement to wait before stopping fitting. Convergence is checked against the training loss or the validation loss depending on the early_stopping parameter. Integer values must be in the range [1, max_iter).

New in version 0.20: Added ‘n_iter_no_change’ option
class_weight (dict, {class_label: weight} or "balanced", default=None) –
Preset for the class_weight fit parameter.

Weights associated with classes. If not given, all classes are supposed to have weight one.

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
warm_start (bool, default=False) –
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See the Glossary.

Repeatedly calling fit or partial_fit when warm_start is True can result in a different solution than when calling fit a single time because of the way the data is shuffled. If a dynamic learning rate is used, the learning rate is adapted depending on the number of samples already seen. Calling fit resets this counter, while partial_fit will result in increasing the existing counter.
average (bool or int, default=False) – When set to True, computes the averaged SGD weights across all updates and stores the result in the coef_ attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples. Integer values must be in the range [1, n_samples].

coef_#

Weights assigned to the features.

Type:: ndarray of shape (1, n_features) if n_classes == 2 else (n_classes, n_features)

intercept_#

Constants in decision function.

Type:: ndarray of shape (1,) if n_classes == 2 else (n_classes,)

n_iter_#

The actual number of iterations before reaching the stopping criterion. For multiclass fits, it is the maximum over every binary fit.

Type:: int

loss_function_#

Type:: concrete LossFunction

classes_#

Type:: array of shape (n_classes,)

t_#

Number of weight updates performed during training. Same as (n_iter_ * n_samples + 1).

Type:: int

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

sklearn.svm.LinearSVC: Linear support vector classification.
LogisticRegression: Logistic regression.
Perceptron: Inherits from SGDClassifier. Perceptron() is equivalent to SGDClassifier(loss="perceptron", eta0=1, learning_rate="constant", penalty=None).

Examples

>>> import numpy as np
>>> from sklearn.linear_model import SGDClassifier
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.pipeline import make_pipeline
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
>>> Y = np.array([1, 1, 2, 2])
>>> # Always scale the input. The most convenient way is to use a pipeline.
>>> clf = make_pipeline(StandardScaler(),
...                     SGDClassifier(max_iter=1000, tol=1e-3))
>>> clf.fit(X, Y)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdclassifier', SGDClassifier())])
>>> print(clf.predict([[-0.8, -1]]))
[1]

predict_log_proba(X)[source]#

Log of probability estimates.

This method is only available for log loss and modified Huber loss.

When loss=”modified_huber”, probability estimates may be hard zeros and ones, so taking the logarithm is not possible.

See predict_proba for details.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Input data for prediction.
Returns:: T – Returns the log-probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.
Return type:: array-like, shape (n_samples, n_classes)

predict_proba(X)[source]#

Probability estimates.

This method is only available for log loss and modified Huber loss.

Multiclass probability estimates are derived from binary (one-vs.-rest) estimates by simple normalization, as recommended by Zadrozny and Elkan.

Binary probability estimates for loss=”modified_huber” are given by (clip(decision_function(X), -1, 1) + 1) / 2. For other loss functions it is necessary to perform proper probability calibration by wrapping the classifier with CalibratedClassifierCV instead.

Parameters:: X ({array-like, sparse matrix}, shape (n_samples, n_features)) – Input data for prediction.
Returns:: Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.
Return type:: ndarray of shape (n_samples, n_classes)

References

Zadrozny and Elkan, “Transforming classifier scores into multiclass probability estimates”, SIGKDD’02, https://dl.acm.org/doi/pdf/10.1145/775047.775151

The justification for the formula in the loss=”modified_huber” case is in the appendix B in: http://jmlr.csail.mit.edu/papers/volume2/zhang02c/zhang02c.pdf

class watex.exlib.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)[source]#

Bases: BaseSVC

C-Support Vector Classification.

The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using LinearSVC or SGDClassifier instead, possibly after a Nystroem transformer or other kernel_approximation.

The multiclass support is handled according to a one-vs-one scheme.

For details on the precise mathematical formulation of the provided kernel functions and how gamma, coef0 and degree affect each other, see the corresponding section in the narrative documentation: svm_kernels.

Read more in the User Guide.

Parameters:

C (float, default=1.0) – Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
kernel ({'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'} or callable, default='rbf') – Specifies the kernel type to be used in the algorithm. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
degree (int, default=3) – Degree of the polynomial kernel function (‘poly’). Must be non-negative. Ignored by all other kernels.
gamma ({'scale', 'auto'} or float, default='scale') –
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
- if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,
- if ‘auto’, uses 1 / n_features
- if float, must be non-negative.
Changed in version 0.22: The default value of gamma changed from ‘auto’ to ‘scale’.
coef0 (float, default=0.0) – Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
shrinking (bool, default=True) – Whether to use the shrinking heuristic. See the User Guide.
probability (bool, default=False) – Whether to enable probability estimates. This must be enabled prior to calling fit, will slow down that method as it internally uses 5-fold cross-validation, and predict_proba may be inconsistent with predict. Read more in the User Guide.
tol (float, default=1e-3) – Tolerance for stopping criterion.
cache_size (float, default=200) – Specify the size of the kernel cache (in MB).
class_weight (dict or 'balanced', default=None) – Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
verbose (bool, default=False) – Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context.
max_iter (int, default=-1) – Hard limit on iterations within solver, or -1 for no limit.
decision_function_shape ({'ovo', 'ovr'}, default='ovr') –
Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2). However, note that internally, one-vs-one (‘ovo’) is always used as a multi-class strategy to train models; an ovr matrix is only constructed from the ovo matrix. The parameter is ignored for binary classification.

Changed in version 0.19: decision_function_shape is ‘ovr’ by default.

New in version 0.17: decision_function_shape=’ovr’ is recommended.

Changed in version 0.17: Deprecated decision_function_shape=’ovo’ and None.
break_ties (bool, default=False) –
If true, decision_function_shape='ovr', and number of classes > 2, predict will break ties according to the confidence values of decision_function; otherwise the first class among the tied classes is returned. Please note that breaking ties comes at a relatively high computational cost compared to a simple predict.

New in version 0.22.
random_state (int, RandomState instance or None, default=None) – Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.

class_weight_#

Multipliers of parameter C for each class. Computed based on the class_weight parameter.

Type:: ndarray of shape (n_classes,)

classes_#

The classes labels.

Type:: ndarray of shape (n_classes,)

coef_#

Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel.

coef_ is a readonly property derived from dual_coef_ and support_vectors_.

Type:: ndarray of shape (n_classes * (n_classes - 1) / 2, n_features)

dual_coef_#

Dual coefficients of the support vector in the decision function (see sgd_mathematical_formulation), multiplied by their targets. For multiclass, coefficient for all 1-vs-1 classifiers. The layout of the coefficients in the multiclass case is somewhat non-trivial. See the multi-class section of the User Guide for details.

Type:: ndarray of shape (n_classes -1, n_SV)

fit_status_#

0 if correctly fitted, 1 otherwise (will raise warning)

Type:: int

intercept_#

Constants in decision function.

Type:: ndarray of shape (n_classes * (n_classes - 1) / 2,)

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

n_iter_#

Number of iterations run by the optimization routine to fit the model. The shape of this attribute depends on the number of models optimized which in turn depends on the number of classes.

New in version 1.1.

Type:: ndarray of shape (n_classes * (n_classes - 1) // 2,)

support_#

Indices of support vectors.

Type:: ndarray of shape (n_SV)

support_vectors_#

Support vectors.

Type:: ndarray of shape (n_SV, n_features)

n_support_#

Number of support vectors for each class.

Type:: ndarray of shape (n_classes,), dtype=int32

probA_#

Type:: ndarray of shape (n_classes * (n_classes - 1) / 2)

probB_#

If probability=True, it corresponds to the parameters learned in Platt scaling to produce probability estimates from decision values. If probability=False, it’s an empty array. Platt scaling uses the logistic function 1 / (1 + exp(decision_value * probA_ + probB_)) where probA_ and probB_ are learned from the dataset [2]. For more information on the multiclass case and training procedure see section 8 of [1].

Type:: ndarray of shape (n_classes * (n_classes - 1) / 2)

shape_fit_#

Array dimensions of training vector X.

Type:: tuple of int of shape (n_dimensions_of_X,)

See also

SVR: Support Vector Machine for Regression implemented using libsvm.
LinearSVC: Scalable Linear Support Vector Machine for classification implemented using liblinear. Check the See Also section of LinearSVC for more comparison element.

References

[1]

LIBSVM: A Library for Support Vector Machines

[2]

Platt, John (1999). “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods”

Examples

>>> import numpy as np
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
>>> y = np.array([1, 1, 2, 2])
>>> from sklearn.svm import SVC
>>> clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
>>> clf.fit(X, y)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svc', SVC(gamma='auto'))])

>>> print(clf.predict([[-0.8, -1]]))
[1]

class watex.exlib.SelectFromModel(estimator, *, threshold=None, prefit=False, norm_order=1, max_features=None, importance_getter='auto')[source]#

Bases: MetaEstimatorMixin, SelectorMixin, BaseEstimator

Meta-transformer for selecting features based on importance weights.

New in version 0.17.

Read more in the User Guide.

Parameters:

estimator (object) – The base estimator from which the transformer is built. This can be both a fitted (if prefit is set to True) or a non-fitted estimator. The estimator should have a feature_importances_ or coef_ attribute after fitting. Otherwise, the importance_getter parameter should be used.
threshold (str or float, default=None) – The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.
prefit (bool, default=False) – Whether a prefit model is expected to be passed into the constructor directly or not. If True, estimator must be a fitted estimator. If False, estimator is fitted and updated by calling fit and partial_fit, respectively.
norm_order (non-zero int, inf, -inf, default=1) – Order of the norm used to filter the vectors of coefficients below threshold in the case where the coef_ attribute of the estimator is of dimension 2.
max_features (int, callable, default=None) –
The maximum number of features to select.
- If an integer, then it specifies the maximum number of features to allow.
- If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of max_features(X).
- If None, then all features are kept.
To only select based on max_features, set threshold=-np.inf.

New in version 0.20.

Changed in version 1.1: max_features accepts a callable.
importance_getter (str or callable, default='auto') –
If ‘auto’, uses the feature importance either through a coef_ attribute or feature_importances_ attribute of estimator.

Also accepts a string that specifies an attribute name/path for extracting feature importance (implemented with attrgetter). For example, give regressor_.coef_ in case of TransformedTargetRegressor or named_steps.clf.feature_importances_ in case of Pipeline with its last step named clf.

If callable, overrides the default feature importance getter. The callable is passed with the fitted estimator and it should return importance for each feature.

New in version 0.24.

estimator_#

The base estimator from which the transformer is built. This attribute exist only when fit has been called.

If prefit=True, it is a deep copy of estimator.
If prefit=False, it is a clone of estimator and fit on the data passed to fit or partial_fit.

Type:: estimator

n_features_in_#

Number of features seen during fit. Only defined if the underlying estimator exposes such an attribute when fit.

New in version 0.24.

Type:: int

max_features_#

Maximum number of features calculated during fit. Only defined if the max_features is not None.

If max_features is an int, then max_features_ = max_features.
If max_features is a callable, then max_features_ = max_features(X).

New in version 1.1.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

threshold_#

The threshold value used for feature selection.

Type:: float

See also

RFE: Recursive feature elimination based on importance weights.
RFECV: Recursive feature elimination with built-in cross-validated selection of the best number of features.
SequentialFeatureSelector: Sequential cross-validation based feature selection. Does not rely on importance weights.

Notes

Allows NaN/Inf in the input if the underlying estimator does as well.

Examples

>>> from sklearn.feature_selection import SelectFromModel
>>> from sklearn.linear_model import LogisticRegression
>>> X = [[ 0.87, -1.34,  0.31 ],
...      [-2.79, -0.02, -0.85 ],
...      [-1.34, -0.48, -2.55 ],
...      [ 1.92,  1.48,  0.65 ]]
>>> y = [0, 1, 0, 1]
>>> selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y)
>>> selector.estimator_.coef_
array([[-0.3252302 ,  0.83462377,  0.49750423]])
>>> selector.threshold_
0.55245...
>>> selector.get_support()
array([False,  True, False])
>>> selector.transform(X)
array([[-1.34],
       [-0.02],
       [-0.48],
       [ 1.48]])

Using a callable to create a selector that can use no more than half of the input features.

>>> def half_callable(X):
...     return round(len(X[0]) / 2)
>>> half_selector = SelectFromModel(estimator=LogisticRegression(),
...                                 max_features=half_callable)
>>> _ = half_selector.fit(X, y)
>>> half_selector.max_features_
2

fit(X, y=None, **fit_params)[source]#

Fit the SelectFromModel meta-transformer.

Parameters:

X (array-like of shape (n_samples, n_features)) – The training input samples.
y (array-like of shape (n_samples,), default=None) – The target values (integers that correspond to classes in classification, real numbers in regression).
**fit_params (dict) – Other estimator specific parameters.

Returns:

self – Fitted estimator.

Return type:

object

property n_features_in_#: Number of features seen during fit.

partial_fit(X, y=None, **fit_params)[source]#

Fit the SelectFromModel meta-transformer only once.

Parameters:

X (array-like of shape (n_samples, n_features)) – The training input samples.
y (array-like of shape (n_samples,), default=None) – The target values (integers that correspond to classes in classification, real numbers in regression).
**fit_params (dict) – Other estimator specific parameters.

Returns:

self – Fitted estimator.

Return type:

object

property threshold_#: Threshold value used for feature selection.

class watex.exlib.SelectKBest(score_func=<function f_classif>, *, k=10)[source]#

Bases: _BaseFilter

Select features according to the k highest scores.

Read more in the User Guide.

Parameters:

score_func (callable, default=f_classif) –
Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif (see below “See Also”). The default function only works with classification tasks.

New in version 0.18.
k (int or "all", default=10) – Number of top features to select. The “all” option bypasses selection, for use in a parameter search.

scores_#

Scores of features.

Type:: array-like of shape (n_features,)

pvalues_#

p-values of feature scores, None if score_func returned only scores.

Type:: array-like of shape (n_features,)

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

f_classif: ANOVA F-value between label/feature for classification tasks.
mutual_info_classif: Mutual information for a discrete target.
chi2: Chi-squared stats of non-negative features for classification tasks.
f_regression: F-value between label/feature for regression tasks.
mutual_info_regression: Mutual information for a continuous target.
SelectPercentile: Select features based on percentile of the highest scores.
SelectFpr: Select features based on a false positive rate test.
SelectFdr: Select features based on an estimated false discovery rate.
SelectFwe: Select features based on family-wise error rate.
GenericUnivariateSelect: Univariate feature selector with configurable mode.

Notes

Ties between features with equal scores will be broken in an unspecified way.

Examples

>>> from sklearn.datasets import load_digits
>>> from sklearn.feature_selection import SelectKBest, chi2
>>> X, y = load_digits(return_X_y=True)
>>> X.shape
(1797, 64)
>>> X_new = SelectKBest(chi2, k=20).fit_transform(X, y)
>>> X_new.shape
(1797, 20)

class watex.exlib.ShrunkCovariance(*, store_precision=True, assume_centered=False, shrinkage=0.1)[source]#

Bases: EmpiricalCovariance

Covariance estimator with shrinkage.

Read more in the User Guide.

Parameters:

store_precision (bool, default=True) – Specify if the estimated precision is stored.
assume_centered (bool, default=False) – If True, data will not be centered before computation. Useful when working with data whose mean is almost, but not exactly zero. If False, data will be centered before computation.
shrinkage (float, default=0.1) – Coefficient in the convex combination used for the computation of the shrunk estimate. Range is [0, 1].

covariance_#

Estimated covariance matrix

Type:: ndarray of shape (n_features, n_features)

location_#

Estimated location, i.e. the estimated mean.

Type:: ndarray of shape (n_features,)

precision_#

Estimated pseudo inverse matrix. (stored only if store_precision is True)

Type:: ndarray of shape (n_features, n_features)

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

EllipticEnvelope: An object for detecting outliers in a Gaussian distributed dataset.
EmpiricalCovariance: Maximum likelihood covariance estimator.
GraphicalLasso: Sparse inverse covariance estimation with an l1-penalized estimator.
GraphicalLassoCV: Sparse inverse covariance with cross-validated choice of the l1 penalty.
LedoitWolf: LedoitWolf Estimator.
MinCovDet: Minimum Covariance Determinant (robust estimator of covariance).
OAS: Oracle Approximating Shrinkage Estimator.

Notes

The regularized covariance is given by:

(1 - shrinkage) * cov + shrinkage * mu * np.identity(n_features)

where mu = trace(cov) / n_features

Examples

>>> import numpy as np
>>> from sklearn.covariance import ShrunkCovariance
>>> from sklearn.datasets import make_gaussian_quantiles
>>> real_cov = np.array([[.8, .3],
...                      [.3, .4]])
>>> rng = np.random.RandomState(0)
>>> X = rng.multivariate_normal(mean=[0, 0],
...                                   cov=real_cov,
...                                   size=500)
>>> cov = ShrunkCovariance().fit(X)
>>> cov.covariance_
array([[0.7387..., 0.2536...],
       [0.2536..., 0.4110...]])
>>> cov.location_
array([0.0622..., 0.0193...])

fit(X, y=None)[source]#

Fit the shrunk covariance model to X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Not used, present for API consistency by convention.

Returns:

self – Returns the instance itself.

Return type:

object

class watex.exlib.SimpleImputer(*, missing_values=nan, strategy='mean', fill_value=None, verbose='deprecated', copy=True, add_indicator=False, keep_empty_features=False)[source]#

Bases: _BaseImputer

Univariate imputer for completing missing values with simple strategies.

Replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value.

Read more in the User Guide.

New in version 0.20: SimpleImputer replaces the previous sklearn.preprocessing.Imputer estimator which is now removed.

Parameters:

missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
strategy (str, default='mean') –
The imputation strategy.
- If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
- If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
- If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
- If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
New in version 0.20: strategy=”constant” for fixed value imputation.
fill_value (str or numerical value, default=None) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. For string or object data types, fill_value must be a string. If None, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
verbose (int, default=0) –
Controls the verbosity of the imputer.

Deprecated since version 1.1: The ‘verbose’ parameter was deprecated in version 1.1 and will be removed in 1.3. A warning will always be raised upon the removal of empty columns in the future version.
copy (bool, default=True) –
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:
- If X is not an array of floating values;
- If X is encoded as a CSR matrix;
- If add_indicator=True.
add_indicator (bool, default=False) – If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
keep_empty_features (bool, default=False) –
If True, features that consist exclusively of missing values when fit is called are returned in results when transform is called. The imputed value is always 0 except when strategy=”constant” in which case fill_value will be used instead.

New in version 1.2.

statistics_#

The imputation fill value for each feature. Computing statistics can result in np.nan values. During transform(), features corresponding to np.nan statistics will be discarded.

Type:: array of shape (n_features,)

indicator_#

Indicator used to add binary indicators for missing values. None if add_indicator=False.

Type:: MissingIndicator

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

IterativeImputer: Multivariate imputer that estimates values to impute for each feature with missing values from all the others.
KNNImputer: Multivariate imputer that estimates missing features using nearest samples.

Notes

Columns which only contained missing values at fit() are discarded upon transform() if strategy is not “constant”.

In a prediction context, simple imputation usually performs poorly when associated with a weak learner. However, with a powerful learner, it can lead to as good or better performance than complex imputation such as IterativeImputer or KNNImputer.

Examples

>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
>>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
SimpleImputer()
>>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
>>> print(imp_mean.transform(X))
[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]

fit(X, y=None)[source]#

Fit the imputer on X.

Parameters:

X ({array-like, sparse matrix}, shape (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Not used, present here for API consistency by convention.

Returns:

self – Fitted estimator.

Return type:

object

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out – Transformed feature names.

Return type:

ndarray of str objects

inverse_transform(X)[source]#

Convert the data back to the original representation.

Inverts the transform operation performed on an array. This operation can only be performed after SimpleImputer is instantiated with add_indicator=True.

Note that inverse_transform can only invert the transform in features that have binary indicators for missing values. If a feature has no missing values at fit time, the feature won’t have a binary indicator, and the imputation done at transform time won’t be inverted.

New in version 0.24.

Parameters:: X (array-like of shape (n_samples, n_features + n_features_missing_indicator)) – The imputed data to be reverted to original data. It has to be an augmented array of imputed data and the missing indicator mask.
Returns:: X_original – The original X with missing values as it was prior to imputation.
Return type:: ndarray of shape (n_samples, n_features)

transform(X)[source]#

Impute all missing values in X.

Parameters:: X ({array-like, sparse matrix}, shape (n_samples, n_features)) – The input data to complete.
Returns:: X_imputed – X with imputed values.
Return type:: {ndarray, sparse matrix} of shape (n_samples, n_features_out)

class watex.exlib.StackingClassifier(estimators, final_estimator=None, *, cv=None, stack_method='auto', n_jobs=None, passthrough=False, verbose=0)[source]#

Bases: ClassifierMixin, _BaseStacking

Stack of estimators with a final classifier.

Stacked generalization consists in stacking the output of individual estimator and use a classifier to compute the final prediction. Stacking allows to use the strength of each individual estimator by using their output as input of a final estimator.

Note that estimators_ are fitted on the full X while final_estimator_ is trained using cross-validated predictions of the base estimators using cross_val_predict.

Read more in the User Guide.

New in version 0.22.

Parameters:

estimators (list of (str, estimator)) –
Base estimators which will be stacked together. Each element of the list is defined as a tuple of string (i.e. name) and an estimator instance. An estimator can be set to ‘drop’ using set_params.

The type of estimator is generally expected to be a classifier. However, one can pass a regressor for some use case (e.g. ordinal regression).
final_estimator (estimator, default=None) – A classifier which will be used to combine the base estimators. The default classifier is a LogisticRegression.
cv (int, cross-validation generator, iterable, or "prefit", default=None) –
Determines the cross-validation splitting strategy used in cross_val_predict to train final_estimator. Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- integer, to specify the number of folds in a (Stratified) KFold,
- An object to be used as a cross-validation generator,
- An iterable yielding train, test splits,
- ”prefit” to assume the estimators are prefit. In this case, the estimators will not be refitted.
For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

Refer User Guide for the various cross-validation strategies that can be used here.

If “prefit” is passed, it is assumed that all estimators have been fitted already. The final_estimator_ is trained on the estimators predictions on the full training set and are not cross validated predictions. Please note that if the models have been trained on the same data to train the stacking model, there is a very high risk of overfitting.

New in version 1.1: The ‘prefit’ option was added in 1.1

Note

A larger number of split will provide no benefits if the number of training samples is large enough. Indeed, the training time will increase. cv is not used for model evaluation but for prediction.
stack_method ({'auto', 'predict_proba', 'decision_function', 'predict'}, default='auto') –
Methods called for each base estimator. It can be:
- if ‘auto’, it will try to invoke, for each estimator, ‘predict_proba’, ‘decision_function’ or ‘predict’ in that order.
- otherwise, one of ‘predict_proba’, ‘decision_function’ or ‘predict’. If the method is not implemented by the estimator, it will raise an error.
n_jobs (int, default=None) – The number of jobs to run in parallel all estimators fit. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
passthrough (bool, default=False) – When False, only the predictions of estimators will be used as training data for final_estimator. When True, the final_estimator is trained on the predictions as well as the original training data.
verbose (int, default=0) – Verbosity level.

classes_#

Class labels.

Type:: ndarray of shape (n_classes,) or list of ndarray if y is of type “multilabel-indicator”.

estimators_#

The elements of the estimators parameter, having been fitted on the training data. If an estimator has been set to ‘drop’, it will not appear in estimators_. When cv=”prefit”, estimators_ is set to estimators and is not fitted again.

Type:: list of estimators

named_estimators_#

Attribute to access any fitted sub-estimators by name.

Type:: Bunch

n_features_in_#

Number of features seen during fit. Only defined if the underlying classifier exposes such an attribute when fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Only defined if the underlying estimators expose such an attribute when fit.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

final_estimator_#

The classifier which predicts given the output of estimators_.

Type:: estimator

stack_method_#

The method used by each base estimator.

Type:: list of str

See also

StackingRegressor: Stack of estimators with a final regressor.

Notes

When predict_proba is used by each estimator (i.e. most of the time for stack_method=’auto’ or specifically for stack_method=’predict_proba’), The first column predicted by each estimator will be dropped in the case of a binary classification problem. Indeed, both feature will be perfectly collinear.

In some cases (e.g. ordinal regression), one can pass regressors as the first layer of the StackingClassifier. However, note that y will be internally encoded in a numerically increasing order or lexicographic order. If this ordering is not adequate, one should manually numerically encode the classes in the desired order.

References

[1]

Wolpert, David H. “Stacked generalization.” Neural networks 5.2 (1992): 241-259.

Examples

>>> from sklearn.datasets import load_iris
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.svm import LinearSVC
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.ensemble import StackingClassifier
>>> X, y = load_iris(return_X_y=True)
>>> estimators = [
...     ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
...     ('svr', make_pipeline(StandardScaler(),
...                           LinearSVC(random_state=42)))
... ]
>>> clf = StackingClassifier(
...     estimators=estimators, final_estimator=LogisticRegression()
... )
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, stratify=y, random_state=42
... )
>>> clf.fit(X_train, y_train).score(X_test, y_test)
0.9...

decision_function(X)[source]#

Decision function for samples in X using the final estimator.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
Returns:: decisions – The decision function computed the final estimator.
Return type:: ndarray of shape (n_samples,), (n_samples, n_classes), or (n_samples, n_classes * (n_classes-1) / 2)

fit(X, y, sample_weight=None)[source]#

Fit the estimators.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target values. Note that y will be internally encoded in numerically increasing order or lexicographic order. If the order matter (e.g. for ordinal regression), one should numerically encode the target y before calling fit.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights. If None, then samples are equally weighted. Note that this is supported only if all underlying estimators support sample weights.

Returns:

self – Returns a fitted instance of estimator.

Return type:

object

predict(X, **predict_params)[source]#

Predict target for X.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
**predict_params (dict of str -> obj) – Parameters to the predict called by the final_estimator. Note that this may be used to return uncertainties from some estimators with return_std or return_cov. Be aware that it will only accounts for uncertainty in the final estimator.

Returns:

y_pred – Predicted targets.

Return type:

ndarray of shape (n_samples,) or (n_samples, n_output)

predict_proba(X)[source]#

Predict class probabilities for X using the final estimator.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
Returns:: probabilities – The class probabilities of the input samples.
Return type:: ndarray of shape (n_samples, n_classes) or list of ndarray of shape (n_output,)

transform(X)[source]#

Return class labels or probabilities for X for each estimator.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
Returns:: y_preds – Prediction outputs for each estimator.
Return type:: ndarray of shape (n_samples, n_estimators) or (n_samples, n_classes * n_estimators)

class watex.exlib.StandardScaler(*, copy=True, with_mean=True, with_std=True)[source]#

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform().

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.

Read more in the User Guide.

Parameters:

copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
with_mean (bool, default=True) – If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_std (bool, default=True) – If True, scale the data to unit variance (or equivalently, unit standard deviation).

scale_#

Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using np.sqrt(var_). If a variance is zero, we can’t achieve unit variance, and the data is left as-is, giving a scaling factor of 1. scale_ is equal to None when with_std=False.

New in version 0.17: scale_

Type:: ndarray of shape (n_features,) or None

mean_#

The mean value for each feature in the training set. Equal to None when with_mean=False.

Type:: ndarray of shape (n_features,) or None

var_#

The variance for each feature in the training set. Used to compute scale_. Equal to None when with_std=False.

Type:: ndarray of shape (n_features,) or None

n_features_in_#

Number of features seen during fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

n_samples_seen_#

The number of samples processed by the estimator for each feature. If there are no missing samples, the n_samples_seen will be an integer, otherwise it will be an array of dtype int. If sample_weights are used it will be a float (if no missing data) or an array of dtype float that sums the weights seen so far. Will be reset on new calls to fit, but increments across partial_fit calls.

Type:: int or ndarray of shape (n_features,)

See also

scale: Equivalent function without the estimator API.
PCA: Further removes the linear correlation across features with ‘whiten=True’.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import StandardScaler
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler()
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[3. 3.]]

fit(X, y=None, sample_weight=None)[source]#

Compute the mean and std to be used for later scaling.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
sample_weight (array-like of shape (n_samples,), default=None) –
Individual weights for each sample.

New in version 0.24: parameter sample_weight support to StandardScaler.

Returns:

self – Fitted scaler.

Return type:

object

inverse_transform(X, copy=None)[source]#

Scale back the data to the original representation.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis.
copy (bool, default=None) – Copy the input X or not.

Returns:

X_tr – Transformed array.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

partial_fit(X, y=None, sample_weight=None)[source]#

Online computation of mean and std on X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242-247:

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
sample_weight (array-like of shape (n_samples,), default=None) –
Individual weights for each sample.

New in version 0.24: parameter sample_weight support to StandardScaler.

Returns:

self – Fitted scaler.

Return type:

object

transform(X, copy=None)[source]#

Perform standardization by centering and scaling.

Parameters:

X ({array-like, sparse matrix of shape (n_samples, n_features)) – The data used to scale along the features axis.
copy (bool, default=None) – Copy the input X or not.

Returns:

X_tr – Transformed array.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

class watex.exlib.StratifiedShuffleSplit(n_splits=10, *, test_size=None, train_size=None, random_state=None)[source]#

Bases: BaseShuffleSplit

Stratified ShuffleSplit cross-validator

Provides train/test indices to split data in train/test sets.

This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.

Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

Read more in the User Guide.

Parameters:

n_splits (int, default=10) – Number of re-shuffling & splitting iterations.
test_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.1.
train_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
random_state (int, RandomState instance or None, default=None) – Controls the randomness of the training and testing indices produced. Pass an int for reproducible output across multiple function calls. See Glossary.

Examples

>>> import numpy as np
>>> from sklearn.model_selection import StratifiedShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 0, 1, 1, 1])
>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
>>> sss.get_n_splits(X, y)
5
>>> print(sss)
StratifiedShuffleSplit(n_splits=5, random_state=0, ...)
>>> for i, (train_index, test_index) in enumerate(sss.split(X, y)):
...     print(f"Fold {i}:")
...     print(f"  Train: index={train_index}")
...     print(f"  Test:  index={test_index}")
Fold 0:
  Train: index=[5 2 3]
  Test:  index=[4 1 0]
Fold 1:
  Train: index=[5 1 4]
  Test:  index=[0 2 3]
Fold 2:
  Train: index=[5 0 2]
  Test:  index=[4 3 1]
Fold 3:
  Train: index=[4 1 0]
  Test:  index=[2 3 5]
Fold 4:
  Train: index=[0 5 1]
  Test:  index=[3 4 2]

split(X, y, groups=None)[source]#

Generate indices to split data into training and test set.

Parameters:

X (array-like of shape (n_samples, n_features)) –
Training data, where n_samples is the number of samples and n_features is the number of features.

Note that providing y is sufficient to generate the splits and hence np.zeros(n_samples) may be used as a placeholder for X instead of actual training data.
y (array-like of shape (n_samples,) or (n_samples, n_labels)) – The target variable for supervised learning problems. Stratification is done based on the y labels.
groups (object) – Always ignored, exists for compatibility.

Yields:

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.

Notes

Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.

class watex.exlib.TransformerMixin[source]#

Bases: _SetOutputMixin

Mixin class for all transformers in scikit-learn.

If get_feature_names_out is defined, then BaseEstimator will automatically wrap transform and fit_transform to follow the set_output API. See the developer_api_set_output for details.

base.OneToOneFeatureMixin and base.ClassNamePrefixFeaturesOutMixin are helpful mixins for defining get_feature_names_out.

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

class watex.exlib.VotingClassifier(estimators, *, voting='hard', weights=None, n_jobs=None, flatten_transform=True, verbose=False)[source]#

Bases: ClassifierMixin, _BaseVoting

Soft Voting/Majority Rule classifier for unfitted estimators.

Read more in the User Guide.

New in version 0.17.

Parameters:

estimators (list of (str, estimator) tuples) –
Invoking the fit method on the VotingClassifier will fit clones of those original estimators that will be stored in the class attribute self.estimators_. An estimator can be set to 'drop' using set_params().

Changed in version 0.21: 'drop' is accepted. Using None was deprecated in 0.22 and support was removed in 0.24.
voting ({'hard', 'soft'}, default='hard') – If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
weights (array-like of shape (n_classifiers,), default=None) – Sequence of weights (float or int) to weight the occurrences of predicted class labels (hard voting) or class probabilities before averaging (soft voting). Uses uniform weights if None.
n_jobs (int, default=None) –
The number of jobs to run in parallel for fit. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

New in version 0.18.
flatten_transform (bool, default=True) – Affects shape of transform output only when voting=’soft’ If voting=’soft’ and flatten_transform=True, transform method returns matrix with shape (n_samples, n_classifiers * n_classes). If flatten_transform=False, it returns (n_classifiers, n_samples, n_classes).
verbose (bool, default=False) –
If True, the time elapsed while fitting will be printed as it is completed.

New in version 0.23.

estimators_#

The collection of fitted sub-estimators as defined in estimators that are not ‘drop’.

Type:: list of classifiers

named_estimators_#

Attribute to access any fitted sub-estimators by name.

New in version 0.20.

Type:: Bunch

le_#

Transformer used to encode the labels during fit and decode during prediction.

Type:: LabelEncoder

classes_#

The classes labels.

Type:: ndarray of shape (n_classes,)

n_features_in_#

Number of features seen during fit. Only defined if the underlying classifier exposes such an attribute when fit.

New in version 0.24.

Type:: int

feature_names_in_#

Names of features seen during fit. Only defined if the underlying estimators expose such an attribute when fit.

New in version 1.0.

Type:: ndarray of shape (n_features_in_,)

See also

VotingRegressor: Prediction voting regressor.

Examples

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.ensemble import RandomForestClassifier, VotingClassifier
>>> clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
>>> clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
>>> clf3 = GaussianNB()
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> eclf1 = VotingClassifier(estimators=[
...         ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')
>>> eclf1 = eclf1.fit(X, y)
>>> print(eclf1.predict(X))
[1 1 1 2 2 2]
>>> np.array_equal(eclf1.named_estimators_.lr.predict(X),
...                eclf1.named_estimators_['lr'].predict(X))
True
>>> eclf2 = VotingClassifier(estimators=[
...         ('lr', clf1), ('rf', clf2), ('gnb', clf3)],
...         voting='soft')
>>> eclf2 = eclf2.fit(X, y)
>>> print(eclf2.predict(X))
[1 1 1 2 2 2]

To drop an estimator, set_params() can be used to remove it. Here we dropped one of the estimators, resulting in 2 fitted estimators:

>>> eclf2 = eclf2.set_params(lr='drop')
>>> eclf2 = eclf2.fit(X, y)
>>> len(eclf2.estimators_)
2

Setting flatten_transform=True with voting=’soft’ flattens output shape of transform:

>>> eclf3 = VotingClassifier(estimators=[
...        ('lr', clf1), ('rf', clf2), ('gnb', clf3)],
...        voting='soft', weights=[2,1,1],
...        flatten_transform=True)
>>> eclf3 = eclf3.fit(X, y)
>>> print(eclf3.predict(X))
[1 1 1 2 2 2]
>>> print(eclf3.transform(X).shape)
(6, 6)

fit(X, y, sample_weight=None)[source]#

Fit the estimators.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target values.
sample_weight (array-like of shape (n_samples,), default=None) –
Sample weights. If None, then samples are equally weighted. Note that this is supported only if all underlying estimators support sample weights.

New in version 0.18.

Returns:

self – Returns the instance itself.

Return type:

object

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:: input_features (array-like of str or None, default=None) – Not used, present here for API consistency by convention.
Returns:: feature_names_out – Transformed feature names.
Return type:: ndarray of str objects

predict(X)[source]#

Predict class labels for X.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples.
Returns:: maj – Predicted class labels.
Return type:: array-like of shape (n_samples,)

predict_proba(X)[source]#

Compute probabilities of possible outcomes for samples in X.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples.
Returns:: avg – Weighted average probability for each class per sample.
Return type:: array-like of shape (n_samples, n_classes)

transform(X)[source]#

Return class labels or probabilities for X for each estimator.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

Returns:

If voting=’soft’ and flatten_transform=True:: returns ndarray of shape (n_samples, n_classifiers * n_classes), being class probabilities calculated by each classifier.
If voting=’soft’ and `flatten_transform=False:: ndarray of shape (n_classifiers, n_samples, n_classes)
If voting=’hard’:: ndarray of shape (n_samples, n_classifiers), being class labels predicted by each classifier.

Return type:

probabilities_or_labels

class watex.exlib.XGBClassifier(*, objective='binary:logistic', use_label_encoder=None, **kwargs)[source]#

Bases: XGBModel, ClassifierMixin

Implementation of the scikit-learn API for XGBoost classification.

Parameters:

n_estimators (int) – Number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
Sampling method. Used only by gpu_hist tree method.
- uniform: select random training instances uniformly.
- gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.

Note

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
- For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
- For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
gpu_id (Optional[int]) – Device ordinal.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].
enable_categorical (bool) –

New in version 1.5.0.

Note

This parameter is experimental

Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (FeatureTypes) –

New in version 1.7.0.

Used for specifying feature types without constructing a dataframe. See DMatrix for details.
max_cat_to_onehot (Optional[int]) –

New in version 1.6.0.

Note

This parameter is experimental

A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and cat-param for details.
max_cat_threshold (Optional[int]) –

New in version 1.7.0.

Note

This parameter is experimental

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and cat-param for details.
eval_metric (Optional[Union[str, List[str], Callable]]) –

New in version 1.6.0.

Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

See Custom Objective and Evaluation Metric for more.

Note

This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.
```
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
X, y = load_diabetes(return_X_y=True)
reg = xgb.XGBRegressor(
    tree_method="hist",
    eval_metric=mean_absolute_error,
)
reg.fit(X, y, eval_set=[(X, y)])
```
early_stopping_rounds (Optional[int]) –

New in version 1.6.0.

Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().

The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.

If early stopping occurs, the model will have three additional fields: best_score, best_iteration and best_ntree_limit.

Note

This parameter replaces early_stopping_rounds in fit() method.
callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

Note

States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
```
for params in parameters_grid:
    # be sure to (re)initialize the callbacks before each run
    callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
    xgboost.train(params, Xy, callbacks=callbacks)
```
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

Note

**kwargs unsupported by scikit-learn

**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

Note

Custom objective function

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

y_true: array_like of shape [n_samples]
The target values

y_pred: array_like of shape [n_samples]
The predicted values

grad: array_like of shape [n_samples]
The value of the gradient for each sample point.

hess: array_like of shape [n_samples]
The value of the second derivative for each sample point

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)[source]#

Fit gradient boosting classifier.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters:

X – Feature matrix
y – Labels
sample_weight – instance weights
base_margin – global bias for each instance.
eval_set – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –

Deprecated since version 1.6.0: Use eval_metric in __init__() or set_params() instead.
early_stopping_rounds (int) –

Deprecated since version 1.6.0: Use early_stopping_rounds in __init__() or set_params() instead.
verbose – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks –

Deprecated since version 1.6.0: Use callbacks in __init__() or set_params() instead.

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)[source]#

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X – Data to predict with.
output_margin – Whether to output the raw untransformed margin value.
ntree_limit – Deprecated, use iteration_range instead.
validate_features – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin – Margin added to prediction.
iteration_range –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

New in version 1.4.0.

Return type:

prediction

predict_proba(X, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)[source]#

Predict the probability of each X example being of a given class.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (array_like) – Feature matrix.
ntree_limit (int) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (array_like) – Margin added to prediction.
iteration_range – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

Returns:

a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.

Return type:

prediction

watex.exlib.accuracy_score(y_true, y_pred, *, normalize=True, sample_weight=None)[source]#

Accuracy classification score.

In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

Read more in the User Guide.

Parameters:

y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) labels.
y_pred (1d array-like, or label indicator array / sparse matrix) – Predicted labels, as returned by a classifier.
normalize (bool, default=True) – If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – If normalize == True, return the fraction of correctly classified samples (float), else returns the number of correctly classified samples (int).

The best performance is 1 with normalize == True and the number of samples with normalize == False.

Return type:

float

See also

balanced_accuracy_score: Compute the balanced accuracy to deal with imbalanced datasets.
jaccard_score: Compute the Jaccard similarity coefficient score.
hamming_loss: Compute the average Hamming loss or Hamming distance between two sets of samples.
zero_one_loss: Compute the Zero-one classification loss. By default, the function will return the percentage of imperfectly predicted subsets.

Notes

In binary classification, this function is equal to the jaccard_score function.

Examples

>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2

In the multilabel case with binary label indicators:

>>> import numpy as np
>>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.5

watex.exlib.classification_report(y_true, y_pred, *, labels=None, target_names=None, sample_weight=None, digits=2, output_dict=False, zero_division='warn')[source]#

Build a text report showing the main classification metrics.

Read more in the User Guide.

Parameters:

y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (array-like of shape (n_labels,), default=None) – Optional list of label indices to include in the report.
target_names (list of str of shape (n_labels,), default=None) – Optional display names matching the labels (same order).
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
digits (int, default=2) – Number of digits for formatting output floating point values. When output_dict is True, this will be ignored and the returned values will not be rounded.
output_dict (bool, default=False) –
If True, return output as dict.

New in version 0.20.
zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised.

Returns:

report – Text summary of the precision, recall, F1 score for each class. Dictionary returned if output_dict is True. Dictionary has the following structure:

{'label 1': {'precision':0.5,
             'recall':1.0,
             'f1-score':0.67,
             'support':1},
 'label 2': { ... },
  ...
}

The reported averages include macro average (averaging the unweighted mean per label), weighted average (averaging the support-weighted mean per label), and sample average (only for multilabel classification). Micro average (averaging the total true positives, false negatives and false positives) is only shown for multi-label or multi-class with a subset of classes, because it corresponds to accuracy otherwise and would be the same for all metrics. See also precision_recall_fscore_support() for more details on averages.

Note that in binary classification, recall of the positive class is also known as “sensitivity”; recall of the negative class is “specificity”.

Return type:

str or dict

See also

precision_recall_fscore_support: Compute precision, recall, F-measure and support for each class.
confusion_matrix: Compute confusion matrix to evaluate the accuracy of a classification.
multilabel_confusion_matrix: Compute a confusion matrix for each class or sample.

Examples

>>> from sklearn.metrics import classification_report
>>> y_true = [0, 1, 2, 2, 2]
>>> y_pred = [0, 0, 2, 2, 1]
>>> target_names = ['class 0', 'class 1', 'class 2']
>>> print(classification_report(y_true, y_pred, target_names=target_names))
              precision    recall  f1-score   support

     class 0       0.50      1.00      0.67         1
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.67      0.80         3

    accuracy                           0.60         5
   macro avg       0.50      0.56      0.49         5
weighted avg       0.70      0.60      0.61         5

>>> y_pred = [1, 1, 0]
>>> y_true = [1, 1, 1]
>>> print(classification_report(y_true, y_pred, labels=[1, 2, 3]))
              precision    recall  f1-score   support

           1       1.00      0.67      0.80         3
           2       0.00      0.00      0.00         0
           3       0.00      0.00      0.00         0

   micro avg       1.00      0.67      0.80         3
   macro avg       0.33      0.22      0.27         3
weighted avg       1.00      0.67      0.80         3

watex.exlib.clone(estimator, *, safe=True)[source]#

Construct a new unfitted estimator with the same parameters.

Clone does a deep copy of the model in an estimator without actually copying attached data. It returns a new estimator with the same parameters that has not been fitted on any data.

Parameters:

estimator ({list, tuple, set} of estimator instance or a single estimator instance) – The estimator or group of estimators to be cloned.
safe (bool, default=True) – If safe is False, clone will fall back to a deep copy on objects that are not estimators.

Returns:

estimator – The deep copy of the input, an estimator if input is an estimator.

Return type:

object

Notes

If the estimator’s random_state parameter is an integer (or if the estimator doesn’t have a random_state parameter), an exact clone is returned: the clone and the original estimator will give the exact same results. Otherwise, statistical clone is returned: the clone might return different results from the original estimator. More details can be found in randomness.

watex.exlib.confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None)[source]#

Compute confusion matrix to evaluate the accuracy of a classification.

By definition a confusion matrix \(C\) is such that \(C_{i, j}\) is equal to the number of observations known to be in group \(i\) and predicted to be in group \(j\).

Thus in binary classification, the count of true negatives is \(C_{0,0}\), false negatives is \(C_{1,0}\), true positives is \(C_{1,1}\) and false positives is \(C_{0,1}\).

Read more in the User Guide.

Parameters:

y_true (array-like of shape (n_samples,)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,)) – Estimated targets as returned by a classifier.
labels (array-like of shape (n_classes), default=None) – List of labels to index the matrix. This may be used to reorder or select a subset of labels. If None is given, those that appear at least once in y_true or y_pred are used in sorted order.
sample_weight (array-like of shape (n_samples,), default=None) –
Sample weights.

New in version 0.18.
normalize ({'true', 'pred', 'all'}, default=None) – Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized.

Returns:

C – Confusion matrix whose i-th row and j-th column entry indicates the number of samples with true label being i-th class and predicted label being j-th class.

Return type:

ndarray of shape (n_classes, n_classes)

See also

ConfusionMatrixDisplay.from_estimator: Plot the confusion matrix given an estimator, the data, and the label.
ConfusionMatrixDisplay.from_predictions: Plot the confusion matrix given the true and predicted labels.
ConfusionMatrixDisplay: Confusion Matrix visualization.

References

[1]

Wikipedia entry for the Confusion matrix (Wikipedia and other references may use a different convention for axes).

Examples

>>> from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

>>> y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
>>> y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
>>> confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

In the binary case, we can extract true positives, etc as follows:

>>> tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()
>>> (tn, fp, fn, tp)
(0, 2, 1, 1)

watex.exlib.cross_val_predict(estimator, X, y=None, *, groups=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', method='predict')[source]#

Generate cross-validated estimates for each input data point.

The data is split according to the cv parameter. Each sample belongs to exactly one test set, and its prediction is computed with an estimator fitted on the corresponding training set.

Passing these predictions into an evaluation metric may not be a valid way to measure generalization performance. Results can differ from cross_validate() and cross_val_score() unless all tests sets have equal size and the metric decomposes over samples.

Read more in the User Guide.

Parameters:

estimator (estimator object implementing 'fit' and 'predict') – The object to use to fit the data.
X (array-like of shape (n_samples, n_features)) – The data to fit. Can be, for example a list, or an array at least 2d.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – The target variable to try to predict in the case of supervised learning.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).
cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a (Stratified)KFold,
- CV splitter,
- An iterable that generates (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

Refer User Guide for the various cross-validation strategies that can be used here.

Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.
n_jobs (int, default=None) – Number of jobs to run in parallel. Training the estimator and predicting are parallelized over the cross-validation splits. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
verbose (int, default=0) – The verbosity level.
fit_params (dict, default=None) – Parameters to pass to the fit method of the estimator.
pre_dispatch (int or str, default='2*n_jobs') –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
- None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
- An int, giving the exact number of total jobs that are spawned
- A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
method ({'predict', 'predict_proba', 'predict_log_proba', 'decision_function'}, default='predict') – The method to be invoked by estimator.

Returns:

predictions –

This is the result of calling method. Shape:

When method is ‘predict’ and in special case where method is ‘decision_function’ and the target is binary: (n_samples,)

When method is one of {‘predict_proba’, ‘predict_log_proba’, ‘decision_function’} (unless special case above): (n_samples, n_classes)

If estimator is multioutput, an extra dimension ‘n_outputs’ is added to the end of each shape above.

Return type:

ndarray

See also

cross_val_score: Calculate score for each CV split.
cross_validate: Calculate one or more scores and timings for each CV split.

Notes

In the case that one or more classes are absent in a training portion, a default score needs to be assigned to all instances for that class if method produces columns per class, as in {‘decision_function’, ‘predict_proba’, ‘predict_log_proba’}. For predict_proba this value is 0. In order to ensure finite output, we approximate negative infinity by the minimum finite float value for the dtype in other cases.

Examples

>>> from sklearn import datasets, linear_model
>>> from sklearn.model_selection import cross_val_predict
>>> diabetes = datasets.load_diabetes()
>>> X = diabetes.data[:150]
>>> y = diabetes.target[:150]
>>> lasso = linear_model.Lasso()
>>> y_pred = cross_val_predict(lasso, X, y, cv=3)

watex.exlib.cross_val_score(estimator, X, y=None, *, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)[source]#

Evaluate a score by cross-validation.

Read more in the User Guide.

Parameters:

estimator (estimator object implementing 'fit') – The object to use to fit the data.
X (array-like of shape (n_samples, n_features)) – The data to fit. Can be for example a list, or an array.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – The target variable to try to predict in the case of supervised learning.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).
scoring (str or callable, default=None) –
A str (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y) which should return only a single value.

Similar to cross_validate() but only a single metric is permitted.

If None, the estimator’s default scorer (if available) is used.
cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a (Stratified)KFold,
- CV splitter,
- An iterable that generates (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

Refer User Guide for the various cross-validation strategies that can be used here.

Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.
n_jobs (int, default=None) – Number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
verbose (int, default=0) – The verbosity level.
fit_params (dict, default=None) – Parameters to pass to the fit method of the estimator.
pre_dispatch (int or str, default='2*n_jobs') –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
- None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
- An int, giving the exact number of total jobs that are spawned
- A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
error_score ('raise' or numeric, default=np.nan) –
Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised.

New in version 0.20.

Returns:

scores – Array of scores of the estimator for each run of the cross validation.

Return type:

ndarray of float of shape=(len(list(cv)),)

See also

cross_validate: To run cross-validation on multiple metrics and also to return train scores, fit times and score times.
cross_val_predict: Get predictions from each split of cross-validation for diagnostic purposes.
sklearn.metrics.make_scorer: Make a scorer from a performance metric or loss function.

Examples

>>> from sklearn import datasets, linear_model
>>> from sklearn.model_selection import cross_val_score
>>> diabetes = datasets.load_diabetes()
>>> X = diabetes.data[:150]
>>> y = diabetes.target[:150]
>>> lasso = linear_model.Lasso()
>>> print(cross_val_score(lasso, X, y, cv=3))
[0.3315057  0.08022103 0.03531816]

watex.exlib.f1_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]#

Compute the F1 score, also known as balanced F-score or F-measure.

The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the average parameter.

Read more in the User Guide.

Parameters:

y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (array-like, default=None) –
The set of labels to include when average != 'binary', and their order if average is None. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.

Changed in version 0.17: Parameter labels improved for multiclass problem.
pos_label (str or int, default=1) – The class to report if average='binary' and the data is binary. If the data are multiclass or multilabel, this will be ignored; setting labels=[pos_label] and average != 'binary' will report scores for that label only.
average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –
This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'binary':
Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

'micro':
Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':
Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

'samples':
Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score()).
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division, i.e. when all predictions and labels are negative. If set to “warn”, this acts as 0, but warnings are also raised.

Returns:

f1_score – F1 score of the positive class in binary classification or weighted average of the F1 scores of each class for the multiclass task.

Return type:

float or array of float, shape = [n_unique_labels]

See also

fbeta_score: Compute the F-beta score.
precision_recall_fscore_support: Compute the precision, recall, F-score, and support.
jaccard_score: Compute the Jaccard similarity coefficient score.
multilabel_confusion_matrix: Compute a confusion matrix for each class or sample.

Notes

When true positive + false positive == 0, precision is undefined. When true positive + false negative == 0, recall is undefined. In such cases, by default the metric will be set to 0, as will f-score, and UndefinedMetricWarning will be raised. This behavior can be modified with zero_division.

References

[1]

Wikipedia entry for the F1-score.

Examples

>>> from sklearn.metrics import f1_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> f1_score(y_true, y_pred, average='macro')
0.26...
>>> f1_score(y_true, y_pred, average='micro')
0.33...
>>> f1_score(y_true, y_pred, average='weighted')
0.26...
>>> f1_score(y_true, y_pred, average=None)
array([0.8, 0. , 0. ])
>>> y_true = [0, 0, 0, 0, 0, 0]
>>> y_pred = [0, 0, 0, 0, 0, 0]
>>> f1_score(y_true, y_pred, zero_division=1)
1.0...
>>> # multilabel classification
>>> y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]]
>>> y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]]
>>> f1_score(y_true, y_pred, average=None)
array([0.66666667, 1.        , 0.66666667])

watex.exlib.f_classif(X, y)[source]#

Compute the ANOVA F-value for the provided sample.

Read more in the User Guide.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The set of regressors that will be tested sequentially.
y (ndarray of shape (n_samples,)) – The target vector.

Returns:

f_statistic (ndarray of shape (n_features,)) – F-statistic for each feature.
p_values (ndarray of shape (n_features,)) – P-values associated with the F-statistic.

See also

chi2: Chi-squared stats of non-negative features for classification tasks.
f_regression: F-value between label/feature for regression tasks.

watex.exlib.learning_curve(estimator, X, y, *, groups=None, train_sizes=array([0.1, 0.325, 0.55, 0.775, 1.]), cv=None, scoring=None, exploit_incremental_learning=False, n_jobs=None, pre_dispatch='all', verbose=0, shuffle=False, random_state=None, error_score=nan, return_times=False, fit_params=None)[source]#

Learning curve.

Determines cross-validated training and test scores for different training set sizes.

A cross-validation generator splits the whole dataset k times in training and test data. Subsets of the training set with varying sizes will be used to train the estimator and a score for each training subset size and the test set will be computed. Afterwards, the scores will be averaged over all k runs for each training subset size.

Read more in the User Guide.

Parameters:

estimator (object type that implements the "fit" and "predict" methods) – An object of that type which is cloned for each validation.
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).
train_sizes (array-like of shape (n_ticks,), default=np.linspace(0.1, 1.0, 5)) – Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.
cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a (Stratified)KFold,
- CV splitter,
- An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

Refer User Guide for the various cross-validation strategies that can be used here.

Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.
scoring (str or callable, default=None) – A str (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y).
exploit_incremental_learning (bool, default=False) – If the estimator supports incremental learning, this will be used to speed up fitting for different training set sizes.
n_jobs (int, default=None) – Number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the different training and test sets. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
pre_dispatch (int or str, default='all') – Number of predispatched jobs for parallel execution (default is all). The option can reduce the allocated memory. The str can be an expression like ‘2*n_jobs’.
verbose (int, default=0) – Controls the verbosity: the higher, the more messages.
shuffle (bool, default=False) – Whether to shuffle training data before taking prefixes of it based on``train_sizes``.
random_state (int, RandomState instance or None, default=None) – Used when shuffle is True. Pass an int for reproducible output across multiple function calls. See Glossary.
error_score ('raise' or numeric, default=np.nan) –
Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised.

New in version 0.20.
return_times (bool, default=False) – Whether to return the fit and score times.
fit_params (dict, default=None) –
Parameters to pass to the fit method of the estimator.

New in version 0.24.

Returns:

train_sizes_abs (array of shape (n_unique_ticks,)) – Numbers of training examples that has been used to generate the learning curve. Note that the number of ticks might be less than n_ticks because duplicate entries will be removed.
train_scores (array of shape (n_ticks, n_cv_folds)) – Scores on training sets.
test_scores (array of shape (n_ticks, n_cv_folds)) – Scores on test set.
fit_times (array of shape (n_ticks, n_cv_folds)) – Times spent for fitting in seconds. Only present if return_times is True.
score_times (array of shape (n_ticks, n_cv_folds)) – Times spent for scoring in seconds. Only present if return_times is True.

Examples

>>> from sklearn.datasets import make_classification
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.model_selection import learning_curve
>>> X, y = make_classification(n_samples=100, n_features=10, random_state=42)
>>> tree = DecisionTreeClassifier(max_depth=4, random_state=42)
>>> train_size_abs, train_scores, test_scores = learning_curve(
...     tree, X, y, train_sizes=[0.3, 0.6, 0.9]
... )
>>> for train_size, cv_train_scores, cv_test_scores in zip(
...     train_size_abs, train_scores, test_scores
... ):
...     print(f"{train_size} samples were used to train the model")
...     print(f"The average train accuracy is {cv_train_scores.mean():.2f}")
...     print(f"The average test accuracy is {cv_test_scores.mean():.2f}")
24 samples were used to train the model
The average train accuracy is 1.00
The average test accuracy is 0.85
48 samples were used to train the model
The average train accuracy is 1.00
The average test accuracy is 0.90
72 samples were used to train the model
The average train accuracy is 1.00
The average test accuracy is 0.93

class watex.exlib.make_column_selector(pattern=None, *, dtype_include=None, dtype_exclude=None)[source]#

Bases: object

Create a callable to select columns to be used with ColumnTransformer.

make_column_selector() can select columns based on datatype or the columns name with a regex. When using multiple selection criteria, all criteria must match for a column to be selected.

Parameters:

pattern (str, default=None) – Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.
dtype_include (column dtype or list of column dtypes, default=None) – A selection of dtypes to include. For more details, see pandas.DataFrame.select_dtypes().
dtype_exclude (column dtype or list of column dtypes, default=None) – A selection of dtypes to exclude. For more details, see pandas.DataFrame.select_dtypes().

Returns:

selector – Callable for column selection to be used by a ColumnTransformer.

Return type:

callable

See also

ColumnTransformer: Class that allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.

Examples

>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
>>> from sklearn.compose import make_column_transformer
>>> from sklearn.compose import make_column_selector
>>> import numpy as np
>>> import pandas as pd  
>>> X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'],
...                   'rating': [5, 3, 4, 5]})  
>>> ct = make_column_transformer(
...       (StandardScaler(),
...        make_column_selector(dtype_include=np.number)),  # rating
...       (OneHotEncoder(),
...        make_column_selector(dtype_include=object)))  # city
>>> ct.fit_transform(X)  
array([[ 0.90453403,  1.        ,  0.        ,  0.        ],
       [-1.50755672,  1.        ,  0.        ,  0.        ],
       [-0.30151134,  0.        ,  1.        ,  0.        ],
       [ 0.90453403,  0.        ,  0.        ,  1.        ]])

watex.exlib.make_column_transformer(*transformers, remainder='drop', sparse_threshold=0.3, n_jobs=None, verbose=False, verbose_feature_names_out=True)[source]#

Construct a ColumnTransformer from the given transformers.

This is a shorthand for the ColumnTransformer constructor; it does not require, and does not permit, naming the transformers. Instead, they will be given names automatically based on their types. It also does not allow weighting with transformer_weights.

Read more in the User Guide.

Parameters:

*transformers (tuples) –
Tuples of the form (transformer, columns) specifying the transformer objects to be applied to subsets of the data.

transformer{‘drop’, ‘passthrough’} or estimator
Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.

columnsstr, array-like of str, int, array-like of int, slice, array-like of bool or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.
remainder ({'drop', 'passthrough'} or estimator, default='drop') – By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform.
sparse_threshold (float, default=0.3) – If the transformed output consists of a mix of sparse and dense data, it will be stacked as a sparse matrix if the density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and this keyword will be ignored.
n_jobs (int, default=None) – Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
verbose (bool, default=False) – If True, the time elapsed while fitting each transformer will be printed as it is completed.
verbose_feature_names_out (bool, default=True) –
If True, get_feature_names_out() will prefix all feature names with the name of the transformer that generated that feature. If False, get_feature_names_out() will not prefix any feature names and will error if feature names are not unique.

New in version 1.0.

Returns:

ct – Returns a ColumnTransformer object.

Return type:

ColumnTransformer

See also

ColumnTransformer: Class that allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.

Examples

>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
>>> from sklearn.compose import make_column_transformer
>>> make_column_transformer(
...     (StandardScaler(), ['numerical_column']),
...     (OneHotEncoder(), ['categorical_column']))
ColumnTransformer(transformers=[('standardscaler', StandardScaler(...),
                                 ['numerical_column']),
                                ('onehotencoder', OneHotEncoder(...),
                                 ['categorical_column'])])

watex.exlib.make_pipeline(*steps, memory=None, verbose=False)[source]#

Construct a Pipeline from the given estimators.

This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.

Parameters:

*steps (list of Estimator objects) – List of the scikit-learn estimators that are chained together.
memory (str or object with the joblib.Memory interface, default=None) – Used to cache the fitted transformers of the pipeline. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute named_steps or steps to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.
verbose (bool, default=False) – If True, the time elapsed while fitting each step will be printed as it is completed.

Returns:

p – Returns a scikit-learn Pipeline object.

Return type:

Pipeline

See also

Pipeline: Class for creating a pipeline of transforms with a final estimator.

Examples

>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.pipeline import make_pipeline
>>> make_pipeline(StandardScaler(), GaussianNB(priors=None))
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('gaussiannb', GaussianNB())])

watex.exlib.make_scorer(score_func, *, greater_is_better=True, needs_proba=False, needs_threshold=False, **kwargs)[source]#

Make a scorer from a performance metric or loss function.

This factory function wraps scoring functions for use in GridSearchCV and cross_val_score(). It takes a score function, such as accuracy_score(), mean_squared_error(), adjusted_rand_score() or average_precision_score() and returns a callable that scores an estimator’s output. The signature of the call is (estimator, X, y) where estimator is the model to be evaluated, X is the data and y is the ground truth labeling (or None in the case of unsupervised models).

Read more in the User Guide.

Parameters:

score_func (callable) – Score function (or loss function) with signature score_func(y, y_pred, **kwargs).
greater_is_better (bool, default=True) – Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.
needs_proba (bool, default=False) –
Whether score_func requires predict_proba to get probability estimates out of a classifier.

If True, for binary y_true, the score function is supposed to accept a 1D y_pred (i.e., probability of the positive class, shape (n_samples,)).
needs_threshold (bool, default=False) –
Whether score_func takes a continuous decision certainty. This only works for binary classification using estimators that have either a decision_function or predict_proba method.

If True, for binary y_true, the score function is supposed to accept a 1D y_pred (i.e., probability of the positive class or the decision function, shape (n_samples,)).

For example average_precision or the area under the roc curve can not be computed using discrete predictions alone.
**kwargs (additional arguments) – Additional parameters to be passed to score_func.

Returns:

scorer – Callable object that returns a scalar score; greater is better.

Return type:

callable

Notes

If needs_proba=False and needs_threshold=False, the score function is supposed to accept the output of predict. If needs_proba=True, the score function is supposed to accept the output of predict_proba (For binary y_true, the score function is supposed to accept probability of the positive class). If needs_threshold=True, the score function is supposed to accept the output of decision_function or predict_proba when decision_function is not present.

Examples

>>> from sklearn.metrics import fbeta_score, make_scorer
>>> ftwo_scorer = make_scorer(fbeta_score, beta=2)
>>> ftwo_scorer
make_scorer(fbeta_score, beta=2)
>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.svm import LinearSVC
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]},
...                     scoring=ftwo_scorer)

watex.exlib.mean_squared_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', squared=True)[source]#

Mean squared error regression loss.

Read more in the User Guide.

Parameters:

y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
multioutput ({'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average') –
Defines aggregating of multiple output values. Array-like value defines weights used to average errors.

’raw_values’ :
Returns a full set of errors in case of multioutput input.

’uniform_average’ :
Errors of all outputs are averaged with uniform weight.
squared (bool, default=True) – If True returns MSE value, if False returns RMSE value.

Returns:

loss – A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

Return type:

float or ndarray of floats

Examples

>>> from sklearn.metrics import mean_squared_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_squared_error(y_true, y_pred)
0.375
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_squared_error(y_true, y_pred, squared=False)
0.612...
>>> y_true = [[0.5, 1],[-1, 1],[7, -6]]
>>> y_pred = [[0, 2],[-1, 2],[8, -5]]
>>> mean_squared_error(y_true, y_pred)
0.708...
>>> mean_squared_error(y_true, y_pred, squared=False)
0.822...
>>> mean_squared_error(y_true, y_pred, multioutput='raw_values')
array([0.41666667, 1.        ])
>>> mean_squared_error(y_true, y_pred, multioutput=[0.3, 0.7])
0.825...

watex.exlib.permutation_importance(estimator, X, y, *, scoring=None, n_repeats=5, n_jobs=None, random_state=None, sample_weight=None, max_samples=1.0)[source]#

Permutation importance for feature evaluation [BRE].

The estimator is required to be a fitted estimator. X can be the data set used to train the estimator or a hold-out set. The permutation importance of a feature is calculated as follows. First, a baseline metric, defined by scoring, is evaluated on a (potentially different) dataset defined by the X. Next, a feature column from the validation set is permuted and the metric is evaluated again. The permutation importance is defined to be the difference between the baseline metric and metric from permutating the feature column.

Read more in the User Guide.

Parameters:

estimator (object) – An estimator that has already been fitted and is compatible with scorer.
X (ndarray or DataFrame, shape (n_samples, n_features)) – Data on which permutation importance will be computed.
y (array-like or None, shape (n_samples, ) or (n_samples, n_classes)) – Targets for supervised or None for unsupervised.
scoring (str, callable, list, tuple, or dict, default=None) –
Scorer to use. If scoring represents a single score, one can use:
- a single string (see scoring_parameter);
- a callable (see scoring) that returns a single value.
If scoring represents multiple scores, one can use:
- a list or tuple of unique strings;
- a callable returning a dictionary where the keys are the metric names and the values are the metric scores;
- a dictionary with metric names as keys and callables a values.
Passing multiple scores to scoring is more efficient than calling permutation_importance for each of the scores as it reuses predictions to avoid redundant computation.

If None, the estimator’s default scorer is used.
n_repeats (int, default=5) – Number of times to permute a feature.
n_jobs (int or None, default=None) – Number of jobs to run in parallel. The computation is done by computing permutation score for each columns and parallelized over the columns. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
random_state (int, RandomState instance, default=None) – Pseudo-random number generator to control the permutations of each feature. Pass an int to get reproducible results across function calls. See Glossary.
sample_weight (array-like of shape (n_samples,), default=None) –
Sample weights used in scoring.

New in version 0.24.
max_samples (int or float, default=1.0) –
The number of samples to draw from X to compute feature importance in each repeat (without replacement).
- If int, then draw max_samples samples.
- If float, then draw max_samples * X.shape[0] samples.
- If max_samples is equal to 1.0 or X.shape[0], all samples will be used.
While using this option may provide less accurate importance estimates, it keeps the method tractable when evaluating feature importance on large datasets. In combination with n_repeats, this allows to control the computational speed vs statistical accuracy trade-off of this method.

New in version 1.0.

Returns:

result – Dictionary-like object, with the following attributes.

importances_meanndarray of shape (n_features, ): Mean of feature importance over n_repeats.
importances_stdndarray of shape (n_features, ): Standard deviation over n_repeats.
importancesndarray of shape (n_features, n_repeats): Raw permutation importance scores.

If there are multiple scoring metrics in the scoring parameter result is a dict with scorer names as keys (e.g. ‘roc_auc’) and Bunch objects like above as values.

Return type:

Bunch or dict of such instances

References

[BRE]

L. Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.

Examples

>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.inspection import permutation_importance
>>> X = [[1, 9, 9],[1, 9, 9],[1, 9, 9],
...      [0, 9, 9],[0, 9, 9],[0, 9, 9]]
>>> y = [1, 1, 1, 0, 0, 0]
>>> clf = LogisticRegression().fit(X, y)
>>> result = permutation_importance(clf, X, y, n_repeats=10,
...                                 random_state=0)
>>> result.importances_mean
array([0.4666..., 0.       , 0.       ])
>>> result.importances_std
array([0.2211..., 0.       , 0.       ])

watex.exlib.precision_recall_curve(y_true, probas_pred, *, pos_label=None, sample_weight=None)[source]#

Compute precision-recall pairs for different probability thresholds.

Note: this implementation is restricted to the binary classification task.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the y axis.

The first precision and recall values are precision=class balance and recall=1.0 which corresponds to a classifier that always predicts the positive class.

Read more in the User Guide.

Parameters:

y_true (ndarray of shape (n_samples,)) – True binary labels. If labels are not either {-1, 1} or {0, 1}, then pos_label should be explicitly given.
probas_pred (ndarray of shape (n_samples,)) – Target scores, can either be probability estimates of the positive class, or non-thresholded measure of decisions (as returned by decision_function on some classifiers).
pos_label (int or str, default=None) – The label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1}, pos_label is set to 1, otherwise an error will be raised.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

precision (ndarray of shape (n_thresholds + 1,)) – Precision values such that element i is the precision of predictions with score >= thresholds[i] and the last element is 1.
recall (ndarray of shape (n_thresholds + 1,)) – Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0.
thresholds (ndarray of shape (n_thresholds,)) – Increasing thresholds on the decision function used to compute precision and recall where n_thresholds = len(np.unique(probas_pred)).

See also

PrecisionRecallDisplay.from_estimator: Plot Precision Recall Curve given a binary classifier.
PrecisionRecallDisplay.from_predictions: Plot Precision Recall Curve using predictions from a binary classifier.
average_precision_score: Compute average precision from prediction scores.
det_curve: Compute error rates for different probability thresholds.
roc_curve: Compute Receiver operating characteristic (ROC) curve.

Examples

>>> import numpy as np
>>> from sklearn.metrics import precision_recall_curve
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> precision, recall, thresholds = precision_recall_curve(
...     y_true, y_scores)
>>> precision
array([0.5       , 0.66666667, 0.5       , 1.        , 1.        ])
>>> recall
array([1. , 1. , 0.5, 0.5, 0. ])
>>> thresholds
array([0.1 , 0.35, 0.4 , 0.8 ])

watex.exlib.precision_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]#

Compute the precision.

The best value is 1 and the worst value is 0.

Read more in the User Guide.

Parameters:

y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (array-like, default=None) –
The set of labels to include when average != 'binary', and their order if average is None. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.

Changed in version 0.17: Parameter labels improved for multiclass problem.
pos_label (str or int, default=1) – The class to report if average='binary' and the data is binary. If the data are multiclass or multilabel, this will be ignored; setting labels=[pos_label] and average != 'binary' will report scores for that label only.
average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –
This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'binary':
Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

'micro':
Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':
Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

'samples':
Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score()).
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised.

Returns:

precision – Precision of the positive class in binary classification or weighted average of the precision of each class for the multiclass task.

Return type:

float (if average is not None) or array of float of shape (n_unique_labels,)

See also

precision_recall_fscore_support: Compute precision, recall, F-measure and support for each class.
recall_score: Compute the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives.
PrecisionRecallDisplay.from_estimator: Plot precision-recall curve given an estimator and some data.
PrecisionRecallDisplay.from_predictions: Plot precision-recall curve given binary class predictions.
multilabel_confusion_matrix: Compute a confusion matrix for each class or sample.

Notes

When true positive + false positive == 0, precision returns 0 and raises UndefinedMetricWarning. This behavior can be modified with zero_division.

Examples

>>> from sklearn.metrics import precision_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> precision_score(y_true, y_pred, average='macro')
0.22...
>>> precision_score(y_true, y_pred, average='micro')
0.33...
>>> precision_score(y_true, y_pred, average='weighted')
0.22...
>>> precision_score(y_true, y_pred, average=None)
array([0.66..., 0.        , 0.        ])
>>> y_pred = [0, 0, 0, 0, 0, 0]
>>> precision_score(y_true, y_pred, average=None)
array([0.33..., 0.        , 0.        ])
>>> precision_score(y_true, y_pred, average=None, zero_division=1)
array([0.33..., 1.        , 1.        ])
>>> # multilabel classification
>>> y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]]
>>> y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]]
>>> precision_score(y_true, y_pred, average=None)
array([0.5, 1. , 1. ])

watex.exlib.recall_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]#

Compute the recall.

The best value is 1 and the worst value is 0.

Read more in the User Guide.

Parameters:

y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (array-like, default=None) –
The set of labels to include when average != 'binary', and their order if average is None. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.

Changed in version 0.17: Parameter labels improved for multiclass problem.
pos_label (str or int, default=1) – The class to report if average='binary' and the data is binary. If the data are multiclass or multilabel, this will be ignored; setting labels=[pos_label] and average != 'binary' will report scores for that label only.
average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –
This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'binary':
Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

'micro':
Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':
Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall. Weighted recall is equal to accuracy.

'samples':
Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score()).
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised.

Returns:

recall – Recall of the positive class in binary classification or weighted average of the recall of each class for the multiclass task.

Return type:

float (if average is not None) or array of float of shape (n_unique_labels,)

See also

precision_recall_fscore_support: Compute precision, recall, F-measure and support for each class.
precision_score: Compute the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives.
balanced_accuracy_score: Compute balanced accuracy to deal with imbalanced datasets.
multilabel_confusion_matrix: Compute a confusion matrix for each class or sample.
PrecisionRecallDisplay.from_estimator: Plot precision-recall curve given an estimator and some data.
PrecisionRecallDisplay.from_predictions: Plot precision-recall curve given binary class predictions.

Notes

When true positive + false negative == 0, recall returns 0 and raises UndefinedMetricWarning. This behavior can be modified with zero_division.

Examples

>>> from sklearn.metrics import recall_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> recall_score(y_true, y_pred, average='macro')
0.33...
>>> recall_score(y_true, y_pred, average='micro')
0.33...
>>> recall_score(y_true, y_pred, average='weighted')
0.33...
>>> recall_score(y_true, y_pred, average=None)
array([1., 0., 0.])
>>> y_true = [0, 0, 0, 0, 0, 0]
>>> recall_score(y_true, y_pred, average=None)
array([0.5, 0. , 0. ])
>>> recall_score(y_true, y_pred, average=None, zero_division=1)
array([0.5, 1. , 1. ])
>>> # multilabel classification
>>> y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]]
>>> y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]]
>>> recall_score(y_true, y_pred, average=None)
array([1. , 1. , 0.5])

watex.exlib.roc_auc_score(y_true, y_score, *, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None)[source]#

Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

Note: this implementation can be used with binary, multiclass and multilabel classification, but some restrictions apply (see Parameters).

Read more in the User Guide.

Parameters:

y_true (array-like of shape (n_samples,) or (n_samples, n_classes)) – True labels or binary label indicators. The binary and multiclass cases expect labels with shape (n_samples,) while the multilabel case expects binary label indicators with shape (n_samples, n_classes).
y_score (array-like of shape (n_samples,) or (n_samples, n_classes)) –
Target scores.
- In the binary case, it corresponds to an array of shape (n_samples,). Both probability estimates and non-thresholded decision values can be provided. The probability estimates correspond to the probability of the class with the greater label, i.e. estimator.classes_[1] and thus estimator.predict_proba(X, y)[:, 1]. The decision values corresponds to the output of estimator.decision_function(X, y). See more information in the User guide;
- In the multiclass case, it corresponds to an array of shape (n_samples, n_classes) of probability estimates provided by the predict_proba method. The probability estimates must sum to 1 across the possible classes. In addition, the order of the class scores must correspond to the order of labels, if provided, or else to the numerical or lexicographical order of the labels in y_true. See more information in the User guide;
- In the multilabel case, it corresponds to an array of shape (n_samples, n_classes). Probability estimates are provided by the predict_proba method and the non-thresholded decision values by the decision_function method. The probability estimates correspond to the probability of the class with the greater label for each output of the classifier. See more information in the User guide.
average ({'micro', 'macro', 'samples', 'weighted'} or None, default='macro') –
If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Note: multiclass ROC AUC currently only handles the ‘macro’ and ‘weighted’ averages. For multiclass targets, average=None is only implemented for multi_class=’ovr’ and average=’micro’ is only implemented for multi_class=’ovr’.

'micro':
Calculate metrics globally by considering each element of the label indicator matrix as a label.

'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':
Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label).

'samples':
Calculate metrics for each instance, and find their average.

Will be ignored when y_true is binary.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
max_fpr (float > 0 and <= 1, default=None) – If not None, the standardized partial AUC [2] over the range [0, max_fpr] is returned. For the multiclass case, max_fpr, should be either equal to None or 1.0 as AUC ROC partial computation currently is not supported for multiclass.
multi_class ({'raise', 'ovr', 'ovo'}, default='raise') –
Only used for multiclass targets. Determines the type of configuration to use. The default value raises an error, so either 'ovr' or 'ovo' must be passed explicitly.

'ovr':
Stands for One-vs-rest. Computes the AUC of each class against the rest [3] [4]. This treats the multiclass case in the same way as the multilabel case. Sensitive to class imbalance even when average == 'macro', because class imbalance affects the composition of each of the ‘rest’ groupings.

'ovo':
Stands for One-vs-one. Computes the average AUC of all possible pairwise combinations of classes [5]. Insensitive to class imbalance when average == 'macro'.
labels (array-like of shape (n_classes,), default=None) – Only used for multiclass targets. List of labels that index the classes in y_score. If None, the numerical or lexicographical order of the labels in y_true is used.

Returns:

auc – Area Under the Curve score.

Return type:

float

See also

average_precision_score: Area under the precision-recall curve.
roc_curve: Compute Receiver operating characteristic (ROC) curve.
RocCurveDisplay.from_estimator: Plot Receiver Operating Characteristic (ROC) curve given an estimator and some data.
RocCurveDisplay.from_predictions: Plot Receiver Operating Characteristic (ROC) curve given the true and predicted values.

References

[1]

Wikipedia entry for the Receiver operating characteristic

[2]

Analyzing a portion of the ROC curve. McClish, 1989

[3]

Provost, F., Domingos, P. (2000). Well-trained PETs: Improving probability estimation trees (Section 6.2), CeDER Working Paper #IS-00-04, Stern School of Business, New York University.

[4]

Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861-874.

[5]

Hand, D.J., Till, R.J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 45(2), 171-186.

Examples

Binary case:

>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.metrics import roc_auc_score
>>> X, y = load_breast_cancer(return_X_y=True)
>>> clf = LogisticRegression(solver="liblinear", random_state=0).fit(X, y)
>>> roc_auc_score(y, clf.predict_proba(X)[:, 1])
0.99...
>>> roc_auc_score(y, clf.decision_function(X))
0.99...

Multiclass case:

>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> clf = LogisticRegression(solver="liblinear").fit(X, y)
>>> roc_auc_score(y, clf.predict_proba(X), multi_class='ovr')
0.99...

Multilabel case:

>>> import numpy as np
>>> from sklearn.datasets import make_multilabel_classification
>>> from sklearn.multioutput import MultiOutputClassifier
>>> X, y = make_multilabel_classification(random_state=0)
>>> clf = MultiOutputClassifier(clf).fit(X, y)
>>> # get a list of n_output containing probability arrays of shape
>>> # (n_samples, n_classes)
>>> y_pred = clf.predict_proba(X)
>>> # extract the positive columns for each output
>>> y_pred = np.transpose([pred[:, 1] for pred in y_pred])
>>> roc_auc_score(y, y_pred, average=None)
array([0.82..., 0.86..., 0.94..., 0.85... , 0.94...])
>>> from sklearn.linear_model import RidgeClassifierCV
>>> clf = RidgeClassifierCV().fit(X, y)
>>> roc_auc_score(y, clf.decision_function(X), average=None)
array([0.81..., 0.84... , 0.93..., 0.87..., 0.94...])

watex.exlib.roc_curve(y_true, y_score, *, pos_label=None, sample_weight=None, drop_intermediate=True)[source]#

Compute Receiver operating characteristic (ROC).

Note: this implementation is restricted to the binary classification task.

Read more in the User Guide.

Parameters:

y_true (ndarray of shape (n_samples,)) – True binary labels. If labels are not either {-1, 1} or {0, 1}, then pos_label should be explicitly given.
y_score (ndarray of shape (n_samples,)) – Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
pos_label (int or str, default=None) – The label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1}, pos_label is set to 1, otherwise an error will be raised.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
drop_intermediate (bool, default=True) –
Whether to drop some suboptimal thresholds which would not appear on a plotted ROC curve. This is useful in order to create lighter ROC curves.

New in version 0.17: parameter drop_intermediate.

Returns:

fpr (ndarray of shape (>2,)) – Increasing false positive rates such that element i is the false positive rate of predictions with score >= thresholds[i].
tpr (ndarray of shape (>2,)) – Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i].
thresholds (ndarray of shape = (n_thresholds,)) – Decreasing thresholds on the decision function used to compute fpr and tpr. thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1.

See also

RocCurveDisplay.from_estimator: Plot Receiver Operating Characteristic (ROC) curve given an estimator and some data.
RocCurveDisplay.from_predictions: Plot Receiver Operating Characteristic (ROC) curve given the true and predicted values.
det_curve: Compute error rates for different probability thresholds.
roc_auc_score: Compute the area under the ROC curve.

Notes

Since the thresholds are sorted from low to high values, they are reversed upon returning them to ensure they correspond to both fpr and tpr, which are sorted in reversed order during their calculation.

References

[1]

Wikipedia entry for the Receiver operating characteristic

[2]

Fawcett T. An introduction to ROC analysis[J]. Pattern Recognition Letters, 2006, 27(8):861-874.

Examples

>>> import numpy as np
>>> from sklearn import metrics
>>> y = np.array([1, 1, 2, 2])
>>> scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)
>>> fpr
array([0. , 0. , 0.5, 0.5, 1. ])
>>> tpr
array([0. , 0.5, 0.5, 1. , 1. ])
>>> thresholds
array([1.8 , 0.8 , 0.4 , 0.35, 0.1 ])

watex.exlib.silhouette_samples(X, labels, *, metric='euclidean', **kwds)[source]#

Compute the Silhouette Coefficient for each sample.

The Silhouette Coefficient is a measure of how well samples are clustered with samples that are similar to themselves. Clustering models with a high Silhouette Coefficient are said to be dense, where samples in the same cluster are similar to each other, and well separated, where samples in different clusters are not very similar to each other.

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.

This function returns the Silhouette Coefficient for each sample.

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters.

Read more in the User Guide.

Parameters:

X (array-like of shape (n_samples_a, n_samples_a) if metric == "precomputed" or (n_samples_a, n_features) otherwise) – An array of pairwise distances between samples, or a feature array.
labels (array-like of shape (n_samples,)) – Label values for each sample.
metric (str or callable, default='euclidean') – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by sklearn.metrics.pairwise.pairwise_distances(). If X is the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.
**kwds (optional keyword parameters) – Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.

Returns:

silhouette – Silhouette Coefficients for each sample.

Return type:

array-like of shape (n_samples,)

References

[1]

Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis”. Computational and Applied Mathematics 20: 53-65.

[2]

Wikipedia entry on the Silhouette Coefficient

class watex.exlib.sklearndoc#

Bases: object

Machine Learning in Python

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.

It was originally called scikits.learn and was initially developed by David Cournapeau as a Google summer of code project in 2007. Later, in 2010, Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for Research in Computer Science and Automation), took this project at another level and made the first public release (v0.1 beta) on 1st Feb. 2010. At this time, itts version history is listed above

May 2019: scikit-learn 0.21.0

March 2019: scikit-learn 0.20.3

December 2018: scikit-learn 0.20.2

November 2018: scikit-learn 0.20.1

September 2018: scikit-learn 0.20.0

July 2018: scikit-learn 0.19.2

July 2017: scikit-learn 0.19.0

September 2016. scikit-learn 0.18.0

November 2015. scikit-learn 0.17.0

March 2015. scikit-learn 0.16.0

July 2014. scikit-learn 0.15.0

August 2013. scikit-learn 0.14

watex.exlib.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)[source]#

Split arrays or matrices into random train and test subsets.

Quick utility that wraps input validation, next(ShuffleSplit().split(X, y)), and application to input data into a single call for splitting (and optionally subsampling) data into a one-liner.

Read more in the User Guide.

Parameters:

*arrays (sequence of indexables with same length / shape[0]) – Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
train_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
random_state (int, RandomState instance or None, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. See Glossary.
shuffle (bool, default=True) – Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
stratify (array-like, default=None) – If not None, data is split in a stratified fashion, using this as the class labels. Read more in the User Guide.

Returns:

splitting – List containing train-test split of inputs.

New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

Return type:

list, length=2 * len(arrays)

Examples

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]

>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]

watex.exlib.validation_curve(estimator, X, y, *, param_name, param_range, groups=None, cv=None, scoring=None, n_jobs=None, pre_dispatch='all', verbose=0, error_score=nan, fit_params=None)[source]#

Validation curve.

Determine training and test scores for varying parameter values.

Compute scores for an estimator with different values of a specified parameter. This is similar to grid search with one parameter. However, this will also compute training scores and is merely a utility for plotting the results.

Read more in the User Guide.

Parameters:

estimator (object type that implements the "fit" and "predict" methods) – An object of that type which is cloned for each validation.
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs) or None) – Target relative to X for classification or regression; None for unsupervised learning.
param_name (str) – Name of the parameter that will be varied.
param_range (array-like of shape (n_values,)) – The values of the parameter that will be evaluated.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).
cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a (Stratified)KFold,
- CV splitter,
- An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

Refer User Guide for the various cross-validation strategies that can be used here.

Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.
scoring (str or callable, default=None) – A str (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y).
n_jobs (int, default=None) – Number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the combinations of each parameter value and each cross-validation split. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
pre_dispatch (int or str, default='all') – Number of predispatched jobs for parallel execution (default is all). The option can reduce the allocated memory. The str can be an expression like ‘2*n_jobs’.
verbose (int, default=0) – Controls the verbosity: the higher, the more messages.
error_score ('raise' or numeric, default=np.nan) –
Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised.

New in version 0.20.
fit_params (dict, default=None) –
Parameters to pass to the fit method of the estimator.

New in version 0.24.

Returns:

train_scores (array of shape (n_ticks, n_cv_folds)) – Scores on training sets.
test_scores (array of shape (n_ticks, n_cv_folds)) – Scores on test set.

Notes

See sphx_glr_auto_examples_model_selection_plot_validation_curve.py

class watex.exlib.xgboostdoc#

Bases: object

Extreme Gradient Boosting

XGBoost stands for Extreme Gradient Boosting, is an open-source software library that implements optimized distributed gradient boosting machine learning algorithms under the Gradient Boosting framework.

XgBoost, which was proposed by the researchers at the University of Washington. It is a library written in C++ which optimizes the training for Gradient Boosting [1]_. Before understanding the XGBoost, we first need to understand the trees especially the decision tree.

Indeed , a Decision tree(DT) is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label. A tree can be ‘learned’ by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions [2]_.

References

..[1] https://www.geeksforgeeks.org/xgboost/ ..[2] https://www.nvidia.com/en-us/glossary/data-science/xgboost/

watex.exlib package#

Submodules#