<no title> — watex 0.3.3 documentation

class watex.models.validation.BaseEvaluation(estimator, cv=4, pipeline=None, prefit=False, scoring='nmse', random_state=42, verbose=0)[source]#

Bases: object

Evaluation of dataset using a base estimator.

Quick evaluation of the data after preparing and pipeline constructions.

Parameters:

estimator (Callable,) – estimator for trainset and label evaluating; something like a class that implements a fit methods. Refer to https://scikit-learn.org/stable/modules/classes.html

cv (float,) –

A cross validation splitting strategy. It used in cross-validation based routines. cv is also available in estimators such as multioutput. ClassifierChain or calibration.CalibratedClassifierCV which use the predictions of one estimator as training data for another, to not overfit the training supervision. Possible inputs for cv are usually:

* An integer, specifying the number of folds in K-fold cross validation.
    K-fold will be stratified over classes if the estimator is a classifier
    (determined by base.is_classifier) and the targets may represent a
    binary or multiclass (but not multioutput) classification problem
    (determined by utils.multiclass.type_of_target).
* A cross-validation splitter instance. Refer to the User Guide for
    splitters available within `Scikit-learn`_
* An iterable yielding train/test splits.

With some exceptions (especially where not using cross validation at all: is an option), the default is 4-fold.

The default is 4.

scoring (str,) – Specifies the score function to be maximized (usually by cross validation), or – in some cases – multiple score functions to be reported. The score function can be a string accepted by sklearn.metrics.get_scorer() or a callable scorer, not to be confused with an evaluation metric, as the latter have a more diverse API. scoring may also be set to None, in which case the estimator’s score method is used. See slearn.scoring_parameter in the Scikit-learn User Guide.
pipeline (Callable or Pipeline object) – If pipeline is given , X is transformed accordingly, Otherwise evaluation is made using purely the base estimator with the given X. Refer to https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline for further details.
kind (str, default ='GridSearchCV') – Kind of grid search method. Could be GridSearchCV or RandomizedSearchCV.
prefit (bool, default=False,) – If False, does not need to compute the cross validation score once again and True otherwise.
random_state (int, RandomState instance or None, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls..

Examples

>>> import watex as wx
>>> from watex.datasets import load_bagoue
>>> from watex.models import BaseEvaluation
>>> X, y = load_bagoue (as_frame =True )
>>> # categorizing the labels
>>> yc = wx.smart_label_classifier (y , values = [1, 3, 10 ],
                                 # labels =['FR0', 'FR1', 'FR2', 'FR4']
                                 )
>>> # drop the subjective columns ['num', 'name']
>>> X = X.drop (columns = ['num', 'name'])
>>> # X = wx.cleaner (X , columns = 'num name', mode='drop')
>>> X.columns
Index(['shape', 'type', 'geol', 'east', 'north', 'power', 'magnitude', 'sfi',
       'ohmS', 'lwi'],
      dtype='object')
>>> X =  wx.naive_imputer ( X, mode ='bi-impute') # impute data
>>> # create a pipeline for X
>>> pipe = wx.make_naive_pipe (X)
>>> Xtrain, Xtest, ytrain, ytest = wx.sklearn.train_test_split(X, yc)
>>> b = BaseEvaluation (estimator= wx.sklearn.RandomForestClassifier,
                        scoring = 'accuracy', pipeline = pipe)
>>> b.fit(Xtrain, ytrain ) # accepts only array
>>> b.cv_scores_
Out[174]: array([0.75409836, 0.72131148, 0.73333333, 0.78333333])
>>> ypred = b.predict(Xtest)
>>> scores = wx.sklearn.accuracy_score (ytest, ypred)
0.7592592592592593

fit(X, y, sample_weight=0.75)[source]#

Quick methods used to evaluate eastimator, display the error results as well as the sample model_predictions.

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
sample_weight (float,default = .75) – The ratio to sample X and y. The default sample 3/4 percent of the data. If given, will sample the X and y. If None, will sample the half of the data.

Returns:

`self` – BaseEvaluation object.

Return type:

BaseEvaluation

property inspect#: Inspect object whether is fitted or not

predict(X)[source]#

Quick prediction and get the scores.

Parameters:: X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Test set; Denotes data that is observed at testing and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
Returns:: y – test predicted target.
Return type:: array-like, shape (M, ) M=m-samples,

class watex.models.validation.GridSearch(base_estimator, grid_params, cv=4, kind='GridSearchCV', scoring='nmse', verbose=0, **grid_kws)[source]#

Bases: object

Fine-tune hyperparameters using grid search methods.

Search Grid will be able to fiddle with the hyperparameters until to

Parameters:

base_estimator (Callable,) – estimator for trainset and label evaluating; something like a class that implements a fit method. Refer to https://scikit-learn.org/stable/modules/classes.html

grid_params (list of dict,) –

list of hyperparameters params to be fine-tuned.For instance:

param_grid=[dict(
    kpca__gamma=np.linspace(0.03, 0.05, 10),
    kpca__kernel=["rbf", "sigmoid"]
    )]

pipeline (Callable or Pipeline object) – If pipeline is given , X is transformed accordingly, Otherwise evaluation is made using purely the base estimator with the given X.
prefit (bool, default=False,) – If False, does not need to compute the cross validation score once again and True otherwise.

cv (float,) –

* An integer, specifying the number of folds in K-fold cross validation.
    K-fold will be stratified over classes if the estimator is a classifier
    (determined by base.is_classifier) and the targets may represent a
    binary or multiclass (but not multioutput) classification problem
    (determined by utils.multiclass.type_of_target).
* A cross-validation splitter instance. Refer to the User Guide for
    splitters available within `Scikit-learn`_
* An iterable yielding train/test splits.

With some exceptions (especially where not using cross validation at all: is an option), the default is 4-fold.

The default is 4.

kind (str, default='GridSearchCV' or '1') – Kind of grid parameter searches. Can be 1 for GridSearchCV or 2 for RandomizedSearchCV.
scoring (str,) – Specifies the score function to be maximized (usually by cross validation), or – in some cases – multiple score functions to be reported. The score function can be a string accepted by sklearn.metrics.get_scorer() or a callable scorer, not to be confused with an evaluation metric, as the latter have a more diverse API. scoring may also be set to None, in which case the estimator’s score method is used. See slearn.scoring_parameter in the Scikit-learn User Guide.
random_state (int, RandomState instance or None, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls..

Examples

>>> from pprint import pprint
>>> from watex.datasets import fetch_data
>>> from watex.models.validation import GridSearch
>>> from watex.exlib.sklearn import RandomForestClassifier
>>> X_prepared, y_prepared =fetch_data ('bagoue prepared')
>>> grid_params = [ dict(
...        n_estimators=[3, 10, 30], max_features=[2, 4, 6, 8]),
...        dict(bootstrap=[False], n_estimators=[3, 10],
...                             max_features=[2, 3, 4])
...        ]
>>> forest_clf = RandomForestClassifier()
>>> grid_search = GridSearch(forest_clf, grid_params)
>>> grid_search.fit(X= X_prepared,y =  y_prepared,)
>>> pprint(grid_search.best_params_ )
{'max_features': 8, 'n_estimators': 30}
>>> pprint(grid_search.cv_results_)

property base_estimator#: Return the base estimator class

best_estimator_#

best_params_#

cv#

cv_results_#

feature_importances_#

fit(X, y)[source]#

Fit method using base Estimator and populate gridSearch attributes.

Parameters:

X (Ndarray ( M x N) matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns:

``self`` – Returns GridSearch

Return type:

GridSearch

grid_kws#

grid_params#

property kind#: Kind of searched. RandomizedSearchCV or GridSearchCV.

scoring#

verbose#

class watex.models.validation.GridSearchMultiple(estimators, scoring, grid_params, *, kind='GridSearchCV', cv=7, random_state=42, savejob=False, filename=None, verbose=0, **grid_kws)[source]#

Bases: object

Search and find multiples best parameters from differents estimators.

Parameters:

estimators (list of callable obj) –
list of estimator objects to fine-tune their hyperparameters For instance:

random_state=42 # build estimators logreg_clf = LogisticRegression(random_state =random_state) linear_svc_clf = LinearSVC(random_state =random_state) sgd_clf = SGDClassifier(random_state = random_state) svc_clf = SVC(random_state =random_state)

)

estimators =(svc_clf,linear_svc_clf, logreg_clf, sgd_clf )

grid_params (list) –

list of parameters Grids. For instance:

grid_params= ([
dict(C=[1e-2, 1e-1, 1, 10, 100], gamma=[5, 2, 1, 1e-1, 1e-2, 1e-3],
             kernel=['rbf']),
dict(kernel=['poly'],degree=[1, 3,5, 7], coef0=[1, 2, 3],
 'C': [1e-2, 1e-1, 1, 10, 100])],
[dict(C=[1e-2, 1e-1, 1, 10, 100], loss=['hinge'])],
[dict()], [dict()]
)

cv (float,) –

* An integer, specifying the number of folds in K-fold cross validation.
    K-fold will be stratified over classes if the estimator is a classifier
    (determined by base.is_classifier) and the targets may represent a
    binary or multiclass (but not multioutput) classification problem
    (determined by utils.multiclass.type_of_target).
* A cross-validation splitter instance. Refer to the User Guide for
    splitters available within `Scikit-learn`_
* An iterable yielding train/test splits.

With some exceptions (especially where not using cross validation at all: is an option), the default is 4-fold.

scoring (str,) – Specifies the score function to be maximized (usually by cross validation), or – in some cases – multiple score functions to be reported. The score function can be a string accepted by sklearn.metrics.get_scorer() or a callable scorer, not to be confused with an evaluation metric, as the latter have a more diverse API. scoring may also be set to None, in which case the estimator’s score method is used. See slearn.scoring_parameter in the Scikit-learn User Guide.
kind (str, default='GridSearchCV' or '1') – Kind of grid parameter searches. Can be 1 for GridSearchCV or 2 for RandomizedSearchCV.
random_state (int, RandomState instance or None, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls..
savejob (bool, default=False) – Save your model parameters to external file using ‘joblib’ or Python persistent ‘pickle’ module. Default sorted to ‘joblib’ format.
verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.
grid_kws (dict,) – Argument passed to grid_method additional keywords.

Examples

>>> from watex.models import GridSearchMultiple , displayFineTunedResults
>>> from watex.exlib import LinearSVC, SGDClassifier, SVC, LogisticRegression
>>> X, y  = wx.fetch_data ('bagoue prepared')
>>> X
... <344x18 sparse matrix of type '<class 'numpy.float64'>'
... with 2752 stored elements in Compressed Sparse Row format>
>>> # As example, we can build 04 estimators and provide their
>>> # grid parameters range for fine-tuning as ::
>>> random_state=42
>>> logreg_clf = LogisticRegression(random_state =random_state)
>>> linear_svc_clf = LinearSVC(random_state =random_state)
>>> sgd_clf = SGDClassifier(random_state = random_state)
>>> svc_clf = SVC(random_state =random_state)
>>> estimators =(svc_clf,linear_svc_clf, logreg_clf, sgd_clf )
>>> grid_params= ([dict(C=[1e-2, 1e-1, 1, 10, 100],
                        gamma=[5, 2, 1, 1e-1, 1e-2, 1e-3],kernel=['rbf']),
                   dict(kernel=['poly'],degree=[1, 3,5, 7], coef0=[1, 2, 3],
                        C= [1e-2, 1e-1, 1, 10, 100])],
                [dict(C=[1e-2, 1e-1, 1, 10, 100], loss=['hinge'])],
                [dict()], # we just no provided parameter for demo
                [dict()]
                )
>>> #Now  we can call :class:`watex.models.GridSearchMultiple` for
>>> # training and self-validating as:
>>> gobj = GridSearchMultiple(estimators = estimators,
                       grid_params = grid_params ,
                       cv =4,
                       scoring ='accuracy',
                       verbose =1,   #> 7 put more verbose
                       savejob=False ,  # set true to save job in binary disk file.
                       kind='GridSearchCV').fit(X, y)
>>> # Once the parameters are fined tuned, we can display the fined tuning
>>> # results using displayFineTunedResults`` function
>>> displayFineTunedResults (gobj.models.values_)
MODEL NAME = SVC
BEST PARAM = {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
BEST ESTIMATOR = SVC(C=100, gamma=0.01, random_state=42)

MODEL NAME = LinearSVC BEST PARAM = {‘C’: 100, ‘loss’: ‘hinge’} BEST ESTIMATOR = LinearSVC(C=100, loss=’hinge’, random_state=42)

MODEL NAME = LogisticRegression BEST PARAM = {} BEST ESTIMATOR = LogisticRegression(random_state=42)

MODEL NAME = SGDClassifier BEST PARAM = {} BEST ESTIMATOR = SGDClassifier(random_state=42)

Notes

Call get_scorers() or use sklearn.metrics.SCORERS.keys() to get all the metrics used to evaluate model errors. Can be any others metrics in ~metrics.metrics.SCORERS.keys(). Furthermore if scoring is set to None nmse is used as default value for ‘neg_mean_squared_error’`.

fit(X, y)[source]#

watex.models.validation.displayCVTables(cvres, cvmodels)[source]#

Display the cross-validation results from all models at each k-fold.

Parameters:

cvres (dict of (str, Array-like)) – cross validation results after training the models of number of parameters equals to N. The str fits the each parameter stored during the cross-validation while the value is stored in Numpy array.
cvmnodels (list) – list of fined-tuned models.

Examples

>>> from watex.datasets import fetch_data
>>> from watex.models import GridSearchMultiple, displayCVTables
>>> X, y  = fetch_data ('bagoue prepared')
>>> gobj =GridSearchMultiple(estimators = estimators,
                             grid_params = grid_params ,
                             cv =4, scoring ='accuracy',
                             verbose =1,  savejob=False ,
                             kind='GridSearchCV')
>>> gobj.fit(X, y)
>>> displayCVTables (cvmodels=[gobj.models.SVC] ,
                     cvres= [gobj.models.SVC.cv_results_ ])
...

watex.models.validation.displayFineTunedResults(cvmodels)[source]#

Display fined -tuning results

Parameters:: cvmnodels (list) – list of fined-tuned models.

watex.models.validation.displayModelMaxDetails(cvres, cv=4)[source]#

Display the max details of each stored model from cross-validation.

Parameters:

cvres (dict of (str, Array-like)) – cross validation results after training the models of number of parameters equals to N. The str fits the each parameter stored during the cross-validation while the value is stored in Numpy array.
cv (int, default=1) – The number of KFlod during the fine-tuning models parameters.

watex.models.validation.getGlobalScores(cvres)[source]#

Retrieve the global mean and standard deviation score from the cross validation containers.

Parameters:: cvres (dict of (str, Array-like)) – cross validation results after training the models of number of parameters equals to N. The str fits the each parameter stored during the cross-validation while the value is stored in Numpy array.
Returns:: scores on CV test data and standard deviation
Return type:: ( mean_test_scores’, ‘std_test_scores’)

watex.models.validation.getSplitBestScores(cvres, split=0)[source]#

Get the best score at each split from cross-validation results

Parameters:

cvres (dict of (str, Array-like)) – cross validation results after training the models of number of parameters equals to N. The str fits the each parameter stored during the cross-validation while the value is stored in Numpy array.
split (int, default=1) – The number of split to fetch parameters. The number of split must be the number of cross-validation (cv) minus one.

Returns:

bests – Dictionnary of the best parameters at the corresponding split in the cross-validation.

Return type:

Dict,

watex.models.validation.get_best_kPCA_params(X, n_components=2, *, y=None, param_grid=None, clf=None, cv=7, **grid_kws)[source]#

Select the Kernel and hyperparameters using GridSearchCV that lead to the best performance.

As kPCA( unsupervised learning algorithm), there is obvious performance measure to help selecting the best kernel and hyperparameters values. However dimensionality reduction is often a preparation step for a supervised task(e.g. classification). So we can use grid search to select the kernel and hyperparameters that lead the best performance on that task. By default implementation we create two steps pipeline. First reducing dimensionality to two dimension using kPCA, then applying the LogisticRegression for classification. AFter use Grid searchCV to find the best kernel and gamma value for kPCA in oder to get the best clasification accuracy at the end of the pipeline.

Parameters:

X (Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like of shape (M, ) :math:`M=m-samples) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
n_components (int,) – Number of dimension to preserve. If n_components is ranged between 0. to 1., it indicated the number of variance ratio to preserve.

param_grid (list) –

list of parameters grids. For instance:

param_grid=[dict(
    kpca__gamma=np.linspace(0.03, 0.05, 10),
    kpca__kernel=["rbf", "sigmoid"]
    )]

clf (callable, always as a function, classifier estimator) –
A supervised (or semi-supervised) predictor with a finite set of discrete possible output values. A classifier supports modeling some of binary, multiclass, multilabel, or multiclass multioutput targets. Within scikit-learn, all classifiers support multi-class classification, defaulting to using a one-vs-rest strategy over the binary classification problem. Classifiers must store a classes_ attribute after fitting, and usually inherit from base.ClassifierMixin, which sets their _estimator_type attribute. A classifier can be distinguished from other estimators with is_classifier. It must implement:
```
* fit
* predict
* score
```
It may also be appropriate to implement decision_function, predict_proba and predict_log_proba. It can also be a base estimator or a composite estimor with pipeline. For instance:: clf =Pipeline([ (‘kpca’, KernelPCA(n_components=2)) (‘log_reg’, LogisticRegression()) ])

cv (float,) –

* An integer, specifying the number of folds in K-fold cross validation.
    K-fold will be stratified over classes if the estimator is a classifier
    (determined by base.is_classifier) and the targets may represent a
    binary or multiclass (but not multioutput) classification problem
    (determined by utils.multiclass.type_of_target).
* A cross-validation splitter instance. Refer to the User Guide for
    splitters available within `Scikit-learn`_
* An iterable yielding train/test splits.

With some exceptions (especially where not using cross validation at all: is an option), the default is 4-fold.

grid_kws (dict,) – Additional keywords arguments passed to Grid parameters from GridSearch

Examples

>>> from watex.analysis.dimensionality import get_best_kPCA_params
>>> from watex.datasets import fetch_data
>>> X, y=fetch_data('Bagoue analysis data')
>>> param_grid=[dict(
    kpca__gamma=np.linspace(0.03, 0.05, 10),
    kpca__kernel=["rbf", "sigmoid"]
    )]
>>> clf =Pipeline([
    ('kpca', KernelPCA(n_components=2)),
    ('log_reg', LogisticRegression())
     ])
>>> kpca_best_params =get_best_kPCA_params(
            X,y=y,scoring = 'accuracy',
            n_components= 2, clf=clf,
            param_grid=param_grid)
>>> kpca_best_params
... {'kpca__gamma': 0.03, 'kpca__kernel': 'rbf'}

watex.models.validation.get_scorers(*, scorer=None, check_scorer=False, error='ignore')[source]#

Fetch the list of available metrics from scikit-learn or verify whether the scorer exist in that list of metrics. This is prior necessary before the model evaluation.

Parameters:: scorer – str, Must be an metrics for model evaluation. Refer to sklearn.metrics

:param check_scorer:bool, default=False: Returns bool if True whether the scorer exists in the list of the metrics for the model evaluation. Note that scorer`can not be ``None` if check_scorer is set to True.

Parameters:

error – str, [‘raise’, ‘ignore’] raise a ValueError if scorer not found in the list of metrics and check_scorer `is ``True`.

Returns:

scorers: bool, tuple: True if scorer is in the list of metrics provided that ` scorer` is not None, or the tuple of scikit-metrics. sklearn.metrics

watex.models.validation.naive_evaluation(clf, X, y, cv=7, scoring='accuracy', display='off', **kws)[source]#

Quick scores evaluation using cross validation.

Parameters:

clf (callable) – Classifer for testing default data.
X (ndarray) – trainset data
y (array_like) – label data
cv (int) – KFold for data validation.
scoring (str) – type of error visualization.
display (str or bool,) – show the show on the stdout
kws (dict,) – Additional keywords arguments passed to watex.exlib.slearn.cross_val_score().

Returns:

scores, mean_core – scaore after evaluation and mean of the score

Return type:

array_like, float

Examples

>>> import watex as wx
>>> from watex.models.validation import naive_evaluation
>>> X,  y = wx.fetch_data ('bagoue data prepared')
>>> clf = wx.sklearn.DecisionTreeClassifier()
>>> naive_evaluation(clf, X, y , cv =4 , display ='on' )
clf=: DecisionTreeClassifier
scores=: [0.6279 0.7674 0.7093 0.593 ]
scores.mean=: 0.6744186046511629
Out[57]: (array([0.6279, 0.7674, 0.7093, 0.593 ]), 0.6744186046511629)