class watex.cases.processing.Preprocessing(tname='flow', drop_features=None, random_state=42, default_estimator='svc', test_size=0.2, verbose=0)[source]#

Bases: object

Base preprocessing class.

Give a baseline preprocessing model with a base score. Usefull before fidlling the model hyperparameters.

Parameters:
  • tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.

  • drop_features (list or str, Optional) – List the useless features for predicting or list of column names to drop out.

  • random_state (int, default is 42) – The state of data shuffling. The default is 42.

  • default_estimator (callable, F or sckitlearn estimator) –

    The default estimator name for predicting the tname value. A predifined defaults estimators prameters are set and keep in cache for quick preprocessing like: - ‘dtc’: For DecisionTreeClassifier - ‘svc’: Support Vector Classifier - ‘sdg’: SGDClassifier - ‘knn’: KNeighborsClassifier - ‘rdf`: RandmForestClassifier - ‘ada’: AdaBoostClassifier - ‘vtc’: VotingClassifier - ‘bag’: BaggingClassifier - ‘stc’: StackingClassifier If estimator is not given the default is svm or

    svc.

  • test_size (float,) – The test set data size. Must be less than 1.The sample test size is 0.2 either 20% of dataset.

  • verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.

X#

training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

Type:

Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)

y#

train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Type:

array-like of shape (M, ) :math:`M=m-samples

Xt#

Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.

Type:

Ndarray ( M x N matrix where M=m-samples, & N=n-features)

yt#

test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Type:

array-like, shape (M, ) M=m-samples,

data#

Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.

Type:

str, filepath_or_buffer or pandas.core.DataFrame

pipe_#

Pipeline can be buit by your own pipeline with different transformer. For base model prediction, it is possible to use the default pipeline. Call get_default_pipe to get the transformation list and steps.

Type:

Callable, preprocessor object from sklearn.pipeline

estimator#

Callable estimator method to fit the model:

estimators= SGDClassifier(random_state=13)
Type:

Callable, F or sklearn.metaestimator

model#

A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:

  • parameters provided in object construction or with set_params;

  • the global numpy.random random state if the estimator’s random_state

    parameter is set to None; and

  • any data or sample properties passed to the most recent call to fit,

    fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.

The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a function.

Type:

callable, always as a function,

cat_features_#

list of categorical features list. If not given it should be find automatically.

Type:

list or str, Optional

num_features_#

list Numerical features list. If not given, should be find automatically.

Type:

list of str, Optional

model#

Use the predifined pipelines i.e can be a Pipeline can your build by your own pipeline with different composite estimator. If model is None , use the default model from the default preprocessor and estimator.

Type:

Callable, {preprocessor + estimator },

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.linear_model import SGDClassifier
>>> from sklearn.impute import SimpleImputer
>>> estimators=dict(
...    RandomForestClassifier=RandomForestClassifier(
...        n_estimators=200, random_state=0),
>>> pc.X= SimpleImputer().fit_transform(pc.X)
>>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values
>>> pc.baseEvaluation(estimator=estimators, eval_metric =True)
>>> pc.base_score_
... 0.72586369
baseEvaluation(model=None, eval_metric=False, **kws)[source]#

Dummy baseline model from preprocessing pipeline.

onto a model by providing an estimator.

Parameters:
  • model (Callable, {'preprocessor + estimator },) – A model is scikit-learn estimator or or composite model built from a Pipeline. If model is None , use the default model from the default preprocessor and `estimator. model can be a dict of multiples estimators. Therefore the evaluation of each estimator is set to dictionnary where the key is each estimator name.

  • eval_metric (bool,) – if set to True, confusion matrix and classification report scores are evaluated assuming the the supervised learning is a classification problem. default is False.

  • scorer (str, Callable,) –

    a scorer is a metric function for model evaluation. If given as string it should be the prefix of the following metrics:

    • ”classification_report” -> for classification_report,

    • ’precision_recall’ -> for precision_recall_curve,

    • ”confusion_matrix” -> for a confusion_matrix,

    • ’precision’ -> for precision_score,

    • ”accuracy” -> for accuracy_score

    • ”mse” -> for mean_squared_error,

    • ”recall” -> for recall_score,

    • ’auc’ -> for roc_auc_score,

    • ’roc’ -> for roc_curve

    • ’f1’ -> for f1_score,

    Other string prefix values should raises an errors

  • kws (dict,) – Additionnal keywords arguments from scklearn metric function.

Returns:

`self.base_score_`

Return type:

base score after predicting

Notes

If None estimator is given, the default estimator is svm otherwise, provide the prefix to select the convenience estimator into the default dict default_estimator. Get the default dict by calling <instance>._getdestimators()>

Examples

>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> pc.fit(data =data)

(1) -> default estimator >>> pc.baseEvaluation (eval_metric=True) … 0.47126436781609193

(2) -> multiples estimators >>> from watex.exlib.sklearn import RandomForestClassifier , SGDClassifier, SimpleImputer >>> estimators={‘RandomForestClassifier’:RandomForestClassifier

(n_estimators=200, random_state=0), ‘SDGC’:SGDClassifier(random_state=0)}

>>> pc.X= SimpleImputer().fit_transform(pc.X)
>>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values
>>> pc.baseEvaluation(model={
'RandomForestClassifier':RandomForestClassifier(
  n_estimators=200, random_state=0),
'SDGC':SGDClassifier(random_state=0)}, eval_metric =True)
>>> pc.ypred_
Out[128]:
{'RandomForestClassifier': array([2, 1, 2, 2, 2, 2, 0, 1, 1, 2, 3, 1, 0, 0, 1, 1, 1, 2, 2, 3, 2, 3,
        1, 2, 1, 2, 0, 2, 2, 3, 2, 2, 1, 1, 3, 3, 0, 2, 3, 3, 2, 1, 0, 2,
        1, 1, 2, 2, 2, 2, 1, 1, 0, 2, 0, 2, 1, 2, 1, 1, 2, 0, 1, 2, 0, 2,
        2, 3, 2, 2, 3, 0, 1, 2, 2, 3, 1, 1, 0, 1, 1, 2, 0, 0, 2, 0, 1],
       dtype=int8),
 'SGDClassifier': array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       dtype=int8)}
>>> pc.base_score_
Out[130]:
{'RandomForestClassifier': 0.7816091954022989,
 'SGDClassifier': 0.14942528735632185}
property data#
property features#

Collect the list of features

fit(X=None, y=None, **fit_params)[source]#

Read the whole dataset, encode the categorial features and populate class attributes.

If X and y are provided, they are considered as a features set and target respectively. They should be splitted to the training set and test set respectively.

Parameters:
  • X (N-d array, shape (N, M)) – the feature arrays composed of N-columns and the M-samples. The feature set excludes the target y.

  • y (arraylike , shape (M)) – the target is composed of M-examples in supervised learning.

  • data (Dataframe or shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N including the target y. Note that if the data is given, it is not necessary to provide the X and y. By specifying the target name tname, the target should be remove to the data.

  • split_X_y (bool, default {'True'}) – split the datatset to training set {X, y } and test set {Xt, yt}. Otherwise X and y should be considered as traning sets.

Returns:

``self``

Return type:

Preprocessing instance for easy method chaining.

Examples

>>> from watex.cases.processing import Preprocessing
>>> from watex.datasets import fetch_data
>>> data = fetch_data('bagoue original').get('data=dfy2')
>>> pc = Preprocessing (drop_features = ['lwi', 'num', 'name']
                        ).fit(data =data )
>>> len(pc.X ),  len(y), len(pc.Xt ),  len(pc.yt)
... (344, 344, 87, 87) # trainset (X,y) and testset (Xt, yt)
get_default_pipe()[source]#

make a default pipe to preprocess the data.

Create a preprocessor by assembling multiple transformers. The default pipeline is not exhaustive so to have full control of the data, it is recommended to provide a strong preprocessor for the data processing at once.

the method returns self.pipe_`as callable, preprocessor pipeline from :class:`sklearn.pipeline.Pipeline object. Basically since, the default transformers are composed of:

  • sklearn.pipeline.make_pipeline() for pipeline creation.

  • sklearn.preprocessing.OneHotEncoder() for categorial

    features encoding.

  • sklearn.preprocessing.PolynomialFeatures() for features

    engineering.

  • sklearn.preprocessing.RobustScaler() for data scaling

  • sklearn.compose.make_column_transformer() for data

    transformation.

  • sklearn.compose.make_column_selector() for features

    composing.

property inspect#

Inspect data and trigger plot after checking the data entry. Raises NotFittedError if self is not fitted yet.

makeModel(pipe=None, estimator=None)[source]#

Assemble pipes and estimator to create the model

The model is composed of the transformers and estimator, If one is set to None, it uses the default pipe and estimator which might be not the one expected. Therefore providing a pipe and estimator is suggested.

Parameters:
  • pipe (Callable, pipeline or preprocessor) – Callable pipeline. Pipeline can your own pipeline with different transformer. Refer to the sklearn.pipeline.Pipeline for futher details. Call get_default_pipe to get the default pipe.

  • estimator (Callable, F or {sklearn estimator}) –

    Callable estimator method to fit the model:

       estimators= SGDClassifier(random_state=13)
    
    `Some pre-estimators can be fetched by providing the prefix as
    a key of the estimator default dict. For instance to fetch the
    `DecisionTreeClassifier` estimators::
    
        >>> from watex.cases.processing import Preprocessing
        >>> Preprocessing._getdestimators()['dtc']
        ... DecisionTreeClassifier(max_depth=100, random_state=42)
    

Returns:

`model_`

Return type:

Callable, {preprocessor + estimator }

Examples

  1. We can get the default preprocessor by merely calling:

>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> pc.fit(data =data)
>>> pc.makeModel() # use default model and preprocessor
>>> pc.model_

(2)-> Or build your own preprocesor object using the example below:

>>> from sklearn.pipeline import Pipeline
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.impute import SimpleImputer
>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
>>> from sklearn.linear_model import LogisticRegression
>>> from watex.datasets import fetch_data
>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> numeric_features = ['east', 'north', 'power', 'magnitude', 'sfi', 'ohmS']
>>> numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")),
           ("scaler", StandardScaler())]
    )
>>> categorical_features = ['shape', 'geol', 'type']
>>> categorical_transformer = OneHotEncoder(handle_unknown="ignore")
>>> preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])
>>> pc.makeModel (pipe = preprocessor,
                  estimator =  LogisticRegression())
>>> pc.model_
class watex.cases.processing.Processing(pipeline=None, estimator=None, **kws)[source]#

Bases: Preprocessing

Processing class for managing baseline model evaluation and learning.

Manages the validation curves after fiddling a little bit an estimator hyperparameters.

Processing is usefull before modeling step. To process data, a default implementation is given for data preprocessor build. It consists of creating a model pipeline using different transformers. If None pipeline is setting and auto is set to ‘True’, a default pipeline is created though the prepocessor`to raun the base model evaluation. Indeed a `preprocessor is a set of transformers + estimators.

Parameters:
  • auto (bool, default is {'False'}) – trigger the composite estimator.If True a composite preprocessor is built and use for base model evaluation. default is False.

  • pipeline (Callable, F or dict of callable F) – preprocessing steps encapsulated. If not supplied a default pipe is used as auto is set to True.

  • estimator (Callable,) – An object which manages the estimation and decoding of a model. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a function.

  • tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.

  • drop_features (list or str, Optional) – List the useless features for predicting or list of column names to drop out.

  • random_state (int, default is 42) – The state of data shuffling. The default is 42.

  • default_estimator (callable, F or sckitlearn estimator) – The default estimator name for predicting the tname value. A predifined defaults estimators prameters are set and keep in cache for quick preprocessing like: - ‘dtc’: For DecisionTreeClassifier - ‘svc’: Support Vector Classifier - ‘sdg’: SGDClassifier - ‘knn’: KNeighborsClassifier - ‘rdf`: RandmForestClassifier - ‘ada’: AdaBoostClassifier - ‘vtc’: VotingClassifier - ‘bag’: BaggingClassifier - ‘stc’: StackingClassifier If estimator is not given the default is svm or svc.

  • test_size (float,) – The test set data size. Must be less than 1.The sample test size is 0.2 either 20% of dataset.

  • verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.

X#

training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

Type:

Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)

y#

train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Type:

array-like of shape (M, ) :math:`M=m-samples

Xt#

Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.

Type:

Ndarray ( M x N matrix where M=m-samples, & N=n-features)

yt#

test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Type:

array-like, shape (M, ) M=m-samples,

data#

Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.

Type:

str, filepath_or_buffer or pandas.core.DataFrame

pipe_#

Pipeline can be buit by your own pipeline with different transformer. For base model prediction, it is possible to use the default pipeline. Call get_default_pipe to get the transformation list and steps.

Type:

Callable, preprocessor object from sklearn.pipeline

estimator#

Callable estimator method to fit the model:

estimators= SGDClassifier(random_state=13)
Type:

Callable, F or sklearn.metaestimator

model#

A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:

  • parameters provided in object construction or with set_params;

  • the global numpy.random random state if the estimator’s random_state

    parameter is set to None; and

  • any data or sample properties passed to the most recent call to fit,

    fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.

The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a function.

Type:

callable, always as a function,

cat_features_#

list of categorical features list. If not given it should be find automatically.

Type:

list or str, Optional

num_features_#

list Numerical features list. If not given, should be find automatically.

Type:

list of str, Optional

model#

Use the predifined pipelines i.e can be a Pipeline can your build by your own pipeline with different composite estimator. If model is None , use the default model from the default preprocessor and estimator.

Type:

Callable, {preprocessor + estimator },

model_score_#

Model test score. Observe your test model score using your compose estimator for enhacement

Type:

float/dict

model_prediction_#

Observe your test model prediction for as well as the compose estimator enhancement.

Type:

array_like

preprocessor_#

Compose piplenes and estimators for default model scorage.

Type:

Callable , F

Examples

>>> from watex.cases.processing  import Processing
>>> from watex.exlib.sklearn import (StandardScaler,RandomForestClassifier,
                                     make_column_selector, PolynomialFeatures,
                                     SelectKBest, f_classif)
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> my_own_pipeline= {'num_column_selector_':
...                       make_column_selector(dtype_include=np.number),
...                'cat_column_selector_':
...                    make_column_selector(dtype_exclude=np.number),
...                'features_engineering_':
...                    PolynomialFeatures(3,include_bias=True),
...                'selectors_': SelectKBest(f_classif, k=4),
...               'encodages_': StandardScaler()
...                 }
>>> my_estimator={
...    'RandomForestClassifier':RandomForestClassifier(
...    n_estimators=200, random_state=0)
...    }
>>> processObj= Processing (tname = 'flow', drop_features =['lwi', 'name', 'num'],
                            pipeline= my_own_pipeline, estimator=my_estimator)
>>> processObj.fit(data=data )
>>> processObj.baseEvaluation (eval_metric=True )
... 0.4942528735632184 # score is an ensemble score for both model
property auto#

Trigger the composite pipeline building and greate a composite default model estimator CE-SVC

property estimator#

Get your estimator of the existing default estimator

get_validation_curve(val_params=None, switch_plot='off', preprocess_step=False, train_pkws=None, val_pkws=None, **kws)[source]#

Compute the validation score and plot the validation curve if the argument turn of decorator is switched to on.

If validation keywords arguments val_curve_kws does not contain a param_range key, the default param_range should be the one of

decorator.

Parameters:
  • val_params

    validation_curve keywords arguments. if none the default should be:

    val_params = {"param_name":'C',
                 "param_range": np.arange(1,210,10),
                 "cv":4}
    

  • switch_plot (str, default ='on') – visualize the validation plot

  • preprocess_step (bool, default=False) – Trigger the default step of preprocessing.

  • train_pkws (dict,) – keywords arguments passed to matplotlib.line/scatter plots for training curve

  • val_pkws (dict ,) –

  • for (keyword arguments passed to matplotlib.line/scatter plot) – validation curve.

Returns:

  • - `train_score` (float|dict of trainset score)

  • - `val_score` (float/dict of valisation score)

  • - `switch` (Turn on or off the validation_plot.)

  • - `kk` (the validation param_range for plot.)

Examples

>>> from watex.cases.processing  import Processing
>>> from watex.datasets import fetch_data
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> processObj= Processing (tname = 'flow',
                            drop_features =['lwi', 'name', 'num'])
>>> processObj.fit(data=data )
>>> processObj.get_validation_curve(
    switch_plot='on', preprocess_step=True)
property model#

Concatenate preprocessor and estimator to var

property model_prediction#

Get the model prediction after composite estimator designed

property model_score#

Get the composite estimator score

property preprocessor#

Preoprocessor for composite_estimator design

property processing_model#

Get the default composite model

quick_estimation(estimator_name=None, default_estimator=False)[source]#

Quick run the model without any processing. If none estimator is provided SVC estimator is used.

Parameters:
  • estimators – Callable estimator. If None, a svc is used to quick estimate prediction.

  • random_state – The state of data shuffling.The default is 7.

Example:

>>> from watex.cases.processing import Processing
>>> processObj = Processing(
    data = 'data/geo_fdata/BagoueDataset2.xlsx')
>>> processObj.quick_estimation(estimator=DecisionTreeClassifier(
    max_depth=100, random_state=13)
>>> processObj.model_score
>>> processObj.model_prediction