- class watex.cases.processing.Preprocessing(tname='flow', drop_features=None, random_state=42, default_estimator='svc', test_size=0.2, verbose=0)[source]#
Bases:
objectBase preprocessing class.
Give a baseline preprocessing model with a base score. Usefull before fidlling the model hyperparameters.
- Parameters:
tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.
drop_features (list or str, Optional) – List the useless features for predicting or list of column names to drop out.
random_state (int, default is
42) – The state of data shuffling. The default is42.default_estimator (callable, F or sckitlearn estimator) –
The default estimator name for predicting the tname value. A predifined defaults estimators prameters are set and keep in cache for quick preprocessing like: - ‘dtc’: For DecisionTreeClassifier - ‘svc’: Support Vector Classifier - ‘sdg’: SGDClassifier - ‘knn’: KNeighborsClassifier - ‘rdf`: RandmForestClassifier - ‘ada’: AdaBoostClassifier - ‘vtc’: VotingClassifier - ‘bag’: BaggingClassifier - ‘stc’: StackingClassifier If estimator is not given the default is
svmorsvc.test_size (float,) – The test set data size. Must be less than 1.The sample test size is
0.2either 20% of dataset.verbose (int, default is
0) – Control the level of verbosity. Higher value lead to more messages.
- X#
training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.
Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Type:
Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)
- y#
train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Type:
array-like of shape (M, ) :math:`M=m-samples
- Xt#
Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.
- Type:
Ndarray ( M x N matrix where
M=m-samples, &N=n-features)
- yt#
test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Type:
array-like, shape (M, )
M=m-samples,
- data#
Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any
os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.- Type:
str, filepath_or_buffer or
pandas.core.DataFrame
- pipe_#
Pipeline can be buit by your own pipeline with different transformer. For base model prediction, it is possible to use the default pipeline. Call get_default_pipe to get the transformation list and steps.
- Type:
Callable, preprocessor object from
sklearn.pipeline
- estimator#
Callable estimator method to fit the model:
estimators= SGDClassifier(random_state=13)
- Type:
Callable, F or
sklearn.metaestimator
- model#
A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:
parameters provided in object construction or with set_params;
- the global numpy.random random state if the estimator’s random_state
parameter is set to None; and
- any data or sample properties passed to the most recent call to fit,
fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.
The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a
function.- Type:
callable, always as a function,
- cat_features_#
list of categorical features list. If not given it should be find automatically.
- Type:
list or str, Optional
- num_features_#
list Numerical features list. If not given, should be find automatically.
- Type:
list of str, Optional
- model#
Use the predifined pipelines i.e can be a Pipeline can your build by your own pipeline with different composite estimator. If model is
None, use the default model from the default preprocessor and estimator.- Type:
Callable, {preprocessor + estimator },
Examples
>>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.linear_model import SGDClassifier >>> from sklearn.impute import SimpleImputer >>> estimators=dict( ... RandomForestClassifier=RandomForestClassifier( ... n_estimators=200, random_state=0), >>> pc.X= SimpleImputer().fit_transform(pc.X) >>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values >>> pc.baseEvaluation(estimator=estimators, eval_metric =True) >>> pc.base_score_ ... 0.72586369
- baseEvaluation(model=None, eval_metric=False, **kws)[source]#
Dummy baseline model from preprocessing pipeline.
onto a model by providing an estimator.
- Parameters:
model (Callable, {'preprocessor + estimator },) – A model is scikit-learn estimator or or composite model built from a Pipeline. If model is
None, use the default model from the default preprocessor and `estimator. model can be a dict of multiples estimators. Therefore the evaluation of each estimator is set to dictionnary where the key is each estimator name.eval_metric (bool,) – if set to
True, confusion matrix and classification report scores are evaluated assuming the the supervised learning is a classification problem. default isFalse.scorer (str, Callable,) –
a scorer is a metric function for model evaluation. If given as string it should be the prefix of the following metrics:
”classification_report” -> for classification_report,
’precision_recall’ -> for precision_recall_curve,
”confusion_matrix” -> for a confusion_matrix,
’precision’ -> for precision_score,
”accuracy” -> for accuracy_score
”mse” -> for mean_squared_error,
”recall” -> for recall_score,
’auc’ -> for roc_auc_score,
’roc’ -> for roc_curve
’f1’ -> for f1_score,
Other string prefix values should raises an errors
kws (dict,) – Additionnal keywords arguments from scklearn metric function.
- Returns:
`self.base_score_`
- Return type:
base score after predicting
Notes
If
Noneestimator is given, the default estimator is svm otherwise, provide the prefix to select the convenience estimator into the default dict default_estimator. Get the default dict by calling <instance>._getdestimators()>Examples
>>> from watex.cases.processing import Preprocessing >>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num']) >>> data = fetch_data ('bagoue original').get('data=dfy2') >>> pc.fit(data =data)
(1) -> default estimator >>> pc.baseEvaluation (eval_metric=True) … 0.47126436781609193
(2) -> multiples estimators >>> from watex.exlib.sklearn import RandomForestClassifier , SGDClassifier, SimpleImputer >>> estimators={‘RandomForestClassifier’:RandomForestClassifier
(n_estimators=200, random_state=0), ‘SDGC’:SGDClassifier(random_state=0)}
>>> pc.X= SimpleImputer().fit_transform(pc.X) >>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values >>> pc.baseEvaluation(model={ 'RandomForestClassifier':RandomForestClassifier( n_estimators=200, random_state=0), 'SDGC':SGDClassifier(random_state=0)}, eval_metric =True) >>> pc.ypred_ Out[128]: {'RandomForestClassifier': array([2, 1, 2, 2, 2, 2, 0, 1, 1, 2, 3, 1, 0, 0, 1, 1, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 0, 2, 2, 3, 2, 2, 1, 1, 3, 3, 0, 2, 3, 3, 2, 1, 0, 2, 1, 1, 2, 2, 2, 2, 1, 1, 0, 2, 0, 2, 1, 2, 1, 1, 2, 0, 1, 2, 0, 2, 2, 3, 2, 2, 3, 0, 1, 2, 2, 3, 1, 1, 0, 1, 1, 2, 0, 0, 2, 0, 1], dtype=int8), 'SGDClassifier': array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], dtype=int8)} >>> pc.base_score_ Out[130]: {'RandomForestClassifier': 0.7816091954022989, 'SGDClassifier': 0.14942528735632185}
- property data#
- property features#
Collect the list of features
- fit(X=None, y=None, **fit_params)[source]#
Read the whole dataset, encode the categorial features and populate class attributes.
If X and y are provided, they are considered as a features set and target respectively. They should be splitted to the training set and test set respectively.
- Parameters:
X (N-d array, shape (N, M)) – the feature arrays composed of N-columns and the M-samples. The feature set excludes the target y.
y (arraylike , shape (M)) – the target is composed of M-examples in supervised learning.
data (Dataframe or shape (M, N) from
pandas.DataFrame) – Dataframe containing samples M and features N including the target y. Note that if the data is given, it is not necessary to provide the X and y. By specifying the target name tname, the target should be remove to the data.split_X_y (bool, default {'True'}) – split the datatset to training set {X, y } and test set {Xt, yt}. Otherwise X and y should be considered as traning sets.
- Returns:
``self``
- Return type:
Preprocessing instance for easy method chaining.
Examples
>>> from watex.cases.processing import Preprocessing >>> from watex.datasets import fetch_data >>> data = fetch_data('bagoue original').get('data=dfy2') >>> pc = Preprocessing (drop_features = ['lwi', 'num', 'name'] ).fit(data =data ) >>> len(pc.X ), len(y), len(pc.Xt ), len(pc.yt) ... (344, 344, 87, 87) # trainset (X,y) and testset (Xt, yt)
- get_default_pipe()[source]#
make a default pipe to preprocess the data.
Create a preprocessor by assembling multiple transformers. The default pipeline is not exhaustive so to have full control of the data, it is recommended to provide a strong preprocessor for the data processing at once.
the method returns self.pipe_`as callable, preprocessor pipeline from :class:`sklearn.pipeline.Pipeline object. Basically since, the default transformers are composed of:
sklearn.pipeline.make_pipeline()for pipeline creation.sklearn.preprocessing.OneHotEncoder()for categorialfeatures encoding.
sklearn.preprocessing.PolynomialFeatures()for featuresengineering.
sklearn.preprocessing.RobustScaler()for data scalingsklearn.compose.make_column_transformer()for datatransformation.
sklearn.compose.make_column_selector()for featurescomposing.
- property inspect#
Inspect data and trigger plot after checking the data entry. Raises NotFittedError if
selfis not fitted yet.
- makeModel(pipe=None, estimator=None)[source]#
Assemble pipes and estimator to create the model
The model is composed of the transformers and estimator, If one is set to None, it uses the default pipe and estimator which might be not the one expected. Therefore providing a pipe and estimator is suggested.
- Parameters:
pipe (Callable, pipeline or preprocessor) – Callable pipeline. Pipeline can your own pipeline with different transformer. Refer to the
sklearn.pipeline.Pipelinefor futher details. Call get_default_pipe to get the default pipe.estimator (Callable, F or {sklearn estimator}) –
Callable estimator method to fit the model:
estimators= SGDClassifier(random_state=13) `Some pre-estimators can be fetched by providing the prefix as a key of the estimator default dict. For instance to fetch the `DecisionTreeClassifier` estimators:: >>> from watex.cases.processing import Preprocessing >>> Preprocessing._getdestimators()['dtc'] ... DecisionTreeClassifier(max_depth=100, random_state=42)
- Returns:
`model_`
- Return type:
Callable, {preprocessor + estimator }
Examples
We can get the default preprocessor by merely calling:
>>> from watex.cases.processing import Preprocessing >>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num']) >>> data = fetch_data ('bagoue original').get('data=dfy2') >>> pc.fit(data =data) >>> pc.makeModel() # use default model and preprocessor >>> pc.model_
(2)-> Or build your own preprocesor object using the example below:
>>> from sklearn.pipeline import Pipeline >>> from sklearn.compose import ColumnTransformer >>> from sklearn.impute import SimpleImputer >>> from sklearn.preprocessing import StandardScaler, OneHotEncoder >>> from sklearn.linear_model import LogisticRegression >>> from watex.datasets import fetch_data >>> from watex.cases.processing import Preprocessing >>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num']) >>> numeric_features = ['east', 'north', 'power', 'magnitude', 'sfi', 'ohmS'] >>> numeric_transformer = Pipeline( steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())] ) >>> categorical_features = ['shape', 'geol', 'type'] >>> categorical_transformer = OneHotEncoder(handle_unknown="ignore") >>> preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features), ]) >>> pc.makeModel (pipe = preprocessor, estimator = LogisticRegression()) >>> pc.model_
- class watex.cases.processing.Processing(pipeline=None, estimator=None, **kws)[source]#
Bases:
PreprocessingProcessing class for managing baseline model evaluation and learning.
Manages the validation curves after fiddling a little bit an estimator hyperparameters.
Processing is usefull before modeling step. To process data, a default implementation is given for data preprocessor build. It consists of creating a model pipeline using different transformers. If None pipeline is setting and auto is set to ‘True’, a default pipeline is created though the prepocessor`to raun the base model evaluation. Indeed a `preprocessor is a set of transformers + estimators.
- Parameters:
auto (bool, default is {'False'}) – trigger the composite estimator.If
Truea composite preprocessor is built and use for base model evaluation. default is False.pipeline (Callable, F or dict of callable F) – preprocessing steps encapsulated. If not supplied a default pipe is used as auto is set to
True.estimator (Callable,) – An object which manages the estimation and decoding of a model. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a function.
tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.
drop_features (list or str, Optional) – List the useless features for predicting or list of column names to drop out.
random_state (int, default is
42) – The state of data shuffling. The default is42.default_estimator (callable, F or sckitlearn estimator) – The default estimator name for predicting the tname value. A predifined defaults estimators prameters are set and keep in cache for quick preprocessing like: - ‘dtc’: For DecisionTreeClassifier - ‘svc’: Support Vector Classifier - ‘sdg’: SGDClassifier - ‘knn’: KNeighborsClassifier - ‘rdf`: RandmForestClassifier - ‘ada’: AdaBoostClassifier - ‘vtc’: VotingClassifier - ‘bag’: BaggingClassifier - ‘stc’: StackingClassifier If estimator is not given the default is
svmorsvc.test_size (float,) – The test set data size. Must be less than 1.The sample test size is
0.2either 20% of dataset.verbose (int, default is
0) – Control the level of verbosity. Higher value lead to more messages.
- X#
training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.
Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Type:
Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)
- y#
train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Type:
array-like of shape (M, ) :math:`M=m-samples
- Xt#
Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.
- Type:
Ndarray ( M x N matrix where
M=m-samples, &N=n-features)
- yt#
test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Type:
array-like, shape (M, )
M=m-samples,
- data#
Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any
os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.- Type:
str, filepath_or_buffer or
pandas.core.DataFrame
- pipe_#
Pipeline can be buit by your own pipeline with different transformer. For base model prediction, it is possible to use the default pipeline. Call get_default_pipe to get the transformation list and steps.
- Type:
Callable, preprocessor object from
sklearn.pipeline
- estimator#
Callable estimator method to fit the model:
estimators= SGDClassifier(random_state=13)
- Type:
Callable, F or
sklearn.metaestimator
- model#
A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:
parameters provided in object construction or with set_params;
- the global numpy.random random state if the estimator’s random_state
parameter is set to None; and
- any data or sample properties passed to the most recent call to fit,
fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.
The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a
function.- Type:
callable, always as a function,
- cat_features_#
list of categorical features list. If not given it should be find automatically.
- Type:
list or str, Optional
- num_features_#
list Numerical features list. If not given, should be find automatically.
- Type:
list of str, Optional
- model#
Use the predifined pipelines i.e can be a Pipeline can your build by your own pipeline with different composite estimator. If model is
None, use the default model from the default preprocessor and estimator.- Type:
Callable, {preprocessor + estimator },
- model_score_#
Model test score. Observe your test model score using your compose estimator for enhacement
- Type:
float/dict
- model_prediction_#
Observe your test model prediction for as well as the compose estimator enhancement.
- Type:
array_like
- preprocessor_#
Compose piplenes and estimators for default model scorage.
- Type:
Callable , F
Examples
>>> from watex.cases.processing import Processing >>> from watex.exlib.sklearn import (StandardScaler,RandomForestClassifier, make_column_selector, PolynomialFeatures, SelectKBest, f_classif) >>> data = fetch_data ('bagoue original').get('data=dfy2') >>> my_own_pipeline= {'num_column_selector_': ... make_column_selector(dtype_include=np.number), ... 'cat_column_selector_': ... make_column_selector(dtype_exclude=np.number), ... 'features_engineering_': ... PolynomialFeatures(3,include_bias=True), ... 'selectors_': SelectKBest(f_classif, k=4), ... 'encodages_': StandardScaler() ... } >>> my_estimator={ ... 'RandomForestClassifier':RandomForestClassifier( ... n_estimators=200, random_state=0) ... } >>> processObj= Processing (tname = 'flow', drop_features =['lwi', 'name', 'num'], pipeline= my_own_pipeline, estimator=my_estimator) >>> processObj.fit(data=data ) >>> processObj.baseEvaluation (eval_metric=True ) ... 0.4942528735632184 # score is an ensemble score for both model
- property auto#
Trigger the composite pipeline building and greate a composite default model estimator CE-SVC
- property estimator#
Get your estimator of the existing default estimator
- get_validation_curve(val_params=None, switch_plot='off', preprocess_step=False, train_pkws=None, val_pkws=None, **kws)[source]#
Compute the validation score and plot the validation curve if the argument turn of decorator is switched to
on.If validation keywords arguments val_curve_kws does not contain a param_range key, the default param_range should be the one of
decorator.
- Parameters:
val_params –
validation_curve keywords arguments. if none the default should be:
val_params = {"param_name":'C', "param_range": np.arange(1,210,10), "cv":4}
switch_plot (str, default ='on') – visualize the validation plot
preprocess_step (bool, default=False) – Trigger the default step of preprocessing.
train_pkws (dict,) – keywords arguments passed to matplotlib.line/scatter plots for training curve
val_pkws (dict ,)
for (keyword arguments passed to matplotlib.line/scatter plot) – validation curve.
- Returns:
- `train_score` (float|dict of trainset score)
- `val_score` (float/dict of valisation score)
- `switch` (Turn
onoroffthe validation_plot.)- `kk` (the validation param_range for plot.)
Examples
>>> from watex.cases.processing import Processing >>> from watex.datasets import fetch_data >>> data = fetch_data ('bagoue original').get('data=dfy2') >>> processObj= Processing (tname = 'flow', drop_features =['lwi', 'name', 'num']) >>> processObj.fit(data=data ) >>> processObj.get_validation_curve( switch_plot='on', preprocess_step=True)
- property model#
Concatenate preprocessor and estimator to var
- property model_prediction#
Get the model prediction after composite estimator designed
- property model_score#
Get the composite estimator score
- property preprocessor#
Preoprocessor for composite_estimator design
- property processing_model#
Get the default composite model
- quick_estimation(estimator_name=None, default_estimator=False)[source]#
Quick run the model without any processing. If none estimator is provided
SVCestimator is used.- Parameters:
estimators – Callable estimator. If
None, asvcis used to quick estimate prediction.random_state – The state of data shuffling.The default is
7.
- Example:
>>> from watex.cases.processing import Processing >>> processObj = Processing( data = 'data/geo_fdata/BagoueDataset2.xlsx') >>> processObj.quick_estimation(estimator=DecisionTreeClassifier( max_depth=100, random_state=13) >>> processObj.model_score >>> processObj.model_prediction