watex.cases.processing.Preprocessing#

class watex.cases.processing.Preprocessing(tname='flow', drop_features=None, random_state=42, default_estimator='svc', test_size=0.2, verbose=0)[source]#

Base preprocessing class.

Give a baseline preprocessing model with a base score. Usefull before fidlling the model hyperparameters.

Parameters
  • tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.

  • drop_features (list or str, Optional) – List the useless features for predicting or list of column names to drop out.

  • random_state (int, default is 42) – The state of data shuffling. The default is 42.

  • default_estimator (callable, F or sckitlearn estimator) –

    The default estimator name for predicting the tname value. A predifined defaults estimators prameters are set and keep in cache for quick preprocessing like: - ‘dtc’: For DecisionTreeClassifier - ‘svc’: Support Vector Classifier - ‘sdg’: SGDClassifier - ‘knn’: KNeighborsClassifier - ‘rdf`: RandmForestClassifier - ‘ada’: AdaBoostClassifier - ‘vtc’: VotingClassifier - ‘bag’: BaggingClassifier - ‘stc’: StackingClassifier If estimator is not given the default is svm or

    svc.

  • test_size (float,) – The test set data size. Must be less than 1.The sample test size is 0.2 either 20% of dataset.

  • verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.

X#

training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

Type

Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)

y#

train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Type

array-like of shape (M, ) :math:`M=m-samples

Xt#

Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.

Type

Ndarray ( M x N matrix where M=m-samples, & N=n-features)

yt#

test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Type

array-like, shape (M, ) M=m-samples,

data#

Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.

Type

str, filepath_or_buffer or pandas.core.DataFrame

pipe_#

Pipeline can be buit by your own pipeline with different transformer. For base model prediction, it is possible to use the default pipeline. Call get_default_pipe to get the transformation list and steps.

Type

Callable, preprocessor object from sklearn.pipeline

estimator#

Callable estimator method to fit the model:

estimators= SGDClassifier(random_state=13)
Type

Callable, F or sklearn.metaestimator

model#

A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:

  • parameters provided in object construction or with set_params;

  • the global numpy.random random state if the estimator’s random_state

    parameter is set to None; and

  • any data or sample properties passed to the most recent call to fit,

    fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.

The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a function.

Type

callable, always as a function,

cat_features_#

list of categorical features list. If not given it should be find automatically.

Type

list or str, Optional

num_features_#

list Numerical features list. If not given, should be find automatically.

Type

list of str, Optional

model#

Use the predifined pipelines i.e can be a Pipeline can your build by your own pipeline with different composite estimator. If model is None , use the default model from the default preprocessor and estimator.

Type

Callable, {preprocessor + estimator },

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.linear_model import SGDClassifier
>>> from sklearn.impute import SimpleImputer
>>> estimators=dict(
...    RandomForestClassifier=RandomForestClassifier(
...        n_estimators=200, random_state=0),
>>> pc.X= SimpleImputer().fit_transform(pc.X)
>>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values
>>> pc.baseEvaluation(estimator=estimators, eval_metric =True)
>>> pc.base_score_
... 0.72586369