watex.cases.processing.Preprocessing#
- class watex.cases.processing.Preprocessing(tname='flow', drop_features=None, random_state=42, default_estimator='svc', test_size=0.2, verbose=0)[source]#
Base preprocessing class.
Give a baseline preprocessing model with a base score. Usefull before fidlling the model hyperparameters.
- Parameters:
tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.
drop_features (list or str, Optional) – List the useless features for predicting or list of column names to drop out.
random_state (int, default is
42) – The state of data shuffling. The default is42.default_estimator (callable, F or sckitlearn estimator) –
The default estimator name for predicting the tname value. A predifined defaults estimators prameters are set and keep in cache for quick preprocessing like: - ‘dtc’: For DecisionTreeClassifier - ‘svc’: Support Vector Classifier - ‘sdg’: SGDClassifier - ‘knn’: KNeighborsClassifier - ‘rdf`: RandmForestClassifier - ‘ada’: AdaBoostClassifier - ‘vtc’: VotingClassifier - ‘bag’: BaggingClassifier - ‘stc’: StackingClassifier If estimator is not given the default is
svmorsvc.test_size (float,) – The test set data size. Must be less than 1.The sample test size is
0.2either 20% of dataset.verbose (int, default is
0) – Control the level of verbosity. Higher value lead to more messages.
- X#
training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.
Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Type:
Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)
- y#
train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Type:
array-like of shape (M, ) :math:`M=m-samples
- Xt#
Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.
- Type:
Ndarray ( M x N matrix where
M=m-samples, &N=n-features)
- yt#
test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Type:
array-like, shape (M, )
M=m-samples,
- data#
Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any
os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.- Type:
str, filepath_or_buffer or
pandas.core.DataFrame
- pipe_#
Pipeline can be buit by your own pipeline with different transformer. For base model prediction, it is possible to use the default pipeline. Call get_default_pipe to get the transformation list and steps.
- Type:
Callable, preprocessor object from
sklearn.pipeline
- estimator#
Callable estimator method to fit the model:
estimators= SGDClassifier(random_state=13)
- Type:
Callable, F or
sklearn.metaestimator
- model#
A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:
parameters provided in object construction or with set_params;
- the global numpy.random random state if the estimator’s random_state
parameter is set to None; and
- any data or sample properties passed to the most recent call to fit,
fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.
The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a
function.- Type:
callable, always as a function,
- cat_features_#
list of categorical features list. If not given it should be find automatically.
- Type:
list or str, Optional
- num_features_#
list Numerical features list. If not given, should be find automatically.
- Type:
list of str, Optional
- model#
Use the predifined pipelines i.e can be a Pipeline can your build by your own pipeline with different composite estimator. If model is
None, use the default model from the default preprocessor and estimator.- Type:
Callable, {preprocessor + estimator },
Examples
>>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.linear_model import SGDClassifier >>> from sklearn.impute import SimpleImputer >>> estimators=dict( ... RandomForestClassifier=RandomForestClassifier( ... n_estimators=200, random_state=0), >>> pc.X= SimpleImputer().fit_transform(pc.X) >>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values >>> pc.baseEvaluation(estimator=estimators, eval_metric =True) >>> pc.base_score_ ... 0.72586369