watex.cases package#

‘cases’ subpackage implements functions and modules already used to solve real engineering problems such as the flow rate prediction and boosting using the base learners and an ensemble paradigms.

features, processing, modeling and prepare modules have base step procedures and can be used for processing and analyses to give a quick depiction of how data looks like and model performance estimation.

class watex.cases.BaseModel(data_fn=None, df=None, **kwargs)[source]#

Bases: object

Base model class. The most interesting and challenging part of modeling is the tuning hyperparameters after designing a composite estimator. Getting the best params is a better way to reorginize the created pipeline {transformers +estimators} so to have a great capability of data generalization.

Parameters:
  • *dataf_fn* (str) – Path to analysis data file.

  • *df* (pd.Core.DataFrame) – Dataframe of features for analysis . Must be contains of main parameters including the target name of pd.Core.series of columns of df.

  • arguments (Holds on others optionals infos in kwargs) –

  • ======================================= (================= ============) –

  • Description (Attributes Type) –

  • =======================================

  • estimator. (auto bool Trigger the composite) – If True a SVC-composite estimator preprocessor is given. default is False.

  • model (pipelines dict Collect your own pipeline for) – preprocessor trigging. it should be find automatically.

  • None (estimators Callable A given estimator. If) – is auto-selected as default estimator.

  • SVM – is auto-selected as default estimator.

  • test (model_score float/dict Model test score. Observe your) – model score using your compose estimator for enhancement or your own pipelines.

  • for (processor Callable Compose piplenes and estimators) – as well as the compose estimator enhancement.

  • for – default model scorage.

  • =======================================

Examples

>>> from watex.bases.modeling import BaseModel
>>> from sklearn.preprocessing import RobustScaler,  PolynomialFeatures
>>> from sklearn.feature_selection import SelectKBest, f_classif
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.compose import make_column_selector
>>> estimator2= RandomForestClassifier()
>>> modelObj = BaseModel(
...     data_fn ='data/geo_fdata/BagoueDataset2.xlsx',
...     pipelines = {
...            'num_column_selector_': make_column_selector(
...                dtype_include=np.number),
...            'cat_column_selector_': make_column_selector(
...                dtype_exclude=np.number),
...            'features_engineering_':PolynomialFeatures(
...                2, include_bias=False),
...            'selectors_': SelectKBest(f_classif, k=2),
...            'encodages_': RobustScaler()
...              },
...     estimator = RandomForestClassifier()
...        )
property feature_importances_#

Get the bar plot of features importances. If the estimator has not feature_importances_ attributes, it will raise an error.

get_learning_curve(estimator=None, X_train=None, y_train=None, learning_curve_kws=None, **kws)[source]#

Compute the train score and validation curve to visualize your learning curve.

Parameters:
  • estimator – The creating model. If None

  • X_train – pd.core.frame.DataFrame of selected trainset

  • x_test – pd.DataFrame of selected Data for testset

  • y_train – array_like of selected data for evaluation set.

  • y_test – array_like of selected data for model test

  • val_kws

    validation_curve keywords arguments. if none the default should be:

    val_curve_kws = {"param_name":'C',
                 "param_range": np.arange(1,210,10),
                 "cv":4}
    

Returns:

  • train_score: float|dict of trainset score.

  • val_score : float/dict of valisation score.

  • switch: Turn on or off the learning curve of validation

    curve.

-trigDec: Trigger the decorator. - N: number of param range for plotting.

Example:
>>> from watex.bases.modeling import BaseModel
>>> processObj = BaseModel(
    data_fn = 'data/geo_fdata/BagoueDataset2.xlsx')
>>> processObj.get_learning_curve (
    switch_plot='on', preprocessor=True)
get_model_prediction(estimator=None, X_test=None, y_test=None, **kws)[source]#

Get the model prediction and quick plot using the surche decorator.

The decorator holds many keyword arguments to customize plot. Refer to watex.utils.decorator.predPlot.

Parameters:
  • estimator – The creating model. If None

  • x_test – pd.DataFrame of selected Data for testset

  • y_test – array_like of selected data for model test

  • kws – Additional keywords arguments which refer to the data_fn df and pipelines parameters.

  • switch – Turn on or off the decorator.

Example:
>>> from watex.modeling.sl import Modeling
>>> modelObj = Modeling(
    data_fn ='data/geo_fdata/BagoueDataset2.xlsx',
    pipelines ={
        'num_column_selector_': make_column_selector(
            dtype_include=np.number),
        'cat_column_selector_': make_column_selector(
            dtype_exclude=np.number),
        'features_engineering_':PolynomialFeatures(2,
                                        include_bias=False),
        'selectors_': SelectKBest(f_classif, k=2),
        'encodages_': RobustScaler()
          }, estimator = SVC(C=1, gamma=0.1))
>>> modelObj.get_model_prediction(estimator =testim, switch ='on')
property model_#

Get a set of processor and eestimator composed of the composite model

property model_score#

Estimate your composite model prediction

permutation_feature_importance(estimator=None, X_train=None, y_train=None, pfi_kws=None, **kws)[source]#

Evaluation of features importance with tree estimators before shuffle and after shuffling trees.

Permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data is tabular. This is especially useful for non-linear or opaque estimators. Refer to :ref:`this link <https://scikit-learn.org/stable/modules/permutation_importance.html>`_ for more details.

Parameters:
  • estimator – The estimator to evaluate the importance of features. The default is RandomForestClassifier.

  • X_train – pd.core.frame.DataFrame of selected trainset.

  • y_train – array_like of selected data for evaluation set.

  • n_estimators – Number of estimator composed the tree. The default is 100

  • n_repeats – Number of tree shuffling. The default is 10.

  • pfi_kwspermution_importance callable additional keywords arguments.

  • pfi_stype

    Type of plot. Can be : - pfi for permutation feature importance before

    and after shuffling trees

    -dendro for dendrogram plot . The default is pfi.

  • switch – Turn on or off the decorator.

Example:
>>> from watex.bases.modeling import BaseModel
>>> from sklearn.ensemble import AdaBoostClassifier
>>> modelObj = BaseModel()
>>> modelObj.permutation_feature_importance(
...    estimator = AdaBoostClassifier(random_state=7),
...    data_fn ='data/geo_fdata/BagoueDataset2.xlsx',
...     switch ='on', pfi_style='pfi')
property processor#

Get te processor after supplying the pipelines

tuning_hyperparameters(estimator=None, hyper_params=None, cv=4, grid_kws=None, **kws)[source]#

Tuning hyperparametres from existing estimator to evaluate performance. Boosting the model using the model best_param

Parameters:
  • estimator – Callable estimator or model to boost

  • hyper_params – dict of hyperparameters of the estimator

  • cv – Cross validation cutting off. the default is 4

:param grid_kws:dict of other gridSearch parameters

Example:
>>> from watex.modeling.basics import SLModeling
>>> from sklearn.preprocessing import RobustScaler,PolynomialFeatures
>>> from sklearn.feature_selection import SelectKBest, f_classif
>>> from sklearn.svm import SVC
>>> from sklearn.compose import make_column_selector
>>> my_own_pipelines= {
        'num_column_selector_': make_column_selector(
            dtype_include=np.number),
        'cat_column_selector_': make_column_selector(
            dtype_exclude=np.number),
        'features_engineering_':PolynomialFeatures(
            3, include_bias=False),
        'selectors_': SelectKBest(f_classif, k=3),
        'encodages_': RobustScaler()
          }
>>> my_estimator = SVC(C=1, gamma=1e-4, random_state=7)
>>> modelObj = SLModeling(data_fn ='data/geo_fdata/BagoueDataset2.xlsx',
               pipelines =my_own_pipelines ,
               estimator = my_estimator)
>>> hyperparams ={
    'columntransformer__pipeline-1__polynomialfeatures__degree':
        np.arange(2,10),
    'columntransformer__pipeline-1__selectkbest__k': np.arange(2,7),
    'svc__C': [1, 10, 100],
    'svc__gamma':[1e-1, 1e-2, 1e-3]}
>>> my_compose_estimator_ = modelObj.model_
>>> modelObj.tuning_hyperparameters(
                            estimator= my_compose_estimator_ ,
                            hyper_params= hyperparams,
                            search='rand')
>>> modelObj.best_params_
>>> modelObj.best_score_
class watex.cases.BaseSteps(tname=None, return_all=True, drop_features=None, categorizefeature_props=None, add_attributes=True, attribute_indexes=None, operator='/', imputer_strategy='median', missing_values=nan, pipeline=None, test_size=0.2, hash=False, random_state=42, verbose=0, **kwargs)[source]#

Bases: object

Default Data preparation steps

By default, the BaseSteps is used to prepare the DC 1d -resistivity geoelectrical features before prediction. The predicted target was the flow rate.

Parameters:
tname: str,

A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.

return_all: bool

return all the stratified trainset. When data is too large, can set to False to take an sample of the stratified trainset. to evaluate your model.

drop_features: list

List of useless features and clean the dataset.

categorizefeature_props: list

list of properties to categorize a particular features in the dataset. It composed of the ‘name of feature’ to convert its numerical values into categorical values , then the value range of data to be categorize and finally the categorical name of that values range. For instance:

categorizefeature_props= [
    ('flow', ([0., 1., 3.], ['FR0', 'FR1', 'FR2', 'FR3']))
    ]

Please refer to watex.utils.transformers.CategorizeFeatures fot furthers details.

hash: bool,

If True, it ensure that data will remain consistent accross multiple runs, even if dataset is refreshed. Use test by id to hash training and test sets when data is splitting.

add_attributes: list, optional

Experience the combinaison <numerical> attributes. List of features for combinaison. Decide to combine features to create a new feature value from operator parameters. By default, the combinaison is ratio of the given attribute/numerical features. For instance, attribute_names=['lwi', 'ohmS'] will divide the feature ‘lwi’ by ‘ohmS’.

operator: str, default =’/’

Type of operation to perform when combining features. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]

attribute_indexes: list of int,

List of attributes indexes to combines. For instance:

attribute_indexes = [1, 0] # or [4, 3]

The operator by default is division . Indexes of each attribute/feature for experiencing combinaison. User warning should raise if any index does match the dataframe of array columns.For more details, refer to CombinedAttributesAdder

imputer_strategy: str

Type of strategy to replace the missing values. Refer to SimpleImputer. Default is median.

missing_valuesfloat

The value to be replaced. Default is np.nan values.

pipeline: callable

Pipeline to prepare the dataset. Default is defaultPipeline().

test_size: float, default=.2 i.e. 20% (X, y)

The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.

random_stateint, RandomState instance or None, default=42

Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

verbose: int, default is 0

Control the level of verbosity. Higher value lead to more messages.

data: Filepath or Dataframe or shape (M, N)

Data is passed here as additional keyword arguments just for making under the X and y using method :neth:`~.stratifydata`. It is pandas.DataFrame containing samples of M and features N.

Notes

The data preparing includes is composed of two steps. The first step includes: - The data cleaning by fixing and removing outliers, to replace the missing

values by the other values using param `imputer_strategy`rather than to get rid of the different instances (examples) or the whole feature.

  • The handling text and features consist to convert the categorial features labels to numbers to let the algorithm to well perform with non-numerical values.

  • The data stratification process is done before separating the dataset into trainset and test set. Indeed, the stratification consist to divide the whole dataset into homogeneous subgroup to guarantee that the test set is most representative of the overall dataset. This is useful in our case because the dataset is not large enough to avoid the risk of introducing a significant bias. Once data are stratified, data are divided into a trainset (80%) and test set (20%).

The second steps consist of features selection, features engineering, encoding and data scaling using the pipeline via a parameter pipeline. If None pipeline is given, the default pipline is triggered.The features engineering’s consist to aggregate features with experiencing combinations of attributes into promising new features using the params attribute_indexes after setting the argument add_attributes to True. The final step of transformation consists of features scaling. The type of scaling used by default in this module is the standardization because it less affected by the outliers. Each transformation step must be executed in right order therefore a full pipeline is created, composed of the numerical pipeline (deals with numerical features) and categorical pipeline (deals with categorial features). Both pipelines are combined and applied to the trainset and later to the test set.

Examples

../datasets/_p.py

property X_#

keep the stratified testset X

property data#
fit(X, y=None)[source]#

Preparing steps.

Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns:

``self`` – returns self for easy method chaining.

Return type:

BaseSteps instance

fit_transform(X=None, y=None, on_testset=False)[source]#

Fit transform apply fit and transform at Once.

Parameters:

X (ndarray, pd.DataFrame) – X or dataframe X

y: array_like,

ylabel or target

on_testset:str
Check whether the dataframe is evaluating on

testset or trainset

data: :returns: * - X_prepared. Data prepared after transformation

  • -y-prepared. label prepared after transformation.

stratifyFolds(data)[source]#

Stratified the dataset and return the trainset. Get more details in watex.bases.transformers.StratifiedWithCategoryAdder.

stratifydata(data=None)[source]#

Split and stratified data and return stratified training and test sets

property tname#
transform(X=None, y=None, on_testset=False)[source]#

Transform data applying the pipeline transformation.

Parameters:
  • X (ndarray, pd.DataFrame) – X or dataframe X

  • y (array_like,) – ylabel or target values

  • on_testset (str) –

    Check whether the dataframe is evaluating on

    testset or trainset

Returns:

  • - X_prepared. Data prepared after transformation

  • -y-prepared. label prepared after transformation.

property y_#

keep the stratified label y

class watex.cases.FeatureInspection(tname='flow', mapflow=True, sanitize=False, flow_classes=[0.0, 1.0, 3.0], set_index=False, col_name=None, **kws)[source]#

Bases: object

Summarizes the flow features.

It deals with data features categorization. When numericall values are provided standard qualitative or quantitative analysis is performed.

Parameters:
  • *data* (str or pd.core.DataFrame) – Path-like object or pandas Dataframe. Must contain the main parameters including the target.

  • **tname** (str) – The tname for predicting purposes. Here for groundwater exploration, we specify the name of the target as flow.

  • **flow_classes** (list or array_like) – The way to classify the flow. Provide the main specific values to convert the categorial trends to numerical values. Different projects have different tnameing flow rate. Might specify either for village hydraulic, or improved village hydraulic or urban hydraulics.

  • **drop_columns** (list) – items for dropping. To analyse the data, we can drop some specific columns to not corrupt data analysis. In formal dataframe collected straighforwardly from GeoFeatures,the default drop_columns refer to coordinates positions as : [‘east’, ‘north’].

  • **mapflow (bool,) –

    if set to True, value in the target columns should map to categorical values. Commonly the flow rate values are given as a trend of numerical values. For a classification purpose, flow rate must be converted to categorical values which are mainly refered to the type of types of hydraulic. Mostly the type of hydraulic system is in turn tided to the the number of the living population in a specific area. For instance, flow classes can be ranged as follow:

    • FR = 0 is for dry boreholes

    • 0 < FR ≤ 3m3/h for village hydraulic (≤2000 inhabitants)

    • 3 < FR ≤ 6m3/h for improved village hydraulic(>2000-20 000inhbts)

    • 6 <FR ≤ 10m3/h for urban hydraulic (>200 000 inhabitants).

    Note that this flow range is not exhaustive and can be modified according to the type of hydraulic required on the project.

  • **set_index** (bool,) – condired a column as dataframe index. If set to True, please provided the col_name, otherwise it should be the id as as a default columns item.

  • **sanitize** – polish the data and remove inconsistent columns in the data which are not refer to the predicting features. It is able to change for instance the french name of water eau to ‘water` wich is related to the value of water inflow features lwi. This could be usefull when the data is given as a Path-Like object and features are not described correctly in the case of groundwater. Default is False

Examples

>>> from watex.cases.features import FeatureInspection
>>> data = 'data/geodata/main.bagciv.data.csv'
>>> fobj = FeatureInspection().fit(data)
>>> fobj.data_.columns
Out[117]:
Index(['num', 'name', 'east', 'north', 'power', 'magnitude', 'shape', 'type',
       'sfi', 'ohmS', 'lwi', 'geol', 'flow'],
      dtype='object')
property cache#

Generate cache df_ for all eliminate features and keep on new pd.core.frame.DataFrame.

property data#

Control the Feature-file extension provide. Usefull to select pd.DataFrame construction.

fit(data)[source]#

Main goals of this method is to fit and classify the different flow classes in the dataset. However by default, four(04) flow classes are considered according to the reference below

Parameters:

*data* (str or pd.core.DataFrame) – Path-like object or pandas Dataframe. Must contains of the main parameters including the tname the tname.

Returns:

object

Return type:

FeatureInspection object

Examples

>>> from watex.bases.features import FeatureInspection
>>> data = 'data/geodata/main.bagciv.data.csv'
>>> fobj = FeatureInspection()
>>> fobj.fit(data)
>>> fobj.data.iloc[1:3 , :]
...    num name  power  magnitude  ...         ohmS        lwi      geol  flow
1    2   b2   70.0      142.0  ...  1135.551531  21.406531  GRANITES   FR1
2    3   b3   80.0       87.0  ...   767.562500   0.000000  GRANITES   FR1

Notes

The paper mentions 04 types of hydraulic according to the population demand and the number of living inhabitants. The hydraulic system are defined as:

  • FR = 0 is for dry boreholes

  • 0 < FR ≤ 3m3/h for village hydraulic (≤2000 inhabitants)

  • 3 < FR ≤ 6m3/h for improved village hydraulic(>2000-20 000inhbts)

  • 6 <FR ≤ 10m3/h for urban hydraulic (>200 000 inhabitants).

The flow classes can be modified according to the type of hydraulic proposed for the project.

References

[1]

CIEH. (2001). L’utilisation des méthodes géophysiques pour la recherche d’eaux dans les aquifères discontinus. Série Hydrogéologie, 169.

property flow_classes#
writedf(df=None, refout=None, to=None, savepath=None, modname='_anEX_', reset_index=False)[source]#

Write the analysis df.

Refer to watex.decorators.exportdf() for more details about the arguments refout, to, savepath, modename and rest_index.

Example:
>>> from watex.analysis.bases.features import FeatureInspection
>>> slObj =FeatureInspection(
...   data_fn='data/geo_fdata/BagoueDataset2.xlsx',
...   set_index =True)
>>> slObj.writedf()
class watex.cases.GeoFeatures(**kws)[source]#

Bases: object

Features class. Deals with Electrical Resistivity profile (VES), Vertical electrical Sounding (VES), Geological (Geol) data and Borehole data(Boreh). Set all features values of differents investigation sites. Features class is composed of:

  • erp class get from watex.methods.erp.ERP_colection

  • geol obtained from watex.geology.geology.Geology

  • boreh get from watex.geology.geology.Borehole

Parameters:
  • *features_fn* (str , Path_like) – File to geoelectical features files.

  • *ErpColObjs* (object) – Collection object from erp survey lines.

  • *vesObjs* (object,) – Collection object from vertical electrical sounding (VES) curves.

  • *geoObjs* (object,) – Collection object from geol class. See watex.geology.geology.Geology.

  • *boreholeObjs* (object) – Collection of boreholes of all investigation sites. Refer to watex.geology.geology.Borehole

Holds on others optionals infos in kwargs arguments:

Attributes

Type

Description

df

pd.core.DataFrame

Container of all features composed of featureLabels

site_ids

array_like

ID of each survey locations.

site_names

array_like

Survey locations names.

gFname

str

Filename of features_fn.

ErpColObjs

obj

ERP erp class object.

vesObjs

obj

VES ves class object.

geoObjs

obj

Geology geol class object.

borehObjs

obj

Borehole boreh class obj.

Notes

Be sure to not miss any coordinates files. Indeed, each selected anomaly should have a borehole performed at that place for supervising learing. That means, each selected anomaly referenced by location coordinates and id on erp must have it own ves, geol and boreh data. For furher details about classes object , please refer to the classes documentation aforementionned.

Examples

>>> from watex.cases.features import GeoFeatures
>>> data ='data/geodata/main.bagciv.data.csv'
>>> featObj =GeoFeatures().fit(data )
>>> featObj.id_
Out[114]:
array(['e0000001', 'e0000002', 'e0000003', 'e0000004', 'e0000005',
       'e0000006', 'e0000007'], dtype='<U8')
>>> featObj.site_names_
>>> featObj.site_names_[:7]
Out[115]: array(['b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7'], dtype=object)
static controlObjId(erpObjID, boreObjID, geolObjID, vesObjsID)[source]#

Control object id whether the each selected anomaly from erp matchs with its`ves` and geol and borehole.

Parameters:
Returns:

New survey ID

property data#

Control the Feature-file extension provide. Usefull to select pd.DataFrame construction.

data_to_numpy(data_fn)[source]#

Method to get datatype and set different features into nympy array

exportdf(refout=None, to=None, savepath=None, **kwargs)[source]#

Export dataframe from df to files can be Excell sheet file or ‘.json’ file. To get more details about the writef decorator, see watex.decorators.writef().

Parameters:
  • refout – Output filename. If not given will be created refering to the exported date.

  • to (str) – Export type. Can be .xlsx , .csv, .json and else

  • savepath – Path to save the refout filename. If not given will be created.

Returns:

  • ndf: new dataframe from attr:`~.geofeatures.Features.df

Example:
>>> from watex.bases.features import Features
>>> featObj = Features(
...    features_fn= 'data/geo_fdata/BagoueDataset2.xlsx' )
>>> featObj.exportdf(refout=ybro, to='csv')
featureLabels_ = ['id', 'east', 'north', 'power', 'magnitude', 'shape', 'type', 'sfi', 'ohmS', 'lwi', 'geol', 'flow']#
fit(data=None, geoObj=None, erpObj=None, vesObj=None, boreholeObj=None, **kws)[source]#

Reading class and attributes populating. Please refer to ~.core.geofeatures.Features for arguments details.

from_csv(erp_fn)[source]#

Method essentially created to read file from csv , collected horizontal distance value and apparent resistivy values. then send to the class for computation purposes.

Parameters:

erp_fn (str) – path_like string of CSV file

Returns:

horizontal distance im meters

Return type:

np.array of all data.

from_json(json_fn, indent=4)[source]#

Collected data from json files and retrieve the most insights contents

Parameters:

json_fn (str) – json file

from_xml(xml_fn, columns=None)[source]#

collected data from xml and build dataFrame

Parameters:
  • xxlm_fn – Full path to xml file

  • columns (list) – list of columns of dataset

sanitize_fdataset()[source]#

Sanitize the feature dataset. Recognize the columns provided by the users and resset according to the features labels disposals featureLabels.

class watex.cases.Preprocessing(tname='flow', drop_features=None, random_state=42, default_estimator='svc', test_size=0.2, verbose=0)[source]#

Bases: object

Base preprocessing class.

Give a baseline preprocessing model with a base score. Usefull before fidlling the model hyperparameters.

Parameters:
  • tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.

  • drop_features (list or str, Optional) – List the useless features for predicting or list of column names to drop out.

  • random_state (int, default is 42) – The state of data shuffling. The default is 42.

  • default_estimator (callable, F or sckitlearn estimator) –

    The default estimator name for predicting the tname value. A predifined defaults estimators prameters are set and keep in cache for quick preprocessing like: - ‘dtc’: For DecisionTreeClassifier - ‘svc’: Support Vector Classifier - ‘sdg’: SGDClassifier - ‘knn’: KNeighborsClassifier - ‘rdf`: RandmForestClassifier - ‘ada’: AdaBoostClassifier - ‘vtc’: VotingClassifier - ‘bag’: BaggingClassifier - ‘stc’: StackingClassifier If estimator is not given the default is svm or

    svc.

  • test_size (float,) – The test set data size. Must be less than 1.The sample test size is 0.2 either 20% of dataset.

  • verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.

X#

training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

Type:

Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)

y#

train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Type:

array-like of shape (M, ) :math:`M=m-samples

Xt#

Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.

Type:

Ndarray ( M x N matrix where M=m-samples, & N=n-features)

yt#

test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Type:

array-like, shape (M, ) M=m-samples,

data#

Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.

Type:

str, filepath_or_buffer or pandas.core.DataFrame

pipe_#

Pipeline can be buit by your own pipeline with different transformer. For base model prediction, it is possible to use the default pipeline. Call get_default_pipe to get the transformation list and steps.

Type:

Callable, preprocessor object from sklearn.pipeline

estimator#

Callable estimator method to fit the model:

estimators= SGDClassifier(random_state=13)
Type:

Callable, F or sklearn.metaestimator

model#

A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:

  • parameters provided in object construction or with set_params;

  • the global numpy.random random state if the estimator’s random_state

    parameter is set to None; and

  • any data or sample properties passed to the most recent call to fit,

    fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.

The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a function.

Type:

callable, always as a function,

cat_features_#

list of categorical features list. If not given it should be find automatically.

Type:

list or str, Optional

num_features_#

list Numerical features list. If not given, should be find automatically.

Type:

list of str, Optional

model#

Use the predifined pipelines i.e can be a Pipeline can your build by your own pipeline with different composite estimator. If model is None , use the default model from the default preprocessor and estimator.

Type:

Callable, {preprocessor + estimator },

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.linear_model import SGDClassifier
>>> from sklearn.impute import SimpleImputer
>>> estimators=dict(
...    RandomForestClassifier=RandomForestClassifier(
...        n_estimators=200, random_state=0),
>>> pc.X= SimpleImputer().fit_transform(pc.X)
>>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values
>>> pc.baseEvaluation(estimator=estimators, eval_metric =True)
>>> pc.base_score_
... 0.72586369
baseEvaluation(model=None, eval_metric=False, **kws)[source]#

Dummy baseline model from preprocessing pipeline.

onto a model by providing an estimator.

Parameters:
  • model (Callable, {'preprocessor + estimator },) – A model is scikit-learn estimator or or composite model built from a Pipeline. If model is None , use the default model from the default preprocessor and `estimator. model can be a dict of multiples estimators. Therefore the evaluation of each estimator is set to dictionnary where the key is each estimator name.

  • eval_metric (bool,) – if set to True, confusion matrix and classification report scores are evaluated assuming the the supervised learning is a classification problem. default is False.

  • scorer (str, Callable,) –

    a scorer is a metric function for model evaluation. If given as string it should be the prefix of the following metrics:

    • ”classification_report” -> for classification_report,

    • ’precision_recall’ -> for precision_recall_curve,

    • ”confusion_matrix” -> for a confusion_matrix,

    • ’precision’ -> for precision_score,

    • ”accuracy” -> for accuracy_score

    • ”mse” -> for mean_squared_error,

    • ”recall” -> for recall_score,

    • ’auc’ -> for roc_auc_score,

    • ’roc’ -> for roc_curve

    • ’f1’ -> for f1_score,

    Other string prefix values should raises an errors

  • kws (dict,) – Additionnal keywords arguments from scklearn metric function.

Returns:

`self.base_score_`

Return type:

base score after predicting

Notes

If None estimator is given, the default estimator is svm otherwise, provide the prefix to select the convenience estimator into the default dict default_estimator. Get the default dict by calling <instance>._getdestimators()>

Examples

>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> pc.fit(data =data)

(1) -> default estimator >>> pc.baseEvaluation (eval_metric=True) … 0.47126436781609193

(2) -> multiples estimators >>> from watex.exlib.sklearn import RandomForestClassifier , SGDClassifier, SimpleImputer >>> estimators={‘RandomForestClassifier’:RandomForestClassifier

(n_estimators=200, random_state=0), ‘SDGC’:SGDClassifier(random_state=0)}

>>> pc.X= SimpleImputer().fit_transform(pc.X)
>>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values
>>> pc.baseEvaluation(model={
'RandomForestClassifier':RandomForestClassifier(
  n_estimators=200, random_state=0),
'SDGC':SGDClassifier(random_state=0)}, eval_metric =True)
>>> pc.ypred_
Out[128]:
{'RandomForestClassifier': array([2, 1, 2, 2, 2, 2, 0, 1, 1, 2, 3, 1, 0, 0, 1, 1, 1, 2, 2, 3, 2, 3,
        1, 2, 1, 2, 0, 2, 2, 3, 2, 2, 1, 1, 3, 3, 0, 2, 3, 3, 2, 1, 0, 2,
        1, 1, 2, 2, 2, 2, 1, 1, 0, 2, 0, 2, 1, 2, 1, 1, 2, 0, 1, 2, 0, 2,
        2, 3, 2, 2, 3, 0, 1, 2, 2, 3, 1, 1, 0, 1, 1, 2, 0, 0, 2, 0, 1],
       dtype=int8),
 'SGDClassifier': array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       dtype=int8)}
>>> pc.base_score_
Out[130]:
{'RandomForestClassifier': 0.7816091954022989,
 'SGDClassifier': 0.14942528735632185}
property data#
property features#

Collect the list of features

fit(X=None, y=None, **fit_params)[source]#

Read the whole dataset, encode the categorial features and populate class attributes.

If X and y are provided, they are considered as a features set and target respectively. They should be splitted to the training set and test set respectively.

Parameters:
  • X (N-d array, shape (N, M)) – the feature arrays composed of N-columns and the M-samples. The feature set excludes the target y.

  • y (arraylike , shape (M)) – the target is composed of M-examples in supervised learning.

  • data (Dataframe or shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N including the target y. Note that if the data is given, it is not necessary to provide the X and y. By specifying the target name tname, the target should be remove to the data.

  • split_X_y (bool, default {'True'}) – split the datatset to training set {X, y } and test set {Xt, yt}. Otherwise X and y should be considered as traning sets.

Returns:

``self``

Return type:

Preprocessing instance for easy method chaining.

Examples

>>> from watex.cases.processing import Preprocessing
>>> from watex.datasets import fetch_data
>>> data = fetch_data('bagoue original').get('data=dfy2')
>>> pc = Preprocessing (drop_features = ['lwi', 'num', 'name']
                        ).fit(data =data )
>>> len(pc.X ),  len(y), len(pc.Xt ),  len(pc.yt)
... (344, 344, 87, 87) # trainset (X,y) and testset (Xt, yt)
get_default_pipe()[source]#

make a default pipe to preprocess the data.

Create a preprocessor by assembling multiple transformers. The default pipeline is not exhaustive so to have full control of the data, it is recommended to provide a strong preprocessor for the data processing at once.

the method returns self.pipe_`as callable, preprocessor pipeline from :class:`sklearn.pipeline.Pipeline object. Basically since, the default transformers are composed of:

  • sklearn.pipeline.make_pipeline() for pipeline creation.

  • sklearn.preprocessing.OneHotEncoder() for categorial

    features encoding.

  • sklearn.preprocessing.PolynomialFeatures() for features

    engineering.

  • sklearn.preprocessing.RobustScaler() for data scaling

  • sklearn.compose.make_column_transformer() for data

    transformation.

  • sklearn.compose.make_column_selector() for features

    composing.

property inspect#

Inspect data and trigger plot after checking the data entry. Raises NotFittedError if self is not fitted yet.

makeModel(pipe=None, estimator=None)[source]#

Assemble pipes and estimator to create the model

The model is composed of the transformers and estimator, If one is set to None, it uses the default pipe and estimator which might be not the one expected. Therefore providing a pipe and estimator is suggested.

Parameters:
  • pipe (Callable, pipeline or preprocessor) – Callable pipeline. Pipeline can your own pipeline with different transformer. Refer to the sklearn.pipeline.Pipeline for futher details. Call get_default_pipe to get the default pipe.

  • estimator (Callable, F or {sklearn estimator}) –

    Callable estimator method to fit the model:

       estimators= SGDClassifier(random_state=13)
    
    `Some pre-estimators can be fetched by providing the prefix as
    a key of the estimator default dict. For instance to fetch the
    `DecisionTreeClassifier` estimators::
    
        >>> from watex.cases.processing import Preprocessing
        >>> Preprocessing._getdestimators()['dtc']
        ... DecisionTreeClassifier(max_depth=100, random_state=42)
    

Returns:

`model_`

Return type:

Callable, {preprocessor + estimator }

Examples

  1. We can get the default preprocessor by merely calling:

>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> pc.fit(data =data)
>>> pc.makeModel() # use default model and preprocessor
>>> pc.model_

(2)-> Or build your own preprocesor object using the example below:

>>> from sklearn.pipeline import Pipeline
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.impute import SimpleImputer
>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
>>> from sklearn.linear_model import LogisticRegression
>>> from watex.datasets import fetch_data
>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> numeric_features = ['east', 'north', 'power', 'magnitude', 'sfi', 'ohmS']
>>> numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")),
           ("scaler", StandardScaler())]
    )
>>> categorical_features = ['shape', 'geol', 'type']
>>> categorical_transformer = OneHotEncoder(handle_unknown="ignore")
>>> preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])
>>> pc.makeModel (pipe = preprocessor,
                  estimator =  LogisticRegression())
>>> pc.model_
class watex.cases.Processing(pipeline=None, estimator=None, **kws)[source]#

Bases: Preprocessing

Processing class for managing baseline model evaluation and learning.

Manages the validation curves after fiddling a little bit an estimator hyperparameters.

Processing is usefull before modeling step. To process data, a default implementation is given for data preprocessor build. It consists of creating a model pipeline using different transformers. If None pipeline is setting and auto is set to ‘True’, a default pipeline is created though the prepocessor`to raun the base model evaluation. Indeed a `preprocessor is a set of transformers + estimators.

Parameters:
  • auto (bool, default is {'False'}) – trigger the composite estimator.If True a composite preprocessor is built and use for base model evaluation. default is False.

  • pipeline (Callable, F or dict of callable F) – preprocessing steps encapsulated. If not supplied a default pipe is used as auto is set to True.

  • estimator (Callable,) – An object which manages the estimation and decoding of a model. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a function.

  • tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.

  • drop_features (list or str, Optional) – List the useless features for predicting or list of column names to drop out.

  • random_state (int, default is 42) – The state of data shuffling. The default is 42.

  • default_estimator (callable, F or sckitlearn estimator) – The default estimator name for predicting the tname value. A predifined defaults estimators prameters are set and keep in cache for quick preprocessing like: - ‘dtc’: For DecisionTreeClassifier - ‘svc’: Support Vector Classifier - ‘sdg’: SGDClassifier - ‘knn’: KNeighborsClassifier - ‘rdf`: RandmForestClassifier - ‘ada’: AdaBoostClassifier - ‘vtc’: VotingClassifier - ‘bag’: BaggingClassifier - ‘stc’: StackingClassifier If estimator is not given the default is svm or svc.

  • test_size (float,) – The test set data size. Must be less than 1.The sample test size is 0.2 either 20% of dataset.

  • verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.

X#

training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

Type:

Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)

y#

train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Type:

array-like of shape (M, ) :math:`M=m-samples

Xt#

Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.

Type:

Ndarray ( M x N matrix where M=m-samples, & N=n-features)

yt#

test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Type:

array-like, shape (M, ) M=m-samples,

data#

Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.

Type:

str, filepath_or_buffer or pandas.core.DataFrame

pipe_#

Pipeline can be buit by your own pipeline with different transformer. For base model prediction, it is possible to use the default pipeline. Call get_default_pipe to get the transformation list and steps.

Type:

Callable, preprocessor object from sklearn.pipeline

estimator#

Callable estimator method to fit the model:

estimators= SGDClassifier(random_state=13)
Type:

Callable, F or sklearn.metaestimator

model#

A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:

  • parameters provided in object construction or with set_params;

  • the global numpy.random random state if the estimator’s random_state

    parameter is set to None; and

  • any data or sample properties passed to the most recent call to fit,

    fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.

The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a function.

Type:

callable, always as a function,

cat_features_#

list of categorical features list. If not given it should be find automatically.

Type:

list or str, Optional

num_features_#

list Numerical features list. If not given, should be find automatically.

Type:

list of str, Optional

model#

Use the predifined pipelines i.e can be a Pipeline can your build by your own pipeline with different composite estimator. If model is None , use the default model from the default preprocessor and estimator.

Type:

Callable, {preprocessor + estimator },

model_score_#

Model test score. Observe your test model score using your compose estimator for enhacement

Type:

float/dict

model_prediction_#

Observe your test model prediction for as well as the compose estimator enhancement.

Type:

array_like

preprocessor_#

Compose piplenes and estimators for default model scorage.

Type:

Callable , F

Examples

>>> from watex.cases.processing  import Processing
>>> from watex.exlib.sklearn import (StandardScaler,RandomForestClassifier,
                                     make_column_selector, PolynomialFeatures,
                                     SelectKBest, f_classif)
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> my_own_pipeline= {'num_column_selector_':
...                       make_column_selector(dtype_include=np.number),
...                'cat_column_selector_':
...                    make_column_selector(dtype_exclude=np.number),
...                'features_engineering_':
...                    PolynomialFeatures(3,include_bias=True),
...                'selectors_': SelectKBest(f_classif, k=4),
...               'encodages_': StandardScaler()
...                 }
>>> my_estimator={
...    'RandomForestClassifier':RandomForestClassifier(
...    n_estimators=200, random_state=0)
...    }
>>> processObj= Processing (tname = 'flow', drop_features =['lwi', 'name', 'num'],
                            pipeline= my_own_pipeline, estimator=my_estimator)
>>> processObj.fit(data=data )
>>> processObj.baseEvaluation (eval_metric=True )
... 0.4942528735632184 # score is an ensemble score for both model
property auto#

Trigger the composite pipeline building and greate a composite default model estimator CE-SVC

property estimator#

Get your estimator of the existing default estimator

get_validation_curve(val_params=None, switch_plot='off', preprocess_step=False, train_pkws=None, val_pkws=None, **kws)[source]#

Compute the validation score and plot the validation curve if the argument turn of decorator is switched to on.

If validation keywords arguments val_curve_kws does not contain a param_range key, the default param_range should be the one of

decorator.

Parameters:
  • val_params

    validation_curve keywords arguments. if none the default should be:

    val_params = {"param_name":'C',
                 "param_range": np.arange(1,210,10),
                 "cv":4}
    

  • switch_plot (str, default ='on') – visualize the validation plot

  • preprocess_step (bool, default=False) – Trigger the default step of preprocessing.

  • train_pkws (dict,) – keywords arguments passed to matplotlib.line/scatter plots for training curve

  • val_pkws (dict ,) –

  • for (keyword arguments passed to matplotlib.line/scatter plot) – validation curve.

Returns:

  • - `train_score` (float|dict of trainset score)

  • - `val_score` (float/dict of valisation score)

  • - `switch` (Turn on or off the validation_plot.)

  • - `kk` (the validation param_range for plot.)

Examples

>>> from watex.cases.processing  import Processing
>>> from watex.datasets import fetch_data
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> processObj= Processing (tname = 'flow',
                            drop_features =['lwi', 'name', 'num'])
>>> processObj.fit(data=data )
>>> processObj.get_validation_curve(
    switch_plot='on', preprocess_step=True)
property model#

Concatenate preprocessor and estimator to var

property model_prediction#

Get the model prediction after composite estimator designed

property model_score#

Get the composite estimator score

property preprocessor#

Preoprocessor for composite_estimator design

property processing_model#

Get the default composite model

quick_estimation(estimator_name=None, default_estimator=False)[source]#

Quick run the model without any processing. If none estimator is provided SVC estimator is used.

Parameters:
  • estimators – Callable estimator. If None, a svc is used to quick estimate prediction.

  • random_state – The state of data shuffling.The default is 7.

Example:

>>> from watex.cases.processing import Processing
>>> processObj = Processing(
    data = 'data/geo_fdata/BagoueDataset2.xlsx')
>>> processObj.quick_estimation(estimator=DecisionTreeClassifier(
    max_depth=100, random_state=13)
>>> processObj.model_score
>>> processObj.model_prediction
watex.cases.base_transform(X, n_components=0.95, attr_names=None, attr_indexes=None, operator=None, view=False, **kws)[source]#

Tranformed X using PCA and plot variance ratio by experiencing the attributes combinaisons.

Create a new attributes using features index or litteral string operator. and prepared data for PCA variance plot.

Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • n_components (float oR int) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.

  • attr_names (list of str , optional) – List of features for combinaison. Decide to combine new feature values by from operator parameters. By default, the combinaison it is ratio of the given attribute/numerical features. For instance, attribute_names=['lwi', 'ohmS'] will divide the feature ‘lwi’ by ‘ohmS’.

  • attr_indexes (list of int,) – index of each feature/feature for experience combinaison. User warning should raise if any index does match the dataframe of array columns.

  • operator (str, default ='/') – Type of operation to perform when combining features. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]

Returns:

  • X (n_darray, or pd.dataframe)

  • New array of dataframe with new attributes combined.

Examples

>>> from from watex.view.mlplot import MLPlots
>>> from watex.datasets import fetch_data
>>> from watex.analysis import pcaVarianceRatio
>>> plot_kws = {'lc':(.9,0.,.8),
        'lw' :3.,           # line width
        'font_size':7.,
        'show_grid' :True,        # visualize grid
       'galpha' :0.2,              # grid alpha
       'glw':.5,                   # grid line width
       'gwhich' :'major',          # minor ticks
        # 'fs' :3.,                 # coeff to manage font_size
        }
>>> X, _ = fetch_data ('Bagoue data analysis')
>>> mlObj =MLPlots(**plot_kws)
>>> pcaVarianceRatio(mlObj,X, plot_var_ratio=True)
watex.cases.default_pipeline(X, num_attributes, cat_attributes, y=None, label_encoding='LabelEncoder', **kws)[source]#

Default pipeline use for preprocessing the`Bagoue` dataset

The pipeline can be improved to achieve a good results.

Parameters:

X (ndarray, pd.DataFrame) – X or dataframe X

y: array_like,

ylabel or target

num_attributes:list

Numerical attributes

cat_attributes: list

categorical attributes

lableEncodage: str

Type of encoding used to encode the label Default is labelEncoder` but can be ``LabelBinarizer

Returns:

  • - `mum_pipeline` (Pipeline to process numerical features)

  • -`cat_pipeline` (pipeline to process categorical features.)

  • - `full_pipeline` (Full pipeline as the union of two pipelines)

  • -`y` (ylabel encoded if not None.)

watex.cases.default_preparation(X, imputer_strategy=None, missing_values=nan, num_indexes=None, cat_indexes=None, scaler=None, encode_cat_features=True, columns=None)[source]#

Automate the data preparation to be ready for PCA analyses

Data preparation consist to imput missing values, scales the numerical features and encoded the categorial features.

Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • imputer_strategy (str, default ='most_frequent') – Strategy proposed to replace the missing values. Can be mean or median or most_frequent. Be aware , it mean or median are given, be sure that the data are not composed of categorial fatures.

  • missing_values (float) – Value to replace the missing value in X ndarray or dataframe. Default is ``np.nan`

  • num_indexes – list of indexes to select the numerical data if categorical data columns exist in X ndarray.

  • cat_indexes – list of indexes to select the categorical data if numerical data columns exists in X ndarray.

  • scaler (str, default, is) – type of feature scaling applied on numerical features. Can be MinMaxScaler. Default is StandardScaler

  • encode_cat_features (bool) – Encode categorical data or text attributes. Default is sklearn.preprocessing.OrdinalEncoder.

  • columns (list, Optional,) – list of columns to compose a dataframe if X is given as an NDAarray.

Returns:

X

Return type:

NDArray | Dataframe

Notes

num_indexes and cat_indexes are mainly used when type of data x is np.ndarray(m, nf) where m is number of instances or examples and nf if number of attributes or features. selector_ is used for dataframe preprocessing.

Submodules#