watex.cases package#
‘cases’ subpackage implements functions and modules already used to solve real engineering problems such as the flow rate prediction and boosting using the base learners and an ensemble paradigms.
features, processing,
modeling and prepare modules have
base step procedures and can be used for processing and analyses to
give a quick depiction of how data looks like and model performance estimation.
- class watex.cases.BaseModel(data_fn=None, df=None, **kwargs)[source]#
Bases:
objectBase model class. The most interesting and challenging part of modeling is the tuning hyperparameters after designing a composite estimator. Getting the best params is a better way to reorginize the created pipeline {transformers +estimators} so to have a great capability of data generalization.
- Parameters
*dataf_fn* (str) – Path to analysis data file.
*df* (pd.Core.DataFrame) – Dataframe of features for analysis . Must be contains of main parameters including the target name of pd.Core.series of columns of df.
arguments (Holds on others optionals infos in kwargs) –
======================================= (================= ============) –
Description (Attributes Type) –
======================================= –
estimator. (auto bool Trigger the composite) – If
Truea SVC-composite estimator preprocessor is given. default is False.model (pipelines dict Collect your own pipeline for) – preprocessor trigging. it should be find automatically.
None (estimators Callable A given estimator. If) – is auto-selected as default estimator.
SVM – is auto-selected as default estimator.
test (model_score float/dict Model test score. Observe your) – model score using your compose estimator for enhancement or your own pipelines.
for (processor Callable Compose piplenes and estimators) – as well as the compose estimator enhancement.
for – default model scorage.
======================================= –
Examples
>>> from watex.bases.modeling import BaseModel >>> from sklearn.preprocessing import RobustScaler, PolynomialFeatures >>> from sklearn.feature_selection import SelectKBest, f_classif >>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.compose import make_column_selector >>> estimator2= RandomForestClassifier() >>> modelObj = BaseModel( ... data_fn ='data/geo_fdata/BagoueDataset2.xlsx', ... pipelines = { ... 'num_column_selector_': make_column_selector( ... dtype_include=np.number), ... 'cat_column_selector_': make_column_selector( ... dtype_exclude=np.number), ... 'features_engineering_':PolynomialFeatures( ... 2, include_bias=False), ... 'selectors_': SelectKBest(f_classif, k=2), ... 'encodages_': RobustScaler() ... }, ... estimator = RandomForestClassifier() ... )
- property feature_importances_#
Get the bar plot of features importances. If the estimator has not feature_importances_ attributes, it will raise an error.
- get_learning_curve(estimator=None, X_train=None, y_train=None, learning_curve_kws=None, **kws)[source]#
Compute the train score and validation curve to visualize your learning curve.
- Parameters
estimator – The creating model. If
NoneX_train – pd.core.frame.DataFrame of selected trainset
x_test – pd.DataFrame of selected Data for testset
y_train – array_like of selected data for evaluation set.
y_test – array_like of selected data for model test
val_kws –
validation_curve keywords arguments. if none the default should be:
val_curve_kws = {"param_name":'C', "param_range": np.arange(1,210,10), "cv":4}
- Returns
train_score: float|dict of trainset score.
val_score : float/dict of valisation score.
- switch: Turn
onoroffthe learning curve of validation curve.
- switch: Turn
-trigDec: Trigger the decorator. - N: number of param range for plotting.
- Example
>>> from watex.bases.modeling import BaseModel >>> processObj = BaseModel( data_fn = 'data/geo_fdata/BagoueDataset2.xlsx') >>> processObj.get_learning_curve ( switch_plot='on', preprocessor=True)
- get_model_prediction(estimator=None, X_test=None, y_test=None, **kws)[source]#
Get the model prediction and quick plot using the surche decorator.
The decorator holds many keyword arguments to customize plot. Refer to
watex.utils.decorator.predPlot.- Parameters
estimator – The creating model. If
Nonex_test – pd.DataFrame of selected Data for testset
y_test – array_like of selected data for model test
kws – Additional keywords arguments which refer to the data_fn df and pipelines parameters.
switch – Turn on or off the decorator.
- Example
>>> from watex.modeling.sl import Modeling >>> modelObj = Modeling( data_fn ='data/geo_fdata/BagoueDataset2.xlsx', pipelines ={ 'num_column_selector_': make_column_selector( dtype_include=np.number), 'cat_column_selector_': make_column_selector( dtype_exclude=np.number), 'features_engineering_':PolynomialFeatures(2, include_bias=False), 'selectors_': SelectKBest(f_classif, k=2), 'encodages_': RobustScaler() }, estimator = SVC(C=1, gamma=0.1)) >>> modelObj.get_model_prediction(estimator =testim, switch ='on')
- property model_#
Get a set of processor and eestimator composed of the composite model
- property model_score#
Estimate your composite model prediction
- permutation_feature_importance(estimator=None, X_train=None, y_train=None, pfi_kws=None, **kws)[source]#
Evaluation of features importance with tree estimators before shuffle and after shuffling trees.
Permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data is tabular. This is especially useful for non-linear or opaque estimators. Refer to :ref:`this link <https://scikit-learn.org/stable/modules/permutation_importance.html>`_ for more details.
- Parameters
estimator – The estimator to evaluate the importance of features. The default is
RandomForestClassifier.X_train – pd.core.frame.DataFrame of selected trainset.
y_train – array_like of selected data for evaluation set.
n_estimators – Number of estimator composed the tree. The default is 100
n_repeats – Number of tree shuffling. The default is 10.
pfi_kws – permution_importance callable additional keywords arguments.
pfi_stype –
Type of plot. Can be : -
pfifor permutation feature importance beforeand after shuffling trees
-
dendrofor dendrogram plot . The default is pfi.switch – Turn
onoroffthe decorator.
- Example
>>> from watex.bases.modeling import BaseModel >>> from sklearn.ensemble import AdaBoostClassifier >>> modelObj = BaseModel() >>> modelObj.permutation_feature_importance( ... estimator = AdaBoostClassifier(random_state=7), ... data_fn ='data/geo_fdata/BagoueDataset2.xlsx', ... switch ='on', pfi_style='pfi')
- property processor#
Get te processor after supplying the pipelines
- tuning_hyperparameters(estimator=None, hyper_params=None, cv=4, grid_kws=None, **kws)[source]#
Tuning hyperparametres from existing estimator to evaluate performance. Boosting the model using the model best_param
- Parameters
estimator – Callable estimator or model to boost
hyper_params – dict of hyperparameters of the estimator
cv – Cross validation cutting off. the default is 4
:param grid_kws:dict of other gridSearch parameters
- Example
>>> from watex.modeling.basics import SLModeling >>> from sklearn.preprocessing import RobustScaler,PolynomialFeatures >>> from sklearn.feature_selection import SelectKBest, f_classif >>> from sklearn.svm import SVC >>> from sklearn.compose import make_column_selector >>> my_own_pipelines= { 'num_column_selector_': make_column_selector( dtype_include=np.number), 'cat_column_selector_': make_column_selector( dtype_exclude=np.number), 'features_engineering_':PolynomialFeatures( 3, include_bias=False), 'selectors_': SelectKBest(f_classif, k=3), 'encodages_': RobustScaler() } >>> my_estimator = SVC(C=1, gamma=1e-4, random_state=7) >>> modelObj = SLModeling(data_fn ='data/geo_fdata/BagoueDataset2.xlsx', pipelines =my_own_pipelines , estimator = my_estimator) >>> hyperparams ={ 'columntransformer__pipeline-1__polynomialfeatures__degree': np.arange(2,10), 'columntransformer__pipeline-1__selectkbest__k': np.arange(2,7), 'svc__C': [1, 10, 100], 'svc__gamma':[1e-1, 1e-2, 1e-3]} >>> my_compose_estimator_ = modelObj.model_ >>> modelObj.tuning_hyperparameters( estimator= my_compose_estimator_ , hyper_params= hyperparams, search='rand') >>> modelObj.best_params_ >>> modelObj.best_score_
- class watex.cases.BaseSteps(tname=None, return_all=True, drop_features=None, categorizefeature_props=None, add_attributes=True, attribute_indexes=None, operator='/', imputer_strategy='median', missing_values=nan, pipeline=None, test_size=0.2, hash=False, random_state=42, verbose=0, **kwargs)[source]#
Bases:
objectDefault Data preparation steps
By default, the
BaseStepsis used to prepare the DC 1d -resistivity geoelectrical features before prediction. The predicted target was the flow rate.- Parameters
- tname: str,
A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.
- return_all: bool
return all the stratified trainset. When data is too large, can set to
Falseto take an sample of the stratified trainset. to evaluate your model.- drop_features: list
List of useless features and clean the dataset.
- categorizefeature_props: list
list of properties to categorize a particular features in the dataset. It composed of the ‘name of feature’ to convert its numerical values into categorical values , then the value range of data to be categorize and finally the categorical name of that values range. For instance:
categorizefeature_props= [ ('flow', ([0., 1., 3.], ['FR0', 'FR1', 'FR2', 'FR3'])) ]
Please refer to watex.utils.transformers.CategorizeFeatures fot furthers details.
- hash: bool,
If
True, it ensure that data will remain consistent accross multiple runs, even if dataset is refreshed. Use test by id to hash training and test sets when data is splitting.- add_attributes: list, optional
Experience the combinaison <numerical> attributes. List of features for combinaison. Decide to combine features to create a new feature value from operator parameters. By default, the combinaison is ratio of the given attribute/numerical features. For instance,
attribute_names=['lwi', 'ohmS']will divide the feature ‘lwi’ by ‘ohmS’.
- operator: str, default =’/’
Type of operation to perform when combining features. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]
- attribute_indexes: list of int,
List of attributes indexes to combines. For instance:
attribute_indexes = [1, 0] # or [4, 3]
The operator by default is division . Indexes of each attribute/feature for experiencing combinaison. User warning should raise if any index does match the dataframe of array columns.For more details, refer to
CombinedAttributesAdder- imputer_strategy: str
Type of strategy to replace the missing values. Refer to
SimpleImputer. Default ismedian.- missing_valuesfloat
The value to be replaced. Default is
np.nanvalues.- pipeline: callable
Pipeline to prepare the dataset. Default is
defaultPipeline().- test_size: float, default=.2 i.e. 20% (X, y)
The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.
- random_stateint, RandomState instance or None, default=42
Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.
- verbose: int, default is
0 Control the level of verbosity. Higher value lead to more messages.
- data: Filepath or Dataframe or shape (M, N)
Data is passed here as additional keyword arguments just for making under the X and y using method :neth:`~.stratifydata`. It is
pandas.DataFramecontaining samples of M and features N.
Notes
The data preparing includes is composed of two steps. The first step includes: - The data cleaning by fixing and removing outliers, to replace the missing
values by the
other valuesusing param `imputer_strategy`rather than to get rid of the different instances (examples) or the whole feature.The handling text and features consist to convert the categorial features labels to numbers to let the algorithm to well perform with non-numerical values.
The data stratification process is done before separating the dataset into trainset and test set. Indeed, the stratification consist to divide the whole dataset into homogeneous subgroup to guarantee that the test set is most representative of the overall dataset. This is useful in our case because the dataset is not large enough to avoid the risk of introducing a significant bias. Once data are stratified, data are divided into a trainset (80%) and test set (20%).
The second steps consist of features selection, features engineering, encoding and data scaling using the pipeline via a parameter pipeline. If None pipeline is given, the default pipline is triggered.The features engineering’s consist to aggregate features with experiencing combinations of attributes into promising new features using the params attribute_indexes after setting the argument add_attributes to
True. The final step of transformation consists of features scaling. The type of scaling used by default in this module is the standardization because it less affected by the outliers. Each transformation step must be executed in right order therefore a full pipeline is created, composed of the numerical pipeline (deals with numerical features) and categorical pipeline (deals with categorial features). Both pipelines are combined and applied to the trainset and later to the test set.Examples
../datasets/_p.py
- property X_#
keep the stratified testset X
- property data#
- fit(X, y=None)[source]#
Preparing steps.
- Parameters
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like, shape (M, )
M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Returns
``self`` – returns
selffor easy method chaining.- Return type
BaseSteps instance
- fit_transform(X=None, y=None, on_testset=False)[source]#
Fit transform apply fit and transform at Once.
- Parameters
X (ndarray, pd.DataFrame) – X or dataframe X
- y: array_like,
ylabel or target
- on_testset:str
- Check whether the dataframe is evaluating on
testset or trainset
data: :returns: * - X_prepared. Data prepared after transformation
-y-prepared. label prepared after transformation.
- stratifyFolds(data)[source]#
Stratified the dataset and return the trainset. Get more details in watex.bases.transformers.StratifiedWithCategoryAdder.
- stratifydata(data=None)[source]#
Split and stratified data and return stratified training and test sets
- property tname#
- transform(X=None, y=None, on_testset=False)[source]#
Transform data applying the pipeline transformation.
- Parameters
X (ndarray, pd.DataFrame) – X or dataframe X
y (array_like,) – ylabel or target values
on_testset (str) –
- Check whether the dataframe is evaluating on
testset or trainset
- Returns
- X_prepared. Data prepared after transformation
-y-prepared. label prepared after transformation.
- property y_#
keep the stratified label y
- class watex.cases.FeatureInspection(tname='flow', mapflow=True, sanitize=False, flow_classes=[0.0, 1.0, 3.0], set_index=False, col_name=None, **kws)[source]#
Bases:
objectSummarizes the flow features.
It deals with data features categorization. When numericall values are provided standard qualitative or quantitative analysis is performed.
- Parameters
*data* (str or pd.core.DataFrame) – Path-like object or pandas Dataframe. Must contain the main parameters including the target.
**tname** (str) – The tname for predicting purposes. Here for groundwater exploration, we specify the name of the target as
flow.**flow_classes** (list or array_like) – The way to classify the flow. Provide the main specific values to convert the categorial trends to numerical values. Different projects have different tnameing flow rate. Might specify either for village hydraulic, or improved village hydraulic or urban hydraulics.
**drop_columns** (list) – items for dropping. To analyse the data, we can drop some specific columns to not corrupt data analysis. In formal dataframe collected straighforwardly from
GeoFeatures,the default drop_columns refer to coordinates positions as : [‘east’, ‘north’].**mapflow (bool,) –
if set to True, value in the target columns should map to categorical values. Commonly the flow rate values are given as a trend of numerical values. For a classification purpose, flow rate must be converted to categorical values which are mainly refered to the type of types of hydraulic. Mostly the type of hydraulic system is in turn tided to the the number of the living population in a specific area. For instance, flow classes can be ranged as follow:
FR = 0 is for dry boreholes
0 < FR ≤ 3m3/h for village hydraulic (≤2000 inhabitants)
3 < FR ≤ 6m3/h for improved village hydraulic(>2000-20 000inhbts)
6 <FR ≤ 10m3/h for urban hydraulic (>200 000 inhabitants).
Note that this flow range is not exhaustive and can be modified according to the type of hydraulic required on the project.
**set_index** (bool,) – condired a column as dataframe index. If set to
True, please provided the col_name, otherwise it should be theidas as a default columns item.**sanitize** – polish the data and remove inconsistent columns in the data which are not refer to the predicting features. It is able to change for instance the french name of water
eauto ‘water` wich is related to the value of water inflow featureslwi. This could be usefull when the data is given as a Path-Like object and features are not described correctly in the case of groundwater. Default isFalse
Examples
>>> from watex.cases.features import FeatureInspection >>> data = 'data/geodata/main.bagciv.data.csv' >>> fobj = FeatureInspection().fit(data) >>> fobj.data_.columns Out[117]: Index(['num', 'name', 'east', 'north', 'power', 'magnitude', 'shape', 'type', 'sfi', 'ohmS', 'lwi', 'geol', 'flow'], dtype='object')
- property cache#
Generate cache df_ for all eliminate features and keep on new pd.core.frame.DataFrame.
- property data#
Control the Feature-file extension provide. Usefull to select pd.DataFrame construction.
- fit(data)[source]#
Main goals of this method is to fit and classify the different flow classes in the dataset. However by default, four(04) flow classes are considered according to the reference below
- Parameters
*data* (str or pd.core.DataFrame) – Path-like object or pandas Dataframe. Must contains of the main parameters including the tname the tname.
- Returns
object
- Return type
FeatureInspectionobject
Examples
>>> from watex.bases.features import FeatureInspection >>> data = 'data/geodata/main.bagciv.data.csv' >>> fobj = FeatureInspection() >>> fobj.fit(data) >>> fobj.data.iloc[1:3 , :] ... num name power magnitude ... ohmS lwi geol flow 1 2 b2 70.0 142.0 ... 1135.551531 21.406531 GRANITES FR1 2 3 b3 80.0 87.0 ... 767.562500 0.000000 GRANITES FR1
Notes
The paper mentions 04 types of hydraulic according to the population demand and the number of living inhabitants. The hydraulic system are defined as:
FR = 0 is for dry boreholes
0 < FR ≤ 3m3/h for village hydraulic (≤2000 inhabitants)
3 < FR ≤ 6m3/h for improved village hydraulic(>2000-20 000inhbts)
6 <FR ≤ 10m3/h for urban hydraulic (>200 000 inhabitants).
The flow classes can be modified according to the type of hydraulic proposed for the project.
References
- 1
CIEH. (2001). L’utilisation des méthodes géophysiques pour la recherche d’eaux dans les aquifères discontinus. Série Hydrogéologie, 169.
- property flow_classes#
- writedf(df=None, refout=None, to=None, savepath=None, modname='_anEX_', reset_index=False)[source]#
Write the analysis df.
Refer to
watex.decorators.exportdf()for more details about the argumentsrefout,to,savepath,modenameandrest_index.- Example
>>> from watex.analysis.bases.features import FeatureInspection >>> slObj =FeatureInspection( ... data_fn='data/geo_fdata/BagoueDataset2.xlsx', ... set_index =True) >>> slObj.writedf()
- class watex.cases.GeoFeatures(**kws)[source]#
Bases:
objectFeatures class. Deals with Electrical Resistivity profile (VES), Vertical electrical Sounding (VES), Geological (Geol) data and Borehole data(Boreh). Set all features values of differents investigation sites. Features class is composed of:
erp class get from
watex.methods.erp.ERP_colectiongeol obtained from
watex.geology.geology.Geologyboreh get from
watex.geology.geology.Borehole
- Parameters
*features_fn* (str , Path_like) – File to geoelectical features files.
*ErpColObjs* (object) – Collection object from erp survey lines.
*vesObjs* (object,) – Collection object from vertical electrical sounding (VES) curves.
*geoObjs* (object,) – Collection object from geol class. See watex.geology.geology.Geology.
*boreholeObjs* (object) – Collection of boreholes of all investigation sites. Refer to watex.geology.geology.Borehole
Holds on others optionals infos in
kwargsarguments:Attributes
Type
Description
df
pd.core.DataFrame
Container of all features composed of
featureLabelssite_ids
array_like
ID of each survey locations.
site_names
array_like
Survey locations names.
gFname
str
Filename of features_fn.
ErpColObjs
obj
ERP erp class object.
vesObjs
obj
VES ves class object.
geoObjs
obj
Geology geol class object.
borehObjs
obj
Borehole boreh class obj.
Notes
Be sure to not miss any coordinates files. Indeed, each selected anomaly should have a borehole performed at that place for supervising learing. That means, each selected anomaly referenced by location coordinates and id on erp must have it own ves, geol and boreh data. For furher details about classes object , please refer to the classes documentation aforementionned.
Examples
>>> from watex.cases.features import GeoFeatures >>> data ='data/geodata/main.bagciv.data.csv' >>> featObj =GeoFeatures().fit(data ) >>> featObj.id_ Out[114]: array(['e0000001', 'e0000002', 'e0000003', 'e0000004', 'e0000005', 'e0000006', 'e0000007'], dtype='<U8') >>> featObj.site_names_ >>> featObj.site_names_[:7] Out[115]: array(['b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7'], dtype=object)
- static controlObjId(erpObjID, boreObjID, geolObjID, vesObjsID)[source]#
Control object id whether the each selected anomaly from erp matchs with its`ves` and geol and borehole.
- Parameters
erpObjID (str) – ERP object ID. Refer to
watex.methods.erp.ERP_collectionboreObjID (str) – Borehole ID. Refer to
watex.geology.drilling.BoreholeboreObjID – Geology ID. Refer to
watex.geology.geology.GeologyvesObjsID – VES object ID. Refer to
watex.methods.electrical.VerticalSounding
- Returns
New survey ID
- property data#
Control the Feature-file extension provide. Usefull to select pd.DataFrame construction.
- exportdf(refout=None, to=None, savepath=None, **kwargs)[source]#
Export dataframe from
dfto files can be Excell sheet file or ‘.json’ file. To get more details about the writef decorator, seewatex.decorators.writef().- Parameters
refout – Output filename. If not given will be created refering to the exported date.
to (str) – Export type. Can be .xlsx , .csv, .json and else
savepath – Path to save the refout filename. If not given will be created.
- Returns
ndf: new dataframe from attr:`~.geofeatures.Features.df
- Example
>>> from watex.bases.features import Features >>> featObj = Features( ... features_fn= 'data/geo_fdata/BagoueDataset2.xlsx' ) >>> featObj.exportdf(refout=ybro, to='csv')
- featureLabels_ = ['id', 'east', 'north', 'power', 'magnitude', 'shape', 'type', 'sfi', 'ohmS', 'lwi', 'geol', 'flow']#
- fit(data=None, geoObj=None, erpObj=None, vesObj=None, boreholeObj=None, **kws)[source]#
Reading class and attributes populating. Please refer to ~.core.geofeatures.Features for arguments details.
- from_csv(erp_fn)[source]#
Method essentially created to read file from csv , collected horizontal distance value and apparent resistivy values. then send to the class for computation purposes.
- Parameters
erp_fn (str) – path_like string of CSV file
- Returns
horizontal distance im meters
- Return type
np.array of all data.
- from_json(json_fn, indent=4)[source]#
Collected data from json files and retrieve the most insights contents
- Parameters
json_fn (str) – json file
- class watex.cases.Preprocessing(tname='flow', drop_features=None, random_state=42, default_estimator='svc', test_size=0.2, verbose=0)[source]#
Bases:
objectBase preprocessing class.
Give a baseline preprocessing model with a base score. Usefull before fidlling the model hyperparameters.
- Parameters
tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.
drop_features (list or str, Optional) – List the useless features for predicting or list of column names to drop out.
random_state (int, default is
42) – The state of data shuffling. The default is42.default_estimator (callable, F or sckitlearn estimator) –
The default estimator name for predicting the tname value. A predifined defaults estimators prameters are set and keep in cache for quick preprocessing like: - ‘dtc’: For DecisionTreeClassifier - ‘svc’: Support Vector Classifier - ‘sdg’: SGDClassifier - ‘knn’: KNeighborsClassifier - ‘rdf`: RandmForestClassifier - ‘ada’: AdaBoostClassifier - ‘vtc’: VotingClassifier - ‘bag’: BaggingClassifier - ‘stc’: StackingClassifier If estimator is not given the default is
svmorsvc.test_size (float,) – The test set data size. Must be less than 1.The sample test size is
0.2either 20% of dataset.verbose (int, default is
0) – Control the level of verbosity. Higher value lead to more messages.
- X#
training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.
Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Type
Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)
- y#
train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Type
array-like of shape (M, ) :math:`M=m-samples
- Xt#
Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.
- Type
Ndarray ( M x N matrix where
M=m-samples, &N=n-features)
- yt#
test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Type
array-like, shape (M, )
M=m-samples,
- data#
Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any
os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.- Type
str, filepath_or_buffer or
pandas.core.DataFrame
- pipe_#
Pipeline can be buit by your own pipeline with different transformer. For base model prediction, it is possible to use the default pipeline. Call get_default_pipe to get the transformation list and steps.
- Type
Callable, preprocessor object from
sklearn.pipeline
- estimator#
Callable estimator method to fit the model:
estimators= SGDClassifier(random_state=13)
- Type
Callable, F or
sklearn.metaestimator
- model#
A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:
parameters provided in object construction or with set_params;
- the global numpy.random random state if the estimator’s random_state
parameter is set to None; and
- any data or sample properties passed to the most recent call to fit,
fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.
The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a
function.- Type
callable, always as a function,
- cat_features_#
list of categorical features list. If not given it should be find automatically.
- Type
list or str, Optional
- num_features_#
list Numerical features list. If not given, should be find automatically.
- Type
list of str, Optional
- model#
Use the predifined pipelines i.e can be a Pipeline can your build by your own pipeline with different composite estimator. If model is
None, use the default model from the default preprocessor and estimator.- Type
Callable, {preprocessor + estimator },
Examples
>>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.linear_model import SGDClassifier >>> from sklearn.impute import SimpleImputer >>> estimators=dict( ... RandomForestClassifier=RandomForestClassifier( ... n_estimators=200, random_state=0), >>> pc.X= SimpleImputer().fit_transform(pc.X) >>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values >>> pc.baseEvaluation(estimator=estimators, eval_metric =True) >>> pc.base_score_ ... 0.72586369
- baseEvaluation(model=None, eval_metric=False, **kws)[source]#
Dummy baseline model from preprocessing pipeline.
onto a model by providing an estimator.
- Parameters
model (Callable, {'preprocessor + estimator },) – A model is scikit-learn estimator or or composite model built from a Pipeline. If model is
None, use the default model from the default preprocessor and `estimator. model can be a dict of multiples estimators. Therefore the evaluation of each estimator is set to dictionnary where the key is each estimator name.eval_metric (bool,) – if set to
True, confusion matrix and classification report scores are evaluated assuming the the supervised learning is a classification problem. default isFalse.scorer (str, Callable,) –
a scorer is a metric function for model evaluation. If given as string it should be the prefix of the following metrics:
”classification_report” -> for classification_report,
’precision_recall’ -> for precision_recall_curve,
”confusion_matrix” -> for a confusion_matrix,
’precision’ -> for precision_score,
”accuracy” -> for accuracy_score
”mse” -> for mean_squared_error,
”recall” -> for recall_score,
’auc’ -> for roc_auc_score,
’roc’ -> for roc_curve
’f1’ -> for f1_score,
Other string prefix values should raises an errors
kws (dict,) – Additionnal keywords arguments from scklearn metric function.
- Returns
`self.base_score_`
- Return type
base score after predicting
Notes
If
Noneestimator is given, the default estimator is svm otherwise, provide the prefix to select the convenience estimator into the default dict default_estimator. Get the default dict by calling <instance>._getdestimators()>Examples
>>> from watex.cases.processing import Preprocessing >>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num']) >>> data = fetch_data ('bagoue original').get('data=dfy2') >>> pc.fit(data =data)
(1) -> default estimator >>> pc.baseEvaluation (eval_metric=True) … 0.47126436781609193
(2) -> multiples estimators >>> from watex.exlib.sklearn import RandomForestClassifier , SGDClassifier, SimpleImputer >>> estimators={‘RandomForestClassifier’:RandomForestClassifier
(n_estimators=200, random_state=0), ‘SDGC’:SGDClassifier(random_state=0)}
>>> pc.X= SimpleImputer().fit_transform(pc.X) >>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values >>> pc.baseEvaluation(model={ 'RandomForestClassifier':RandomForestClassifier( n_estimators=200, random_state=0), 'SDGC':SGDClassifier(random_state=0)}, eval_metric =True) >>> pc.ypred_ Out[128]: {'RandomForestClassifier': array([2, 1, 2, 2, 2, 2, 0, 1, 1, 2, 3, 1, 0, 0, 1, 1, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 0, 2, 2, 3, 2, 2, 1, 1, 3, 3, 0, 2, 3, 3, 2, 1, 0, 2, 1, 1, 2, 2, 2, 2, 1, 1, 0, 2, 0, 2, 1, 2, 1, 1, 2, 0, 1, 2, 0, 2, 2, 3, 2, 2, 3, 0, 1, 2, 2, 3, 1, 1, 0, 1, 1, 2, 0, 0, 2, 0, 1], dtype=int8), 'SGDClassifier': array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], dtype=int8)} >>> pc.base_score_ Out[130]: {'RandomForestClassifier': 0.7816091954022989, 'SGDClassifier': 0.14942528735632185}
- property data#
- property features#
Collect the list of features
- fit(X=None, y=None, **fit_params)[source]#
Read the whole dataset, encode the categorial features and populate class attributes.
If X and y are provided, they are considered as a features set and target respectively. They should be splitted to the training set and test set respectively.
- Parameters
X (N-d array, shape (N, M)) – the feature arrays composed of N-columns and the M-samples. The feature set excludes the target y.
y (arraylike , shape (M)) – the target is composed of M-examples in supervised learning.
data (Dataframe or shape (M, N) from
pandas.DataFrame) – Dataframe containing samples M and features N including the target y. Note that if the data is given, it is not necessary to provide the X and y. By specifying the target name tname, the target should be remove to the data.split_X_y (bool, default {'True'}) – split the datatset to training set {X, y } and test set {Xt, yt}. Otherwise X and y should be considered as traning sets.
- Returns
``self``
- Return type
Preprocessing instance for easy method chaining.
Examples
>>> from watex.cases.processing import Preprocessing >>> from watex.datasets import fetch_data >>> data = fetch_data('bagoue original').get('data=dfy2') >>> pc = Preprocessing (drop_features = ['lwi', 'num', 'name'] ).fit(data =data ) >>> len(pc.X ), len(y), len(pc.Xt ), len(pc.yt) ... (344, 344, 87, 87) # trainset (X,y) and testset (Xt, yt)
- get_default_pipe()[source]#
make a default pipe to preprocess the data.
Create a preprocessor by assembling multiple transformers. The default pipeline is not exhaustive so to have full control of the data, it is recommended to provide a strong preprocessor for the data processing at once.
the method returns self.pipe_`as callable, preprocessor pipeline from :class:`sklearn.pipeline.Pipeline object. Basically since, the default transformers are composed of:
sklearn.pipeline.make_pipeline()for pipeline creation.sklearn.preprocessing.OneHotEncoder()for categorialfeatures encoding.
sklearn.preprocessing.PolynomialFeatures()for featuresengineering.
sklearn.preprocessing.RobustScaler()for data scalingsklearn.compose.make_column_transformer()for datatransformation.
sklearn.compose.make_column_selector()for featurescomposing.
- property inspect#
Inspect data and trigger plot after checking the data entry. Raises NotFittedError if
selfis not fitted yet.
- makeModel(pipe=None, estimator=None)[source]#
Assemble pipes and estimator to create the model
The model is composed of the transformers and estimator, If one is set to None, it uses the default pipe and estimator which might be not the one expected. Therefore providing a pipe and estimator is suggested.
- Parameters
pipe (Callable, pipeline or preprocessor) – Callable pipeline. Pipeline can your own pipeline with different transformer. Refer to the
sklearn.pipeline.Pipelinefor futher details. Call get_default_pipe to get the default pipe.estimator (Callable, F or {sklearn estimator}) –
Callable estimator method to fit the model:
estimators= SGDClassifier(random_state=13) `Some pre-estimators can be fetched by providing the prefix as a key of the estimator default dict. For instance to fetch the `DecisionTreeClassifier` estimators:: >>> from watex.cases.processing import Preprocessing >>> Preprocessing._getdestimators()['dtc'] ... DecisionTreeClassifier(max_depth=100, random_state=42)
- Returns
`model_`
- Return type
Callable, {preprocessor + estimator }
Examples
We can get the default preprocessor by merely calling:
>>> from watex.cases.processing import Preprocessing >>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num']) >>> data = fetch_data ('bagoue original').get('data=dfy2') >>> pc.fit(data =data) >>> pc.makeModel() # use default model and preprocessor >>> pc.model_
(2)-> Or build your own preprocesor object using the example below:
>>> from sklearn.pipeline import Pipeline >>> from sklearn.compose import ColumnTransformer >>> from sklearn.impute import SimpleImputer >>> from sklearn.preprocessing import StandardScaler, OneHotEncoder >>> from sklearn.linear_model import LogisticRegression >>> from watex.datasets import fetch_data >>> from watex.cases.processing import Preprocessing >>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num']) >>> numeric_features = ['east', 'north', 'power', 'magnitude', 'sfi', 'ohmS'] >>> numeric_transformer = Pipeline( steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())] ) >>> categorical_features = ['shape', 'geol', 'type'] >>> categorical_transformer = OneHotEncoder(handle_unknown="ignore") >>> preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features), ]) >>> pc.makeModel (pipe = preprocessor, estimator = LogisticRegression()) >>> pc.model_
- class watex.cases.Processing(pipeline=None, estimator=None, **kws)[source]#
Bases:
PreprocessingProcessing class for managing baseline model evaluation and learning.
Manages the validation curves after fiddling a little bit an estimator hyperparameters.
Processing is usefull before modeling step. To process data, a default implementation is given for data preprocessor build. It consists of creating a model pipeline using different transformers. If None pipeline is setting and auto is set to ‘True’, a default pipeline is created though the prepocessor`to raun the base model evaluation. Indeed a `preprocessor is a set of transformers + estimators.
- Parameters
auto (bool, default is {'False'}) – trigger the composite estimator.If
Truea composite preprocessor is built and use for base model evaluation. default is False.pipeline (Callable, F or dict of callable F) – preprocessing steps encapsulated. If not supplied a default pipe is used as auto is set to
True.estimator (Callable,) – An object which manages the estimation and decoding of a model. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a function.
tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.
drop_features (list or str, Optional) – List the useless features for predicting or list of column names to drop out.
random_state (int, default is
42) – The state of data shuffling. The default is42.default_estimator (callable, F or sckitlearn estimator) – The default estimator name for predicting the tname value. A predifined defaults estimators prameters are set and keep in cache for quick preprocessing like: - ‘dtc’: For DecisionTreeClassifier - ‘svc’: Support Vector Classifier - ‘sdg’: SGDClassifier - ‘knn’: KNeighborsClassifier - ‘rdf`: RandmForestClassifier - ‘ada’: AdaBoostClassifier - ‘vtc’: VotingClassifier - ‘bag’: BaggingClassifier - ‘stc’: StackingClassifier If estimator is not given the default is
svmorsvc.test_size (float,) – The test set data size. Must be less than 1.The sample test size is
0.2either 20% of dataset.verbose (int, default is
0) – Control the level of verbosity. Higher value lead to more messages.
- X#
training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.
Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Type
Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)
- y#
train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Type
array-like of shape (M, ) :math:`M=m-samples
- Xt#
Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.
- Type
Ndarray ( M x N matrix where
M=m-samples, &N=n-features)
- yt#
test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Type
array-like, shape (M, )
M=m-samples,
- data#
Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any
os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.- Type
str, filepath_or_buffer or
pandas.core.DataFrame
- pipe_#
Pipeline can be buit by your own pipeline with different transformer. For base model prediction, it is possible to use the default pipeline. Call get_default_pipe to get the transformation list and steps.
- Type
Callable, preprocessor object from
sklearn.pipeline
- estimator#
Callable estimator method to fit the model:
estimators= SGDClassifier(random_state=13)
- Type
Callable, F or
sklearn.metaestimator
- model#
A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:
parameters provided in object construction or with set_params;
- the global numpy.random random state if the estimator’s random_state
parameter is set to None; and
- any data or sample properties passed to the most recent call to fit,
fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.
The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a
function.- Type
callable, always as a function,
- cat_features_#
list of categorical features list. If not given it should be find automatically.
- Type
list or str, Optional
- num_features_#
list Numerical features list. If not given, should be find automatically.
- Type
list of str, Optional
- model#
Use the predifined pipelines i.e can be a Pipeline can your build by your own pipeline with different composite estimator. If model is
None, use the default model from the default preprocessor and estimator.- Type
Callable, {preprocessor + estimator },
- model_score_#
Model test score. Observe your test model score using your compose estimator for enhacement
- Type
float/dict
- model_prediction_#
Observe your test model prediction for as well as the compose estimator enhancement.
- Type
array_like
- preprocessor_#
Compose piplenes and estimators for default model scorage.
- Type
Callable , F
Examples
>>> from watex.cases.processing import Processing >>> from watex.exlib.sklearn import (StandardScaler,RandomForestClassifier, make_column_selector, PolynomialFeatures, SelectKBest, f_classif) >>> data = fetch_data ('bagoue original').get('data=dfy2') >>> my_own_pipeline= {'num_column_selector_': ... make_column_selector(dtype_include=np.number), ... 'cat_column_selector_': ... make_column_selector(dtype_exclude=np.number), ... 'features_engineering_': ... PolynomialFeatures(3,include_bias=True), ... 'selectors_': SelectKBest(f_classif, k=4), ... 'encodages_': StandardScaler() ... } >>> my_estimator={ ... 'RandomForestClassifier':RandomForestClassifier( ... n_estimators=200, random_state=0) ... } >>> processObj= Processing (tname = 'flow', drop_features =['lwi', 'name', 'num'], pipeline= my_own_pipeline, estimator=my_estimator) >>> processObj.fit(data=data ) >>> processObj.baseEvaluation (eval_metric=True ) ... 0.4942528735632184 # score is an ensemble score for both model
- property auto#
Trigger the composite pipeline building and greate a composite default model estimator CE-SVC
- property estimator#
Get your estimator of the existing default estimator
- get_validation_curve(val_params=None, switch_plot='off', preprocess_step=False, train_pkws=None, val_pkws=None, **kws)[source]#
Compute the validation score and plot the validation curve if the argument turn of decorator is switched to
on.If validation keywords arguments val_curve_kws does not contain a param_range key, the default param_range should be the one of
decorator.
- Parameters
val_params –
validation_curve keywords arguments. if none the default should be:
val_params = {"param_name":'C', "param_range": np.arange(1,210,10), "cv":4}
switch_plot (str, default ='on') – visualize the validation plot
preprocess_step (bool, default=False) – Trigger the default step of preprocessing.
train_pkws (dict,) – keywords arguments passed to matplotlib.line/scatter plots for training curve
val_pkws (dict ,) –
for (keyword arguments passed to matplotlib.line/scatter plot) – validation curve.
- Returns
- `train_score` (float|dict of trainset score)
- `val_score` (float/dict of valisation score)
- `switch` (Turn
onoroffthe validation_plot.)- `kk` (the validation param_range for plot.)
Examples
>>> from watex.cases.processing import Processing >>> from watex.datasets import fetch_data >>> data = fetch_data ('bagoue original').get('data=dfy2') >>> processObj= Processing (tname = 'flow', drop_features =['lwi', 'name', 'num']) >>> processObj.fit(data=data ) >>> processObj.get_validation_curve( switch_plot='on', preprocess_step=True)
- property model#
Concatenate preprocessor and estimator to var
- property model_prediction#
Get the model prediction after composite estimator designed
- property model_score#
Get the composite estimator score
- property preprocessor#
Preoprocessor for composite_estimator design
- property processing_model#
Get the default composite model
- quick_estimation(estimator_name=None, default_estimator=False)[source]#
Quick run the model without any processing. If none estimator is provided
SVCestimator is used.- Parameters
estimators – Callable estimator. If
None, asvcis used to quick estimate prediction.random_state – The state of data shuffling.The default is
7.
- Example
>>> from watex.cases.processing import Processing >>> processObj = Processing( data = 'data/geo_fdata/BagoueDataset2.xlsx') >>> processObj.quick_estimation(estimator=DecisionTreeClassifier( max_depth=100, random_state=13) >>> processObj.model_score >>> processObj.model_prediction
- watex.cases.base_transform(X, n_components=0.95, attr_names=None, attr_indexes=None, operator=None, view=False, **kws)[source]#
Tranformed X using PCA and plot variance ratio by experiencing the attributes combinaisons.
Create a new attributes using features index or litteral string operator. and prepared data for PCA variance plot.
- Parameters
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.n_components (float oR int) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If
Noneas default value the number of variance to preserve is95%.attr_names (list of str , optional) – List of features for combinaison. Decide to combine new feature values by from operator parameters. By default, the combinaison it is ratio of the given attribute/numerical features. For instance,
attribute_names=['lwi', 'ohmS']will divide the feature ‘lwi’ by ‘ohmS’.attr_indexes (list of int,) – index of each feature/feature for experience combinaison. User warning should raise if any index does match the dataframe of array columns.
operator (str, default ='/') – Type of operation to perform when combining features. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]
- Returns
X (n_darray, or pd.dataframe)
New array of dataframe with new attributes combined.
Examples
>>> from from watex.view.mlplot import MLPlots >>> from watex.datasets import fetch_data >>> from watex.analysis import pcaVarianceRatio >>> plot_kws = {'lc':(.9,0.,.8), 'lw' :3., # line width 'font_size':7., 'show_grid' :True, # visualize grid 'galpha' :0.2, # grid alpha 'glw':.5, # grid line width 'gwhich' :'major', # minor ticks # 'fs' :3., # coeff to manage font_size } >>> X, _ = fetch_data ('Bagoue data analysis') >>> mlObj =MLPlots(**plot_kws) >>> pcaVarianceRatio(mlObj,X, plot_var_ratio=True)
- watex.cases.default_pipeline(X, num_attributes, cat_attributes, y=None, label_encoding='LabelEncoder', **kws)[source]#
Default pipeline use for preprocessing the`Bagoue` dataset
The pipeline can be improved to achieve a good results.
- Parameters
X (ndarray, pd.DataFrame) – X or dataframe X
- y: array_like,
ylabel or target
- num_attributes:list
Numerical attributes
- cat_attributes: list
categorical attributes
- lableEncodage: str
Type of encoding used to encode the label Default is
labelEncoder` but can be ``LabelBinarizer
- Returns
- `mum_pipeline` (Pipeline to process numerical features)
-`cat_pipeline` (pipeline to process categorical features.)
- `full_pipeline` (Full pipeline as the union of two pipelines)
-`y` (ylabel encoded if not None.)
- watex.cases.default_preparation(X, imputer_strategy=None, missing_values=nan, num_indexes=None, cat_indexes=None, scaler=None, encode_cat_features=True, columns=None)[source]#
Automate the data preparation to be ready for PCA analyses
Data preparation consist to imput missing values, scales the numerical features and encoded the categorial features.
- Parameters
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.imputer_strategy (str, default ='most_frequent') – Strategy proposed to replace the missing values. Can be
meanormedianormost_frequent. Be aware , it mean or median are given, be sure that the data are not composed of categorial fatures.missing_values (float) – Value to replace the missing value in X ndarray or dataframe. Default is ``np.nan`
num_indexes – list of indexes to select the numerical data if categorical data columns exist in X ndarray.
cat_indexes – list of indexes to select the categorical data if numerical data columns exists in X ndarray.
scaler (str, default, is) – type of feature scaling applied on numerical features. Can be
MinMaxScaler. Default isStandardScalerencode_cat_features (bool) – Encode categorical data or text attributes. Default is
sklearn.preprocessing.OrdinalEncoder.columns (list, Optional,) – list of columns to compose a dataframe if X is given as an NDAarray.
- Returns
X
- Return type
NDArray | Dataframe
Notes
num_indexes and cat_indexes are mainly used when type of data x is np.ndarray(m, nf) where m is number of instances or examples and nf if number of attributes or features. selector_ is used for dataframe preprocessing.