Base data preparation for case studies#

Base module helps to automate data preparation at once. It is created for fast data preparation in real engineering case study. The base steps has been used to solve a flow rate prediction problems [1]. Its steps procedure can straighforwardly help user to fast reach the same goal as the published paper. An example of different kind of Bagoue dataset [2] , is prepared using the BaseSteps module.

References

1

Kouadio, K.L., Kouame, L.N., Drissa, C., Mi, B., Kouamelan, K.S., Gnoleba, S.P.D., Zhang, H., et al. (2022) Groundwater Flow Rate Prediction from Geo‐Electrical Features using Support Vector Machines. Water Resour. Res. DOI:10.1029/2021wr031623

2

Kouadio, K.L., Nicolas, K.L., Binbin, M., Déguine, G.S.P. & Serge, K.K. (2021, October) Bagoue dataset-Cote d’Ivoire: Electrical profiling, electrical sounding and boreholes data, Zenodo. DOI:10.5281/zenodo.5560937

class watex.cases.prepare.BaseSteps(tname=None, return_all=True, drop_features=None, categorizefeature_props=None, add_attributes=True, attribute_indexes=None, operator='/', imputer_strategy='median', missing_values=nan, pipeline=None, test_size=0.2, hash=False, random_state=42, verbose=0, **kwargs)[source]#

Bases: object

Default Data preparation steps

By default, the BaseSteps is used to prepare the DC 1d -resistivity geoelectrical features before prediction. The predicted target was the flow rate.

Parameters
tname: str,

A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.

return_all: bool

return all the stratified trainset. When data is too large, can set to False to take an sample of the stratified trainset. to evaluate your model.

drop_features: list

List of useless features and clean the dataset.

categorizefeature_props: list

list of properties to categorize a particular features in the dataset. It composed of the ‘name of feature’ to convert its numerical values into categorical values , then the value range of data to be categorize and finally the categorical name of that values range. For instance:

categorizefeature_props= [
    ('flow', ([0., 1., 3.], ['FR0', 'FR1', 'FR2', 'FR3']))
    ]

Please refer to watex.utils.transformers.CategorizeFeatures fot furthers details.

hash: bool,

If True, it ensure that data will remain consistent accross multiple runs, even if dataset is refreshed. Use test by id to hash training and test sets when data is splitting.

add_attributes: list, optional

Experience the combinaison <numerical> attributes. List of features for combinaison. Decide to combine features to create a new feature value from operator parameters. By default, the combinaison is ratio of the given attribute/numerical features. For instance, attribute_names=['lwi', 'ohmS'] will divide the feature ‘lwi’ by ‘ohmS’.

operator: str, default =’/’

Type of operation to perform when combining features. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]

attribute_indexes: list of int,

List of attributes indexes to combines. For instance:

attribute_indexes = [1, 0] # or [4, 3]

The operator by default is division . Indexes of each attribute/feature for experiencing combinaison. User warning should raise if any index does match the dataframe of array columns.For more details, refer to CombinedAttributesAdder

imputer_strategy: str

Type of strategy to replace the missing values. Refer to SimpleImputer. Default is median.

missing_valuesfloat

The value to be replaced. Default is np.nan values.

pipeline: callable

Pipeline to prepare the dataset. Default is defaultPipeline().

test_size: float, default=.2 i.e. 20% (X, y)

The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.

random_stateint, RandomState instance or None, default=42

Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

verbose: int, default is 0

Control the level of verbosity. Higher value lead to more messages.

data: Filepath or Dataframe or shape (M, N)

Data is passed here as additional keyword arguments just for making under the X and y using method :neth:`~.stratifydata`. It is pandas.DataFrame containing samples of M and features N.

Notes

The data preparing includes is composed of two steps. The first step includes: - The data cleaning by fixing and removing outliers, to replace the missing

values by the other values using param `imputer_strategy`rather than to get rid of the different instances (examples) or the whole feature.

  • The handling text and features consist to convert the categorial features labels to numbers to let the algorithm to well perform with non-numerical values.

  • The data stratification process is done before separating the dataset into trainset and test set. Indeed, the stratification consist to divide the whole dataset into homogeneous subgroup to guarantee that the test set is most representative of the overall dataset. This is useful in our case because the dataset is not large enough to avoid the risk of introducing a significant bias. Once data are stratified, data are divided into a trainset (80%) and test set (20%).

The second steps consist of features selection, features engineering, encoding and data scaling using the pipeline via a parameter pipeline. If None pipeline is given, the default pipline is triggered.The features engineering’s consist to aggregate features with experiencing combinations of attributes into promising new features using the params attribute_indexes after setting the argument add_attributes to True. The final step of transformation consists of features scaling. The type of scaling used by default in this module is the standardization because it less affected by the outliers. Each transformation step must be executed in right order therefore a full pipeline is created, composed of the numerical pipeline (deals with numerical features) and categorical pipeline (deals with categorial features). Both pipelines are combined and applied to the trainset and later to the test set.

Examples

../datasets/_p.py

property X_#

keep the stratified testset X

property data#
fit(X, y=None)[source]#

Preparing steps.

Parameters
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns

``self`` – returns self for easy method chaining.

Return type

BaseSteps instance

fit_transform(X=None, y=None, on_testset=False)[source]#

Fit transform apply fit and transform at Once.

Parameters

X (ndarray, pd.DataFrame) – X or dataframe X

y: array_like,

ylabel or target

on_testset:str
Check whether the dataframe is evaluating on

testset or trainset

data: :returns: * - X_prepared. Data prepared after transformation

  • -y-prepared. label prepared after transformation.

stratifyFolds(data)[source]#

Stratified the dataset and return the trainset. Get more details in watex.bases.transformers.StratifiedWithCategoryAdder.

stratifydata(data=None)[source]#

Split and stratified data and return stratified training and test sets

property tname#
transform(X=None, y=None, on_testset=False)[source]#

Transform data applying the pipeline transformation.

Parameters
  • X (ndarray, pd.DataFrame) – X or dataframe X

  • y (array_like,) – ylabel or target values

  • on_testset (str) –

    Check whether the dataframe is evaluating on

    testset or trainset

Returns

  • - X_prepared. Data prepared after transformation

  • -y-prepared. label prepared after transformation.

property y_#

keep the stratified label y

watex.cases.prepare.base_transform(X, n_components=0.95, attr_names=None, attr_indexes=None, operator=None, view=False, **kws)[source]#

Tranformed X using PCA and plot variance ratio by experiencing the attributes combinaisons.

Create a new attributes using features index or litteral string operator. and prepared data for PCA variance plot.

Parameters
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • n_components (float oR int) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.

  • attr_names (list of str , optional) – List of features for combinaison. Decide to combine new feature values by from operator parameters. By default, the combinaison it is ratio of the given attribute/numerical features. For instance, attribute_names=['lwi', 'ohmS'] will divide the feature ‘lwi’ by ‘ohmS’.

  • attr_indexes (list of int,) – index of each feature/feature for experience combinaison. User warning should raise if any index does match the dataframe of array columns.

  • operator (str, default ='/') – Type of operation to perform when combining features. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]

Returns

  • X (n_darray, or pd.dataframe)

  • New array of dataframe with new attributes combined.

Examples

>>> from from watex.view.mlplot import MLPlots
>>> from watex.datasets import fetch_data
>>> from watex.analysis import pcaVarianceRatio
>>> plot_kws = {'lc':(.9,0.,.8),
        'lw' :3.,           # line width
        'font_size':7.,
        'show_grid' :True,        # visualize grid
       'galpha' :0.2,              # grid alpha
       'glw':.5,                   # grid line width
       'gwhich' :'major',          # minor ticks
        # 'fs' :3.,                 # coeff to manage font_size
        }
>>> X, _ = fetch_data ('Bagoue data analysis')
>>> mlObj =MLPlots(**plot_kws)
>>> pcaVarianceRatio(mlObj,X, plot_var_ratio=True)
watex.cases.prepare.default_pipeline(X, num_attributes, cat_attributes, y=None, label_encoding='LabelEncoder', **kws)[source]#

Default pipeline use for preprocessing the`Bagoue` dataset

The pipeline can be improved to achieve a good results.

Parameters

X (ndarray, pd.DataFrame) – X or dataframe X

y: array_like,

ylabel or target

num_attributes:list

Numerical attributes

cat_attributes: list

categorical attributes

lableEncodage: str

Type of encoding used to encode the label Default is labelEncoder` but can be ``LabelBinarizer

Returns

  • - `mum_pipeline` (Pipeline to process numerical features)

  • -`cat_pipeline` (pipeline to process categorical features.)

  • - `full_pipeline` (Full pipeline as the union of two pipelines)

  • -`y` (ylabel encoded if not None.)

watex.cases.prepare.default_preparation(X, imputer_strategy=None, missing_values=nan, num_indexes=None, cat_indexes=None, scaler=None, encode_cat_features=True, columns=None)[source]#

Automate the data preparation to be ready for PCA analyses

Data preparation consist to imput missing values, scales the numerical features and encoded the categorial features.

Parameters
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • imputer_strategy (str, default ='most_frequent') – Strategy proposed to replace the missing values. Can be mean or median or most_frequent. Be aware , it mean or median are given, be sure that the data are not composed of categorial fatures.

  • missing_values (float) – Value to replace the missing value in X ndarray or dataframe. Default is ``np.nan`

  • num_indexes – list of indexes to select the numerical data if categorical data columns exist in X ndarray.

  • cat_indexes – list of indexes to select the categorical data if numerical data columns exists in X ndarray.

  • scaler (str, default, is) – type of feature scaling applied on numerical features. Can be MinMaxScaler. Default is StandardScaler

  • encode_cat_features (bool) – Encode categorical data or text attributes. Default is sklearn.preprocessing.OrdinalEncoder.

  • columns (list, Optional,) – list of columns to compose a dataframe if X is given as an NDAarray.

Returns

X

Return type

NDArray | Dataframe

Notes

num_indexes and cat_indexes are mainly used when type of data x is np.ndarray(m, nf) where m is number of instances or examples and nf if number of attributes or features. selector_ is used for dataframe preprocessing.