Base data preparation for case studies#
Base module helps to automate data preparation at once. It is created for fast data preparation in real engineering case study. The base steps has been used to solve a flow rate prediction problems [1]. Its steps procedure can straighforwardly help user to fast reach the same goal as the published paper. An example of different kind of Bagoue dataset [2] , is prepared using the BaseSteps module.
References
- 1
Kouadio, K.L., Kouame, L.N., Drissa, C., Mi, B., Kouamelan, K.S., Gnoleba, S.P.D., Zhang, H., et al. (2022) Groundwater Flow Rate Prediction from Geo‐Electrical Features using Support Vector Machines. Water Resour. Res. DOI:10.1029/2021wr031623
- 2
Kouadio, K.L., Nicolas, K.L., Binbin, M., Déguine, G.S.P. & Serge, K.K. (2021, October) Bagoue dataset-Cote d’Ivoire: Electrical profiling, electrical sounding and boreholes data, Zenodo. DOI:10.5281/zenodo.5560937
- class watex.cases.prepare.BaseSteps(tname=None, return_all=True, drop_features=None, categorizefeature_props=None, add_attributes=True, attribute_indexes=None, operator='/', imputer_strategy='median', missing_values=nan, pipeline=None, test_size=0.2, hash=False, random_state=42, verbose=0, **kwargs)[source]#
Bases:
objectDefault Data preparation steps
By default, the
BaseStepsis used to prepare the DC 1d -resistivity geoelectrical features before prediction. The predicted target was the flow rate.- Parameters
- tname: str,
A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.
- return_all: bool
return all the stratified trainset. When data is too large, can set to
Falseto take an sample of the stratified trainset. to evaluate your model.- drop_features: list
List of useless features and clean the dataset.
- categorizefeature_props: list
list of properties to categorize a particular features in the dataset. It composed of the ‘name of feature’ to convert its numerical values into categorical values , then the value range of data to be categorize and finally the categorical name of that values range. For instance:
categorizefeature_props= [ ('flow', ([0., 1., 3.], ['FR0', 'FR1', 'FR2', 'FR3'])) ]
Please refer to watex.utils.transformers.CategorizeFeatures fot furthers details.
- hash: bool,
If
True, it ensure that data will remain consistent accross multiple runs, even if dataset is refreshed. Use test by id to hash training and test sets when data is splitting.- add_attributes: list, optional
Experience the combinaison <numerical> attributes. List of features for combinaison. Decide to combine features to create a new feature value from operator parameters. By default, the combinaison is ratio of the given attribute/numerical features. For instance,
attribute_names=['lwi', 'ohmS']will divide the feature ‘lwi’ by ‘ohmS’.
- operator: str, default =’/’
Type of operation to perform when combining features. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]
- attribute_indexes: list of int,
List of attributes indexes to combines. For instance:
attribute_indexes = [1, 0] # or [4, 3]
The operator by default is division . Indexes of each attribute/feature for experiencing combinaison. User warning should raise if any index does match the dataframe of array columns.For more details, refer to
CombinedAttributesAdder- imputer_strategy: str
Type of strategy to replace the missing values. Refer to
SimpleImputer. Default ismedian.- missing_valuesfloat
The value to be replaced. Default is
np.nanvalues.- pipeline: callable
Pipeline to prepare the dataset. Default is
defaultPipeline().- test_size: float, default=.2 i.e. 20% (X, y)
The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.
- random_stateint, RandomState instance or None, default=42
Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.
- verbose: int, default is
0 Control the level of verbosity. Higher value lead to more messages.
- data: Filepath or Dataframe or shape (M, N)
Data is passed here as additional keyword arguments just for making under the X and y using method :neth:`~.stratifydata`. It is
pandas.DataFramecontaining samples of M and features N.
Notes
The data preparing includes is composed of two steps. The first step includes: - The data cleaning by fixing and removing outliers, to replace the missing
values by the
other valuesusing param `imputer_strategy`rather than to get rid of the different instances (examples) or the whole feature.The handling text and features consist to convert the categorial features labels to numbers to let the algorithm to well perform with non-numerical values.
The data stratification process is done before separating the dataset into trainset and test set. Indeed, the stratification consist to divide the whole dataset into homogeneous subgroup to guarantee that the test set is most representative of the overall dataset. This is useful in our case because the dataset is not large enough to avoid the risk of introducing a significant bias. Once data are stratified, data are divided into a trainset (80%) and test set (20%).
The second steps consist of features selection, features engineering, encoding and data scaling using the pipeline via a parameter pipeline. If None pipeline is given, the default pipline is triggered.The features engineering’s consist to aggregate features with experiencing combinations of attributes into promising new features using the params attribute_indexes after setting the argument add_attributes to
True. The final step of transformation consists of features scaling. The type of scaling used by default in this module is the standardization because it less affected by the outliers. Each transformation step must be executed in right order therefore a full pipeline is created, composed of the numerical pipeline (deals with numerical features) and categorical pipeline (deals with categorial features). Both pipelines are combined and applied to the trainset and later to the test set.Examples
../datasets/_p.py
- property X_#
keep the stratified testset X
- property data#
- fit(X, y=None)[source]#
Preparing steps.
- Parameters
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like, shape (M, )
M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Returns
``self`` – returns
selffor easy method chaining.- Return type
BaseSteps instance
- fit_transform(X=None, y=None, on_testset=False)[source]#
Fit transform apply fit and transform at Once.
- Parameters
X (ndarray, pd.DataFrame) – X or dataframe X
- y: array_like,
ylabel or target
- on_testset:str
- Check whether the dataframe is evaluating on
testset or trainset
data: :returns: * - X_prepared. Data prepared after transformation
-y-prepared. label prepared after transformation.
- stratifyFolds(data)[source]#
Stratified the dataset and return the trainset. Get more details in watex.bases.transformers.StratifiedWithCategoryAdder.
- stratifydata(data=None)[source]#
Split and stratified data and return stratified training and test sets
- property tname#
- transform(X=None, y=None, on_testset=False)[source]#
Transform data applying the pipeline transformation.
- Parameters
X (ndarray, pd.DataFrame) – X or dataframe X
y (array_like,) – ylabel or target values
on_testset (str) –
- Check whether the dataframe is evaluating on
testset or trainset
- Returns
- X_prepared. Data prepared after transformation
-y-prepared. label prepared after transformation.
- property y_#
keep the stratified label y
- watex.cases.prepare.base_transform(X, n_components=0.95, attr_names=None, attr_indexes=None, operator=None, view=False, **kws)[source]#
Tranformed X using PCA and plot variance ratio by experiencing the attributes combinaisons.
Create a new attributes using features index or litteral string operator. and prepared data for PCA variance plot.
- Parameters
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.n_components (float oR int) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If
Noneas default value the number of variance to preserve is95%.attr_names (list of str , optional) – List of features for combinaison. Decide to combine new feature values by from operator parameters. By default, the combinaison it is ratio of the given attribute/numerical features. For instance,
attribute_names=['lwi', 'ohmS']will divide the feature ‘lwi’ by ‘ohmS’.attr_indexes (list of int,) – index of each feature/feature for experience combinaison. User warning should raise if any index does match the dataframe of array columns.
operator (str, default ='/') – Type of operation to perform when combining features. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]
- Returns
X (n_darray, or pd.dataframe)
New array of dataframe with new attributes combined.
Examples
>>> from from watex.view.mlplot import MLPlots >>> from watex.datasets import fetch_data >>> from watex.analysis import pcaVarianceRatio >>> plot_kws = {'lc':(.9,0.,.8), 'lw' :3., # line width 'font_size':7., 'show_grid' :True, # visualize grid 'galpha' :0.2, # grid alpha 'glw':.5, # grid line width 'gwhich' :'major', # minor ticks # 'fs' :3., # coeff to manage font_size } >>> X, _ = fetch_data ('Bagoue data analysis') >>> mlObj =MLPlots(**plot_kws) >>> pcaVarianceRatio(mlObj,X, plot_var_ratio=True)
- watex.cases.prepare.default_pipeline(X, num_attributes, cat_attributes, y=None, label_encoding='LabelEncoder', **kws)[source]#
Default pipeline use for preprocessing the`Bagoue` dataset
The pipeline can be improved to achieve a good results.
- Parameters
X (ndarray, pd.DataFrame) – X or dataframe X
- y: array_like,
ylabel or target
- num_attributes:list
Numerical attributes
- cat_attributes: list
categorical attributes
- lableEncodage: str
Type of encoding used to encode the label Default is
labelEncoder` but can be ``LabelBinarizer
- Returns
- `mum_pipeline` (Pipeline to process numerical features)
-`cat_pipeline` (pipeline to process categorical features.)
- `full_pipeline` (Full pipeline as the union of two pipelines)
-`y` (ylabel encoded if not None.)
- watex.cases.prepare.default_preparation(X, imputer_strategy=None, missing_values=nan, num_indexes=None, cat_indexes=None, scaler=None, encode_cat_features=True, columns=None)[source]#
Automate the data preparation to be ready for PCA analyses
Data preparation consist to imput missing values, scales the numerical features and encoded the categorial features.
- Parameters
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.imputer_strategy (str, default ='most_frequent') – Strategy proposed to replace the missing values. Can be
meanormedianormost_frequent. Be aware , it mean or median are given, be sure that the data are not composed of categorial fatures.missing_values (float) – Value to replace the missing value in X ndarray or dataframe. Default is ``np.nan`
num_indexes – list of indexes to select the numerical data if categorical data columns exist in X ndarray.
cat_indexes – list of indexes to select the categorical data if numerical data columns exists in X ndarray.
scaler (str, default, is) – type of feature scaling applied on numerical features. Can be
MinMaxScaler. Default isStandardScalerencode_cat_features (bool) – Encode categorical data or text attributes. Default is
sklearn.preprocessing.OrdinalEncoder.columns (list, Optional,) – list of columns to compose a dataframe if X is given as an NDAarray.
- Returns
X
- Return type
NDArray | Dataframe
Notes
num_indexes and cat_indexes are mainly used when type of data x is np.ndarray(m, nf) where m is number of instances or examples and nf if number of attributes or features. selector_ is used for dataframe preprocessing.