watex.cases.prepare.BaseSteps#

class watex.cases.prepare.BaseSteps(tname=None, return_all=True, drop_features=None, categorizefeature_props=None, add_attributes=True, attribute_indexes=None, operator='/', imputer_strategy='median', missing_values=nan, pipeline=None, test_size=0.2, hash=False, random_state=42, verbose=0, **kwargs)[source]#

Default Data preparation steps

By default, the BaseSteps is used to prepare the DC 1d -resistivity geoelectrical features before prediction. The predicted target was the flow rate.

Parameters:

tname: str,

A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.

return_all: bool

return all the stratified trainset. When data is too large, can set to False to take an sample of the stratified trainset. to evaluate your model.

drop_features: list

List of useless features and clean the dataset.

categorizefeature_props: list

list of properties to categorize a particular features in the dataset. It composed of the ‘name of feature’ to convert its numerical values into categorical values , then the value range of data to be categorize and finally the categorical name of that values range. For instance:

categorizefeature_props= [
    ('flow', ([0., 1., 3.], ['FR0', 'FR1', 'FR2', 'FR3']))
    ]

Please refer to watex.utils.transformers.CategorizeFeatures fot furthers details.

hash: bool,

If True, it ensure that data will remain consistent accross multiple runs, even if dataset is refreshed. Use test by id to hash training and test sets when data is splitting.

add_attributes: list, optional

Experience the combinaison <numerical> attributes. List of features for combinaison. Decide to combine features to create a new feature value from operator parameters. By default, the combinaison is ratio of the given attribute/numerical features. For instance, attribute_names=['lwi', 'ohmS'] will divide the feature ‘lwi’ by ‘ohmS’.

operator: str, default =’/’

Type of operation to perform when combining features. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]

attribute_indexes: list of int,

List of attributes indexes to combines. For instance:

attribute_indexes = [1, 0] # or [4, 3]

The operator by default is division . Indexes of each attribute/feature for experiencing combinaison. User warning should raise if any index does match the dataframe of array columns.For more details, refer to CombinedAttributesAdder

imputer_strategy: str

Type of strategy to replace the missing values. Refer to SimpleImputer. Default is median.

missing_valuesfloat

The value to be replaced. Default is np.nan values.

pipeline: callable

Pipeline to prepare the dataset. Default is defaultPipeline().

test_size: float, default=.2 i.e. 20% (X, y)

The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.

random_stateint, RandomState instance or None, default=42

Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

verbose: int, default is 0

Control the level of verbosity. Higher value lead to more messages.

data: Filepath or Dataframe or shape (M, N)

Data is passed here as additional keyword arguments just for making under the X and y using method :neth:`~.stratifydata`. It is pandas.DataFrame containing samples of M and features N.

Notes

The data preparing includes is composed of two steps. The first step includes: - The data cleaning by fixing and removing outliers, to replace the missing

values by the other values using param `imputer_strategy`rather than to get rid of the different instances (examples) or the whole feature.

The handling text and features consist to convert the categorial features labels to numbers to let the algorithm to well perform with non-numerical values.
The data stratification process is done before separating the dataset into trainset and test set. Indeed, the stratification consist to divide the whole dataset into homogeneous subgroup to guarantee that the test set is most representative of the overall dataset. This is useful in our case because the dataset is not large enough to avoid the risk of introducing a significant bias. Once data are stratified, data are divided into a trainset (80%) and test set (20%).

The second steps consist of features selection, features engineering, encoding and data scaling using the pipeline via a parameter pipeline. If None pipeline is given, the default pipline is triggered.The features engineering’s consist to aggregate features with experiencing combinations of attributes into promising new features using the params attribute_indexes after setting the argument add_attributes to True. The final step of transformation consists of features scaling. The type of scaling used by default in this module is the standardization because it less affected by the outliers. Each transformation step must be executed in right order therefore a full pipeline is created, composed of the numerical pipeline (deals with numerical features) and categorical pipeline (deals with categorial features). Both pipelines are combined and applied to the trainset and later to the test set.

Examples

../datasets/_p.py