Gives some efficient tools for data manipulation and transformation.

class watex.transformers.CategorizeFeatures(num_columns_properties=None)[source]#

Bases: BaseEstimator, TransformerMixin

Transform numerical features into categorical features and return a new array transformed.

Parameters:

*num_columns_properties* (list) – list composed ofnumerical features name, list of features boundaries with their categorized names.

Notes

From the boundaries values including, features values can be transformed. num_columns_properties is composed of:

  • feature name or index equals to ‘flow`’ or index of flow =’12’

  • features boundaries equals to [0., 1., 3] may correspond to:

    • 0: features flow values with equal to 0. By default the begining

      value like 0 is unranged.

    • 0-1: replace values ranged between 0 and 1.

    • 1-3:replace values ranged between 1-3

    • >3get all values greater than 3. by default categorize values

      greater than the last values.

    If the default classification is not suitable, create your own range

    values like [[0-1], [1-3], 3] (1)

  • categorized names: Be sure that if the value is provided as without

    ranging like (1). The number of categorized values must be the size of the features boundaries +1. For instance, we try to replace all numerical values in column flow by

    -FR0 : all fllow egal to 0.
    -FR1: flow between 0-1
    -FR2: flow between 1-3
    -FR3: flow greater than 3.
    

    As you can see the features boundaries [0., 1., 3]size is equal to `categorized name`[‘FR0’, ‘FR1’, ‘FR2’, ‘FR3’] size +1.

ascertain_mumerical_values(X, y=None)[source]#

Retreive indexes from mumerical attributes and return a dataframe values especially if X is dataframe else returns values of array.

fit(X, y=None)[source]#
transform(X, y=None)[source]#

Transform the data and return new array. Can straightforwardly call fit_transform() inherited from scikit_learn.

class watex.transformers.CombinedAttributesAdder(attribute_names=None, attribute_indexes=None, operator='/')[source]#

Bases: BaseEstimator, TransformerMixin

Combined attributes from litteral string operators, indexes or names.

Create a new attribute using features index or litteral string operator. Inherits from scikit_learn BaseEstimator`and `TransformerMixin classes.

Parameters:
  • *attribute_names* (list of str , optional) – List of features for combinaison. Decide to combine new feature values by from operator parameters. By default, the combinaison it is ratio of the given attribute/numerical features. For instance, attribute_names=['lwi', 'ohmS'] will divide the feature ‘lwi’ by ‘ohmS’.

  • *attributes_indexes* (list of int,) – index of each feature/feature for experience combinaison. User warning should raise if any index does match the dataframe of array columns.

  • *operator* (str, default ='/') – Type of operation to perform. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]

Returns:

X – A new array contained the new data from the attrs_indexes operation. If attr_names and attr_indexes is None, will return the same array like beginning.

Return type:

np.ndarray,

Notes

A litteral string operator can be used. For instance dividing two numerical features can be illustrated using the word “per” separated by underscore like “_per_” For instance, to create a new feature based on the division of the features lwi and ohmS, the litteral string operator that holds the attribute_names could be:

attribute_names='lwi_per_ohmS'

The same litteral string is valid for multiplication (_mul_) , substraction (_sub_) , modulo (_mod_) and addition (_add_). However, indexes of features can also use rather than attribute_names providing the operator parameters.

Or it could be the indexes of both features in the array like attributes_ix =[(10, 9)] which means the lwi and ohmS are found at index 10 and 9``respectively. Furthermore, multiples operations can be set by adding mutiples litteral string operator into a list like ``attributes_ix = [ 'power_per_magnitude', 'ohmS_per_lwi'].

Examples

>>> import pandas as pd
>>> from watex.transformers import CombinedAttributesAdder
>>> from watex.datasets.dload import load_bagoue
>>> X, y = load_bagoue (as_frame =True )
>>> cobj = CombinedAttributesAdder (attribute_names='lwi_per_ohmS')
>>> Xadded = cobj.fit_transform(X)
>>> cobj.attribute_names_
... ['num',
     'name',
     'east',
     'north',
     'power',
     'magnitude',
     'shape',
     'type',
     'sfi',
     'ohmS',
     'lwi',
     'geol',
     'lwi_div_ohmS'] # new attributes with 'lwi'/'ohmS'
>>> df0 = pd.DataFrame (Xadded, columns = cobj.attribute_names_)
>>> df0['lwi_div_ohmS']
... 0           0.0
    1      0.000002
    2      0.000005
    3      0.000004
    4      0.000008

426 0.453359 427 0.382985 428 0.476676 429 0.457371 430 0.379429 Name: lwi_div_ohmS, Length: 431, dtype: object

>>> cobj = CombinedAttributesAdder (
    attribute_names=['lwi', 'ohmS', 'power'], operator='+')
>>> df0 = pd.DataFrame (cobj.fit_transform(X),
                        columns = cobj.attribute_names_)
>>> df0.iloc [:, -1]
... 0      1777.165142
    1      1207.551531
    2         850.5625
    3      1051.943553
    4       844.095833

426 1708.8585 427 1705.5375 428 1568.9825 429 1570.15625 430 1666.9185 Name: lwi_add_ohmS_add_power, Length: 431, dtype: object

>>> cobj = CombinedAttributesAdder (
    attribute_indexes =[1,6], operator='+')
>>> df0 = pd.DataFrame (cobj.fit_transform(X),
                        columns = cobj.attribute_names_)
>>> df0.iloc [:, -1]
... 0        b1W
    1        b2V
    2        b3V
    3        b4W
    4        b5W

426 b427W 427 b428V 428 b429V 429 b430V 430 b431V Name: name_add_shape, Length: 431, dtype: object

fit(X, y=None)[source]#
Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns:

self – returns self for easy method chaining.

Return type:

CombinedAttributesAdder instance

transform(X)[source]#

Tranform X and return new array with experience attributes combinaison.

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

Returns:

  • X (NDarray, Ndarray ( M x N+1 matrix)) – returns X transformed (M=m-samples, & N=n+1-features) with attribute combined.

  • .. versionadded:: 0.1.3

class watex.transformers.DataFrameSelector(attribute_names=None, select_type=None)[source]#

Bases: BaseEstimator, TransformerMixin

Select data from specific attributes for column transformer.

Select only numerical or categorial columns for operations. Work as the same like sckit-learn make_colum_tranformer

Parameters:
  • *attribute_names* (list or array_like) – List of the main columns to keep the data

  • *select_type* (str) – Automatic numerical and categorial selector. If select_type is num, only numerical values in dataframe are retrieved else cat for categorials attributes.

Returns:

X – New array with composed of data of selected attribute_names.

Return type:

ndarray

Examples

>>> from watex.transformers import DataFrameSelector
>>> from watex.utils.mlutils import load_data
>>> df = mlfunc.load_data('data/geo_fdata')
>>> XObj = DataFrameSelector(attribute_names=['power','magnitude','sfi'],
...                          select_type=None)
>>> cdf = XObj.fit_transform(df)
fit(X, y=None)[source]#
transform(X, y=None)[source]#

Transform data and return numerical or categorial values.

class watex.transformers.FrameUnion(num_attributes=None, cat_attributes=None, scale=True, imput_data=True, encode=True, param_search='auto', strategy='median', scale_mode='StandardScaler', encode_mode='OrdinalEncoder')[source]#

Bases: BaseEstimator, TransformerMixin

Unified categorial and numerical features after scaling and and categorial features encoded.

Use DataframeSelector class to define the categorial features and numerical features.

Parameters:
  • num_attributes (list) – List of numerical attributes

  • cat_attributes (list) – list of categorial attributes

  • scale (bool) – Features scaling. Default is True and use :class:~sklearn.preprocessing.StandarScaler

  • imput_data (bool ,) – Replace the missing data. Default is True and use strategy.

  • param_search (bool,) – If num_attributes and cat_attributes`are None, the numerical features and categorial features should be found automatically. Default is True

  • scale_mode (bool,) – Mode of data scaling. Default is StandardScaler``but can be a ``MinMaxScaler

  • encode_mode (bool,) – Mode of data encoding. Default is OrdinalEncoder but can be OneHotEncoder but creating a sparse matrix. Once selected, the new shape of X should be different from the original shape.

Example

>>> from watex.datasets import fetch_data
>>> from watex.utils.transformers import FrameUnion
>>> X_= fetch_data ('Bagoue original').get('data=dfy1')
>>> frameObj = FrameUnion(X_, encoding =OneHotEncoder)
>>> X= frameObj.fit_transform(X_)
fit(X)[source]#
transform(X, y=None)[source]#

Transform data and return X numerical and categorial encoded values.

class watex.transformers.StratifiedUsingBaseCategory(base_column=None, test_size=0.2, random_state=42)[source]#

Bases: BaseEstimator, TransformerMixin

Transformer to stratified dataset to have data more representativce into the trainset and the test set especially when data is not large enough.

Parameters:
  • *base_column* (str or int,) – Hyperparameters and can be index of the base mileage(category) for stratifications. If base_column is None, will return the purely random sampling.

  • *test_size* (float) – Size to put in the test set.

  • *random_state* (shuffled number of instance in the overall dataset.) – default is 42.

  • Usage

  • ------

  • attributes (If data is not large enough especially relative number of) –

  • sampling (if much possible to run therisk of introducing a significant) –

  • avoid (biais.Therefore strafied sampling is a better way to) –

    a significant biais of sampling survey. For instance:

    >>> from watex.transformers import StratifiedUsingBaseCategory
    >>> from watex.utils.mlutils import load_data
    >>> df = load_data('data/geo_fdata')
    >>> stratifiedObj = StratifiedUsingBaseCategory(base_column='geol')
    >>> stratifiedObj.fit_transform(X=df)
    >>> stats= stratifiedObj.statistics_
    

Notes

An statictics_ inspection attribute is good way to observe the test set generated using purely random and the stratified sampling. The stratified sampling has category base_column proportions almost indentical to those in the full dataset whereas the test set generated using purely random sampling is quite skewed.

fit(X, y=None)[source]#

Fit method and populated isnpections attributes from hyperparameters.

transform(X, y=None)[source]#

return dataset trainset and testset using stratified sampling.

If base_column not given will return the trainset and testset using purely random sampling.

class watex.transformers.StratifiedWithCategoryAdder(base_num_feature=None, threshold_operator=1.0, return_train=False, max_category=3, n_splits=1, test_size=0.2, random_state=42)[source]#

Bases: BaseEstimator, TransformerMixin

Stratified sampling transformer based on new generated category from numerical attributes and return stratified trainset and test set.

Parameters:
  • *base_num_feature* (str,) – Numerical features to categorize.

  • *threshold_operator* (float,) – The coefficient to divised the numerical features value to normalize the data

  • *max_category* (Maximum value fits a max category to gather all) – value greather than.

  • *return_train* (bool,) – Return the whole stratified trainset if set to True. usefull when the dataset is not enough. It is convenient to train all the whole trainset rather than a small amount of stratified data. Sometimes all the stratified data are not the similar equal one to another especially when the dataset is not enough.

  • and (Another way to stratify dataset is to get insights from the dataset) –

  • attributes (to add a new category as additional mileage. From this new) –

:param : :param data could be stratified after categorizing numerical features.: :param Once data is tratified: :param the new category will be drop and return the: :param train set and testset stratified. For instance::: >>> from watex.transformers import StratifiedWithCategoryAdder

>>> stratifiedNumObj= StratifiedWithCatogoryAdder('flow')
>>> stratifiedNumObj.fit_transform(X=df)
>>> stats2 = stratifiedNumObj.statistics_
Parameters:
  • Usage

  • ------

  • example (In this) –

  • using (we firstly categorize the flow attribute) –

:param the ceilvalue (see discretizeCategoriesforStratification()): :param and groupby other values greater than the max_category value to the: :param max_category andput in the temporary features. From this features: :param the categorization is performed and stratified the trainset and: :param the test set.:

Notes

If base_num_feature is not given, dataset will be stratified using random sampling.

fit(X, y=None)[source]#

Fit method

transform(X, y=None)[source]#

Transform data and populate inspections attributes from hyperparameters.