Gives some efficient tools for data manipulation and transformation.

class watex.transformers.CombinedAttributesAdder(attribute_names=None, attribute_indexes=None, operator='/')[source]#

Bases: BaseEstimator, TransformerMixin

Combined attributes from litteral string operators, indexes or names.

Create a new attribute using features index or litteral string operator. Inherits from scikit_learn BaseEstimator`and `TransformerMixin classes.

Parameters:
  • *attribute_names* (list of str , optional) – List of features for combinaison. Decide to combine new feature values by from operator parameters. By default, the combinaison it is ratio of the given attribute/numerical features. For instance, attribute_names=['lwi', 'ohmS'] will divide the feature ‘lwi’ by ‘ohmS’.

  • *attributes_indexes* (list of int,) – index of each feature/feature for experience combinaison. User warning should raise if any index does match the dataframe of array columns.

  • *operator* (str, default ='/') – Type of operation to perform. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]

Returns:

X – A new array contained the new data from the attrs_indexes operation. If attr_names and attr_indexes is None, will return the same array like beginning.

Return type:

np.ndarray,

Notes

A litteral string operator can be used. For instance dividing two numerical features can be illustrated using the word “per” separated by underscore like “_per_” For instance, to create a new feature based on the division of the features lwi and ohmS, the litteral string operator that holds the attribute_names could be:

attribute_names='lwi_per_ohmS'

The same litteral string is valid for multiplication (_mul_) , substraction (_sub_) , modulo (_mod_) and addition (_add_). However, indexes of features can also use rather than attribute_names providing the operator parameters.

Or it could be the indexes of both features in the array like attributes_ix =[(10, 9)] which means the lwi and ohmS are found at index 10 and 9``respectively. Furthermore, multiples operations can be set by adding mutiples litteral string operator into a list like ``attributes_ix = [ 'power_per_magnitude', 'ohmS_per_lwi'].

Examples

>>> import pandas as pd
>>> from watex.transformers import CombinedAttributesAdder
>>> from watex.datasets.dload import load_bagoue
>>> X, y = load_bagoue (as_frame =True )
>>> cobj = CombinedAttributesAdder (attribute_names='lwi_per_ohmS')
>>> Xadded = cobj.fit_transform(X)
>>> cobj.attribute_names_
... ['num',
     'name',
     'east',
     'north',
     'power',
     'magnitude',
     'shape',
     'type',
     'sfi',
     'ohmS',
     'lwi',
     'geol',
     'lwi_div_ohmS'] # new attributes with 'lwi'/'ohmS'
>>> df0 = pd.DataFrame (Xadded, columns = cobj.attribute_names_)
>>> df0['lwi_div_ohmS']
... 0           0.0
    1      0.000002
    2      0.000005
    3      0.000004
    4      0.000008

426 0.453359 427 0.382985 428 0.476676 429 0.457371 430 0.379429 Name: lwi_div_ohmS, Length: 431, dtype: object

>>> cobj = CombinedAttributesAdder (
    attribute_names=['lwi', 'ohmS', 'power'], operator='+')
>>> df0 = pd.DataFrame (cobj.fit_transform(X),
                        columns = cobj.attribute_names_)
>>> df0.iloc [:, -1]
... 0      1777.165142
    1      1207.551531
    2         850.5625
    3      1051.943553
    4       844.095833

426 1708.8585 427 1705.5375 428 1568.9825 429 1570.15625 430 1666.9185 Name: lwi_add_ohmS_add_power, Length: 431, dtype: object

>>> cobj = CombinedAttributesAdder (
    attribute_indexes =[1,6], operator='+')
>>> df0 = pd.DataFrame (cobj.fit_transform(X),
                        columns = cobj.attribute_names_)
>>> df0.iloc [:, -1]
... 0        b1W
    1        b2V
    2        b3V
    3        b4W
    4        b5W

426 b427W 427 b428V 428 b429V 429 b430V 430 b431V Name: name_add_shape, Length: 431, dtype: object

fit(X, y=None)[source]#
Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns:

self – returns self for easy method chaining.

Return type:

CombinedAttributesAdder instance

transform(X)[source]#

Tranform X and return new array with experience attributes combinaison.

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

Returns:

  • X (NDarray, Ndarray ( M x N+1 matrix)) – returns X transformed (M=m-samples, & N=n+1-features) with attribute combined.

  • .. versionadded:: 0.1.3

class watex.transformers.DataFrameSelector(attribute_names=None, select_type=None)[source]#

Bases: BaseEstimator, TransformerMixin

Select data from specific attributes for column transformer.

Select only numerical or categorial columns for operations. Work as the same like sckit-learn make_colum_tranformer

Parameters:
  • *attribute_names* (list or array_like) – List of the main columns to keep the data

  • *select_type* (str) – Automatic numerical and categorial selector. If select_type is num, only numerical values in dataframe are retrieved else cat for categorials attributes.

Returns:

X – New array with composed of data of selected attribute_names.

Return type:

ndarray

Examples

>>> from watex.transformers import DataFrameSelector
>>> from watex.utils.mlutils import load_data
>>> df = mlfunc.load_data('data/geo_fdata')
>>> XObj = DataFrameSelector(attribute_names=['power','magnitude','sfi'],
...                          select_type=None)
>>> cdf = XObj.fit_transform(df)
fit(X, y=None)[source]#

Select the Data frame

Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns:

self – returns self for easy method chaining.

Return type:

DataFrameSelector instance

transform(X)[source]#

Transform data and return numerical or categorial values.

X{array-like, sparse matrix} of shape (n_samples, n_features)

Training vector, where n_samples is the number of samples and n_features is the number of features.

class watex.transformers.FrameUnion(num_attributes=None, cat_attributes=None, scale=True, imput_data=True, encode=True, param_search='auto', strategy='median', scale_mode='StandardScaler', encode_mode='OrdinalEncoder')[source]#

Bases: BaseEstimator, TransformerMixin

Unified categorial and numerical features after scaling and and categorial features encoded.

Use DataframeSelector class to define the categorial features and numerical features.

Parameters:
  • num_attributes (list) – List of numerical attributes

  • cat_attributes (list) – list of categorial attributes

  • scale (bool) – Features scaling. Default is True and use :class:~sklearn.preprocessing.StandarScaler

  • imput_data (bool ,) – Replace the missing data. Default is True and use strategy.

  • param_search (bool,) – If num_attributes and cat_attributes`are None, the numerical features and categorial features should be found automatically. Default is True

  • scale_mode (bool,) – Mode of data scaling. Default is StandardScaler``but can be a ``MinMaxScaler

  • encode_mode (bool,) – Mode of data encoding. Default is OrdinalEncoder but can be OneHotEncoder but creating a sparse matrix. Once selected, the new shape of X should be different from the original shape.

Example

>>> from watex.datasets import fetch_data
>>> from watex.utils.transformers import FrameUnion
>>> X_= fetch_data ('Bagoue original').get('data=dfy1')
>>> frameObj = FrameUnion(X_, encoding =OneHotEncoder)
>>> X= frameObj.fit_transform(X_)
fit(X, y=None)[source]#

Does nothing. Just for scikit-learn purpose.

transform(X)[source]#

Transform data and return X numerical and categorial encoded values.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

Returns:

X – transformed arraylike, where n_samples is the number of samples and n_features is the number of features.

Return type:

{array-like, sparse matrix} of shape (n_samples, n_features)

class watex.transformers.KMeansFeaturizer(n_clusters=7, target_scale=5.0, random_state=None, n_components=None)[source]#

Bases: object

Transforms numeric data into k-means cluster memberships.

This transformer runs k-means on the input data and converts each data point into the ID of the closest cluster. If a target variable is present, it is scaled and included as input to k-means in order to derive clusters that obey the classification boundary as well as group similar points together.

Parameters:
  • n_clusters (int, default=7) – Number of initial clusters

  • target_scale (float, default=5.0) – Apply appropriate scaling and include it in the input data to k-means.

  • n_components (int, optional) – Number of components for reducted down the predictor. It uses the PCA to reduce down dimension to the importance components.

  • random_state (int, Optional) – State for shuffling the data

km_model#
Type:

KMeans featurization model used to transform

Examples

>>> # (1) Use a common dataset
>>> import matplotlib.pyplot as plt
>>> from sklearn.datasets import make_moons
>>> from watex.utils.plotutils import plot_voronoi
>>> from watex.datasets import load_mxs
>>> X, y = make_moons(n_samples=5000, noise=0.2)
>>> kmf_hint = KMeansFeaturizer(n_clusters=50, target_scale=10).fit(X,y)
>>> kmf_no_hint = KMeansFeaturizer(n_clusters=50, target_scale=0).fit(X, y)
>>> fig, ax = plt.subplots(2,1, figsize =(7, 7))
>>> plot_voronoi ( X, y ,cluster_centers=kmf_hint.cluster_centers_,
                  fig_title ='KMeans with hint', ax = ax [0] )
>>> plot_voronoi ( X, y ,cluster_centers=kmf_no_hint.cluster_centers_,
                  fig_title ='KMeans No hint' , ax = ax[1])
<AxesSubplot:title={'center':'KMeans No hint'}>
>>> # (2)  Use a concrete data set
>>> X, y = load_mxs ( return_X_y =True, key ='numeric' )
>>> # get the most principal components
>>> from watex.analysis import nPCA
>>> Xpca =nPCA (X, n_components = 2  ) # veronoi plot expect two dimensional data
>>> kmf_hint = KMeansFeaturizer(n_clusters=7, target_scale=10).fit(Xpca,y)
>>> kmf_no_hint = KMeansFeaturizer(n_clusters=7, target_scale=0).fit(Xpca, y)
>>> fig, ax = plt.subplots(2,1, figsize =(7, 7))
>>> plot_voronoi ( Xpca, y ,cluster_centers=kmf_hint.cluster_centers_,
                  fig_title ='KMeans with hint', ax = ax [0] )
>>> plot_voronoi ( Xpca, y ,cluster_centers=kmf_no_hint.cluster_centers_,
                  fig_title ='KMeans No hint' , ax = ax[1])
fit(X, y=None)[source]#

Runs k-means on the input data and finds and updated centroids.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples,)) – Target vector relative to X.

Returns:

Fitted estimator.

Return type:

self

fit_transform(X, y=None)[source]#

Fit and transform the data

X{array-like, sparse matrix} of shape (n_samples, n_features)

Training vector, where n_samples is the number of samples and n_features is the number of features.

yarray-like of shape (n_samples,)

Target vector relative to X.

Returns:

labels – Index of the cluster each sample belongs to.

Return type:

ndarray of shape (n_samples,)

transform(X)[source]#

Outputs the closest cluster ID for each input data point.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to predict.

Returns:

labels – Index of the cluster each sample belongs to.

Return type:

ndarray of shape (n_samples,)

class watex.transformers.StratifiedUsingBaseCategory(base_column=None, test_size=0.2, random_state=42)[source]#

Bases: BaseEstimator, TransformerMixin

Transformer to stratified dataset to have data more representativce into the trainset and the test set especially when data is not large enough.

Parameters:
  • base_column (str or int,) – Hyperparameters and can be index of the base mileage(category) for stratifications. If base_column is None, will return the purely random sampling.

  • test_size (float) – Size to put in the test set.

  • random_state (shuffled number of instance in the overall dataset.) – default is 42.

  • Usage

  • ------

  • attributes (If data is not large enough especially relative number of) –

  • sampling (if much possible to run therisk of introducing a significant) –

  • avoid (biais.Therefore strafied sampling is a better way to) –

    a significant biais of sampling survey. For instance:

    >>> from watex.transformers import StratifiedUsingBaseCategory
    >>> from watex.utils.mlutils import load_data
    >>> df = load_data('data/geo_fdata')
    >>> stratifiedObj = StratifiedUsingBaseCategory(base_column='geol')
    >>> stratifiedObj.fit_transform(X=df)
    >>> stats= stratifiedObj.statistics_
    

Notes

An statictics_ inspection attribute is good way to observe the test set generated using purely random and the stratified sampling. The stratified sampling has category base_column proportions almost indentical to those in the full dataset whereas the test set generated using purely random sampling is quite skewed.

fit(X, y=None)[source]#

Does nothing , just for API purpose.

transform(X, y=None)[source]#

Split dataset into trainset and testset using stratified sampling.

If base_column not given will return the trainset and testset using purely random sampling. {array-like, sparse matrix} of shape (n_samples, n_features)

New data to predict.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (Ignored) –

Returns:

strat_train_set, strat_test_set – train set and test set stratified

Return type:

NDArray, ( n_samples , n_features)

class watex.transformers.StratifiedWithCategoryAdder(base_num_feature=None, threshold_operator=1.0, return_train=False, max_category=3, n_splits=1, test_size=0.2, random_state=42)[source]#

Bases: BaseEstimator, TransformerMixin

Stratified sampling transformer based on new generated category from numerical attributes and return stratified trainset and test set.

Parameters:
  • base_num_feature (str,) – Numerical features to categorize.

  • threshold_operator (float,) – The coefficient to divised the numerical features value to normalize the data

  • max_category (Maximum value fits a max category to gather all) – value greather than.

  • return_train (bool,) – Return the whole stratified trainset if set to True. usefull when the dataset is not enough. It is convenient to train all the whole trainset rather than a small amount of stratified data. Sometimes all the stratified data are not the similar equal one to another especially when the dataset is not enough.

  • and (Another way to stratify dataset is to get insights from the dataset) –

  • attributes (to add a new category as additional mileage. From this new) –

:param : :param data could be stratified after categorizing numerical features.: :param Once data is tratified: :param the new category will be drop and return the: :param train set and testset stratified. For instance::: >>> from watex.transformers import StratifiedWithCategoryAdder

>>> stratifiedNumObj= StratifiedWithCatogoryAdder('flow')
>>> stratifiedNumObj.fit_transform(X=df)
>>> stats2 = stratifiedNumObj.statistics_
Parameters:
  • Usage

  • ------

  • example (In this) –

  • using (we firstly categorize the flow attribute) –

:param the ceilvalue (see discretizeCategoriesforStratification()): :param and groupby other values greater than the max_category value to the: :param max_category andput in the temporary features. From this features: :param the categorization is performed and stratified the trainset and: :param the test set.:

Notes

If base_num_feature is not given, dataset will be stratified using random sampling.

fit(X, y=None)[source]#

Does nothin just for scikit-learn API purpose.

transform(X, y=None)[source]#
Transform data and populate inspections attributes

from hyperparameters.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to predict.

  • y (Ignored) – Keep just for API purpose.

Returns:

X – New data transformed.

Return type:

{array-like, sparse matrix} of shape (n_samples, n_features)

watex.transformers.featurize_X(X, y=None, *, n_clusters=7, target_scale=5, random_state=None, n_components=None, model=None, split_X_y=False, test_ratio=0.2, shuffle=True, return_model=Ellipsis, to_sparse=Ellipsis, sparsity='coo')[source]#

Featurize X with the cluster based on the KMeans featurization

Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)

Training vector, where n_samples is the number of samples and n_features is the number of features. Note that when n_components is set, sparse matrix for X is not acceptable.

yarray-like of shape (n_samples,)

Target vector relative to X.

n_clusters: int, default=7

Number of initial clusters

target_scale: float, default=5.0

Apply appropriate scaling and include it in the input data to k-means.

n_components: int, optional

Number of components for reduced down the predictor X. It uses the PCA to reduce down dimension to the importance features.

model: :class:`KMeansFeaturizer`.

KMeasFeaturizer model. Model can be provided to featurize the test data separated from the train data.

New in version 0.2.4.

random_state: int, Optional

State for shuffling the data

split_X_y: bool, default=False,

Split the X, y into train data and test data according to the test size

test_ratio: int, default=0.2

ratio to keep for a test data.

shuffle: bool, default=True

Suffling the data set.

return_model: bool, default =False

If True return the KMeans featurization mode and the transformed X.

to_sparse: bool, default=False

Convert X data to sparse matrix, by default the sparse matrix is coordinates matrix (COO)

sparsity:str, default=’coo’

Kind of sparse matrix use to convert X. It can be [‘csr’|’coo’]. Any other values with return a coordinates matrix unless to_sparse is turned to False.

New in version 0.2.4.

Returns:
X, yNDArray shape (m_samples, n_features +1) or shape (m_samples, n_sparse_features)

Returns NDArray of m_features plus the clusters features from KMF feturization procedures. The n_sparse_features is created if to_sparse is set to True.

X, y, model: NDarray and KMF models

Returns transformed array X and y and model if return_model is set to True.

Array like train data X transformed and test if split_X_y is set to True.

X, Xtest, y, ytest: NDArray (KMF), ArrayLike

Split tuple is returned when split_X_y=True`.

Examples

>>> import numpy as np
>>> from watex.transformers import featurize_X
>>> X = np.random.randn (12 , 7 ) ; y = np.arange(12 )
>>> y[ y < 6 ]= 0 ; y [y >0 ]= 1  # for binary data
>>> Xtransf , _ = featurize_X (X, to_sparse =False)
>>> X.shape, Xtransf.shape
((12, 7), (12, 8))
>>> Xtransf, y  = featurize_X (X,y,  to_sparse =True )
>>> Xtransf , y
(<12x8 sparse matrix of type '<class 'numpy.float64'>'
    with 93 stored elements in COOrdinate format>,
 array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]))
>>> featurize_X (X,y,  to_sparse =True, split_X_y=True  )
(<9x8 sparse matrix of type '<class 'numpy.float64'>'
    with 71 stored elements in COOrdinate format>,
 <3x8 sparse matrix of type '<class 'numpy.float64'>'
    with 24 stored elements in COOrdinate format>,
 array([0, 1, 1, 0, 0, 0, 0, 1, 1]),
 array([0, 1, 1]))
>>> *_, kmf_model = featurize_X (X,y,  to_sparse =True, return_model =True)
>>> kmf_model
<'KMeansFeaturizer':n_clusters=7, target_scale=5, random_state=None,
n_components=None>