<no title> — watex 0.3.0.dev1+g10707dc.d20230709 documentation

Gives some efficient tools for data manipulation and transformation.

class watex.transformers.CombinedAttributesAdder(attribute_names=None, attribute_indexes=None, operator='/')[source]#

Bases: BaseEstimator, TransformerMixin

Combined attributes from litteral string operators, indexes or names.

Create a new attribute using features index or litteral string operator. Inherits from scikit_learn BaseEstimator`and `TransformerMixin classes.

Parameters

*attribute_names* (list of str , optional) – List of features for combinaison. Decide to combine new feature values by from operator parameters. By default, the combinaison it is ratio of the given attribute/numerical features. For instance, attribute_names=['lwi', 'ohmS'] will divide the feature ‘lwi’ by ‘ohmS’.
*attributes_indexes* (list of int,) – index of each feature/feature for experience combinaison. User warning should raise if any index does match the dataframe of array columns.
*operator* (str, default ='/') – Type of operation to perform. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]

Returns

X – A new array contained the new data from the attrs_indexes operation. If attr_names and attr_indexes is None, will return the same array like beginning.

Return type

np.ndarray,

Notes

A litteral string operator can be used. For instance dividing two numerical features can be illustrated using the word “per” separated by underscore like “_per_” For instance, to create a new feature based on the division of the features lwi and ohmS, the litteral string operator that holds the attribute_names could be:

attribute_names='lwi_per_ohmS'

The same litteral string is valid for multiplication (_mul_) , substraction (_sub_) , modulo (_mod_) and addition (_add_). However, indexes of features can also use rather than attribute_names providing the operator parameters.

Or it could be the indexes of both features in the array like attributes_ix =[(10, 9)] which means the lwi and ohmS are found at index 10 and 9``respectively. Furthermore, multiples operations can be set by adding mutiples litteral string operator into a list like ``attributes_ix = [ 'power_per_magnitude', 'ohmS_per_lwi'].

Examples

>>> import pandas as pd
>>> from watex.transformers import CombinedAttributesAdder
>>> from watex.datasets.dload import load_bagoue
>>> X, y = load_bagoue (as_frame =True )
>>> cobj = CombinedAttributesAdder (attribute_names='lwi_per_ohmS')
>>> Xadded = cobj.fit_transform(X)
>>> cobj.attribute_names_
... ['num',
     'name',
     'east',
     'north',
     'power',
     'magnitude',
     'shape',
     'type',
     'sfi',
     'ohmS',
     'lwi',
     'geol',
     'lwi_div_ohmS'] # new attributes with 'lwi'/'ohmS'
>>> df0 = pd.DataFrame (Xadded, columns = cobj.attribute_names_)
>>> df0['lwi_div_ohmS']
... 0           0.0
    1      0.000002
    2      0.000005
    3      0.000004
    4      0.000008

426 0.453359 427 0.382985 428 0.476676 429 0.457371 430 0.379429 Name: lwi_div_ohmS, Length: 431, dtype: object

>>> cobj = CombinedAttributesAdder (
    attribute_names=['lwi', 'ohmS', 'power'], operator='+')
>>> df0 = pd.DataFrame (cobj.fit_transform(X),
                        columns = cobj.attribute_names_)
>>> df0.iloc [:, -1]
... 0      1777.165142
    1      1207.551531
    2         850.5625
    3      1051.943553
    4       844.095833

426 1708.8585 427 1705.5375 428 1568.9825 429 1570.15625 430 1666.9185 Name: lwi_add_ohmS_add_power, Length: 431, dtype: object

>>> cobj = CombinedAttributesAdder (
    attribute_indexes =[1,6], operator='+')
>>> df0 = pd.DataFrame (cobj.fit_transform(X),
                        columns = cobj.attribute_names_)
>>> df0.iloc [:, -1]
... 0        b1W
    1        b2V
    2        b3V
    3        b4W
    4        b5W

426 b427W 427 b428V 428 b429V 429 b430V 430 b431V Name: name_add_shape, Length: 431, dtype: object

fit(X, y=None)[source]#

Parameters

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns

self – returns self for easy method chaining.

Return type

CombinedAttributesAdder instance

transform(X)[source]#

Tranform X and return new array with experience attributes combinaison.

Parameters

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

Returns

X (NDarray, Ndarray ( M x N+1 matrix)) – returns X transformed (M=m-samples, & N=n+1-features) with attribute combined.
.. versionadded:: 0.1.3

class watex.transformers.DataFrameSelector(attribute_names=None, select_type=None)[source]#

Bases: BaseEstimator, TransformerMixin

Select data from specific attributes for column transformer.

Select only numerical or categorial columns for operations. Work as the same like sckit-learn make_colum_tranformer

Parameters

*attribute_names* (list or array_like) – List of the main columns to keep the data
*select_type* (str) – Automatic numerical and categorial selector. If select_type is num, only numerical values in dataframe are retrieved else cat for categorials attributes.

Returns

X – New array with composed of data of selected attribute_names.

Return type

ndarray

Examples

>>> from watex.transformers import DataFrameSelector
>>> from watex.utils.mlutils import load_data
>>> df = mlfunc.load_data('data/geo_fdata')
>>> XObj = DataFrameSelector(attribute_names=['power','magnitude','sfi'],
...                          select_type=None)
>>> cdf = XObj.fit_transform(df)

fit(X, y=None)[source]#

Select the Data frame

Parameters

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns

self – returns self for easy method chaining.

Return type

DataFrameSelector instance

transform(X)[source]#

Transform data and return numerical or categorial values.

X{array-like, sparse matrix} of shape (n_samples, n_features): Training vector, where n_samples is the number of samples and n_features is the number of features.

class watex.transformers.FrameUnion(num_attributes=None, cat_attributes=None, scale=True, imput_data=True, encode=True, param_search='auto', strategy='median', scale_mode='StandardScaler', encode_mode='OrdinalEncoder')[source]#

Bases: BaseEstimator, TransformerMixin

Unified categorial and numerical features after scaling and and categorial features encoded.

Use DataframeSelector class to define the categorial features and numerical features.

Parameters

num_attributes (list) – List of numerical attributes
cat_attributes (list) – list of categorial attributes
scale (bool) – Features scaling. Default is True and use :class:~sklearn.preprocessing.StandarScaler
imput_data (bool ,) – Replace the missing data. Default is True and use strategy.
param_search (bool,) – If num_attributes and cat_attributes`are None, the numerical features and categorial features should be found automatically. Default is True
scale_mode (bool,) – Mode of data scaling. Default is StandardScaler``but can be a ``MinMaxScaler
encode_mode (bool,) – Mode of data encoding. Default is OrdinalEncoder but can be OneHotEncoder but creating a sparse matrix. Once selected, the new shape of X should be different from the original shape.

Example

>>> from watex.datasets import fetch_data
>>> from watex.utils.transformers import FrameUnion
>>> X_= fetch_data ('Bagoue original').get('data=dfy1')
>>> frameObj = FrameUnion(X_, encoding =OneHotEncoder)
>>> X= frameObj.fit_transform(X_)

fit(X, y=None)[source]#: Does nothing. Just for scikit-learn purpose.

transform(X)[source]#

Transform data and return X numerical and categorial encoded values.

Parameters: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
Returns: X – transformed arraylike, where n_samples is the number of samples and n_features is the number of features.
Return type: {array-like, sparse matrix} of shape (n_samples, n_features)

class watex.transformers.KMeansFeaturizer(n_clusters=7, target_scale=5.0, random_state=None, n_components=None)[source]#

Bases: object

Transforms numeric data into k-means cluster memberships.

This transformer runs k-means on the input data and converts each data point into the ID of the closest cluster. If a target variable is present, it is scaled and included as input to k-means in order to derive clusters that obey the classification boundary as well as group similar points together.

Parameters

n_clusters (int, default=7) – Number of initial clusters
target_scale (float, default=5.0) – Apply appropriate scaling and include it in the input data to k-means.
n_components (int, optional) – Number of components for reducted down the predictor. It uses the PCA to reduce down dimension to the importance components.
random_state (int, Optional) – State for shuffling the data

km_model#

Type: KMeans featurization model used to transform

Examples

>>> # (1) Use a common dataset
>>> import matplotlib.pyplot as plt
>>> from sklearn.datasets import make_moons
>>> from watex.utils.plotutils import plot_voronoi
>>> from watex.datasets import load_mxs
>>> X, y = make_moons(n_samples=5000, noise=0.2)
>>> kmf_hint = KMeansFeaturizer(n_clusters=50, target_scale=10).fit(X,y)
>>> kmf_no_hint = KMeansFeaturizer(n_clusters=50, target_scale=0).fit(X, y)
>>> fig, ax = plt.subplots(2,1, figsize =(7, 7))
>>> plot_voronoi ( X, y ,cluster_centers=kmf_hint.cluster_centers_,
                  fig_title ='KMeans with hint', ax = ax [0] )
>>> plot_voronoi ( X, y ,cluster_centers=kmf_no_hint.cluster_centers_,
                  fig_title ='KMeans No hint' , ax = ax[1])
<AxesSubplot:title={'center':'KMeans No hint'}>
>>> # (2)  Use a concrete data set
>>> X, y = load_mxs ( return_X_y =True, key ='numeric' )
>>> # get the most principal components
>>> from watex.analysis import nPCA
>>> Xpca =nPCA (X, n_components = 2  ) # veronoi plot expect two dimensional data
>>> kmf_hint = KMeansFeaturizer(n_clusters=7, target_scale=10).fit(Xpca,y)
>>> kmf_no_hint = KMeansFeaturizer(n_clusters=7, target_scale=0).fit(Xpca, y)
>>> fig, ax = plt.subplots(2,1, figsize =(7, 7))
>>> plot_voronoi ( Xpca, y ,cluster_centers=kmf_hint.cluster_centers_,
                  fig_title ='KMeans with hint', ax = ax [0] )
>>> plot_voronoi ( Xpca, y ,cluster_centers=kmf_no_hint.cluster_centers_,
                  fig_title ='KMeans No hint' , ax = ax[1])

fit(X, y=None)[source]#

Runs k-means on the input data and finds and updated centroids.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target vector relative to X.

Returns

Fitted estimator.

Return type

self

fit_transform(X, y=None)[source]#

Fit and transform the data

X{array-like, sparse matrix} of shape (n_samples, n_features): Training vector, where n_samples is the number of samples and n_features is the number of features.
yarray-like of shape (n_samples,): Target vector relative to X.

Returns: labels – Index of the cluster each sample belongs to.
Return type: ndarray of shape (n_samples,)

transform(X)[source]#

Outputs the closest cluster ID for each input data point.

Parameters: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to predict.
Returns: labels – Index of the cluster each sample belongs to.
Return type: ndarray of shape (n_samples,)

class watex.transformers.StratifiedUsingBaseCategory(base_column=None, test_size=0.2, random_state=42)[source]#

Bases: BaseEstimator, TransformerMixin

Transformer to stratified dataset to have data more representativce into the trainset and the test set especially when data is not large enough.

Parameters

base_column (str or int,) – Hyperparameters and can be index of the base mileage(category) for stratifications. If base_column is None, will return the purely random sampling.
test_size (float) – Size to put in the test set.
random_state (shuffled number of instance in the overall dataset.) – default is 42.
Usage –
------ –
attributes (If data is not large enough especially relative number of) –
sampling (if much possible to run therisk of introducing a significant) –

avoid (biais.Therefore strafied sampling is a better way to) –

a significant biais of sampling survey. For instance:

>>> from watex.transformers import StratifiedUsingBaseCategory
>>> from watex.utils.mlutils import load_data
>>> df = load_data('data/geo_fdata')
>>> stratifiedObj = StratifiedUsingBaseCategory(base_column='geol')
>>> stratifiedObj.fit_transform(X=df)
>>> stats= stratifiedObj.statistics_

Notes

An statictics_ inspection attribute is good way to observe the test set generated using purely random and the stratified sampling. The stratified sampling has category base_column proportions almost indentical to those in the full dataset whereas the test set generated using purely random sampling is quite skewed.

fit(X, y=None)[source]#: Does nothing , just for API purpose.

transform(X, y=None)[source]#

Split dataset into trainset and testset using stratified sampling.

If base_column not given will return the trainset and testset using purely random sampling. {array-like, sparse matrix} of shape (n_samples, n_features)

New data to predict.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) –

Returns

strat_train_set, strat_test_set – train set and test set stratified

Return type

NDArray, ( n_samples , n_features)

class watex.transformers.StratifiedWithCategoryAdder(base_num_feature=None, threshold_operator=1.0, return_train=False, max_category=3, n_splits=1, test_size=0.2, random_state=42)[source]#

Bases: BaseEstimator, TransformerMixin

Stratified sampling transformer based on new generated category from numerical attributes and return stratified trainset and test set.

Parameters

base_num_feature (str,) – Numerical features to categorize.
threshold_operator (float,) – The coefficient to divised the numerical features value to normalize the data
max_category (Maximum value fits a max category to gather all) – value greather than.
return_train (bool,) – Return the whole stratified trainset if set to True. usefull when the dataset is not enough. It is convenient to train all the whole trainset rather than a small amount of stratified data. Sometimes all the stratified data are not the similar equal one to another especially when the dataset is not enough.
and (Another way to stratify dataset is to get insights from the dataset) –
attributes (to add a new category as additional mileage. From this new) –

:param : :param data could be stratified after categorizing numerical features.: :param Once data is tratified: :param the new category will be drop and return the: :param train set and testset stratified. For instance::: >>> from watex.transformers import StratifiedWithCategoryAdder

>>> stratifiedNumObj= StratifiedWithCatogoryAdder('flow')
>>> stratifiedNumObj.fit_transform(X=df)
>>> stats2 = stratifiedNumObj.statistics_

Parameters

Usage –
------ –
example (In this) –
using (we firstly categorize the flow attribute) –

:param the ceilvalue (see discretizeCategoriesforStratification()): :param and groupby other values greater than the max_category value to the: :param max_category andput in the temporary features. From this features: :param the categorization is performed and stratified the trainset and: :param the test set.:

Notes

If base_num_feature is not given, dataset will be stratified using random sampling.

fit(X, y=None)[source]#: Does nothin just for scikit-learn API purpose.

transform(X, y=None)[source]#

Transform data and populate inspections attributes: from hyperparameters.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to predict.
y (Ignored) – Keep just for API purpose.

Returns

X – New data transformed.

Return type

{array-like, sparse matrix} of shape (n_samples, n_features)

watex.transformers.featurize_X(X, y=None, *, n_clusters=7, target_scale=5, random_state=None, n_components=None, model=None, split_X_y=False, test_ratio=0.2, shuffle=True, return_model=Ellipsis, to_sparse=Ellipsis, sparsity='coo')[source]#

Featurize X with the cluster based on the KMeans featurization

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features): Training vector, where n_samples is the number of samples and n_features is the number of features. Note that when n_components is set, sparse matrix for X is not acceptable.
yarray-like of shape (n_samples,): Target vector relative to X.
n_clusters: int, default=7: Number of initial clusters
target_scale: float, default=5.0: Apply appropriate scaling and include it in the input data to k-means.
n_components: int, optional: Number of components for reduced down the predictor X. It uses the PCA to reduce down dimension to the importance features.
model: :class:`KMeansFeaturizer`.: KMeasFeaturizer model. Model can be provided to featurize the test data separated from the train data.

New in version 0.2.4.
random_state: int, Optional: State for shuffling the data
split_X_y: bool, default=False,: Split the X, y into train data and test data according to the test size
test_ratio: int, default=0.2: ratio to keep for a test data.
shuffle: bool, default=True: Suffling the data set.
return_model: bool, default =False: If True return the KMeans featurization mode and the transformed X.
to_sparse: bool, default=False: Convert X data to sparse matrix, by default the sparse matrix is coordinates matrix (COO)
sparsity:str, default=’coo’: Kind of sparse matrix use to convert X. It can be [‘csr’|’coo’]. Any other values with return a coordinates matrix unless to_sparse is turned to False.

New in version 0.2.4.

Returns

X, yNDArray shape (m_samples, n_features +1) or shape (m_samples, n_sparse_features)

Returns NDArray of m_features plus the clusters features from KMF feturization procedures. The n_sparse_features is created if to_sparse is set to True.

X, y, model: NDarray and KMF models

Returns transformed array X and y and model if return_model is set to True.

Array like train data X transformed and test if split_X_y is set to True.

X, Xtest, y, ytest: NDArray (KMF), ArrayLike: Split tuple is returned when split_X_y=True`.

Examples

>>> import numpy as np
>>> from watex.transformers import featurize_X
>>> X = np.random.randn (12 , 7 ) ; y = np.arange(12 )
>>> y[ y < 6 ]= 0 ; y [y >0 ]= 1  # for binary data
>>> Xtransf , _ = featurize_X (X, to_sparse =False)
>>> X.shape, Xtransf.shape
((12, 7), (12, 8))
>>> Xtransf, y  = featurize_X (X,y,  to_sparse =True )
>>> Xtransf , y
(<12x8 sparse matrix of type '<class 'numpy.float64'>'
    with 93 stored elements in COOrdinate format>,
 array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]))
>>> featurize_X (X,y,  to_sparse =True, split_X_y=True  )
(<9x8 sparse matrix of type '<class 'numpy.float64'>'
    with 71 stored elements in COOrdinate format>,
 <3x8 sparse matrix of type '<class 'numpy.float64'>'
    with 24 stored elements in COOrdinate format>,
 array([0, 1, 1, 0, 0, 0, 0, 1, 1]),
 array([0, 1, 1]))
>>> *_, kmf_model = featurize_X (X,y,  to_sparse =True, return_model =True)
>>> kmf_model
<'KMeansFeaturizer':n_clusters=7, target_scale=5, random_state=None,
n_components=None>