Gives some efficient tools for data manipulation and transformation.
- class watex.transformers.CategorizeFeatures(num_columns_properties=None)[source]#
Bases:
BaseEstimator,TransformerMixinTransform numerical features into categorical features and return a new array transformed.
- Parameters:
*num_columns_properties* (list) – list composed ofnumerical features name, list of features boundaries with their categorized names.
Notes
From the boundaries values including, features values can be transformed. num_columns_properties is composed of:
feature name or index equals to ‘flow`’ or index of flow =’12’
features boundaries equals to
[0., 1., 3]may correspond to:- 0: features flow values with equal to 0. By default the begining
value like 0 is unranged.
0-1: replace values ranged between 0 and 1.
1-3:replace values ranged between 1-3
- >3get all values greater than 3. by default categorize values
greater than the last values.
- If the default classification is not suitable, create your own range
values like
[[0-1], [1-3], 3] (1)
- categorized names: Be sure that if the value is provided as without
ranging like (1). The number of categorized values must be the size of the features boundaries +1. For instance, we try to replace all numerical values in column flow by
-FR0 : all fllow egal to 0. -FR1: flow between 0-1 -FR2: flow between 1-3 -FR3: flow greater than 3.
As you can see the features boundaries [0., 1., 3]size is equal to `categorized name`[‘FR0’, ‘FR1’, ‘FR2’, ‘FR3’] size +1.
- ascertain_mumerical_values(X, y=None)[source]#
Retreive indexes from mumerical attributes and return a dataframe values especially if X is dataframe else returns values of array.
- transform(X, y=None)[source]#
Transform the data and return new array. Can straightforwardly call
fit_transform()inherited from scikit_learn.
- class watex.transformers.CombinedAttributesAdder(attribute_names=None, attribute_indexes=None, operator='/')[source]#
Bases:
BaseEstimator,TransformerMixinCombined attributes from litteral string operators, indexes or names.
Create a new attribute using features index or litteral string operator. Inherits from scikit_learn BaseEstimator`and `TransformerMixin classes.
- Parameters:
*attribute_names* (list of str , optional) – List of features for combinaison. Decide to combine new feature values by from operator parameters. By default, the combinaison it is ratio of the given attribute/numerical features. For instance,
attribute_names=['lwi', 'ohmS']will divide the feature ‘lwi’ by ‘ohmS’.*attributes_indexes* (list of int,) – index of each feature/feature for experience combinaison. User warning should raise if any index does match the dataframe of array columns.
*operator* (str, default ='/') – Type of operation to perform. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]
- Returns:
X – A new array contained the new data from the attrs_indexes operation. If attr_names and attr_indexes is
None, will return the same array like beginning.- Return type:
np.ndarray,
Notes
A litteral string operator can be used. For instance dividing two numerical features can be illustrated using the word “per” separated by underscore like “_per_” For instance, to create a new feature based on the division of the features
lwiandohmS, the litteral string operator that holds theattribute_namescould be:attribute_names='lwi_per_ohmS'
The same litteral string is valid for multiplication (_mul_) , substraction (_sub_) , modulo (_mod_) and addition (_add_). However, indexes of features can also use rather than attribute_names providing the operator parameters.
Or it could be the indexes of both features in the array like
attributes_ix =[(10, 9)]which means the lwi and ohmS are found at index10and9``respectively. Furthermore, multiples operations can be set by adding mutiples litteral string operator into a list like ``attributes_ix = [ 'power_per_magnitude', 'ohmS_per_lwi'].Examples
>>> import pandas as pd >>> from watex.transformers import CombinedAttributesAdder >>> from watex.datasets.dload import load_bagoue >>> X, y = load_bagoue (as_frame =True ) >>> cobj = CombinedAttributesAdder (attribute_names='lwi_per_ohmS') >>> Xadded = cobj.fit_transform(X) >>> cobj.attribute_names_ ... ['num', 'name', 'east', 'north', 'power', 'magnitude', 'shape', 'type', 'sfi', 'ohmS', 'lwi', 'geol', 'lwi_div_ohmS'] # new attributes with 'lwi'/'ohmS' >>> df0 = pd.DataFrame (Xadded, columns = cobj.attribute_names_) >>> df0['lwi_div_ohmS'] ... 0 0.0 1 0.000002 2 0.000005 3 0.000004 4 0.000008
426 0.453359 427 0.382985 428 0.476676 429 0.457371 430 0.379429 Name: lwi_div_ohmS, Length: 431, dtype: object
>>> cobj = CombinedAttributesAdder ( attribute_names=['lwi', 'ohmS', 'power'], operator='+') >>> df0 = pd.DataFrame (cobj.fit_transform(X), columns = cobj.attribute_names_) >>> df0.iloc [:, -1] ... 0 1777.165142 1 1207.551531 2 850.5625 3 1051.943553 4 844.095833
426 1708.8585 427 1705.5375 428 1568.9825 429 1570.15625 430 1666.9185 Name: lwi_add_ohmS_add_power, Length: 431, dtype: object
>>> cobj = CombinedAttributesAdder ( attribute_indexes =[1,6], operator='+') >>> df0 = pd.DataFrame (cobj.fit_transform(X), columns = cobj.attribute_names_) >>> df0.iloc [:, -1] ... 0 b1W 1 b2V 2 b3V 3 b4W 4 b5W
426 b427W 427 b428V 428 b429V 429 b430V 430 b431V Name: name_add_shape, Length: 431, dtype: object
- fit(X, y=None)[source]#
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like, shape (M, )
M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Returns:
self – returns
selffor easy method chaining.- Return type:
CombinedAttributesAdder instance
- transform(X)[source]#
Tranform X and return new array with experience attributes combinaison.
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Returns:
X (NDarray, Ndarray ( M x N+1 matrix)) – returns X transformed (
M=m-samples, &N=n+1-features) with attribute combined... versionadded:: 0.1.3
- class watex.transformers.DataFrameSelector(attribute_names=None, select_type=None)[source]#
Bases:
BaseEstimator,TransformerMixinSelect data from specific attributes for column transformer.
Select only numerical or categorial columns for operations. Work as the same like sckit-learn make_colum_tranformer
- Parameters:
*attribute_names* (list or array_like) – List of the main columns to keep the data
*select_type* (str) – Automatic numerical and categorial selector. If select_type is
num, only numerical values in dataframe are retrieved elsecatfor categorials attributes.
- Returns:
X – New array with composed of data of selected attribute_names.
- Return type:
ndarray
Examples
>>> from watex.transformers import DataFrameSelector >>> from watex.utils.mlutils import load_data >>> df = mlfunc.load_data('data/geo_fdata') >>> XObj = DataFrameSelector(attribute_names=['power','magnitude','sfi'], ... select_type=None) >>> cdf = XObj.fit_transform(df)
- class watex.transformers.FrameUnion(num_attributes=None, cat_attributes=None, scale=True, imput_data=True, encode=True, param_search='auto', strategy='median', scale_mode='StandardScaler', encode_mode='OrdinalEncoder')[source]#
Bases:
BaseEstimator,TransformerMixinUnified categorial and numerical features after scaling and and categorial features encoded.
Use
DataframeSelectorclass to define the categorial features and numerical features.- Parameters:
num_attributes (list) – List of numerical attributes
cat_attributes (list) – list of categorial attributes
scale (bool) – Features scaling. Default is
Trueand use :class:~sklearn.preprocessing.StandarScalerimput_data (bool ,) – Replace the missing data. Default is
Trueand usestrategy.param_search (bool,) – If num_attributes and cat_attributes`are None, the numerical features and categorial features should be found automatically. Default is
Truescale_mode (bool,) – Mode of data scaling. Default is
StandardScaler``but can be a ``MinMaxScalerencode_mode (bool,) – Mode of data encoding. Default is
OrdinalEncoderbut can beOneHotEncoderbut creating a sparse matrix. Once selected, the new shape ofXshould be different from the original shape.
Example
>>> from watex.datasets import fetch_data >>> from watex.utils.transformers import FrameUnion >>> X_= fetch_data ('Bagoue original').get('data=dfy1') >>> frameObj = FrameUnion(X_, encoding =OneHotEncoder) >>> X= frameObj.fit_transform(X_)
- class watex.transformers.StratifiedUsingBaseCategory(base_column=None, test_size=0.2, random_state=42)[source]#
Bases:
BaseEstimator,TransformerMixinTransformer to stratified dataset to have data more representativce into the trainset and the test set especially when data is not large enough.
- Parameters:
*base_column* (str or int,) – Hyperparameters and can be index of the base mileage(category) for stratifications. If base_column is None, will return the purely random sampling.
*test_size* (float) – Size to put in the test set.
*random_state* (shuffled number of instance in the overall dataset.) – default is
42.Usage –
------ –
attributes (If data is not large enough especially relative number of) –
sampling (if much possible to run therisk of introducing a significant) –
avoid (biais.Therefore strafied sampling is a better way to) –
a significant biais of sampling survey. For instance:
>>> from watex.transformers import StratifiedUsingBaseCategory >>> from watex.utils.mlutils import load_data >>> df = load_data('data/geo_fdata') >>> stratifiedObj = StratifiedUsingBaseCategory(base_column='geol') >>> stratifiedObj.fit_transform(X=df) >>> stats= stratifiedObj.statistics_
Notes
An
statictics_inspection attribute is good way to observe the test set generated using purely random and the stratified sampling. The stratified sampling has categorybase_columnproportions almost indentical to those in the full dataset whereas the test set generated using purely random sampling is quite skewed.
- class watex.transformers.StratifiedWithCategoryAdder(base_num_feature=None, threshold_operator=1.0, return_train=False, max_category=3, n_splits=1, test_size=0.2, random_state=42)[source]#
Bases:
BaseEstimator,TransformerMixinStratified sampling transformer based on new generated category from numerical attributes and return stratified trainset and test set.
- Parameters:
*base_num_feature* (str,) – Numerical features to categorize.
*threshold_operator* (float,) – The coefficient to divised the numerical features value to normalize the data
*max_category* (Maximum value fits a max category to gather all) – value greather than.
*return_train* (bool,) – Return the whole stratified trainset if set to
True. usefull when the dataset is not enough. It is convenient to train all the whole trainset rather than a small amount of stratified data. Sometimes all the stratified data are not the similar equal one to another especially when the dataset is not enough.and (Another way to stratify dataset is to get insights from the dataset) –
attributes (to add a new category as additional mileage. From this new) –
:param : :param data could be stratified after categorizing numerical features.: :param Once data is tratified: :param the new category will be drop and return the: :param train set and testset stratified. For instance::: >>> from watex.transformers import StratifiedWithCategoryAdder
>>> stratifiedNumObj= StratifiedWithCatogoryAdder('flow') >>> stratifiedNumObj.fit_transform(X=df) >>> stats2 = stratifiedNumObj.statistics_
- Parameters:
Usage –
------ –
example (In this) –
using (we firstly categorize the flow attribute) –
:param the ceilvalue (see
discretizeCategoriesforStratification()): :param and groupby other values greater than themax_categoryvalue to the: :parammax_categoryandput in the temporary features. From this features: :param the categorization is performed and stratified the trainset and: :param the test set.:Notes
If base_num_feature is not given, dataset will be stratified using random sampling.