Learning utilities for data transformation, model learning and inspections.

watex.utils.mlutils.bi_selector(d, /, features=None, return_frames=False)[source]#

Auto-differentiates the numerical from categorical attributes.

This is usefull to select the categorial features from the numerical features and vice-versa when we are a lot of features. Enter features individually become tiedous and a mistake could probably happenned.

Parameters:
  • d (pandas dataframe) – Dataframe pandas

  • features (list of str) – List of features in the dataframe columns. Raise error is feature(s) does/do not exist in the frame. Note that if features is None, it returns the categorical and numerical features instead.

  • return_frames (bool, default =False) – return the difference columns (features) from the given features as a list. If set to True returns bi-frames composed of the given features and the remaining features.

Returns:

  • - Tuple ( list, list) – list of features and remaining features

  • - Tuple ( pd.DataFrame, pd.DataFrame ) – List of features and remaing features frames.

Example

>>> from watex.utils.mlutils import bi_selector
>>> from watex.datasets import load_hlogs
>>> data = load_hlogs().frame # get the frame
>>> data.columns
>>> Index(['hole_id', 'depth_top', 'depth_bottom', 'strata_name', 'rock_name',
       'layer_thickness', 'resistivity', 'gamma_gamma', 'natural_gamma', 'sp',
       'short_distance_gamma', 'well_diameter', 'aquifer_group',
       'pumping_level', 'aquifer_thickness', 'hole_depth_before_pumping',
       'hole_depth_after_pumping', 'hole_depth_loss', 'depth_starting_pumping',
       'pumping_depth_at_the_end', 'pumping_depth', 'section_aperture', 'k',
       'kp', 'r', 'rp', 'remark'],
      dtype='object')
>>> num_features, cat_features = bi_selector (data)
>>> num_features
...['gamma_gamma',
     'depth_top',
     'aquifer_thickness',
     'pumping_depth_at_the_end',
     'section_aperture',
     'remark',
     'depth_starting_pumping',
     'hole_depth_before_pumping',
     'rp',
     'hole_depth_after_pumping',
     'hole_depth_loss',
     'depth_bottom',
     'sp',
     'pumping_depth',
     'kp',
     'resistivity',
     'short_distance_gamma',
     'r',
     'natural_gamma',
     'layer_thickness',
     'k',
     'well_diameter']
>>> cat_features
... ['hole_id', 'strata_name', 'rock_name', 'aquifer_group',
     'pumping_level']
watex.utils.mlutils.cattarget(arr, /, func=None, labels=None, rename_labels=None, coerce=False, order='strict')[source]#

Categorize array to hold the given identifier labels.

Classifier numerical values according to the given label values. Labels are a list of integers where each integer is a group of unique identifier of a sample in the dataset.

Parameters:
  • arr (array-like |pandas.Series) – array or series containing numerical values. If a non-numerical values is given , an errors will raises.

  • func (Callable,) – Function to categorize the target y.

  • labels (int, list of int,) – if an integer value is given, it should be considered as the number of category to split ‘y’. For instance label=3 applied on the first ten number, the labels values should be [0, 1, 2]. If labels are given as a list, items must be self-contain in the target ‘y’.

  • rename_labels (list of str;) – list of string or values to replace the label integer identifier.

  • coerce (bool, default =False,) – force the new label names passed to rename_labels to appear in the target including or not some integer identifier class label. If coerce is True, the target array holds the dtype of new_array.

Returns:

arr – The category array with unique identifer labels

Return type:

Arraylike |pandas.Series

Examples

>>> from watex.utils.mlutils import cattarget
>>> def binfunc(v):
        if v < 3 : return 0
        else : return 1
>>> arr = np.arange (10 )
>>> arr
... array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> target = cattarget(arr, func =binfunc)
... array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
>>> cattarget(arr, labels =3 )
... array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2])
>>> array([2, 2, 2, 2, 1, 1, 1, 0, 0, 0])
>>> cattarget(arr, labels =3 , order =None )
... array([0, 0, 0, 0, 1, 1, 1, 2, 2, 2])
>>> cattarget(arr[::-1], labels =3 , order =None )
... array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2]) # reverse does not change
>>> cattarget(arr, labels =[0 , 2,  4]  )
... array([0, 0, 0, 2, 2, 4, 4, 4, 4, 4])
watex.utils.mlutils.correlatedfeatures(df, corr='pearson', threshold=0.95, fmt=False)[source]#

Find the correlated features/columns in the dataframe.

Indeed, highly correlated columns don’t add value and can throw off features importance and interpretation of regression coefficients. If we had correlated columns, choose to remove either the columns from level_0 or level_1 from the features data is a good choice.

Parameters:
  • df (Dataframe or shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N

  • corr (str, ['pearson'|'spearman'|'covariance']) – Method of correlation to perform. Note that the ‘person’ and ‘covariance’ don’t support string value. If such kind of data is given, turn the corr to spearman. default is pearson

  • threshold (int, default is 0.95) – the value from which can be considered as a correlated data. Should not be greater than 1.

  • fmt (bool, default {False}) – format the correlated dataframe values

Returns:

df – Dataframe with cilumns equals to [level_0, level_1, pearson]

Return type:

pandas.DataFrame

Examples

>>> from watex.utils.mlutils import correlatedcolumns
>>> df_corr = correlatedcolumns (data , corr='spearman',
                                 fmt=None, threshold=.95
                                 )
watex.utils.mlutils.default_data_splitting(X, y=None, *, test_size=0.2, target=None, random_state=42, fetch_target=False, **skws)[source]#

Splitting data function naively.

Split data into the training set and test set. If target y is not given and you want to consider a specific array as a target for supervised learning, just turn fetch_target argument to True and set the target argument as a numpy columns index or pandas dataframe colums name.

Parameters:
  • X – np.ndarray or pd.DataFrame

  • y – array_like

  • test_size – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.

  • random_state – int, Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls

  • fetch_target – bool, use to retrieve the targetted value from the whole data X.

  • target – int, str If int itshould be the index of the targetted value otherwise should be the columns name of pandas DataFrame.

  • skws – additional scikit-lean keywords arguments https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Returns:

list, length -List containing train-test split of inputs.

Example:
>>> from watex.datasets import fetch_data
>>> data = fetch_data ('Bagoue original').get('data=df')
>>> X, XT, y, yT= default_data_splitting(data.values,
                             fetch_target=True,
                             target =12 )
>>> X, XT, y, yT= default_data_splitting(data,
                     fetch_target=True,
                     target ='flow' )
>>> X0= data.copy()
>>> X0.drop('flow', axis =1, inplace=True)
>>> y0 = data ['flow']
>>> X, XT, y, yT= default_data_splitting(X0, y0)
watex.utils.mlutils.discretizeCategoriesforStratification(data, in_cat=None, new_cat=None, **kws)[source]#

Create a new category attribute to discretize instances.

A new category in data is better use to stratified the trainset and the dataset to be consistent and rounding using ceil values.

Parameters:
  • in_cat – column name used for stratified dataset

  • new_cat – new category name created and inset into the dataframe.

Returns:

new dataframe with new column of created category.

watex.utils.mlutils.dumpOrSerializeData(data, filename=None, savepath=None, to=None, verbose=0)[source]#

Dump and save binary file

Parameters:
  • data – Object Object to dump into a binary file.

  • filename – str Name of file to serialize. If ‘None’, should create automatically.

  • savepath – str, PathLike object Directory to save file. If not exists should automaticallycreate.

  • to – str Force your data to be written with specific module like joblib or Python pickle` module. Should be ``joblib or pypickle.

Returns:

str dumped or serialized filename.

Example:
>>> import numpy as np
>>> from watex.utils.mlutils import dumpOrSerializeData
>>>  data=(np.array([0, 1, 3]),np.array([0.2, 4]))
>>> dumpOrSerializeData(data, filename ='__XTyT.pkl', to='pickle',
                        savepath='watex/datasets')
watex.utils.mlutils.evalModel(model, X, y, Xt, yt=None, scorer='accuracy', eval=False, **kws)[source]#

Evaluate model and quick test the score with metric scorers.

Parameters:
  • model (Callable, {'preprocessor + estimator } | estimator,) – the preprocessor is list of step for data handling all encapsulated on the pipeline. model can also be a simple estimator with fit,

  • X (N-d array, shape (N, M)) –

    the training set composed of N-columns and the M-samples. The

    feature set excludes the target y.

  • y (arraylike , shape (M)) – the target is composed of M-examples in supervised learning.

  • Xt (N-d array, shape (N, M)) – test set array composed of N-columns and the M-samples. The feature set excludes the target y.

  • yt (arraylike , shape (M)) – test label (or test target) composed of M-examples in supervised learning.

  • scorer (str, Callable,) –

    a scorer is a metric function for model evaluation. If given as string it should be the prefix of the following metrics:

    • ”classification_report” -> for classification_report,

    • ’precision_recall’ -> for precision_recall_curve,

    • ”confusion_matrix” -> for a confusion_matrix,

    • ’precision’ -> for precision_score,

    • ”accuracy” -> for accuracy_score

    • ”mse” -> for mean_squared_error,

    • ”recall” -> for recall_score,

    • ’auc’ -> for roc_auc_score,

    • ’roc’ -> for roc_curve

    • ’f1’ -> for f1_score,

    Other string prefix values should raises an errors

  • kws (dict,) – Additionnal keywords arguments from scklearn metric function.

Returns:

Tuple – the model score or the predicted y if predict is set to True.

Return type:

(score, ypred)

watex.utils.mlutils.exporttarget(df, tname, inplace=True)[source]#

Extract target and modified data in place or not .

Parameters:
  • df – A dataframe with features including the target name tname

  • tname – A target name. It should be include in the dataframe columns otherwise an error is raised.

  • inplace – modified the dataframe inplace. if False return the dataframe. the defaut is True

Returns:

Tuple of the target and dataframe (modified or not)

Example:

>>> from watex.datasets import fetch_data '
>>> from watex.utils.mlutils import exporttarget
>>> data0 = fetch_data ('bagoue original').get('data=dfy1')
>>> # no modification
>>> target, data_no = exporttarget (data0 , 'sfi', False )
>>> len(data_no.columns ) , len(data0.columns )
... (13, 13)
>>> # modified in place
>>> target, data= exporttarget (data0 , 'sfi')
>>> len(data.columns ) , len(data0.columns )
... (12, 12)
watex.utils.mlutils.fetchGeoDATA(data_url, data_path, tgz_filename)[source]#

Fetch data from data repository in zip of ‘targz_file.

I will create a datasets/data directory in your workspace, downloading

the ~.tgz_file and extract the `data.csv from this directory.

Parameters:
  • data_url – url to the datafilename where tgz filename is located

  • data_path – absolute path to the tgz filename

  • filenametgz filename.

watex.utils.mlutils.fetchModel(file, *, default=True, name=None, storage=None)[source]#

Fetch your data/model saved using Python pickle or joblib module.

Parameters:
  • file (str or Path-Like object) – dumped model file name saved using joblib or Python pickle module.

  • path (path-Like object ,) – Path to model dumped file =`modelfile`

  • default (bool,) – Model parameters by default are saved into a dictionary. When default is True, returns a tuple of pair (the model and its best parameters). If False return all values saved from ~.MultipleGridSearch

  • storage (str, default='joblib') – kind of module use to pickling the data

  • name (str) – Is the name of model to retreived from dumped file. If name is given get only the model and its best parameters.

Returns:

- `data` – data composed of models, classes and params for ‘best_model’, ‘best_params_’ and ‘best_scores’ if default is True, and model dumped and all parameters otherwise.

Return type:

Tuple (Dict, )

Example

>>> from watex.bases import fetch_model
>>> my_model, = fetchModel ('SVC__LinearSVC__LogisticRegression.pkl',
                            default =False,  modname='SVC')
>>> my_model
watex.utils.mlutils.fetch_model(modelfile, modelpath=None, default=True, modname=None, verbose=0)[source]#

Fetch your model saved using Python pickle module or joblib module.

Parameters:
  • modelfile – str or Path-Like object dumped model file name saved using joblib or Python pickle module.

  • modelpath – path-Like object , Path to model dumped file =`modelfile`

Default:

bool, Model parameters by default are saved into a dictionary. When default is True, returns a tuple of pair (the model and its best parameters) . If False return all values saved from ~.MultipleGridSearch

Modname:

str Is the name of model to retrived from dumped file. If name is given get only the model and its best parameters.

Verbose:

int, level=0 control the verbosity.More message if greater than 0.

Returns:

  • model_class_params: if default is True

  • pickedfname: model dumped and all parameters if default is False

Example:
>>> from watex.bases import fetch_model
>>> my_model = fetch_model ('SVC__LinearSVC__LogisticRegression.pkl',
                            default =False,  modname='SVC')
>>> my_model
watex.utils.mlutils.findCatandNumFeatures(df=None, features=None, return_frames=False)[source]#

Retrieve the categorial or numerical features on whole features of dataset.

Parameters:
  • df (Dataframe) – Dataframe with columns composing the features

  • features (list of str,) – list of the column names. If the dataframe is big, can set the only required features. If features are provided, frame should be shrunked to match the only given features before the numerical and categorical features search. Note that an error will raises if any of one features is missing in the dataframe.

  • return_frames (bool,) – if set to True, it returns two separated dataframes (cat & num) otherwise, it only returns the cat and num columns names.

Returns:

Tuple

Return type:

cat_features and num_features names or frames

Examples

>>> from watex.datasets import fetch_data
>>>> from watex.tools import findCatandNumFeatures
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> cat, num = findCatandNumFeatures(data)
>>> cat, num
... (['type', 'geol', 'shape', 'name', 'flow'],
 ['num', 'east', 'north', 'power', 'magnitude', 'sfi', 'ohmS', 'lwi'])
>>> cat, num = findCatandNumFeatures(
    data, features = ['geol', 'ohmS', 'sfi'])
... (['geol'], ['ohmS', 'sfi'])
watex.utils.mlutils.getGlobalScore(cvres)[source]#

Retrieve the global mean and standard deviation score from the cross validation containers.

Parameters:

cvres (dict of Array-like, Shape (N, )) – cross validation results after training the models of number of parameters equals to N.

Returns:

tuple ( mean_test_scores’, ‘std_test_scores’)

scores on test_dcore and standard deviation scores

watex.utils.mlutils.labels_validator(t, /, labels, return_bool=False)[source]#

Assert the validity of the label in the target and return the label or the boolean whether all items of label are in the target.

Parameters:
  • t – array-like, target that is expected to contain the labels.

  • labels – int, str or list of (str or int) that is supposed to be in the target t.

  • return_bool – bool, default=False; returns ‘True’ or ‘False’ rather the labels if set to True.

Returns:

bool or labels; ‘True’ or ‘False’ if return_bool is set to True and labels otherwise.

Example:

>>> from watex.datasets import fetch_data
>>> from watex.utils.mlutils import cattarget, labels_validator
>>> _, y = fetch_data ('bagoue', return_X_y=True, as_frame=True)
>>> # binarize target y into [0 , 1]
>>> ybin = cattarget(y, labels=2 )
>>> labels_validator (ybin, [0, 1])
... [0, 1] # all labels exist.
>>> labels_validator (y, [0, 1, 3])
... ValueError: Value '3' is missing in the target.
>>> labels_validator (ybin, 0 )
... [0]
>>> labels_validator (ybin, [0, 5], return_bool=True ) # no raise error
... False
watex.utils.mlutils.loadDumpedOrSerializedData(filename, verbose=0)[source]#

Load dumped or serialized data from filename

Parameters:

filename – str or path-like object Name of dumped data file.

Returns:

Data loaded from dumped file.

Example:
>>> from watex.utils.mlutils import loadDumpedOrSerializedData
>>> loadDumpedOrSerializedData(filename ='Watex/datasets/__XTyT.pkl')
watex.utils.mlutils.load_data(data=None, delimiter=None, **kws)[source]#

Load csv file to a frame.

Parameters:
  • data_path – path to data csv file

  • delimiter – str, item for data delimitations.

  • kws – dict, additional keywords arguments passed to pandas.read_csv

Returns:

pandas dataframe

watex.utils.mlutils.make_naive_pipe(X, y=None, *, num_features=None, cat_features=None, label_encoding='LabelEncoder', scaler='StandardScaler', missing_values=nan, impute_strategy='median', sparse_output=True, for_pca=False, transform=False)[source]#

make a pipeline to transform data at once.

make a naive pipeline is usefull to fast preprocess the data at once for quick prediction.

Work with a pandas dataframe. If None features is set, the numerical and categorial features are automatically retrieved.

Parameters:
  • X (pandas dataframe of shape (n_samples, n_features)) – The input samples. Use dtype=np.float32 for maximum efficiency. Sparse matrices are also supported, use sparse csc_matrix for maximum efficiency.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.

  • num_features (list or str, optional) – Numerical features put on the list. If num_features are given whereas cat_features are None, cat_features are figured out automatically.

  • cat_features (list of str, optional) – Categorial features put on the list. If num_features are given whereas num_features are None, num_features are figured out automatically.

  • label_encoding (callable or str, default='sklearn.preprocessing.LabelEncoder') – kind of encoding used to encode label. This assumes ‘y’ is supplied.

  • scaler (callable or str , default='sklearn.preprocessing.StandardScaler') – kind of scaling used to scaled the numerical data. Note that for the categorical data encoding, ‘sklearn.preprocessing.OneHotEncoder’ is implemented under the hood instead.

  • missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.

  • impute_strategy (str, default='mean') –

    The imputation strategy.

    • If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.

    • If “median”, then replace missing values using the median along each column. Can only be used with numeric data.

    • If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.

    • If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

      strategy=”constant” for fixed value imputation.

  • sparse_output (bool, default=False) – Is used when label y is given. Binarize labels in a one-vs-all fashion. If True, returns array from transform is desired to be in sparse CSR format.

  • for_pca (bool, default=False,) – Transform data for principal component ( PCA) analysis. If set to True, watex.exlib.sklearn.OrdinalEncoder` is used insted of watex.exlib.sklearn.OneHotEncoder`.

  • transform (bool, default=False,) – Tranform data inplace rather than returning the naive pipeline.

Returns:

  • full_pipeline (watex.exlib.sklearn.FeatureUnion) –

    • Full pipeline composed of numerical and categorical pipes

  • (X_transformed &| y_transformed) ({array-like, sparse matrix} of shape (n_samples, n_features)) –

    • Transformed data.

Examples

>>> from watex.utils.mlutils import make_naive_pipe
>>> from watex.datasets import load_hlogs

(1) Make a naive simple pipeline with RobustScaler, StandardScaler >>> from watex.exlib.sklearn import RobustScaler >>> X_, y_ = load_hlogs (as_frame=True )# get all the data >>> pipe = make_naive_pipe(X_, scaler =RobustScaler )

(2) Transform X in place with numerical and categorical features with StandardScaler (default). Returned CSR matrix

>>> make_naive_pipe(X_, transform =True )
... <181x40 sparse matrix of type '<class 'numpy.float64'>'
    with 2172 stored elements in Compressed Sparse Row format>
watex.utils.mlutils.naive_imputer(X, y=None, strategy='mean', mode=None, drop_features=False, missing_values=nan, fill_value=None, verbose='deprecated', add_indicator=False, copy=True, keep_empty_features=False, **fit_params)[source]#

Imput missing values in the data.

Whatever data contains categorial features, ‘bi-impute’ argument passed to ‘kind’ parameters has a strategy to both impute the numerical and categorical features rather than raising an error when the ‘strategy’ is not set to ‘most_frequent’.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Not used, present here for API consistency by convention.

  • strategy (str, default='mean') –

    The imputation strategy.

    • If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.

    • If “median”, then replace missing values using the median along each column. Can only be used with numeric data.

    • If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.

    • If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

      strategy=”constant” for fixed value imputation.

  • mode (str, [bi-impute'], default= None) – If mode is set to ‘bi-impute’, it imputes the both numerical and categorical features and returns a single imputed dataframe.

  • drop_features (bool or list, default =False,) – drop a list of features in the dataframe before imputation. If True and no list of features is supplied, the categorial features are dropped.

  • missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.

  • fill_value (str or numerical value, default=None) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.

  • keep_empty_features (bool, default=False) –

    If True, features that consist exclusively of missing values when fit is called are returned in results when transform is called. The imputed value is always 0 except when strategy=”constant” in which case fill_value will be used instead.

    New in version 0.2.0.

  • verbose (int, default=0) – Controls the verbosity of the imputer.

  • copy (bool, default=True) –

    If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:

    • If X is not an array of floating values;

    • If X is encoded as a CSR matrix;

    • If add_indicator=True.

  • add_indicator (bool, default=False) – If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

  • fit_params (dict,) – keywords arguments passed to the scikit-learn fitting parameters More details on https://scikit-learn.org/stable/

Returns:

Xi – Data imputed

Return type:

Dataframe, array-like, sparse matrix of shape (n_samples, n_features)

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from watex.utils.mlutils import naive_imputer
>>> X= np.random.randn ( 7, 4 )
>>> X[3, :] =np.nan  ; X[:, 3][-4:]=np.nan
>>> naive_imputer  (X)
... array([[ 1.34783528,  0.53276798, -1.57704281,  0.43455785],
           [ 0.36843174, -0.27132106, -0.38509441, -0.29371997],
           [-1.68974996,  0.15268509, -2.54446498,  0.18939122],
           [ 0.06013775,  0.36687602, -0.21973368,  0.11007637],
           [-0.27129147,  1.18103398,  1.78985393,  0.11007637],
           [ 1.09223954,  0.12924661,  0.52473794,  0.11007637],
           [-0.48663864,  0.47684353,  0.87360825,  0.11007637]])
>>> frame = pd.DataFrame (X, columns =['a', 'b', 'c', 'd']  )
>>> # change [bc] types to categorical values.
>>> frame['b']=['pineaple', '', 'cabbage', 'watermelon', 'onion',
                'cabbage', 'onion']
>>> frame['c']=['lion', '', 'cat', 'cat', 'dog', '', 'mouse']
>>> naive_imputer(frame, kind ='bi-impute')
...             b      c         a         d
    0    pineaple   lion  1.347835  0.434558
    1     cabbage    cat  0.368432 -0.293720
    2     cabbage    cat -1.689750  0.189391
    3  watermelon    cat  0.060138  0.110076
    4       onion    dog -0.271291  0.110076
    5     cabbage    cat  1.092240  0.110076
    6       onion  mouse -0.486639  0.110076
watex.utils.mlutils.naive_scaler(X, y=None, *, kind=<class 'sklearn.preprocessing._data.StandardScaler'>, copy=True, with_mean=True, with_std=True, feature_range=(0, 1), clip=False, norm='l2', **fit_params)[source]#

Quick data scaling using both strategies implemented in scikit-learn with StandardScaler and MinMaxScaler.

Function returns scaled frame if dataframe is passed or ndarray. For other scaling, call scikit-learn instead.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

  • kind (str, default='StandardScaler') – Kind of data scaling. Can also be [‘MinMaxScaler’, ‘Normalizer’]. The default is ‘StandardScaler’

  • copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

  • with_mean (bool, default=True) – If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

  • with_std (bool, default=True) – If True, scale the data to unit variance (or equivalently, unit standard deviation).

  • feature_range (tuple (min, max), default=(0, 1)) – Desired range of transformed data.

  • norm ({'l1', 'l2', 'max'}, default='l2') – The norm to use to normalize each non zero sample. If norm=’max’ is used, values will be rescaled by the maximum of the absolute values.

  • clip (bool, default=False) – Set to True to clip transformed values of held-out data to provided feature range.

  • fit_params (dict,) – keywords arguments passed to the scikit-learn fitting parameters More details on https://scikit-learn.org/stable/

Returns:

X_sc – Transformed array.

Return type:

{ndarray, sparse matrix} or dataframe of shape (n_samples, n_features)

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from watex.utils.mlutils import naive_scaler
>>> X= np.random.randn (7 , 3 )
>>> X_std = naive_scaler (X )
... array([[ 0.17439644,  1.55683005,  0.24115109],
       [-0.59738672,  1.3166854 ,  1.23748004],
       [-1.6815365 , -1.19775838,  0.71381357],
       [-0.1518278 , -0.32063059, -0.47483155],
       [-0.41335886,  0.13880519,  0.69258621],
       [ 1.45221902, -1.03852015, -0.40157981],
       [ 1.21749443, -0.45541153, -2.00861955]])
>>> # use dataframe
>>> Xdf = pd.DataFrame (X, columns =['a', 'c', 'c'])
>>> naive_scaler (Xdf , kind='Normalizer') # return data frame
...           a         c         c
    0  0.252789  0.967481 -0.008858
    1 -0.265161  0.908862  0.321961
    2 -0.899863 -0.416231  0.130380
    3  0.178203  0.039443 -0.983203
    4 -0.418487  0.800306  0.429394
    5  0.933933 -0.309016 -0.179661
    6  0.795234 -0.051054 -0.604150
watex.utils.mlutils.predict(y_true, y_pred=None, *, X_=None, clf=None, verbose=0)[source]#

Make a quick statistic after prediction.

Parameters:
  • y_true – array-like y value (label) to predict

  • y_pred – array_like y value predicted

  • X – ndarray(nexamples, nfeatures) test sets

  • clf – callable Estimator or classifier object.

  • XT – ndarray

Pram X:

ndarray(nexamples, nfeatures) Training data sets

:param verbose:int, level=0

Control the verbosity. More than 1 more message

Parameters:

from_c – str Column to visualize statistic. Be sure the colum exist into the test sets. If not raise errors.

watex.utils.mlutils.projection_validator(X, Xt=None, columns=None)[source]#

Retrieve x, y coordinates of a datraframe ( X, Xt ) from columns names or indexes.

If X or Xt are given as arrays, columns may hold integers from selecting the the coordinates ‘x’ and ‘y’.

Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.

  • Xt (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix.

  • columns (list of str or index, optional) – columns is usefull when a dataframe is given with a dimension size greater than 2. If such data is passed to X or Xt, columns must hold the name to consider as ‘easting’, ‘northing’ when UTM coordinates are given or ‘latitude’ , ‘longitude’ when latlon are given. If dimension size is greater than 2 and columns is None , an error will raises to prevent the user to provide the index for ‘y’ and ‘x’ coordinated retrieval.

Returns:

arrays and coordinate labels

Return type:

( x, y, xt, yt ), (xname, yname, xtname, ytname), Tuple of coordinate

watex.utils.mlutils.rename_labels_in(arr, new_names, coerce=False)[source]#

Rename label by a new names

Parameters:
  • arr – arr: array-like |pandas.Series array or series containing numerical values. If a non-numerical values is given , an errors will raises.

  • new_names – list of str; list of string or values to replace the label integer identifier.

  • coerce – bool, default =False, force the ‘new_names’ to appear in the target including or not some integer identifier class label. coerce is True, the target array hold the dtype of new_array; coercing the label names will not yield error. Consequently can introduce an unexpected results.

Returns:

array-like, An array-like with full new label names.

watex.utils.mlutils.select_feature_importances(clf, X, y=None, *, threshold=0.1, prefit=True, verbose=0, return_selector=False, **kws)[source]#

Select feature importance based on a user-specified threshold after model fitting, which is useful if one want to use RandomForestClassifier as a feature selector and intermediate step in scikit-learn Pipeline object, which allows us to connect different processing steps with an estimator.

Parameters:
  • clf (estimator object) – The base estimator from which the transformer is built. This can be both a fitted (if prefit is set to True) or a non-fitted estimator. The estimator should have a feature_importances_ or coef_ attribute after fitting. Otherwise, the importance_getter parameter should be used.

  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples, )) – Target vector where n_samples is the number of samples. If given, set prefit=False for estimator to fit and transform the data for feature importance selecting. If estimator is already fitted i.e. prefit=True, ‘y’ is not needed.

  • threshold (str or float, default=None) – The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.

  • prefit (bool, default=False) – Whether a prefit model is expected to be passed into the constructor directly or not. If True, estimator must be a fitted estimator. If False, estimator is fitted and updated by calling fit and partial_fit, respectively.

  • importance_getter (str or callable, default='auto') –

    If ‘auto’, uses the feature importance either through a coef_ attribute or feature_importances_ attribute of estimator.

    Also accepts a string that specifies an attribute name/path for extracting feature importance (implemented with attrgetter). For example, give regressor_.coef_ in case of TransformedTargetRegressor or named_steps.clf.feature_importances_ in case of Pipeline with its last step named clf.

    If callable, overrides the default feature importance getter. The callable is passed with the fitted estimator and it should return importance for each feature.

  • norm_order (non-zero int, inf, -inf, default=1) – Order of the norm used to filter the vectors of coefficients below threshold in the case where the coef_ attribute of the estimator is of dimension 2.

  • max_features (int, callable, default=None) –

    The maximum number of features to select.

    • If an integer, then it specifies the maximum number of features to allow.

    • If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of max_feaures(X).

    • If None, then all features are kept.

    To only select based on max_features, set threshold=-np.inf.

  • return_selector (bool, default=False,) – Returns selector object if True., otherwise returns the transformed X.

  • verbose (int, default=0) – display the number of features that meet the criterion according to their importance range.

Returns:

Xs or selector – Ndarray of number of samples and features that meet the criterion according to the importance range or selector object

Return type:

ndarray (n_samples, n_criterion_features), or sklearn.feature_selection.SelectFromModel

Examples

>>> from watex.utils.mlutils import select_feature_importances
>>> from watex.exlib.sklearn import LogisticRegression
>>> X0 = [[ 0.87, -1.34,  0.31 ],
...      [-2.79, -0.02, -0.85 ],
...      [-1.34, -0.48, -2.55 ],
...      [ 1.92,  1.48,  0.65 ]]
>>> y0 = [0, 1, 0, 1]

(1) use prefit =True and get the Xs importance features >>> Xs = select_feature_importances (

LogisticRegression().fit(X0, y0), X0 , prefit =True )

>>> Xs
array([[ 0.87, -1.34,  0.31],
       [-2.79, -0.02, -0.85],
       [-1.34, -0.48, -2.55],
       [ 1.92,  1.48,  0.65]])
  1. Set off prefix and return selector obj

>>> selector= select_feature_importances (
    LogisticRegression(), X= X0 ,
    y =y0  ,
    prefit =False , return_selector= True
    )
>>> selector.estimator_.coef_
array([[-0.3252302 ,  0.83462377,  0.49750423]])
>>> selector.threshold_
0.1
>>> selector.get_support()
array([ True,  True,  True])
>>> selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y)
>>> selector.estimator_.coef_
array([[-0.3252302 ,  0.83462377,  0.49750423]])
>>> selector.threshold_
0.55245...
>>> selector.get_support()
array([False,  True, False])
>>> selector.transform (X0)
array([[ 0.87, -1.34,  0.31],
       [-2.79, -0.02, -0.85],
       [-1.34, -0.48, -2.55],
       [ 1.92,  1.48,  0.65]])
watex.utils.mlutils.selectfeatures(df, features=None, include=None, exclude=None, coerce=False, **kwd)[source]#

Select features and return new dataframe.

Parameters:
  • df – a dataframe for features selections

  • features – list of features to select. List of features must be in the dataframe otherwise an error occurs.

  • include – the type of data to retrieve in the dataframe df. Can be number.

  • exclude – type of the data to exclude in the dataframe df. Can be number i.e. only non-digits data will be keep in the data return.

  • coerce – return the whole dataframe with transforming numeric columns. Be aware that no selection is done and no error is raises instead. default is False

  • kwd – additional keywords arguments from pd.astype function

Ref:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

watex.utils.mlutils.split_train_test(df, test_ratio)[source]#

A naive dataset split into train and test sets from a ratio and return a shuffled train set and test set.

Parameters:
  • df – a dataframe containing features

  • test_ratio – a ratio for test set batch. test_ratio is ranged between 0 to 1. Default is 20%.

Returns:

a tuple of train set and test set.

watex.utils.mlutils.split_train_test_by_id(data, test_ratio, id_column=None, keep_colindex=True, hash=<built-in function openssl_md5>)[source]#

Ensure that data will remain consistent accross multiple runs, even if dataset is refreshed.

The new testset will contain 20%of the instance, but it will not contain any instance that was previously in the training set.

Parameters:
  • data – Pandas.core.DataFrame

  • test_ratio – ratio of data to put in testset

  • id_colum – identifier index columns. If id_column is None, reset dataframe data index and set id_column equal to index

  • hash – secures hashes algorithms. Refer to test_set_check_id()

Returns:

consistency trainset and testset

watex.utils.mlutils.stratifiedUsingDiscretedCategories(data, cat_name, n_splits=1, test_size=0.2, random_state=42)[source]#

Stratified sampling based on new generated category from DiscretizeCategoriesforStratification().

Parameters:
  • data – dataframe holding the new column of category

  • cat_name – new category name inserted into data

  • n_splits – number of splits