watex.utils.naive_imputer#

watex.utils.naive_imputer(X, y=None, strategy='mean', mode=None, drop_features=False, missing_values=nan, fill_value=None, verbose='deprecated', add_indicator=False, copy=True, keep_empty_features=False, **fit_params)[source]#

Imput missing values in the data.

Whatever data contains categorial features, ‘bi-impute’ argument passed to ‘kind’ parameters has a strategy to both impute the numerical and categorical features rather than raising an error when the ‘strategy’ is not set to ‘most_frequent’.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Not used, present here for API consistency by convention.
strategy (str, default='mean') –
The imputation strategy.
- If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
- If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
- If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
- If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
  
  strategy=”constant” for fixed value imputation.
mode (str, [bi-impute'], default= None) – If mode is set to ‘bi-impute’, it imputes the both numerical and categorical features and returns a single imputed dataframe.
drop_features (bool or list, default =False,) – drop a list of features in the dataframe before imputation. If True and no list of features is supplied, the categorial features are dropped.
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
fill_value (str or numerical value, default=None) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
keep_empty_features (bool, default=False) –
If True, features that consist exclusively of missing values when fit is called are returned in results when transform is called. The imputed value is always 0 except when strategy=”constant” in which case fill_value will be used instead.

New in version 0.2.0.
verbose (int, default=0) – Controls the verbosity of the imputer.
copy (bool, default=True) –
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:
- If X is not an array of floating values;
- If X is encoded as a CSR matrix;
- If add_indicator=True.
add_indicator (bool, default=False) – If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
fit_params (dict,) – keywords arguments passed to the scikit-learn fitting parameters More details on https://scikit-learn.org/stable/

Returns:

Xi – Data imputed

Return type:

Dataframe, array-like, sparse matrix of shape (n_samples, n_features)

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from watex.utils.mlutils import naive_imputer
>>> X= np.random.randn ( 7, 4 )
>>> X[3, :] =np.nan  ; X[:, 3][-4:]=np.nan
>>> naive_imputer  (X)
... array([[ 1.34783528,  0.53276798, -1.57704281,  0.43455785],
           [ 0.36843174, -0.27132106, -0.38509441, -0.29371997],
           [-1.68974996,  0.15268509, -2.54446498,  0.18939122],
           [ 0.06013775,  0.36687602, -0.21973368,  0.11007637],
           [-0.27129147,  1.18103398,  1.78985393,  0.11007637],
           [ 1.09223954,  0.12924661,  0.52473794,  0.11007637],
           [-0.48663864,  0.47684353,  0.87360825,  0.11007637]])
>>> frame = pd.DataFrame (X, columns =['a', 'b', 'c', 'd']  )
>>> # change [bc] types to categorical values.
>>> frame['b']=['pineaple', '', 'cabbage', 'watermelon', 'onion',
                'cabbage', 'onion']
>>> frame['c']=['lion', '', 'cat', 'cat', 'dog', '', 'mouse']
>>> naive_imputer(frame, kind ='bi-impute')
...             b      c         a         d
    0    pineaple   lion  1.347835  0.434558
    1     cabbage    cat  0.368432 -0.293720
    2     cabbage    cat -1.689750  0.189391
    3  watermelon    cat  0.060138  0.110076
    4       onion    dog -0.271291  0.110076
    5     cabbage    cat  1.092240  0.110076
    6       onion  mouse -0.486639  0.110076