watex.cases.prepare.default_preparation#

watex.cases.prepare.default_preparation(X, imputer_strategy=None, missing_values=nan, num_indexes=None, cat_indexes=None, scaler=None, encode_cat_features=True, columns=None)[source]#

Automate the data preparation to be ready for PCA analyses

Data preparation consist to imput missing values, scales the numerical features and encoded the categorial features.

Parameters
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • imputer_strategy (str, default ='most_frequent') – Strategy proposed to replace the missing values. Can be mean or median or most_frequent. Be aware , it mean or median are given, be sure that the data are not composed of categorial fatures.

  • missing_values (float) – Value to replace the missing value in X ndarray or dataframe. Default is ``np.nan`

  • num_indexes – list of indexes to select the numerical data if categorical data columns exist in X ndarray.

  • cat_indexes – list of indexes to select the categorical data if numerical data columns exists in X ndarray.

  • scaler (str, default, is) – type of feature scaling applied on numerical features. Can be MinMaxScaler. Default is StandardScaler

  • encode_cat_features (bool) – Encode categorical data or text attributes. Default is sklearn.preprocessing.OrdinalEncoder.

  • columns (list, Optional,) – list of columns to compose a dataframe if X is given as an NDAarray.

Returns

X

Return type

NDArray | Dataframe

Notes

num_indexes and cat_indexes are mainly used when type of data x is np.ndarray(m, nf) where m is number of instances or examples and nf if number of attributes or features. selector_ is used for dataframe preprocessing.