watex.cases.prepare.default_preparation#
- watex.cases.prepare.default_preparation(X, imputer_strategy=None, missing_values=nan, num_indexes=None, cat_indexes=None, scaler=None, encode_cat_features=True, columns=None)[source]#
Automate the data preparation to be ready for PCA analyses
Data preparation consist to imput missing values, scales the numerical features and encoded the categorial features.
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.imputer_strategy (str, default ='most_frequent') – Strategy proposed to replace the missing values. Can be
meanormedianormost_frequent. Be aware , it mean or median are given, be sure that the data are not composed of categorial fatures.missing_values (float) – Value to replace the missing value in X ndarray or dataframe. Default is ``np.nan`
num_indexes – list of indexes to select the numerical data if categorical data columns exist in X ndarray.
cat_indexes – list of indexes to select the categorical data if numerical data columns exists in X ndarray.
scaler (str, default, is) – type of feature scaling applied on numerical features. Can be
MinMaxScaler. Default isStandardScalerencode_cat_features (bool) – Encode categorical data or text attributes. Default is
sklearn.preprocessing.OrdinalEncoder.columns (list, Optional,) – list of columns to compose a dataframe if X is given as an NDAarray.
- Returns:
X
- Return type:
NDArray | Dataframe
Notes
num_indexes and cat_indexes are mainly used when type of data x is np.ndarray(m, nf) where m is number of instances or examples and nf if number of attributes or features. selector_ is used for dataframe preprocessing.