watex.utils.make_naive_pipe#
- watex.utils.make_naive_pipe(X, y=None, *, num_features=None, cat_features=None, label_encoding='LabelEncoder', scaler='StandardScaler', missing_values=nan, impute_strategy='median', sparse_output=True, for_pca=False, transform=False)[source]#
make a pipeline to transform data at once.
make a naive pipeline is usefull to fast preprocess the data at once for quick prediction.
Work with a pandas dataframe. If None features is set, the numerical and categorial features are automatically retrieved.
- Parameters
X (pandas dataframe of shape (n_samples, n_features)) – The input samples. Use
dtype=np.float32for maximum efficiency. Sparse matrices are also supported, use sparsecsc_matrixfor maximum efficiency.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
num_features (list or str, optional) – Numerical features put on the list. If num_features are given whereas cat_features are
None, cat_features are figured out automatically.cat_features (list of str, optional) – Categorial features put on the list. If num_features are given whereas num_features are
None, num_features are figured out automatically.label_encoding (callable or str, default='sklearn.preprocessing.LabelEncoder') – kind of encoding used to encode label. This assumes ‘y’ is supplied.
scaler (callable or str , default='sklearn.preprocessing.StandardScaler') – kind of scaling used to scaled the numerical data. Note that for the categorical data encoding, ‘sklearn.preprocessing.OneHotEncoder’ is implemented under the hood instead.
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
impute_strategy (str, default='mean') –
The imputation strategy.
If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
strategy=”constant” for fixed value imputation.
sparse_output (bool, default=False) – Is used when label y is given. Binarize labels in a one-vs-all fashion. If
True, returns array from transform is desired to be in sparse CSR format.for_pca (bool, default=False,) – Transform data for principal component ( PCA) analysis. If set to
True,watex.exlib.sklearn.OrdinalEncoder`is used insted ofwatex.exlib.sklearn.OneHotEncoder`.transform (bool, default=False,) – Tranform data inplace rather than returning the naive pipeline.
- Returns
full_pipeline (
watex.exlib.sklearn.FeatureUnion) –Full pipeline composed of numerical and categorical pipes
(X_transformed &| y_transformed) ({array-like, sparse matrix} of shape (n_samples, n_features)) –
Transformed data.
Examples
>>> from watex.utils.mlutils import make_naive_pipe >>> from watex.datasets import load_hlogs
(1) Make a naive simple pipeline with RobustScaler, StandardScaler >>> from watex.exlib.sklearn import RobustScaler >>> X_, y_ = load_hlogs (as_frame=True )# get all the data >>> pipe = make_naive_pipe(X_, scaler =RobustScaler )
(2) Transform X in place with numerical and categorical features with StandardScaler (default). Returned CSR matrix
>>> make_naive_pipe(X_, transform =True ) ... <181x40 sparse matrix of type '<class 'numpy.float64'>' with 2172 stored elements in Compressed Sparse Row format>