watex.utils.select_feature_importances#

watex.utils.select_feature_importances(clf, X, y=None, *, threshold=0.1, prefit=True, verbose=0, return_selector=False, **kws)[source]#

Select feature importance based on a user-specified threshold after model fitting, which is useful if one want to use RandomForestClassifier as a feature selector and intermediate step in scikit-learn Pipeline object, which allows us to connect different processing steps with an estimator.

Parameters:
  • clf (estimator object) – The base estimator from which the transformer is built. This can be both a fitted (if prefit is set to True) or a non-fitted estimator. The estimator should have a feature_importances_ or coef_ attribute after fitting. Otherwise, the importance_getter parameter should be used.

  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples, )) – Target vector where n_samples is the number of samples. If given, set prefit=False for estimator to fit and transform the data for feature importance selecting. If estimator is already fitted i.e. prefit=True, ‘y’ is not needed.

  • threshold (str or float, default=None) – The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.

  • prefit (bool, default=False) – Whether a prefit model is expected to be passed into the constructor directly or not. If True, estimator must be a fitted estimator. If False, estimator is fitted and updated by calling fit and partial_fit, respectively.

  • importance_getter (str or callable, default='auto') –

    If ‘auto’, uses the feature importance either through a coef_ attribute or feature_importances_ attribute of estimator.

    Also accepts a string that specifies an attribute name/path for extracting feature importance (implemented with attrgetter). For example, give regressor_.coef_ in case of TransformedTargetRegressor or named_steps.clf.feature_importances_ in case of Pipeline with its last step named clf.

    If callable, overrides the default feature importance getter. The callable is passed with the fitted estimator and it should return importance for each feature.

  • norm_order (non-zero int, inf, -inf, default=1) – Order of the norm used to filter the vectors of coefficients below threshold in the case where the coef_ attribute of the estimator is of dimension 2.

  • max_features (int, callable, default=None) –

    The maximum number of features to select.

    • If an integer, then it specifies the maximum number of features to allow.

    • If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of max_feaures(X).

    • If None, then all features are kept.

    To only select based on max_features, set threshold=-np.inf.

  • return_selector (bool, default=False,) – Returns selector object if True., otherwise returns the transformed X.

  • verbose (int, default=0) – display the number of features that meet the criterion according to their importance range.

Returns:

Xs or selector – Ndarray of number of samples and features that meet the criterion according to the importance range or selector object

Return type:

ndarray (n_samples, n_criterion_features), or sklearn.feature_selection.SelectFromModel

Examples

>>> from watex.utils.mlutils import select_feature_importances
>>> from watex.exlib.sklearn import LogisticRegression
>>> X0 = [[ 0.87, -1.34,  0.31 ],
...      [-2.79, -0.02, -0.85 ],
...      [-1.34, -0.48, -2.55 ],
...      [ 1.92,  1.48,  0.65 ]]
>>> y0 = [0, 1, 0, 1]

(1) use prefit =True and get the Xs importance features >>> Xs = select_feature_importances (

LogisticRegression().fit(X0, y0), X0 , prefit =True )

>>> Xs
array([[ 0.87, -1.34,  0.31],
       [-2.79, -0.02, -0.85],
       [-1.34, -0.48, -2.55],
       [ 1.92,  1.48,  0.65]])
  1. Set off prefix and return selector obj

>>> selector= select_feature_importances (
    LogisticRegression(), X= X0 ,
    y =y0  ,
    prefit =False , return_selector= True
    )
>>> selector.estimator_.coef_
array([[-0.3252302 ,  0.83462377,  0.49750423]])
>>> selector.threshold_
0.1
>>> selector.get_support()
array([ True,  True,  True])
>>> selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y)
>>> selector.estimator_.coef_
array([[-0.3252302 ,  0.83462377,  0.49750423]])
>>> selector.threshold_
0.55245...
>>> selector.get_support()
array([False,  True, False])
>>> selector.transform (X0)
array([[ 0.87, -1.34,  0.31],
       [-2.79, -0.02, -0.85],
       [-1.34, -0.48, -2.55],
       [ 1.92,  1.48,  0.65]])