watex.utils.select_feature_importances#
- watex.utils.select_feature_importances(clf, X, y=None, *, threshold=0.1, prefit=True, verbose=0, return_selector=False, **kws)[source]#
Select feature importance based on a user-specified threshold after model fitting.
This is useful if one want to use RandomForestClassifier as a feature selector and intermediate step in scikit-learn
Pipelineobject, which allows us to connect different processing steps with an estimator.- Parameters:
clf (estimator object) – The base estimator from which the transformer is built. This can be both a fitted (if
prefitis set to True) or a non-fitted estimator. The estimator should have afeature_importances_orcoef_attribute after fitting. Otherwise, theimportance_getterparameter should be used.X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples, )) – Target vector where n_samples is the number of samples. If given, set prefit=False for estimator to fit and transform the data for feature importance selecting. If estimator is already fitted i.e. prefit=True, ‘y’ is not needed.
threshold (str or float, default=None) – The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the
thresholdvalue is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.prefit (bool, default=False) – Whether a prefit model is expected to be passed into the constructor directly or not. If True, estimator must be a fitted estimator. If False, estimator is fitted and updated by calling fit and partial_fit, respectively.
importance_getter (str or callable, default='auto') –
If ‘auto’, uses the feature importance either through a
coef_attribute orfeature_importances_attribute of estimator.Also accepts a string that specifies an attribute name/path for extracting feature importance (implemented with attrgetter). For example, give regressor_.coef_ in case of
TransformedTargetRegressoror named_steps.clf.feature_importances_ in case ofPipelinewith its last step named clf.If callable, overrides the default feature importance getter. The callable is passed with the fitted estimator and it should return importance for each feature.
norm_order (non-zero int, inf, -inf, default=1) – Order of the norm used to filter the vectors of coefficients below
thresholdin the case where thecoef_attribute of the estimator is of dimension 2.max_features (int, callable, default=None) –
The maximum number of features to select.
If an integer, then it specifies the maximum number of features to allow.
If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of max_feaures(X).
If None, then all features are kept.
To only select based on
max_features, setthreshold=-np.inf.return_selector (bool, default=False,) – Returns selector object if
True., otherwise returns the transformed X.verbose (int, default=0) – display the number of features that meet the criterion according to their importance range.
- Returns:
Xs or selector – Ndarray of number of samples and features that meet the criterion according to the importance range or selector object
- Return type:
ndarray (n_samples, n_criterion_features), or
sklearn.feature_selection.SelectFromModel
Examples
>>> from watex.utils.mlutils import select_feature_importances >>> from watex.exlib.sklearn import LogisticRegression >>> X0 = [[ 0.87, -1.34, 0.31 ], ... [-2.79, -0.02, -0.85 ], ... [-1.34, -0.48, -2.55 ], ... [ 1.92, 1.48, 0.65 ]] >>> y0 = [0, 1, 0, 1]
(1) use prefit =True and get the Xs importance features >>> Xs = select_feature_importances (
LogisticRegression().fit(X0, y0), X0 , prefit =True )
>>> Xs array([[ 0.87, -1.34, 0.31], [-2.79, -0.02, -0.85], [-1.34, -0.48, -2.55], [ 1.92, 1.48, 0.65]])
Set off prefix and return selector obj
>>> selector= select_feature_importances ( LogisticRegression(), X= X0 , y =y0 , prefit =False , return_selector= True ) >>> selector.estimator_.coef_ array([[-0.3252302 , 0.83462377, 0.49750423]]) >>> selector.threshold_ 0.1 >>> selector.get_support() array([ True, True, True])
>>> selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y) >>> selector.estimator_.coef_ array([[-0.3252302 , 0.83462377, 0.49750423]]) >>> selector.threshold_ 0.55245... >>> selector.get_support() array([False, True, False]) >>> selector.transform (X0) array([[ 0.87, -1.34, 0.31], [-2.79, -0.02, -0.85], [-1.34, -0.48, -2.55], [ 1.92, 1.48, 0.65]])