watex.base.SequentialBackwardSelection#

class watex.base.SequentialBackwardSelection(estimator=None, k_features=1, scoring='accuracy', test_size=0.25, random_state=42)[source]#

Sequential Backward Selection (SBS) is a feature selection algorithm which aims to reduce dimensionality of the initial feature subspace with a minimum decay in the performance of the classifier to improve upon computationan efficiency. In certains cases, SBS can even improve the predictive power of the model if a model suffers from overfitting.

The idea behind the SBS is simple: it sequentially removes features from the full feature subset until the new feature subspace contains the desired number of features. In order to determine which feature is to be removed at each stage, the criterion fonction \(J\) is needed for minimization [1]. Indeed, the criterion calculated from the criteria function can simply be the difference in performance of the classifier before and after the removal of this particular feature. Then, the feature to be remove at each stage can simply be the defined as the feature that maximizes this criterion; or in more simple terms, at each stage, the feature that causes the least performance is eliminated loss after removal. Based on the preceding definition of SBS, the algorithm can be outlibe with a few steps:

Initialize the algorithm with \(k=d\), where \(d\) is the
dimensionality of the full feature space, \(X_d\).

Determine the feature \(x^{-}\),that maximizes the criterion:
\(x^{-}= argmax J(X_k-x)\), where \(x\in X_k\).

Remove the feature \(x^{-}\) from the feature set
\(X_{k+1}= X_k -x^{-}; k=k-1\).

-Terminate if \(k\) equals to the number of desired features;
otherwise go to the step 2. [2]

Parameters:

estimator (callable or instanciated object,) – callable or instance object that has a fit method.
k_features (int, default=1) – the number of features from where starting the selection. It must be less than the number of feature in the training set, otherwise it does not make sense.
scoring (callable or str , default='accuracy') – metric for scoring. availabe metric are ‘precision’, ‘recall’, ‘roc_auc’ or ‘accuracy’. Any other metric with raise an errors.
test_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
random_state (int, RandomState instance or None, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

References

[1]

Raschka, S., Mirjalili, V., 2019. Python Machine Learning, 3rd ed. Packt.

[2]

Ferri F., Pudil F., Hatef M., and Kittler J., Comparative study of the techniques for Large-scale feature selection, pages 403-413, 1994.

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

Type:: ndarray of shape (n_features_in_,)

indices_#

Collect the indices of subset of the best validated models

Type:: tuple of dimensionnality X

subsets_#

list of indices_

Type:: list,

scores_#

Collection of the scores of the best model got during the cross-validating

Type:: list,

k_score_#

The score of the desired feature.

Type:: float,

Examples

>>> from watex.exlib.sklearn import KNeighborsClassifier , train_test_split
>>> from watex.datasets import fetch_data
>>> from watex.base import SequentialBackwardSelection
>>> X, y = fetch_data('bagoue analysed') # data already standardized
>>> Xtrain, Xt, ytrain,  yt = train_test_split(X, y)
>>> knn = KNeighborsClassifier(n_neighbors=5)
>>> sbs= SequentialBackwardSelection (knn)
>>> sbs.fit(Xtrain, ytrain )