watex.utils.resampling#

watex.utils.resampling(X, y, kind='over', strategy='auto', random_state=None, verbose=Ellipsis, **kws)[source]#

Combining Random Oversampling and Undersampling

Resampling involves creating a new transformed version of the training dataset in which the selected examples have a different class distribution. This is a simple and effective strategy for imbalanced classification problems.

Applying re-sampling strategies to obtain a more balanced data distribution is an effective solution to the imbalance problem. There are two main approaches to random resampling for imbalanced classification; they are oversampling and undersampling.

  • Random Oversampling: Randomly duplicate examples in the minority class.

  • Random Undersampling: Randomly delete examples in the majority class.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples, )) – Target vector where n_samples is the number of samples.

  • kind (str, {"over", "under"} , default="over") – kind of sampling to perform. "over" and "under" stand for oversampling and undersampling respectively.

  • strategy (float, str, dict, callable, default='auto') –

    Sampling information to sample the data set.

    • When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as \(\alpha_{us} = N_{m} / N_{rM}\) where \(N_{m}\) is the number of samples in the minority class and \(N_{rM}\) is the number of samples in the majority class after resampling.

      Warning

      float is only available for binary classification. An error is raised for multi-class classification.

    • When str, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:

      'majority': resample only the majority class;

      'not minority': resample all classes but the minority class;

      'not majority': resample all classes but the majority class;

      'all': resample all classes;

      'auto': equivalent to 'not minority'.

    • When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.

    • When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

  • random_state (int, RandomState instance, default=None) –

    Control the randomization of the algorithm.

    • If int, random_state is the seed used by the random number generator;

    • If RandomState instance, random_state is the random number generator;

    • If None, the random number generator is the RandomState instance used by np.random.

  • verbose (bool, default=False) – Display the counting samples

Returns:

X, y – Arraylike sampled

Return type:

NDarray, Arraylike

Examples

>>> import watex as wx
>>> from watex.utils.mlutils import resampling
>>> data, target = wx.fetch_data ('bagoue analysed', as_frame =True)
>>> data.shape, target.shape
>>> data_us, target_us = resampling (data, target, kind ='under',
                                     verbose=True)
>>> data_us.shape, target_us.shape
Counters: Auto
                     Raw counter y: Counter({0: 232, 1: 112})
           UnderSampling counter y: Counter({0: 112, 1: 112})
Out[43]: ((224, 8), (224,))