watex.transformers.StratifiedUsingBaseCategory#

class watex.transformers.StratifiedUsingBaseCategory(base_column=None, test_size=0.2, random_state=42)[source]#

Transformer to stratified dataset to have data more representativce into the trainset and the test set especially when data is not large enough.

Parameters
  • base_column (str or int,) – Hyperparameters and can be index of the base mileage(category) for stratifications. If base_column is None, will return the purely random sampling.

  • test_size (float) – Size to put in the test set.

  • random_state (shuffled number of instance in the overall dataset.) – default is 42.

  • Usage

  • ------

  • attributes (If data is not large enough especially relative number of) –

  • sampling (if much possible to run therisk of introducing a significant) –

  • avoid (biais.Therefore strafied sampling is a better way to) –

    a significant biais of sampling survey. For instance:

    >>> from watex.transformers import StratifiedUsingBaseCategory
    >>> from watex.utils.mlutils import load_data
    >>> df = load_data('data/geo_fdata')
    >>> stratifiedObj = StratifiedUsingBaseCategory(base_column='geol')
    >>> stratifiedObj.fit_transform(X=df)
    >>> stats= stratifiedObj.statistics_
    

Notes

An statictics_ inspection attribute is good way to observe the test set generated using purely random and the stratified sampling. The stratified sampling has category base_column proportions almost indentical to those in the full dataset whereas the test set generated using purely random sampling is quite skewed.