watex.transformers.StratifiedUsingBaseCategory#

class watex.transformers.StratifiedUsingBaseCategory(base_column=None, test_size=0.2, random_state=42)[source]#

Transformer to stratified dataset to have data more representativce into the trainset and the test set especially when data is not large enough.

Parameters:

*base_column* (str or int,) – Hyperparameters and can be index of the base mileage(category) for stratifications. If base_column is None, will return the purely random sampling.
*test_size* (float) – Size to put in the test set.
*random_state* (shuffled number of instance in the overall dataset.) – default is 42.
Usage –
------ –
attributes (If data is not large enough especially relative number of) –
sampling (if much possible to run therisk of introducing a significant) –

avoid (biais.Therefore strafied sampling is a better way to) –

a significant biais of sampling survey. For instance:

>>> from watex.transformers import StratifiedUsingBaseCategory
>>> from watex.utils.mlutils import load_data
>>> df = load_data('data/geo_fdata')
>>> stratifiedObj = StratifiedUsingBaseCategory(base_column='geol')
>>> stratifiedObj.fit_transform(X=df)
>>> stats= stratifiedObj.statistics_

Notes

An statictics_ inspection attribute is good way to observe the test set generated using purely random and the stratified sampling. The stratified sampling has category base_column proportions almost indentical to those in the full dataset whereas the test set generated using purely random sampling is quite skewed.