watex.utils.bin_counting#
- watex.utils.bin_counting(data, bin_columns, tname, odds='N+', return_counts=Ellipsis, tolog=Ellipsis)[source]#
- Bin counting categorical variable and turn it into probabilistic
ratio.
Bin counting is one of the perennial rediscoveries in machine learning. It has been reinvented and used in a variety of applications, from ad click-through rate prediction to hardware branch prediction [Ra4bd71913a2f-1], [Ra4bd71913a2f-2] and [Ra4bd71913a2f-3].
Given an input variable X and a target variable Y, the odds ratio is defined as:
\[odds ratio =\]- rac{ P(Y = 1 | X = 1)/ P(Y = 0 | X = 1)}{
P(Y = 1 | X = 0)/ P(Y = 0 | X = 0)}
Probability ratios can easily become very small or very large. The log transform again comes to our rescue. Anotheruseful property of the logarithm is that it turns a division into a subtraction. To turn bin statistic probability value to log, set
uselog=True.
- Parameters
- data: dataframe
Data containing the categorical values.
- bin_columns: str or list
The columns to applied the bin_countings
- tname: str, pd.Series
The target name for which the counting is operated. If series, it must have the same length as the data.
- odds: str , {“N+”, “N-”, “log_N+”}:
The odds ratio of bin counting to fill the categorical.
N+andN-are positive and negative probabilistic computing. Whereas thelog_N+is the logarithm odds ratio useful when value are smaller or larger.- return_counts: bool, default=True
return the bin counting dataframes.
- tolog: bool, default=False,
Apply the logarithm to the output data ratio. Indeed, Probability ratios can easily become very small or very large. For instance, there will be users who almost never click on ads, and perhaps users who click on ads much more frequently than not.) The log transform again comes to our rescue. Another useful property of the logarithm is that it turns a division
- Returns
- d: dataframe
Dataframe transformed or bin-counting data
Examples
>>> import watex as wx >>> from watex.utils.mlutils import bin_counting >>> X, y = wx.fetch_data ('bagoue analysed', as_frame =True) >>> # target binarize >>> y [y <=1] = 0; y [y > 0]=1 >>> X.head(2) Out[7]: power magnitude sfi ohmS lwi shape type geol 0 0.191800 -0.140799 -0.426916 0.386121 0.638622 4.0 1.0 3.0 1 -0.430644 -0.114022 1.678541 -0.185662 -0.063900 3.0 2.0 2.0 >>> bin_counting (X , bin_columns= 'geol', tname =y).head(2) Out[8]: power magnitude sfi ohmS ... shape type geol bin_target 0 0.191800 -0.140799 -0.426916 0.386121 ... 4.0 1.0 0.656716 1 1 -0.430644 -0.114022 1.678541 -0.185662 ... 3.0 2.0 0.219251 0 [2 rows x 9 columns] >>> bin_counting (X , bin_columns= ['geol', 'shape', 'type'], tname =y).head(2) Out[10]: power magnitude sfi ... type geol bin_target 0 0.191800 -0.140799 -0.426916 ... 0.267241 0.656716 1 1 -0.430644 -0.114022 1.678541 ... 0.385965 0.219251 0 [2 rows x 9 columns] >>> df = pd.DataFrame ( pd.concat ( [X, pd.Series ( y, name ='flow')], axis =1)) >>> bin_counting (df , bin_columns= ['geol', 'shape', 'type'], tname ="flow", tolog=True).head(2) Out[12]: power magnitude sfi ohmS ... shape type geol flow 0 0.191800 -0.140799 -0.426916 0.386121 ... 0.828571 0.364706 1.913043 1 1 -0.430644 -0.114022 1.678541 -0.185662 ... 0.364865 0.628571 0.280822 0 >>> bin_counting (df , bin_columns= ['geol', 'shape', 'type'],odds ="N-", tname =y, tolog=True).head(2) Out[13]: power magnitude sfi ... geol flow bin_target 0 0.191800 -0.140799 -0.426916 ... 0.522727 1 1 1 -0.430644 -0.114022 1.678541 ... 3.560976 0 0 [2 rows x 10 columns] >>> bin_counting (df , bin_columns= "geol",tname ="flow", tolog=True, return_counts= True ) Out[14]: flow no_flow total_flow N+ N- logN+ logN- 3.0 44 23 67 0.656716 0.343284 1.913043 0.522727 2.0 41 146 187 0.219251 0.780749 0.280822 3.560976 0.0 18 43 61 0.295082 0.704918 0.418605 2.388889 1.0 9 20 29 0.310345 0.689655 0.450000 2.222222