watex.utils.bin_counting#

watex.utils.bin_counting(data, bin_columns, tname, odds='N+', return_counts=Ellipsis, tolog=Ellipsis)[source]#

Bin counting categorical variable and turn it into probabilistic

ratio.

Bin counting is one of the perennial rediscoveries in machine learning. It has been reinvented and used in a variety of applications, from ad click-through rate prediction to hardware branch prediction [Ra4bd71913a2f-1], [Ra4bd71913a2f-2] and [Ra4bd71913a2f-3].

Given an input variable X and a target variable Y, the odds ratio is defined as:

\[odds ratio =\]

rac{ P(Y = 1 | X = 1)/ P(Y = 0 | X = 1)}{

P(Y = 1 | X = 0)/ P(Y = 0 | X = 0)}

Probability ratios can easily become very small or very large. The log transform again comes to our rescue. Anotheruseful property of the logarithm is that it turns a division into a subtraction. To turn bin statistic probability value to log, set uselog=True.

Parameters

data: dataframe: Data containing the categorical values.
bin_columns: str or list: The columns to applied the bin_countings
tname: str, pd.Series: The target name for which the counting is operated. If series, it must have the same length as the data.
odds: str , {“N+”, “N-”, “log_N+”}:: The odds ratio of bin counting to fill the categorical. N+ and N- are positive and negative probabilistic computing. Whereas the log_N+ is the logarithm odds ratio useful when value are smaller or larger.
return_counts: bool, default=True: return the bin counting dataframes.
tolog: bool, default=False,: Apply the logarithm to the output data ratio. Indeed, Probability ratios can easily become very small or very large. For instance, there will be users who almost never click on ads, and perhaps users who click on ads much more frequently than not.) The log transform again comes to our rescue. Another useful property of the logarithm is that it turns a division

Returns

d: dataframe: Dataframe transformed or bin-counting data

Examples

>>> import watex as wx
>>> from watex.utils.mlutils import bin_counting
>>> X, y = wx.fetch_data ('bagoue analysed', as_frame =True)
>>> # target binarize
>>> y [y <=1] = 0;  y [y > 0]=1
>>> X.head(2)
Out[7]:
      power  magnitude       sfi      ohmS       lwi  shape  type  geol
0  0.191800  -0.140799 -0.426916  0.386121  0.638622    4.0   1.0   3.0
1 -0.430644  -0.114022  1.678541 -0.185662 -0.063900    3.0   2.0   2.0
>>>  bin_counting (X , bin_columns= 'geol', tname =y).head(2)
Out[8]:
      power  magnitude       sfi      ohmS  ...  shape  type      geol  bin_target
0  0.191800  -0.140799 -0.426916  0.386121  ...    4.0   1.0  0.656716           1
1 -0.430644  -0.114022  1.678541 -0.185662  ...    3.0   2.0  0.219251           0
[2 rows x 9 columns]
>>>  bin_counting (X , bin_columns= ['geol', 'shape', 'type'], tname =y).head(2)
Out[10]:
      power  magnitude       sfi  ...      type      geol  bin_target
0  0.191800  -0.140799 -0.426916  ...  0.267241  0.656716           1
1 -0.430644  -0.114022  1.678541  ...  0.385965  0.219251           0
[2 rows x 9 columns]
>>> df = pd.DataFrame ( pd.concat ( [X, pd.Series ( y, name ='flow')],
                                   axis =1))
>>> bin_counting (df , bin_columns= ['geol', 'shape', 'type'],
                  tname ="flow", tolog=True).head(2)
Out[12]:
      power  magnitude       sfi      ohmS  ...     shape      type      geol  flow
0  0.191800  -0.140799 -0.426916  0.386121  ...  0.828571  0.364706  1.913043     1
1 -0.430644  -0.114022  1.678541 -0.185662  ...  0.364865  0.628571  0.280822     0
>>> bin_counting (df , bin_columns= ['geol', 'shape', 'type'],odds ="N-",
                  tname =y, tolog=True).head(2)
Out[13]:
      power  magnitude       sfi  ...      geol  flow  bin_target
0  0.191800  -0.140799 -0.426916  ...  0.522727     1           1
1 -0.430644  -0.114022  1.678541  ...  3.560976     0           0
[2 rows x 10 columns]
>>> bin_counting (df , bin_columns= "geol",tname ="flow", tolog=True,
                  return_counts= True )
Out[14]:
     flow  no_flow  total_flow        N+        N-     logN+     logN-
3.0    44       23          67  0.656716  0.343284  1.913043  0.522727
2.0    41      146         187  0.219251  0.780749  0.280822  3.560976
0.0    18       43          61  0.295082  0.704918  0.418605  2.388889
1.0     9       20          29  0.310345  0.689655  0.450000  2.222222