watex.utils.label_importance#

watex.utils.label_importance(label, arr_k, arr_aq, *, method='naive')[source]#
Compute the score for the label and its representativity in the valid

array ‘arr_k’

Parameters:
label: int, or string

class label from the true labels array of permeability coefficient ‘k’. If string, be sure to convert the array to hold the dtype str. It is recommnended to provide data with no NaN to have full control the occurence results.

arr_k: array-like 1d

True labels of array containing the permeability coefficient ‘k’.

arr_aq: array_like 1d
True labels of the groups of aquifers or predicted naive group of

aquifer (NGA labels). See predict_NGA_labels().

method: str [‘naive’, ‘strict’], default=’naive’

The kind of strategy to compute the representativity of a label in the predicted array ‘array_aq’. It can also be ‘strict’. Indeed:

  • ‘naive’ computes the importance of the label by the number of its

    occurence for this specific label in the array ‘k’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in ‘arr_k’

  • ‘strict’ computes the importance of the label by the number of

    occurence in the whole valid ‘arr_k’ i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in ‘arr_k’.

Returns:
label_dict_group_rate: dict,

Dictionnary of the label and its rate of occurence in the arr_aq. Thus each group in arr_aq has its rate of representativity of the label in arr_k.

‘ , r)
label k = 1 :
{‘V’: 0.17, ‘IV’: 0.141, ‘II’: 0.126, ‘III’: 0.084, ‘IV&V’: 0.005,

‘II&III’: 0.003, ‘III&IV’: 0.003}

label k = 2 :

{‘III’: 0.052, ‘II’: 0.05, ‘V’: 0.05, ‘IV’: 0.034, ‘III&IV’: 0.005}

label k = 3 :

{‘V’: 0.123, ‘IV’: 0.086, ‘III’: 0.068}

>>> # **comments:
    # label k=1 is 17% importance for group V, 12.3% for group II whereas
    # label k=2 has a weak rate in the whole dataset ~=0.19% for all groups
    # the most dominate labels are k=1 and k=3 with 53.14% and 27.74 %
    # respectively in the dataset.
    # If threshold of representativity is set to 50% , none of the true
    # label k will fit any aquifer group since the max representativity
    # score is 17% and is for the group V especially for k=1.