watex.utils.find_similar_labels#

watex.utils.find_similar_labels(y_true, y_pred, *, categorize_k=False, threshold=None, func=None, keep_label_0=False, method='naive', return_groups=False, **kwd)[source]#

Find similarities between y_true and y_pred and returns rate

Parameters

y_true (array-like 1d or pandas.Series) – Array containing the true labels of ‘k’
y_pred (array_like, or pandas.Series) – array containing the predicted naive group of aquifers (NGA)
categorize_k (bool,) – If set to True, user needs to provide a function ufunc to map or categorize the permeability coefficient ‘k’ into an integer labels.
func (callable) – Function to specifically map the permeability coefficient column in the dataframe of serie. If not given, the default function can be enabled instead from param default_func.
threshold (float, default=None) – The threshold from which, label in ‘y_true’ can be considered similar than the one in NGA labels ‘y_pred’. The default is ‘None’ which means none rule is considered and the high preponderence or occurence in the data compared to other labels is considered as the most representative and similar. Setting the rule instead by fixing the threshold is recommended especially in a huge dataset.
keep_label_0 (bool, default=0) –
Force including 0 in the predicted label if include_label_0 is set to True. Mostly label ‘0’ refers to ‘k=0’ i.e. no permeability coefficient equals to 0, which is not True in principle, because all rocks have a permeability coefficient ‘k’. Here we considered ‘k=0’ as an undefined permeability coefficient. Therefore, ‘0’ , can be exclude since, it can also considered as a missing ‘k’-value. If predicted ‘0’ is in the target it should mean a missing ‘k’-value rather than being a concrete label. Therefore, to avoid any confusion, ‘0’ is removed by default in the ‘k’ categorization. However, when the prediction ‘y_pred’ is made from the the unsupervising method, the prediction ‘0’ straigthforwardly includes

’0’ i.e ‘k=0’ as a first class. So the value +1 is used to move forward

all class labels thereby excluding the ‘0’ label. To force include 0 in the label, set include_label_0 to True.
method (str ['naive', 'strict'], default='naive') –
The kind of strategy to compute the representativity of a label in the predicted array ‘y_pred’. It can also be ‘strict’. Indeed:
- naive computes the importance of the label by the number of its
  occurence for this specific label in the array ‘y_true’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in y_true.
- strict computes the importance of the label by the number of
  occurence in the whole valid y_true i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in y_pred.
return_groups (bool, default=False) – Returns label groups and their values counts in the predicted labels y_pred where ‘k’ values are not missing.

Returns

g.similarity (Tuple of labels found that are considered similar in) – predicted labels.
g.group (Tuple of group that have their similarity in the true labels)

Example

>>> from watex.utils import read_data
>>> from watex.utils.hydroutils import find_similar_labels, classify_k
>>> data = read_data ('data/boreholes/hf.csv')
>>> ymap = classify_k(data.k , default_func =True)
>>> # Note that for the demo we use the group of aquifer columns, however
>>> # in pratical example, y_pred must be a predicted NGA labels. This
>>> # is possible using the function <predict_NGA_labels>
>>> sim = find_similar_labels(y_true= ymap, y_pred=data.aquifer_group)
>>> sim
... ((1, 'V'), (2, 'III'), (3, 'V'))
>>> group= find_similar_labels(ymap, data.aquifer_group, return_groups=True)
>>> group
... ((1,
  {'V': 0.17,
   'IV': 0.141,
   'II': 0.126,
   'III': 0.084,
   'IV&V': 0.005,
   'II&III': 0.003,
   'III&IV': 0.003}),
 (2, {'III': 0.052, 'II': 0.05, 'V': 0.05, 'IV': 0.034, 'III&IV': 0.005}),
 (3, {'V': 0.123, 'IV': 0.086, 'III': 0.068}))
>>> find_similar_labels(y_true= ymap, y_pred=data.aquifer_group,
                              threshold = 0.15)
... [(1, 'V')]