watex.utils.make_MXS_labels#

watex.utils.make_MXS_labels(y_true, y_pred, threshold=None, similar_labels=None, sep=None, prefix=None, method='naive', trailer='*', return_obj=False, **kws)[source]#

Create a Mixture Learning Strategy (MXS) labels from true labels ‘y_true’ and the predicted Naive Group of Aquifer (NGA) labels ‘y_pred’

Parameters

y_true (array-like 1d, pandas.Series) – Array composed of valid k-values and possible missing k-values.
y_pred (Array-like 1d, pandas.Series) – Array composing the valid NGA labels. Note that NGA labels is a predicted labels mostly using the unsupervising learning.
threshold (float, default=None) – The threshold from which, label in ‘y_true’ can be considered similar than the one in NGA labels ‘y_pred’. The default is ‘None’ which means none rule is considered and the high preponderence or occurence in the data compared to other labels is considered as the most representative and similar. Setting the rule instead by fixing the threshold is recommended especially in a huge dataset.
similar_labels (list of tuple, optional) – list of tuple in pair (label and similar group). If given, the similar group must be the label existing in the predicted NGA. If None, the auto-similarity is triggered.
sep (str, default'') –
Separator between the true labels ‘y_true’ and predicted NGA labels. Sep is used to rewrite the MXS labels. Mostly the MXS labels is a combinaison with the true label of permeability coefficient ‘k’ and the label of NGA to compose new similarity labels. For instance
```
>>> true_labels=['k1', 'k2', 'k3'] ; NGA_labels =['II', 'I', 'IV']
>>> # gives
>>> MXS_labels= ['k1_II', 'k2_I', 'k3_IV']
```
where the seperator sep is set to _. This happens especially when one of the label (NGA or true_labels) is not a numeric datatype and a similariy is found between ‘k1’ and ‘II’, ‘k2’ and ‘I’ and so on.
prefix (str, default='') –
prefix is used to rename the true_labels i.e the true valid-k. For instance:
```
>>> k_valid =[1, 2, ..] -> k_new = [k1, k2, ...]
```
where ‘k’ is the prefix.
method (str ['naive', 'strict'], default='naive') –
The kind of strategy to compute the representativity of a label in the predicted array ‘y_pred’. It can also be ‘strict’. Indeed:
- naive computes the importance of the label by the number of its
  occurence for this specific label in the array ‘y_true’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in y_true.
- strict computes the importance of the label by the number of
  occurence in the whole valid y_true i.e. under the total of occurence of all the labels that exist in the whole ‘arr_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in y_pred.
trailer (str, default='*') –
The Mixture strategy marker to differentiate the existing class label in ‘y_true’ with the predicted labels ‘y_pred’ especially when the same class labels are also present the true label with the same label-identifier name. This usefull to avoid any confusion for both labels in y_true and y_pred for better demarcation and distinction. Note that if the trailer`is set to ``None` and both y_true and y_pred are numeric data, the labels in y_pred are systematically renamed to be distinct with the ones in the ‘y_true’. For instance
```
>>> true_labels=[1, 2, 3] ; NGA_labels =[0, 1, 2]
>>> # with trailer , MXS labels should be
>>>  MXS_labels= ['0', '1*', '2*', '3'] # 1 and 2 are in true_labels
>>> # with no trailer
>>> MXS_labels= [0, 4, 5, 3] # 1 and 2 have been changed to [4, 5]
```
return_obj (watex.utils.box.Boxspace) –
If True, returns a MXS object with usefull attributes such as:
- mxs_classes_ = the MXS class labels
- mxs_labels_= the array-like of MXS labels. It also includes some
  non similar labels from NGA
mxs_map_classes_= a dict or original class labels of the array
’k’ <’y_true’> and their temporary integer class labels. Indeed, if ‘y_true’ class labels are not a numeric dtype, New labels with integer dtype is created. The dict is used to wrap the true labels (original ones) during the MXS creation. Thus, the original labels are not altered and will be map in turn at the end to recover their positions as well in new MXS array. It is set to ‘None’ if ‘y_true’ has a numeric dtype.

mxs_group_classes_: dict of all the similar group labels with the
MXS labels related from the modified existing groups of NGA. Note that the non-similar group are modified if their labels are also found in the true_labels to avoid any confusion. Thus the dict wrap the non-similar label with their new temporay labels.

mxs_similar_groups_= list of the similar labels found in
y_true that have a similarity in NGA.

mxs_similarity_= Tuple of similarity in pair (label, group)
existing between the label class in y_true and NGA.

mxs_group_labels_= list of the similar groups found in the
predicted NGA that have a similarity in true labels ‘y_true’

Returns

MXS – array like of MXS labels or MXS object containing the usefull attributes.

Return type

array-like 1d or Boxspace

See also

predict_NGA_labels: Predicts Naive group of Aquifers labels.

Examples

>>> from watex.datasets import load_hlogs
>>> from watex.utils import read_data
>>> from watex.utils.hydroutils import classify_k, make_MXS_labels
>>> data = load_hlogs ().frame
>>> # map data.k to categorize k values
>>> ymap = classify_k(data.k , default_func =True)
>>> y_mxs = make_MXS_labels (ymap, data.aquifer_group)
>>> y_mxs[14:24]
...  array(['I', 'I', 2, 2, 2, 2, 2, 2, 2, 2], dtype=object)
>>> mxs_obj = make_MXS_labels (ymap, data.aquifer_group, return_obj=True )
>>> mxs_obj.mxs_labels_[14: 24]
... array(['I', 'I', 2, 2, 2, 2, 2, 2, 2, 2], dtype=object)
>>> # now we did the same task using the private data 'hf.csv'
>>> # composed of 11 boreholes. For default we alternatively uses
>>> # the aquifer groups like a fake NGA
>>> data = read_data ('data/boreholes/hf.csv')
>>> ymap =  classify_k(data.k , default_func =True)
>>> y_mxs= make_MXS_labels (ymap, data.aquifer_group)
>>> np.unique (y_mxs)
... array(['1', '1V', '2', '2III', '3', 'I', 'II', 'III&IV', 'IV'],
      dtype='<U6')
>>> # *comments:
    # label '1V' means the group V (expected to be a cluster)
    # and label 1 (true labels) have a similarity
    # the same of label '2III' while the remain label 3 does not
    #  any similarity in the other labels  in the 'y_pred' expected
    # to be NGA labels.