.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "glr_examples/applications/plot_ymxs_label_for_k_prediction.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_glr_examples_applications_plot_ymxs_label_for_k_prediction.py>`
        to download the full example code. or to run this example in your browser via Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_glr_examples_applications_plot_ymxs_label_for_k_prediction.py:


=====================================================
k-prediction from MXS: step-by-step guide  
=====================================================

Real-world examples to generate the mixture learning strategy (MXS) 
target :math:`y*` for predicting the permeability coefficient 
:math:`k` parameter from two boreholes.  

.. GENERATED FROM PYTHON SOURCE LINES 10-13

.. code-block:: Python

    # Author: L.Kouadio 
    # Licence: BSD-3-clause 


.. GENERATED FROM PYTHON SOURCE LINES 14-16

Note that this is an example of two boreholes which results is quited less relevant compared to the tangible example implemented in 
the `Hongliu coal mine <https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4326365>`_ with 11 boreholes data [1]_.  

.. GENERATED FROM PYTHON SOURCE LINES 18-19

We start by importing the required modules 

.. GENERATED FROM PYTHON SOURCE LINES 19-22

.. code-block:: Python

    import pandas as pd 
    from watex.datasets import load_hlogs 


.. GENERATED FROM PYTHON SOURCE LINES 23-27

Preprocess data 
===================
Make a unique dataset from two boreholes data collected in Hongliu 
coal mine :h502 and h2601 and reduce down dimensions if necessary

.. GENERATED FROM PYTHON SOURCE LINES 27-40

.. code-block:: Python


    # * load `load_hlogs` to get explicitly the features names and target names 
    box = load_hlogs () 
    # combine our test data 
    # data  = load_hlogs().frame + load_hlogs(key= 'h2601').frame  
    data = load_hlogs (key ='*').frame 
    X0, y0 = data [box.feature_names] , data [box.target_names ] 
    # make copies for safety 
    X, y = X0.copy() , y0.copy() 
    # let's visualize the features names and target names 
    print("feature_names:\n" , box.feature_names ) 
    print("target names:\n", box.target_names ) 


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    feature_names:
     ['hole_id', 'depth_top', 'depth_bottom', 'strata_name', 'rock_name', 'layer_thickness', 'resistivity', 'gamma_gamma', 'natural_gamma', 'sp', 'short_distance_gamma', 'well_diameter']
    target names:
     ['aquifer_group', 'pumping_level', 'aquifer_thickness', 'hole_depth_before_pumping', 'hole_depth_after_pumping', 'hole_depth_loss', 'depth_starting_pumping', 'pumping_depth_at_the_end', 'pumping_depth', 'section_aperture', 'k', 'kp', 'r', 'rp', 'remark']


.. GENERATED FROM PYTHON SOURCE LINES 41-44

Data contains some categorical values, we will drop the rock name, the hole_id 
and well diameter which are subjective data and not useful for prediction 
puposes and impute  the remaining data using a bi-impute strategy 

.. GENERATED FROM PYTHON SOURCE LINES 44-53

.. code-block:: Python


    from watex.utils import naive_imputer 
    X.drop (columns = ['rock_name', 'hole_id', 'well_diameter'] , inplace =True )

    # * Merge both depths into one to compose only a single depth column 
    X['depth'] = ( X.depth_bottom + X.depth_top )/2 
    X.drop (columns =['depth_top', 'depth_bottom'], inplace =True )
    data_imputed = naive_imputer( X , strategy='mean', mode='bi-impute')  


.. GENERATED FROM PYTHON SOURCE LINES 54-56

* Use PCA analysis to reduce the dimension to down the important features 
to predict the naive aquifer group (NGA).

.. GENERATED FROM PYTHON SOURCE LINES 56-73

.. code-block:: Python


    # Note that for PCA analysis, we can remove the only categorial features 
    # "strata_name" and scaled the remaining features as follows: 

    from watex.utils import to_numeric_dtypes  
    from watex.utils import naive_scaler 

    # pop_cat_features auto-drop the only categorial features
    Xpca = to_numeric_dtypes (data_imputed , pop_cat_features= True,
                              reset_index=True, drop_index =True,
                              verbose =True) 
    #  Scale the data  by default 
    Xpca_scaled = naive_scaler( Xpca )  
    Xpca_scaled_columns = list( Xpca_scaled.columns )
    # * Call the normal PCA and plot all components set to None

    from watex.analysis import nPCA 


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    No NaN column found.
    Feature:     1. strata_name  has been dropped from the dataframe.


.. GENERATED FROM PYTHON SOURCE LINES 74-75

* Plot explained variance ratio

.. GENERATED FROM PYTHON SOURCE LINES 75-77

.. code-block:: Python

    pca = nPCA (Xpca_scaled , return_X= False, view = True ) # return PCA object rather than the reduced X  


.. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_001.png
   :alt: Explained variance as a function of the number of dimension
   :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 78-81

As a comment,  here 5/6 features are enough since the explained variance ratio is already 
got 98 % 
* Set the number of components and use a convenient plot the both components   

.. GENERATED FROM PYTHON SOURCE LINES 81-89

.. code-block:: Python


    from watex.utils import plot_pca_components 
    pca = nPCA (Xpca_scaled ,n_components=2,  return_X=False ) # return object for plot purpose 
 
    components = pca.components_ 
    features = pca.feature_names_in_
    plot_pca_components (components, feature_names= features, cmap='jet_r') 


.. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_002.png
   :alt: plot ymxs label for k prediction
   :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 90-102

As comments, the matrix plot shows the contributions of all features in the first and second components. 
Indeed, while most contributions are got in-depth resistivity gamma and gamma short distance 
they are negatively correlated with layer thickness and natural gamma. However, 
no-correlation is found with the sp log data 
In the second components, the depth and natural gamma are more corollated and inversely 
correlated with the resistivity gamma, sp, and short distance. 
whereas the quasi-null correlation exists with layer thickness. 
By summarizing the PC1 and PC2 analysis, all features are useful as prediction 
and one of them can be skipped. This validates the explained variance ratio where 
under 8 features, after 7 dimensions, the explained variance ratio is already 
reached 98 %.  Therefore features skipped should not influence the result of 
prediction 

.. GENERATED FROM PYTHON SOURCE LINES 104-108

* Auto-preprocess the data using the default pipe 
Note that the categorical data "strata_name" is one-hot-encoded and 
generate a sparse matrix ready  for the data for prediction, then  we will use the function 'make_naive_pipe'
to fast encode and transform the data as output.

.. GENERATED FROM PYTHON SOURCE LINES 108-115

.. code-block:: Python


    from watex.utils  import make_naive_pipe 

    # auto scaled the data and store it into a compressed sparse matrix format 
    csr_data = make_naive_pipe(data_imputed, transform= True) # auto-scaled the data using StandardScaler and  transform the data in place 
    csr_data


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <1148x24 sparse matrix of type '<class 'numpy.float64'>'
    	with 9158 stored elements in Compressed Sparse Row format>


.. GENERATED FROM PYTHON SOURCE LINES 116-121

Prediction of Naive Group of Aquifer (NGA) 
============================================
We randomly set the number of clusters to 05 which might correspond to 
the number of aquifer groups in the survey area according to the geological information. 
KMeans is used to predict the  class label instead  and plot the clusters 

.. GENERATED FROM PYTHON SOURCE LINES 121-124

.. code-block:: Python


    from watex.exlib.sklearn import KMeans 
    from watex.utils import plot_clusters 


.. GENERATED FROM PYTHON SOURCE LINES 125-126

* Group the principal two components of PCA  into the 5 clusters 

.. GENERATED FROM PYTHON SOURCE LINES 126-133

.. code-block:: Python


    km = KMeans (n_clusters =5 , init= 'random' )  
    ykm = km.fit_predict(pca.X  ) 
    km3c = KMeans (n_clusters =3 , init= 'random' )  
    ykm3 = km3c.fit_predict(pca.X  )
    # plot clusters into the general information of 5 groups of aquifers  
    plot_clusters (5 , pca.X, ykm , km.cluster_centers_ )  


.. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_003.png
   :alt: plot ymxs label for k prediction
   :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 134-137

* Plot the 03 clusters
Now test the sample lot with only 03 clusters as a theory group of aquifer 
based on the distribution of the data.

.. GENERATED FROM PYTHON SOURCE LINES 137-139

.. code-block:: Python


    plot_clusters (3 , pca.X, ykm3 , km3c.cluster_centers_ ) 


.. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_004.png
   :alt: plot ymxs label for k prediction
   :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_004.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 140-142

* Plot the feature’s importance 
We encode the strata_name and add it to the scale value and plot_the feature  importance 

.. GENERATED FROM PYTHON SOURCE LINES 142-155

.. code-block:: Python


    from watex.exlib.sklearn import RandomForestClassifier 
    from watex.utils import plot_rf_feature_importances 

    # add the strata_name to the remaining features 
    strata_column =  pd.Series ( X ['strata_name'].astype ('category').cat.codes , name ='strata_name' ) 
    strata_column.index = range (len(strata_column))  # reindexing

    X_for_fi = pd.concat( [ strata_column , Xpca_scaled ], axis =1, ignore_index=True ) 
    # # plot importance with the predicted label ykm  
    X_for_fi=pd.DataFrame ( X_for_fi.values, columns= ['strata_name'] + Xpca_scaled_columns)
    plot_rf_feature_importances (RandomForestClassifier(), X_for_fi , y =ykm ) 


.. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_005.png
   :alt: Feature importance
   :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_005.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 156-157

plot elbow to confirm or infirm the 05 clustering of  aquifers from geological infos

.. GENERATED FROM PYTHON SOURCE LINES 157-160

.. code-block:: Python

    from watex.utils import plot_elbow 
    plot_elbow(pca.X, n_clusters=11)  


.. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_006.png
   :alt: plot ymxs label for k prediction
   :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_006.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <AxesSubplot:xlabel='Number of clusters', ylabel='Distorsion'>


.. GENERATED FROM PYTHON SOURCE LINES 161-169

As comments, we can see, the elbow is located at k=3 that i.e we can classify the aquifer 
group based on the current datasets into three groups in hongliu coal mine. 
Note that the dataset is only for boreholes, this can not confirm the 
exact number of the aquifer. In the case study data applied in Honliu coal mine composed 
of 11 boreholes, the number of 03 clusters is selected although the 05 clusters 
do not indicate a bad clustering after a silhouette plot. The number of 03 is 
finally ascertained using the Hierarchical Agglomerative clustering (HAC) dendrogram plot. 
The step are enumerated below: 

.. GENERATED FROM PYTHON SOURCE LINES 171-172

Let’s confirm the 05 clusters  using the silhouette plot from KMeans

.. GENERATED FROM PYTHON SOURCE LINES 172-177

.. code-block:: Python


    from watex.view import plotSilhouette 
    # plot silhouette for the 05 clusters with pca reduced data 
    plotSilhouette (pca.X, labels =ykm , prefit =True)  
 

.. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_007.png
   :alt: plot ymxs label for k prediction
   :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_007.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 178-182

Plot with the 03 custers; plot silhouette for the three clusters by 
setting prefit to False since a new prediction should be made under the hood
after n-iterations to find the best clustering. Refer to 
:func:`~watex.view.plotSilhouette` documentation.

.. GENERATED FROM PYTHON SOURCE LINES 182-185

.. code-block:: Python


    plotSilhouette (pca.X, n_clusters= 3 , prefit =False)  


.. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_008.png
   :alt: plot ymxs label for k prediction
   :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_008.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 186-187

Finally, we plot the dendrogram from HAC

.. GENERATED FROM PYTHON SOURCE LINES 187-191

.. code-block:: Python


    from watex.view import plotDendrogram
    plotDendrogram (pca.X , labels = ykm)


.. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_009.png
   :alt: plot ymxs label for k prediction
   :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_009.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 192-200

As comments in the case of MXS target, merging the predicted y with cluster =5 
with create a lot of y=k33' where we expected to have a list a =balance target 
with the true labels y (k1, k2 and k3 ) 
therefore the cluster with 3 labels is used instead of 5 
thus the predicted NGA labels with true labels is combined with the 
the true labels y for supervised learnings. Note that 
the true labels are not altered by the predicted label y 
not let plot the dendro-heat

.. GENERATED FROM PYTHON SOURCE LINES 202-205

Before predicting the NGA labels, we can  fit the  aquifer group and find the 
most representative of the true k labels to the predicted labels 
test with the number of clusters set to 3 

.. GENERATED FROM PYTHON SOURCE LINES 205-216

.. code-block:: Python


    from watex.utils.hydroutils import find_aquifer_groups, classify_k
    # categorize the k-values using the default func 
    yk_map =classify_k (y.k , default_func =True)
    groupobj = find_aquifer_groups (yk_map,  ykm ) 
    print(groupobj)
    # now make the prediction 
    from watex.utils import predict_NGA_labels 

    yNGA = predict_NGA_labels(pca.X, n_clusters= 3)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    _Group(Label=[' 1 ', 
                       Preponderance( rate = '53.141  %', 
                                    [('Groups', {5: 0.468, 1: 0.236, 4: 0.217, 2: 0.049, 3: 0.03}),
                                     ('Representativity', ( '5', 0.47)),
                                     ('Similarity', '5')])],
                 Label=[' 2 ', 
                       Preponderance( rate = ' 19.11  %', 
                                    [('Groups', {5: 0.452, 1: 0.301, 4: 0.123, 2: 0.11, 3: 0.014}),
                                     ('Representativity', ( '5', 0.45)),
                                     ('Similarity', '5')])],
                 Label=[' 3 ', 
                       Preponderance( rate = '27.749  %', 
                                    [('Groups', {4: 0.491, 1: 0.443, 5: 0.038, 2: 0.028}),
                                     ('Representativity', ( '4', 0.49)),
                                     ('Similarity', '4')])],
                 )


.. GENERATED FROM PYTHON SOURCE LINES 217-222

Prediction of MXS target :math:`y*`
=====================================
The prediction of MXS can simply be made by calling the function 
:func:`~watex.utils.make_MXS_labels` or use the MXS class (:class:`~watex.methods.MXS` ) 
of the module :mod:`~watex.methods.hydro`

.. GENERATED FROM PYTHON SOURCE LINES 222-228

.. code-block:: Python


    from watex.utils import make_MXS_labels 

    yMXS = make_MXS_labels(y_true=yk_map , y_pred=yNGA )
    # Let’s print the 12 firstMXS target 
    print(yMXS[:12])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ['2*' '2*' '2*' '2*' '2*' '2*' '2*' '2*' '2*' '2*' '2*' '2*']


.. GENERATED FROM PYTHON SOURCE LINES 229-239

As a comment, the existing :math:`21` and math:`2*` in the :math:`y*(yMXS)`
indicates that there is a strong similarity found between label 2 in 
the permeability coefficient dataset :math:`y` and the predicted `yNGA` labels. 
This is validated by the group preponderance object above. Whilst, the math:`2*`
indicates that the label `2` in yNGA has no similarity found in :math:`y*(yMXS)`). 
The label `3` in `yNGA` has no relationship with any labels in the :math:`y` 
therefore no modification is occurred and kept safe. 
If the parameter `return_obj` is set to True, it will return an MXS object 
where many attributes like class mapping can be retrieved for understanding purposes. 
for instance:

.. GENERATED FROM PYTHON SOURCE LINES 239-254

.. code-block:: Python


    mxso = make_MXS_labels(y_true=yk_map , y_pred=yNGA , return_obj=True )
    # similar labels 
    print(mxso.mxs_similarity_)

    # group classes for mapping 
    print(mxso.mxs_group_classes_) 

    #M XS class mapping. This is usefull to know the labels that have been 
    # modified based on the similarity computation.
    print(mxso.mxs_classes_)

    # Once the :math:`y*(yMXS)` is predicted, the supervised learning model 
    # training can be made with the predictor:math:`X`. 


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    [11, 21, 31]
    {1: 11, 2: '2*', 3: '3*'}
    ['1' '11' '2' '2*' '3' '3*']


.. GENERATED FROM PYTHON SOURCE LINES 255-258

A paper is under puclication in Engineering Geology for k-prediction which 
explained a concrete study (Case study in Hongliu coal mine). See the 
reference in the :ref:`citation <citing>` page. 


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 9.420 seconds)


.. _sphx_glr_download_glr_examples_applications_plot_ymxs_label_for_k_prediction.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/watex/watex/0.3.X?urlpath=lab/tree/notebooks/glr_examples/applications/plot_ymxs_label_for_k_prediction.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_ymxs_label_for_k_prediction.ipynb <plot_ymxs_label_for_k_prediction.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_ymxs_label_for_k_prediction.py <plot_ymxs_label_for_k_prediction.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_ymxs_label_for_k_prediction.zip <plot_ymxs_label_for_k_prediction.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_