.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "glr_examples/applications/plot_ymxs_label_for_k_prediction.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_glr_examples_applications_plot_ymxs_label_for_k_prediction.py: ===================================================== k-prediction from MXS: step-by-step guide ===================================================== Real-world examples to generate the mixture learning strategy (MXS) target :math:`y*` for predicting the permeability coefficient :math:`k` parameter from two boreholes. .. GENERATED FROM PYTHON SOURCE LINES 10-13 .. code-block:: Python # Author: L.Kouadio # Licence: BSD-3-clause .. GENERATED FROM PYTHON SOURCE LINES 14-16 Note that this is an example of two boreholes which results is quited less relevant compared to the tangible example implemented in the `Hongliu coal mine `_ with 11 boreholes data [1]_. .. GENERATED FROM PYTHON SOURCE LINES 18-19 We start by importing the required modules .. GENERATED FROM PYTHON SOURCE LINES 19-22 .. code-block:: Python import pandas as pd from watex.datasets import load_hlogs .. GENERATED FROM PYTHON SOURCE LINES 23-27 Preprocess data =================== Make a unique dataset from two boreholes data collected in Hongliu coal mine :h502 and h2601 and reduce down dimensions if necessary .. GENERATED FROM PYTHON SOURCE LINES 27-40 .. code-block:: Python # * load `load_hlogs` to get explicitly the features names and target names box = load_hlogs () # combine our test data # data = load_hlogs().frame + load_hlogs(key= 'h2601').frame data = load_hlogs (key ='*').frame X0, y0 = data [box.feature_names] , data [box.target_names ] # make copies for safety X, y = X0.copy() , y0.copy() # let's visualize the features names and target names print("feature_names:\n" , box.feature_names ) print("target names:\n", box.target_names ) .. rst-class:: sphx-glr-script-out .. code-block:: none feature_names: ['hole_id', 'depth_top', 'depth_bottom', 'strata_name', 'rock_name', 'layer_thickness', 'resistivity', 'gamma_gamma', 'natural_gamma', 'sp', 'short_distance_gamma', 'well_diameter'] target names: ['aquifer_group', 'pumping_level', 'aquifer_thickness', 'hole_depth_before_pumping', 'hole_depth_after_pumping', 'hole_depth_loss', 'depth_starting_pumping', 'pumping_depth_at_the_end', 'pumping_depth', 'section_aperture', 'k', 'kp', 'r', 'rp', 'remark'] .. GENERATED FROM PYTHON SOURCE LINES 41-44 Data contains some categorical values, we will drop the rock name, the hole_id and well diameter which are subjective data and not useful for prediction puposes and impute the remaining data using a bi-impute strategy .. GENERATED FROM PYTHON SOURCE LINES 44-53 .. code-block:: Python from watex.utils import naive_imputer X.drop (columns = ['rock_name', 'hole_id', 'well_diameter'] , inplace =True ) # * Merge both depths into one to compose only a single depth column X['depth'] = ( X.depth_bottom + X.depth_top )/2 X.drop (columns =['depth_top', 'depth_bottom'], inplace =True ) data_imputed = naive_imputer( X , strategy='mean', mode='bi-impute') .. GENERATED FROM PYTHON SOURCE LINES 54-56 * Use PCA analysis to reduce the dimension to down the important features to predict the naive aquifer group (NGA). .. GENERATED FROM PYTHON SOURCE LINES 56-73 .. code-block:: Python # Note that for PCA analysis, we can remove the only categorial features # "strata_name" and scaled the remaining features as follows: from watex.utils import to_numeric_dtypes from watex.utils import naive_scaler # pop_cat_features auto-drop the only categorial features Xpca = to_numeric_dtypes (data_imputed , pop_cat_features= True, reset_index=True, drop_index =True, verbose =True) # Scale the data by default Xpca_scaled = naive_scaler( Xpca ) Xpca_scaled_columns = list( Xpca_scaled.columns ) # * Call the normal PCA and plot all components set to None from watex.analysis import nPCA .. rst-class:: sphx-glr-script-out .. code-block:: none No NaN column found. Feature: 1. strata_name has been dropped from the dataframe. .. GENERATED FROM PYTHON SOURCE LINES 74-75 * Plot explained variance ratio .. GENERATED FROM PYTHON SOURCE LINES 75-77 .. code-block:: Python pca = nPCA (Xpca_scaled , return_X= False, view = True ) # return PCA object rather than the reduced X .. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_001.png :alt: Explained variance as a function of the number of dimension :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 78-81 As a comment, here 5/6 features are enough since the explained variance ratio is already got 98 % * Set the number of components and use a convenient plot the both components .. GENERATED FROM PYTHON SOURCE LINES 81-89 .. code-block:: Python from watex.utils import plot_pca_components pca = nPCA (Xpca_scaled ,n_components=2, return_X=False ) # return object for plot purpose components = pca.components_ features = pca.feature_names_in_ plot_pca_components (components, feature_names= features, cmap='jet_r') .. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_002.png :alt: plot ymxs label for k prediction :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 90-102 As comments, the matrix plot shows the contributions of all features in the first and second components. Indeed, while most contributions are got in-depth resistivity gamma and gamma short distance they are negatively correlated with layer thickness and natural gamma. However, no-correlation is found with the sp log data In the second components, the depth and natural gamma are more corollated and inversely correlated with the resistivity gamma, sp, and short distance. whereas the quasi-null correlation exists with layer thickness. By summarizing the PC1 and PC2 analysis, all features are useful as prediction and one of them can be skipped. This validates the explained variance ratio where under 8 features, after 7 dimensions, the explained variance ratio is already reached 98 %. Therefore features skipped should not influence the result of prediction .. GENERATED FROM PYTHON SOURCE LINES 104-108 * Auto-preprocess the data using the default pipe Note that the categorical data "strata_name" is one-hot-encoded and generate a sparse matrix ready for the data for prediction, then we will use the function 'make_naive_pipe' to fast encode and transform the data as output. .. GENERATED FROM PYTHON SOURCE LINES 108-115 .. code-block:: Python from watex.utils import make_naive_pipe # auto scaled the data and store it into a compressed sparse matrix format csr_data = make_naive_pipe(data_imputed, transform= True) # auto-scaled the data using StandardScaler and transform the data in place csr_data .. rst-class:: sphx-glr-script-out .. code-block:: none <1148x24 sparse matrix of type '' with 9158 stored elements in Compressed Sparse Row format> .. GENERATED FROM PYTHON SOURCE LINES 116-121 Prediction of Naive Group of Aquifer (NGA) ============================================ We randomly set the number of clusters to 05 which might correspond to the number of aquifer groups in the survey area according to the geological information. KMeans is used to predict the class label instead and plot the clusters .. GENERATED FROM PYTHON SOURCE LINES 121-124 .. code-block:: Python from watex.exlib.sklearn import KMeans from watex.utils import plot_clusters .. GENERATED FROM PYTHON SOURCE LINES 125-126 * Group the principal two components of PCA into the 5 clusters .. GENERATED FROM PYTHON SOURCE LINES 126-133 .. code-block:: Python km = KMeans (n_clusters =5 , init= 'random' ) ykm = km.fit_predict(pca.X ) km3c = KMeans (n_clusters =3 , init= 'random' ) ykm3 = km3c.fit_predict(pca.X ) # plot clusters into the general information of 5 groups of aquifers plot_clusters (5 , pca.X, ykm , km.cluster_centers_ ) .. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_003.png :alt: plot ymxs label for k prediction :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 134-137 * Plot the 03 clusters Now test the sample lot with only 03 clusters as a theory group of aquifer based on the distribution of the data. .. GENERATED FROM PYTHON SOURCE LINES 137-139 .. code-block:: Python plot_clusters (3 , pca.X, ykm3 , km3c.cluster_centers_ ) .. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_004.png :alt: plot ymxs label for k prediction :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 140-142 * Plot the feature’s importance We encode the strata_name and add it to the scale value and plot_the feature importance .. GENERATED FROM PYTHON SOURCE LINES 142-155 .. code-block:: Python from watex.exlib.sklearn import RandomForestClassifier from watex.utils import plot_rf_feature_importances # add the strata_name to the remaining features strata_column = pd.Series ( X ['strata_name'].astype ('category').cat.codes , name ='strata_name' ) strata_column.index = range (len(strata_column)) # reindexing X_for_fi = pd.concat( [ strata_column , Xpca_scaled ], axis =1, ignore_index=True ) # # plot importance with the predicted label ykm X_for_fi=pd.DataFrame ( X_for_fi.values, columns= ['strata_name'] + Xpca_scaled_columns) plot_rf_feature_importances (RandomForestClassifier(), X_for_fi , y =ykm ) .. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_005.png :alt: Feature importance :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_005.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 156-157 plot elbow to confirm or infirm the 05 clustering of aquifers from geological infos .. GENERATED FROM PYTHON SOURCE LINES 157-160 .. code-block:: Python from watex.utils import plot_elbow plot_elbow(pca.X, n_clusters=11) .. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_006.png :alt: plot ymxs label for k prediction :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_006.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 161-169 As comments, we can see, the elbow is located at k=3 that i.e we can classify the aquifer group based on the current datasets into three groups in hongliu coal mine. Note that the dataset is only for boreholes, this can not confirm the exact number of the aquifer. In the case study data applied in Honliu coal mine composed of 11 boreholes, the number of 03 clusters is selected although the 05 clusters do not indicate a bad clustering after a silhouette plot. The number of 03 is finally ascertained using the Hierarchical Agglomerative clustering (HAC) dendrogram plot. The step are enumerated below: .. GENERATED FROM PYTHON SOURCE LINES 171-172 Let’s confirm the 05 clusters using the silhouette plot from KMeans .. GENERATED FROM PYTHON SOURCE LINES 172-177 .. code-block:: Python from watex.view import plotSilhouette # plot silhouette for the 05 clusters with pca reduced data plotSilhouette (pca.X, labels =ykm , prefit =True) .. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_007.png :alt: plot ymxs label for k prediction :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_007.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 178-182 Plot with the 03 custers; plot silhouette for the three clusters by setting prefit to False since a new prediction should be made under the hood after n-iterations to find the best clustering. Refer to :func:`~watex.view.plotSilhouette` documentation. .. GENERATED FROM PYTHON SOURCE LINES 182-185 .. code-block:: Python plotSilhouette (pca.X, n_clusters= 3 , prefit =False) .. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_008.png :alt: plot ymxs label for k prediction :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_008.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 186-187 Finally, we plot the dendrogram from HAC .. GENERATED FROM PYTHON SOURCE LINES 187-191 .. code-block:: Python from watex.view import plotDendrogram plotDendrogram (pca.X , labels = ykm) .. image-sg:: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_009.png :alt: plot ymxs label for k prediction :srcset: /glr_examples/applications/images/sphx_glr_plot_ymxs_label_for_k_prediction_009.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 192-200 As comments in the case of MXS target, merging the predicted y with cluster =5 with create a lot of y=k33' where we expected to have a list a =balance target with the true labels y (k1, k2 and k3 ) therefore the cluster with 3 labels is used instead of 5 thus the predicted NGA labels with true labels is combined with the the true labels y for supervised learnings. Note that the true labels are not altered by the predicted label y not let plot the dendro-heat .. GENERATED FROM PYTHON SOURCE LINES 202-205 Before predicting the NGA labels, we can fit the aquifer group and find the most representative of the true k labels to the predicted labels test with the number of clusters set to 3 .. GENERATED FROM PYTHON SOURCE LINES 205-216 .. code-block:: Python from watex.utils.hydroutils import find_aquifer_groups, classify_k # categorize the k-values using the default func yk_map =classify_k (y.k , default_func =True) groupobj = find_aquifer_groups (yk_map, ykm ) print(groupobj) # now make the prediction from watex.utils import predict_NGA_labels yNGA = predict_NGA_labels(pca.X, n_clusters= 3) .. rst-class:: sphx-glr-script-out .. code-block:: none _Group(Label=[' 1 ', Preponderance( rate = '53.141 %', [('Groups', {5: 0.468, 1: 0.236, 4: 0.217, 2: 0.049, 3: 0.03}), ('Representativity', ( '5', 0.47)), ('Similarity', '5')])], Label=[' 2 ', Preponderance( rate = ' 19.11 %', [('Groups', {5: 0.452, 1: 0.301, 4: 0.123, 2: 0.11, 3: 0.014}), ('Representativity', ( '5', 0.45)), ('Similarity', '5')])], Label=[' 3 ', Preponderance( rate = '27.749 %', [('Groups', {4: 0.491, 1: 0.443, 5: 0.038, 2: 0.028}), ('Representativity', ( '4', 0.49)), ('Similarity', '4')])], ) .. GENERATED FROM PYTHON SOURCE LINES 217-222 Prediction of MXS target :math:`y*` ===================================== The prediction of MXS can simply be made by calling the function :func:`~watex.utils.make_MXS_labels` or use the MXS class (:class:`~watex.methods.MXS` ) of the module :mod:`~watex.methods.hydro` .. GENERATED FROM PYTHON SOURCE LINES 222-228 .. code-block:: Python from watex.utils import make_MXS_labels yMXS = make_MXS_labels(y_true=yk_map , y_pred=yNGA ) # Let’s print the 12 firstMXS target print(yMXS[:12]) .. rst-class:: sphx-glr-script-out .. code-block:: none ['2*' '2*' '2*' '2*' '2*' '2*' '2*' '2*' '2*' '2*' '2*' '2*'] .. GENERATED FROM PYTHON SOURCE LINES 229-239 As a comment, the existing :math:`21` and math:`2*` in the :math:`y*(yMXS)` indicates that there is a strong similarity found between label 2 in the permeability coefficient dataset :math:`y` and the predicted `yNGA` labels. This is validated by the group preponderance object above. Whilst, the math:`2*` indicates that the label `2` in yNGA has no similarity found in :math:`y*(yMXS)`). The label `3` in `yNGA` has no relationship with any labels in the :math:`y` therefore no modification is occurred and kept safe. If the parameter `return_obj` is set to True, it will return an MXS object where many attributes like class mapping can be retrieved for understanding purposes. for instance: .. GENERATED FROM PYTHON SOURCE LINES 239-254 .. code-block:: Python mxso = make_MXS_labels(y_true=yk_map , y_pred=yNGA , return_obj=True ) # similar labels print(mxso.mxs_similarity_) # group classes for mapping print(mxso.mxs_group_classes_) #M XS class mapping. This is usefull to know the labels that have been # modified based on the similarity computation. print(mxso.mxs_classes_) # Once the :math:`y*(yMXS)` is predicted, the supervised learning model # training can be made with the predictor:math:`X`. .. rst-class:: sphx-glr-script-out .. code-block:: none [11, 21, 31] {1: 11, 2: '2*', 3: '3*'} ['1' '11' '2' '2*' '3' '3*'] .. GENERATED FROM PYTHON SOURCE LINES 255-258 A paper is under puclication in Engineering Geology for k-prediction which explained a concrete study (Case study in Hongliu coal mine). See the reference in the :ref:`citation ` page. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 9.420 seconds) .. _sphx_glr_download_glr_examples_applications_plot_ymxs_label_for_k_prediction.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/watex/watex/0.3.X?urlpath=lab/tree/notebooks/glr_examples/applications/plot_ymxs_label_for_k_prediction.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_ymxs_label_for_k_prediction.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_ymxs_label_for_k_prediction.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_ymxs_label_for_k_prediction.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_