watex package#

A machine learning research in water exploration#

watex stands for WAT-er EX-ploration. Packages and/or modules are written to solve engineering problems in the field of groundwater exploration (GWE). Currently, dealing with:

geophysical (from DC-Electrical to Electromagnetic);
hydrogeology (from drilling to parameters calculation);
hydrogeophysic (predicting permeability coefficient (k), flow rate);
EM (processing NSAMT noised data and recover missing tensors);
geology (for stratigraphic model generation);
more…

WATex contributes to minimize the risk of unsucessfull drillings, unustainable boreholes and could hugely reduce the cost of the hydrogeology parameter collections.

class watex.AdalineGradientDescent(eta=0.01, n_iter=50, random_state=42)[source]#

Bases: _Base

Adaptative Linear Neuron Classifier

ADAptative LInear NEuron (Adaline) was published by Bernard Widrow and his doctoral studentTeed Hoff only a few uears after Rosenblatt’s perceptron algorithm. It can be considered as impovrment of the latter Windrow and al., 1960.

Adaline illustrates the key concepts of defining and minimizing continuous cost function. This lays the groundwork for understanding more advanced machine learning algorithm for classification, such as Logistic Regression, Support Vector Machines,and Regression models.

The key difference between Adaline rule (also know as the WIdrow-Hoff rule) and Rosenblatt’s perceptron is that the weights are updated based on linear activation function rather than unit step function like in the perceptron. In Adaline, this linear activation function $\phi(z)$ is simply the identifu function of the net input so that:

\[\phi (w^Tx)= w^Tx\]

while the linear activation function is used for learning the weights.

Parameters:

eta (float,) – Learning rate between (0. and 1.)
n_iter (int ,) – number of iteration passes over the training set
random_state (int, default is 42) – random number generator seed for random weight initialization.

w_#

Weight after fitting

Type:: Array-like,

cost_#

Sum of squares cost function (updates ) in each epoch

Type:: list

References

[1]

Windrow and al., 1960. An Adaptative “Adeline” Neuron Using Chemical “Memistors”, Technical reports Number, 1553-2,B Windrow and al., standford Electron labs, Standford, CA,October 1960.

activation(X)[source]#

Compute the linear activation

Parameters:: X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
Returns:: X
Return type:: activate NDArray

fit(X, y)[source]#

Fit the training data

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns:

self – returns self for easy method chaining.

Return type:

Perceptron instance

property inspect#: Inspect object whether is fitted or not

net_input(X)[source]#

Compute the net input X

Parameters:

X: Ndarray ( M x N matrix where ``M=m-samples``, & ``N=n-features``): Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

Returns:

weight net inputs

predict(X)[source]#

Predict the class label after unit step

Parameters:: X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
Returns:: ypred
Return type:: predicted class label after the unit step (1, or -1)

class watex.AdalineStochasticGradientDescent(eta=0.01, n_iter=50, shuffle=True, random_state=42)[source]#

Bases: _Base

Adaptative Linear Neuron Classifier with batch (stochastic) gradient descent

A stochastic gradient descent is a popular alternative algorithm which is sometimes also called iterative or online gradient descent [1]. It updates the weights based on the sum of accumulated errors over all training examples $x^{(i)}$:

\[\delta w: \sum{i} (y^{(i)} -\phi( z^{(i)}))x^(i)\]

the weights are updated incremetally for each training examples:

\[\eta(y^{(i)} - \phi(z^{(i)})) x^{(i)}\]

Parameters:

eta (float,) – Learning rate between (0. and 1.)
n_iter (int,) – number of iteration passes over the training set
suffle (bool,) – shuffle training data every epoch if True to prevent cycles.
random_state (int, default is 42) – random number generator seed for random weight initialization.

w_#

Weight after fitting

Type:: Array-like,

cost_#

Sum of squares cost function (updates ) in each epoch

Type:: list

See also

AdelineGradientDescent: AdalineGradientDescent

References

[1]

Windrow and al., 1960. An Adaptative “Adaline” Neuron Using Chemical “Memistors”, Technical reports Number, 1553-2,B Windrow and al., standford Electron labs, Standford, CA,October 1960.

activation(X)[source]#

Compute the linear activation

Parameters:: X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
Returns:: X
Return type:: activate NDArray

fit(X, y)[source]#

Fit the training data

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns:

self – returns self for easy method chaining.

Return type:

Perceptron instance

property inspect#: Inspect object whether is fitted or not

net_input(X)[source]#

Compute the net input X

Parameters:: X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
Return type:: weight net inputs

partial_fit(X, y)[source]#

Fit training data without reinitialising the weights

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns:

self – returns self for easy method chaining.

Return type:

Perceptron instance

predict(X)[source]#

Predict the class label after unit step

Parameters:: X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
Returns:: ypred
Return type:: predicted class label after the unit step (1, or -1)

class watex.BaseModel(data_fn=None, df=None, **kwargs)[source]#

Bases: object

Base model class. The most interesting and challenging part of modeling is the tuning hyperparameters after designing a composite estimator. Getting the best params is a better way to reorginize the created pipeline {transformers +estimators} so to have a great capability of data generalization.

Parameters:

*dataf_fn* (str) – Path to analysis data file.
*df* (pd.Core.DataFrame) – Dataframe of features for analysis . Must be contains of main parameters including the target name of pd.Core.series of columns of df.
arguments (Holds on others optionals infos in kwargs) –
======================================= (================= ============) –
Description (Attributes Type) –
======================================= –
estimator. (auto bool Trigger the composite) – If True a SVC-composite estimator preprocessor is given. default is False.
model (pipelines dict Collect your own pipeline for) – preprocessor trigging. it should be find automatically.
None (estimators Callable A given estimator. If) – is auto-selected as default estimator.
SVM – is auto-selected as default estimator.
test (model_score float/dict Model test score. Observe your) – model score using your compose estimator for enhancement or your own pipelines.
for (processor Callable Compose piplenes and estimators) – as well as the compose estimator enhancement.
for – default model scorage.
======================================= –

Examples

>>> from watex.bases.modeling import BaseModel
>>> from sklearn.preprocessing import RobustScaler,  PolynomialFeatures
>>> from sklearn.feature_selection import SelectKBest, f_classif
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.compose import make_column_selector
>>> estimator2= RandomForestClassifier()
>>> modelObj = BaseModel(
...     data_fn ='data/geo_fdata/BagoueDataset2.xlsx',
...     pipelines = {
...            'num_column_selector_': make_column_selector(
...                dtype_include=np.number),
...            'cat_column_selector_': make_column_selector(
...                dtype_exclude=np.number),
...            'features_engineering_':PolynomialFeatures(
...                2, include_bias=False),
...            'selectors_': SelectKBest(f_classif, k=2),
...            'encodages_': RobustScaler()
...              },
...     estimator = RandomForestClassifier()
...        )

property feature_importances_#: Get the bar plot of features importances. If the estimator has not feature_importances_ attributes, it will raise an error.

get_learning_curve(estimator=None, X_train=None, y_train=None, learning_curve_kws=None, **kws)[source]#

Compute the train score and validation curve to visualize your learning curve.

Parameters:

estimator – The creating model. If None
X_train – pd.core.frame.DataFrame of selected trainset
x_test – pd.DataFrame of selected Data for testset
y_train – array_like of selected data for evaluation set.
y_test – array_like of selected data for model test

val_kws –

validation_curve keywords arguments. if none the default should be:

val_curve_kws = {"param_name":'C',
             "param_range": np.arange(1,210,10),
             "cv":4}

Returns:

train_score: float|dict of trainset score.
val_score : float/dict of valisation score.
switch: Turn on or off the learning curve of validation
curve.

-trigDec: Trigger the decorator. - N: number of param range for plotting.

Example:

>>> from watex.bases.modeling import BaseModel
>>> processObj = BaseModel(
    data_fn = 'data/geo_fdata/BagoueDataset2.xlsx')
>>> processObj.get_learning_curve (
    switch_plot='on', preprocessor=True)

get_model_prediction(estimator=None, X_test=None, y_test=None, **kws)[source]#

Get the model prediction and quick plot using the surche decorator.

The decorator holds many keyword arguments to customize plot. Refer to watex.utils.decorator.predPlot.

Parameters:

estimator – The creating model. If None
x_test – pd.DataFrame of selected Data for testset
y_test – array_like of selected data for model test
kws – Additional keywords arguments which refer to the data_fn df and pipelines parameters.
switch – Turn on or off the decorator.

Example:

>>> from watex.modeling.sl import Modeling
>>> modelObj = Modeling(
    data_fn ='data/geo_fdata/BagoueDataset2.xlsx',
    pipelines ={
        'num_column_selector_': make_column_selector(
            dtype_include=np.number),
        'cat_column_selector_': make_column_selector(
            dtype_exclude=np.number),
        'features_engineering_':PolynomialFeatures(2,
                                        include_bias=False),
        'selectors_': SelectKBest(f_classif, k=2),
        'encodages_': RobustScaler()
          }, estimator = SVC(C=1, gamma=0.1))
>>> modelObj.get_model_prediction(estimator =testim, switch ='on')

property model_#: Get a set of processor and eestimator composed of the composite model

property model_score#: Estimate your composite model prediction

permutation_feature_importance(estimator=None, X_train=None, y_train=None, pfi_kws=None, **kws)[source]#

Evaluation of features importance with tree estimators before shuffle and after shuffling trees.

Permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data is tabular. This is especially useful for non-linear or opaque estimators. Refer to :ref:`this link <https://scikit-learn.org/stable/modules/permutation_importance.html>`_ for more details.

Parameters:

estimator – The estimator to evaluate the importance of features. The default is RandomForestClassifier.
X_train – pd.core.frame.DataFrame of selected trainset.
y_train – array_like of selected data for evaluation set.
n_estimators – Number of estimator composed the tree. The default is 100
n_repeats – Number of tree shuffling. The default is 10.
pfi_kws – permution_importance callable additional keywords arguments.
pfi_stype –
Type of plot. Can be : - pfi for permutation feature importance before

and after shuffling trees

-dendro for dendrogram plot . The default is pfi.
switch – Turn on or off the decorator.

Example:

>>> from watex.bases.modeling import BaseModel
>>> from sklearn.ensemble import AdaBoostClassifier
>>> modelObj = BaseModel()
>>> modelObj.permutation_feature_importance(
...    estimator = AdaBoostClassifier(random_state=7),
...    data_fn ='data/geo_fdata/BagoueDataset2.xlsx',
...     switch ='on', pfi_style='pfi')

property processor#: Get te processor after supplying the pipelines

tuning_hyperparameters(estimator=None, hyper_params=None, cv=4, grid_kws=None, **kws)[source]#

Tuning hyperparametres from existing estimator to evaluate performance. Boosting the model using the model best_param

Parameters:

estimator – Callable estimator or model to boost
hyper_params – dict of hyperparameters of the estimator
cv – Cross validation cutting off. the default is 4

:param grid_kws:dict of other gridSearch parameters

Example:

>>> from watex.modeling.basics import SLModeling
>>> from sklearn.preprocessing import RobustScaler,PolynomialFeatures
>>> from sklearn.feature_selection import SelectKBest, f_classif
>>> from sklearn.svm import SVC
>>> from sklearn.compose import make_column_selector
>>> my_own_pipelines= {
        'num_column_selector_': make_column_selector(
            dtype_include=np.number),
        'cat_column_selector_': make_column_selector(
            dtype_exclude=np.number),
        'features_engineering_':PolynomialFeatures(
            3, include_bias=False),
        'selectors_': SelectKBest(f_classif, k=3),
        'encodages_': RobustScaler()
          }
>>> my_estimator = SVC(C=1, gamma=1e-4, random_state=7)
>>> modelObj = SLModeling(data_fn ='data/geo_fdata/BagoueDataset2.xlsx',
               pipelines =my_own_pipelines ,
               estimator = my_estimator)
>>> hyperparams ={
    'columntransformer__pipeline-1__polynomialfeatures__degree':
        np.arange(2,10),
    'columntransformer__pipeline-1__selectkbest__k': np.arange(2,7),
    'svc__C': [1, 10, 100],
    'svc__gamma':[1e-1, 1e-2, 1e-3]}
>>> my_compose_estimator_ = modelObj.model_
>>> modelObj.tuning_hyperparameters(
                            estimator= my_compose_estimator_ ,
                            hyper_params= hyperparams,
                            search='rand')
>>> modelObj.best_params_
>>> modelObj.best_score_

class watex.BaseSteps(tname=None, return_all=True, drop_features=None, categorizefeature_props=None, add_attributes=True, attribute_indexes=None, operator='/', imputer_strategy='median', missing_values=nan, pipeline=None, test_size=0.2, hash=False, random_state=42, verbose=0, **kwargs)[source]#

Bases: object

Default Data preparation steps

By default, the BaseSteps is used to prepare the DC 1d -resistivity geoelectrical features before prediction. The predicted target was the flow rate.

Parameters:

tname: str,

A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.

return_all: bool

return all the stratified trainset. When data is too large, can set to False to take an sample of the stratified trainset. to evaluate your model.

drop_features: list

List of useless features and clean the dataset.

categorizefeature_props: list

list of properties to categorize a particular features in the dataset. It composed of the ‘name of feature’ to convert its numerical values into categorical values , then the value range of data to be categorize and finally the categorical name of that values range. For instance:

categorizefeature_props= [
    ('flow', ([0., 1., 3.], ['FR0', 'FR1', 'FR2', 'FR3']))
    ]

Please refer to watex.utils.transformers.CategorizeFeatures fot furthers details.

hash: bool,

If True, it ensure that data will remain consistent accross multiple runs, even if dataset is refreshed. Use test by id to hash training and test sets when data is splitting.

add_attributes: list, optional

Experience the combinaison <numerical> attributes. List of features for combinaison. Decide to combine features to create a new feature value from operator parameters. By default, the combinaison is ratio of the given attribute/numerical features. For instance, attribute_names=['lwi', 'ohmS'] will divide the feature ‘lwi’ by ‘ohmS’.

operator: str, default =’/’

Type of operation to perform when combining features. Can be [‘/’, ‘+’, ‘-’, ‘*’, ‘%’]

attribute_indexes: list of int,

List of attributes indexes to combines. For instance:

attribute_indexes = [1, 0] # or [4, 3]

The operator by default is division . Indexes of each attribute/feature for experiencing combinaison. User warning should raise if any index does match the dataframe of array columns.For more details, refer to CombinedAttributesAdder

imputer_strategy: str

Type of strategy to replace the missing values. Refer to SimpleImputer. Default is median.

missing_valuesfloat

The value to be replaced. Default is np.nan values.

pipeline: callable

Pipeline to prepare the dataset. Default is defaultPipeline().

test_size: float, default=.2 i.e. 20% (X, y)

The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.

random_stateint, RandomState instance or None, default=42

Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

verbose: int, default is 0

Control the level of verbosity. Higher value lead to more messages.

data: Filepath or Dataframe or shape (M, N)

Data is passed here as additional keyword arguments just for making under the X and y using method :neth:`~.stratifydata`. It is pandas.DataFrame containing samples of M and features N.

Notes

The data preparing includes is composed of two steps. The first step includes: - The data cleaning by fixing and removing outliers, to replace the missing

values by the other values using param `imputer_strategy`rather than to get rid of the different instances (examples) or the whole feature.

The handling text and features consist to convert the categorial features labels to numbers to let the algorithm to well perform with non-numerical values.
The data stratification process is done before separating the dataset into trainset and test set. Indeed, the stratification consist to divide the whole dataset into homogeneous subgroup to guarantee that the test set is most representative of the overall dataset. This is useful in our case because the dataset is not large enough to avoid the risk of introducing a significant bias. Once data are stratified, data are divided into a trainset (80%) and test set (20%).

The second steps consist of features selection, features engineering, encoding and data scaling using the pipeline via a parameter pipeline. If None pipeline is given, the default pipline is triggered.The features engineering’s consist to aggregate features with experiencing combinations of attributes into promising new features using the params attribute_indexes after setting the argument add_attributes to True. The final step of transformation consists of features scaling. The type of scaling used by default in this module is the standardization because it less affected by the outliers. Each transformation step must be executed in right order therefore a full pipeline is created, composed of the numerical pipeline (deals with numerical features) and categorical pipeline (deals with categorial features). Both pipelines are combined and applied to the trainset and later to the test set.

Examples

../datasets/_p.py

property X_#: keep the stratified testset X

property data#

fit(X, y=None)[source]#

Preparing steps.

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns:

``self`` – returns self for easy method chaining.

Return type:

BaseSteps instance

fit_transform(X=None, y=None, on_testset=False)[source]#

Fit transform apply fit and transform at Once.

Parameters:: X (ndarray, pd.DataFrame) – X or dataframe X

y: array_like,

ylabel or target

on_testset:str

Check whether the dataframe is evaluating on: testset or trainset

data: :returns: * - X_prepared. Data prepared after transformation

-y-prepared. label prepared after transformation.

stratifyFolds(data)[source]#: Stratified the dataset and return the trainset. Get more details in watex.bases.transformers.StratifiedWithCategoryAdder.

stratifydata(data=None)[source]#: Split and stratified data and return stratified training and test sets

property tname#

transform(X=None, y=None, on_testset=False)[source]#

Transform data applying the pipeline transformation.

Parameters:

X (ndarray, pd.DataFrame) – X or dataframe X
y (array_like,) – ylabel or target values
on_testset (str) –

Check whether the dataframe is evaluating on
testset or trainset

Returns:

- X_prepared. Data prepared after transformation
-y-prepared. label prepared after transformation.

property y_#: keep the stratified label y

class watex.DCProfiling(stations=None, dipole=10.0, auto=False, keep_params=False, read_sheets=False, force=False, **kws)[source]#

Bases: ElectricalMethods

A collection of DC-resistivity profiling classes.

It reads and compute electrical parameters. Each line compose a specific object and gather all the attributes of ResistivityProfiling for easy use. For instance, the expeced drilling location point and its resistivity value for two survey lines ( line1 and line2) can be fetched as:

>>> <object>.line1.sves_ ; <object>.line1.sves_resistivity_
>>> <object>.line2.sves_ ; <object>.line2.sves_resistivity_

Parameters:

stations (list or str (path-like object )) –
list of station name where the drilling is expected to be located. It strongly linked to the name of used to specify the center position of each dipole when the survey data is collected. Each survey can have its own way for numbering the positions, howewer if the station is given it should be one ( presumed to be the suitable point for drilling) in the survey lines. Commonly it is called the sves which mean at this point, the DC-sounding will be operated. Be sure to provide the correct station to compute the electrical parameters.

It is recommed to provide the positioning of the station expected to hold the drillings. However if stations is None, the auto-way for computing electrical features should be triggered. User can also provide the list of stations by hand. In that case, each station should numbered from 1 not 0. For instance:
- in a survey line of 20 positions. We considered the station 13
  as the best point to locate the drilling. Therefore the name of the station should be ‘S13’. In other survey line (line2) the second point of my survey is considered the suitable one to locate my drilling. Considering the two survey lines, the list of stations sould be ‘[‘S13’, ‘S2’]
- stations can also be arrange in a single to be parsed which
  refer to the string arguments.
dipole (float) – The dipole length used during the exploration area. If dipole value is set as keyword argument,i.e. the station name is overwritten and is henceforth named according to the value of the dipole. For instance for dipole equals to 10m, the first station should be S00, the second S10 , the third S20 and so on. However, it is recommend to name the station using counting numbers rather than using the dipole position.
auto (bool) – Auto dectect the best conductive zone. If True, the station position should be the station of the lower resistivity value in Electrical Resistivity Profiling.
keep_params (bool, default=False,) – If True , keeps only the predicted parameters in the summary table, otherwise, returns the usefull details of the line like geographical coordinates where the DC predicted parameters are computed.
read_sheets (bool,) – Read the data in sheets. Here its assumes the data of each survey lines are arrange in a single excell worksheets. Note that if read_sheets is set to True and the file is not in excell format, a TypError will raise.
force (bool, default=False,) –
By default, DCProfiling expects users to provide either DC objects or pandas dataframe. This assumes users have already transformed its data from sheets to data frame. If not the case, setting force to True constrains the algorithm to do the both tasks at once.

New in version 0.2.0.
fit_params (dict) – Additional Electrical Resistivity Profiling keywords arguments

Examples

-> Get DC -resistivity profiling from the individual Resistivity object

>>> from watex.methods import ResistivityProfiling
>>> from watex.methods import DCProfiling
>>> robj1= ResistivityProfiling(auto=True) # auto detection
>>> robj1.utm_zone = '50N'
>>> robj1.fit('data/erp/testsafedata.xlsx')
>>> robj1.sves_
... 'S036'
>>> robj2= ResistivityProfiling(auto=True, utm_zone='40S')
>>> robj2.fit('data/erp/l11_gbalo.xlsx')
>>> robj2.sves_
... 'S006'
>>> # read the both objects
>>> dcobjs = DCProfiling()
>>> dcobjs.fit([robj1, robj2])
>>> dcobjs.sves_
... array(['S036', 'S006'], dtype=object)
>>> dcobjs.line1.sves_ # => robj1.sves_
>>> dcobjs.line2.sves_ # => robj2.sves_

-> Read from a collection of excell data

>>> datapath = r'data/erp'
>>> dcobjs.read_sheets=True
>>> dcobjs.fit(datapath)
>>> dcobjs.nlines_  # getting the number of survey lines
... 9
>>> dcobjs.sves_ # stations of the best conductive zone
... array(['S017', 'S006', 'S000', 'S036', 'S036', 'S036', 'S036', 'S036',
       'S001'], dtype='<U33')
>>> dcobjs.sves_resistivities_ # the lower conductive resistivities
... array([  80,   50, 1101,  500,  500,  500,  500,  500,   93], dtype=int64)
>>> dcobjs.powers_
... array([ 50,  60,  30,  60,  60, 180, 180, 180,  40])
>>> dcobjs.sves_ # stations of the best conductive zone
... array(['S017', 'S006', 'S000', 'S036', 'S036', 'S036', 'S036', 'S036',
       'S001'], dtype='<U33')

(3) -> Read data and all sheets, assumes all data are arranged in a sheets >>> dcobjs.read_sheets=True >>> dcobjs.fit(datapath) >>> dcobjs.nlines_ # here it assumes all the data are in single worksheets. … 4 >>> dcobjs.line4.conductive_zone_ # conductive zone of the line 4 … array([1460, 1450, 950, 500, 1300, 1630, 1400], dtype=int64) >>> dcobjs.sfis_ >>> array([1.05085691, 0.07639077, 0.03592814, 0.07639077, 0.07639077,

0.07639077, 0.07639077, 0.07639077, 1.08655919])

>>> dcobjs.line3.sfi_ # => robj1.sfi_
... array([0.03592814]) # for line 3

fit(*data, **fit_params)[source]#

Read and fit the collections of data

Parameters:

**data** (List of path-like obj, or ResistivityProfiling) – object. Data containing the collection of DC-resistivity values of of multiple survey areas.
**fit_params** (str,) – Additional keyword from :func:watex.utils.coreutils.parseStations`. It refers to the station_delimiter parameters. If the attribute stations is given as a path-like object. If the stations are disposed in the same line, it is convenient to provide the delimiter to parse the stations.

Return type:

object instanciated from ResistivityProfiling.

Notes

The stations should numbered from 1 not 0 and might fit the number of the survey line. Each survey line expect to hold one positionning drilling.

property inspect#: Inspect object whether is fitted or not

summary(return_table=True)[source]#

Agregate the DC-Profiling parameters to compose a param-table

Parameters:

return_table – bool, default=True returns table of DC parameters at all sites if True and ‘DCProfiling’ instanciated object otherwise.

Returns:

table if return_table is True and DCProfiling

instanciated object otherwise.

class watex.DCSounding(search=45.0, rho0=None, h0=1.0, read_sheets=False, strategy='HMCMC', vesorder=None, typeofop='mean', objective='coverall', keep_params=False, **kws)[source]#

Bases: ElectricalMethods

Direct-Current Electrical Sounding

A collection of Vertical Electrical Sounding class and computed predictors paramaters accordingly.

The VES is carried out to speculate about the existence of a fracture zone and the layer thicknesses. Commonly, it comes as supplement methods to Electrical Resistivity Profiling after selecting the best conductive zone when survey is made on one-dimensional. Data from each DC-sounding site can be retrieved using:
>>> <object>.site<number>.<:attr:`~.VerticalSounding.<attr>_`
For instance to fetch the DC-sounding data position and the resistivity in depth of the fractured zone for the first site, we use:
>>> <object>.site1.fractured_zone_
>>> <object>.site1.fractured_zone_resistivity_

Parameters:

search: float , list of float

The collection of the depth in meters from which one expects to find a fracture zone outside of pollutions. Indeed, the search parameter is used to speculate about the expected groundwater in the fractured rocks under the average level of water inrush in a specific area. For instance in Bagoue region , the average depth of water inrush is around 45m.So the search can be specified via the water inrush average value.

rho0: float

Value of the starting resistivity model. If None, rho0 should be the half minumm value of the apparent resistivity collected. Units is in Ω.m not log10(Ω.m)

h0: float

Thickness in meter of the first layers in meters.If None, it should be the minimum thickess as possible 1.m .

strategy: str

Type of inversion scheme. The defaut is Hybrid Monte Carlo (HMC) known as HMCMC. Another scheme is Bayesian neural network approach (BNN).

vesorder: int

The index to retrieve the resistivity data of a specific sounding point. Sometimes the sounding data are composed of the different sounding values collected in the same survey area into different Electrical Resistivity Profiling line. For instance:

AB/2

MN/2

SE1

SE2

SE3

…

SEn

Where SE are the electrical sounding data values and n is the number of the sounding points selected. SE1, SE2 and SE3 are three points selected for Vertical Electrical Sounding i.e. 3 sounding points carried out either in the same Electrical Resistivity Profiling or somewhere else. These sounding data are the resistivity data with a specific numbers. Commonly the number are randomly chosen. It does not refer to the expected best fracture zone selected after the prior-interpretation. After transformation via the function vesSelector(), the header of the data should hold the resistivity. For instance, refering to the table above, the data should be:

AB

MN

resistivity

resistivity

resistivity

…

Therefore, the vesorder is used to select the specific resistivity values i.e. select the corresponding sounding number of the Vertical Electrical Sounding expecting to locate the drilling operations or for computation. For esample, `vesorder`=1 should figure out:

AB/2

MN/2

SE2

–>

AB

MN

resistivity

If vesorder is None and the number of sounding curves are more than one, by default the first sounding curve is selected ie rhoaIndex equals to 0

typeofop: str

Type of operation to apply to the resistivity values rhoa of the duplicated spacing points AB. The default operation is mean. Sometimes at the potential electrodes ( MN ),the measurement of AB are collected twice after modifying the distance of MN a bit. At this point, two or many resistivity values are targetted to the same distance AB (AB still remains unchangeable while while MN is changed). So the operation consists whether to the average ( mean ) resistiviy values or to take the median values or to leaveOneOut (i.e. keep one value of resistivity among the different values collected at the same point AB ) at the same spacing AB. Note that for the LeaveOneOut, the selected resistivity value is randomly chosen.

objective: str

Type operation to output. By default, the function outputs the value of pseudo-area in $$ohm.m^2$$. However, for plotting purpose by setting the argument to view, its gives an alternatively outputs of X and Y, recomputed and projected as weel as the X and Y values of the expected fractured zone. Where X is the AB dipole spacing when imaging to the depth and Y is the apparent resistivity computed.

keep_params: bool, default=False,

If True , keeps only the predicted parameters in the summary table, otherwise, returns the usefull details of the site like the depth AB/2 where the DC predicted area parameter is computed.

kws: dict

Additionnal keywords arguments from Vertical Electrical Sounding data operations. See watex.utils.exmath.vesDataOperator() for futher details.

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

fit(*data, **fit_params)[source]#

Fit the DC- electrical sounding

Fit the sounding Vertical Electrical Sounding curves and computed the ohmic-area and set all the features for demarcating fractured zone from the selected anomaly.

Parameters:

data (list of path-like object, or DataFrames) – The string argument is a path-like object. It must be a valid file wich encompasses the collected data on the field. It shoud be composed of spacing values AB and the apparent resistivity values rhoa. By convention AB is half-space data i.e AB/2. So, if data is given, params AB and rhoa should be kept to None. If AB and rhoa is expected to be inputted, user must set the data to None values for API purpose. If not an error will raise. Or the recommended way is to use the vesSelector tool in watex.utils.vesSelector() to buid the Vertical Electrical Sounding data before feeding it to the algorithm. See the example below.
fit_params (dict) – additional keywords arguments, specific to the readable files. Refer to :method:`watex.property.Config.parsers` . Use the key() to get all the readables format.

Returns:

object

Return type:

A collection of Vertical Electrical Sounding objects

property inspect#: Inspect object whether is fitted or not

summary(return_table=True)[source]#

Agregate the DC-Sounding parameters to compose a param-table

Parameters:

return_table – bool, default=True returns table of DC parameters at all sites if True and ‘DCSounding’ instanciated object otherwise.

Returns:

table if return_table is True and DCSounding instanciated

object otherwise.

class watex.Data(verbose=0)[source]#

Bases: object

Data base class

Typically, we train a model with a matrix of data. Note that pandas Dataframe is the most used because it is very nice to have columns lables even though Numpy arrays work as well.

For supervised Learning for instance, suc as regression or clasification, our intent is to have a function that transforms features into a label. If we were to write this as an algebra formula, it would be look like:

\[y = f(X)\]

X is a matrix. Each row represent a sample of data or information about individual. Every columns in X is a feature.The output of our function, y, is a vector that contains labels (for classification) or values (for regression).

In Python, by convention, we use the variable name X to hold the sample data even though the capitalization of variable is a violation of standard naming convention (see PEP8).

Parameters:

data (str, filepath_or_buffer or pandas.core.DataFrame) – Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.
columns (str or list of str) – columns to replace which contain the missing data. Can use the axis equals to ‘1’.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Determine if rows or columns which contain missing values are removed. * 0, or ‘index’ : Drop rows which contain missing values. * 1, or ‘columns’ : Drop columns which contain missing value. Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
sample (int, Optional,) – Number of row to visualize or the limit of the number of sample to be able to see the patterns. This is usefull when data is composed of many rows. Skrunked the data to keep some sample for visualization is recommended. None plot all the samples ( or examples) in the data
kind (str, Optional) –
type of visualization. Can be dendrogramm, mbar or bar. corr plot for dendrogram , msno bar, plt and msno correlation visualization respectively:
- bar plot counts the nonmissing data using pandas
- mbar use the msno package to count the number
  of nonmissing data.
- dendrogram`` show the clusterings of where the data is missing.
  leaves that are the same level predict one onother presence (empty of filled). The vertical arms are used to indicate how different cluster are. short arms mean that branch are similar.
- ``corr` creates a heat map showing if there are correlations
  where the data is missing. In this case, it does look like the locations where missing data are corollated.
- None is the default vizualisation. It is useful for viewing
  contiguous area of the missing data which would indicate that the missing data is not random. The matrix function includes a sparkline along the right side. Patterns here would also indicate non-random missing data. It is recommended to limit the number of sample to be able to see the patterns.
Any other value will raise an error.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.

Returns:

self – returns self for easy method chaining.

Return type:

Baseclass instance

Examples

property data#: return verified data

property describe#: Get summary stats as well as see the cound of non-null data. Here is the default behaviour of the method i.e. it is to only report on numeric columns. To have have full control, do it manually by yourself.

drop(labels=None, columns=None, inplace=False, axis=0, **kws)[source]#

Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

Parameters:

labels (single label or list-like) – Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
columns (single label or list-like) – Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels)
kws (dict,) – Additionnal keywords arguments passed to pd.DataFrame.drop().

Returns:

DataFrame without the removed index or column labels or None if inplace equsls to True.

Return type:

DataFrame or None

fit(data=None)[source]#

Read, assert and fit the data.

Parameters:: data (Dataframe or shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N
Returns:: Returns self for easy method chaining.
Return type:: Data instance

property inspect#: Inspect data and trigger plot after checking the data entry. Raises NotFittedError if ExPlot is not fitted yet.

merge()[source]#

Merge two series whatever the type with operator &&.

When series as dtype object as non numeric values, dtypes should be change into a object

profilingReport(data=None, **kwd)[source]#

Generate a report in a notebook.

It will summarize the types of the columns and allow yuou to view details of quatiles statistics, a histogram, common values and extreme values.

Parameters:: data (Dataframe or shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N
Returns:: Returns self for easy method chaining.
Return type:: Data instance

Examples

>>> from watex.base import Data
>>> Data().fit(data).profilingReport()

rename(data=None, columns=None, pattern=None)[source]#

rename columns of the dataframe with columns in lowercase and spaces replaced by underscores.

Parameters:

data (Dataframe of shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N
columns (str or list of str, Optional) – the specific columns in dataframe to renames. However all columns is put in lowercase. If columns not in dataframe, error raises.
pattern (str, Optional,) – Regular expression pattern to strip the data. By default, the pattern is '[ -@*#&+/]'.

Returns:

``self`` – returns self for easy method chaining.

Return type:

Data instance

shrunk(columns, data=None, **kwd)[source]#

Reduce the data with importance features

Parameters:

data (Dataframe or shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N
columns (str or list of str) – Columns or features to keep in the datasets
kwd (dict,) –

:param additional keywords arguments from watex.utils.mlutils.selectfeatures():

Returns:: Returns self for easy method chaining.
Return type:: Data instance

class watex.EM(survey_name=None, verbose=0)[source]#

Bases: IsEdi

Create EM object as a collection of EDI-file.

Collect edifiles and create an EM object. It sets the properties from audio-magnetotelluric. The two(2) components XY and YX will be set and calculated.Can read MT data instead, however the full handling transfer function like Tipper and Spectra is not completed. Use other MT softwares for a long periods data.

Parameters:: survey_name (str) – location name where the date where collected . If surveyname is None can chech on edifiles.

ediObjs_#

array of the collection of edifiles read_sucessfully

Type:: Array-like of shape (N,)

data_#

array of all edifiles feed in the EM modules whatever sucessuffuly read or not.

Type:: Array-like of shape (N, )

edinames_#

array of all edi-names sucessfully read

Type:: array-like of shape (N,)

edifiles_#

array of all edifiles if given.

Type:: array of shape (N, )

freqs_#

Array of the frequency range from EDIs

Type:: array-like of shape (N, )

refreq_#

Reference refrequency for data correction. Note the reference frequency is the highest frequency with clean data.

Type:: float,

Properties#

------------

longitude#

longitude coordinate values collected from EDIs

Type:: array-like, shape (N,)

latitude#

Latitude coordinate values collected from EDIs

Type:: array-like, shape (N, )

elevation#

Elevation coordinates collected from EDIs

Type:: array-like, shape (N,)

property elevation#

exportedis(ediObjs, new_Z, savepath=None, **kws)[source]#

Export EDI files from multiples EDI or z objects

Export new EDI file from the former object with a given new impedance tensors. The export is assumed a new output EDI resulting from multiples corrections applications.

Parameters:

ediObjs (list of string watex.edi.Edi) – Full path to Edi file/object or object from class:EM objects.
new_Z (list of ndarray (nfreq, 2, 2)) – A collection of Ndarray of impedance tensors Z. The tensor Z is 3D array composed of number of frequency nfreq`and four components (``xx`, xy, yx, and yy) in 2X2 matrices. The tensor Z is a complex number.
savepath (str, Optional) – Path to save a new EDI file. If None, outputs to _outputEDI_ folder.

Return type:

ediObj from watex.edi.Edi

See also

exportedi: Export single EDI from

fit(data)[source]#

Assert and make EM object from a collection EDIs.

Parameters:: data (str, or list or pycsamt.core.edi.Edi object) – Full path to EDI files or collection of EDI-objects
Returns:: self
Return type:: EM object from a collection EDIs

Examples

>>> from watex.methods.em import EM
>>> emObjs = EM().fit (r'data/edis')
>>> emObjs.ediObjs_
...

static get_z_from(edi_obj_list, /)[source]#

Get z object from Edi object. :param z_or_edis_obj_list: A collection of EDI- or Impedances tensors objects. :type z_or_edis_obj_list: list of watex.edi.Edi or watex.externals.z.Z :param .. versionadded:: v0.1.9:

Returns:: Z – List of impedance tensor Objects.
Return type:: list of watex.externals.z.Z

getfullfrequency(to_log10=False)[source]#

Get the frequency with clean data.

The full or plain frequency is array frequency with no missing data during the data collection. Note that when using Natural Source Audio-Magnetotellurics, some data are missing due to the weak of missing frequency at certain band especially in the attenuation band.

Parameters:: to_log10 (bool, default=False,) – export frequency to base 10 logarithm
Returns:: f – frequency with clean data. Out of attenuation band if survey is completed with Natural Source Audio-Magnetotellurics.
Return type:: Arraylike 1d of shape(N, )

See also

watex.utils.exmath.get_full_frequency: Get the complete frequency with no missing signals.

Example

>>> import watex as wx
>>> edi_sample = wx.fetch_data ('edis', return_data=True, samples = 12 )
>>> wx.EM().fit(edi_sample).getfullfrequency(to_log10 =True )
array([4.76937733, 4.71707639, 4.66477553, 4.61247466, 4.56017382,
       4.50787287, 4.45557204, 4.40327104, 4.35097021, 4.29866928,
       4.24636832, 4.19406761, 4.14176668, 4.08946565, 4.03716465,
       ...
       2.67734228, 2.62504479, 2.57274385, 2.52044423, 2.46814047,
       2.41584107, 2.36353677, 2.31124512, 2.25892448, 2.20663701,
       2.15433266, 2.10202186, 2.04972182, 1.99743007])

getreferencefrequency(to_log10=False)[source]#

Get the reference frequency from collection Edis objects.

The highest frequency with clean data should be selected as the reference frequency

Parameters:

data (list of pycsamt.core.edi.Edi or mtpy.core.edi.Edi objects) – Collections of EDI-objects from pycsamt
to_log10 (bool,) – outputs the reference frequency into base 10 logarithm in Hz.

Returns:

rf – the reference frequency at the clean data in Hz

Return type:

float

Examples

>>> from watex.methods.em import EM
>>> edipath ='data/3edis'
>>> ref = EM().getreferencefrequency(edipath, to_log10=True)
>>> ref
... 4.845098040014257 # in Hz

References

http://www.zonge.com/legacy/PDF_DatPro/Astatic.pdf

property inspect#: Inspect object whether is fitted or not

is_valid(obj)[source]#

Assert that the given argument is an EDI -object from modules EDI or EDI from pycsamt and MTpy packages. A TypeError will occurs otherwise.

Parameters:: obj (str, pycsamt.core.edi.Edi or mtpy.core.edi.Edi) – Full path EDI file or pycsamt or `MTpy`_ objects.
Returns:: obj – Identical object after asserting.
Return type:: str, pycsamt.core.edi.Edi or mtpy.core.edi.Edi

property latitude#

property longitude#

make2d(out='resxy', *, kind='complex', **kws)[source]#

Out 2D resistivity, phase-error and tensor matrix from a collection of EDI-objects.

Matrix depends of the number of frequency times number of sites. The function asserts whether all data from all frequencies are available. The missing values should be filled by NaN.

Parameters:

data (Path-like object or list of pycsamt.core.edi objects) – Collections of EDI-objects from pycsamt or full path to EDI files.
out (str) – kind of data to output. Be sure to provide the component to retrieve the attribute from the collection object. Except the error and frequency attribute, the missing component to the attribute will raise an error. for instance resxy for xy component. Default is resxy.
kind (bool or str) – focuses on the tensor output. Note that the tensor is a complex number of ndarray (nfreq, 2,2 ). If set to``modulus`, the modulus of the complex tensor should be outputted. If real or``imag``, it returns only the specific one. Default is complex.
kws (dict) – Additional keywords arguments from :func:`~.getfullfrequency `.

Returns:

mat2d – the matrix of number of frequency and number of Edi-collectes which correspond to the number of the stations/sites.

Return type:

np.ndarray(nfreq, nstations)

Examples

>>> from watex.methods.em import EM
>>> edipath ='data/edis'
>>> emObjs= EM().fit(edipath)
>>> phyx = EM().make2d ('phaseyx')
>>> phyx
... array([[ 26.42546593,  32.71066454,  30.9222746 ],
       [ 44.25990541,  40.77911136,  41.0339148 ],
       ...
       [ 37.66594686,  33.03375863,  35.75420802],
       [         nan,          nan,  44.04498791]])
>>> phyx.shape
... (55, 3)
>>> # get the real number of the yy componet of tensor z
>>> zyy_r = make2d (ediObjs, 'zyx', kind ='real')
... array([[ 4165.6   ,  8665.64  ,  5285.47  ],
       [ 7072.81  , 11663.1   ,  6900.33  ],
       ...
       [   90.7099,   119.505 ,   122.343 ],
       [       nan,        nan,    88.0624]])
>>> # get the resistivity error of component 'xy'
>>> resxy_err = EM.make2d ('resxy_err')
>>> resxy_err
... array([[0.01329037, 0.02942557, 0.0176034 ],
       [0.0335909 , 0.05238863, 0.03111475],
       ...
       [3.33359942, 4.14684926, 4.38562271],
       [       nan,        nan, 4.35605603]])
>>> phyx.shape ,zyy_r.shape, resxy_err.shape
... ((55, 3), (55, 3), (55, 3))

rewrite(*, by='name', prefix=None, dataid=None, savepath=None, how='py', correct_ll=True, make_coords=False, reflong=None, reflat=None, step='1km', edi_prefix=None, export=True, **kws)[source]#

Rewrite Edis, correct station coordinates and dipole length.

Can rename the dataid, customize sites and correct the positioning latitudes and longitudes.

Parameters:

dataid (list) – list of ids to rename the existing EDI-dataid from Head.dataid. If given, it should match the length of the collections of ediObjs. A ValueError will occurs if the length of ids provided is out of the range of the number of EDis objects
by (str) – Rename according to the inner module Id. Can be name, id, number. Default is name. If survey_name is given, the whole survey name should be overwritten. Conversly, the argument ix outputs the number of formating stations excluding the survey name.
prefix (str) – Prefix the number of the site. It could be the abbreviation of the survey area.
correct_ll (bool,) – Write the scaled positions( longitude and latitude). Default is True.
make_coords (bool) – Useful to hide the real coordinates of the sites by generating a ‘fake’ coordinates for a specific purposes. When setting to True be sure to provide the reflong and reflat values otherwise and error will occurs.
reflong (float or string) – Reference longitude in degree decimal or in DD:MM:SS for the site considered as the origin of the lamdmark.
reflat (float or string) – Reference latitude in degree decimal or in DD:MM:SS for the reference site considered as the landmark origin.
step (float or str) – Offset or the distance of seperation between different sites in meters. If the value is given as string type, except the km, it should be considered as a m value. Only meters and kilometers are accepables. Default value of seperation between the site is 1km.
savepath (str) – Full path of the save directory. If not given, EDIs should be outputed in the created directory.
how (str) – The way to index the stations. Default is the Python indexing i.e. the counting starts by 0. Any other value will start counting the site from 1.
export (bool,) – Export new edi-files
kws (dict) – Additionnal keyword arguments from ~Edi.write_edifile and watex.utils.coreutils.makeCoords().

Returns:

EM – Returns self for easy method chaining.

Return type:

EM instance

Examples

>>> from watex.methods.em import EM
>>> edipath = r'data/edis'
>>> savepath =  r'/Users/Daniel/Desktop/ediout'
>>> emObjs = EM().fit(edipath)
>>> emObjs.rewrite_edis(by='id', edi_prefix ='b1',
                        savepath =savepath)
>>> #
>>> # second example to write 7 samples of edi from
>>> # Edi objects inner datasets
>>> #
>>> import watex as wx
>>> edi_sample = wx.fetch_data ('edis', key ='edi',
                                samples =7, return_data =True )
>>> emobj = wx.EM ().fit(edi_sample)
>>> emobj.rewrite(by='station', prefix='PS')

property stnames#

tslicer(freqs=None, z=None, component='xy')[source]#

Returns tensor 2d from components

Parameters:

freqs (arraylike) – full frequency that composed the tensor. If None, use the components in
Z (ArrayLike 3D) – Tensor is composed of 3D array of shape (n_freqs, 2, 2)
component (str,) – components along side to retrieve . Can be [‘xx’|’xy’|’yx’|’yy’]
versionadded: (..) – v0.2.0:

Returns:

z or slice – Returns 2D tensor or dictionnary of components index slicers.

Return type:

Arralike 2D tensor, or dict

watex.EMProcessing#: alias of Processing

class watex.EvalPlot(tname=None, encode_labels=False, scale=None, cv=None, objective=None, prefix=None, label_values=None, litteral_classes=None, **kws)[source]#

Bases: BasePlot

Metrics, dimensionality and model evaluatation plots.

Inherited from BasePlot. Dimensional reduction and metric plots. The class works only with numerical features.

Discouraged

Contineous target values for plotting classification metrics is discouraged. However, We encourage user to prepare its dataset before using the EvalPlot methods. This is recommended to have full control of the expected results. Indeed, the most metrics plot implemented here works with supervised methods especially deals with the classification problems. So, the convenient way is for users to discretize/categorize (class labels) before the fit. If not the case, as the examples of demonstration under each method implementation, we first need to categorize the continue labels. The choice is twofolds: either providing individual class label as a list of integers using the method EvalPlot._cat_codes_y() or by specifying the number of clusters that the target must hold. Commonly the latter choice is usefull for a test or academic purpose. In practice into a real dataset, it is discouraged to use this kind of target partition since, it is far away of the reality and will yield unexpected misinterpretation.

Parameters:

X (Ndarray of shape ( M x N), $M=m-samples$ & $N=n-features$) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like of shape (M, ) :math:`M=m-samples) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.
objective (str, default=None,) – The purpose of dataset; what probem do we intend to solve ? Originally the package was designed for flow rate prediction. Thus, if the objective is set to flow, plot will behave like the flow rate prediction purpose and in that case, some condition of target values need to be fullfilled. Furthermore, if the objective is set to flow, label_values` as well as the litteral_classes parameters need to be supplied to right encode the target according to the hydraulic system requirement during the campaign for drinking water supply. For any other purpose for the dataset, keep the objective to None. Default is None.
encode_labels (bool, default=False,) –
label encoding works with label_values parameter. If the y is a continous numerical values, we could turn the regression to classification by setting encode_labels to True. if value is set to True and values of labels is not given, an unique identifier is created which can not fit the exact needs of the users. So it is recommended to set this parameters in combinaison with the`label_values`. For instance:
```
encode_labels=True ; label_values =3
```
indicates that the target y values should be categorized to hold the integer identifier equals to [0 , 1, 2]. y are splitted into three subsets where:
```
classes (c) = [ c{0} <= y. min(), y.min() < c {1}< y.max(),
                 >=y.max {2}]
```
This auto-splitting could not fit the exact classification of the target so it is recommended to set the label_values as a list of class labels. For instance label_values=[0 , 1, 2] and else.
scale (str, ['StandardScaler'|'MinMaxScaler'], default ='StandardScaler') – kind of feature scaling to apply on numerical features. Note that when using PCA, it is recommended to turn scale to True and fit_transform rather than only fit the method. Note that transform method also handle the missing nan value in the data where the default strategy for filling is most_frequent.

cv (float,) –

A cross validation splitting strategy. It used in cross-validation based routines. cv is also available in estimators such as multioutput. ClassifierChain or calibration.CalibratedClassifierCV which use the predictions of one estimator as training data for another, to not overfit the training supervision. Possible inputs for cv are usually:

* An integer, specifying the number of folds in K-fold cross validation.
    K-fold will be stratified over classes if the estimator is a classifier
    (determined by base.is_classifier) and the targets may represent a
    binary or multiclass (but not multioutput) classification problem
    (determined by utils.multiclass.type_of_target).
* A cross-validation splitter instance. Refer to the User Guide for
    splitters available within `Scikit-learn`_
* An iterable yielding train/test splits.

With some exceptions (especially where not using cross validation at all: is an option), the default is 4-fold.

prefix (str, optional) – litteral string to prefix the integer identical labels.
label_values (list of int, optional) – works with encode_labels parameters. It indicates the different class labels. Refer to explanation of encode_labels.
Litteral_classes (list or str, optional) –
Works when objective is flow. Replace class integer names by its litteral strings. For instance:
```
label_values =[0, 1, 3, 6]
Litteral_classes = ['rate0', 'rate1', 'rate2', 'rate3']
```
yp_ls (str, default='-',) – Line style of Predicted label. Can be [ ‘-’ | ‘.’ | ‘:’ ]
yp_lw (str, default= 3) – Line weight of the Predicted plot
yp_lc (str or matplotlib.cm(), default= ‘k’) – Line color of the Prediction plot. default is k
rs (str, default='--') – Line style of Recall metric
ps (str, default='-') – Line style of `Precision `metric
rc (str, default=(.6,.6,.6)) – Recall metric colors
pc (str or matplotlib.cm(), default=’k’) – Precision colors from Matplotlib colormaps.
yp_marker (str or matplotlib.markers(), default =’o’) – Style of marker in of Prediction points.
yp_markerfacecolor (str or matplotlib.cm(), default=’k’) – Facecolor of the Predicted label marker.
yp_markeredgecolor (stror matplotlib.cm(), default= ‘r’) – Edgecolor of the Predicted label marker.
yp_markeredgewidth (int, default=2) – Width of the `Predicted`label marker.
savefig (str, Path-like object,) – savefigure’s name, default is None
fig_dpi (float,) – dots-per-inch resolution of the figure. default is 300
fig_num (int,) – size of figure in inches (width, height). default is [5, 5]
fig_size (Tuple (int, int) or inch) – size of figure in inches (width, height).*default* is [5, 5]
fig_orientation (str,) – figure orientation. default is landscape
fig_tile (str,) – figure title. default is None
fs (float,) – size of font of axis tick labels, axis labels are fs+2. default is 6
ls (str,) – line style, it can be [ ‘-’ | ‘.’ | ‘:’ ] . default is ‘-’
lc (str, Optional,) – line color of the plot, default is k
lw (float, Optional,) – line weight of the plot, default is 1.5
alpha (float between 0 < alpha < 1,) – transparency number, default is 0.5,
font_weight (str, Optional) – weight of the font , default is bold.
font_style (str, Optional) – style of the font. default is italic
font_size (float, Optional) – size of font in inches (width, height). default is 3.
ms (float, Optional) – size of marker in points. default is 5
marker (str, Optional) – marker of stations default is o.
marker_style (str, Optional) – facecolor of the marker. default is yellow
marker_edgecolor (str, Optional) – facecolor of the marker. default is yellow
marker_edgewidth (float, Optional) – width of the marker. default is 3.
xminorticks (float, Optional) – minortick according to x-axis size and default is 1.
yminorticks (float, Optional) – yminorticks according to x-axis size and default is 1.
bins (histograms element separation between two bar. default is 10.) –
xlim (tuple (int, int), Optional) – limit of x-axis in plot.
ylim (tuple (int, int), Optional) – limit of x-axis in plot.
xlabel (str, Optional,) – label name of x-axis in plot.
ylabel (str, Optional,) – label name of y-axis in plot.
rotate_xlabel (float, Optional) – angle to rotate xlabel in plot.
rotate_ylabel (float, Optional) – angle to rotate ylabel in plot.
leg_kws (dict, Optional) – keyword arguments of legend. default is empty dict
plt_kws (dict, Optional) – keyword arguments of plot. default is empty dict
glc (str, Optional) – line color of the grid plot, default is k
glw (float, Optional) – line weight of the grid plot, default is 2
galpha (float, Optional,) – transparency number of grid, default is 0.5
gaxis (str ('x', 'y', 'both')) – type of axis to hold the grid, default is both
gwhich (str, Optional) – kind of grid in the plot. default is major
tp_axis (bool,) – axis to apply the ticks params. default is both
tp_labelsize (str, Optional) – labelsize of ticks params. default is italic
tp_bottom (bool,) – position at bottom of ticks params. default is True.
tp_labelbottom (bool,) – put label on the bottom of the ticks. default is False
tp_labeltop (bool,) – put label on the top of the ticks. default is True
cb_orientation (str , ('vertical', 'horizontal')) – orientation of the colorbar, default is vertical
cb_aspect (float, Optional) – aspect of the colorbar. default is 20.
cb_shrink (float, Optional) – shrink size of the colorbar. default is 1.0
cb_pad (float,) – pad of the colorbar of plot. default is .05
cb_anchor (tuple (float, float)) – anchor of the colorbar. default is (0.0, 0.5)
cb_panchor (tuple (float, float)) – proportionality anchor of the colorbar. default is (1.0, 0.5)
cb_label (str, Optional) – label of the colorbar.
cb_spacing (str, Optional) – spacing of the colorbar. default is uniform
cb_drawedges (bool,) – draw edges inside of the colorbar. default is False

Notes

This module works with numerical data i.e if the data must contains the numerical features only. If categorical values are included in the dataset, they should be removed and the size of the data should be chunked during the fit methods.

fit(X=None, y=None, **fit_params)[source]#

Fit data and populate the attributes for plotting purposes.

There is no conventional procedure for checking if a method is fitted. However, an class that is not fitted should raise watex.exceptions.NotFittedError when a method is called.

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
data (Filepath or Dataframe or shape (M, N) from) – pandas.DataFrame. Dataframe containing samples M and features N
fit_params (dict Additional keywords arguments from) – :func:watex.utils.coreutils._is_readable`

Returns:

``self`` – returns self for easy method chaining.

Return type:

EvalPlot instance

fit_transform(X, y=None, **fit_params)[source]#

Fit and transform at once.

Parameters:: X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
Returns:: X – The transformed array or dataframe with numerical features
Return type:: NDArray |Dataframe , shape (M x N )

property inspect#: Inspect data and trigger plot after checking the data entry. Raises NotFittedError if ExPlot is not fitted yet.

plotConfusionMatrix(clf, *, kind=None, labels=None, matshow_kws=None, **conf_mx_kws)[source]#

Plot confusion matrix for error evaluation.

A representation of the confusion matrix for error visualization. If kind is set map, plot will give the number of confused instances/items. However when kind is set to error, the number of items confused is explained as a percentage.

Parameters:: clf (callable, always as a function, classifier estimator) – A supervised predictor with a finite set of discrete possible output values. A classifier must supports modeling some of binary, targets. It must store a classes attribute after fitting.

labels: int, or list of int, optional: Specific class to evaluate the tradeoff of precision

and recall. label needs to be specified and a value within the target.
plottype: str: can be map or error to visualize the matshow of prediction and errors respectively.
matshow_kws: dict: matplotlib additional keywords arguments.
conf_mx_kws: dict: Additional confusion matrix keywords arguments.
ylabel: list: list of labels names to hold the name of each categories. Return

Examples

>>> from watex.datasets import fetch_data
>>> from watex.utils.mlutils import cattarget
>>> from watex.exlib.sklearn import SVC
>>> from watex.view.mlplot import EvalPlot
>>> X, y = fetch_data ('bagoue', return_X_y=True, as_frame =True)
>>> # partition the target into 4 clusters-> just for demo
>>> b= EvalPlot(scale =True, label_values = 4 )
>>> b.fit_transform (X, y)
>>> # prepare our estimator
>>> svc_clf = SVC(C=100, gamma=1e-2, kernel='rbf', random_state =42)
>>> matshow_kwargs ={
'aspect': 'auto', # 'auto'equal
'interpolation': None,
'cmap':'jet }
>>> plot_kws ={'lw':3,
'lc':(.9, 0, .8),
'font_size':15.,
'cb_format':None,
'xlabel': 'Predicted classes',
'ylabel': 'Actual classes',
'font_weight':None,
'tp_labelbottom':False,
'tp_labeltop':True,
'tp_bottom': False
}
>>> b.plotConfusionMatrix(clf=svc_clf,
matshow_kws = matshow_kwargs,
**plot_kws)
>>> svc_clf = SVC(C=100, gamma=1e-2, kernel='rbf',
...                  random_state =42)
>>> # replace the integer identifier with litteral string
>>> b.litteral_classes = ['FR0', 'FR1', 'FR2', 'FR3']
>>> b.plotConfusionMatrix(svc_clf, matshow_kws=matshow_kwargs,
kind='error', **plot_kws)

plotPCA(n_components=None, *, n_axes=None, biplot=False, pc1_label='Axis 1', pc2_label='Axis 2', plot_dict=None, **pca_kws)[source]#

Plot PCA component analysis using decomposition.

PCA identifies the axis that accounts for the largest amount of variance in the train set X. It also finds a second axis orthogonal to the first one, that accounts for the largest amount of remaining variance.

Parameters:

n_components (Number of dimension to preserve. If`n_components`) – is ranged between float 0. to 1., it indicates the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.
n_axes (Number of importance components to retrieve the) – variance ratio. Default is 2. The first two importance components with most variance ratio.
biplot (bool,) – biplot plots PCA features importance (pc1 and pc2) and visualize the level of variance and direction of components for different variables. Refer to `Serafeim Loukas`_
pc1_label (str, default ='Axis 1') – the first component with most variance held in ‘Axis 1’. Can be modified to any other axis for instance ‘Axis 3’ to replace the component in ‘Axis 1’ to the one in Axis 3 and so one. This will allow to visualize the position of each level of variance for each variable.
pc2_label (str, default ='Axis 2',) – the second component with most variance held in ‘Axis 2’. Can be modified to any other axis for instance ‘Axis 6’ to replace the component in ‘Axis 2’ to the one in Axis 6 and so one.
plot_dict (dict,) – dictionnary of font and properties for markers for each sample corresponding to the label_values.
pca_kws (dict,) – additional keyword arguments passed to watex.analysis.dimensionality.nPCA

Returns:

``self`` – self for easy method chaining.

Return type:

EvalPlot instance

Notes

By default, nPCA methods plots the first two principal components named pc1_label for axis 1 and pc2_label for axis 2. If you want to plot the first component pc1 vs the third components`pc2` set the pc2_label to Axis 3 and set the n_components to 3 that is the max reduced columns to retrieve, otherwise an users warning will be displayed. Commonly Algorithm should automatically detect the digit 3 in the litteral pc1_labels including Axis (e.g. ‘Axis 3`) and will consider as the third component `pc3 `. The same process is available for other axis.

Examples

>>> from watex.datasets import load_bagoue
>>> from watex.view.mlplot import EvalPlot
>>> X , y = load_bagoue(as_frame =True )
>>> b=EvalPlot(tname ='flow', encode_labels=True ,
                  scale = True )
>>> b.fit_transform (X, y)
>>> b.plotPCA (n_components= 2 )
...
>>> # pc1 and pc2 labels > n_components -> raises user warnings
>>> b.plotPCA (n_components= 2 , biplot=False, pc1_label='Axis 3',
               pc2_label='axis 4')
... UserWarning: Number of components and axes might be consistent;
    '2'and '4 are given; default two components are used.
>>> b.plotPCA (n_components= 8 , biplot=False, pc1_label='Axis3',
               pc2_label='axis4')
    # works fine since n_components are greater to the number of axes
... EvalPlot(tname= None, objective= None, scale= True, ... ,
             sns_height= 4.0, sns_aspect= 0.7, verbose= 0)

plotPR(clf, label, kind=None, method=None, cvp_kws=None, **prt_kws)[source]#

Precision/recall (PR) and tradeoff plots.

PR computes a score based on the decision function and plot the result as a score vs threshold.

Parameters:: clf (callable, always as a function, classifier estimator) – A supervised predictor with a finite set of discrete possible output values. A classifier must supports modeling some of binary, targets. It must store a classes attribute after fitting.

label: int,: Specific class to evaluate the tradeoff of precision and recall. label needs to be specified and a value within the target. kind: str, [‘threshold|’recall’], default=’threshold’ kind of PR plot. If kind is ‘recall’, method plots the precision VS the recall scores, otherwiwe the PR tradeoff is plotted against the ‘threshold.’
method: str: Method to get scores from each instance in the trainset. Could be decison_funcion or predict_proba. When using the scikit-Learn classifier, it generally has one of the method. Default is decision_function.
cvp_kws: dict, optional: The sklearn.model_selection.cross_val_predict() keywords additional arguments
prt_kws:dict,: Additional keyword arguments passed to func:watex.exlib.sklearn.precision_recall_tradeoff Return

Examples

>>> from watex.exlib.sklearn import SGDClassifier
>>> from watex.datasets.dload import load_bagoue
>>> from watex.utils import cattarget
>>> from watex.view.mlplot import EvalPlot
>>> X , y = load_bagoue(as_frame =True )
>>> sgd_clf = SGDClassifier(random_state= 42) # our estimator
>>> b= EvalPlot(scale = True , encode_labels=True)
>>> b.fit_transform(X, y)
>>> # binarize the label b.y
>>> ybin = cattarget(b.y, labels= 2 ) # can also use labels =[0, 1]
>>> b.y = ybin
>>> # plot the Precision-recall tradeoff
>>> b.plotPR(sgd_clf , label =1) # class=1
... EvalPlot(tname= None, objective= None, scale= True, ... ,
sns_height= 4.0, sns_aspect= 0.7, verbose= 0)

plotROC(clfs, label, method=None, cvp_kws=None, **roc_kws)[source]#

Plot receiving operating characteric (ROC) classifiers.

Can plot multiple classifiers at once. If multiple classifiers are given, each classifier must be a tuple of ( <name>, classifier>, <method>). For instance, to plot the both sklearn.ensemble.RandomForestClassifier and sklearn.linear_model.SGDClassifier classifiers, they must be ranged as follow:

clfs =[
    ('sgd', SGDClassifier(), "decision_function" ),
    ('forest', RandomForestClassifier(), "predict_proba")
    ]

It is important to know whether the method ‘predict_proba’ is valid for the scikit-learn classifier, we want to plot its ROC curve.

Parameters:

clfs (callables, always as a function, classifier estimators) – A supervised predictor with a finite set of discrete possible output values. A classifier must supports modeling some of binary, targets. It must store a classes attribute after fitting.
label (int,) – Specific class to evaluate the tradeoff of precision and recall. label needs to be specified and a value within the target.
kind (str, ['threshold|'recall'], default='threshold') – kind of PR plot. If kind is ‘recall’, method plots the precision VS the recall scores, otherwiwe the PR tradeoff is plotted against the ‘threshold.’
method (str) – Method to get scores from each instance in the trainset. Could be decison_funcion or predict_proba. When using the scikit-Learn classifier, it generally has one of the method. Default is decision_function.
cvp_kws (dict, optional) – The sklearn.model_selection.cross_val_predict() keywords additional arguments
prt_kws (dict,) – Additional keyword arguments passed to func:watex.exlib.sklearn.precision_recall_tradeoff
roc_kws (dict) – roc_curve additional keywords arguments.

Returns:

``self`` – self for easy method chaining.

Return type:

EvalPlot instance

Examples

Plot ROC for single classifier

>>> from watex.exlib.sklearn import ( SGDClassifier,
                                     RandomForestClassifier
                                     )
>>> from watex.datasets.dload import load_bagoue
>>> from watex.utils import cattarget
>>> from watex.view.mlplot import EvalPlot
>>> X , y = load_bagoue(as_frame =True )
>>> sgd_clf = SGDClassifier(random_state= 42) # our estimator
>>> b= EvalPlot(scale = True , encode_labels=True)
>>> b.fit_transform(X, y)
>>> # binarize the label b.y
>>> ybin = cattarget(b.y, labels= 2 ) # can also use labels =[0, 1]
>>> b.y = ybin
>>> # plot the ROC
>>> b.plotROC(sgd_clf , label =1) # class=1
... EvalPlot(tname= None, objective= None, scale= True, ... ,
             sns_height= 4.0, sns_aspect= 0.7, verbose= 0)

(2)-> Plot ROC for multiple classifiers

>>> b= EvalPlot(scale = True , encode_labels=True,
                lw =3., lc=(.9, 0, .8), font_size=7 )
>>> sgd_clf = SGDClassifier(random_state= 42)
>>> forest_clf =RandomForestClassifier(random_state=42)
>>> b.fit_transform(X, y)
>>> # binarize the label b.y
>>> ybin = cattarget(b.y, labels= 2 ) # can also use labels =[0, 1]
>>> b.y = ybin
>>> clfs =[('sgd', sgd_clf, "decision_function" ),
       ('forest', forest_clf, "predict_proba")]
>>> b.plotROC (clfs =clfs , label =1 )
... EvalPlot(tname= None, objective= None, scale= True, ... ,
             sns_height= 4.0, sns_aspect= 0.7, verbose= 0)

save(fig)[source]#: savefigure if figure properties are given.

transform(X, **t_params)[source]#

Transform the data and imputs the numerical features.

It is not convenient to use transform if user want to keep categorical values in the array

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
t_params (dict,) – Keyword arguments passed to sklearn.impute.SimpleImputer for imputing the missing data; default strategy is ‘most_frequent’ or keywords arguments passed to :func:watex.utils.funcutils.to_numeric_dtypes`

Returns:

X – The transformed array or dataframe with numerical features

Return type:

NDArray |Dataframe , shape (M x N )

class watex.ExPlot(tname=None, inplace=False, **kws)[source]#

Bases: BasePlot

Exploratory plot for data analysis

ExPlot is a shadow class. Explore data is needed to create a model since it gives a feel for the data and also at great excuses to meet and discuss issues with business units that controls the data. ExPlot methods i.e. return an instancied object that inherits from watex.property.Baseplots ABC (Abstract Base Class) for visualization.

Parameters:

savefig (str, Path-like object,) – savefigure’s name, default is None
fig_dpi (float,) – dots-per-inch resolution of the figure. default is 300
fig_num (int,) – size of figure in inches (width, height). default is [5, 5]
fig_size (Tuple (int, int) or inch) – size of figure in inches (width, height).*default* is [5, 5]
fig_orientation (str,) – figure orientation. default is landscape
fig_tile (str,) – figure title. default is None
fs (float,) – size of font of axis tick labels, axis labels are fs+2. default is 6
ls (str,) – line style, it can be [ ‘-’ | ‘.’ | ‘:’ ] . default is ‘-’
lc (str, Optional,) – line color of the plot, default is k
lw (float, Optional,) – line weight of the plot, default is 1.5
alpha (float between 0 < alpha < 1,) – transparency number, default is 0.5,
font_weight (str, Optional) – weight of the font , default is bold.
font_style (str, Optional) – style of the font. default is italic
font_size (float, Optional) – size of font in inches (width, height). default is 3.
ms (float, Optional) – size of marker in points. default is 5
marker (str, Optional) – marker of stations default is o.
marker_style (str, Optional) – facecolor of the marker. default is yellow
marker_edgecolor (str, Optional) – facecolor of the marker. default is yellow
marker_edgewidth (float, Optional) – width of the marker. default is 3.
xminorticks (float, Optional) – minortick according to x-axis size and default is 1.
yminorticks (float, Optional) – yminorticks according to x-axis size and default is 1.
bins (histograms element separation between two bar. default is 10.) –
xlim (tuple (int, int), Optional) – limit of x-axis in plot.
ylim (tuple (int, int), Optional) – limit of x-axis in plot.
xlabel (str, Optional,) – label name of x-axis in plot.
ylabel (str, Optional,) – label name of y-axis in plot.
rotate_xlabel (float, Optional) – angle to rotate xlabel in plot.
rotate_ylabel (float, Optional) – angle to rotate ylabel in plot.
leg_kws (dict, Optional) – keyword arguments of legend. default is empty dict
plt_kws (dict, Optional) – keyword arguments of plot. default is empty dict
glc (str, Optional) – line color of the grid plot, default is k
glw (float, Optional) – line weight of the grid plot, default is 2
galpha (float, Optional,) – transparency number of grid, default is 0.5
gaxis (str ('x', 'y', 'both')) – type of axis to hold the grid, default is both
gwhich (str, Optional) – kind of grid in the plot. default is major
tp_axis (bool,) – axis to apply the ticks params. default is both
tp_labelsize (str, Optional) – labelsize of ticks params. default is italic
tp_bottom (bool,) – position at bottom of ticks params. default is True.
tp_labelbottom (bool,) – put label on the bottom of the ticks. default is False
tp_labeltop (bool,) – put label on the top of the ticks. default is True
cb_orientation (str , ('vertical', 'horizontal')) – orientation of the colorbar, default is vertical
cb_aspect (float, Optional) – aspect of the colorbar. default is 20.
cb_shrink (float, Optional) – shrink size of the colorbar. default is 1.0
cb_pad (float,) – pad of the colorbar of plot. default is .05
cb_anchor (tuple (float, float)) – anchor of the colorbar. default is (0.0, 0.5)
cb_panchor (tuple (float, float)) – proportionality anchor of the colorbar. default is (1.0, 0.5)
cb_label (str, Optional) – label of the colorbar.
cb_spacing (str, Optional) – spacing of the colorbar. default is uniform
cb_drawedges (bool,) – draw edges inside of the colorbar. default is False
sns_orient ('v' | 'h', optional) – Orientation of the plot (vertical or horizontal). This is usually inferred based on the type of the input variables, but it can be used to resolve ambiguity when both x and y are numeric or when plotting wide-form data. default is v which refer to ‘vertical’
sns_style (dict, or one of {darkgrid, whitegrid, dark, white, ticks}) – A dictionary of parameters or the name of a preconfigured style.
sns_palette (seaborn color paltte | matplotlib colormap | hls | husl) – Palette definition. Should be something color_palette() can process. the palette generates the point with different colors
sns_height (float,) – Proportion of axes extent covered by each rug element. Can be negative. default is 4.
sns_aspect (scalar (float, int)) – Aspect ratio of each facet, so that aspect * height gives the width of each facet in inches. default is .7

Returns:

self – returns self for easy method chaining.

Return type:

Baseclass instance

Examples

>>> import pandas as pd
>>> from watex.view import ExPlot
>>> data = pd.read_csv ('data/geodata/main.bagciv.data.csv' )
>>> ExPlot(fig_size = (12, 4)).fit(data).missing(kind ='corr')
... <watex.view.plot.ExPlot at 0x21162a975e0>

fit(data, **fit_params)[source]#

Fit data and populate the arguments for plotting purposes.

There is no conventional procedure for checking if a method is fitted. However, an class that is not fitted should raise exceptions.NotFittedError when a method is called.

Parameters:

data (Filepath or Dataframe or shape (M, N) from) – pandas.DataFrame. Dataframe containing samples M and features N
fit_params (dict) – Additional keywords arguments for reading the data is given as a path-like object passed from :func:watex.utils.coreutils._is_readable`

Returns:

``self`` – returns self for easy method chaining.

Return type:

Plot instance

property inspect#: Inspect data and trigger plot after checking the data entry. Raises NotFittedError if ExPlot is not fitted yet.

msg = "{expobj.__class__.__name__} instance is not fitted yet. Call 'fit' with appropriate arguments before using this method."#

plotbv(xname=None, yname=None, kind='box', **kwd)[source]#

Visualize distributions using the box, boxen or violin plots.

Parameters:

xname (vectors or keys in data) – Variables that specify positions on the x and y axes. Both are the column names to consider. Shoud be items in the dataframe columns. Raise an error if elements do not exist.
yname (vectors or keys in data) – Variables that specify positions on the x and y axes. Both are the column names to consider. Shoud be items in the dataframe columns. Raise an error if elements do not exist.
kind (str) – style of the plot. Can be [‘box’|’boxen’|’violin’]. default is box
kwd (dict,) – Other keyword arguments are passed down to seaborn.boxplot .

Returns:

``self`` (ExPlot instance and returns self for easy)
method chaining.

Example

>>> from watex.datasets import fetch_data
>>> from watex.view import ExPlot
>>> data = fetch_data ('bagoue original').get('data=dfy1')
>>> p= ExPlot(tname='flow').fit(data)
>>> p.plotbv(xname='flow', yname='sfi', kind='violin')

plotcutcomparison(xname=None, yname=None, q=10, bins=3, cmap='viridis', duplicates='drop', **kws)[source]#

Compare the cut or q quantiles values of ordinal categories.

It simulates that the the bining of ‘xname’ into a q quantiles, and ‘yname’into bins. Plot is normalized so its fills all the vertical area. which makes easy to see that in the 4*q % quantiles.

Parameters:

xname (vectors or keys in data) – Variables that specify positions on the x and y axes. Both are the column names to consider. Shoud be items in the dataframe columns. Raise an error if elements do not exist.
yname (vectors or keys in data) – Variables that specify positions on the x and y axes. Both are the column names to consider. Shoud be items in the dataframe columns. Raise an error if elements do not exist.
q (int or list-like of float) – Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.
bins (int, sequence of scalars, or IntervalIndex) –
The criteria to bin by.
- intDefines the number of equal-width bins in the range of x.
  The range of x is extended by .1% on each side to include the minimum and maximum values of x.
- sequence of scalarsDefines the bin edges allowing for non-uniform
  width. No extension of the range of x is done.
- IntervalIndexDefines the exact bins to be used. Note that
  IntervalIndex for bins must be non-overlapping.
labels (array or False, default None) – Used as labels for the resulting bins. Must be of the same length as the resulting bins. If False, return only integer indicators of the bins. If True, raises an error.
cmap (str, color or list of color, optional) – The matplotlib colormap of the bar faces.
duplicates ({default 'raise', 'drop}, optional) – If bin edges are not unique, raise ValueError or drop non-uniques. default is ‘drop’
kws (dict,) – Other keyword arguments are passed down to pandas.qcut .

Returns:

``self``

Return type:

ExPlot instance and returns self for easy method chaining.

Examples

>>> from watex.datasets import fetch_data
>>> from watex.view import ExPlot
>>> data = fetch_data ('bagoue original').get('data=dfy1')
>>> p= ExPlot(tname='flow').fit(data)
>>> p.plotcutcomparison(xname ='sfi', yname='ohmS')

plothist(xname=None, *, kind='hist', **kws)[source]#

A histogram visualization of numerica data.

Parameters:

xname (str , xlabel) – feature name in the dataframe and is the label on x-axis. Raises an error , if it does not exist in the dataframe
kind (str) – Mode of pandas series plotting. the default is hist.
kws (dict,) – additional keywords arguments from : func:pandas.DataFrame.plot

Returns:

``self`` – returns self for easy method chaining.

Return type:

ExPlot instance

plothistvstarget(xname, c=None, *, posilabel=None, neglabel=None, kind='binarize', **kws)[source]#

A histogram of continuous against the target of binary plot.

Parameters:

xname (str,) – the column name to consider on x-axis. Shoud be an item in the dataframe columns. Raise an error if element does not exist.
c (str or int) – the class value in y to consider. Raise an error if not in y. value c can be considered as the binary positive class
posilabel (str, Optional) – the label of c considered as the positive class
neglabel (str, Optional) – the label of other classes (categories) except c considered as the negative class
kind (str, Optional, (default, 'binarize')) – the kind of plot features against target. binarize considers plotting the positive class (‘c’) vs negative class (‘not c’)
kws (dict,) – Additional keyword arguments of `seaborn displot`_

Returns:

``self`` – returns self for easy method chaining.

Return type:

ExPlot instance

Examples

>>> from watex.utils import read_data
>>> from watex.view import ExPlot
>>> data = read_data  ( 'data/geodata/main.bagciv.data.csv' )
>>> p = ExPlot(tname ='flow').fit(data)
>>> p.fig_size = (7, 5)
>>> p.savefig ='bbox.png'
>>> p.plothistvstarget (xname= 'sfi', c = 0, kind = 'binarize',  kde=True,
                  posilabel='dried borehole (m3/h)',
                  neglabel = 'accept. boreholes'
                  )
Out[95]: <'ExPlot':xname='sfi', yname=None , tname='flow'>

plotjoint(xname, yname=None, corr='pearson', kind='scatter', pkg='sns', yb_kws=None, **kws)[source]#

fancier scatterplot that includes histogram on the edge as well as a regression line called a joinplot

Parameters:

xname (vectors or keys in data) – Variables that specify positions on the x and y axes. Both are the column names to consider. Shoud be items in the dataframe columns. Raise an error if elements do not exist.
yname (vectors or keys in data) – Variables that specify positions on the x and y axes. Both are the column names to consider. Shoud be items in the dataframe columns. Raise an error if elements do not exist.
pkg (str, Optional,) – kind or library to use for visualization. can be [‘sns’|’yb’] for ‘seaborn’ or ‘yellowbrick’. default is sns.
kind (str in {'scatter', 'hex'}, default: 'scatter') – The type of plot to render in the joint axes. Note that when kind=’hex’ the target cannot be plotted by color.
corr (str, default: 'pearson') – The algorithm used to compute the relationship between the variables in the joint plot, one of: ‘pearson’, ‘covariance’, ‘spearman’, ‘kendalltau’.
yb_kws (dict,) – Additional keywords arguments from yellowbrick.JointPlotVisualizer
kws (dict,) – Other keyword arguments are passed down to seaborn.joinplot .

Returns:

``self``

Return type:

ExPlot instance and returns self for easy method chaining.

Notes

When using the yellowbrick library and array i.e a (x, y) variables in the columns as well as the target arrays must not contain infs or NaNs values. A value error raises if that is the case.

plotmissing(*, kind=None, sample=None, **kwd)[source]#

Vizualize patterns in the missing data.

Parameters:

data (Dataframe or shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N
kind (str, Optional) –
kind of visualization. Can be dendrogramm, mbar or bar plot for dendrogram , msno bar and plt visualization respectively:
- bar plot counts the nonmissing data using pandas
- mbar use the msno package to count the number
  of nonmissing data.
- dendrogram`` show the clusterings of where the data is missing.
  leaves that are the same level predict one onother presence (empty of filled). The vertical arms are used to indicate how different cluster are. short arms mean that branch are similar.
- ``corr` creates a heat map showing if there are correlations
  where the data is missing. In this case, it does look like the locations where missing data are corollated.
- mpatterns is the default vizualisation. It is useful for viewing
  contiguous area of the missing data which would indicate that the missing data is not random. The matrix function includes a sparkline along the right side. Patterns here would also indicate non-random missing data. It is recommended to limit the number of sample to be able to see the patterns.
Any other value will raise an error
sample (int, Optional) – Number of row to visualize. This is usefull when data is composed of many rows. Skrunked the data to keep some sample for visualization is recommended. None plot all the samples ( or examples) in the data
kws (dict) – Additional keywords arguments of msno.matrix plot.

Returns:

``self`` – returns self for easy method chaining.

Return type:

ExPlot instance

Example

>>> import pandas as pd
>>> from watex.view import ExPlot
>>> data = pd.read_csv ('data/geodata/main.bagciv.data.csv' )
>>> p = ExPlot().fit(data)
>>> p.fig_size = (12, 4)
>>> p.plotmissing(kind ='corr')

plotpairgrid(xname=None, yname=None, vars=None, **kwd)[source]#

Create a pair grid.

Is a matrix of columns and kernel density estimations. To color by a columns from a dataframe, use ‘hue’ parameter.

Parameters:

xname (vectors or keys in data) – Variables that specify positions on the x and y axes. Both are the column names to consider. Shoud be items in the dataframe columns. Raise an error if elements do not exist.
yname (vectors or keys in data) – Variables that specify positions on the x and y axes. Both are the column names to consider. Shoud be items in the dataframe columns. Raise an error if elements do not exist.
vars (list, str) – list of items in the dataframe columns. Raise an error if items dont exist in the dataframe columns.
kws (dict,) – Other keyword arguments are passed down to seaborn.joinplot .

Returns:

``self``

Return type:

ExPlot instance and returns self for easy method chaining.

Example

>>> from watex.datasets import fetch_data
>>> from watex.view import ExPlot
>>> data = fetch_data ('bagoue original').get('data=dfy1')
>>> p= ExPlot(tname='flow').fit(data)
>>> p.plotpairgrid (vars = ['magnitude', 'power', 'ohmS'] )
... <'ExPlot':xname=(None,), yname=None , tname='flow'>

plotpairwisecomparison(corr='pearson', pkg='sns', **kws)[source]#

Create pairwise comparizons between features.

Plots shows a [‘pearson’|’spearman’|’covariance’] correlation.

Parameters:

corr (str, ['pearson'|'spearman'|'covariance']) – Method of correlation to perform. Note that the ‘person’ and ‘covariance’ don’t support string value. If such kind of data is given, turn the corr to spearman. default is pearson
pkg (str, Optional,) – kind or library to use for visualization. can be [‘sns’|’yb’] for ‘seaborn’ or ‘yellowbrick’ respectively. default is sns.
kws (dict,) – Additional keywords arguments are passed down to yellowbrick.Rand2D and seaborn.heatmap

Returns:

``self``

Return type:

ExPlot instance and returns self for easy method chaining.

Example

>>> from watex.datasets import fetch_data
>>> from watex.view import ExPlot
>>> data = fetch_data ('bagoue original').get('data=dfy1')
>>> p= ExPlot(tname='flow').fit(data)
>>> p.plotpairwisecomparison(fmt='.2f', corr='spearman', pkg ='yb',
                             annot=True,
                             cmap='RdBu_r',
                             vmin=-1,
                             vmax=1 )
... <'ExPlot':xname='sfi', yname='ohmS' , tname='flow'>

plotparallelcoords(classes=None, pkg='pd', rxlabel=45, **kwd)[source]#

Use parallel coordinates in multivariates for clustering visualization

Parameters:

classes (list, default: None) –
a list of class names for the legend The class labels for each class in y, ordered by sorted class index. These names act as a label encoder for the legend, identifying integer classes or renaming string labels. If omitted, the class labels will be taken from the unique values in y.

Note that the length of this list must match the number of unique values in y, otherwise an exception is raised.
pkg (str, Optional,) – kind or library to use for visualization. can be [‘sns’|’pd’] for ‘yellowbrick’ or ‘pandas’ respectively. default is pd.
rxlabel (int, default is 45) – rotate the xlabel when using pkg is set to pd.
kws (dict,) – Additional keywords arguments are passed down to yellowbrick.ParallelCoordinates and pandas.plotting.parallel_coordinates()

Returns:

``self``

Return type:

ExPlot instance and returns self for easy method chaining.

Examples

>>> from watex.datasets import fetch_data
>>> from watex.view import ExPlot
>>> data =fetch_data('original data').get('data=dfy1')
>>> p = ExPlot (tname ='flow').fit(data)
>>> p.plotparallelcoords(pkg='yb')
... <'ExPlot':xname=None, yname=None , tname='flow'>

plotradviz(classes=None, pkg='pd', **kwd)[source]#

plot each sample on circle or square, with features on the circonference to vizualize separately between target.

Values are normalized and each figure has a spring that pulls samples to it based on the value.

Parameters:

classes (list of int | float, [categorized classes]) – must be a value in the target. Specified classes must match the number of unique values in target. otherwise an error occurs. the default behaviour i.e. None detect all classes in unique value in the target.
pkg (str, Optional,) –

kind or library to use for visualization. can be [‘sns’|’pd’] for
’yellowbrick’ or ‘pandas’ respectively. default is pd.
kws (dict,) – Additional keywords arguments are passed down to yellowbrick.RadViZ and pandas.plotting.radviz()

Returns:

``self``

Return type:

ExPlot instance and returns self for easy method chaining.

Examples

(1)-> using yellowbrick RadViz

>>> from watex.datasets import fetch_data
>>> from watex.view import ExPlot
>>> data0 = fetch_data('bagoue original').get('data=dfy1')
>>> p = ExPlot(tname ='flow').fit(data0)
>>> p.plotradviz(classes= [0, 1, 2, 3] ) # can set to None

-> Using pandas radviz plot

>>> # use pandas with
>>> data2 = fetch_data('bagoue original').get('data=dfy2')
>>> p = ExPlot(tname ='flow').fit(data2)
>>> p.plotradviz(classes= None, pkg='pd' )
... <'ExPlot':xname=None, yname=None , tname='flow'>

plotscatter(xname=None, yname=None, c=None, s=None, **kwd)[source]#

Shows the relationship between two numeric columns.

Parameters:

xname (vectors or keys in data) – Variables that specify positions on the x and y axes. Both are the column names to consider. Shoud be items in the dataframe columns. Raise an error if elements do not exist.
yname (vectors or keys in data) – Variables that specify positions on the x and y axes. Both are the column names to consider. Shoud be items in the dataframe columns. Raise an error if elements do not exist.
c (str, int or array_like, Optional) –
The color of each point. Possible values are:
- A single color string referred to by name, RGB or RGBA code,
  for instance ‘red’ or ‘#a98d19’.
- A sequence of color strings referred to by name, RGB or RGBA
  code, which will be used for each point’s color recursively. For instance [‘green’,’yellow’] all points will be filled in green or yellow, alternatively.
- A column name or position whose values will be used to color
  the marker points according to a colormap.
s (scalar or array_like, Optional,) –
The size of each point. Possible values are:
- A single scalar so all points have the same size.
- A sequence of scalars, which will be used for each point’s
  size recursively. For instance, when passing [2,14] all points size will be either 2 or 14, alternatively.
kwd (dict,) – Other keyword arguments are passed down to seaborn.scatterplot .

Returns:

``self`` – returns self for easy method chaining.

Return type:

ExPlot instance

Example

>>> from watex.view import ExPlot
>>> p = ExPlot(tname='flow').fit(data).plotscatter (
    xname ='sfi', yname='ohmS')
>>> p
...  <'ExPlot':xname='sfi', yname='ohmS' , tname='flow'>

References

Scatterplot: https://seaborn.pydata.org/generated/seaborn.scatterplot.html Pd.scatter plot: https://www.w3resource.com/pandas/dataframe/dataframe-plot-scatter.php

save(fig)[source]#: savefigure if figure properties are given.

class watex.FeatureInspection(tname='flow', mapflow=True, sanitize=False, flow_classes=[0.0, 1.0, 3.0], set_index=False, col_name=None, **kws)[source]#

Bases: object

Summarizes the flow features.

It deals with data features categorization. When numericall values are provided standard qualitative or quantitative analysis is performed.

Parameters:

*data* (str or pd.core.DataFrame) – Path-like object or pandas Dataframe. Must contain the main parameters including the target.
**tname** (str) – The tname for predicting purposes. Here for groundwater exploration, we specify the name of the target as flow.
**flow_classes** (list or array_like) – The way to classify the flow. Provide the main specific values to convert the categorial trends to numerical values. Different projects have different tnameing flow rate. Might specify either for village hydraulic, or improved village hydraulic or urban hydraulics.
**drop_columns** (list) – items for dropping. To analyse the data, we can drop some specific columns to not corrupt data analysis. In formal dataframe collected straighforwardly from GeoFeatures,the default drop_columns refer to coordinates positions as : [‘east’, ‘north’].
**mapflow (bool,) –
if set to True, value in the target columns should map to categorical values. Commonly the flow rate values are given as a trend of numerical values. For a classification purpose, flow rate must be converted to categorical values which are mainly refered to the type of types of hydraulic. Mostly the type of hydraulic system is in turn tided to the the number of the living population in a specific area. For instance, flow classes can be ranged as follow:
- FR = 0 is for dry boreholes
- 0 < FR ≤ 3m3/h for village hydraulic (≤2000 inhabitants)
- 3 < FR ≤ 6m3/h for improved village hydraulic(>2000-20 000inhbts)
- 6 <FR ≤ 10m3/h for urban hydraulic (>200 000 inhabitants).
Note that this flow range is not exhaustive and can be modified according to the type of hydraulic required on the project.
**set_index** (bool,) – condired a column as dataframe index. If set to True, please provided the col_name, otherwise it should be the id as as a default columns item.
**sanitize** – polish the data and remove inconsistent columns in the data which are not refer to the predicting features. It is able to change for instance the french name of water eau to ‘water` wich is related to the value of water inflow features lwi. This could be usefull when the data is given as a Path-Like object and features are not described correctly in the case of groundwater. Default is False

Examples

>>> from watex.cases.features import FeatureInspection
>>> data = 'data/geodata/main.bagciv.data.csv'
>>> fobj = FeatureInspection().fit(data)
>>> fobj.data_.columns
Out[117]:
Index(['num', 'name', 'east', 'north', 'power', 'magnitude', 'shape', 'type',
       'sfi', 'ohmS', 'lwi', 'geol', 'flow'],
      dtype='object')

property cache#: Generate cache df_ for all eliminate features and keep on new pd.core.frame.DataFrame.

property data#: Control the Feature-file extension provide. Usefull to select pd.DataFrame construction.

fit(data)[source]#

Main goals of this method is to fit and classify the different flow classes in the dataset. However by default, four(04) flow classes are considered according to the reference below

Parameters:: *data* (str or pd.core.DataFrame) – Path-like object or pandas Dataframe. Must contains of the main parameters including the tname the tname.
Returns:: object
Return type:: FeatureInspection object

Examples

>>> from watex.bases.features import FeatureInspection
>>> data = 'data/geodata/main.bagciv.data.csv'
>>> fobj = FeatureInspection()
>>> fobj.fit(data)
>>> fobj.data.iloc[1:3 , :]
...    num name  power  magnitude  ...         ohmS        lwi      geol  flow
1    2   b2   70.0      142.0  ...  1135.551531  21.406531  GRANITES   FR1
2    3   b3   80.0       87.0  ...   767.562500   0.000000  GRANITES   FR1

Notes

The paper mentions 04 types of hydraulic according to the population demand and the number of living inhabitants. The hydraulic system are defined as:

FR = 0 is for dry boreholes

0 < FR ≤ 3m3/h for village hydraulic (≤2000 inhabitants)

3 < FR ≤ 6m3/h for improved village hydraulic(>2000-20 000inhbts)

6 <FR ≤ 10m3/h for urban hydraulic (>200 000 inhabitants).

The flow classes can be modified according to the type of hydraulic proposed for the project.

References

[1]

CIEH. (2001). L’utilisation des méthodes géophysiques pour la recherche d’eaux dans les aquifères discontinus. Série Hydrogéologie, 169.

property flow_classes#

writedf(df=None, refout=None, to=None, savepath=None, modname='_anEX_', reset_index=False)[source]#

Write the analysis df.

Refer to watex.decorators.exportdf() for more details about the arguments refout, to, savepath, modename and rest_index.

Example:

>>> from watex.analysis.bases.features import FeatureInspection
>>> slObj =FeatureInspection(
...   data_fn='data/geo_fdata/BagoueDataset2.xlsx',
...   set_index =True)
>>> slObj.writedf()

class watex.GridSearch(base_estimator, grid_params, cv=4, kind='GridSearchCV', scoring='nmse', verbose=0, **grid_kws)[source]#

Bases: object

Fine-tune hyperparameters using grid search methods.

Search Grid will be able to fiddle with the hyperparameters until to

Parameters:

base_estimator (Callable,) – estimator for trainset and label evaluating; something like a class that implements a fit method. Refer to https://scikit-learn.org/stable/modules/classes.html

grid_params (list of dict,) –

list of hyperparameters params to be fine-tuned.For instance:

param_grid=[dict(
    kpca__gamma=np.linspace(0.03, 0.05, 10),
    kpca__kernel=["rbf", "sigmoid"]
    )]

pipeline (Callable or Pipeline object) – If pipeline is given , X is transformed accordingly, Otherwise evaluation is made using purely the base estimator with the given X.
prefit (bool, default=False,) – If False, does not need to compute the cross validation score once again and True otherwise.

cv (float,) –

* An integer, specifying the number of folds in K-fold cross validation.
    K-fold will be stratified over classes if the estimator is a classifier
    (determined by base.is_classifier) and the targets may represent a
    binary or multiclass (but not multioutput) classification problem
    (determined by utils.multiclass.type_of_target).
* A cross-validation splitter instance. Refer to the User Guide for
    splitters available within `Scikit-learn`_
* An iterable yielding train/test splits.

With some exceptions (especially where not using cross validation at all: is an option), the default is 4-fold.

The default is 4.

kind (str, default='GridSearchCV' or '1') – Kind of grid parameter searches. Can be 1 for GridSearchCV or 2 for RandomizedSearchCV.
scoring (str,) – Specifies the score function to be maximized (usually by cross validation), or – in some cases – multiple score functions to be reported. The score function can be a string accepted by sklearn.metrics.get_scorer() or a callable scorer, not to be confused with an evaluation metric, as the latter have a more diverse API. scoring may also be set to None, in which case the estimator’s score method is used. See slearn.scoring_parameter in the Scikit-learn User Guide.
random_state (int, RandomState instance or None, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls..

Examples

>>> from pprint import pprint
>>> from watex.datasets import fetch_data
>>> from watex.models.validation import GridSearch
>>> from watex.exlib.sklearn import RandomForestClassifier
>>> X_prepared, y_prepared =fetch_data ('bagoue prepared')
>>> grid_params = [ dict(
...        n_estimators=[3, 10, 30], max_features=[2, 4, 6, 8]),
...        dict(bootstrap=[False], n_estimators=[3, 10],
...                             max_features=[2, 3, 4])
...        ]
>>> forest_clf = RandomForestClassifier()
>>> grid_search = GridSearch(forest_clf, grid_params)
>>> grid_search.fit(X= X_prepared,y =  y_prepared,)
>>> pprint(grid_search.best_params_ )
{'max_features': 8, 'n_estimators': 30}
>>> pprint(grid_search.cv_results_)

property base_estimator#: Return the base estimator class

best_estimator_#

best_params_#

cv#

cv_results_#

feature_importances_#

fit(X, y)[source]#

Fit method using base Estimator and populate gridSearch attributes.

Parameters:

X (Ndarray ( M x N) matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns:

``self`` – Returns GridSearch

Return type:

GridSearch

grid_kws#

grid_params#

property kind#: Kind of searched. RandomizedSearchCV or GridSearchCV.

scoring#

verbose#

class watex.GridSearchMultiple(estimators, scoring, grid_params, *, kind='GridSearchCV', cv=7, random_state=42, savejob=False, filename=None, verbose=0, **grid_kws)[source]#

Bases: object

Search and find multiples best parameters from differents estimators.

Parameters:

estimators (list of callable obj) –
list of estimator objects to fine-tune their hyperparameters For instance:

random_state=42 # build estimators logreg_clf = LogisticRegression(random_state =random_state) linear_svc_clf = LinearSVC(random_state =random_state) sgd_clf = SGDClassifier(random_state = random_state) svc_clf = SVC(random_state =random_state)

)

estimators =(svc_clf,linear_svc_clf, logreg_clf, sgd_clf )

grid_params (list) –

list of parameters Grids. For instance:

grid_params= ([
dict(C=[1e-2, 1e-1, 1, 10, 100], gamma=[5, 2, 1, 1e-1, 1e-2, 1e-3],
             kernel=['rbf']),
dict(kernel=['poly'],degree=[1, 3,5, 7], coef0=[1, 2, 3],
 'C': [1e-2, 1e-1, 1, 10, 100])],
[dict(C=[1e-2, 1e-1, 1, 10, 100], loss=['hinge'])],
[dict()], [dict()]
)

cv (float,) –

* An integer, specifying the number of folds in K-fold cross validation.
    K-fold will be stratified over classes if the estimator is a classifier
    (determined by base.is_classifier) and the targets may represent a
    binary or multiclass (but not multioutput) classification problem
    (determined by utils.multiclass.type_of_target).
* A cross-validation splitter instance. Refer to the User Guide for
    splitters available within `Scikit-learn`_
* An iterable yielding train/test splits.

With some exceptions (especially where not using cross validation at all: is an option), the default is 4-fold.

scoring (str,) – Specifies the score function to be maximized (usually by cross validation), or – in some cases – multiple score functions to be reported. The score function can be a string accepted by sklearn.metrics.get_scorer() or a callable scorer, not to be confused with an evaluation metric, as the latter have a more diverse API. scoring may also be set to None, in which case the estimator’s score method is used. See slearn.scoring_parameter in the Scikit-learn User Guide.
kind (str, default='GridSearchCV' or '1') – Kind of grid parameter searches. Can be 1 for GridSearchCV or 2 for RandomizedSearchCV.
random_state (int, RandomState instance or None, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls..
savejob (bool, default=False) – Save your model parameters to external file using ‘joblib’ or Python persistent ‘pickle’ module. Default sorted to ‘joblib’ format.
verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.
grid_kws (dict,) – Argument passed to grid_method additional keywords.

Examples

>>> from watex.models import GridSearchMultiple , displayFineTunedResults
>>> from watex.exlib import LinearSVC, SGDClassifier, SVC, LogisticRegression
>>> X, y  = wx.fetch_data ('bagoue prepared')
>>> X
... <344x18 sparse matrix of type '<class 'numpy.float64'>'
... with 2752 stored elements in Compressed Sparse Row format>
>>> # As example, we can build 04 estimators and provide their
>>> # grid parameters range for fine-tuning as ::
>>> random_state=42
>>> logreg_clf = LogisticRegression(random_state =random_state)
>>> linear_svc_clf = LinearSVC(random_state =random_state)
>>> sgd_clf = SGDClassifier(random_state = random_state)
>>> svc_clf = SVC(random_state =random_state)
>>> estimators =(svc_clf,linear_svc_clf, logreg_clf, sgd_clf )
>>> grid_params= ([dict(C=[1e-2, 1e-1, 1, 10, 100],
                        gamma=[5, 2, 1, 1e-1, 1e-2, 1e-3],kernel=['rbf']),
                   dict(kernel=['poly'],degree=[1, 3,5, 7], coef0=[1, 2, 3],
                        C= [1e-2, 1e-1, 1, 10, 100])],
                [dict(C=[1e-2, 1e-1, 1, 10, 100], loss=['hinge'])],
                [dict()], # we just no provided parameter for demo
                [dict()]
                )
>>> #Now  we can call :class:`watex.models.GridSearchMultiple` for
>>> # training and self-validating as:
>>> gobj = GridSearchMultiple(estimators = estimators,
                       grid_params = grid_params ,
                       cv =4,
                       scoring ='accuracy',
                       verbose =1,   #> 7 put more verbose
                       savejob=False ,  # set true to save job in binary disk file.
                       kind='GridSearchCV').fit(X, y)
>>> # Once the parameters are fined tuned, we can display the fined tuning
>>> # results using displayFineTunedResults`` function
>>> displayFineTunedResults (gobj.models.values_)
MODEL NAME = SVC
BEST PARAM = {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
BEST ESTIMATOR = SVC(C=100, gamma=0.01, random_state=42)

MODEL NAME = LinearSVC BEST PARAM = {‘C’: 100, ‘loss’: ‘hinge’} BEST ESTIMATOR = LinearSVC(C=100, loss=’hinge’, random_state=42)

MODEL NAME = LogisticRegression BEST PARAM = {} BEST ESTIMATOR = LogisticRegression(random_state=42)

MODEL NAME = SGDClassifier BEST PARAM = {} BEST ESTIMATOR = SGDClassifier(random_state=42)

Notes

Call get_scorers() or use sklearn.metrics.SCORERS.keys() to get all the metrics used to evaluate model errors. Can be any others metrics in ~metrics.metrics.SCORERS.keys(). Furthermore if scoring is set to None nmse is used as default value for ‘neg_mean_squared_error’`.

fit(X, y)[source]#

watex.LLE(X, n_components=None, *, return_X=True, n_neighbors=5, **lle_kws)[source]#

Locally Linear Embedding(LLE)

LLE is nonlinear dimensinality reduction based on closest neighbors (c.n).

LLE is another powerfull non linear dimensionality reduction(NLDR) technique. It is Manifold Learning technique that does not rely on projections like PCA. In a nutshell, works by first measurement how each training instance library lineraly relates to its closest neighbors(c.n.), and then looking for a low-dimensional representation of the training set where these local relationships are best preserved (more details shortly).Using LLE yields good resuls especially when makes it particularly good at unrolling twisted manifolds, especially when there is too much noise.

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
n_components (int, optional) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.
n_neighbors (int, default=5) – Number of neighbors to consider for each point.
return_X (bool, default =True ,) – return the train set transformed with most representative varaince ratio.
lle_kws (dict,) – Additional keyword arguments passed to sklearn.decomposition.LocallyLinearEmbedding.

Returns:

The transformed training set or the LLE container attributes for plotting purposes.

Return type:

X (NDArray) or LLE object,

References

Gokhan H. Bakir, Jason Wetson and Bernhard Scholkoft, 2004; “Learning to Find Pre-images”;Tubingen, Germany:Max Planck Institute for Biological Cybernetics.

S. Roweis, L.Saul, 2000, Nonlinear Dimensionality Reduction by Loccally Linear Embedding.

Notes

Scikit-Learn used the algorithm based on Kernel Ridge Regression

Example

>>> from watex.analysis.dimensionality import LLE
>>> from watex.datasets import fetch_data
>>> X, _=fetch_data('Bagoue analysed data')
>>> lle_kws ={
...    'n_components': 4,
...    "n_neighbors": 5}
>>> Xtransf=LLE(X,**lle_kws)

class watex.MXS(kname=None, aqname=None, threshold=None, method='naive', trailer='*', keep_label_0=False, random_state=42, n_groups=3, sep=None, prefix=None, **kws)[source]#

Bases: HData

Mixture Learning Strategy (MXS)

The use of machine learning for k-parameter prediction seems an alternative way to reduce the cost of data collection thereby saving money. However, the borehole data comes with a lot of missing k since the parameter is strongly tied to the aquifer after the pumping test. In other words, the k-parameter collection is feasible if the layer in the well is an aquifer. Unfortunately, predicting some samples of k in a large set of missing data remains an issue using the classical supervised learning methods. We, therefore propose an alternative approach called a mixture learning strategy (MXS) to solve these double issues. It entails predicting upstream a naïve group of aquifers (NGA) combined with the real values k to counterbalance the missing values and yield an optimal prediction score. The method, first, implies the K-Means and Hierarchical Agglomerative Clustering (HAC) algorithms. K-Means and HAC are used for NGA label predicting necessary the MXS label merging.

Parameters:

kname (str, int) –

Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly

kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
aqname (str, optional,) –

Name of aquifer group column. aqname allows to retrieve the
aquifer group arr_aq value in a specific dataframe. Commonly

aqname needs to be supplied when a dataframe is passed as a positional
or keyword argument. Note that it is not mandatory to have a group of aquifer in the log data. It is needed only if the label similarity needs to be calculated.
threshold (float, default=None) – The threshold from which, label in ‘k’ array can be considered similar than the one in NGA labels ‘y_pred’. The default is ‘None’ which means none rule is considered and the high preponderence or occurence in the data compared to other labels is considered as the most representative and similar. Setting the rule instead by fixing the threshold is recommended especially in a huge dataset.
n_groups (int, default=3) – The number of aquifer n_groups to form as well as the number of centroids to generate. If a idea about the number of aquifer group in the areas, it should be used instead. Hiwever, it is recommended to validate this number using the ‘elbow plot’ or the ‘silhouette plot’ or the Hierachical Agglomerative Clustering dendrogram. Refer to plot_elbow() or plotSilhouette() or :func:~.watex.view.plotDendrogram` for plotting purpose.
keep_label_0 (bool, default=False) –
The prediction already include the label 0. However, including 0 in
the predicted label refers to ‘k=0’ i.e. no permeability coefficient equals to 0, which is not True in principle, because all rocks have a permeability coefficient ‘k’. Here we considered ‘k=0’ as an undefined permeability coefficient. Therefore, ‘0’ , can be exclude since, it can also considered as a missing ‘k’-value. If predicted ‘0’ is in the target it should mean a missing ‘k’-value rather than being a concrete label. Therefore, to avoid any confusion, ‘0’ is altered to ‘1’ so the value +1 is used to move forward all class labels thereby excluding the ‘0’ label. To force include 0 in the label, set keep_label_0 to True.

sep: str, default’’
Separator between the true labels ‘y_true’ and predicted NGA labels. Sep is used to rewrite the MXS labels. Mostly the MXS labels is a combinaison with the true label of permeability coefficient ‘k’ and the label of NGA to compose new similarity labels. For instance
>>> true_labels=['k1', 'k2', 'k3'] ; NGA_labels =['II', 'I', 'UV'] >>> # gives >>> MXS_labels= ['k1_II', 'k2_I', 'k3_UV']
where the seperator sep is set to _. This happens especially when one of the label (NGA or true_labels) is not a numeric datatype and a similariy is found between ‘k1’ and ‘II’, ‘k2’ and ‘I’ and so on.
prefix: str, default=’’
prefix is used to rename the true_labels i.e the true valid-k. For instance:
>>> k_valid =[1, 2, ..] -> k_new = [k1, k2, ...]
where ‘k’ is the prefix.
method: str [‘naive’, ‘strict’], default=’naive’
The kind of strategy to compute the representativity of a label in the predicted array ‘y_pred’. It can also be ‘strict’. Indeed:
- naive computes the importance of the label by the number of its
  occurence for this specific label in the array ‘y_true’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in y_true.
- strict computes the importance of the label by the number of
  occurence in the whole valid y_true i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in y_pred.
trailer: str, default=’*’
The Mixture strategy marker to differentiate the existing class label in ‘y_true’ with the predicted labels ‘y_pred’ especially when the the same class labels are also present the true label with the same label-identifier name. This usefull to avoid any confusion for both labels in y_true and y_pred for better demarcation and distinction. Note that if the trailer`is set to ``None` and both y_true and y_pred are numeric data, the labels in y_pred are systematically renamed to be distinct with the ones in the ‘y_true’. For instance
>>> true_labels=[1, 2, 3] ; NGA_labels =[0, 1, 2] >>> # with trailer , MXS labels should be >>> MXS_labels= ['0', '1*', '2*', '3'] # 1 and 2 are in true_labels >>> # with no trailer >>> MXS_labels= [0, 4, 5, 3] # 1 and 2 have been changed to [4, 5]
verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.

Examples

>>> from watex.datasets import load_hlogs
>>> from watex.methods.hydro import MXS
>>> hdata= load_hlogs (as_frame =True)
>>> # drop the 'remark' columns since there is no valid data
>>> hdata.drop (columns ='remark', inplace =True)
>>> mxs = MXS (kname ='k').fit(hdata)
>>> # predict the default NGA
>>> mxs.predictNGA() # default prediction with n_groups =3
>>> # make MXS labels using the default 'k' categorization
>>> ymxs=mxs.makeyMXS(categorize_k=True, default_func=True)
>>> mxs.yNGA_ [62:74]
Out[43]: array([1, 2, 2, 2, 3, 1, 2, 1, 2, 2, 1, 2])
>>> ymxs[62:74]
Out[44]: array([ 1, 22, 22, 22,  3,  1, 22,  1, 22, 22,  1, 22])
>>> # to get the label similariry , need to provide the
>>> # the column name of aquifer group and fit again like
>>> mxs = MXS (kname ='k', aqname ='aquifer_group').fit(hdata)
>>> sim = mxs.labelSimilarity()
>>> sim
Out[47]: [(0, 'II')] # group II and label 0 are very similar

aqname = 'aquifer_group'#

kname = 'k'#

labelSimilarity(func=None, categorize_k=False, default_func=False, **sm_kws)[source]#

Find label similarities

Parameters:

func (callable) – Function to specifically map the permeability coefficient column in the dataframe of serie. If not given, the default function can be enabled instead from param default_func.
string (bool,) – If set to “True”, categorized map from ‘k’ should be prefixed by “k”. However is string value is given , the prefix is changed according to this label.
default_ufunc (bool,) –
Default function for mapping k is setting to True. Note that, this could probably not fitted your own data. So it is recommended to provide your own function for mapping ‘k’. However the default ‘k’ mapping is given as follow:
- k0 {0}: k = 0
- k1 {1}: 0 < k <= .01
- k2 {2}: .01 < k <= .07
- k3 {3}: k> .07
sm_kws (dict,) – Additional keyword arguments passed to find_similar_labels().

makeyMXS(y_pred=None, func=None, categorize_k=False, default_func=False, **mxs_kws)[source]#

Construct the MXS target $y*$

Parameters:

y_pred (Array-like 1d, pandas.Series) –
Array composing the valid NGA labels. Note that NGA labels is a predicted labels mostly using the unsupervising learning.

seealso:

predict_NGA_labels() for further details.
func (callable) – Function to specifically map the permeability coefficient column in the dataframe of serie. If not given, the default function can be enabled instead from param default_func.
string (bool,) – If set to “True”, categorized map from ‘k’ should be prefixed by “k”. However is string value is given , the prefix is changed according to this label.
default_ufunc (bool,) –
Default function for mapping k is setting to True. Note that, this
could probably not fitted your own data. So it is recommended to provide your own function for mapping ‘k’. However the default ‘k’ mapping is given as follow:
- k0 {0}: k = 0
- k1 {1}: 0 < k <= .01
- k2 {2}: .01 < k <= .07
- k3 {3}: k> .07
mxs_kws:dict,
Additional keyword arguments passed to make_MXS_labels().

Returns:

MXS.mxs_labels_ – array like of MXS labels

Return type:

array-like 1d `

Example

>>> from watex.datasets import load_hlogs
>>> from watex.methods.hydro import MXS
>>> hdata = load_hlogs ().frame
>>> # drop the 'remark' columns since there is no valid data
>>> hdata.drop (columns ='remark', inplace=True)
>>> mxs =MXS (kname ='k').fit(hdata) # specify the 'k'columns
>>> # we can predict the NGA labels and yMXS with single line
>>> # of code snippet using the default 'k' classification.
>>> ymxs = mxs.predictNGA().makeyMXS(categorize_k=True, default_func=True)
>>> mxs.yNGA_[:7]
... array([2, 2, 2, 2, 2, 2, 2])
>>> ymxs[:7]
Out[40]: array([22, 22, 22, 22, 22, 22, 22])
>>> mxs.mxs_group_classes_
Out[56]: {1: 1, 2: 22, 3: 3} # transform classes
>>> mxs.mxs_group_labels_
Out[57]: (2,)
>>> # **comment:
    # # only the label '2' is tranformed to '22' since
    # it is the only one that has similariry with the true label 2

predictNGA(n_components=2, return_label=False, **NGA_kws)[source]#

Predicts Naive Group of Aquifer from Hydro-Log data.

Parameters:

n_components (int, default=2) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicates the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.
return_label (bool,default=False) – If True, return the NGA label predicted, otherwise return MXS instanciated object. if False, NGA label can be fetch using the attribute watex.hydro.MXS.yNGA_
NGA_kws (dict,) – keyword argument passed to watex.utils.predict_NGA_labels()

Returns:

yNGA_ or self – MXS instanciated object.

Return type:

arraylike-1d of naive group of aquifer or

Example

>>> from watex.datasets import load_hlogs
>>> from watex.methods.hydro import MXS
>>> hdata = load_hlogs ().frame
>>> # drop the 'remark' columns since there is no valid data
>>> hdata.drop (columns ='remark', inplace=True)
>>> mxs =MXS (kname ='k').fit(hdata) # specify the 'k' column
>>> y_pred = mxs.predictNGA(return_label=True )
>>> y_pred [-12:]
Out[52]: array([1, 3, 1, 3, 3, 3, 3, 1, 3, 3, 3, 3])

sname = None#

verbose = 0#

zname = None#

class watex.Missing(in_percent=False, sample=None, kind=None, drop_columns=None, **kws)[source]#

Bases: Data

Deal with missing values in Data

Most algorithms will not work with missing data. Notable exceptions are the recent boosting libraries such as the XGBoost (watex.documentation.xgboost.__doc__) CatBoost and LightGBM. As with many things in machine learning , there are no hard answaers for how to treat a missing data. Also, missing data could represent different situations. There are three warious way to handle missing data:

* Remove any row with missing data
* Remove any columns with missing data
* Impute missing values
* Create an indicator columns to indicator data was missing

Parameters:

in_percent (bool,) – give the statistic of missing data in percentage if ser to True.
sample (int, Optional,) – Number of row to visualize or the limit of the number of sample to be able to see the patterns. This is usefull when data is composed of many rows. Skrunked the data to keep some sample for visualization is recommended. None plot all the samples ( or examples) in the data
kind (str, Optional) –
type of visualization. Can be dendrogramm, mbar or bar. corr plot for dendrogram , msno bar, plt and msno correlation visualization respectively:
- bar plot counts the nonmissing data using pandas
- mbar use the msno package to count the number
  of nonmissing data.
- dendrogram`` show the clusterings of where the data is missing.
  leaves that are the same level predict one onother presence (empty of filled). The vertical arms are used to indicate how different cluster are. short arms mean that branch are similar.
- ``corr` creates a heat map showing if there are correlations
  where the data is missing. In this case, it does look like the locations where missing data are corollated.
- None is the default vizualisation. It is useful for viewing
  contiguous area of the missing data which would indicate that the missing data is not random. The matrix function includes a sparkline along the right side. Patterns here would also indicate non-random missing data. It is recommended to limit the number of sample to be able to see the patterns.
Any other value will raise an error

Examples

>>> from watex.base import Missing
>>> data ='data/geodata/main.bagciv.data.csv'
>>> ms= Missing().fit(data)
>>> ms.plot_.fig_size = (12, 4 )
>>> ms.plot ()

drop(data=None, columns=None, inplace=False, axis=1, **kwd)[source]#

Remove missing data

Parameters:

data (Dataframe of shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N
columns (str or list of str) – columns to drop which contain the missing data. Can use the axis equals to ‘1’.
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Determine if rows or columns which contain missing values are removed. * 0, or ‘index’ : Drop rows which contain missing values.
- 1, or ‘columns’ : Drop columns which contain missing value.
Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
how ({'any', 'all'}, default 'any') –
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
- ’any’: If any NA values are present, drop that row or column.
- ’all’ : If all values are NA, drop that row or column.
thresh (int, optional) – Require that many non-NA values. Cannot be combined with how.
subset (column label or sequence of labels, optional) – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.

Returns:

``self`` – returns self for easy method chaining.

Return type:

Missing instance

property get_missing_columns#: return columns with Nan Values

property isnull#: Check the mean values in the data in percentge

plot(figsize=None, **kwd)[source]#

Vizualize patterns in the missing data.

Parameters:

data (Dataframe of shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N
kind (str, Optional) –
kind of visualization. Can be dendrogramm, mbar or bar plot for dendrogram , msno bar and plt visualization respectively:
- bar plot counts the nonmissing data using pandas
- mbar use the msno package to count the number
  of nonmissing data.
- dendrogram`` show the clusterings of where the data is missing.
  leaves that are the same level predict one onother presence (empty of filled). The vertical arms are used to indicate how different cluster are. short arms mean that branch are similar.
- ``corr` creates a heat map showing if there are correlations
  where the data is missing. In this case, it does look like the locations where missing data are corollated.
- None is the default vizualisation. It is useful for viewing
  contiguous area of the missing data which would indicate that the missing data is not random. The matrix function includes a sparkline along the right side. Patterns here would also indicate non-random missing data. It is recommended to limit the number of sample to be able to see the patterns.
Any other value will raise an error
sample (int, Optional) – Number of row to visualize. This is usefull when data is composed of many rows. Skrunked the data to keep some sample for visualization is recommended. None plot all the samples ( or examples) in the data
kws (dict) – Additional keywords arguments of msno.matrix plot.

Returns:

``self`` – returns self for easy method chaining.

Return type:

Missing instance

Examples

>>> from watex.base import Missing
>>> data ='data/geodata/main.bagciv.data.csv'
>>> ms= Missing().fit(data)
>>> ms.plot(figsize = (12, 4 ) )

replace(data=None, columns=None, fill_value=None, new_column_name=None, return_non_null=False, **kwd)[source]#

Replace the missing values to consider.

Use the coalease function of pyjanitor. It takes a dataframe and a list of columns to consider. This is a similar to functionality found in Excel and SQL databases. It returns the first non null value of each row.

Parameters:

data (Dataframe of shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N
columns (str or list of str) – columns to replace which contain the missing data. Can use the axis equals to ‘1’.
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Determine if rows or columns which contain missing values are
removed. * 0, or ‘index’ : Drop rows which contain missing values.
- 1, or ‘columns’ : Drop columns which contain missing value.
Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
returns:

``self`` – returns self for easy method chaining.

rtype:

Missing instance

property sanity_check#: Ensure that we have deal with all missing values. The following code returns a single boolean if there is any cell that is missing in a DataFrame

class watex.Preprocessing(tname='flow', drop_features=None, random_state=42, default_estimator='svc', test_size=0.2, verbose=0)[source]#

Bases: object

Base preprocessing class.

Give a baseline preprocessing model with a base score. Usefull before fidlling the model hyperparameters.

Parameters:

tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.
drop_features (list or str, Optional) – List the useless features for predicting or list of column names to drop out.
random_state (int, default is 42) – The state of data shuffling. The default is 42.
default_estimator (callable, F or sckitlearn estimator) –
The default estimator name for predicting the tname value. A predifined defaults estimators prameters are set and keep in cache for quick preprocessing like: - ‘dtc’: For DecisionTreeClassifier - ‘svc’: Support Vector Classifier - ‘sdg’: SGDClassifier - ‘knn’: KNeighborsClassifier - ‘rdf`: RandmForestClassifier - ‘ada’: AdaBoostClassifier - ‘vtc’: VotingClassifier - ‘bag’: BaggingClassifier - ‘stc’: StackingClassifier If estimator is not given the default is svm or

svc.
test_size (float,) – The test set data size. Must be less than 1.The sample test size is 0.2 either 20% of dataset.
verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.

X#

training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

Type:: Ndarray of shape ( M x N), $M=m-samples$ & $N=n-features$

y#

train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Type:: array-like of shape (M, ) :math:`M=m-samples

Xt#

Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.

Type:: Ndarray ( M x N matrix where M=m-samples, & N=n-features)

yt#

test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Type:: array-like, shape (M, ) M=m-samples,

data#

Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.

Type:: str, filepath_or_buffer or pandas.core.DataFrame

pipe_#

Pipeline can be buit by your own pipeline with different transformer. For base model prediction, it is possible to use the default pipeline. Call get_default_pipe to get the transformation list and steps.

Type:: Callable, preprocessor object from sklearn.pipeline

estimator#

Callable estimator method to fit the model:

estimators= SGDClassifier(random_state=13)

Type:: Callable, F or sklearn.metaestimator

model#

A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:

parameters provided in object construction or with set_params;

the global numpy.random random state if the estimator’s random_state
parameter is set to None; and

any data or sample properties passed to the most recent call to fit,
fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.

The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a function.

Type:: callable, always as a function,

cat_features_#

list of categorical features list. If not given it should be find automatically.

Type:: list or str, Optional

num_features_#

list Numerical features list. If not given, should be find automatically.

Type:: list of str, Optional

model#

Use the predifined pipelines i.e can be a Pipeline can your build by your own pipeline with different composite estimator. If model is None , use the default model from the default preprocessor and estimator.

Type:: Callable, {preprocessor + estimator },

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.linear_model import SGDClassifier
>>> from sklearn.impute import SimpleImputer
>>> estimators=dict(
...    RandomForestClassifier=RandomForestClassifier(
...        n_estimators=200, random_state=0),
>>> pc.X= SimpleImputer().fit_transform(pc.X)
>>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values
>>> pc.baseEvaluation(estimator=estimators, eval_metric =True)
>>> pc.base_score_
... 0.72586369

baseEvaluation(model=None, eval_metric=False, **kws)[source]#

Dummy baseline model from preprocessing pipeline.

onto a model by providing an estimator.

Parameters:

model (Callable, {'preprocessor + estimator },) – A model is scikit-learn estimator or or composite model built from a Pipeline. If model is None , use the default model from the default preprocessor and `estimator. model can be a dict of multiples estimators. Therefore the evaluation of each estimator is set to dictionnary where the key is each estimator name.
eval_metric (bool,) – if set to True, confusion matrix and classification report scores are evaluated assuming the the supervised learning is a classification problem. default is False.
scorer (str, Callable,) –
a scorer is a metric function for model evaluation. If given as string it should be the prefix of the following metrics:
- ”classification_report” -> for classification_report,
- ’precision_recall’ -> for precision_recall_curve,
- ”confusion_matrix” -> for a confusion_matrix,
- ’precision’ -> for precision_score,
- ”accuracy” -> for accuracy_score
- ”mse” -> for mean_squared_error,
- ”recall” -> for recall_score,
- ’auc’ -> for roc_auc_score,
- ’roc’ -> for roc_curve
- ’f1’ -> for f1_score,
Other string prefix values should raises an errors
kws (dict,) – Additionnal keywords arguments from scklearn metric function.

Returns:

`self.base_score_`

Return type:

base score after predicting

Notes

If None estimator is given, the default estimator is svm otherwise, provide the prefix to select the convenience estimator into the default dict default_estimator. Get the default dict by calling <instance>._getdestimators()>

Examples

>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> pc.fit(data =data)

(1) -> default estimator >>> pc.baseEvaluation (eval_metric=True) … 0.47126436781609193

(2) -> multiples estimators >>> from watex.exlib.sklearn import RandomForestClassifier , SGDClassifier, SimpleImputer >>> estimators={‘RandomForestClassifier’:RandomForestClassifier

(n_estimators=200, random_state=0), ‘SDGC’:SGDClassifier(random_state=0)}

>>> pc.X= SimpleImputer().fit_transform(pc.X)
>>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values
>>> pc.baseEvaluation(model={
'RandomForestClassifier':RandomForestClassifier(
  n_estimators=200, random_state=0),
'SDGC':SGDClassifier(random_state=0)}, eval_metric =True)
>>> pc.ypred_
Out[128]:
{'RandomForestClassifier': array([2, 1, 2, 2, 2, 2, 0, 1, 1, 2, 3, 1, 0, 0, 1, 1, 1, 2, 2, 3, 2, 3,
        1, 2, 1, 2, 0, 2, 2, 3, 2, 2, 1, 1, 3, 3, 0, 2, 3, 3, 2, 1, 0, 2,
        1, 1, 2, 2, 2, 2, 1, 1, 0, 2, 0, 2, 1, 2, 1, 1, 2, 0, 1, 2, 0, 2,
        2, 3, 2, 2, 3, 0, 1, 2, 2, 3, 1, 1, 0, 1, 1, 2, 0, 0, 2, 0, 1],
       dtype=int8),
 'SGDClassifier': array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       dtype=int8)}
>>> pc.base_score_
Out[130]:
{'RandomForestClassifier': 0.7816091954022989,
 'SGDClassifier': 0.14942528735632185}

property data#

property features#: Collect the list of features

fit(X=None, y=None, **fit_params)[source]#

Read the whole dataset, encode the categorial features and populate class attributes.

If X and y are provided, they are considered as a features set and target respectively. They should be splitted to the training set and test set respectively.

Parameters:

X (N-d array, shape (N, M)) – the feature arrays composed of N-columns and the M-samples. The feature set excludes the target y.
y (arraylike , shape (M)) – the target is composed of M-examples in supervised learning.
data (Dataframe or shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N including the target y. Note that if the data is given, it is not necessary to provide the X and y. By specifying the target name tname, the target should be remove to the data.
split_X_y (bool, default {'True'}) – split the datatset to training set {X, y } and test set {Xt, yt}. Otherwise X and y should be considered as traning sets.

Returns:

``self``

Return type:

Preprocessing instance for easy method chaining.

Examples

>>> from watex.cases.processing import Preprocessing
>>> from watex.datasets import fetch_data
>>> data = fetch_data('bagoue original').get('data=dfy2')
>>> pc = Preprocessing (drop_features = ['lwi', 'num', 'name']
                        ).fit(data =data )
>>> len(pc.X ),  len(y), len(pc.Xt ),  len(pc.yt)
... (344, 344, 87, 87) # trainset (X,y) and testset (Xt, yt)

get_default_pipe()[source]#

make a default pipe to preprocess the data.

Create a preprocessor by assembling multiple transformers. The default pipeline is not exhaustive so to have full control of the data, it is recommended to provide a strong preprocessor for the data processing at once.

the method returns self.pipe_`as callable, preprocessor pipeline from :class:`sklearn.pipeline.Pipeline object. Basically since, the default transformers are composed of:

sklearn.pipeline.make_pipeline() for pipeline creation.

sklearn.preprocessing.OneHotEncoder() for categorial
features encoding.

sklearn.preprocessing.PolynomialFeatures() for features
engineering.

sklearn.preprocessing.RobustScaler() for data scaling

sklearn.compose.make_column_transformer() for data
transformation.

sklearn.compose.make_column_selector() for features
composing.

property inspect#: Inspect data and trigger plot after checking the data entry. Raises NotFittedError if self is not fitted yet.

makeModel(pipe=None, estimator=None)[source]#

Assemble pipes and estimator to create the model

The model is composed of the transformers and estimator, If one is set to None, it uses the default pipe and estimator which might be not the one expected. Therefore providing a pipe and estimator is suggested.

Parameters:

pipe (Callable, pipeline or preprocessor) – Callable pipeline. Pipeline can your own pipeline with different transformer. Refer to the sklearn.pipeline.Pipeline for futher details. Call get_default_pipe to get the default pipe.

estimator (Callable, F or {sklearn estimator}) –

Callable estimator method to fit the model:

   estimators= SGDClassifier(random_state=13)

`Some pre-estimators can be fetched by providing the prefix as
a key of the estimator default dict. For instance to fetch the
`DecisionTreeClassifier` estimators::

    >>> from watex.cases.processing import Preprocessing
    >>> Preprocessing._getdestimators()['dtc']
    ... DecisionTreeClassifier(max_depth=100, random_state=42)

Returns:

`model_`

Return type:

Callable, {preprocessor + estimator }

Examples

We can get the default preprocessor by merely calling:

>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> pc.fit(data =data)
>>> pc.makeModel() # use default model and preprocessor
>>> pc.model_

(2)-> Or build your own preprocesor object using the example below:

>>> from sklearn.pipeline import Pipeline
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.impute import SimpleImputer
>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
>>> from sklearn.linear_model import LogisticRegression
>>> from watex.datasets import fetch_data
>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> numeric_features = ['east', 'north', 'power', 'magnitude', 'sfi', 'ohmS']
>>> numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")),
           ("scaler", StandardScaler())]
    )
>>> categorical_features = ['shape', 'geol', 'type']
>>> categorical_transformer = OneHotEncoder(handle_unknown="ignore")
>>> preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ])
>>> pc.makeModel (pipe = preprocessor,
                  estimator =  LogisticRegression())
>>> pc.model_

class watex.QuickPlot(classes=None, tname=None, mapflow=False, **kws)[source]#

Bases: BasePlot

Special class dealing with analysis modules for quick diagrams, histograms and bar visualizations.

Originally, it was designed for the flow rate prediction, however, it still works with any other dataset by following the parameters details.

Parameters:

data (str, filepath_or_buffer or pandas.core.DataFrame) – Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.
y (array-like of shape (M, ) :math:`M=m-samples) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.
classes (list of int | float, [categorized classes]) –
list of the categorial values encoded to numerical. For instance, for flow data analysis in the Bagoue dataset, the classes could be [0., 1., 3.] which means:
```
* 0 m3/h  --> FR0
* > 0 to 1 m3/h --> FR1
* > 1 to 3 m3/h --> FR2
* > 3 m3/h  --> FR3
```
mapflow (bool,) –
Is refer to the flow rate prediction using DC-resistivity features and work when the tname is set to flow. If set to True, value in the target columns should map to categorical values. Commonly the flow rate values are given as a trend of numerical values. For a classification purpose, flow rate must be converted to categorical values which are mainly refered to the type of types of hydraulic. Mostly the type of hydraulic system is in turn tided to the number of the living population in a specific area. For instance, flow classes can be ranged as follow:
- FR = 0 is for dry boreholes
- 0 < FR ≤ 3m3/h for village hydraulic (≤2000 inhabitants)
- 3 < FR ≤ 6m3/h for improved village hydraulic(>2000-20 000inhbts)
- 6 <FR ≤ 10m3/h for urban hydraulic (>200 000 inhabitants).
Note that the flow range from mapflow is not exhaustive and can be modified according to the type of hydraulic required on the project.
savefig (str, Path-like object,) – savefigure’s name, default is None
fig_dpi (float,) – dots-per-inch resolution of the figure. default is 300
fig_num (int,) – size of figure in inches (width, height). default is [5, 5]
fig_size (Tuple (int, int) or inch) – size of figure in inches (width, height).*default* is [5, 5]
fig_orientation (str,) – figure orientation. default is landscape
fig_tile (str,) – figure title. default is None
fs (float,) – size of font of axis tick labels, axis labels are fs+2. default is 6
ls (str,) – line style, it can be [ ‘-’ | ‘.’ | ‘:’ ] . default is ‘-’
lc (str, Optional,) – line color of the plot, default is k
lw (float, Optional,) – line weight of the plot, default is 1.5
alpha (float between 0 < alpha < 1,) – transparency number, default is 0.5,
font_weight (str, Optional) – weight of the font , default is bold.
font_style (str, Optional) – style of the font. default is italic
font_size (float, Optional) – size of font in inches (width, height). default is 3.
ms (float, Optional) – size of marker in points. default is 5
marker (str, Optional) – marker of stations default is o.
marker_style (str, Optional) – facecolor of the marker. default is yellow
marker_edgecolor (str, Optional) – facecolor of the marker. default is yellow
marker_edgewidth (float, Optional) – width of the marker. default is 3.
xminorticks (float, Optional) – minortick according to x-axis size and default is 1.
yminorticks (float, Optional) – yminorticks according to x-axis size and default is 1.
bins (histograms element separation between two bar. default is 10.) –
xlim (tuple (int, int), Optional) – limit of x-axis in plot.
ylim (tuple (int, int), Optional) – limit of x-axis in plot.
xlabel (str, Optional,) – label name of x-axis in plot.
ylabel (str, Optional,) – label name of y-axis in plot.
rotate_xlabel (float, Optional) – angle to rotate xlabel in plot.
rotate_ylabel (float, Optional) – angle to rotate ylabel in plot.
leg_kws (dict, Optional) – keyword arguments of legend. default is empty dict
plt_kws (dict, Optional) – keyword arguments of plot. default is empty dict
glc (str, Optional) – line color of the grid plot, default is k
glw (float, Optional) – line weight of the grid plot, default is 2
galpha (float, Optional,) – transparency number of grid, default is 0.5
gaxis (str ('x', 'y', 'both')) – type of axis to hold the grid, default is both
gwhich (str, Optional) – kind of grid in the plot. default is major
tp_axis (bool,) – axis to apply the ticks params. default is both
tp_labelsize (str, Optional) – labelsize of ticks params. default is italic
tp_bottom (bool,) – position at bottom of ticks params. default is True.
tp_labelbottom (bool,) – put label on the bottom of the ticks. default is False
tp_labeltop (bool,) – put label on the top of the ticks. default is True
cb_orientation (str , ('vertical', 'horizontal')) – orientation of the colorbar, default is vertical
cb_aspect (float, Optional) – aspect of the colorbar. default is 20.
cb_shrink (float, Optional) – shrink size of the colorbar. default is 1.0
cb_pad (float,) – pad of the colorbar of plot. default is .05
cb_anchor (tuple (float, float)) – anchor of the colorbar. default is (0.0, 0.5)
cb_panchor (tuple (float, float)) – proportionality anchor of the colorbar. default is (1.0, 0.5)
cb_label (str, Optional) – label of the colorbar.
cb_spacing (str, Optional) – spacing of the colorbar. default is uniform
cb_drawedges (bool,) – draw edges inside of the colorbar. default is False
sns_orient ('v' | 'h', optional) – Orientation of the plot (vertical or horizontal). This is usually inferred based on the type of the input variables, but it can be used to resolve ambiguity when both x and y are numeric or when plotting wide-form data. default is v which refer to ‘vertical’
sns_style (dict, or one of {darkgrid, whitegrid, dark, white, ticks}) – A dictionary of parameters or the name of a preconfigured style.
sns_palette (seaborn color paltte | matplotlib colormap | hls | husl) – Palette definition. Should be something color_palette() can process. the palette generates the point with different colors
sns_height (float,) – Proportion of axes extent covered by each rug element. Can be negative. default is 4.
sns_aspect (scalar (float, int)) – Aspect ratio of each facet, so that aspect * height gives the width of each facet in inches. default is .7

Returns:

self – returns self for easy method chaining.

Return type:

Baseclass instance

Examples

>>> from watex.view.plot import  QuickPlot
>>> data = 'data/geodata/main.bagciv.data.csv'
>>> qkObj = QuickPlot(  leg_kws= dict( loc='upper right'),
...          fig_title = '`sfi` vs`ohmS|`geol`',
...            )
>>> qkObj.tname='flow' # target the DC-flow rate prediction dataset
>>> qkObj.mapflow=True  # to hold category FR0, FR1 etc..
>>> qkObj.fit(data)
>>> sns_pkws= dict ( aspect = 2 ,
...          height= 2,
...                  )
>>> map_kws= dict( edgecolor="w")
>>> qkObj.discussingfeatures(features =['ohmS', 'sfi','geol', 'flow'],
...                           map_kws=map_kws,  **sns_pkws
...                         )

barcatdist(basic_plot=True, groupby=None, **kws)[source]#

Bar plot distribution.

Plots a categorical distribution according to the occurence of the target in the data.

Parameters:

basic_pot (bool,) – Plot only the occurence of targetted columns from matplotlib.pyplot.bar function.
groupby (list or dict, optional) –
Group features for plotting. For instance it plot others features located in the df columns. The plot features can be on list and use default plot properties. To customize plot provide, one may provide, the features on dict with convenients properties like:
```
* `groupby`= ['shape', 'type'] #{'type':{'color':'b',
                             'width':0.25 , 'sep': 0.}
                     'shape':{'color':'g', 'width':0.25,
                             'sep':0.25}}
```
kws (dict,) – Additional keywords arguments from seaborn.countplot
data (str or pd.core.DataFrame) – Path -like object or Dataframe. Long-form (tidy) dataset for plotting. Each column should correspond to a variable, and each row should correspond to an observation. If data is given as path-like object,`QuickPlot` reads and sanitizes data before plotting. Be aware in this case to provide the target name and possible the classes for data inspection. Both str or dataframe need to provide the name of target.

Returns:

Returns self for easy method chaining.

Return type:

QuickPlot instance

Notes

The argument for data must be passed to fit method. data parameter is not allowed in other QuickPlot method. The description of the parameter data is to give a synopsis of the kind of data the plot expected. An error will raise if force to pass data argument as a keyword arguments.

Examples

>>> from watex.view.plot import QuickPlot
>>> from watex.datasets import load_bagoue
>>> data = load_bagoue ().frame
>>> qplotObj= QuickPlot(xlabel = 'Anomaly type',
                        ylabel='Number of  occurence (%)',
                        lc='b', tname='flow')
>>> qplotObj.sns_style = 'darkgrid'
>>> qplotObj.fit(data)
>>> qplotObj. barcatdist(basic_plot =False,
...                      groupby=['shape' ])

corrmatrix(cortype='num', features=None, method='pearson', min_periods=1, **sns_kws)[source]#

Method to quick plot the numerical and categorical features.

Set features by providing the names of features for visualization.

Parameters:

cortype (str,) – The typle of parameters to cisualize their coreletions. Can be num for numerical features and cat for categorical features. Default is num for quantitative values.
method (str,) – the correlation method. can be ‘spearman’ or person. *Default is pearson
features (List, optional) – list of the name of features for correlation analysis. If given, must be sure that the names belong to the dataframe columns, otherwise an error will occur. If features are valid, dataframe is shrunk to the number of features before the correlation plot.
min_periods – Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation. For more details refer to https://www.geeksforgeeks.org/python-pandas-dataframe-corr/
sns_kws (Other seabon heatmap arguments. Refer to) – https://seaborn.pydata.org/generated/seaborn.heatmap.html
data (str or pd.core.DataFrame) – Path -like object or Dataframe. Long-form (tidy) dataset for plotting. Each column should correspond to a variable, and each row should correspond to an observation. If data is given as path-like object,`QuickPlot` reads and sanitizes data before plotting. Be aware in this case to provide the target name and possible the classes for data inspection. Both str or dataframe need to provide the name of target.

Returns:

Returns self for easy method chaining.

Return type:

QuickPlot instance

Notes

Example

>>> from watex.view.plot import QuickPlot
>>> from watex.datasets import load_bagoue
>>> data = load_bagoue ().frame
>>> qplotObj = QuickPlot().fit(data)
>>> sns_kwargs ={'annot': False,
...            'linewidth': .5,
...            'center':0 ,
...            # 'cmap':'jet_r',
...            'cbar':True}
>>> qplotObj.corrmatrix(cortype='cat', **sns_kwargs)

property data#

discussingfeatures(features, *, map_kws=None, map_func=None, **sns_kws)[source]#

Provides the features names at least 04 and discuss with their distribution.

This method maps a dataset onto multiple axes arrayed in a grid of rows and columns that correspond to levels of features in the dataset. The plots produced are often called “lattice”, “trellis”, or ‘small-multiple’ graphics.

Parameters:

features (list) –

List of features for discussing. The number of recommended features for better analysis is four (04) classified as below:

features_disposal = [‘x’, ‘y’, ‘col’, ‘target|hue’]

where:

x is the features hold to the x-axis, default is``ohmS``
y is the feature located on y_xis, default is sfi
col is the feature on column subset, *default` is col
target or hue for targetted examples, default is flow

If 03 features are given, the latter is considered as a target

map_kws:dict, optional: Extra keyword arguments for mapping plot.
func_map: callable, Optional: callable object, is a plot style function. Can be a ‘matplotlib-pyplot’ function like plt.scatter or ‘seaborn-scatterplot’ like sns.scatterplot. The default is sns.scatterplot.
sns_kwargs: dict, optional: kwywords arguments to control what visual semantics are used to identify the different subsets. For more details, please consult <http://seaborn.pydata.org/generated/seaborn.FacetGrid.html>.
data: str or pd.core.DataFrame: Path -like object or Dataframe. Long-form (tidy) dataset for plotting. Each column should correspond to a variable, and each row should correspond to an observation. If data is given as path-like object,`QuickPlot` reads and sanitizes data before plotting. Be aware in this case to provide the target name and possible the classes for data inspection. Both str or dataframe need to provide the name of target.

Returns:: Returns self for easy method chaining.
Return type:: QuickPlot instance

Notes

Examples

>>> from watex.view.plot import  QuickPlot
>>> from watex.datasets import load_bagoue
>>> data = load_bagoue ().frame
>>> qkObj = QuickPlot(  leg_kws={'loc':'upper right'},
...          fig_title = '`sfi` vs`ohmS|`geol`',
...            )
>>> qkObj.tname='flow' # target the DC-flow rate prediction dataset
>>> qkObj.mapflow=True  # to hold category FR0, FR1 etc..
>>> qkObj.fit(data)
>>> sns_pkws={'aspect':2 ,
...          "height": 2,
...                  }
>>> map_kws={'edgecolor':"w"}
>>> qkObj.discussingfeatures(features =['ohmS', 'sfi','geol', 'flow'],
...                           map_kws=map_kws,  **sns_pkws
...                         )

fit(data, y=None)[source]#

Fit data and populate the attributes for plotting purposes.

Parameters:

data (str or pd.core.DataFrame) – Path -like object or Dataframe. Long-form (tidy) dataset for plotting. Each column should correspond to a variable, and each row should correspond to an observation. If data is given as path-like object,`QuickPlot` reads and sanitizes data before plotting. Be aware in this case to provide the target name and possible the classes for data inspection. Both str or dataframe need to provide the name of target.
y (array-like, optional) –

array of the target. Must be the same length as the data. If y
is provided and data is given as str or DataFrame, all the data should be considered as the X data for analysis.

returns:

self – Returns self for easy method chaining.

rtype:

QuickPlot instance

Examples

>>> from watex.datasets import load_bagoue
>>> data = load_bagoue ().frame
>>> from watex.view.plot import QuickPlot
>>> qplotObj= QuickPlot(xlabel = 'Flow classes in m3/h',
                        ylabel='Number of  occurence (%)')
>>> qplotObj.tname= None # eith nameof target set to None
>>> qplotObj.fit(data)
>>> qplotObj.data.iloc[1:2, :]
...     num name      east      north  ...         ohmS        lwi      geol flow
    1  2.0   b2  791227.0  1159566.0  ...  1135.551531  21.406531  GRANITES  0.0
>>> qplotObj.tname= 'flow'
>>> qplotObj.mapflow= True # map the flow from num. values to categ. values
>>> qplotObj.fit(data)
>>> qplotObj.data.iloc[1:2, :]
...    num name      east      north  ...         ohmS        lwi      geol flow
    1  2.0   b2  791227.0  1159566.0  ...  1135.551531  21.406531  GRANITES  FR0

histcatdist(stacked=False, **kws)[source]#

Histogram plot distribution.

Plots a distributions of categorized classes according to the percentage of occurence.

Parameters:

stacked (bool) – Pill bins one to another as a cummulative values. default is False.
bins (int, optional) – contains the integer or sequence or string
range (list, optional) – is the lower and upper range of the bins
density (bool, optional) – contains the boolean values
weights (array-like, optional) – is an array of weights, of the same shape as data
bottom (float, optional) – is the location of the bottom baseline of each bin
histtype (str, optional) – is used to draw type of histogram. {‘bar’, ‘barstacked’, step, ‘stepfilled’}
align (str, optional) – controls how the histogram is plotted. {‘left’, ‘mid’, ‘right’}
rwidth (float, optional,) – is a relative width of the bars as a fraction of the bin width
log (bool, optional) – is used to set histogram axis to a log scale
color (str, optional) – is a color spec or sequence of color specs, one per dataset
label (str , optional) – is a string, or sequence of strings to match multiple datasets
normed (bool, optional) – an optional parameter and it contains the boolean values. It uses the density keyword argument instead.
data (str or pd.core.DataFrame) – Path -like object or Dataframe. Long-form (tidy) dataset for plotting. Each column should correspond to a variable, and each row should correspond to an observation. If data is given as path-like object,`QuickPlot` reads and sanitizes data before plotting. Be aware in this case to provide the target name and possible the classes for data inspection. Both str or dataframe need to provide the name of target.

Returns:

Returns self for easy method chaining.

Return type:

QuickPlot instance

Notes

Examples

>>> from watex.view.plot import QuickPlot
>>> from watex.datasets import load_bagoue
>>> data = load_bagoue ().frame
>>> qplotObj= QuickPlot(xlabel = 'Flow classes',
                        ylabel='Number of  occurence (%)',
                        lc='b', tname='flow')
>>> qplotObj.sns_style = 'darkgrid'
>>> qplotObj.fit(data)
>>> qplotObj. histcatdist()

property inspect#: Inspect object whether is fitted or not

joint2features(features, *, join_kws=None, marginals_kws=None, **sns_kws)[source]#

Joint method allows to visualize correlation of two features.

Draw a plot of two features with bivariate and univariate graphs.

Parameters:

features (list) – List of numerical features to plot for correlating analyses. will raise an error if features does not exist in the data
join_kws (dict, optional) – Additional keyword arguments are passed to the function used to draw the plot on the joint Axes, superseding items in the joint_kws dictionary.
marginals_kws (dict, optional) – Additional keyword arguments are passed to the function used to draw the plot on the marginals Axes.
sns_kwargs (dict, optional) – keywords arguments of seaborn joinplot methods. Refer to <http://seaborn.pydata.org/generated/seaborn.jointplot.html> for more details about usefull kwargs to customize plots.
data (str or pd.core.DataFrame) – Path -like object or Dataframe. Long-form (tidy) dataset for plotting. Each column should correspond to a variable, and each row should correspond to an observation. If data is given as path-like object,`QuickPlot` reads and sanitizes data before plotting. Be aware in this case to provide the target name and possible the classes for data inspection. Both str or dataframe need to provide the name of target.

Returns:

Returns self for easy method chaining.

Return type:

QuickPlot instance

Notes

Examples

>>> from watex.view.plot import QuickPlot
>>> from watex.datasets import load_bagoue
>>> data = load_bagoue ().frame
>>> qkObj = QuickPlot( lc='b', sns_style ='darkgrid',
...             fig_title='Quantitative features correlation'
...             ).fit(data)
>>> sns_pkws={
...            'kind':'reg' , #'kde', 'hex'
...            # "hue": 'flow',
...               }
>>> joinpl_kws={"color": "r",
                'zorder':0, 'levels':6}
>>> plmarg_kws={'color':"r", 'height':-.15, 'clip_on':False}
>>> qkObj.joint2features(features=['ohmS', 'lwi'],
...            join_kws=joinpl_kws, marginals_kws=plmarg_kws,
...            **sns_pkws,
...            )

multicatdist(*, x=None, col=None, hue=None, targets=None, x_features=None, y_features=None, kind='count', **kws)[source]#

Figure-level interface for drawing multiple categorical distributions plots onto a FacetGrid.

Multiple categorials plots from targetted pd.series.

Parameters:

x (list , Optional,) – names of variables in data. Inputs for plotting long-form data. See examples for interpretation. Here it can correspond to x_features , y_features and targets from dataframe. Note that each columns item could be correspond as element of x, y or hue. For instance x_features could refer to x-axis features and must be more than 0 and set into a list. the y_features might match the columns name for sns.catplot. If number of feature is more than one, create a list to hold all features is recommended. the y should fit the sns.catplot argument hue. Like other it should be on list of features are greater than one.
y (list , Optional,) – names of variables in data. Inputs for plotting long-form data. See examples for interpretation. Here it can correspond to x_features , y_features and targets from dataframe. Note that each columns item could be correspond as element of x, y or hue. For instance x_features could refer to x-axis features and must be more than 0 and set into a list. the y_features might match the columns name for sns.catplot. If number of feature is more than one, create a list to hold all features is recommended. the y should fit the sns.catplot argument hue. Like other it should be on list of features are greater than one.
hue (list , Optional,) – names of variables in data. Inputs for plotting long-form data. See examples for interpretation. Here it can correspond to x_features , y_features and targets from dataframe. Note that each columns item could be correspond as element of x, y or hue. For instance x_features could refer to x-axis features and must be more than 0 and set into a list. the y_features might match the columns name for sns.catplot. If number of feature is more than one, create a list to hold all features is recommended. the y should fit the sns.catplot argument hue. Like other it should be on list of features are greater than one.
row – Categorical variables that will determine the faceting of the grid.
data (str or pd.core.DataFrame) – Categorical variables that will determine the faceting of the grid.
optional – Categorical variables that will determine the faceting of the grid.
col_wrapint – “Wrap” the column variable at this width, so that the column facets span multiple rows. Incompatible with a row facet.
estimator (string or callable that maps vector -> scalar, optional) – Statistical function to estimate within each categorical bin.
errorbar (string, (string, number) tuple, or callable) – Name of errorbar method (either “ci”, “pi”, “se”, or “sd”), or a tuple with a method name and a level parameter, or a function that maps from a vector to a (min, max) interval.
n_bootint – Number of bootstrap samples used to compute confidence intervals.
optional – Number of bootstrap samples used to compute confidence intervals.
units (name of variable in data or vector data, optional) – Identifier of sampling units, which will be used to perform a multilevel bootstrap and account for repeated measures design.
seed (int, numpy.random.Generator, or numpy.random.RandomState, optional) – Seed or random number generator for reproducible bootstrapping.
order (lists of strings, optional) – Order to plot the categorical levels in; otherwise the levels are inferred from the data objects.
hue_order (lists of strings, optional) – Order to plot the categorical levels in; otherwise the levels are inferred from the data objects.
row_order (lists of strings, optional) – Order to organize the rows and/or columns of the grid in, otherwise the orders are inferred from the data objects.
col_order (lists of strings, optional) – Order to organize the rows and/or columns of the grid in, otherwise the orders are inferred from the data objects.
height (scalar) – Height (in inches) of each facet. See also: aspect.
aspect (scalar) – Aspect ratio of each facet, so that aspect * height gives the width of each facet in inches.
kind (str, optional) – `The kind of plot to draw, corresponds to the name of a categorical axes-level plotting function. Options are: “strip”, “swarm”, “box”, “violin”, “boxen”, “point”, “bar”, or “count”.
native_scale (bool, optional) – When True, numeric or datetime values on the categorical axis will maintain their original scaling rather than being converted to fixed indices.
formatter (callable, optional) – Function for converting categorical data into strings. Affects both grouping and tick labels.
orient ("v" | "h", optional) – Orientation of the plot (vertical or horizontal). This is usually inferred based on the type of the input variables, but it can be used to resolve ambiguity when both x and y are numeric or when plotting wide-form data.
color (matplotlib color, optional) – Single color for the elements in the plot.
palette (palette name, list, or dict) – Colors to use for the different levels of the hue variable. Should be something that can be interpreted by color_palette(), or a dictionary mapping hue levels to matplotlib colors.
hue_norm (tuple or matplotlib.colors.Normalize object) – Normalization in data units for colormap applied to the hue variable when it is numeric. Not relevant if hue is categorical.
legend (str or bool, optional) – Set to False to disable the legend. With strip or swarm plots, this also accepts a string, as described in the axes-level docstrings.
legend_out (bool) – If True, the figure size will be extended, and the legend will be drawn outside the plot on the center right.
share{x (bool, 'col', or 'row' optional) – If true, the facets will share y axes across columns and/or x axes across rows.
y} (bool, 'col', or 'row' optional) – If true, the facets will share y axes across columns and/or x axes across rows.
margin_titles (bool) – If True, the titles for the row variable are drawn to the right of the last column. This option is experimental and may not work in all cases.
facet_kws (dict, optional) – Dictionary of other keyword arguments to pass to FacetGrid.
kwargs (key, value pairings) – Other keyword arguments are passed through to the underlying plotting function.
data – Path -like object or Dataframe. Long-form (tidy) dataset for plotting. Each column should correspond to a variable, and each row should correspond to an observation. If data is given as path-like object,`QuickPlot` reads and sanitizes data before plotting. Be aware in this case to provide the target name and possible the classes for data inspection. Both str or dataframe need to provide the name of target.

Returns:

Returns self for easy method chaining.

Return type:

QuickPlot instance

Notes

Examples

>>> from watex.view.plot import QuickPlot
>>> from watex.datasets import load_bagoue
>>> data = load_bagoue ().frame
>>> qplotObj= QuickPlot(lc='b', tname='flow')
>>> qplotObj.sns_style = 'darkgrid'
>>> qplotObj.mapflow=True # to categorize the flow rate
>>> qplotObj.fit(data)
>>> fdict={
...            'x':['shape', 'type', 'type'],
...            'col':['type', 'geol', 'shape'],
...            'hue':['flow', 'flow', 'geol'],
...            }
>>> qplotObj.multicatdist(**fdict)

naiveviz(x=None, y=None, kind='scatter', s_col='lwi', leg_kws={}, **pd_kws)[source]#

Creates a plot to visualize the samples distributions according to the geographical coordinates x and y.

Parameters:

x (str ,) – Column name to hold the x-axis values
y (str,) – column na me to hold the y-axis values
s_col (column for scatter points. ‘Default is fs time the features) – column lwi.
pd_kws (dict, optional,) – Pandas plot keywords arguments
leg_kws (dict, kws) – Matplotlib legend keywords arguments
data (str or pd.core.DataFrame) – Path -like object or Dataframe. Long-form (tidy) dataset for plotting. Each column should correspond to a variable, and each row should correspond to an observation. If data is given as path-like object,`QuickPlot` reads and sanitizes data before plotting. Be aware in this case to provide the target name and possible the classes for data inspection. Both str or dataframe need to provide the name of target.

Returns:

Returns self for easy method chaining.

Return type:

QuickPlot instance

Notes

Examples

>>> from watex.transformers import StratifiedWithCategoryAdder
>>> from watex.view.plot import QuickPlot
>>> from watex.datasets import load_bagoue
>>> df = load_bagoue ().frame
>>> stratifiedNumObj= StratifiedWithCategoryAdder('flow')
>>> strat_train_set , *_=         ...    stratifiedNumObj.fit_transform(X=df)
>>> pd_kws ={'alpha': 0.4,
...         'label': 'flow m3/h',
...         'c':'flow',
...         'cmap':plt.get_cmap('jet'),
...         'colorbar':True}
>>> qkObj=QuickPlot(fs=25.)
>>> qkObj.fit(strat_train_set)
>>> qkObj.naiveviz( x= 'east', y='north', **pd_kws)

numfeatures(features=None, coerce=False, map_lower_kws=None, **sns_kws)[source]#

Plots qualitative features distribution using correlative aspect. Be sure to provide numerical features as data arguments.

Parameters:

features (list) – List of numerical features to plot for correlating analyses. will raise an error if features does not exist in the data
coerce (bool,) – Constraint the data to read all features and keep only the numerical values. An error occurs if False and the data contains some non-numericalfeatures. default is False.
map_lower_kws (dict, Optional) – a way to customize plot. Is a dictionnary of sns.pairplot map_lower kwargs arguments. If the diagram kind is kde, plot is customized with the provided map_lower_kws arguments. if None, will check whether the diag_kind argument on sns_kws is kde before triggering the plotting map.
sns_kws (dict,) – Keywords word arguments of seabon pairplots. Refer to http://seaborn.pydata.org/generated/seaborn.pairplot.html for further details.
data (str or pd.core.DataFrame) – Path -like object or Dataframe. Long-form (tidy) dataset for plotting. Each column should correspond to a variable, and each row should correspond to an observation. If data is given as path-like object,`QuickPlot` reads and sanitizes data before plotting. Be aware in this case to provide the target name and possible the classes for data inspection. Both str or dataframe need to provide the name of target.

Returns:

Returns self for easy method chaining.

Return type:

QuickPlot instance

Notes

Examples

>>> from watex.view.plot import QuickPlot
>>> from watex.datasets import load_bagoue
>>> data = load_bagoue ().frame
>>> qkObj = QuickPlot(mapflow =False, tname='flow'
                          ).fit(data)
>>> qkObj.sns_style ='darkgrid',
>>> qkObj.fig_title='Quantitative features correlation'
>>> sns_pkws={'aspect':2 ,
...          "height": 2,
# ...          'markers':['o', 'x', 'D', 'H', 's',
#                         '^', '+', 'S'],
...          'diag_kind':'kde',
...          'corner':False,
...          }
>>> marklow = {'level':4,
...          'color':".2"}
>>> qkObj.numfeatures(coerce=True, map_lower_kws=marklow, **sns_pkws)

scatteringfeatures(features, *, relplot_kws=None, **sns_kws)[source]#

Draw a scatter plot with possibility of several semantic features groupings.

Indeed scatteringfeatures analysis is a process of understanding how features in a dataset relate to each other and how those relationships depend on other features. Visualization can be a core component of this process because, when data are visualized properly, the human visual system can see trends and patterns that indicate a relationship.

Parameters:

features (list) – List of numerical features to plot for correlating analyses. will raise an error if features does not exist in the data
relplot_kws (dict, optional) – Extra keyword arguments to show the relationship between two features with semantic mappings of subsets. refer to <http://seaborn.pydata.org/generated/seaborn.relplot.html#seaborn.relplot> for more details.
sns_kwargs (dict, optional) – kwywords arguments to control what visual semantics are used to identify the different subsets. For more details, please consult <http://seaborn.pydata.org/generated/seaborn.scatterplot.html>.
data (str or pd.core.DataFrame) – Path -like object or Dataframe. Long-form (tidy) dataset for plotting. Each column should correspond to a variable, and each row should correspond to an observation. If data is given as path-like object,`QuickPlot` reads and sanitizes data before plotting. Be aware in this case to provide the target name and possible the classes for data inspection. Both str or dataframe need to provide the name of target.

Returns:

Returns self for easy method chaining.

Return type:

QuickPlot instance

Notes

Examples

>>> from watex.view.plot import  QuickPlot
>>> from watex.datasets import load_bagoue
>>> data = load_bagoue ().frame
>>> qkObj = QuickPlot(lc='b', sns_style ='darkgrid',
...             fig_title='geol vs lewel of water inflow',
...             xlabel='Level of water inflow (lwi)',
...             ylabel='Flow rate in m3/h'
...            )
>>>
>>> qkObj.tname='flow' # target the DC-flow rate prediction dataset
>>> qkObj.mapflow=True  # to hold category FR0, FR1 etc..
>>> qkObj.fit(data)
>>> marker_list= ['o','s','P', 'H']
>>> markers_dict = {key:mv for key, mv in zip( list (
...                       dict(qkObj.data ['geol'].value_counts(
...                           normalize=True)).keys()),
...                            marker_list)}
>>> sns_pkws={'markers':markers_dict,
...          'sizes':(20, 200),
...          "hue":'geol',
...          'style':'geol',
...         "palette":'deep',
...          'legend':'full',
...          # "hue_norm":(0,7)
...            }
>>> regpl_kws = {'col':'flow',
...             'hue':'lwi',
...             'style':'geol',
...             'kind':'scatter'
...            }
>>> qkObj.scatteringfeatures(features=['lwi', 'flow'],
...                         relplot_kws=regpl_kws,
...                         **sns_pkws,
...                    )

class watex.ResistivityProfiling(station=None, dipole=10.0, auto=False, constraints=None, coerce=False, force=False, **kws)[source]#

Bases: ElectricalMethods

Class deals with the Electrical Resistivity Profiling (ERP).

The electrical resistivity profiling is one of the cheap geophysical subsurface imaging method. It is most preferred to find groundwater during the campaigns of drinking water supply, especially in developing countries. Commonly, it is used in combinaision with the the vertical electrical sounding Vertical Electrical Sounding to speculated about the layer thickesses and the existence of the fracture zone.

Parameters:

station: str

Station name where the drilling is expected to be located. The station should numbered from 1 not 0. So if S00` is given, the station name should be set to ``S01. Moreover, if dipole value is set as keyword argument,i.e. the station is named according to the value of the dipole. For instance for dipole equals to 10m, the first station should be S00, the second S10 , the third S20 and so on. However, it is recommend to name the station using counting numbers rather than using the dipole position.

dipole: float

The dipole length used during the exploration area.

auto: bool

Auto dectect the best conductive zone. If True, the station position should be the station of the lower resistivity value in Electrical Resistivity Profiling.

constraints: list or dict,

It determines the restriction observed in the site during the survey area. Any station close to a restriction area must be listed and should be ignored when the best location for drilling operations is automatically detected. A restricted stations can be enumerated as a dictionnary of key='restricted station' and value='reason why the station must be ignored. For instance:
constraints ={'S10': 'Heritage site, no authorization for drilling'
              'S25': 'Close to the household waste'
              "S45": 'Station close to a municipality domain'
              'S50': 'Marsh area'
              ...
              }

Note that, commonly constraints is mostly needed when the automatic detection is triggered. However, it can be coerce with the explicit defined station.

force: bool, default=False,

By default, ResistivityProfiling expects users to provide: either DC objects or pandas dataframe. This supposes users have already

transformed its data from sheets to data frame. If not the case, setting force to True constrains the algorithm to do the both tasks at once.

New in version 0.2.0.

kws: dict

Additional Electrical Resistivity Profiling keywords arguments

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

fit(data, **fit_params)[source]#

Fitting the ResistivityProfiling and populate the class attributes.

Parameters:

**data** (Path-like obj, Array, Series, Dataframe.) – Data containing the the collected resistivity values in survey area.
**columns** (list,) – Only necessary if the data is given as an array. No need to to explicitly define it when data is a dataframe or a Pathlike object.
**fit_params** (dict,) – Additional keyword arguments; e.g. to force the station to match at least the best minimal resistivity value in the whole data collected in the survey area.

Returns:

self

Return type:

object instanciated for chaining methods.

Notes

The station should numbered from 1 not 0. So if S00` is given, the station name should be set to ``S01. Moreover, if dipole value is set as keyword argument, i.e. the station is named according to the value of the dipole. For instance for dipole equals to 10m, the first station should be S00, the second S10, the third S20 and so on. However, it is recommend to name the station using counting numbers rather than using the dipole position.

property inspect#: Inspect object whether is fitted or not

plotAnomaly(**plot_kws)[source]#

Plot the best conductive zone found in the Electrical Resistivity Profiling

Parameters:: plot_kws – dict, additional keyword arguments passed to plotAnomaly().

summary(keep_params=False, return_table=False)[source]#

Summarize the most import parameters for prediction purpose.

Parameters:

keep_params (bool, default=False,) – If keep_params is set to True. Method should output only the main important params for prediction purpose. Otherwise, returns all main DC-resistivity attributes
return_tables (bool, default=False,) – Returns attributes of parameters in a pandas dataframe.

Returns:

self or table_ – Returns DC- profiling object or dataframe.

Return type:

ResistivityProfiling or class:pd.DataFrame

class watex.SequentialBackwardSelection(estimator=None, k_features=1, scoring='accuracy', test_size=0.25, random_state=42)[source]#

Bases: _Base

Sequential Backward Selection (SBS) is a feature selection algorithm which aims to reduce dimensionality of the initial feature subspace with a minimum decay in the performance of the classifier to improve upon computationan efficiency. In certains cases, SBS can even improve the predictive power of the model if a model suffers from overfitting.

The idea behind the SBS is simple: it sequentially removes features from the full feature subset until the new feature subspace contains the desired number of features. In order to determine which feature is to be removed at each stage, the criterion fonction $J$ is needed for minimization [1]. Indeed, the criterion calculated from the criteria function can simply be the difference in performance of the classifier before and after the removal of this particular feature. Then, the feature to be remove at each stage can simply be the defined as the feature that maximizes this criterion; or in more simple terms, at each stage, the feature that causes the least performance is eliminated loss after removal. Based on the preceding definition of SBS, the algorithm can be outlibe with a few steps:

Initialize the algorithm with $k=d$, where $d$ is the
dimensionality of the full feature space, $X_d$.

Determine the feature $x^{-}$,that maximizes the criterion:
$x^{-}= argmax J(X_k-x)$, where $x\in X_k$.

Remove the feature $x^{-}$ from the feature set
$X_{k+1}= X_k -x^{-}; k=k-1$.

-Terminate if $k$ equals to the number of desired features;
otherwise go to the step 2. [2]

Parameters:

estimator (callable or instanciated object,) – callable or instance object that has a fit method.
k_features (int, default=1) – the number of features from where starting the selection. It must be less than the number of feature in the training set, otherwise it does not make sense.
scoring (callable or str , default='accuracy') – metric for scoring. availabe metric are ‘precision’, ‘recall’, ‘roc_auc’ or ‘accuracy’. Any other metric with raise an errors.
test_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
random_state (int, RandomState instance or None, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

References

[1]

Raschka, S., Mirjalili, V., 2019. Python Machine Learning, 3rd ed. Packt.

[2]

Ferri F., Pudil F., Hatef M., and Kittler J., Comparative study of the techniques for Large-scale feature selection, pages 403-413, 1994.

feature_names_in_#

Names of features seen during fit. Defined only when X has feature names that are all strings.

Type:: ndarray of shape (n_features_in_,)

indices_#

Collect the indices of subset of the best validated models

Type:: tuple of dimensionnality X

subsets_#

list of indices_

Type:: list,

scores_#

Collection of the scores of the best model got during the cross-validating

Type:: list,

k_score_#

The score of the desired feature.

Type:: float,

Examples

>>> from watex.exlib.sklearn import KNeighborsClassifier , train_test_split
>>> from watex.datasets import fetch_data
>>> from watex.base import SequentialBackwardSelection
>>> X, y = fetch_data('bagoue analysed') # data already standardized
>>> Xtrain, Xt, ytrain,  yt = train_test_split(X, y)
>>> knn = KNeighborsClassifier(n_neighbors=5)
>>> sbs= SequentialBackwardSelection (knn)
>>> sbs.fit(Xtrain, ytrain )

fit(X, y)[source]#

Fit the training data

Note that SBS splits the datasets into a test and training insite the fit function. $X$ is still fed to the algorithm. Indeed, SBS will then create a new training subsets for testing (validation) and training , which is why this test set is also called the validation dataset. This approach is necessary to prevent our original test set to becoming part of the training data.

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns:

self – returns self for easy method chaining.

Return type:

SequentialBackwardSelection instance

transform(X)[source]#

Transform the training set

Parameters:: X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
Returns:: X – New transformed training set with selected features columns
Return type:: Ndarray ( M x N matrix where M=m-samples, & N=n-features)

class watex.Structural(configfile=None, **kwds)[source]#

Bases: GeoBase

Geology strutural conventions class.

Note that the given structural objects are quite less than the litterature More structural object can be added as the structures is known. All geological structural informations are geostructral object.

Holds the following informations:

More attributes can be added by inputing a key word dictionary

Example:

>>> from watex.geology import Structural
>>> s=Structural().fit()
>>> s.boudin_axis.code_
... 'lsb'
>>> s.boudin_axis.name_
... 'Boudin Axis'
>>> s.boudin_axis.color_
... 'R128GB'

fit(configfile=None, **kwd)[source]#: Configure the structural data and set each object as attributes

class watex.Structures(configfile=None, **kwds)[source]#

Bases: GeoBase

This class is an auxilliary class to supplement geodatabase , if the GeodataBase doesnt reply to SQL request , then use this class to secah information about structures . If SQL is done as well , program won’t call this class as rescure . Containers of more than 150 geological strutures.

Attributes	Type	Explanation
names	array_like	names of all geological strutures
codes	array_like	names of all geological codes
**code	str	code of specific geological structure
**label	str	label of specific structure
**name	str	label of specific structure
**pattern	str	pattern of specific structure
**pat_size	str	pattern size of specific structure
**pat_density	str	pattern density l of specific structure
**pat_thickness	str	pttern thickess of specific structure
**color	str	color of specific structure

Note

To get the attribute value, merely replace the param “**” by the name of struture following by dot “.”. See examples:

Examples

To get the names of different strutures, write the script below:

>>> from watex.geology.geology import Structures
>>> geo_structure = Structures().fit()
>>> geo_structure.names_ # get the list of all geological strutures

To extract color and to get the code of structure like tonalite:

>>> from watex.geology.geology import Structures
>>> sobj = Structures().fit()
>>> sobj.tonalite.pat_thickness_
... 0.  # -> not implemented
>>> sobj.tonalite.code_
>>> ...'TNL'
>>> sobj.tonalite.color_
... ''RB128'

codef = ['code', 'label', '__description', 'pattern', 'pat_size', 'pat_density', 'pat_thickness', 'color']#

property coerce#: Force configuration if auto getting the property file fails.

fit(**kwd)[source]#

Fit and set the geological strutures as object attributes and kwd arguments refer to the argument below.

Parameters:

configfile (str) – is a configure file from ‘AGS0’ data
fillna (float,) – fill NaN values in the AGS0 file. The default values to fill is 0 for False.

Notes

Each geological strutures can be retrieved as an attribute. For instance to get the code, the label and the pattern density of the ‘amphibolite’.

Example

>>> from watex.geology import Structures
>>> sobj = Structures().fit()
>>> sobj.amphibolite.code
... 'AMP'
>>> sobj.amphibolite.label_
... 'AMP'
>>> sobj.amphibolite.pat_density_
... 0. # not set
>>> # To get all the key (attributes of the structures ), uses::
>>> sobj.keys

class watex.TPlot(survey_area=None, distance=50.0, prefix='S', how='py', window_size=5, component='xy', mode='same', method='slinear', out='srho', c=2, **kws)[source]#

Bases: BasePlot

Tensor plot from EM processing data.

TPlot is a Tensor (Impedances , resistivity and phases ) plot class. Explore SEG ( Society of Exploration Geophysicist ) class data. Plot recovery tensors. TPlot methods returns an instancied object that inherits from watex.property.Baseplots ABC (Abstract Base Class) for visualization.

Parameters:

window_size (int) – the length of the window. Must be greater than 1 and preferably an odd integer number. Default is 5
component (str) – field tensors direction. It can be xx, xy,``yx``, yy. If arr2d` is provided, no need to give an argument. It become useful when a collection of EDI-objects is provided. If don’t specify, the resistivity and phase value at component xy should be fetched for correction by default. Change the component value to get the appropriate data for correction. Default is xy.
mode (str , ['valid', 'same'], default='same') – mode of the border trimming. Should be ‘valid’ or ‘same’.’valid’ is used for regular trimimg whereas the ‘same’ is used for appending the first and last value of resistivity. Any other argument except ‘valid’ should be considered as ‘same’ argument. Default is same.
method (str, default slinear) – Interpolation technique to use. Can be nearest``or ``pad. Refer to the documentation of ~.interpolate2d.
out (str) – Value to export. Can be sfactor, tensor for corrections factor and impedance tensor. Any other values will export the static corrected resistivity srho.
c (int,) – A window-width expansion factor that must be input to the filter adaptation process to control the roll-off characteristics of the applied Hanning window. It is recommended to select c between 1 and 4. Default is 2.
distance (float) – The step between two stations/sites. If given, it creates an array of position for plotting purpose. Default value is 50 meters.
prefix (str) – string value to add as prefix of given id. Prefix can be the site name. Default is S.
how (str) – Mode to index the station. Default is ‘Python indexing’ i.e. the counting of stations would starts by 0. Any other mode will start the counting by 1.
savefig (str, Path-like object,) – savefigure’s name, default is None
fig_dpi (float,) – dots-per-inch resolution of the figure. default is 300
fig_num (int,) – size of figure in inches (width, height). default is [5, 5]
fig_size (Tuple (int, int) or inch) – size of figure in inches (width, height).*default* is [5, 5]
fig_orientation (str,) – figure orientation. default is landscape
fig_tile (str,) – figure title. default is None
fs (float,) – size of font of axis tick labels, axis labels are fs+2. default is 6
ls (str,) – line style, it can be [ ‘-’ | ‘.’ | ‘:’ ] . default is ‘-’
lc (str, Optional,) – line color of the plot, default is k
lw (float, Optional,) – line weight of the plot, default is 1.5
alpha (float between 0 < alpha < 1,) – transparency number, default is 0.5,
font_weight (str, Optional) – weight of the font , default is bold.
font_style (str, Optional) – style of the font. default is italic
font_size (float, Optional) – size of font in inches (width, height). default is 3.
ms (float, Optional) – size of marker in points. default is 5
marker (str, Optional) – marker of stations default is o.
marker_style (str, Optional) – facecolor of the marker. default is yellow
marker_edgecolor (str, Optional) – facecolor of the marker. default is yellow
marker_edgewidth (float, Optional) – width of the marker. default is 3.
xminorticks (float, Optional) – minortick according to x-axis size and default is 1.
yminorticks (float, Optional) – yminorticks according to x-axis size and default is 1.
bins (histograms element separation between two bar. default is 10.) –
xlim (tuple (int, int), Optional) – limit of x-axis in plot.
ylim (tuple (int, int), Optional) – limit of x-axis in plot.
xlabel (str, Optional,) – label name of x-axis in plot.
ylabel (str, Optional,) – label name of y-axis in plot.
rotate_xlabel (float, Optional) – angle to rotate xlabel in plot.
rotate_ylabel (float, Optional) – angle to rotate ylabel in plot.
leg_kws (dict, Optional) – keyword arguments of legend. default is empty dict
plt_kws (dict, Optional) – keyword arguments of plot. default is empty dict
glc (str, Optional) – line color of the grid plot, default is k
glw (float, Optional) – line weight of the grid plot, default is 2
galpha (float, Optional,) – transparency number of grid, default is 0.5
gaxis (str ('x', 'y', 'both')) – type of axis to hold the grid, default is both
gwhich (str, Optional) – kind of grid in the plot. default is major
tp_axis (bool,) – axis to apply the ticks params. default is both
tp_labelsize (str, Optional) – labelsize of ticks params. default is italic
tp_bottom (bool,) – position at bottom of ticks params. default is True.
tp_labelbottom (bool,) – put label on the bottom of the ticks. default is False
tp_labeltop (bool,) – put label on the top of the ticks. default is True
cb_orientation (str , ('vertical', 'horizontal')) – orientation of the colorbar, default is vertical
cb_aspect (float, Optional) – aspect of the colorbar. default is 20.
cb_shrink (float, Optional) – shrink size of the colorbar. default is 1.0
cb_pad (float,) – pad of the colorbar of plot. default is .05
cb_anchor (tuple (float, float)) – anchor of the colorbar. default is (0.0, 0.5)
cb_panchor (tuple (float, float)) – proportionality anchor of the colorbar. default is (1.0, 0.5)
cb_label (str, Optional) – label of the colorbar.
cb_spacing (str, Optional) – spacing of the colorbar. default is uniform
cb_drawedges (bool,) – draw edges inside of the colorbar. default is False
sns_orient ('v' | 'h', optional) – Orientation of the plot (vertical or horizontal). This is usually inferred based on the type of the input variables, but it can be used to resolve ambiguity when both x and y are numeric or when plotting wide-form data. default is v which refer to ‘vertical’
sns_style (dict, or one of {darkgrid, whitegrid, dark, white, ticks}) – A dictionary of parameters or the name of a preconfigured style.
sns_palette (seaborn color paltte | matplotlib colormap | hls | husl) – Palette definition. Should be something color_palette() can process. the palette generates the point with different colors
sns_height (float,) – Proportion of axes extent covered by each rug element. Can be negative. default is 4.
sns_aspect (scalar (float, int)) – Aspect ratio of each facet, so that aspect * height gives the width of each facet in inches. default is .7

Returns:

self – returns self for easy method chaining.

Return type:

Baseclass instance

Examples

>>> from watex.view.plot import TPlot
>>> from watex.datasets import load_edis
>>> plot_kws = dict( ylabel = '$Log_{10}Frequency [Hz]$',
                    xlabel = '$Distance(m)$',
                    cb_label = '$Log_{10}Rhoa[\Omega.m$]',
                    fig_size =(6, 3),
                    font_size =7.,
                    rotate_xlabel=45,
                    imshow_interp='bicubic',
                    )
>>> edi_data =load_edis (return_data= True, samples=7 )
>>> t= TPlot(**plot_kws ).fit(edi_data)
>>> t.fit(edi_data ).plot_tensor2d (to_log10=True )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|Data collected =  7      |EDI success. read=  7      |Rate     =  100.0  %|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Out[150]: <AxesSubplot:xlabel='$Distance(m)$', ylabel='$Log_{10}Frequency [Hz]$'>

fit(data)[source]#

Fit data and populate attributes.

Parameters:: data (str, or list or pycsamt.core.edi.Edi object) – Full path to EDI files or collection of EDI-objects
Returns:: ``self`` – returns self for chaining methods.
Return type:: watex.view.plot.TPlot instanciated object

property inspect#: Inspect object whether is fitted or not

plotSkew(method='Bahr', view='skew', mode=None, threshold_line=None, show_average_sensistivity=True, suppress_outliers=True, **plot_kws)[source]#

Plot phase sensistive skew visualization

‘Skew’ is also knwown as the conventional asymmetry parameter based on the Z magnitude.

Mosly, the EM signal is influenced by several factors such as the dimensionality of the propagation medium and the physical anomalies, which can distort theEM field both locally and regionally. The distortion of Z was determined from the quantification of its asymmetry and the deviation from the conditions that define its dimensionality. The parameters used for this purpose are all rotational invariant because the Z components involved in its definition are independent of the orientation system used. The conventional asymmetry parameter based on the Z magnitude is the skew defined by Swift (1967) [1] and Bahr (1991) [2].

Parameters:

method (str, default='Bahr':) –
Kind of correction. Can be:
- swift for the remove distorsion proposed by Swift in 1967. The value close to 0. assume the 1D and 2D structures, and 3D otherwise. However, In general case, the electrical structure of $\eta < 0.4$ can be treated as a 2D medium.
- bahr for the remove distorsion proposed by Bahr in 1991. The latter threshold is set to 0.3. Above this value the structures is 3D.
view (str, default='skew') – phase sensistive visualization. Can be rotational invariant invariant. In fact, setting to mu or invariant does not change any interpretation when since the distortion of Z are all rotational invariant whether using the Bahr or swift methods.
mode (str, optional) – X-axis coordinates for visualisation. plot either 'frequency' or 'periods'. The default is 'frequency'
threshold_line (float, optional) –
Visualize th threshold line. Can be [‘bahr’, ‘swift’, ‘both’]:
- Note that when method is set to swift, the value close to close to $0.$ assume the 1D and 2D structures ($\eta <0.4$), and 3D otherwise( $\eta >0.4$). The threshold line for swift is set to $0.4$.
- when method is set to Bahr, $\eta > 0.3`$ is 3D structures, between $[0.1 - 0.3]$ assumes modified 3D/2D structures whereas $<0.1$ 1D, 2D or distorted 2D.
show_average_sensistivity (bool, default=True) – Display the averaged value of skew data at all -frequencies. Value can help a dimensionality interpretation purposes.
suppress_outliers (bool, default=True) – Remove the outliers in the data if exists. It uses the Inter Quartile Range (IQR) approach. See the documentation of watex.utils.remove_outliers(). This is useful for clear interpretation using the skew threshold value.

See also

watex.methods.Processing.skew: For mathematical skew Bahr and Swift concept formulations.
watex.utils.plot_skew: For phase sensistive skew visualization - naive plot.

Examples

>>> import watex
>>> test_data = watex.fetch_data ('edis', samples =37, return_data =True )
>>> watex.TPlot(fig_size =(10,  4), marker ='x').fit(
    test_data).plotSkew(method ='swift', threshold_line=True)

References

[1]

Swift, C., 1967. A magnetotelluric investigation of an electrical conductivity anomaly in the southwestern United States. Ph.D. Thesis, MIT Press. Cambridge.

[2]

Bahr, K., 1991. Geological noise in magnetotelluric data: a classification of distortion types. Physics of the Earth and Planetary Interiors 66 (1–2), 24–38.

plot_corrections(fltr='ama', ss_fx=None, ss_fy=None, r=1000.0, nfreq=21, skipfreq=5, tol=0.12, rotate=0.0, distortion=None, distortion_err=None, mode='TE', scale='period', sites=None, seed=None, how='py', show_site=True, survey=None, style=None, errorbar=True, spad=0.5, n_sites=1, mcolors=None, markers=None, **kws)[source]#

Plot apparent resistivity/phase curves and corrections.

Changed in version 0.2.1: Can henceforth display multiple sites by providing the sites as a collection.

Parameters:

fltr (str , default='ama') –
Type of filter to apply. ss is used to remove the static shift using spatial median filter. Whereas dist is for distorsion removal. Note that distortion might be provided otherwise an error raises. Can also be [‘tma’|’ama’|’flma’] for EMAP filters.
- tma for trimming moving-average
- ama for adaptative moving-average
- flma for fixed-length moving-average
distortion_tensor (np.ndarray(2, 2, dtype=real)) – Real distortion tensor as a 2x2
error (np.ndarray(2, 2, dtype=real), Optional) – Propagation of errors/uncertainties included
ss_fx (float, Optional) – static shift factor to be applied to x components (ie z[:, 0, :]). This is assumed to be in resistivity scale. If None should be automatically computed using the spatial median filter.
ss_fy (float, optional) – static shift factor to be applied to y components (ie z[:, 1, :]). This is assumed to be in resistivity scale. If None , should be computed using the spatial filter median.
r (float, default=1000.) – radius to look for nearby stations, in meters.
nfreq (int, default=21) – number of frequencies calculate the median static shift. This is assuming the first frequency is the highest frequency. Cause usually highest frequencies are sampling a 1D earth.
skipfreq (int, default=5) – number of frequencies to skip from the highest frequency. Sometimes the highest frequencies are not reliable due to noise or low signal in the AMT deadband. This allows you to skip those frequencies.
tol (float, default=0.12) – Tolerance on the median static shift correction. If the data is noisy the correction factor can be biased away from 1. Therefore the shift_tol is used to stop that bias. If 1-tol < correction < 1+tol then the correction factor is set to 1
rotate (float, default=0.) – Rotate Z array by angle alpha in degrees. All angles are referenced to geographic North, positive in clockwise direction. (Mathematically negative!). In non-rotated state, X refs to North and Y to East direction.
mode (str, default='TE',) – Electromagnetic mode. Can be [‘TM’ |’both’]. If both, components xy and yx are expected in the data.
scale (str, default='period') – Visualization on axis labell. can be 'frequency'.
sites (int,str, optional) – index of name of the site to plot. site must be composed of a position number. For instance 'S13'. If not provided, a random station is selected instead.
seed (int, optional) – Get the same site if site is not provided. seed fetches a random number of site. T
how (str, default='py') – The way the site is fetched for plot. For instance, in Python indexing (default), the site is numbered from 0. For instance ‘site05’ will fetch the data at index 4. If this positioning is not wished, set to ‘None’.
show_site (bool, default=True,) – Display the number of site.
survey (str, optional) – Method used for the survey. e.g., ‘AMT’ for Audio-Magnetotellurics.
style (str, default='default') – Matplotlib style.
errorbar (bool, default=True) – display the error bar.
spad (float, default=.5,) –
pad to display the station in the top of each section plot.

New in version 0.2.1.
n_sites (int, default =1.) – Number of random sites to select for visualizing. It cannot work if the names of sites are given.
mcolors (str, list, optional) – The list of colors for resistivy and phase.

markersstr, list, optional: The list of marker for resistivy and phase.

markers = None,

kws: dict,: Addfitional keywords arguments passed to Matplotlib.Axes.Scatter plots.

Examples

>>> import numpy as np
>>> import watex as wx
>>> edi_data = wx.fetch_data ('edis', return_data =True, samples =27)
>>> wx.TPlot(show_grid=True).fit(edi_data).plot_corrections (
    seed =52, )
>>> distortion = np.array([[1.1 , 0.6 ],[0.23, 1.9 ]])
>>> wx.TPlot(show_grid=True).fit(edi_data).plot_corrections (
     seed =52, mode ='tm', fltr ='dist', distortion =distortion
     )

plot_ctensor2d(tensor='res', ffilter='tma', sites=None, to_log10=False)[source]#

Plot filtered tensors

Parameters:

tensor (str , ['res','phase', 'z'], default='res') – kind of tensor to plot. Can be resistivity or phase. If phase, customize your plot to not fit the default ‘res’ behaviour.
ffilter (str ['ama', 'flma', 'tma'], default='tma') – kind of appropriate filter to corrected tensor data.
to_log10 (bool, defaut=False,) – Convert the resistivity data and frequeny in log10.
sites (list of str, optional) – List of stations/sites names. If given, it must have the same length of the positions in of the EDI data. Must fit the number of ‘EDI’ succesffully read.

Returns:

arr2d: 2D filtered tensor array from the component
freqs: array-like 1d of frequency in the survey.
positions: Sites/stations positions. It is equals to the distance
between stations times the number of sites
sites: list of the names of the station/sites
base_plot_kws: plot keywords arguments inherits from
watex.property.BasePlot. It composes the last parameters for customizing plot as decorated return function.

Return type:

( arr2d , freqs, positions , sites , base_plot_kws)

Examples

>>> from watex.view.plot import TPlot
>>> from watex.datasets import load_edis
>>> # get some 3 samples of EDI for demo
>>> edi_data = load_edis (return_data =True, samples =3 )
>>> # customize plot by adding plot_kws
>>> plot_kws = dict( ylabel = '$Log_{10}Frequency [Hz]$',
                    xlabel = '$Distance(m)$',
                    cb_label = '$Log_{10}Rhoa[\Omega.m$]',
                    fig_size =(6, 3),
                    font_size =7.
                    )
>>> t= TPlot(**plot_kws ).fit(edi_data)
>>> # plot filtered tensor using the log10 resistivity
>>> t.plot_ctensor2d (to_log10=True)
<AxesSubplot:xlabel='$Distance(m)$', ylabel='$Log_{10}Frequency [Hz]$'>

plot_multi_recovery(sites, colors=None, **kws)[source]#

Plots mutiple site/stations with signal recovery.

Parameters:

sites (list) – list of sites to visualize. Can also be the index of the sites
colors (list of str) – matplotlib colors to customize the raw signal and recovery signal

Returns:

Return type:

Matplotlib suplot axes

Examples

>>> from watex.view.plot import TPlot
>>> from watex.datasets import load_edis
>>> # takes the 03 samples of EDIs
>>> edi_data = load_edis (return_data= True, samples =3 )
>>> TPlot(fig_size =(5, 3)).fit(edi_data).plot_multi_recovery (
    sites =['S00'], colors =['o', 'ok--'])
<AxesSubplot:title={'center':'Recovered tensor $|Z_{xy}|$'},
xlabel='$Frequency [H_z]$', ylabel='$ App.resistivity \quad xy \quad [ \Omega.m]$'>

plot_phase_tensors(mode='frequency', stretch=(7000, 20), linedir='ns', tensor='phimin', ellipse_dict=None, **kws)[source]#

Plot phase tensor pseudosection and skew ellipsis visualization.

Method plots the phase tensor ellipses in a pseudo section format. It uses mtpy as dependency.

Parameters:

mode (str, default ='frequency') – Tempoora scale in y-axis. Can be [‘frequency’ | ‘period’]
stretch (float or tuple (xstretch, ystretch), default=200) – Is a factor that scales the distance from one station to the next to make the plot readable. It determines (x,y) aspect ratio of plot.
linedir (str [ 'ns' | 'ew' ], default='ns') –
The predominant direction of profile line. It can be [‘ns’ | ‘ew’] where:
- ’ns’ refer to North-South Line or line is closer to north-south)
- ’ew’ refer to East-West line or line is closer to east-west
Default is ‘ns’
tensor (str, default='phimin') –

Is the tensor skew or ellipsis visualizations. The color for plot
style is referred accordingly. Tensor can be:

[ ‘phimin’ | ‘phimax’ | ‘skew’ |’skew_seg’ | ‘phidet’ |’ellipticity’ ]

where:
- ’phimin’ -> colors by minimum phase
- ’phimax’ -> colors by maximum phase
- ’skew’ -> colors by skew
- ’skew_seg’ -> colors by skew indiscrete segments defined
  by the range
- ’normalized_skew’ -> colors by skew see [Booker, 2014]
- ’normalized_skew_seg’ -> colors by normalized skew in
  discrete segments defined by the range
- ’phidet’ -> colors by determinant of the phase tensor
- ’ellipticity’ -> colors by ellipticity default is ‘phimin’
ellipse_dict (dict, optional) –
Dictionary of parameters for the phase tensor ellipses with keys:
- ’size’: float, default =2 , is the size of ellipse in points
- ’colorby’str, default=’phimin’
  Is the color for plot style referring either to tensor, skew or ellipsis visualizations. It can be all the tensor parameter values. see tensor parameter values. [ ‘phimin’ | ‘phimax’ | ‘skew’ |’skew_seg’ | ‘phidet’ |’ellipticity’ ]
- ’range’tuple (min, max, step), default=’colorby’
  Need to input at least the min and max and if using ‘skew_seg’ to plot discrete values input step as well
- ’cmap’[ ‘mt_yl2rd’ | ‘mt_bl2yl2rd’ |’mt_wh2bl’ | ‘mt_rd2bl’ |
  ’mt_bl2wh2rd’ | ‘mt_seg_bl2wh2rd’ |’mt_rd2gr2bl’ ]
  - ’mt_yl2rd’ -> yellow to red
  - ’mt_bl2yl2rd’ -> blue to yellow to red
  - ’mt_wh2bl’ -> white to blue
  - ’mt_rd2bl’ -> red to blue
  - ’mt_bl2wh2rd’ -> blue to white to red
  - ’mt_bl2gr2rd’ -> blue to green to red
  - ’mt_rd2gr2bl’ -> red to green to blue
  - ’mt_seg_bl2wh2rd’ -> discrete blue to white to red
kws (dict) – Additional keywords arguments passed from |MTpy| pseudosection phase tensor class: PlotPhaseTensorPseudoSection

See also

mtpy.imaging.phase_tensor_pseudosection.PlotPhaseTensorPseudoSection: PlotPhase pseudo section tensor from |MTpy| package.
watex.utils.plot_skew: Phase sensitive skew visualization.

Examples

>>> import watex as wx
>>> edi_data = wx.fetch_data ('edis', key='edi', return_data =True , samples =17 )
>>> tplot = wx.TPlot ().fit(edi_data )
>>> tplot.plot_phase_tensors (tensor ='skew')

plot_recovery(site='S00')[source]#

visualize the restored tensor per site.

Parameters:: site (str, int, default ="S00") – Site/station name for
Returns:: ``self`` – returns self for chaining methods.
Return type:: watex.view.plot.TPlot instanciated object

Examples

>>> from watex.view import TPlot
>>> from watex.datasets import load_edis
>>> edi_data = load_edis (return_data =True, samples =7)
>>> plot_kws = dict( ylabel = '$Log_{10}Frequency [Hz]$',
            xlabel = '$Distance(m)$',
            cb_label = '$Log_{10}Rhoa[\Omega.m$]',
            fig_size =(7, 4),
            font_size =7.
            )
>>> t= TPlot(**plot_kws ).fit(edi_data)
>>> # plot recovery of site 'S01'
>>> t.plot_recovery ('S01')

plot_rhoa(mode='TE', scale='period', site=None, seed=None, how='py', show_site=True, survey=None, style=None, errorbar=True, suppress_outliers=False, **kws)[source]#

Plot apparent resistivity and phase curves

Parameters:

mode (str, default='TE',) – Electromagnetic mode. Can be [‘TM’ |’both’]. If both, components xy and yx are expected in the data.
scale (str, default='period') – Visualization on axis labell. can be 'frequency'.
site (int,str, optional) – index of name of the site to plot. site must be composed of a position number. For instance 'S13'. If not provided, a random station is selected instead.
seed (int, optional) – If site is not provided, seed fetches randomly a site. To fetch the same sime everytimes, it is better to set the seed value.
how (str, default='py') – The way the site is fetched for plot. For instance, in Python indexing (default), the site is numbered from 0. For instance ‘site05’ will fetch the data at index 4. If this positioning is not wished, set to ‘None’.
show_site (bool, default=True,) – Display the number of site.
survey (str, optional) – Method used for the survey. e.g., ‘AMT’ for Audio-Magnetotellurics.
style (str, default='default') – Matplotlib style.
errorbar (bool, default=True) – display the error bar.
suppress_outliers (bool, default=False,) – Remove outliers in the data before plotting
kws (dict,) – Addfitional keywords arguments passed to Matplotlib.Axes.Scatter plots.

Examples

>>> import watex as wx
>>> edi_data = wx.fetch_data ('edis', return_data =True, samples =27)
>>> wx.TPlot(show_grid=True).fit(edi_data).plot_rhoa (
    seed =52, mode ='*')

plot_rhophi(sites=None, mode='TE', scale='period', seed=None, how='py', show_site=True, survey=None, style=None, errorbar=True, suppress_outliers=False, n_sites=1, spad=0.5, **kws)[source]#

Plot resistivities and phases from multiples stations.

Parameters:

mode (str, default='TE',) – Electromagnetic mode. Can be [‘TM’ |’both’]. If both, components xy and yx are expected in the data.
sites (int,str, or list, optional) – A collection of index of name of the site . Each site must be composed of a position number. For instance 'S13'. If not provided, a random sites are selected instead using the n_sites parameter.
scale (str, default='period') – Visualization on axis labell. can be 'frequency'.
seed (int, optional) – If site is not provided, seed fetches randomly a site. To fetch the same sime everytimes, it is better to set the seed value.
how (str, default='py') – The way the site is fetched for plot. For instance, in Python indexing (default), the site is numbered from 0. For instance ‘site05’ will fetch the data at index 4. If this positioning is not wished, set to ‘None’.
show_site (bool, default=True,) – Display the number of site.
survey (str, optional) – Method used for the survey. e.g., ‘AMT’ for Audio-Magnetotellurics.
style (str, default='default') – Matplotlib style.
errorbar (bool, default=True) – display the error bar.
suppress_outliers (bool, default=False,) – Remove outliers in the data before plotting
n_sites (int, default =1.) – Number of random sites to select for visualizing. It cannot work if the names of sites are given.
spad (float, default=.5,) –
pad to display the station in the top of each section plot.

New in version 0.2.1.
kws (dict,) – Addfitional keywords arguments passed to Matplotlib.Axes.Scatter plots.

Examples

>>> import watex as wx
>>> edi_data = wx.fetch_data ('edis', return_data =True, samples =27)
>>> wx.TPlot(show_grid=True).fit(edi_data).plot_rhophi (
    seed =52, mode ='*', n_sites =3 )

plot_tensor2d(tensor='res', sites=None, to_log10=False)[source]#

Plot two dimensional tensor.

Parameters:

freqs (array-like) – y-coordinates. It should have the length N, the same of the arr2d. the rows of the arr2d.Frequency array. It should be the complete frequency used during the survey area.
tensor (str , ['res','phase', 'z'], default='res') – kind of tensor to plot. Can be resistivity or phase. If phase, customize your plot to not fit the default ‘res’ behaviour.
to_log10 (bool, defaut=False,) – Convert the resistivity data and frequeny in log10.
sites (list of str, optional) – List of stations/sites names. If given, it must have the same length of the positions in of the EDI data. Must fit the number of ‘EDI’ succesffully read.

Returns:

arr2d: 2D resistivity array from the tensor component
freqs: array-like 1d of frequency in the survey.
positions: Sites/stations positions. It is equals to the distance
between stations times the number of sites
sites: list of the names of the station/sites
base_plot_kws: plot keywords arguments inherits from
watex.property.BasePlot. It composes the last parameters for customizing plot as decorated return function.

Return type:

( arr2d , freqs, positions , sites , base_plot_kws)

Examples

>>> from watex.view.plot import TPlot
>>> from watex.datasets import load_edis
>>> # get some 3 samples of EDI for demo
>>> edi_data = load_edis (return_data =True, samples =3 )
>>> # customize plot by adding plot_kws
>>> plot_kws = dict( ylabel = '$Log_{10}Frequency [Hz]$',
                    xlabel = '$Distance(m)$',
                    cb_label = '$Log_{10}Rhoa[\Omega.m$]',
                    fig_size =(6, 3),
                    font_size =7.
                    )
>>> t= TPlot(**plot_kws ).fit(edi_data)
>>> # plot recovery2d using the log10 resistivity
>>> t.plot_tensor2d (to_log10=True)
<AxesSubplot:xlabel='$Distance(m)$', ylabel='$Log_{10}Frequency [Hz]$'>

class watex.VerticalSounding(search=45.0, rho0=None, h0=1.0, strategy='HMCMC', vesorder=None, typeofop='mean', objective='coverall', xycoords=None, **kws)[source]#

Bases: ElectricalMethods

Vertical Electrical Sounding (VES) class; inherits of ElectricalMethods base class.

The VES is carried out to speculate about the existence of a fracture zone and the layer thicknesses. Commonly, it comes as supplement methods to Electrical Resistivity Profiling after selecting the best conductive zone when survey is made on one-dimensional.

Parameters:

**search: float**

The depth in meters from which one expects to find a fracture zone outside of pollutions. Indeed, the search parameter is used to speculate about the expected groundwater in the fractured rocks under the average level of water inrush in a specific area. For instance in Bagoue region , the average depth of water inrush is around 45m.So the search can be specified via the water inrush average value.

**rho0: float**

Value of the starting resistivity model. If None, rho0 should be the half minumm value of the apparent resistivity collected. Units is in Ω.m not log10(Ω.m)

**h0: float**

Thickness in meter of the first layers in meters.If None, it should be the minimum thickess as possible 1.m .

**strategy: str**

Type of inversion scheme. The defaut is Hybrid Monte Carlo (HMC) known as HMCMC. Another scheme is Bayesian neural network approach (BNN).

**vesorder: int**

AB/2

MN/2

SE1

SE2

SE3

…

SEn

AB

MN

resistivity

resistivity

resistivity

…

AB/2

MN/2

SE2

–>

AB

MN

resistivity

If vesorder is None and the number of sounding curves are more than one, by default the first sounding curve is selected ie rhoaIndex equals to 0

**typeofop: str**

**objective: str**

**kws: dict**

Additionnal keywords arguments from Vertical Electrical Sounding data operations. See watex.utils.exmath.vesDataOperator() for futher details.

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

fit(data, **fit_params)[source]#

Fit the sounding Vertical Electrical Sounding curves and computed the ohmic-area and set all the features for demarcating fractured zone from the selected anomaly.

Parameters:

data (Path-like object, DataFrame) – The string argument is a path-like object. It must be a valid file wich encompasses the collected data on the field. It shoud be composed of spacing values AB and the apparent resistivity values rhoa. By convention AB is half-space data i.e AB/2. So, if data is given, params AB and rhoa should be kept to None. If AB and rhoa is expected to be inputted, user must set the data to None values for API purpose. If not an error will raise. Or the recommended way is to use the vesSelector tool in watex.utils.vesSelector() to buid the Vertical Electrical Sounding data before feeding it to the algorithm. See the example below.
AB (array-like) – The spacing of the current electrodes when exploring in deeper. Units are in meters. Note that the AB is by convention equals to AB/2. It’s taken as half-space of the investigation depth.
MN (array-like) – Potential electrodes distances at each investigation depth. Note by convention the values are half-space and equals to MN/2.
rhoa (array-like) – Apparent resistivity values collected in imaging in depth. Units are in Ω.m not log10(Ω.m)
fit_params (dict) – additional keywords arguments, specific to the readable files. Refer to :method:`watex.property.Config.parsers` . Use the key() to get all the readables format.

Returns:

object

Return type:

a DC -resistivity Vertical Electrical Sounding object.

property inspect#: Inspect object whether is fitted or not

invert(data, strategy=None, **kwd)[source]#

Invert1D the Vertical Electrical Sounding data collected in the exporation area.

Parameters:

data (Dataframe pandas) – contains the depth measurement AB from current electrodes, the potentials electrodes MN and the collected apparent resistivities.
rho0 (float -) – Value of the starting resistivity model. If None, rho0 should be the half minumm value of the apparent resistivity collected. Units is in Ω.m not log10(Ω.m)
h0 (float - Thickness in meter of the first layers in meters.) – If None, it should be the minimum thickess as possible ``1.``m.
strategy (str - Type of inversion scheme. The defaut is Hybrid Monte) – Carlo (HMC) known as HMCMC. Another scheme is Bayesian neural network approach (BNN).
kwd (dict - Additionnal keywords arguments from Vertical Electrical Sounding data) – operations. See watex.utils.exmath.vesDataOperator for futherdetails.
replace (.. VES) –

plotOhmicArea(fbtw=False, **plot_kws)[source]#

Plot the ohmic-area from selected fractured zone.

Parameters:

fbtw – bool, default=False, If True, filled the computed fractured zone.
plot_kws – dict, Additional keywords arguments passed to plotOhmicArea().

summary(keep_params=False, return_table=False)[source]#

Summarize the most import features for prediction purpose.

Parameters:

keep_params (bool, default=False,) – If keep_params is set to True. Method should output only the main important params for prediction purpose. Otherwise, returns all main DC-resistivity attributes
return_tables (bool, default=False,) – if True, returns only the summarized table

Returns:

self or table_ – Returns DC- Sounding object or dataframe.

Return type:

VerticalSounding or class:pd.DataFrame

class watex.XGBClassifier(*, objective='binary:logistic', use_label_encoder=None, **kwargs)[source]#

Bases: XGBModel, ClassifierMixin

Implementation of the scikit-learn API for XGBoost classification.

Parameters:

n_estimators (int) – Number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
Sampling method. Used only by gpu_hist tree method.
- uniform: select random training instances uniformly.
- gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.

Note

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information
importance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
- For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
- For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
gpu_id (Optional[int]) – Device ordinal.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].
enable_categorical (bool) –

New in version 1.5.0.

Note

This parameter is experimental

Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (FeatureTypes) –

New in version 1.7.0.

Used for specifying feature types without constructing a dataframe. See DMatrix for details.
max_cat_to_onehot (Optional[int]) –

New in version 1.6.0.

Note

This parameter is experimental

A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and cat-param for details.
max_cat_threshold (Optional[int]) –

New in version 1.7.0.

Note

This parameter is experimental

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and cat-param for details.
eval_metric (Optional[Union[str, List[str], Callable]]) –

New in version 1.6.0.

Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

See Custom Objective and Evaluation Metric for more.

Note

This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.
```
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
X, y = load_diabetes(return_X_y=True)
reg = xgb.XGBRegressor(
    tree_method="hist",
    eval_metric=mean_absolute_error,
)
reg.fit(X, y, eval_set=[(X, y)])
```
early_stopping_rounds (Optional[int]) –

New in version 1.6.0.

Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().

The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.

If early stopping occurs, the model will have three additional fields: best_score, best_iteration and best_ntree_limit.

Note

This parameter replaces early_stopping_rounds in fit() method.
callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

Note

States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
```
for params in parameters_grid:
    # be sure to (re)initialize the callbacks before each run
    callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
    xgboost.train(params, Xy, callbacks=callbacks)
```
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

Note

**kwargs unsupported by scikit-learn

**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

Note

Custom objective function

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

y_true: array_like of shape [n_samples]
The target values

y_pred: array_like of shape [n_samples]
The predicted values

grad: array_like of shape [n_samples]
The value of the gradient for each sample point.

hess: array_like of shape [n_samples]
The value of the second derivative for each sample point

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)[source]#

Fit gradient boosting classifier.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters:

X – Feature matrix
y – Labels
sample_weight – instance weights
base_margin – global bias for each instance.
eval_set – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
eval_metric (str, list of str, or callable, optional) –

Deprecated since version 1.6.0: Use eval_metric in __init__() or set_params() instead.
early_stopping_rounds (int) –

Deprecated since version 1.6.0: Use early_stopping_rounds in __init__() or set_params() instead.
verbose – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
base_margin_eval_set – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.
feature_weights – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks –

Deprecated since version 1.6.0: Use callbacks in __init__() or set_params() instead.

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)[source]#

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X – Data to predict with.
output_margin – Whether to output the raw untransformed margin value.
ntree_limit – Deprecated, use iteration_range instead.
validate_features – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin – Margin added to prediction.
iteration_range –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

New in version 1.4.0.

Return type:

prediction

predict_proba(X, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)[source]#

Predict the probability of each X example being of a given class.

Note

This function is only thread safe for gbtree and dart.

Parameters:

X (array_like) – Feature matrix.
ntree_limit (int) – Deprecated, use iteration_range instead.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (array_like) – Margin added to prediction.
iteration_range – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

Returns:

a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.

Return type:

prediction

watex.bi_selector(d, /, features=None, return_frames=False)[source]#

Auto-differentiates the numerical from categorical attributes.

This is usefull to select the categorial features from the numerical features and vice-versa when we are a lot of features. Enter features individually become tiedous and a mistake could probably happenned.

Parameters:

d (pandas dataframe) – Dataframe pandas
features (list of str) – List of features in the dataframe columns. Raise error is feature(s) does/do not exist in the frame. Note that if features is None, it returns the categorical and numerical features instead.
return_frames (bool, default =False) – return the difference columns (features) from the given features as a list. If set to True returns bi-frames composed of the given features and the remaining features.

Returns:

- Tuple ( list, list) – list of features and remaining features
- Tuple ( pd.DataFrame, pd.DataFrame ) – List of features and remaing features frames.

Example

>>> from watex.utils.mlutils import bi_selector
>>> from watex.datasets import load_hlogs
>>> data = load_hlogs().frame # get the frame
>>> data.columns
>>> Index(['hole_id', 'depth_top', 'depth_bottom', 'strata_name', 'rock_name',
       'layer_thickness', 'resistivity', 'gamma_gamma', 'natural_gamma', 'sp',
       'short_distance_gamma', 'well_diameter', 'aquifer_group',
       'pumping_level', 'aquifer_thickness', 'hole_depth_before_pumping',
       'hole_depth_after_pumping', 'hole_depth_loss', 'depth_starting_pumping',
       'pumping_depth_at_the_end', 'pumping_depth', 'section_aperture', 'k',
       'kp', 'r', 'rp', 'remark'],
      dtype='object')
>>> num_features, cat_features = bi_selector (data)
>>> num_features
...['gamma_gamma',
     'depth_top',
     'aquifer_thickness',
     'pumping_depth_at_the_end',
     'section_aperture',
     'remark',
     'depth_starting_pumping',
     'hole_depth_before_pumping',
     'rp',
     'hole_depth_after_pumping',
     'hole_depth_loss',
     'depth_bottom',
     'sp',
     'pumping_depth',
     'kp',
     'resistivity',
     'short_distance_gamma',
     'r',
     'natural_gamma',
     'layer_thickness',
     'k',
     'well_diameter']
>>> cat_features
... ['hole_id', 'strata_name', 'rock_name', 'aquifer_group',
     'pumping_level']

watex.classify_k(o, /, func=None, kname=None, inplace=False, string=False, default_func=False)[source]#

Categorize the permeability coefficient ‘k’

Map the continuous ‘k’ into categorial classes.

Parameters:

o (ndarray of pd.Series or Dataframe) – data containing the permeability coefficient k contineous values. If data is passsed as a pandas dataframe, the column containing the k-values kname needs to be specified.
func (callable) – Function to specifically map the permeability coefficient column in the dataframe of serie. If not given, the default function can be enabled instead from param default_func.
inplace (bool, default=False) – Modified object inplace and return None
string (bool,) – If set to “True”, categorized map from ‘k’ should be prefixed by “k”. However is string value is given , the prefix is changed according to this label.
default_ufunc (bool,) –
Default function for mapping k is setting to True. Note that, this could probably not fitted your own data. So it is recommended to provide your own function for mapping ‘k’. However the default ‘k’ mapping is given as follow:
- k0 {0}: k = 0
- k1 {1}: 0 < k <= .01
- k2 {2}: .01 < k <= .07
- k3 {3}: k> .07

Returns:

o – return None only if dataframe is given and inplace is set to True i.e modified object inplace.

Return type:

None, ndarray, Series or Dataframe

Examples

>>> import numpy as np
>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import classify_k
>>> _, y0 = load_hlogs (as_frame =True)
>>> # let visualize four nonzeros values in y0
>>> y0.k.values [ ~np.isnan (y0.k ) ][:4]
...  array([0.054, 0.054, 0.054, 0.054])
>>> classify_k (y0 , kname ='k', inplace =True, use_default_func=True )
>>> # let see again the same four value in the dataframe
>>> y0.k.values [ ~np.isnan (y0.k ) ][:4]
... array([2., 2., 2., 2.])

watex.cleaner(data, /, columns=None, inplace=False, labels=None, func=None, mode='clean', **kws)[source]#

Sanitize data in the data or columns by dropping specified labels from rows or columns.

If data is not a pandas dataframe, should be converted to dataframe and uses index to drop the labels.

Parameters:

data (pd.Dataframe or arraylike2D.) – Dataframe pandas or Numpy two dimensional arrays. If 2D array is passed, it should prior be converted to a daframe by default and drop row index from index parameters
columns (single label or list-like) –

Alternative to specifying axis (
labels, axis=1 is equivalent to columns=labels).
labels (single label or list-like) – Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.
func (F, callable) – Universal function used to clean the columns. If performs only when mode is on clean option.
inplace (bool, default False) – If False, return a copy. Otherwise, do operation inplace and return None.
mode (str, default='clean') – Options or mode of operation to do on the data. It could be [‘clean’|’drop’]. If drop, it behaves like dataframe.drop of pandas.

Returns:

DataFrame cleaned or without the removed index or column labels or None if inplace=True or array is data is passed as an array.

Return type:

DataFrame, array2D or None

watex.erpSelector(f, columns=Ellipsis, force=False, utm_zone=None, epsg=None, verbose=0.0, **kws)[source]#

Read and sanitize the data collected from the survey.

data should be an array, a dataframe, series, or arranged in .csv or .xlsx formats. Be sure to provide the header of each columns in’ the worksheet. In a file is given, header columns should be aranged as ['station','resistivity' ,'longitude', 'latitude']. Note that coordinates columns (longitude and latitude) are not compulsory.

Parameters:

f (Path-like object, ndarray, Series or Dataframe,) – If a path-like object is given, can only parse .csv and .xlsx file formats. However, if ndarray is given and shape along axis 1 is greater than 4, the ndarray should be shrunked.
columns (list) – list of the valuable columns. It can be used to fix along the axis 1 of the array the specific values. It should contain the prefix or the whole name of each item in ['station','resistivity' ,'longitude', 'latitude'].
force (bool, default=False,) – If Vertical electrical (VES) is passed while expecting ERP data, force set to True will consider the VES data as ERP data and will use only the resistivity values in VES data. This will will an invalid results especially when parameters computation are needed.
verbose (int,) – Show the verbosity; outputs more messages if True.
utm_zone (string, optional) –
zone number and ‘S’ or ‘N’ e.g. ‘55S’. Default to the centre point of the provided points. If given, the longitude/latitude are computed from valid easting/northing coordinates.

New in version 0.2.1.
epsg (int) – epsg number defining projection (see http://spatialreference.org/ref/ for moreinfo). Overrides utm_zone if both are provided
kws (dict) – Additional pandas pd.read_csv and pd.read_excel methods keyword arguments. Be sure to provide the right argument. when reading f. For instance, provide sep= ',' argument when the file to read is xlsx format will raise an error. Indeed, sep parameter is acceptable for parsing the .csv file format only.

Return type:

DataFrame with valuable column(s).

Notes

The length of acceptable columns is 4. If the size of the columns is higher than 4, the data should be shrunked to match the expected columns. Futhermore, if the header is not specified in f , the defaut column arrangement should be used. Therefore, the second column should be considered as the resistivity column.

Examples

>>> import numpy as np
>>> from watex.utils.coreutils import erpSelector
>>> df = erpSelector ('data/erp/testsafedata.csv')
>>> df.shape
... (45, 4)
>>> list(df.columns)
... ['station','resistivity', 'longitude', 'latitude']
>>> df = erp_selector('data/erp/testunsafedata.xlsx')
>>> list(df.columns)
... ['easting', 'station', 'resistivity', 'northing']
>>> df = erpSelector(np.random.randn(7, 7))
>>> df.shape
... (7, 4)
>>> list(df.columns)
... ['station', 'resistivity', 'longitude', 'latitude']

watex.erpSmartDetector(constr, erp, station=None, coerce=False, return_cz=False, view=False, raise_warn=True, **plot_kws)[source]#

Automatically detect the drilling location by involving the constraints observed in the survey area.

Consider the constraints on the survey area and detect the suitable drilling location. Commonly the station is not needed when using the constraintssince the station indicates that the user is aware about the reason to select this station. However in the case, doubts raise, user can set the parameter coerce to True.

Parameters:

constr (list, dict) –
List of restricted station. The constraint or restricted stations are the station where to ignore when selecting the best drilling location. Indeed, this is useful since in DWSC, not the station are presumed to be suitable to propose the drilling in technical view. For instance, if some stations are close to the household waste site, the stations must be list and ignored.

If the constr is passed in a dictionnary, it might be contain, the key for the restricted stations and the value for the reason why the station is restricted. For instance:
```
constr = {"s02": "station close to the household waste"
          "S25": "station is located in a marsh area."
          }
```
erp (array-like 1d) – DC profiling ERP resistivity values
station (str, optional) – The station of the presumed location for drilling operations. Commonly the station is not need when using the constraints. If the station is given whereas coerce=False an errors will raise top warnm the users, To force considering the station in the auto-detection, coerce must be set to True.
coerce (bool, default=False,) – Allow the station to be consider in the auto-detection.
raise_warn (bool, default=True,) – warn the user whether a suitable location is found or not. Returns None otherwise.
view (bool, default=False,) – Plot the conductive zone and restricted stations.
plot_kws (dict,) – Additional plotting keywords arguments passed to plotAnomaly().

Returns:

(station |None) or cz, cs – staion for the drilling operations detected automatically. If no station is detected, will return None. if return_cz is True, station and the conductive zone are returned as well as the restricted station position number.

Return type:

str,

See also

watex.plotAnomaly: Plot DC profiling ERP and conductive zone.

Examples

>>> import numpy as np
>>> from watex.datasets import make_erp
>>> from watex.utils.coreutils import erpSmartDetector
>>> resistivity = make_erp (n_stations =50 , as_frame=True, seed=125).resistivity
>>> # get the min value of the resistivity
>>> resmin_index = np.where ( resistivity==resistivity.min())
42
>>> erpSmartDetector (constr =['s42'], resistivity )
'S13'
>>> # S42 is rejected and selected another zone presumed to be better.
>>> constraints ={"S00": "Marsh area. ",
                  "S10": " Municipality square, no authorization to make drill",
                  "S29": "Heritage site",
                  "S46": "Household waste site",
                  "S42": "Household waste site"
                  }
>>> erpSmartDetector (constraints, resistivity)
'S16'
>>> erpSmartDetector (['s12', 's40'], resistivity)
'S29'
>>> # station 42 close s40 is rejected too.

watex.fetch_data(tag, **kws)[source]#

Fetch dataset from tag.

A tag corresponds to the name area of data collection or each level of data processing.

Parameters:

tag (str, ['bagoue', 'tankesse', 'semien', 'iris', 'boundiali', 'gbalo']) –

name of the area of data to fetch. For instance set the tag to bagoue will load the bagoue datasets. If the tag name is following by a suffix, the later specifies the stage of the data processing. As an example, bagoue original or bagoue prepared will retrieve the original data and the transformed data after applying default transformers respectively.

There are different options to retrieve data such as:

[‘original’] => original or raw data -& returns a dict of details
contex combine with get method to get the dataframe like:
>>> fetch_data ('bagoue original').get ('data=df')
[‘stratified’] => stratification data
[‘mid’ |'semi'|’preprocess’|’fit’]=> data cleaned with
attributes experience combinaisons.
[‘pipe’]=> default pipeline created during the data preparing.
[‘analyses’|’pca’|’reduce dimension’]=> data with text attributes
only encoded using the ordinal encoder + attributes combinaisons.
[‘test’] => stratified test set data

Returns:

dict, X, y –

If tag is following by suffix in the case of ‘bagoue’ area, it returns:

data: Original data
X, y : Stratified train set and training target
X0, y0: data cleaned after dropping useless features and combined
numerical attributes combinaisons if True
X_prepared, y_prepared: Data prepared after applying all the
transformation via the transformer (pipeline).
XT, yT : stratified test set and test label
_X: Stratified training set for data analysis. So None sparse
matrix is contained. The text attributes (categorical) are converted using Ordianal Encoder.
_pipeline: the default pipeline.

Return type:

frame of Boxspace object

Examples

>>> from watex.datasets import fetch_data
>>> b = fetch_data('bagoue' ) # no suffix returns 'Boxspace' object
>>> b.tnames
... array(['flow'], dtype='<U4')
>>> b.feature_names
... ['num',
     'name',
     'east',
     'north',
     'power',
     'magnitude',
     'shape',
     'type',
     'sfi',
     'ohmS',
     'lwi',
     'geol']
>>> X, y = fetch_data('bagoue prepared' )
>>> X # is transformed  # ready for prediction
>>> X[0]
... <1x18 sparse matrix of type '<class 'numpy.float64'>'
        with 8 stored elements in Compressed Sparse Row format>
>>> y
... array([2, 1, 2, 2, 1, 0, ... , 3, 2, 3, 3, 2], dtype=int64)

watex.fittensor(refreq, compfreq, z, fill_value=nan)[source]#

Fit each tensor component to the complete frequency range.

The complete frequency is the frequency with clean data. It contain all the frequency range on the site. During the survey, the missing frequencies lead to missing tensor data. So the function will indicate where the tensor data is missing and fit to the prior frequencies.

Parameters:

refreq (ArrayLike) – Reference frequency - Should be the complete frequency collected in the field.
comfreq (array-like,) – The specific frequency collect in the site. Sometimes due to the interferences, the frequency at individual site could be different from the complete. However, the frequency values at the individual site must be included in the complete frequency refreq.
z (array-like,) – should be the tensor value (real or imaginary part ) at the component xx, xy, yx, yy.
fill_value (float . default='NaN') – Value to replace the missing data in tensors.

Returns:

Z – new Z filled by invalid value NaN where the frequency is missing in the data.

Return type:

Arraylike

Examples

>>> import numpy as np
>>> from watex.utils.exmath import fittensor
>>> refreq = np.linspace(7e7, 1e0, 20) # 20 frequencies as reference
>>> freq_ = np.hstack ((refreq.copy()[:7], refreq.copy()[12:] ))
>>> z = np.random.randn(len(freq_)) *10 # assume length of  freq as
...                 # the same like the tensor Z value
>>> zn  = fittensor (refreq, freq_, z)
>>> z # some frequency values are missing but not visible.
...array([-23.23448367,   2.93185982,  10.81194723, -12.46326732,
         1.57312908,   7.23926576, -14.65645799,   9.85956253,
         3.96269863, -10.38325124,  -4.29739755,  -8.2591703 ,
        21.7930423 ,   0.21709129,   4.07815217])
>>> # zn show where the frequencies are missing
>>> # the NaN value means in a missing value in  tensor Z at specific frequency
>>> zn
... array([-23.23448367,   2.93185982,  10.81194723, -12.46326732,
         1.57312908,   7.23926576, -14.65645799,          nan,
                nan,          nan,          nan,          nan,
         9.85956253,   3.96269863, -10.38325124,  -4.29739755,
        -8.2591703 ,  21.7930423 ,   0.21709129,   4.07815217])
>>> # let visualize where the missing frequency value in tensor Z
>>> refreq
... array([7.00000000e+07, 6.63157895e+07, 6.26315791e+07, 5.89473686e+07,
       5.52631581e+07, 5.15789476e+07, 4.78947372e+07, 4.42105267e+07*,
       4.05263162e+07*, 3.68421057e+07*, 3.31578953e+07*, 2.94736848e+07*,
       2.57894743e+07, 2.21052638e+07, 1.84210534e+07, 1.47368429e+07,
       1.10526324e+07, 7.36842195e+06, 3.68421147e+06, 1.00000000e+00])
>>> refreq[np.isnan(zn)] #we can see the missing value between [7:12](*) in refreq
... array([44210526.68421052, 40526316.21052632, 36842105.73684211,
       33157895.2631579 , 29473684.78947368])

watex.get2dtensor(z_or_edis_obj_list, /, tensor='z', component='xy', kind='modulus', return_freqs=False, **kws)[source]#

Make tensor into two dimensional array from a collection of Impedance tensors Z.

Out 2D resistivity, phase-error and tensor matrix from a collection of EDI-objects.

Matrix depends of the number of frequency times number of sites. The function asserts whether all data from all frequencies are available. The missing values should be filled by NaN. Note that each element of z is (nfreq, 2, 2) dimension for:

xx ( 0, 0) ------- xy ( 0, 1)
yx ( 1, 0) ------- yy ( 1, 1)

Parameters:

z_or_edis_obj_list (list of watex.edi.Edi or watex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.
tensor (str, default='z') – Tensor name. Can be [ resistivity|phase|z|frequency]
component (str, default='xy' (TE mode)) – EM mode. Can be [‘xx’, ‘xy’, ‘yx’, ‘yy’]
out (str) – kind of data to output. Be sure to provide the component to retrieve the attribute from the collection object. Except the error and frequency attribute, the missing component to the attribute will raise an error. for instance resxy for xy component. Default is resxy.
kind (str , default='modulus') – focuses on the tensor output. Note that the tensor is a complex number of ndarray (nfreq, 2,2 ). If set to``modulus`, the modulus of the complex tensor should be outputted. If real or``imag``, it returns only the specific one. Default is complex.
return_freqs (Arraylike ,) – If True , returns also the full frequency ranges.
kws (dict) – Additional keywords arguments from :meth:`~EM.getfullfrequency `.

Returns:

mat2d – the matrix of number of frequency and number of Edi-collectes which correspond to the number of the stations/sites.

Return type:

arraylike2d

Examples

>>> from watex.datasets import load_huayuan
>>> from watex.methods import get2dtensor
>>> box= load_huayuan ( key ='raw', clear_cache = True, samples =7)
>>> data = box.data
>>> phase_yx = get2dtensor ( data, tensor ='phase', component ='yx')
>>> phase_yx.shape
(56, 7)
>>> phase_yx [0, :]
array([        nan,         nan,         nan,         nan, 18.73244951,
       35.00516522, 59.91093054])

watex.get_scorers(*, scorer=None, check_scorer=False, error='ignore')[source]#

Fetch the list of available metrics from scikit-learn or verify whether the scorer exist in that list of metrics. This is prior necessary before the model evaluation.

Parameters:: scorer – str, Must be an metrics for model evaluation. Refer to sklearn.metrics

:param check_scorer:bool, default=False: Returns bool if True whether the scorer exists in the list of the metrics for the model evaluation. Note that scorer`can not be ``None` if check_scorer is set to True.

Parameters:

error – str, [‘raise’, ‘ignore’] raise a ValueError if scorer not found in the list of metrics and check_scorer `is ``True`.

Returns:

scorers: bool, tuple: True if scorer is in the list of metrics provided that ` scorer` is not None, or the tuple of scikit-metrics. sklearn.metrics

watex.iPCA(X, n_components=None, *, view=False, n_batches=None, return_X=True, store_in_binary_file=False, filename=None, **ipca_kws)[source]#

Incremental PCA

iPCA allows to split the trainsing set into mini-batches and feed algorithm one mini-batch at a time.

Once problem with the preceeding implementation of PCA is that requires the whole training set to fit in memory in order of the SVD algorithm to run. This is usefull for large training sets, and also applying PCA online(i.e, on the fly as a new instance arrive)

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
n_components (int, optional) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.
n_batches (int, optional) – Number of batches to split the training set.
store_in_binary_file (bool, default=False) – Alternatively, we used numpy` memmap` class to manipulate a large array stored in a binary file on disk as if it were entirely in memory. The class load only the data it need in memory when it need its.
filename (str,optional) – Default binary filename to store in a binary file in a disk.
return_X (bool, default =True ,) – return the train set transformed with most representative varaince ratio.
view (bool,default=False,) – Plot the explained varaince as a function of number of dimension.
ipca_kws (dict,) – Additional keyword arguments passed to sklearn.decomposition.IncrementalPCA

Returns:

The transformed training set or the iPCA container attributes for plotting purposes.

Return type:

X (NDArray) or iPCA object,

Examples

>>> from watex.analysis.dimensionality import iPCA
>>> from watex.datasets import fetch_data
>>> X, _=fetch_data('Bagoue analysed data')
>>> Xtransf = iPCA(X,n_components=None,n_batches=100, view=True)

watex.kPCA(X, n_components=None, *, return_X=True, kernel='rbf', reconstruct_pre_image=False, **kpca_kws)[source]#

Kernel PCA

kPCA performs complex nonlinear projections for dimentionality reduction.

Commonly the kernel tricks is a mathematically technique that implicitly maps instances into a very high-dimensionality space(called the feature space), enabling non linear classification or regression with SVMs. Recall that a linear decision boundary in the high dimensional feature space corresponds to a complex non-linear decison boundary in the original space.

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
n_components (int, optional) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.
return_X (bool, default =True ,) – return the train set transformed with most representative varaince ratio.
kernel ({'linear', 'poly', 'rbf', 'sigmoid', 'cosine', 'precomputed'}, default='rbf') – Kernel used for PCA.
kpca_kws (dict,) – Additional keyword arguments passed to sklearn.decomposition.KernelPCA

Returns:

The transformed training set or the kPCA container attributes for plotting purposes.

Return type:

X (NDArray) or kPCA object,

Examples

>>> from watex.analysis.dimensionality import kPCA
>>> from watex.datasets import fetch_data
>>> X, _=fetch_data('Bagoue analysis data')
>>> Xtransf=kPCA(X,n_components=None,kernel='rbf',
                            gamma=0.04, view=True)

watex.magnitude(cz)[source]#

Compute the magnitude of selected conductive zone.

The magnitude parameter is the absolute resistivity value between the minimum $\min \rho_a$ and maximum $\max \rho_a$ value of selected anomaly:

\[magnitude=|\min\rho_a -\max\rho_a|\]

Parameters:: cz – array-like. Array of apparent resistivity values composing the conductive zone.
Returns:: Absolute value of anomaly magnitude in ohm.meters.

watex.make_MXS_labels(y_true, y_pred, threshold=None, similar_labels=None, sep=None, prefix=None, method='naive', trailer='*', return_obj=False, **kws)[source]#

Create a Mixture Learning Strategy (MXS) labels from true labels ‘y_true’ and the predicted Naive Group of Aquifer (NGA) labels ‘y_pred’

Parameters:

y_true (array-like 1d, pandas.Series) – Array composed of valid k-values and possible missing k-values.
y_pred (Array-like 1d, pandas.Series) – Array composing the valid NGA labels. Note that NGA labels is a predicted labels mostly using the unsupervising learning.
threshold (float, default=None) – The threshold from which, label in ‘y_true’ can be considered similar than the one in NGA labels ‘y_pred’. The default is ‘None’ which means none rule is considered and the high preponderence or occurence in the data compared to other labels is considered as the most representative and similar. Setting the rule instead by fixing the threshold is recommended especially in a huge dataset.
similar_labels (list of tuple, optional) – list of tuple in pair (label and similar group). If given, the similar group must be the label existing in the predicted NGA. If None, the auto-similarity is triggered.
sep (str, default'') –
Separator between the true labels ‘y_true’ and predicted NGA labels. Sep is used to rewrite the MXS labels. Mostly the MXS labels is a combinaison with the true label of permeability coefficient ‘k’ and the label of NGA to compose new similarity labels. For instance
```
>>> true_labels=['k1', 'k2', 'k3'] ; NGA_labels =['II', 'I', 'IV']
>>> # gives
>>> MXS_labels= ['k1_II', 'k2_I', 'k3_IV']
```
where the seperator sep is set to _. This happens especially when one of the label (NGA or true_labels) is not a numeric datatype and a similariy is found between ‘k1’ and ‘II’, ‘k2’ and ‘I’ and so on.
prefix (str, default='') –
prefix is used to rename the true_labels i.e the true valid-k. For instance:
```
>>> k_valid =[1, 2, ..] -> k_new = [k1, k2, ...]
```
where ‘k’ is the prefix.
method (str ['naive', 'strict'], default='naive') –
The kind of strategy to compute the representativity of a label in the predicted array ‘y_pred’. It can also be ‘strict’. Indeed:
- naive computes the importance of the label by the number of its
  occurence for this specific label in the array ‘y_true’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in y_true.
- strict computes the importance of the label by the number of
  occurence in the whole valid y_true i.e. under the total of occurence of all the labels that exist in the whole ‘arr_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in y_pred.
trailer (str, default='*') –
The Mixture strategy marker to differentiate the existing class label in ‘y_true’ with the predicted labels ‘y_pred’ especially when the same class labels are also present the true label with the same label-identifier name. This usefull to avoid any confusion for both labels in y_true and y_pred for better demarcation and distinction. Note that if the trailer`is set to ``None` and both y_true and y_pred are numeric data, the labels in y_pred are systematically renamed to be distinct with the ones in the ‘y_true’. For instance
```
>>> true_labels=[1, 2, 3] ; NGA_labels =[0, 1, 2]
>>> # with trailer , MXS labels should be
>>>  MXS_labels= ['0', '1*', '2*', '3'] # 1 and 2 are in true_labels
>>> # with no trailer
>>> MXS_labels= [0, 4, 5, 3] # 1 and 2 have been changed to [4, 5]
```
return_obj (watex.utils.box.Boxspace) –
If True, returns a MXS object with usefull attributes such as:
- mxs_classes_ = the MXS class labels
- mxs_labels_= the array-like of MXS labels. It also includes some
  non similar labels from NGA
mxs_map_classes_= a dict or original class labels of the array
’k’ <’y_true’> and their temporary integer class labels. Indeed, if ‘y_true’ class labels are not a numeric dtype, New labels with integer dtype is created. The dict is used to wrap the true labels (original ones) during the MXS creation. Thus, the original labels are not altered and will be map in turn at the end to recover their positions as well in new MXS array. It is set to ‘None’ if ‘y_true’ has a numeric dtype.

mxs_group_classes_: dict of all the similar group labels with the
MXS labels related from the modified existing groups of NGA. Note that the non-similar group are modified if their labels are also found in the true_labels to avoid any confusion. Thus the dict wrap the non-similar label with their new temporay labels.

mxs_similar_groups_= list of the similar labels found in
y_true that have a similarity in NGA.

mxs_similarity_= Tuple of similarity in pair (label, group)
existing between the label class in y_true and NGA.

mxs_group_labels_= list of the similar groups found in the
predicted NGA that have a similarity in true labels ‘y_true’

Returns:

MXS – array like of MXS labels or MXS object containing the usefull attributes.

Return type:

array-like 1d or Boxspace

See also

predict_NGA_labels: Predicts Naive group of Aquifers labels.

Examples

>>> from watex.datasets import load_hlogs
>>> from watex.utils import read_data
>>> from watex.utils.hydroutils import classify_k, make_MXS_labels
>>> data = load_hlogs ().frame
>>> # map data.k to categorize k values
>>> ymap = classify_k(data.k , default_func =True)
>>> y_mxs = make_MXS_labels (ymap, data.aquifer_group)
>>> y_mxs[14:24]
...  array(['I', 'I', 2, 2, 2, 2, 2, 2, 2, 2], dtype=object)
>>> mxs_obj = make_MXS_labels (ymap, data.aquifer_group, return_obj=True )
>>> mxs_obj.mxs_labels_[14: 24]
... array(['I', 'I', 2, 2, 2, 2, 2, 2, 2, 2], dtype=object)
>>> # now we did the same task using the private data 'hf.csv'
>>> # composed of 11 boreholes. For default we alternatively uses
>>> # the aquifer groups like a fake NGA
>>> data = read_data ('data/boreholes/hf.csv')
>>> ymap =  classify_k(data.k , default_func =True)
>>> y_mxs= make_MXS_labels (ymap, data.aquifer_group)
>>> np.unique (y_mxs)
... array(['1', '1V', '2', '2III', '3', 'I', 'II', 'III&IV', 'IV'],
      dtype='<U6')
>>> # *comments:
    # label '1V' means the group V (expected to be a cluster)
    # and label 1 (true labels) have a similarity
    # the same of label '2III' while the remain label 3 does not
    #  any similarity in the other labels  in the 'y_pred' expected
    # to be NGA labels.

watex.make_erp(*, n_stations=42, max_rho=1000.0, min_rho=1.0, step=20.0, reflong='110:29:09.00', reflat='26:03:05.00', utm_zone='29N', order='+', full_coordinates=True, raise_warning=False, as_frame=False, seed=None, is_utm=False, epsg=None, **coord_kws)[source]#

Generate Electrical Resistivity Profiling (ERP) data from stations and coordinates points.

To generate samples from specific area, it is better to provide both latitude and longitude values from a single station of this area as arguments passed to parameters reflat and reflong respectively. Also specify the utm_zone for the lat/lon coordinates conversion into UTM if necessary. If not useful, can turn off the parameter full_coordinates to False.

Parameters:

n_stations (int, default=42) – number of measurements stations
max_rho (float, default=1e3) – maximum resistivity value on the survey area in $\Omega.m$
min_rho (float, default=1e0) – minimum resistivity value on the survey area in $\Omega.m$
reflong (float or string or list of [start, stop], default='110:29:09.00') – Reference longitude in degree decimal or in DD:MM:SS for the first station considered as the origin of the landmark.
reflat (float or string or list of [start, stop], default='26:03:05.00') – Reference latitude in degree decimal or in DD:MM:SS for the reference site considered as the landmark origin. If value is given in a list, it can contain the start point and the stop point.
step (float or str , default=20) – Offset or the distance of seperation between different sites in meters. If the value is given as string type, except the km, it should be considered as a m value. Only meters and kilometers are accepables.
order (str , default='-') – Direction of the projection line. By default the projected line is in ascending order i.e. from SW to NE with angle r set to 45 degrees. Could be - for descending order. Any other value should be in ascending order.
utm_zone (string (##N or ##S), default='29N') – utm zone in the form of number and North or South hemisphere, 10S or 03N Must be given if utm2deg is set to True.
full_coordinates (bool, default=True,) – Convert latitude and longitude to approximate UTM values. Easting and northing are gotten using the reference ellipsoid =23 with WGS84. If False, easting and northing are not computed and set to null.
raise_warning (bool, default=True,) – Raises warnings if GDAL is not set or the coordinates accurately status.
as_frame (bool, default=False,) – if True, outputs the data into as a pandas dataframe, Boxspace object otherwise.
seed (int, Optional,) – It allows reproducing the same data. If value is passed, it reproduces the same data at that sample points.
is_utm (bool, default=False) –
Type of coordinates passed to reflat and reflong` params for generating longitude-latitude coordinates. If is_utm is explicity set to True, that means values reflong and reflat arein UTM coordinates. Then the conversion to longitude-latitude should be operated. However if is_utm is False when reflat and reflong values are greater than 90 and 180 degrees respectively, an errors should raise.

New in version 0.2.1.
epsg (int, str, Optional) – EPSG number defining projection. See http://spatialreference.org/ref/ for moreinfo. Overrides utm_zone if both are provided
coord_kws (dict,) – Additional keywords passed to makeCoords().

Return type:

(pd.Dataframe | Boxspace )

Examples

>>> from watex.datasets.gdata import make_erp
>>> erp_data = make_erp (n_stations =50 , step =30  , as_frame =True)
>>> erp_data.head(3)
Out[256]:
   station  longitude  latitude        easting    northing  resistivity
0        0 -13.488511  0.000997  668210.580864  110.183287   225.265306
1       30 -13.488511  0.000997  668210.581109  110.183482   327.204082
2       60 -13.488510  0.000997  668210.581355  110.183676   204.877551

watex.make_naive_pipe(X, y=None, *, num_features=None, cat_features=None, label_encoding='LabelEncoder', scaler='StandardScaler', missing_values=nan, impute_strategy='median', sparse_output=True, for_pca=False, transform=False)[source]#

make a pipeline to transform data at once.

make a naive pipeline is usefull to fast preprocess the data at once for quick prediction.

Work with a pandas dataframe. If None features is set, the numerical and categorial features are automatically retrieved.

Parameters:

X (pandas dataframe of shape (n_samples, n_features)) – The input samples. Use dtype=np.float32 for maximum efficiency. Sparse matrices are also supported, use sparse csc_matrix for maximum efficiency.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
num_features (list or str, optional) – Numerical features put on the list. If num_features are given whereas cat_features are None, cat_features are figured out automatically.
cat_features (list of str, optional) – Categorial features put on the list. If num_features are given whereas num_features are None, num_features are figured out automatically.
label_encoding (callable or str, default='sklearn.preprocessing.LabelEncoder') – kind of encoding used to encode label. This assumes ‘y’ is supplied.
scaler (callable or str , default='sklearn.preprocessing.StandardScaler') – kind of scaling used to scaled the numerical data. Note that for the categorical data encoding, ‘sklearn.preprocessing.OneHotEncoder’ is implemented under the hood instead.
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
impute_strategy (str, default='mean') –
The imputation strategy.
- If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
- If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
- If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
- If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
  
  strategy=”constant” for fixed value imputation.
sparse_output (bool, default=False) – Is used when label y is given. Binarize labels in a one-vs-all fashion. If True, returns array from transform is desired to be in sparse CSR format.
for_pca (bool, default=False,) – Transform data for principal component ( PCA) analysis. If set to True, watex.exlib.sklearn.OrdinalEncoder` is used insted of watex.exlib.sklearn.OneHotEncoder`.
transform (bool, default=False,) – Tranform data inplace rather than returning the naive pipeline.

Returns:

full_pipeline (watex.exlib.sklearn.FeatureUnion) –
- Full pipeline composed of numerical and categorical pipes
(X_transformed &| y_transformed) ({array-like, sparse matrix} of shape (n_samples, n_features)) –
- Transformed data.

Examples

>>> from watex.utils.mlutils import make_naive_pipe
>>> from watex.datasets import load_hlogs

(1) Make a naive simple pipeline with RobustScaler, StandardScaler >>> from watex.exlib.sklearn import RobustScaler >>> X_, y_ = load_hlogs (as_frame=True )# get all the data >>> pipe = make_naive_pipe(X_, scaler =RobustScaler )

(2) Transform X in place with numerical and categorical features with StandardScaler (default). Returned CSR matrix

>>> make_naive_pipe(X_, transform =True )
... <181x40 sparse matrix of type '<class 'numpy.float64'>'
    with 2172 stored elements in Compressed Sparse Row format>

watex.make_ves(*, samples=31, min_rho=10.0, max_rho=1000.0, max_depth=100.0, order='-', as_frame=False, seed=None, iorder=3, xy=None, is_utm=False, add_xy=False, utm_zone=None, epsg=None)[source]#

Generate Vertical Electrical Sounding (VES) data from pseudo-depth measurements.

For a large pseudo-depth measurements, one can change the number of samples to a large values. The default samples presumed collected is samples=31 measurements in deeper.

Parameters:

samples (int, default=42) – number of measurements depth AB/2 in meters.
max_rho (float, default=1e3) – maximum resistivity value expected in deeeper on the survey area in $\\Omega.m$
min_rho (float, default=1e1) – minimum resistivity value expected in deeper on the survey area in $\\Omega.m$
order (str , default='-') – Direction of the projection line. By default the projected line is in ascending order i.e. from SW to NE with angle r set to 45 degrees. Could be - for descending order. Any other value should be in ascending order.
max_depth (float, default=100) – Value of the measurement in deeper expected to reach by AB/2 in meters.
as_frame (bool, default=False,) – if True, outputs the data into as a pandas dataframe, Boxspace object otherwise.
seed (int, Optional,) – It allows reproducing the same data. If value is passed, it reproduces the same data at that sample points.
iorder (int, default=3) – Inflexion order. It is a positive value greater than 0. If None, it should be computed using the length of extrema (local + global). It also might be lower as possible to let the fitting VES curve more realistic.
xy (tuple, optional) –
Coordinates point ( easting, northing ) or (lon, lat) corresponding to the VES points sves. If coordinates values are not given coordinates are randomly generated into (lon, lat) and stored into the attribute xy. To returns the xy auto-coordinates when as_frame=True set add_xy to True.

New in version 0.2.1.
is_utm (bool, default=False) – In principle, xy expects to be in longitude-latitude coordinates. However if coordinates are passed into a UTM such as easting-northing, user can specify the utm_zone to convert the xy values into a valid longitude and latitude coordinates.
add_xy (bool, default=False) – Add xy coordinates to the VES dataframe.
utm_zone (str, Optional) – To generate coordinates xy from a specific zone, utm_zone can be specified, otherwise 29N is used instead.
epsg (int, str, Optional) – EPSG number defining projection. See http://spatialreference.org/ref/ for moreinfo. Overrides utm_zone if both are provided

Return type:

(pd.Dataframe | Boxspace )

Notes

when returning the Boxspace object, each columns of ‘VES’ data can be retrieved as an attributes. Check the examples below

Examples

>>> from watex.datasets.gdata import make_ves
>>> b = make_ves (samples =50 , order ='+') # 50 measurements in deeper
>>> b.resistivity [:-7]
Out[314]:
array([429.873 , 434.255 , 438.5707, 442.8203, 447.0042, 451.1228,
       457.5775])
>>> b.frame.head(3)
Out[315]:
    AB   MN  resistivity
0  1.0  0.6   429.872999
1  2.0  0.6   434.255018
2  3.0  0.6   438.570675
>>> ves_data = make_ves (samples =50 , min_rho =10, max_rho =1e5 ,
                         as_frame =True, add_xy= True ,
                         xy = ( 3143965.855 , 336704.455) ,
                         is_utm = True , utm_zone = '49N', epsg =None)
>>> ves_data.head(2)
Out[316]:
    AB   MN   resistivity   longitude   latitude
0  1.0  0.6  51544.426685  107.901553 -61.802165
1  2.0  0.6  51420.739513  107.901553 -61.802165

watex.nPCA(X, n_components=None, *, view=False, return_X=True, plot_kws=None, n_axes=None, **pca_kws)[source]#

Normal Principal Components analysis (PCA)

PCA is by far the most popular dimensional reduction algorithm. First it identifies the hyperplane that lies closest to the data and project it to the data onto it.

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
n_components (int, optional) – Number of dimension to preserve. If`n_components` is ranged between float 0. to 1., it indicated the number of variance ratio to preserve. If None as default value the number of variance to preserve is 95%.
return_X (bool, default =True ,) – return the train set transformed with most representative varaince ratio.
view (bool,default=False,) – Plot the explained varaince as a function of number of dimension.
n_axes (int, optional,) – Number of importance components to retrieve the variance ratio. If None the features importance is computed using the cumulative variance representative of 95% .
pca_kws (dict,) – Additional matplotlib.pyplot keywords arguments passed to sklearn.decomposition.PCA

Returns:

The transformed training set or the PCA container attributes for plotting purpose.

Return type:

X or nPCA object,

Examples

>>> from watex.analysis.dimensionality import nPCA
>>> from watex.datasets import fetch_data
>>> X, _= fetch_data('Bagoue analysed dataset')
>>> pca = nPCA(X, 0.95, n_axes =3, return_X=False)
>>> pca.components_
>>> pca.feature_importances_

watex.naive_evaluation(clf, X, y, cv=7, scoring='accuracy', display='off', **kws)[source]#

Quick scores evaluation using cross validation.

Parameters:

clf (callable) – Classifer for testing default data.
X (ndarray) – trainset data
y (array_like) – label data
cv (int) – KFold for data validation.
scoring (str) – type of error visualization.
display (str or bool,) – show the show on the stdout
kws (dict,) – Additional keywords arguments passed to watex.exlib.slearn.cross_val_score().

Returns:

scores, mean_core – scaore after evaluation and mean of the score

Return type:

array_like, float

Examples

>>> import watex as wx
>>> from watex.models.validation import naive_evaluation
>>> X,  y = wx.fetch_data ('bagoue data prepared')
>>> clf = wx.sklearn.DecisionTreeClassifier()
>>> naive_evaluation(clf, X, y , cv =4 , display ='on' )
clf=: DecisionTreeClassifier
scores=: [0.6279 0.7674 0.7093 0.593 ]
scores.mean=: 0.6744186046511629
Out[57]: (array([0.6279, 0.7674, 0.7093, 0.593 ]), 0.6744186046511629)

watex.naive_imputer(X, y=None, strategy='mean', mode=None, drop_features=False, missing_values=nan, fill_value=None, verbose='deprecated', add_indicator=False, copy=True, keep_empty_features=False, **fit_params)[source]#

Imput missing values in the data.

Whatever data contains categorial features, ‘bi-impute’ argument passed to ‘kind’ parameters has a strategy to both impute the numerical and categorical features rather than raising an error when the ‘strategy’ is not set to ‘most_frequent’.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Not used, present here for API consistency by convention.
strategy (str, default='mean') –
The imputation strategy.
- If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
- If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
- If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
- If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
  
  strategy=”constant” for fixed value imputation.
mode (str, [bi-impute'], default= None) – If mode is set to ‘bi-impute’, it imputes the both numerical and categorical features and returns a single imputed dataframe.
drop_features (bool or list, default =False,) – drop a list of features in the dataframe before imputation. If True and no list of features is supplied, the categorial features are dropped.
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
fill_value (str or numerical value, default=None) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
keep_empty_features (bool, default=False) –
If True, features that consist exclusively of missing values when fit is called are returned in results when transform is called. The imputed value is always 0 except when strategy=”constant” in which case fill_value will be used instead.

New in version 0.2.0.
verbose (int, default=0) – Controls the verbosity of the imputer.
copy (bool, default=True) –
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:
- If X is not an array of floating values;
- If X is encoded as a CSR matrix;
- If add_indicator=True.
add_indicator (bool, default=False) – If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
fit_params (dict,) – keywords arguments passed to the scikit-learn fitting parameters More details on https://scikit-learn.org/stable/

Returns:

Xi – Data imputed

Return type:

Dataframe, array-like, sparse matrix of shape (n_samples, n_features)

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from watex.utils.mlutils import naive_imputer
>>> X= np.random.randn ( 7, 4 )
>>> X[3, :] =np.nan  ; X[:, 3][-4:]=np.nan
>>> naive_imputer  (X)
... array([[ 1.34783528,  0.53276798, -1.57704281,  0.43455785],
           [ 0.36843174, -0.27132106, -0.38509441, -0.29371997],
           [-1.68974996,  0.15268509, -2.54446498,  0.18939122],
           [ 0.06013775,  0.36687602, -0.21973368,  0.11007637],
           [-0.27129147,  1.18103398,  1.78985393,  0.11007637],
           [ 1.09223954,  0.12924661,  0.52473794,  0.11007637],
           [-0.48663864,  0.47684353,  0.87360825,  0.11007637]])
>>> frame = pd.DataFrame (X, columns =['a', 'b', 'c', 'd']  )
>>> # change [bc] types to categorical values.
>>> frame['b']=['pineaple', '', 'cabbage', 'watermelon', 'onion',
                'cabbage', 'onion']
>>> frame['c']=['lion', '', 'cat', 'cat', 'dog', '', 'mouse']
>>> naive_imputer(frame, kind ='bi-impute')
...             b      c         a         d
    0    pineaple   lion  1.347835  0.434558
    1     cabbage    cat  0.368432 -0.293720
    2     cabbage    cat -1.689750  0.189391
    3  watermelon    cat  0.060138  0.110076
    4       onion    dog -0.271291  0.110076
    5     cabbage    cat  1.092240  0.110076
    6       onion  mouse -0.486639  0.110076

watex.naive_scaler(X, y=None, *, kind=<class 'sklearn.preprocessing._data.StandardScaler'>, copy=True, with_mean=True, with_std=True, feature_range=(0, 1), clip=False, norm='l2', **fit_params)[source]#

Quick data scaling using both strategies implemented in scikit-learn with StandardScaler and MinMaxScaler.

Function returns scaled frame if dataframe is passed or ndarray. For other scaling, call scikit-learn instead.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
kind (str, default='StandardScaler') – Kind of data scaling. Can also be [‘MinMaxScaler’, ‘Normalizer’]. The default is ‘StandardScaler’
copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
with_mean (bool, default=True) – If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_std (bool, default=True) – If True, scale the data to unit variance (or equivalently, unit standard deviation).
feature_range (tuple (min, max), default=(0, 1)) – Desired range of transformed data.
norm ({'l1', 'l2', 'max'}, default='l2') – The norm to use to normalize each non zero sample. If norm=’max’ is used, values will be rescaled by the maximum of the absolute values.
clip (bool, default=False) – Set to True to clip transformed values of held-out data to provided feature range.
fit_params (dict,) – keywords arguments passed to the scikit-learn fitting parameters More details on https://scikit-learn.org/stable/

Returns:

X_sc – Transformed array.

Return type:

{ndarray, sparse matrix} or dataframe of shape (n_samples, n_features)

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from watex.utils.mlutils import naive_scaler
>>> X= np.random.randn (7 , 3 )
>>> X_std = naive_scaler (X )
... array([[ 0.17439644,  1.55683005,  0.24115109],
       [-0.59738672,  1.3166854 ,  1.23748004],
       [-1.6815365 , -1.19775838,  0.71381357],
       [-0.1518278 , -0.32063059, -0.47483155],
       [-0.41335886,  0.13880519,  0.69258621],
       [ 1.45221902, -1.03852015, -0.40157981],
       [ 1.21749443, -0.45541153, -2.00861955]])
>>> # use dataframe
>>> Xdf = pd.DataFrame (X, columns =['a', 'c', 'c'])
>>> naive_scaler (Xdf , kind='Normalizer') # return data frame
...           a         c         c
    0  0.252789  0.967481 -0.008858
    1 -0.265161  0.908862  0.321961
    2 -0.899863 -0.416231  0.130380
    3  0.178203  0.039443 -0.983203
    4 -0.418487  0.800306  0.429394
    5  0.933933 -0.309016 -0.179661
    6  0.795234 -0.051054 -0.604150

watex.ohmicArea(data=None, search=45.0, sum=False, objective='ohmS', **kws)[source]#

Compute the ohmic-area from the Vertical Electrical Sounding data collected in exploration area.

Parameters:

* data: Dataframe pandas - contains the depth measurement AB from current: electrodes, the potentials electrodes MN and the collected apparents resistivities.
* search: float - The depth in meters from which one expects to find a: fracture zone outside of pollutions. Indeed, the search parameter is used to speculate about the expected groundwater in the fractured rocks under the average level of water inrush in a specific area. For instance in Bagoue region , the average depth of water inrush is around 45m. So the search can be specified via the water inrush average value.
* objective: str - Type operation to outputs. By default, the function: outputs the value of pseudo-area in $\Omega .m^2$. However, for plotting purpose by setting the argument to view, its gives an alternatively outputs of X and Y, recomputed and projected as weel as the X and Y values of the expected fractured zone. Where X is the AB dipole spacing when imaging to the depth and Y is the apparent resistivity computed
kws: dict - Additionnal keywords arguments from |VES| data operations.: See watex.utils.exmath.vesDataOperator() for futher details.

Returns:

List of twice tuples:

Tuple(ohmS, error, roots):
- `ohmS`is the pseudo-area computed expected to be a fractured zone
- error is the integration error
- roots is the integration boundaries of the expected fractured
  zone where the basement rocks is located above the resistivity transform function. At these points both curves values equal to null.
Tuple (XY, fit XY,XYohmSarea):
- XY is the ndarray(nvalues, 2) of the operated of AB dipole
  spacing and resistivity rhoa values.
- fit XY is the fitting ndarray(nvalues, 2) uses to redraw the
  dummy resistivity transform function.
- XYohmSarea is ndarray(nvalues, 2) of the dipole spacing and
  resistiviy values of the expected fracture zone.

Raises:

VESError: If the search is greater or equal to the maximum investigation depth in meters.

Notes

The ohmS value calculated from pseudo-area is a fully data-driven parameter and is used to evaluate a pseudo-area of the fracture zone from the depth where the basement rock is supposed to start. Usually, when exploring deeper using the Vertical Electrical Sounding, we are looking for groundwater in thefractured rock that is outside the anthropic pollution (Biemi, 1992). Since the VES is an indirect method, we cannot ascertain whether the presumed fractured rock contains water inside. However, we assume that the fracture zone could exist and should contain groundwater. Mathematically, based on the VES1D model proposed by `Koefoed, O. (1976)`_ , we consider a function $\rho_T(l)$, a set of reducing resistivity transform function to lower the boundary plane at half the current electrode spacing $(l)$. From the sounding curve $\rho_T(l)$, curve an imaginary basement rock $b_r (l)$ of slope equal to 45° with the horizontal $h(l)$ was created. A pseudo-area $S(l)$ should be defined by extending from $h(l)$ the $b_r (l)$ curve when the sounding curve $\rho_T(l)$ is below $b_r(l)$, otherwise $S(l)$ is equal to null. The computed area is called the ohmic-area $ohmS$ expressed in $\Omega .m^2$ and constitutes the expected fractured zone. Thus $ohmS$ ≠ $0$ confirms the existence of the fracture zone while of $Ohms=0$ raises doubts. The equation to determine the parameter is given as:

\[ \begin{align}\begin{aligned}ohmS & = &\int_{ l_i}^{l_{i+1}} S(l)dl \quad {s.t.}\\\begin{split}S(l) & = & b_r (l) - \rho_T (l) \quad \text{if} \quad b_r (l) > \rho_T (l) \\ & = & 0. \quad \text{if} \quad b_r (l) \leq \rho_T (l)\end{split}\\b_r(l) & = & l + h(l) \quad ; \quad h(l) = \beta\\\rho_T(l) & = & l^2 \int_{0}^{\infty} T_i( \lambda ) h_1( \lambda l) \lambda d\lambda\end{aligned}\end{align} \]

where $l_i \quad \text{and} \quad l_{i+1}$ solve the equation $S(l=0)$; $l$ is half the current electrode spacing $AB/2$, and $h_1$ denotes the first-order of the Bessel function of the first kind, $\beta$ is the coordinate value on y-axis direction of the intercept term of the $b_r(l)$ and $h(l)$, $T_i(\lambda )$ resistivity transform function, $lamda$ denotes the integral variable, where n denotes the number of layers, $rho_i$ and $h_i$ are the resistivity and thickness of the $i-th$ layer, respectively. Get more explanations and cleareance of formula in the paper of `Kouadio et al 2022`_.

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

Examples

>>> from watex.utils.exmath import ohmicArea
>>> from watex.utils.coreutils import vesSelector
>>> data = vesSelector (f= 'data/ves/ves_gbalo.xlsx')
>>> (ohmS, err, roots), *_ = ohmicArea(data = data, search =45, sum =True )
... (13.46012197818152, array([5.8131967e-12]), array([45.        , 98.07307307]))
# pseudo-area is computed between the spacing point AB =[45, 98] depth.
>>> _, (XY.shape, XYfit.shape, XYohms_area.shape) = ohmicArea(
                AB= data.AB, rhoa =data.resistivity, search =45,
                objective ='plot')
... ((26, 2), (1000, 2), (8, 2))

class watex.pModels(model='svm', target='bin', kernel=None, oob_score=False, objective='fr')[source]#

Bases: object

Pretrained Models class.

The pretrained model class is composed of estimators already trained in a case study region in West -Africa Bagoue region. Refer to Kouadio et al, 2022 for furher details. It is a set of support vector machines, decision tree`, k-nearest neighbors, Extreme ``gradient boosting machines, benchmart voting classifier, and `` bagging classifier. Each retrained model is considered as a class object and attributes compose the training parameters from cross-validation results.

Parameters:

model: str

Name of the pretrained model. Note that the pretrained SVMs is composed of 04 kernels such as the rbf for radial basis function , the poly for polynomial , sig for sigmoid and lin for linear. Default is rbf. Each kernel is a model attributes of SVM class. For instance to retrieve the pretrained model with kernel = ‘poly’, we must use after fitting pModels class:

>>> pModels(model='svm', kernel='poly').fit().SVM.poly.best_estimator_
... SVC(C=128.0, coef0=7, degree=5, gamma=0.00048828125, kernel='poly', tol=0.01)
>>> # or
>>> pModels(model='svm', kernel='poly').fit().estimator_
... SVC(C=128.0, coef0=7, degree=5, gamma=0.00048828125, kernel='poly', tol=0.01)

kernel: str

kernel refers to SVM machines kernels. It can be rbf for radial basis function , the poly for polynomial , sig for sigmoid and lin for linear. No need to provide since it can be retrieved as an attribute of the SVM model like:

>>> pModels(model='svm').fit().SVM.rbf # is an object instance
>>> # to retreive the rbf values use attribute `best_estimator_
>>> pModels(model='svm').fit().SVM.rbf.best_estimator_
...  SVC(C=2.0, coef0=0, degree=1, gamma=0.125)

target: str

Two types of classification is predicted. The binary classification bin and the multiclass classification multi. default is bin. When turning target to multi, be aware that only the SVMs are trained for multiclass prediction. Futhernore, the bin consisted to predict the flow rate (FR) with label {0} and {1} where {0} means the $FR <=1 m^3/hr$ and {1} for $FR> 1m^3/hr$. About multi, four classes are predicted such as:

\[FR0 & = & FR = 0 FR1 & = & 0 < FR <=1 m^3/hr FR2 & = & 1< FR <=3 m^3/hr FR3 & = & FR> 3 m^3/hr\]

oob_score: bool,

Out-of-bag. Setting oob_score to true, you will retrieve some pretrained model with obb_score set to true when training. The pretrained models with fine-tuned model with oob_score set to true are ‘RandomForest’ and ‘Extratrees’.

objective: str, default=’fr’

Is the prediction aim goal, the reason for storing the pretrained models. The default objective is ‘fr’ i.e. for flow rate prediction. Other objectives will be added as new engineering problems are solved and published.

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

fit(X=None, y=None, **fit_params)[source]#

Fit X and y with the pretrained models.

Note that to retrieve only the pretrained model, don’t pass anything in fit method. For instance to fetch the best SVM estimator with kernel = ‘sigmoid’, one just needs to fit:class:.pModels class as follow:

>>> pModels(model='svm', kernel='sigmoid').fit().estimator_
Out[24]: SVC(C=512.0, coef0=0, degree=1, gamma=0.001953125, kernel='sigmoid', tol=1.0)

If model=’svm’ and none kernel is passed, the rbf is used instead as default.

Parameters:

X (Ndarray of shape ( M x N), $M=m-samples x N=n-features$) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
y (array-like of shape (M, ) :math:`M=m-samples) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Returns:

Returns self for easy method chaining.

Return type:

pModels instance

property inspect#: Inspect object whether is fitted or not

pdefaults_ = [('xgboost', 'ExtremeGradientBoosting'), ('svc', 'SupportVectorClassifier'), ('dtc', 'DecisionTreeClassifier'), ('stc', 'StackingClassifier'), ('bag', 'BaggingClassifier'), ('logit', 'LogisticRegression'), ('vtc', 'VotingClassifier'), ('rdf', 'RandomForestClassifier'), ('ada', 'AdaBoostClassifier'), ('extree', 'ExtraTreesClassifier'), ('knn', 'KNeighborsClassifier')]#

predict(X)[source]#

Predict object from the pretrained model

Parameters:: X (Ndarray of shape ( M x N), $M=m-samples x N=n-features$) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
Returns:: y_pred – the predicted target values from X.
Return type:: Array-like, shape (M, )

watex.plotAnomaly(erp, cz=None, station=None, fig_size=(10, 4), fig_dpi=300, savefig=None, show_fig_title=True, style='seaborn', fig_title_kws=Ellipsis, czkws=Ellipsis, legkws=Ellipsis, how='py', **kws)[source]#

Plot the whole Electrical Resistivity Profiling line and selected conductive zone.

Conductive zone can be supplied nannualy as a subset of the erp or by specifying the station expected for drilling location. For instance S07 for the seventh station. Futhermore, for automatic detection, one should set the station argument s to auto. However, it ‘s recommended to provide the cz or the s to have full control. The conductive zone overlained the whole Electrical Resistivity Profiling survey. user can customize the cz plot by filling with Matplotlib pyplot additional keywords araguments thought the keyword arguments czkws.

Parameters:

erp: array_like 1d

the Electrical Resistivity Profiling survey line. The line is an array of resistivity values. Note that if a dataframe is passed, be sure that the frame matches the DC resistivity data (ERP), otherwise an error occurs. At least, the frame columns includes the resistivity and stations.

cz: array_like 1d

the selected conductive zone. If None, only the erp should be displayed. Note that cz is an subset of erp array.

station: str, optional

The station location given as string (e.g. s= "S10") or as a station number (indexing; e.g s =10). If value is set to "auto", s should be find automatically and fetching cz as well.

figsize: tuple, default =(10, 4)

Tuple value of figure size. Refer to the web resources Matplotlib figure.

fig_dpi: int , default=300,

figure resolution “dot per inch”. Refer to Matplotlib figure.

savefig: str, optional,

save the figure. Refer to Matplotlib figure.

show_fig_title: bool, default =True

display the title of the figure.

fig_title_kws: dict,

Keywords arguments of figure suptile. Refer to Matplotlib figsuptitle.

style: str - the style for customizing visualization. For instance to

get the first seven available styles in pyplot, one can run the script below:

plt.style.available[:7]

Futher details can be foud in Webresources below or click on GeekforGeeks.

how: str, default=’py’

By default (how='py'), the station is naming following the Python indexing. Station is counting from station 00(S00). Any other values will start the station naming from 1.

czkws: dict,

keywords Matplotlib pyplot additional arguments to customize the cz plot.

legkws: dict,

Additional keywords Matplotlib legend arguments.

kws: dict,

additional keywords argument for Matplotlib pyplot to customize the erp plot.

See also

watex.erpSmartDetector: Detection conductive zone applying the constraint. Set the view=True for constraints visualization.

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

Examples

>>> import numpy as np
>>> from watex.utils import plotAnomaly, defineConductiveZone
>>> test_array = np.abs (np.random.randn (10)) *1e2
>>> selected_cz ,*_ = defineConductiveZone(test_array, 7)
>>> plotAnomaly(test_array, selected_cz )
>>> plotAnomaly(test_array, selected_cz , s= 5)
>>> plotAnomaly(test_array, s= 's02')
>>> plotAnomaly(test_array)

watex.plotDendrogram(df, columns=None, labels=None, metric='euclidean', method='complete', kind=None, return_r=False, verbose=False, **kwd)[source]#

Visualizes the linkage matrix in the results of dendrogram.

Note that the categorical features if exist in the dataframe should automatically be discarded.

Parameters:

df (dataframe or NDArray of (n_samples, n_features)) – dataframe of Ndarray. If array is given , must specify the column names to much the array shape 1
columns (list) – list of labels to name each columns of arrays of (n_samples, n_features) If dataframe is given, don’t need to specify the columns.
kind (str, ['squareform'|'condense'|'design'], default is {'design'}) – kind of approach to summing up the linkage matrix. Indeed, a condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of $m$ observation vectors in $n$ dimensions may be passed as an $m$ by $n$ array. All elements of the condensed distance matrix must be finite, i.e., no NaNs or infs. Alternatively, we could used the squareform distance matrix to yield different distance values than expected. the design approach uses the complete inpout example matrix also called ‘design matrix’ to lead correct linkage matrix similar to squareform and condense`.
metric (str or callable, default is {'euclidean'}) – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by sklearn.metrics.pairwise.pairwise_distances(). If X is the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.
method (str, optional, default is {'complete'}) – The linkage algorithm to use. See the Linkage Methods section below for full descriptions in watex.utils.exmath.linkage_matrix()
labels (ndarray, optional) – By default, labels is None so the index of the original observation is used to label the leaf nodes. Otherwise, this is an $n$-sized sequence, with n == Z.shape[0] + 1. The labels[i] value is the text to put under the $i$ th leaf node only if it corresponds to an original observation and not a non-singleton cluster.
return_r (bool, default='False',) – return r-dictionnary if set to ‘True’ otherwise returns nothing
verbose (int, bool, default='False') – If True, output message of the name of categorical features dropped.
kwd (dict) – additional keywords arguments passes to scipy.cluster.hierarchy.dendrogram()

Returns:

r – A dictionary of data structures computed to render the dendrogram. Its has the following keys:

'color_list': A list of color names. The k’th element represents the color of the k’th link.
'icoord' and 'dcoord': Each of them is a list of lists. Let icoord = [I1, I2, ..., Ip] where Ik = [xk1, xk2, xk3, xk4] and dcoord = [D1, D2, ..., Dp] where Dk = [yk1, yk2, yk3, yk4], then the k’th link painted is (xk1, yk1) - (xk2, yk2) - (xk3, yk3) - (xk4, yk4).
'ivl': A list of labels corresponding to the leaf nodes.
'leaves': For each i, H[i] == j, cluster node j appears in position i in the left-to-right traversal of the leaves, where $j < 2n-1$ and $i < n$. If j is less than n, the i-th leaf node corresponds to an original observation. Otherwise, it corresponds to a non-singleton cluster.
'leaves_color_list': A list of color names. The k’th element represents the color of the k’th leaf.

Return type:

dict

Examples

>>> from watex.datasets import load_iris
>>> from watex.view import plotDendrogram
>>> data = load_iris ()
>>> X =data.data[:, :2]
>>> plotDendrogram (X, columns =['X1', 'X2' ] )

watex.plotLearningInspections(models, X, y, fig_size=(22, 18), cv=None, savefig=None, titles=None, subplot_kws=None, **kws)[source]#

Inspect multiple models from their learning curves.

Mutiples Inspection plots that generate the test and training learning curve, the training samples vs fit times curve, the fit times vs score curve for each model.

Parameters:

models (list of estimator instances) – Each estimator instance implements fit and predict methods which will be cloned for each validation.
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples) or (n_samples, n_features)) – Target relative to X for classification or regression; None for unsupervised learning.
cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 5-fold cross-validation,
- integer, to specify the number of folds.
- CV splitter,
- An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if y is binary or multiclass, StratifiedKFold used. If the estimator is not a classifier or if y is neither binary nor multiclass, KFold is used.

Refer Sckikit-learn User Guide for the various cross-validators that can be used here.
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.
titles (str, list) – List of model names if changes are needed. If None, model names are used by default.
kws (dict,) – Additional keywords argument passed to plotLearningInspection().

Returns:

axes

Return type:

Matplotlib axes

See also

plotLearningInspection: Inspect single model

Examples

>>> from watex.datasets import fetch_data
>>> from watex.models.premodels import p
>>> from watex.view.mlplot import plotLearningInspections
>>> # import sparse  matrix from Bagoue dataset
>>> X, y = fetch_data ('bagoue prepared')
>>> # import the two pretrained models from SVM
>>> models = [p.SVM.rbf.best_estimator_ , p.SVM.poly.best_estimator_]
>>> plotLearningInspections (models , X, y, ylim=(0.7, 1.01) )

watex.plotOhmicArea(data=None, search=45.0, pre_computed=False, xy=None, xyf=None, xyarea=None, colors=None, fbtw=False, **plot_kws)[source]#

Plot the Vertical Electrical Sounding data ohmic -area

Parameters:

data (*) – contains the depth measurement AB from current electrodes, the potentials electrodes MN and the collected apparent resistivities.
search (*) – The depth in meters from which one expects to find a fracture zone outside of pollutions. Indeed, the search parameter is used to speculate about the expected groundwater in the fractured rocks under the average level of water inrush in a specific area. For instance in Bagoue region , the average depth of water inrush is around 45m. So the search can be specified via the water inrush average value.
pre_computed (bool, default=False,) – If True computed the ohmic_area parameters. If False, the ohmic area arguments must be passed to xy, xyf and xyarea, otherwise an errors will raise.
xy (array-like of shape (n_AB, 2)) – Arraylike of the sanitized depth measurement AB from current. electrodes n_AB. See vesDataOperator().
xyf (array-like of shape (n_fit_samples, 2)) – Array-like of the fitted samples i.e the number of points for fitting the sounding resistivity values from the surface thin the total depth. The fitted rhoa showns a smooth curves. The default point is 1000.
xyarea (array-like of shape (n_area, 2)) – Arraylike of the resistivity positions of the depth measurment AB where the fractured zone is found.
fbtw (bool, default=False,) – If True, filled the computed fractured zone using the parameters computed from xyf and xyarea.
kws (dict - Additionnal keywords arguments from Vertical Electrical Sounding data operations.) – See watex.utils.exmath.vesDataOperator() for futher details.

Notes

The first and second columns of xy, xyfit and xyarea are the position AB/2 and their corresponding resistivity values.

Examples

>>> from watex.datasets import load_semien
>>> from watex.utils.exmath import plotOhmicArea
>>> ves_data = load_semien ()
>>> plotOhmicArea (ves_data)

watex.plotProjection(X, Xt=None, *, columns=None, test_kws=None, **baseplot_kws)[source]#

Visualize train and test dataset based on the geographical coordinates.

Since there is geographical information(latitude/longitude or easting/northing), it is a good idea to create a scatterplot of all instances to visualize data.

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.
Xt (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix.
columns (list of str or index, optional) – columns is usefull when a dataframe is given with a dimension size greater than 2. If such data is passed to X or Xt, columns must hold the name to considered as ‘easting’, ‘northing’ when UTM coordinates are given or ‘latitude’ , ‘longitude’ when latlon are given. If dimension size is greater than 2 and columns is None , an error will raises to prevent the user to provide the index for ‘y’ and ‘x’ coordinated retrieval.
test_kws (dict,) – keywords arguments passed to matplotlib.plot.scatter() as test location font and colors properties.
baseplot_kws (dict,) – All all the keywords arguments passed to the peroperty watex.property.BasePlot class.

Examples

>>> from watex.datasets import fetch_data
>>> from watex.view.mlplot import plotProjection
>>> # Discard all the non-numeric data
>>> # then inut numerical data
>>> from watex.utils import to_numeric_dtypes, naive_imputer
>>> X, Xt, *_ = fetch_data ('bagoue', split_X_y =True, as_frame =True)
>>> X =to_numeric_dtypes(X, pop_cat_features=True )
>>> X= naive_imputer(X)
>>> Xt = to_numeric_dtypes(Xt, pop_cat_features=True )
>>> Xt= naive_imputer(Xt)
>>> plot_kws = dict (fig_size=(8, 12),
                 lc='k',
                 marker='o',
                 lw =3.,
                 font_size=15.,
                 xlabel= 'easting (m) ',
                 ylabel='northing (m)' ,
                 markerfacecolor ='k',
                 markeredgecolor='r',
                 alpha =1.,
                 markeredgewidth=2.,
                 show_grid =True,
                 galpha =0.2,
                 glw=.5,
                 rotate_xlabel =90.,
                 fs =3.,
                 s =None )
>>> plotProjection( X, Xt , columns= ['east', 'north'],
                    trainlabel='train location',
                    testlabel='test location', **plot_kws
                   )

watex.plotSilhouette(X, labels=None, prefit=True, n_clusters=3, n_init=10, max_iter=300, random_state=None, tol=10000.0, metric='euclidean', **kwd)[source]#

quantifies the quality of clustering samples.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
labels (array-like 1d of shape (n_samples,)) – Label values for each sample.
n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.
prefit (bool, default=False) – Whether a prefit labels is expected to be passed into the function directly or not. If True, labels must be a fit predicted values target. If False, labels is fitted and updated from X by calling fit_predict methods. Any other values passed to labels is discarded.
n_init (int, default=10) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
max_iter (int, default=300) – Maximum number of iterations of the k-means algorithm for a single run.
tol (float, default=1e-4) – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
verbose (int, default=0) – Verbosity mode.
random_state (int, RandomState instance or None, default=42) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
tol – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
metric (str or callable, default='euclidean') – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by sklearn.metrics.pairwise.pairwise_distances(). If X is the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.
**kwds (optional keyword parameters) – Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.

Note

The sihouette coefficient is bound between -1 and 1

watex.plot_clusters(n_clusters, X, y_pred, cluster_centers=None, savefig=None)[source]#

Visualize the cluster that k-means identified in the dataset

Parameters:

n_clusters – int, number of cluster to visualize
X – NDArray, data containing the features, expect to be a two dimensional data
y_pred – array-like, array containing the predicted class labels.
cluster_centers – NDArray containg the coordinates of the centroids or the similar points with continous features.

Example:

>>> from watex.exlib.sklearn import KMeans, MinMaxScaler
>>> from watex.utils.plotutils import plot_clusters
>>> from watex.datasets import fetch_data
>>> h= fetch_data('hlogs').frame
>>> # collect two features 'resistivity' and gamma-gamma logging values
>>> h2 = h[['resistivity', 'gamma_gamma']]
>>> km = KMeans (n_clusters =3 , init= 'random' )
>>> # scaled the data with MinMax scaler i.e. between ( 0-1)
>>> h2_scaled = MinMaxScaler().fit_transform(h2)
>>> ykm = km.fit_predict(h2_scaled )
>>> plot_clusters (3 , h2_scaled, ykm , km.cluster_centers_ )

watex.plot_confidence_in(z_or_edis_obj_list, /, tensor='res', view='1d', drop_outliers=True, distance=None, c_line=False, view_ci=True, figsize=(6, 2), fontsize=4.0, dpi=300.0, top_label='Stations', rotate_xlabel=90.0, fbtw=True, savefig=None, **plot_kws)[source]#

Plot data confidency from tensor errors.

The default tensor for evaluating the data confidence is the resistivity at TE mode (‘xy’).

Check confidence in the data before starting the concrete processing seems meaningful. In the area with complex terrain, with high topography addition to interference noises, signals are weals or missing especially when using AMT survey. The most common technique to do this is to eliminate the bad frequency and interpolate the remains one. However, the tricks for eliminating frequency differ from one author to another. Here, the tip using the data confidence seems meaningful to indicate which frequencies to eliminate (at which stations/sites) and which ones are still recoverable using the tensor recovering strategy.

The plot implements three levels of confidence:

High confidence: $conf. \geq 0.95$ values greater than 95%
Soft confidence: $0.5 \leq conf. < 0.95$. The data in this confidence range can be beneficial for tensor recovery to restore the weak and missing signals.
bad confidence: $conf. <0.5$. Data in this interval must be deleted.

Parameters:

z_or_edis_obj_list (list of watex.edi.Edi or watex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.
tensor (str, default='res') – Tensor name. Can be [ ‘resistivity’|’phase’|’z’|’frequency’]
view (str, default='1d') – Type of plot. Can be [‘1D’|’2D’]
drop_outliers (bool, default=True) – Suppress the ouliers in the data if True.
distance (float, optional) – Distance between stations/sites
fontsize (float, default=3.) – label font size.
figsize (Tuple, default=(6, 2)) – Figure size.
c_line (bool, default=True,) – Display the confidence line in two dimensinal view.
dpi (int, default=300) – Image resolution in dot-per-inch
rotate_xlabel (float, default=90.) – Angle to rotate the stations/sites labels
top_labels (str,default='Stations') – Labels the sites either using the survey name.
view_ci (bool,default=True,) – Show the marker of confidence interval.
fbtw (bool, default=True,) – Fill between confidence interval.
plot_kws (dict,) – Additional keywords pass to the plot()

See also

watex.methods.Processing.zrestore: For more details about the function for tensor recovering technique.

Examples

>>> from watex.utils.exmath import plot_confidence_in
>>> from watex.datasets import fetch_data
>>> emobj  = fetch_data ( 'huayuan', samples = 25, clear_cache =True,
                         key='raw').emo
>>> plot_confidence_in (emobj.ediObjs_ ,
                        distance =20 ,
                        view ='2d',
                        figsize =(6, 2)
                        )
>>> plot_confidence_in (emobj.ediObjs_ , distance =20 ,
                        view ='1d', figsize =(6, 3), fontsize =5,
                        )

watex.plot_confusion_matrices(clfs, Xt, yt, annot=True, pkg=None, normalize='true', sample_weight=None, encoder=None, fig_size=(22, 6), savefig=None, subplot_kws=None, **scorer_kws)[source]#

Plot inline multiple model confusion matrices using either the sckitlearn or ‘yellowbrick’

Parameters:

clfs (list of classifier estimators) – A scikit-learn estimator that should be a classifier. If the model is not a classifier, an exception is raised. Note that the classifier must be fitted beforehand.
Xt (ndarray or DataFrame of shape (M X N)) – A matrix of n instances with m features. Preferably, matrix represents the test data for error evaluation.
yt (ndarray of shape (M, ) or Series oF length (M, )) – An array or series of target or class values. Preferably, the array represent the test class labels data for error evaluation.
pkg (str, optional , default ='sklearn') – the library to handle the plot. It could be ‘yellowbrick’. The basic confusion matrix is handled by the scikit-learn package.
normalize ({'true', 'pred', 'all'}, default=None) – Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
encoder (dict or LabelEncoder, default: None) – A mapping of classes to human readable labels. Often there is a mismatch between desired class labels and those contained in the target variable passed to fit() or score(). The encoder disambiguates this mismatch ensuring that classes are labeled correctly in the visualization.

annot: bool, default=True: Annotate the number of samples (right or wrong prediction ) in the plot. Set False to mute the display.
fig_sizetuple (width, height), default =(8, 6): the matplotlib figure size given as a tuple of width and height
savefig: str, default =None ,: the path to save the figures. Argument is passed to matplotlib.Figure class.

Examples

>>> import matplotlib.pyplot as plt
>>> plt.style.use ('classic')
>>> from watex.datasets import fetch_data
>>> from watex.exlib.sklearn import train_test_split
>>> from watex.models.premodels import p
>>> from watex.utils.plotutils import plot_confusion_matrices
>>> # split the  data . Note that fetch_data output X and y
>>> X, Xt, y, yt  = train_test_split (* fetch_data ('bagoue analysed'), test_size =.25  )
>>> # compose the models
>>> # from RBF, and poly
>>> models =[ p.SVM.rbf.best_estimator_,
         p.LogisticRegression.best_estimator_,
         p.RandomForest.best_estimator_
         ]
>>> models
[SVC(C=2.0, coef0=0, degree=1, gamma=0.125), LogisticRegression(),
 RandomForestClassifier(criterion='entropy', max_depth=16, n_estimators=350)]
>>> # now fit all estimators
>>> fitted_models = [model.fit(X, y) for model in models ]
>>> plot_confusion_matrices(fitted_models , Xt, yt)

watex.plot_elbow(X, n_clusters, n_init=10, max_iter=300, random_state=42, fig_size=(10, 4), marker='o', savefig=None, **kwd)[source]#

Plot elbow method to find the optimal number of cluster, k’, for a given data.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.
n_init (int, default=10) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
max_iter (int, default=300) – Maximum number of iterations of the k-means algorithm for a single run.
tol (float, default=1e-4) – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
verbose (int, default=0) – Verbosity mode.
random_state (int, RandomState instance or None, default=42) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
savefig (str, default =None ,) – the path to save the figure. Argument is passed to matplotlib.Figure class.
marker (str, default='o',) – cluster marker point.
kwd (dict) – Addionnal keywords arguments passed to matplotlib.pyplot.plot()

Returns:

Return type:

Matplotlib.pyplot axes objects

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.plotutils import plot_elbow
>>> # get the only resistivy and gamma-gama values for example
>>> res_gamma = load_hlogs ().frame[['resistivity', 'gamma_gamma']]
>>> plot_elbow(res_gamma, n_clusters=11)

watex.plot_learning_curves(models, X, y, *, cv=None, train_sizes=None, baseline_score=0.4, convergence_line=True, fig_size=(20, 6), sns_style=None, savefig=None, set_legend=True, subplot_kws=None, **kws)[source]#

Horizontally visualization of multiple models learning curves.

Determines cross-validated training and test scores for different training set sizes.

Parameters:

models (list or estimators) – An estimator instance or not that implements fit and predict methods which will be cloned for each validation.
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a (Stratified)KFold,
- CV splitter,
- An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

Refer User Guide for the various cross-validation strategies that can be used here.

cv default value if None changed from 3-fold to 4-fold.
train_sizesarray-like of shape (n_ticks,), default=np.linspace(0.1, 1, 50)
Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.
baseline_score (floatm default=.4) – base score to start counting in score y-axis (score)
convergence_line (bool, default=True) – display the convergence line or not that indicate the level of bias between the training and validation curve.
fig_size (tuple (width, height), default =(14, 6)) – the matplotlib figure size given as a tuple of width and height
sns_style (str, optional,) – the seaborn style .
set_legend (bool, default=True) – display legend in each figure. Note the default location of the legend is ‘best’ from legend()
subplot_kws (dict, default is dict(left=0.0625, right = 0.95, wspace = 0.1)) – the subplot keywords arguments passed to matplotlib.subplots_adjust()
kws (dict,) – keyword arguments passed to sklearn.model_selection.learning_curve()

Examples

-> plot via a metaestimator already cross-validated.

>>> from watex.models.premodels import p
>>> from watex.datasets import fetch_data
>>> from watex.utils.plotutils import plot_learning_curves
>>> X, y = fetch_data ('bagoue prepared') # yields a sparse matrix
>>> # let collect 04 estimators already cross-validated from SVMs
>>> models = [ p.SVM.linear , p.SVM.rbf , p.SVM.sigmoid , p.SVM.poly ]
>>> plot_learning_curves (models, X, y, cv=4, sns_style = 'darkgrid')

-> plot with multiples models not crossvalidated yet.

>>> from watex.exlib.sklearn import (LogisticRegression,
                                     RandomForestClassifier,
                                     SVC , KNeighborsClassifier
                                     )
>>> models =[LogisticRegression(), RandomForestClassifier(), SVC() ,
             KNeighborsClassifier() ]
>>> plot_learning_curves (models, X, y, cv=4, sns_style = 'darkgrid')

watex.plot_logging(X, y=None, zname=None, tname=None, labels=None, impute_nan=True, normalize=False, log10=False, columns_to_skip=None, pattern=None, strategy='mean', posiy=None, fill_value=None, fig_size=(16, 7), fig_dpi=300, colors=None, cs4_colors=False, sns_style=False, savefig=None, draw_spines=False, seed=None, verbose=0, **kws)[source]#

Plot logging data

Plot expects a collection of logging data. Each logging data composes a column of data collected on the field.Note that can also plot anykind of data related that it contains numerical values. The function does not accept categorical data. If categorical data are given, they should be discarded.

Parameters:

X (Dataframe of shape (n_samples, n_features)) – where n_samples is the number of data, expected to be the data collected at different depths and n_features is the number of columns (features) that supposed to be plot. Note that X must include the depth columns. If not given a relative depth should be created according to the number of sample that composes X.
y (array-like or series of shape (n_samples,), optional) – Target relative to X for classification or regression; If given, by default the target plot should be located at the last position. However with the argument of posiy , target plot can be toggled to the desired position.
zname (str, default='depth' or 'None') – The name of the depth column in X. If the name ‘depth’ is not specified as the main depth columns, an other name in the columns that matches the depth can also be indicated so the function will put aside this columm as depth column for plot purpose. If set to None, zname holds the name depth and assumes that depth exists in X columns.
tname (str, optional,) – name of the target. This can rename of the target name if given y as a pandas series or add the name of target if given as an array-like. If not provided, it should use the name of the target series if y is not None.
normalize (bool, default = False) – Normalize all the data to be range between (0, 1) except the depth,
labels (list or str, optional) – If labels are given, they should fit the size of the number of columns. The given labels should replace the old columns in X and should figue out in the plot. This is usefull to change the columns labels in the dataframe to a new labels that describe the best the plot ; for instance by inluding the units in the new labels. Note that if the labels do not match the size of the old columns in X a warning should be let to the user and none operation will be performed.
impute_nan (bool, default=True,) – Replace the NaN values in the dataframe. Note that the default behaviour for replacing NaN is the mean. However if the argument of fill_value is provided,the latter should be used to replace ‘NaN’ in X.
log10 (bool, default=False) – Convert values to log10. This can be usefull when using the logarithm data. However, it seems not all the data can be used this operation, for instance, a negative data. In that case, column_to_skip argument is usefull to provide so to skip that columns when converting values to log10.
columns_to_skip (list or str, optional,) –

Columns to skip when performing some operation like ‘log10’. These
columns with not be affected by the ‘log10’ operations. Note that

columns_to_skip can also gives as litteral string. In that case, the pattern is need to parse the columns into a list of string.
pattern (str, default = '[#&*@!,;s]s*') –
Regex pattern to parse the columns_to_skip into a list of string where each item is a column name especially when the latter is given as litteral text string. For instance:
```
columns_to_skip='depth_top, thickness, sp, gamma_gamma'
-> ['depth_top', 'thickness', 'sp', 'gamma_gamma']
```
by using the default pattern. To have full control of columns splitted it is recommended to provided your own pattern to avoid wrong parsing and can lead to an error.
strategy (str, default='mean') –
The imputation strategy.
- If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
- If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
- If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
- If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
fill_value (str or numerical value, optional) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types. If not given and impute_nan is True, the mean strategy is used instead.
posiy (int, optional) – the position to place the target plot y . By default the target plot if given is located at the last position behind the logging plots.
colors (str, list of Matplotlib.colors map, optional) –
The colors for plotting each columns of X except the depth. If not given, default colors are auto-generated.

If colors is string and ‘cs4’or ‘xkcd’ is included. Matplotlib.colors.CS4_COLORS or Matplotlib.colors.XKCD_COLORS should be used instead. In addition if the ‘cs4’ or ‘xkcd’ is suffixed by colons and integer value like cs4:4 or xkcd:4, the CS4 or XKCD colors should be used from index equals to 4.

New in version 0.2.3: Matplotlib.colors.CS4_COLORS or Matplotlib.colors.XKCD_COLORS can be used by setting colors to 'cs4' or 'xkcd'. To reproduce the same CS4 or XKCD colors, set the seed parameter to a specific value.
draw_spines (bool, tuple (-lim, +lim), default= False,) – Only draw spine between the y-ticks. -lim and +lim are lower and upper bound i.e. a range to draw the spines in y-axis.
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
fig_dpi (float or 'figure', default: rcParams["savefig.dpi"] (default: 'figure')) – The resolution in dots per inch. If ‘figure’, use the figure’s dpi value.
savefig (str, default =None ,) – the path to save the figure. Argument is passed to matplotlib.Figure class.
sns_style (str, optional,) – the seaborn style.
seed (int, optional) –
Allow to reproduce the Matplotlib.colors.CS4_COLORS if colors is set to cs4.

New in version 0.2.3.
verbose (int, default=0) – Output the number of categorial features dropped in the dataframe.
kws (dict,) – Additional keyword arguments passed to matplotlib.axes.plot()

Examples

>>> from watex.datasets import load_hlogs
>>> from watex.utils.plotutils import plot_logging
>>> X0, y = load_hlogs (as_frame =True) # get the frames rather than object
>>> # plot the default logging with Normalize =True
>>> plot_logging (X0, normalize =True)
>>> # Include the target in the plot
>>> plot_logging ( X0,  y = y.kp , posiy = 0,
                  columns_to_skip=['thickness', 'sp'],
                  log10 =True,
                  )
>>> # draw spines and limit plot from (0, 700) m depth
>>> plot_logging (X0 , y= y.kp, draw_spines =(0, 700) )

watex.plot_naive_dendrogram(X, *ybounds, fig_size=(12, 5), savefig=None, **kws)[source]#

Quick plot dendrogram using the ward clustering function from Scipy.

Parameters:

X – ndarray of shape (n_samples, n_features) Array of features
ybounds – int, integrer values to draw horizontal cluster lines that indicate the number of clusters.
fig_size – tuple (width, height), default =(12,5) the matplotlib figure size given as a tuple of width and height
kws – dict , Addditional keyword arguments passed to scipy.cluster.hierarchy.dendrogram()

Examples:

>>> from watex.datasets import fetch_data
>>> from watex.utils.plotutils import plot_naive_dendrogram
>>> X, _= fetch_data('Bagoue analysed') # data is already scaled
>>> # get the two features 'power' and  'magnitude'
>>> data = X[['power', 'magnitude']]
>>> plot_naive_dendrogram(data )
>>> # add the horizontal line of the cluster at ybounds = (20 , 20 )
>>> # for a single cluster (cluser 1)
>>> plot_naive_dendrogram(data , 20, 20 )

watex.plot_pca_components(components, *, feature_names=None, cmap='viridis', savefig=None, **kws)[source]#

Visualize the coefficient of principal component analysis (PCA) as a heatmap

Parameters:

components – Ndarray, shape (n_components, n_features)or PCA object Array of the PCA compoments or object from watex.analysis.dimensionality.nPCA. If the object is given it is not necessary to set the feature_names
feature_names – list or str, optional list of the feature names to locate in the map. Feature_names and the number of eigen vectors must be the same length. If PCA object is passed as components arguments, no need to set the feature_names. The name of features is retreived automatically.
cmap – str, default=’viridis’ the matplotlib color map for matshow visualization.
kws – dict, Additional keywords arguments passed to matplotlib.pyplot.matshow

Examples:

(1)-> with PCA object

>>> from watex.datasets import fetch_data
>>> from watex.utils.plotutils import plot_pca_components
>>> from watex.analysis import nPCA
>>> X, _= fetch_data('bagoue pca')
>>> pca = nPCA (X, n_components=2, return_X =False)# to return object
>>> plot_pca_components (pca)

(2)-> use the components and features individually

>>> components = pca.components_
>>> features = pca.feature_names_in_
>>> plot_pca_components (components, feature_names= features,
                         cmap='jet_r')

watex.plot_profiling(erp, /, station=None, cz=None, *, style='classic', fig_size=(10, 4), cz_plot_kws=None, marker_kws=None, savefig=None, **plot_kws)[source]#

Visualizes the resistivity profiling of ERP data.

Function can overlain the selected conductive zone to the ERP if cz is given.

Parameters:

erp (array_like 1d) –
The electrical resistivity profiling array. If dataframe is passed, resistivity column must be included.

Changed in version 0.2.1: Can henceforth accept dataframe that contains resistivity values.
station (str, int, optional) –

Station is used to visualize the conductive zone in the erp profile.
This seems useful if cz is not given. When station=’auto’ it automatically detect the best conductive zone assuming the very low resistivity in the profile and plot the conductive zone. To have the expected results, station position or cz must be given or the .

New in version 0.2.1: Can henceforth pass the station to plot the conductive zone.
cz (array_like, optional,) – The selected conductive zone. If None, cz should not be plotted.
style (str, default='classic') – Matplotlib plottings style.
fig_size (tuple, default= (10, 4)) – Matplotlib figure size.
marker_kws (dict, default = {'marker':'o', 'c':'#9EB3DD' }) – The dictionnary to customize marker in the plot
cz_plot_kws (dict, default = {'ls':'-','c':'#0A4CEE', 'lw'L2 }) – The dictionnary to customize the conductize zone in the plot.
savefig (str, optional) – Save figure name. The default resolution dot-per-inch is 300.
plot_kws (dict,) – Additional keyword arguments passed to matplotlib.pyplot.plot() function

Returns:

ax – Return axis

Return type:

Matplotlib.pyplot.Axis

Examples

>>> from watex.datasets import make_erp
>>> from watex.utils.plotutils import plot_profiling
>>> d= make_erp (n_stations =56, seed = 42)
>>> plot_profiling  (d.resistivity)
>>> # read the frame and get the resistivity values
>>> plot_profiling (d.frame, station ='s07' )
<AxesSubplot:xlabel='Stations', ylabel='App.resistivity ($\Omega.m$)'>

watex.plot_regularization_path(X, y, c_range=(-4.0, 6.0), fig_size=(8, 5), sns_style=False, savefig=None, **kws)[source]#

Plot the regularisation path from Logit / LogisticRegression

Varying the different regularization strengths and plot the weight coefficient of the different features for different regularization strength.

Note that, it is recommended to standardize the data first.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features. X is expected to be standardized.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
c_range (list or tuple [start, stop]) – Regularization strength list. It is a range from the strong strong ( start) to lower (stop) regularization. Note that ‘C’ is the inverse of the Logistic Regression regularization parameter $\lambda$.
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.
sns_style (str, optional,) – the seaborn style.
kws (dict,) – Additional keywords arguments passed to sklearn.linear_model.LogisticRegression

Examples

>>> from watex.utils.plotutils import plot_regularization_path
>>> from watex.datasets import fetch_data
>>> X, y = fetch_data ('bagoue analysed' ) # data aleardy standardized
>>> plot_regularization_path (X, y )

watex.plot_rf_feature_importances(clf, X=None, y=None, fig_size=(8, 4), savefig=None, n_estimators=500, verbose=0, sns_style=None, **kws)[source]#

Plot features importance with RandomForest.

Parameters:

clf (estimator object) – The base estimator from which the transformer is built. This can be both a fitted (if prefit is set to True) or a non-fitted estimator. The estimator should have a feature_importances_ or coef_ attribute after fitting. Otherwise, the importance_getter parameter should be used.
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
n_estimators (int, default=500) – The number of trees in the forest.
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.
sns_style (str, optional,) – the seaborn style.
verbose (int, default=0) – print the feature labels with the rate of their importances.
kws (dict,) – Additional keyyword arguments passed to sklearn.ensemble.RandomForestClassifier

Examples

>>> from watex.datasets import fetch_data
>>> from watex.exlib.sklearn import RandomForestClassifier
>>> from watex.utils.plotutils import plot_rf_feature_importances
>>> X, y = fetch_data ('bagoue analysed' )
>>> plot_rf_feature_importances (
    RandomForestClassifier(), X=X, y=y , sns_style=True)

watex.plot_sbs_feature_selection(sbs_estimator, /, X=None, y=None, fig_size=(8, 5), sns_style=False, savefig=None, verbose=0, **sbs_kws)[source]#

plot Sequential Backward Selection (SBS) for feature selection.

SBS collects the scores of the best feature subset at each stage.

Parameters:

sbs_estimator (SequentialBackwardSelection estimator object) – The Sequential Backward Selection estimator can either be fitted or not. If not fitted. Please provide the training X and y, otherwise an error will occurs.
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
n_estimators (int, default=500) – The number of trees in the forest.
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.
sns_style (str, optional,) – the seaborn style.
verbose (int, default=0) – print the feature labels with the rate of their importances.
sbs_kws (dict,) – Additional keyyword arguments passed to SequentialBackwardSelection

Examples

(1)-> Plot fitted SBS in action >>> from watex.exlib.sklearn import KNeighborsClassifier , train_test_split >>> from watex.datasets import fetch_data >>> from watex.base import SequentialBackwardSelection >>> from watex.utils.plotutils import plot_sbs_feature_selection >>> X, y = fetch_data(‘bagoue analysed’) # data already standardized >>> Xtrain, Xt, ytrain, yt = train_test_split(X, y) >>> knn = KNeighborsClassifier(n_neighbors=5) >>> sbs= SequentialBackwardSelection (knn) >>> sbs.fit(Xtrain, ytrain ) >>> plot_sbs_feature_selection(sbs, sns_style= True)

(2)-> Plot estimator with no prefit SBS. >>> plot_sbs_feature_selection(knn, Xtrain, ytrain) # yield the same result

watex.plot_sfi(cz, p=None, s=None, dipolelength=None, fig_size=(10, 4), style='classic', **plotkws)[source]#

Plot sfi parameter components.

Parameters:

cz (array-like 1d,) – Selected conductive zone
p (array-like 1d,) – Station positions of the conductive zone.
dipolelength (float. If p is not given, it will be set) – automatically using the default value to match the cz size. The default value is 10.
fig_size (tuple, default=(10, 4)) – Matplotlib (MPL) figure size; should be a tuple value of integers

See also

watex.utils.exmath.sfi: for more details about the sfi parameter computation.

Examples

>>> import numpy as np
>>> from watex.utils.exmath import plot_sfi
>>> rang = np.random.RandomState (42)
>>> condzone = np.abs(rang.randn (7))*1e2
>>> plotkws  = dict (rlabel = 'Selected conductive zone (cz)',
                     color=f'{P().frcolortags.get("fr3")}',
                     )
>>> plot_sfi (condzone, **plotkws)

watex.plot_silhouette(X, labels, metric='euclidean', savefig=None, **kwds)[source]#

Plot quantifying the quality of clustering silhouette

Parameters:

X (array-like of shape (n_samples_a, n_samples_a) if metric == ) – “precomputed” or (n_samples_a, n_features) otherwise An array of pairwise distances between samples, or a feature array.
labels (array-like of shape (n_samples,)) – Label values for each sample.
metric (str or callable, default='euclidean') – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by sklearn.metrics.pairwise.pairwise_distances(). If X is the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.
savefig (str, default =None ,) – the path to save the figure. Argument is passed to matplotlib.Figure class.
**kwds (optional keyword parameters) – Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.

See also

watex.view.mlplot.plotSilhouette: Gives consistency plot as the use of prefit parameter which checks whether`labels` are expected to be passed into the function directly or not.

Examples

>>> import numpy as np
>>> from watex.exlib.sklearn import KMeans
>>> from watex.datasets import load_iris
>>> from watex.utils.plotutils import plot_silhouette
>>> d= load_iris ()
>>> X= d.data [:, 0][:, np.newaxis] # take the first axis
>>> km= KMeans (n_clusters =3 , init='k-means++', n_init =10 ,
                max_iter = 300 ,
                tol=1e-4,
                random_state =0
                )
>>> y_km = km.fit_predict(X)
>>> plot_silhouette (X, y_km)

watex.power(p)[source]#

Compute the power of the selected conductive zone. Anomaly power is closely referred to the width of the conductive zone.

The power parameter implicitly defines the width of the conductive zone and is evaluated from the difference between the abscissa $X_{LB}$ and the end $X_{UB}$ points of the selected anomaly:

\[power=|X_{LB} - X_{UB} |\]

Parameters:: p – array-like. Station position of conductive zone.
Returns:: Absolute value of the width of conductive zone in meters.

watex.predict_NGA_labels(X, /, n_clusters, random_state=0, keep_label_0=False, n_init='auto', return_cluster_centers=False, **kws)[source]#

Predict the Naive Group of Aquifer (NGA) labels.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.
random_state (int, RandomState instance or None, default=42) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
keep_label_0 (bool, default=False) – The prediction already includes the label 0. However, including 0 in the predicted label refers to ‘k=0’ i.e. permeability coefficient equals to 0, which is not True in principle, because all rocks have a permeability coefficient ‘k’. Here we considered ‘k=0’ as an undefined permeability coefficient. Therefore, ‘0’ , can be exclude since, it can also considered as a missing ‘k’-value. If predicted ‘0’ is in the target it should mean a missing ‘k’-value rather than being a concrete label. Therefore, to avoid any confusion, ‘0’ is altered to ‘1’ so the value +1 is used to move forward all class labels thereby excluding the ‘0’ label. To force include 0 in the label, set keep_label_0 to True.
n_init ('auto' or int, default=10) –
Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of n_init consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see kmeans_sparse_high_dim).

When n_init=’auto’, the number of runs will be 10 if using init=’random’, and 1 if using init=’kmeans++’.

New in version 0.2.0: Added ‘auto’ option for n_init.
return_cluster_centers (bool, default=False,) – export the array of clusters centers if True.
kws (dict,) – Additional keyword arguments passed to sklearn.clusters.KMeans.

Returns:

NGA (array_like of shape (n_samples, n_features)) – Predicted NGA labels.
( NGA , cluster_centers) (Tuple of array-like,) – MGA and clusters centers if return_cluster_centers` is set to ``True.

watex.qc(z_or_edis_obj_list, /, tol=0.5, *, interpolate_freq=False, return_freq=False, tensor='res', return_data=False, to_log10=False, return_qco=False)[source]#

Check the quality control in the collection of Z or EDI objects.

Analyse the data in the EDI collection and return the quality control value. It indicates how percentage are the data to be representative.

Parameters:

tol (float, default=.5) – the tolerance parameter. The value indicates the rate from which the data can be consider as meaningful. Preferably it should be less than 1 and greater than 0. Default is .5 means 50 %. Analysis becomes soft with higher tol values and severe otherwise.
interpolate_freq (bool,) – interpolate the valid frequency after removing the frequency which data threshold is under the ``1-tol``% goodness
return_freq (bool, default=False) – returns the interpolated frequency.
return_data (bool, default= False,) – returns the valid data from up to 1-tol% goodness.
tensor (str, default='z') – Tensor name. Can be [ resistivity|phase|z|frequency]. Impedance is used for data quality assessment.
to_log10 (bool, default=True) – convert the frequency value to log10.
qco (return) –
retuns quality control object that wraps all usefull informations after control. The following attributes can be fetched as:
- rate_: the rate of the quality of the data
- component_: The selected component where data is selected for analysis By default used either xy or yx.
- mode_: The EM mode. Either the [‘TE’|’TM’] modes
- freqs_: The valid frequency in the data selected according to the tol parameters. Note that if interpolate_freq is True, it is used instead.
- invalid_freqs_: Useless frequency dropped in the data during control
- data_: Valid tensor data either in TE or TM mode.

Returns:

return the quality control value and interpolated frequency if

return_freq is set to True otherwise return the only the quality control ratio.

return the the quality control object.

Return type:

Tuple (float ) or (float, array-like, shape (N, )) or QCo

Examples

>>> import watex as wx
>>> data = wx.fetch_data ('huayuan', samples =20, return_data =True ,
                          key='raw')
>>> r,= wx.qc (data)
r
Out[61]: 0.75
>>> r, = wx.qc (data, tol=.2 )
0.75
>>> r, = wx.qc (data, tol=.1 )

watex.read_data(f, sanitize=Ellipsis, reset_index=Ellipsis, verbose=Ellipsis, **read_kws)[source]#

Assert and read specific files and url allowed by the package

Readable files are systematically convert to a data frame.

Parameters:

f (str, Path-like object) – File path or Pathlib object. Must contain a valid file name and should be a readable file or url
sanitize (bool, default=False,) –
Push a minimum sanitization of the data such as:
- replace a non-alphabetic column items with a pattern ‘_’
- cast data values to numeric if applicable
- drop full NaN columns and rows in the data
reset_index (bool, default=False,) –
Reset index if full NaN columns are dropped after sanitization.

New in version 0.2.5: Apply minimum data sanitization after reading data.
read_kws (dict,) – Additional keywords arguments passed to pandas readable file keywords.

Returns:

f – A dataframe with head contents by default.

Return type:

pandas.DataFrame

watex.reduce_samples(*data, sname, zname=None, kname=None, section_indexes=None, error='raise', strategy='average', verify_integrity=False, ignore_index=False, **kws)[source]#

Create a new dataframe by squeezing/compressing the non valid data.

The m-samples reduction is necessary for the dataset with a lot of missing k-values. The technique of shrinking the number of k0 –values (k-missing values ) seems a relevant idea. It consists to compressed the values of the missing $k -values from the top ( depth equals 0 ) thin the upper section of the first aquifer with lower depth into a single vector :math:`x_r$ with dimension (1×n ) i.e. contains the n-features.

Parameters:

data (list of dataframes) – Data that contains mainly the aquifer values. It must contains the depth values refering at the column_name passed at zname and the permeability coefficient k passed to kname . Both argument need t supplied when datafame as passes as positional arguments.
sname (str, optional) – Name of column in the dataframe that contains the strata values. Dont confuse ‘sname’ with ‘stratum’ which is the name of the valid layer/rock in the array/Series of strata.
zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.
kname (str, int) –

Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly

kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.
strategy (str , default='average' or 'mean',) – strategy used to select or compute the numerical data into a singular series. It can be [‘naive’]. In that case , a single serie if randomly picked up into the base strata data.
section_indexes (tuple or list of int) – list of a pair tuple or list of integers. It is be the the valid sections( upper and lower ) indexes of of the aquifer. If the depth range z_range and zname are supplied, section_indexes can be None. Note that the last indix is considered as the last position, the bottom of the section therefore, its value is included in the data.
error (str, default='raise') – Raise errors if trouble occurs when computing the section of each aquifer. If ‘ignore’, a UserWarning is displayed when invalid data is found. Any other value of error will set error to raise.
verify_integrity (bool, default=False) –
Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method. if ‘True’, remove the duplicate rows from a DataFrame.

subset: By default, if the rows have the same values in all the columns, they are considered duplicates. This parameter is used to specify the columns that only need to be considered for identifying duplicates. keep: Determines which duplicates (if any) to keep. It takes inputs as, first – Drop duplicates except for the first occurrence. This is the default behavior. last – Drop duplicates except for the last occurrence. False – Drop all duplicates. inplace: It is used to specify whether to return a new DataFrame or update an existing one. It is a boolean flag with default False.
ignore_index (bool, default=False,) – It is a boolean flag to indicate if row index should be reset after dropping duplicate rows. False: It keeps the original row index. True: It reset the index, and the resulting rows will be labeled 0, 1, …, n – 1.

Returns:

df_new – new dataframes with reducing samples.

Return type:

List of pandas.dataframes

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import reduce_samples
>>> data = load_hlogs ().frame # get the frames
>>> # add explicitly the aquifer section indices
>>> dfnew= reduce_samples (data.copy(), sname='strata_name',
                             section_indexes = (16, 29 ),)
>>> dfnew[0]
...    hole_number               strata_name     rock_name  ...      r     rp  remark
    0         H502                  mudstone           J2z  ...    NaN    NaN     NaN
    16        H502                 siltstone           NaN  ...  35.74  59.23     NaN
    17        H502    fine-grained sandstone           NaN  ...  35.74  59.23     NaN
    18        H502                 siltstone           NaN  ...  35.74  59.23     NaN
    19        H502    fine-grained sandstone           NaN  ...  35.74  59.23     NaN
    20        H502                  mudstone           NaN  ...  35.74  59.23     NaN
    21        H502                 siltstone           NaN  ...  35.74  59.23     NaN
    22        H502    fine-grained sandstone           NaN  ...  59.61  59.23     NaN
    23        H502                 siltstone           NaN  ...  59.61  59.23     NaN
    24        H502    fine-grained sandstone           NaN  ...  59.61  59.23     NaN
    25        H502  Coarse-grained sandstone           NaN  ...  59.61  59.23     NaN
    26        H502                  mudstone           NaN  ...  82.33  59.23     NaN
    27        H502    fine-grained sandstone           NaN  ...  82.33  59.23     NaN
    28        H502  Coarse-grained sandstone           J2z  ...  82.33  59.23     NaN
    29        H502                      coal  (J2y)  2coal  ...  82.33  59.23     NaN
    0         H502                 siltstone           NaN  ...    NaN    NaN     NaN

[16 rows x 23 columns] >>> # specify the column name and kname without section indexes >>> dfnew= reduce_samples (

data.copy(), sname=’strata_name’, data, zname=’depth’, kname=’k’, ignore_index= True )[0]

… dfnew[0].index # index is reset … RangeIndex(start=0, stop=16, step=1)

watex.reshape(arr, axis=None)[source]#

Detect the array shape and reshape it accordingly, back to the given axis.

Parameters:

array – array_like with number of dimension equals to 1 or 2
axis – axis to reshape back array. If ‘axis’ is None and the number of dimension is greater than 1, it reshapes back array to array-like

Returns:

New reshaped array

Example:

>>> import numpy as np
>>> from watex.utils.funcutils import reshape
>>> array = np.random.randn(50 )
>>> array.shape
... (50,)
>>> ar1 = reshape(array, 1)
>>> ar1.shape
... (1, 50)
>>> ar2 =reshape(ar1 , 0)
>>> ar2.shape
... (50, 1)
>>> ar3 = reshape(ar2, axis = None)
>>> ar3.shape # goes back to the original array
>>> ar3.shape
... (50,)

watex.select_base_stratum(d, /, sname=None, stratum=None, return_rate=False, return_counts=False)[source]#

Selects base stratum from the the strata column in the logging data.

Find the most recurrent stratum in the data and compute the rate of occurrence.

Parameters:

d (array-like 1D , pandas.Series or DataFrame) – Valid data containing the strata. If dataframe is passed, ‘sname’ is needed to fetch strata values.
sname (str, optional) – Name of column in the dataframe that contains the strata values. Dont confuse ‘sname’ with ‘stratum’ which is the name of the valid layer/rock in the array/Series of strata.
stratum (str, optional) – Name of the base stratum. Must be self contain as an item of the strata data. Note that if stratum is passed, the auto-detection of base stratum is not triggered. It returns the same stratum , however it can gives the rate and occurence of this stratum if return_rate or return_counts is set to True.
return_rate (bool,default=False,) – Returns the rate of occurence of the base stratum in the data.
return_counts (bool, default=False,) – Returns each stratum name and the occurences (count) in the data.

Returns:

bs (str) – base stratum , self contain in the data
r (float) – rate of occurence in base stratum in the data
c (tuple (str, int)) – Tuple of each stratum whith their occurrence in the data.

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import select_base_stratum
>>> data = load_hlogs().frame # get only the frame
>>> select_base_stratum(data, sname ='strata_name')
... 'siltstone'
>>> select_base_stratum(data, sname ='strata_name', return_rate =True)
... 0.287292817679558
>>> select_base_stratum(data, sname ='strata_name', return_counts=True)
... [('siltstone', 52),
     ('fine-grained sandstone', 40),
     ('mudstone', 37),
     ('coal', 24),
     ('Coarse-grained sandstone', 15),
     ('carbonaceous mudstone', 9),
     ('medium-grained sandstone', 2),
     ('topsoil', 1),
     ('gravel layer', 1)]

watex.selectfeatures(df, features=None, include=None, exclude=None, coerce=False, **kwd)[source]#

Select features and return new dataframe.

Parameters:

df – a dataframe for features selections
features – list of features to select. List of features must be in the dataframe otherwise an error occurs.
include – the type of data to retrieve in the dataframe df. Can be number.
exclude – type of the data to exclude in the dataframe df. Can be number i.e. only non-digits data will be keep in the data return.
coerce – return the whole dataframe with transforming numeric columns. Be aware that no selection is done and no error is raises instead. default is False
kwd – additional keywords arguments from pd.astype function

Ref:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

watex.sfi(cz, p=None, s=None, dipolelength=None, view=False, raw=False, return_components=False, **plotkws)[source]#

Compute the pseudo-fracturing index known as sfi.

The sfi parameter does not indicate the rock fracturing degree in the underground but it is used to speculate about the apparent resistivity dispersion ratio around the cumulated sum of the resistivity values of the selected anomaly. It uses a similar approach of IF parameter proposed by Dieng et al (2004). Furthermore, its threshold is set to $sqrt{2}$ for symmetrical anomaly characterized by a perfect distribution of resistivity in a homogenous medium. The formula is given by:

\[sfi=\sqrt{(P_a^{*}/P_a )^2+(M_a^{*}/M_a )^2}\]

where $P_a$ and $M_a$ are the anomaly power and the magnitude respectively. $P_a^{*}$ is and $M_a^{*}$ are the projected power and magnitude of the lower point of the selected anomaly.

Parameters:

cz (array-like,) – Selected conductive zone
p (array-like,) – Station positions of the conductive zone.
dipolelength (float. If p is not given, it will be set) – automatically using the default value to match the cz size. The default value is 10..
view (bool, default=False,) – Visualize the fitting curve. Default is False.
raw (bool,default=False,) – Overlaining the fitting curve with the raw curve from cz.
return_components (bool, default=False,) – If True, it returns the different components used for compute sfi especially for external visualization.
plotkws (dict) – Matplotlib plot keyword arguments.

Returns:

sfi – value computed for pseudo-fracturing index

Return type:

float

Examples

>>> import numpy as np
>>> from watex.property import P
>>> from watex.utils.exmath import sfi
>>> rang = np.random.RandomState (42)
>>> condzone = np.abs(rang.randn (7))
>>> # no visualization and default value `s` with global minimal rho
>>> pfi = sfi (condzone)
... 3.35110143
>>> # visualize fitting curve
>>> plotkws  = dict (rlabel = 'Conductive zone (cz)',
                     label = 'fitting model',
                     color=f'{P().frcolortags.get("fr3")}',
                     )
>>> sfi (condzone, view= True , s= 5, figsize =(7, 7),
          **plotkws )
Out[598]: (array([ 0., 10., 20., 30.]), 1)

References

See Numpy Polyfit
See Stackoverflow
the answer of AkaRem edited by Tobu and Migilson.
See Numpy Errorstate and
how to implement the context manager.

watex.shape(cz, s=Ellipsis, p=Ellipsis)[source]#

Compute the shape of anomaly.

The shape parameter is mostly used in the basement medium to depict the better conductive zone for the drilling location. According to Sombo et al. (2011; 2012), various shapes of anomalies can be described such as:

“V”, “U”, “W”, “M”, “K”, “C”, and “H”

The shape consists to feed the algorithm with the Electrical Resistivity Profiling resistivity values by specifying the station $(S_{VES})$. Indeed, mostly, $S_{VES}$ is the station with a very low resistivity value expected to be the drilling location.

Parameters:

cz – array-like - Conductive zone resistivity values
s – int, str - Station position index or name.
p – Array-like - Should be the position of the conductive zone.

Note

If s is given, p should be provided. If p is missing an error will raises.

Returns:

str - the shape of anomaly.

Example:

>>> import numpy as np
>>> rang = np.random.RandomState(42)
>>> from watex.utils.exmath import shape
>>> test_array1 = np.arange(10)
>>> shape (test_array1)
...  'C'
>>> test_array2 = rang.randn (7)
>>> shape(test_array2)
... 'H'
>>> test_array3 = np.power(10, test_array2 , dtype =np.float32)
>>> shape (test_array3)
... 'H'   # does not change whatever the resistivity values.

References

Sombo, P. A., Williams, F., Loukou, K. N., & Kouassi, E. G. (2011).: Contribution de la Prospection Électrique à L’identification et à la Caractérisation des Aquifères de Socle du Département de Sikensi (Sud de la Côte d’Ivoire). European Journal of Scientific Research, 64(2), 206–219.
Sombo, P. A. (2012). Application des methodes de resistivites electriques: dans la determination et la caracterisation des aquiferes de socle en Cote d’Ivoire. Cas des departements de Sikensi et de Tiassale (Sud de la Cote d’Ivoire). Universite Felix Houphouet Boigny.

watex.show_versions()[source]#: Print useful debugging information”

New in version 0.1.3.

watex.smart_label_classifier(arr, /, values=None, labels=None, order='soft', func=None, raise_warn=True)[source]#

map smartly the numeric array into a class labels from a map function or a given fixed values.

New classes created from the fixed values can be renamed if labels are supplied.

Parameters:

arr (Arraylike 1d,) – array-like whose items are expected to be categorized.
values (float, list of float,) – The threshold item values from which the default categorization must be fixed.
labels (int |str| or List of [str, int],) – The labels values that might be correspond to the fixed values. Note that the number of fixed_labels might be consistent with the fixed values plus one, otherwise a ValueError shall raise if order is set to strict.
order (str, ['soft'|'strict'], default='soft',) – If order is True, the argument passed to values must be self contain as item in the arr, and raise warning otherwise.
func (callable, optional) – Function to map the given array. If given, values dont need to be supply.
raise_warn (bool, default='True') – Raise warning message if order=soft and the fixed values are not found in the arr. Also raise warnings, if labels arguments does not match the number of class from fixed values.

Returns:

arr – categorized array with the same length as the raw

Return type:

array-like 1d

Examples

>>> import numpy as np
>>> from watex.utils.funcutils import smart_label_classifier
>>> sc = np.arange (0, 7, .5 )
>>> smart_label_classifier (sc, values = [1, 3.2 ])
array([0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2], dtype=int64)
>>> # rename labels <=1 : 'l1', ]1; 3.2]: 'l2' and >3.2 :'l3'
>>> smart_label_classifier (sc, values = [1, 3.2 ], labels =['l1', 'l2', 'l3'])
>>> array(['l1', 'l1', 'l1', 'l2', 'l2', 'l2', 'l2', 'l3', 'l3', 'l3', 'l3',
       'l3', 'l3', 'l3'], dtype=object)
>>> def f (v):
        if v <=1: return 'l1'
        elif 1< v<=3.2: return "l2"
        else : return "l3"
>>> smart_label_classifier (sc, func= f )
array(['l1', 'l1', 'l1', 'l2', 'l2', 'l2', 'l2', 'l3', 'l3', 'l3', 'l3',
       'l3', 'l3', 'l3'], dtype=object)
>>> smart_label_classifier (sc, values = 1.)
array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
>>> smart_label_classifier (sc, values = 1., labels='l1')
array(['l1', 'l1', 'l1', 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=object)

watex.to_numeric_dtypes(arr, *, columns=None, return_feature_types=Ellipsis, missing_values=nan, pop_cat_features=Ellipsis, sanitize_columns=Ellipsis, regex=None, fill_pattern=None, drop_nan_columns=True, how='all', reset_index=Ellipsis, drop_index=True, verbose=Ellipsis)[source]#

Convert array to dataframe and coerce arguments to appropriate dtypes.

Function includes additional tools to manipulate the transformed data such as:

pop_cat_features to remove the categorical attributes,
sanitize_columns to clean the columns of the dataframe by removing the undesirable characters,
drop_nan_columns to drop all the columns and/or rows that contains full NaN, …

Parameters:

arr (Ndarray or Dataframe, shape (m_samples, n_features)) – Array of dataframe to create, to sanitize or to auto-detect feature categories ( numerical or categorical).
columns (list of str, optional) – Usefull to create a dataframe when array is given. Be aware to fit the number of array columns (shape[1])
return_feature_types (bool, default=False,) – return the list of numerical and categorial features.
missing_values (float, default='NaN') – Replace the missing or empty string if exist in the dataframe.
pop_cat_features (bool, default=False,) – remove the categorial features from the DataFrame.
sanitize_columns (bool, default=False,) –
remove undesirable character in the data columns using the default argument of regex parameters.

New in version 0.1.9.
regex (re object,) –
Regular expresion object used to polish the data columns.
the default is:
>>> import re >>> re.compile (r'[_#&.)(*@!_,;\s-]\s*', flags=re.IGNORECASE)
New in version 0.1.9.
fill_pattern (str, default='') – Pattern to replace the non-alphabetic character in each item of columns.
drop_nan_columns (bool, default=True) –
Remove all columns filled by NaN values.
how (str, default='all') – Drop also the NaN row data. The row data which is composed entirely with NaN or Null values.
reset_index (bool, default=False) –
Reset the index of the dataframe.
drop_index (bool, default=True,) –
Drop index in the dataframe after reseting.
verbose (bool, default=False,) – outputs a message by listing the categorial items dropped from the dataframe if exists.

Returns:

df or (df, nf, cf) – also return nf and cf if return_feature_types is set to``True``.

Return type:

Dataframe of values casted to numeric types

Examples

>>> from watex.datasets.dload import load_bagoue
>>> from watex.utils.funcutils import to_numeric_dtypes
>>> X, y = load_bagoue (as_frame =True )
>>> X0 =X[['shape', 'power', 'magnitude']]
>>> X0.dtypes
... shape        object
    power        object
    magnitude    object
    dtype: object
>>> df = to_numeric_dtypes(X0)
>>> df.dtypes
... shape         object
    power        float64
    magnitude    float64
    dtype: object

watex.type_(erp)[source]#

Compute the type of anomaly.

The type parameter is defined by the African Hydraulic Study Committee report (CIEH, 2001). Later it was implemented by authors such as (Adam et al., 2020; Michel et al., 2013; Nikiema, 2012). Type comes to help the differenciation of two or several anomalies with the same shape. For instance, two anomalies with the same shape W will differ from the order of priority of their types. The type depends on the lateral resistivity distribution of underground (resulting from the pace of the apparent resistivity curve) along with the whole Electrical Resistivity Profiling survey line. Indeed, four types of anomalies were emphasized:

“EC”, “CB2P”, “NC” and “CP”.

For more details refers to references.

Parameters:

erp – array-like - Array of Electrical Resistivity Profiling line composed of apparent resistivity values.

Returns:

str -The type of anomaly.

Example:

>>> import numpy as np
>>> from watex.utils.exmath import type_
>>> rang = np.random.RandomState(42)
>>> test_array2 = rang.randn (7)
>>> type_(np.abs(test_array2))
... 'EC'
>>> long_array = np.abs (rang.randn(71))
>>> type(long_array)
... 'PC'

References

Adam, B. M., Abubakar, A. H., Dalibi, J. H., Khalil Mustapha,M., & Abubakar,: A. H. (2020). Assessment of Gaseous Emissions and Socio-Economic Impacts From Diesel Generators used in GSM BTS in Kano Metropolis. African Journal of Earth and Environmental Sciences, 2(1),517–523. https://doi.org/10.11113/ajees.v3.n1.104
CIEH. (2001). L’utilisation des méthodes géophysiques pour la recherche: d’eaux dans les aquifères discontinus. Série Hydrogéologie, 169.
Michel, K. A., Drissa, C., Blaise, K. Y., & Jean, B. (2013). Application: de méthodes géophysiques à l ’ étude de la productivité des forages d ’eau en milieu cristallin : cas de la région de Toumodi ( Centre de la Côte d ’Ivoire). International Journal of Innovation and Applied Studies, 2(3), 324–334.
Nikiema, D. G. C. (2012). Essai d‘optimisation de l’implantation géophysique: des forages en zone de socle : Cas de la province de Séno, Nord Est du Burkina Faso (IRD). (I. / I. Ile-de-France, Ed.). IST / IRD Ile-de-France, Ouagadougou, Burkina Faso, West-africa. Retrieved from http://documentation.2ie-edu.org/cdi2ie/opac_css/doc_num.php?explnum_id=148

watex.vesSelector(data=None, *, rhoa=None, AB=None, MN=None, index_rhoa=None, xy_coords=None, is_utm=False, utm_zone=None, epsg=None, **kws)[source]#

Assert the validity of Vertical Electrical Sounding data and return a sanitize dataframe.

param rhoa:

array-like - Apparent resistivities collected during the sounding.

param AB:

array-like - Investigation distance between the current electrodes. Note that the AB is by convention equals to AB/2. It’s taken as half-space of the investigation depth.

param MN:

array-like - Potential electrodes distances at each investigation depth. Note by convention the values are half-space and equals to MN/2.

param f:

Path-like object or sounding dataframe. If given, the others parameters could keep the ``None` values.

param index_rhoa:

int - The index to retrieve the resistivity data of a specific sounding point. Sometimes the sounding data are composed of the different sounding values collected in the same survey area into different Electrical Resistivity Profiling line. For instance:

AB/2

MN/2

SE1

SE2

SE3

…

SEn

Where SE are the electrical sounding data values and n is the number of the sounding points selected. SE1, SE2 and SE3 are three points selected for Vertical Electrical Sounding i.e. 3 sounding points carried out either in the same Electrical Resistivity Profiling or somewhere else. These sounding data are the resistivity data with a specific numbers. Commonly the number are randomly chosen. It does not refer to the expected best fracture zone selected after the prior-interpretation. After transformation via the function ves_selector, the header of the data should hold the resistivity. For instance, refering to the table above, the data should be:

AB

MN

resistivity

resistivity

resistivity

…

Therefore, the index_rhoa is used to select the specific resistivity values i.e. select the corresponding sounding number of the Vertical Electrical Sounding expecting to locate the drilling operations or for computation. For esample, index_rhoa=1 should figure out:

AB/2

MN/2

SE2

–>

AB

MN

resistivity

If index_rhoa is None and the number of sounding curves are more than one, by default the first sounding curve is selected ie index_rhoa equals to 0.

param xy_coords:

tuple (float, float) Coordinates of the sounding point. Must be (‘longitude’,’latitude’) or (‘easting’, ‘northing’). If xy is xy_coords is given as (‘easting’ , ‘northing’), specify is_utm=True so the conversion to (‘longitude’, ‘latitude’) should be triggered. If False, a warnings occurs if values are greater than 180 and 90 degree for longitude and latitude respectively. Note that if the coordinates exists in the dataframe, its should takes the priority

New in version 0.2.1.

param is_utm:

bool, default= False, Allow conversion the (‘easting’, ‘northing’) coordinated from xy_coords to (‘longitude’, ‘latitude’)

param utm_zone:

default=’49R’ Is needed when xy_coords is passed as (‘easting’, ‘northing’) for conversion.

param epsg:

int, str , optional EPSG number defining projection. See http://spatialreference.org/ref/ for moreinfo. Overrides utm_zone if both are provided

param kws:

dict - Pandas dataframe reading additionals keywords arguments.

return:

-dataframe -Sanitize Vertical Electrical Sounding dataframe with ` AB`, MN and resistivity as the column headers.

Example:
>>> from watex.utils.coreutils import vesSelector
>>> df = vesSelector (data='data/ves/ves_gbalo.csv')
>>> df.head(3)
...    AB   MN  resistivity
    0   1  0.4          943
    1   2  0.4         1179
    2   3  0.4         1103
>>> df = vesSelector ('data/ves/ves_gbalo.csv', index_rhoa=3 )
>>> df.head(3)
...    AB   MN  resistivity
    0   1  0.4          457
    1   2  0.4          582
    2   3  0.4          558

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

watex package#

A machine learning research in water exploration#

Subpackages#

Submodules#