watex.utils package#

Utils sub-package offers several tools for data handling, parameters computation models estimation and evalution, and graphs visualization. The extension of the mathematical concepts, and the core of program are performed via the modules exmath and coreutils respectively. Whereas the machine learning utilities and additional functionalities are performed with mlutils and funcutils respectively. The plot utilities from plotutils gives several plotting tools for visualization.

watex.utils.bi_selector(d, /, features=None, return_frames=False)[source]#

Auto-differentiates the numerical from categorical attributes.

This is usefull to select the categorial features from the numerical features and vice-versa when we are a lot of features. Enter features individually become tiedous and a mistake could probably happenned.

Parameters:

d (pandas dataframe) – Dataframe pandas
features (list of str) – List of features in the dataframe columns. Raise error is feature(s) does/do not exist in the frame. Note that if features is None, it returns the categorical and numerical features instead.
return_frames (bool, default =False) – return the difference columns (features) from the given features as a list. If set to True returns bi-frames composed of the given features and the remaining features.

Returns:

- Tuple ( list, list) – list of features and remaining features
- Tuple ( pd.DataFrame, pd.DataFrame ) – List of features and remaing features frames.

Example

>>> from watex.utils.mlutils import bi_selector
>>> from watex.datasets import load_hlogs
>>> data = load_hlogs().frame # get the frame
>>> data.columns
>>> Index(['hole_id', 'depth_top', 'depth_bottom', 'strata_name', 'rock_name',
       'layer_thickness', 'resistivity', 'gamma_gamma', 'natural_gamma', 'sp',
       'short_distance_gamma', 'well_diameter', 'aquifer_group',
       'pumping_level', 'aquifer_thickness', 'hole_depth_before_pumping',
       'hole_depth_after_pumping', 'hole_depth_loss', 'depth_starting_pumping',
       'pumping_depth_at_the_end', 'pumping_depth', 'section_aperture', 'k',
       'kp', 'r', 'rp', 'remark'],
      dtype='object')
>>> num_features, cat_features = bi_selector (data)
>>> num_features
...['gamma_gamma',
     'depth_top',
     'aquifer_thickness',
     'pumping_depth_at_the_end',
     'section_aperture',
     'remark',
     'depth_starting_pumping',
     'hole_depth_before_pumping',
     'rp',
     'hole_depth_after_pumping',
     'hole_depth_loss',
     'depth_bottom',
     'sp',
     'pumping_depth',
     'kp',
     'resistivity',
     'short_distance_gamma',
     'r',
     'natural_gamma',
     'layer_thickness',
     'k',
     'well_diameter']
>>> cat_features
... ['hole_id', 'strata_name', 'rock_name', 'aquifer_group',
     'pumping_level']

watex.utils.cattarget(arr, /, func=None, labels=None, rename_labels=None, coerce=False, order='strict')[source]#

Categorize array to hold the given identifier labels.

Classifier numerical values according to the given label values. Labels are a list of integers where each integer is a group of unique identifier of a sample in the dataset.

Parameters:

arr (array-like |pandas.Series) – array or series containing numerical values. If a non-numerical values is given , an errors will raises.
func (Callable,) – Function to categorize the target y.
labels (int, list of int,) – if an integer value is given, it should be considered as the number of category to split ‘y’. For instance label=3 applied on the first ten number, the labels values should be [0, 1, 2]. If labels are given as a list, items must be self-contain in the target ‘y’.
rename_labels (list of str;) – list of string or values to replace the label integer identifier.
coerce (bool, default =False,) – force the new label names passed to rename_labels to appear in the target including or not some integer identifier class label. If coerce is True, the target array holds the dtype of new_array.

Returns:

arr – The category array with unique identifer labels

Return type:

Arraylike |pandas.Series

Examples

>>> from watex.utils.mlutils import cattarget
>>> def binfunc(v):
        if v < 3 : return 0
        else : return 1
>>> arr = np.arange (10 )
>>> arr
... array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> target = cattarget(arr, func =binfunc)
... array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
>>> cattarget(arr, labels =3 )
... array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2])
>>> array([2, 2, 2, 2, 1, 1, 1, 0, 0, 0])
>>> cattarget(arr, labels =3 , order =None )
... array([0, 0, 0, 0, 1, 1, 1, 2, 2, 2])
>>> cattarget(arr[::-1], labels =3 , order =None )
... array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2]) # reverse does not change
>>> cattarget(arr, labels =[0 , 2,  4]  )
... array([0, 0, 0, 2, 2, 4, 4, 4, 4, 4])

watex.utils.check_flow_objectivity(y, /, values, classes)[source]#

Function checks the flow rate objectivity

If objective is set to flow i.e the prediction focuses on the flow rate, there are some conditions that the target y needs to meet when values are passed for classes categorization.

Parameters:

values – list of values to encoding the numerical target y. for instance values=[0, 1, 2]
objective – str, relate to the flow rate prediction. Set to None for any other predictions.
prefix –
the prefix to add to the class labels. For instance, if the prefix equals to FR, class labels will become:
```
[0, 1, 2] => [FR0, FR1, FR2]
```
classes –
list of classes names to replace the default FR that is used to specify the flow rate. For instance, it can be:
```
[0, 1, 2] => [sf0, sf1, sf2]
```

Returns:

(y, classes): Tuple, - y: array-like 1d of categorized y - classes: list of flow rate classes.

watex.utils.classify_k(o, /, func=None, kname=None, inplace=False, string=False, default_func=False)[source]#

Categorize the permeability coefficient ‘k’

Map the continuous ‘k’ into categorial classes.

Parameters:

o (ndarray of pd.Series or Dataframe) – data containing the permeability coefficient k contineous values. If data is passsed as a pandas dataframe, the column containing the k-values kname needs to be specified.
func (callable) – Function to specifically map the permeability coefficient column in the dataframe of serie. If not given, the default function can be enabled instead from param default_func.
inplace (bool, default=False) – Modified object inplace and return None
string (bool,) – If set to “True”, categorized map from ‘k’ should be prefixed by “k”. However is string value is given , the prefix is changed according to this label.
default_ufunc (bool,) –
Default function for mapping k is setting to True. Note that, this could probably not fitted your own data. So it is recommended to provide your own function for mapping ‘k’. However the default ‘k’ mapping is given as follow:
- k0 {0}: k = 0
- k1 {1}: 0 < k <= .01
- k2 {2}: .01 < k <= .07
- k3 {3}: k> .07

Returns:

o – return None only if dataframe is given and inplace is set to True i.e modified object inplace.

Return type:

None, ndarray, Series or Dataframe

Examples

>>> import numpy as np
>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import classify_k
>>> _, y0 = load_hlogs (as_frame =True)
>>> # let visualize four nonzeros values in y0
>>> y0.k.values [ ~np.isnan (y0.k ) ][:4]
...  array([0.054, 0.054, 0.054, 0.054])
>>> classify_k (y0 , kname ='k', inplace =True, use_default_func=True )
>>> # let see again the same four value in the dataframe
>>> y0.k.values [ ~np.isnan (y0.k ) ][:4]
... array([2., 2., 2., 2.])

watex.utils.correlatedfeatures(df, corr='pearson', threshold=0.95, fmt=False)[source]#

Find the correlated features/columns in the dataframe.

Indeed, highly correlated columns don’t add value and can throw off features importance and interpretation of regression coefficients. If we had correlated columns, choose to remove either the columns from level_0 or level_1 from the features data is a good choice.

Parameters:

df (Dataframe or shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N
corr (str, ['pearson'|'spearman'|'covariance']) – Method of correlation to perform. Note that the ‘person’ and ‘covariance’ don’t support string value. If such kind of data is given, turn the corr to spearman. default is pearson
threshold (int, default is 0.95) – the value from which can be considered as a correlated data. Should not be greater than 1.
fmt (bool, default {False}) – format the correlated dataframe values

Returns:

df – Dataframe with cilumns equals to [level_0, level_1, pearson]

Return type:

pandas.DataFrame

Examples

>>> from watex.utils.mlutils import correlatedcolumns
>>> df_corr = correlatedcolumns (data , corr='spearman',
                                 fmt=None, threshold=.95
                                 )

watex.utils.defineConductiveZone(erp, station=None, position=None, auto=False, index='py', **kws)[source]#

Define conductive zone as subset of the erp line.

Indeed the conductive zone is a specific zone expected to hold the drilling location station. If drilling location is not provided, it would be by default the very low resistivity values found in the erp line.

Parameters:

erp (array_like,) – the array contains the apparent resistivity values
station (str or int,) – is the station position name.
position (float,) – station position value.
auto (bool) – If True, the station position should be the position of the lower resistivity value in Electrical Resistivity Profiling.
indexing (str,) –

Returns:

- conductive zone of resistivity values
- conductive zone positionning
- station position index in the conductive zone
- station position index in the whole |ERP| line

Example:

>>> import numpy as np
>>>
>>> from watex.utils.coreutils import defineConductiveZone
>>> test_array = np.random.randn (10)
>>> selected_cz ,*_ = defineConductiveZone(test_array, 's20')
>>> shortPlot(test_array, selected_cz )

watex.utils.erpSelector(f, columns=Ellipsis, force=False, **kws)[source]#

Read and sanitize the data collected from the survey.

data should be an array, a dataframe, series, or arranged in .csv or .xlsx formats. Be sure to provide the header of each columns in’ the worksheet. In a file is given, header columns should be aranged as ['station','resistivity' ,'longitude', 'latitude']. Note that coordinates columns (longitude and latitude) are not compulsory.

Parameters:

f (Path-like object, ndarray, Series or Dataframe,) – If a path-like object is given, can only parse .csv and .xlsx file formats. However, if ndarray is given and shape along axis 1 is greater than 4, the ndarray should be shrunked.
columns (list) – list of the valuable columns. It can be used to fix along the axis 1 of the array the specific values. It should contain the prefix or the whole name of each item in ['station','resistivity' ,'longitude', 'latitude'].
force (bool, default=False,) – If Vertical electrical (VES) is passed while expecting ERP data, force set to True will consider the VES data as ERP data and will use only the resistivity values in VES data. This will will an invalid results especially when parameters computation are needed.
kws (dict) – Additional pandas pd.read_csv and pd.read_excel methods keyword arguments. Be sure to provide the right argument. when reading f. For instance, provide sep= ',' argument when the file to read is xlsx format will raise an error. Indeed, sep parameter is acceptable for parsing the .csv file format only.

Return type:

DataFrame with valuable column(s).

Notes

The length of acceptable columns is 4. If the size of the columns is higher than 4, the data should be shrunked to match the expected columns. Futhermore, if the header is not specified in f , the defaut column arrangement should be used. Therefore, the second column should be considered as the resistivity column.

Examples

>>> import numpy as np
>>> from watex.utils.coreutils import erpSelector
>>> df = erpSelector ('data/erp/testsafedata.csv')
>>> df.shape
... (45, 4)
>>> list(df.columns)
... ['station','resistivity', 'longitude', 'latitude']
>>> df = erp_selector('data/erp/testunsafedata.xlsx')
>>> list(df.columns)
... ['easting', 'station', 'resistivity', 'northing']
>>> df = erpSelector(np.random.randn(7, 7))
>>> df.shape
... (7, 4)
>>> list(df.columns)
... ['station', 'resistivity', 'longitude', 'latitude']

watex.utils.erpSmartDetector(constr, erp, station=None, coerce=False, return_cz=False, view=False, raise_warn=True, **plot_kws)[source]#

Automatically detect the drilling location by involving the constraints observed in the survey area.

Consider the constraints on the survey area and detect the suitable drilling location. Commonly the station is not needed when using the constraintssince the station indicates that the user is aware about the reason to select this station. However in the case, doubts raise, user can set the parameter coerce to True.

Parameters:

constr (list, dict) –
List of restricted station. The constraint or restricted stations are the station where to ignore when selecting the best drilling location. Indeed, this is useful since in DWSC, not the station are presumed to be suitable to propose the drilling in technical view. For instance, if some stations are close to the household waste site, the stations must be list and ignored.

If the constr is passed in a dictionnary, it might be contain, the key for the restricted stations and the value for the reason why the station is restricted. For instance:
```
constr = {"s02": "station close to the household waste"
          "S25": "station is located in a marsh area."
          }
```
erp (array-like 1d) – DC profiling ERP resistivity values
station (str, optional) – The station of the presumed location for drilling operations. Commonly the station is not need when using the constraints. If the station is given whereas coerce=False an errors will raise top warnm the users, To force considering the station in the auto-detection, coerce must be set to True.
coerce (bool, default=False,) – Allow the station to be consider in the auto-detection.
raise_warn (bool, default=True,) – warn the user whether a suitable location is found or not. Returns None otherwise.
view (bool, default=False,) – Plot the conductive zone and restricted stations.
plot_kws (dict,) – Additional plotting keywords arguments passed to plotAnomaly().

Returns:

(station |None) or cz, cs – staion for the drilling operations detected automatically. If no station is detected, will return None. if return_cz is True, station and the conductive zone are returned as well as the restricted station position number.

Return type:

str,

See also

watex.plotAnomaly: Plot DC profiling ERP and conductive zone.

Examples

>>> import numpy as np
>>> from watex.datasets import make_erp
>>> from watex.utils.coreutils import erpSmartDetector
>>> resistivity = make_erp (n_stations =50 , as_frame=True, seed=125).resistivity
>>> # get the min value of the resistivity
>>> resmin_index = np.where ( resistivity==resistivity.min())
42
>>> erpSmartDetector (constr =['s42'], resistivity )
'S13'
>>> # S42 is rejected and selected another zone presumed to be better.
>>> constraints ={"S00": "Marsh area. ",
                  "S10": " Municipality square, no authorization to make drill",
                  "S29": "Heritage site",
                  "S46": "Household waste site",
                  "S42": "Household waste site"
                  }
>>> erpSmartDetector (constraints, resistivity)
'S16'
>>> erpSmartDetector (['s12', 's40'], resistivity)
'S29'
>>> # station 42 close s40 is rejected too.

watex.utils.evalModel(model, X, y, Xt, yt=None, scorer='accuracy', eval=False, **kws)[source]#

Evaluate model and quick test the score with metric scorers.

Parameters:

model (Callable, {'preprocessor + estimator } | estimator,) – the preprocessor is list of step for data handling all encapsulated on the pipeline. model can also be a simple estimator with fit,
X (N-d array, shape (N, M)) –

the training set composed of N-columns and the M-samples. The
feature set excludes the target y.
y (arraylike , shape (M)) – the target is composed of M-examples in supervised learning.
Xt (N-d array, shape (N, M)) – test set array composed of N-columns and the M-samples. The feature set excludes the target y.
yt (arraylike , shape (M)) – test label (or test target) composed of M-examples in supervised learning.
scorer (str, Callable,) –
a scorer is a metric function for model evaluation. If given as string it should be the prefix of the following metrics:
- ”classification_report” -> for classification_report,
- ’precision_recall’ -> for precision_recall_curve,
- ”confusion_matrix” -> for a confusion_matrix,
- ’precision’ -> for precision_score,
- ”accuracy” -> for accuracy_score
- ”mse” -> for mean_squared_error,
- ”recall” -> for recall_score,
- ’auc’ -> for roc_auc_score,
- ’roc’ -> for roc_curve
- ’f1’ -> for f1_score,
Other string prefix values should raises an errors
kws (dict,) – Additionnal keywords arguments from scklearn metric function.

Returns:

Tuple – the model score or the predicted y if predict is set to True.

Return type:

(score, ypred)

watex.utils.findCatandNumFeatures(df=None, features=None, return_frames=False)[source]#

Retrieve the categorial or numerical features on whole features of dataset.

Parameters:

df (Dataframe) – Dataframe with columns composing the features
features (list of str,) – list of the column names. If the dataframe is big, can set the only required features. If features are provided, frame should be shrunked to match the only given features before the numerical and categorical features search. Note that an error will raises if any of one features is missing in the dataframe.
return_frames (bool,) – if set to True, it returns two separated dataframes (cat & num) otherwise, it only returns the cat and num columns names.

Returns:

Tuple

Return type:

cat_features and num_features names or frames

Examples

>>> from watex.datasets import fetch_data
>>>> from watex.tools import findCatandNumFeatures
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> cat, num = findCatandNumFeatures(data)
>>> cat, num
... (['type', 'geol', 'shape', 'name', 'flow'],
 ['num', 'east', 'north', 'power', 'magnitude', 'sfi', 'ohmS', 'lwi'])
>>> cat, num = findCatandNumFeatures(
    data, features = ['geol', 'ohmS', 'sfi'])
... (['geol'], ['ohmS', 'sfi'])

watex.utils.find_aquifer_groups(arr_k, /, arr_aq=None, kname=None, aqname=None, subjectivity=False, default_arr=None, keep_label_0=False, method='naive')[source]#

Fit the group of aquifer and find the representative of each true label in array ‘k’ in the aquifer group array.

The idea consists to find the corresponding aquifer group which fits the most the true label ‘X’ in ‘y_true’.

‘arr_k’ and ‘arr_aq’ must contain a class label, not continue values.

Parameters:

arr_k (array_like, pandas series or dataframe) – arraylike that contains the permeability coefficients ‘k’. If a dataframe is supplied, the permeabitlity coefficient column name ‘kname’ must be specified.
arr_aq (array-like , pandas series or dataframe) – array-like that contains the aquifer groups. If NAN values exists in the aquifer groups, it is suggested to imputed values before feediing to the algorithms. Missing values are not allowed. If dataframe is supplied, the aquifer group column name ‘aqname’ must be specified.
kname (str, int) –

Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly

kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
aqname (str, optional,) –

Name of aquifer group columns. aqname allows to retrieve the
aquifer group arr_aq value in a specific dataframe. Commonly

aqname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
subjectivity (bool, default=False) – Considers each class label as a naive group of aquifer. Subjectivity occurs when no group of aquifer is not found in the data. Therefore, each class label is considered as a naive group of aquifer. It is strongly recommended to provide a default group passes to parameter default_arr to substitute the group of aquifers for more pratical reason. For instance it can be the layer collected at a specific depth like the ‘strata’ columns.
default_arr (array-like, pd.Series) – Array used as deefault for subsitutue the group of aqquifer if the latter is missing. This is an heuristic option because it might lead to breaking code or invalid results.
keep_label_0 (bool, default=False) – The prediction already include the label 0. However, including 0 in the predicted label refers to ‘k=0’ i.e. no permeability coefficient equals to 0, which is not True in principle, because all rocks have a permeability coefficient ‘k’. Here we considered ‘k=0’ as an undefined permeability coefficient. Therefore, ‘0’ , can be exclude since, it can also considered as a missing ‘k’-value. If predicted ‘0’ is in the target it should mean a missing ‘k’-value rather than being a concrete label. Therefore, to avoid any confusion, ‘0’ is altered to ‘1’ so the value +1 is used to move forward all class labels thereby excluding the ‘0’ label. To force include 0 in the label, set keep_label_0 to True.
method (str ['naive', 'strict'], default='naive') –
The kind of strategy to compute the representativity of a label in the predicted array ‘array_aq’. It can also be ‘strict’. Indeed:
- naive computes the importance of the label by the number of its
  occurence for this specific label in the array ‘k’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in arr_k.
- strict computes the importance of the label by the number of
  occurence in the whole valid arr_k i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in arr_k.

Returns:

_Group – Use attribute .groups to find the group values.

Return type:

_Group class object

Examples

Use the real aquifer group collected in the area

>>> from watex.utils import naive_imputer, read_data, reshape
>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import classify_k, find_aquifer_groups
>>> b= load_hlogs () #just taking the target names
>>> data = read_data ('data/boreholes/hf.csv') # read complete data
>>> y = data [b.target_names]
>>> # impute the missing values found in aquifer group columns
>>> # reshape 1d array along axis 0 for imputation
>>> agroup_imputed = naive_imputer ( reshape (y.aquifer_group, axis =0 ) ,
...                                    strategy ='most_frequent')
>>> # reshape back to array_like 1d
>>> y.aquifer_group =reshape (agroup_imputed)
>>> # categorize the 'k' continous value in 'y.k' using the default
>>> # 'k' mapping func
>>> y.k = classify_k (y.k , default_func =True)
>>> # get the group obj
>>> group_obj = find_aquifer_groups(y.k, y.aquifer_group)
>>> group_obj
_Group(Label=[' 1 ',
             Preponderance( rate = '53.141  %',
                           [('Groups', {'V': 0.32, 'IV': 0.266, 'II': 0.236,
                                        'III': 0.158, 'IV&V': 0.01,
                                        'II&III': 0.005, 'III&IV': 0.005}),
                            ('Representativity', ( 'V', 0.32)),
                            ('Similarity', 'V')])],
        Label=[' 2 ',
              Preponderance( rate = ' 19.11  %',
                           [('Groups', {'III': 0.274, 'II': 0.26, 'V': 0.26,
                                        'IV': 0.178, 'III&IV': 0.027}),
                            ('Representativity', ( 'III', 0.27)),
                            ('Similarity', 'III')])],
        Label=[' 3 ',
              Preponderance( rate = '27.749  %',
                           [('Groups', {'V': 0.443, 'IV': 0.311, 'III': 0.245}),
                            ('Representativity', ( 'V', 0.44)),
                            ('Similarity', 'V')])],
             )
(2) Use the subjectivity and set the strata columns as default array

>>> find_aquifer_groups(y.k, subjectivity=True, default_arr= X.strata_name )
_Group(Label=[' 1 ',
             Preponderance( rate = '53.141  %',
                           [('Groups', {'siltstone': 0.35, 'coal': 0.227,
                                        'fine-grained sandstone': 0.158,
                                        'medium-grained sandstone': 0.094,
                                        'mudstone': 0.079,
                                        'carbonaceous mudstone': 0.054,
                                        'coarse-grained sandstone': 0.03,
                                        'coarse': 0.01}),
                            ('Representativity', ( 'siltstone', 0.35)),
                            ('Similarity', 'siltstone')])],
        Label=[' 2 ',
              Preponderance( rate = ' 19.11  %',
                           [('Groups', {'mudstone': 0.288, 'siltstone': 0.205,
                                        'coal': 0.192,
                                        'coarse-grained sandstone': 0.137,
                                        'fine-grained sandstone': 0.137,
                                        'carbonaceous mudstone': 0.027,
                                        'medium-grained sandstone': 0.014}),
                            ('Representativity', ( 'mudstone', 0.29)),
                            ('Similarity', 'mudstone')])],
        Label=[' 3 ',
              Preponderance( rate = '27.749  %',
                           [('Groups', {'mudstone': 0.245, 'coal': 0.226,
                                        'siltstone': 0.217,
                                        'fine-grained sandstone': 0.123,
                                        'carbonaceous mudstone': 0.066,
                                        'medium-grained sandstone': 0.066,
                                        'coarse-grained sandstone': 0.057}),
                            ('Representativity', ( 'mudstone', 0.24)),
                            ('Similarity', 'mudstone')])],
             )

watex.utils.find_similar_labels(y_true, y_pred, *, categorize_k=False, threshold=None, func=None, keep_label_0=False, method='naive', return_groups=False, **kwd)[source]#

Find similarities between y_true and y_pred and returns rate

Parameters:

y_true (array-like 1d or pandas.Series) – Array containing the true labels of ‘k’
y_pred (array_like, or pandas.Series) – array containing the predicted naive group of aquifers (NGA)
categorize_k (bool,) – If set to True, user needs to provide a function ufunc to map or categorize the permeability coefficient ‘k’ into an integer labels.
func (callable) – Function to specifically map the permeability coefficient column in the dataframe of serie. If not given, the default function can be enabled instead from param default_func.
threshold (float, default=None) – The threshold from which, label in ‘y_true’ can be considered similar than the one in NGA labels ‘y_pred’. The default is ‘None’ which means none rule is considered and the high preponderence or occurence in the data compared to other labels is considered as the most representative and similar. Setting the rule instead by fixing the threshold is recommended especially in a huge dataset.
keep_label_0 (bool, default=0) –
Force including 0 in the predicted label if include_label_0 is set to True. Mostly label ‘0’ refers to ‘k=0’ i.e. no permeability coefficient equals to 0, which is not True in principle, because all rocks have a permeability coefficient ‘k’. Here we considered ‘k=0’ as an undefined permeability coefficient. Therefore, ‘0’ , can be exclude since, it can also considered as a missing ‘k’-value. If predicted ‘0’ is in the target it should mean a missing ‘k’-value rather than being a concrete label. Therefore, to avoid any confusion, ‘0’ is removed by default in the ‘k’ categorization. However, when the prediction ‘y_pred’ is made from the the unsupervising method, the prediction ‘0’ straigthforwardly includes

’0’ i.e ‘k=0’ as a first class. So the value +1 is used to move forward

all class labels thereby excluding the ‘0’ label. To force include 0 in the label, set include_label_0 to True.
method (str ['naive', 'strict'], default='naive') –
The kind of strategy to compute the representativity of a label in the predicted array ‘y_pred’. It can also be ‘strict’. Indeed:
- naive computes the importance of the label by the number of its
  occurence for this specific label in the array ‘y_true’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in y_true.
- strict computes the importance of the label by the number of
  occurence in the whole valid y_true i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in y_pred.
return_groups (bool, default=False) – Returns label groups and their values counts in the predicted labels y_pred where ‘k’ values are not missing.

Returns:

g.similarity (Tuple of labels found that are considered similar in) – predicted labels.
g.group (Tuple of group that have their similarity in the true labels)

Example

>>> from watex.utils import read_data
>>> from watex.utils.hydroutils import find_similar_labels, classify_k
>>> data = read_data ('data/boreholes/hf.csv')
>>> ymap = classify_k(data.k , default_func =True)
>>> # Note that for the demo we use the group of aquifer columns, however
>>> # in pratical example, y_pred must be a predicted NGA labels. This
>>> # is possible using the function <predict_NGA_labels>
>>> sim = find_similar_labels(y_true= ymap, y_pred=data.aquifer_group)
>>> sim
... ((1, 'V'), (2, 'III'), (3, 'V'))
>>> group= find_similar_labels(ymap, data.aquifer_group, return_groups=True)
>>> group
... ((1,
  {'V': 0.17,
   'IV': 0.141,
   'II': 0.126,
   'III': 0.084,
   'IV&V': 0.005,
   'II&III': 0.003,
   'III&IV': 0.003}),
 (2, {'III': 0.052, 'II': 0.05, 'V': 0.05, 'IV': 0.034, 'III&IV': 0.005}),
 (3, {'V': 0.123, 'IV': 0.086, 'III': 0.068}))
>>> find_similar_labels(y_true= ymap, y_pred=data.aquifer_group,
                              threshold = 0.15)
... [(1, 'V')]

watex.utils.fittensor(refreq, compfreq, z, fill_value=nan)[source]#

Fit each tensor component to the complete frequency range.

The complete frequency is the frequency with clean data. It contain all the frequency range on the site. During the survey, the missing frequencies lead to missing tensor data. So the function will indicate where the tensor data is missing and fit to the prior frequencies.

Parameters:

refreq – Reference frequency - Should be the complete frequency collected in the field.
comfreq – array-like, should the frequency of the survey area.
z – array-like, should be the tensor value (real or imaginary part ) at the component xx, xy, yx, yy.
fill_value – float Value to replace the missing data in tensors. Default is NaN.
return – new Z filled by invalid value NaN where the frequency is missing in the data.

Example:

>>> import numpy as np
>>> from watex.utils.exmath import fittensor
>>> refreq = np.linspace(7e7, 1e0, 20) # 20 frequencies as reference
>>> freq_ = np.hstack ((refreq.copy()[:7], refreq.copy()[12:] ))
>>> z = np.random.randn(len(freq_)) *10 # assume length of  freq as
...                 # the same like the tensor Z value
>>> zn  = fittensor (refreq, freq_, z)
>>> z # some frequency values are missing but not visible.
...array([-23.23448367,   2.93185982,  10.81194723, -12.46326732,
         1.57312908,   7.23926576, -14.65645799,   9.85956253,
         3.96269863, -10.38325124,  -4.29739755,  -8.2591703 ,
        21.7930423 ,   0.21709129,   4.07815217])
>>> # zn show where the frequencies are missing
>>> # the NaN value means in a missing value in  tensor Z at specific frequency
>>> zn
... array([-23.23448367,   2.93185982,  10.81194723, -12.46326732,
         1.57312908,   7.23926576, -14.65645799,          nan,
                nan,          nan,          nan,          nan,
         9.85956253,   3.96269863, -10.38325124,  -4.29739755,
        -8.2591703 ,  21.7930423 ,   0.21709129,   4.07815217])
>>> # let visualize where the missing frequency value in tensor Z
>>> refreq
... array([7.00000000e+07, 6.63157895e+07, 6.26315791e+07, 5.89473686e+07,
       5.52631581e+07, 5.15789476e+07, 4.78947372e+07, 4.42105267e+07*,
       4.05263162e+07*, 3.68421057e+07*, 3.31578953e+07*, 2.94736848e+07*,
       2.57894743e+07, 2.21052638e+07, 1.84210534e+07, 1.47368429e+07,
       1.10526324e+07, 7.36842195e+06, 3.68421147e+06, 1.00000000e+00])
>>> refreq[np.isnan(zn)] #we can see the missing value between [7:12](*) in refreq
... array([44210526.68421052, 40526316.21052632, 36842105.73684211,
       33157895.2631579 , 29473684.78947368])

watex.utils.get2dtensor(z_or_edis_obj_list, /, tensor='z', component='xy', kind='modulus', return_freqs=False, **kws)[source]#

Make tensor into two dimensional array from a collection of Impedance tensors Z.

Out 2D resistivity, phase-error and tensor matrix from a collection of EDI-objects.

Matrix depends of the number of frequency times number of sites. The function asserts whether all data from all frequencies are available. The missing values should be filled by NaN. Note that each element of z is (nfreq, 2, 2) dimension for:

xx ( 0, 0) ------- xy ( 0, 1)
yx ( 1, 0) ------- yy ( 1, 1)

Parameters:

z_or_edis_obj_list (list of watex.edi.Edi or watex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.
tensor (str, default='z') – Tensor name. Can be [ resistivity|phase|z|frequency]
component (str, default='xy' (TE mode)) – EM mode. Can be [‘xx’, ‘xy’, ‘yx’, ‘yy’]
out (str) – kind of data to output. Be sure to provide the component to retrieve the attribute from the collection object. Except the error and frequency attribute, the missing component to the attribute will raise an error. for instance resxy for xy component. Default is resxy.
kind (str , default='modulus') – focuses on the tensor output. Note that the tensor is a complex number of ndarray (nfreq, 2,2 ). If set to``modulus`, the modulus of the complex tensor should be outputted. If real or``imag``, it returns only the specific one. Default is complex.
kws (dict) – Additional keywords arguments from :meth:`~EM.getfullfrequency `.

Returns:

name, m2 – the matrix of number of frequency and number of Edi-collectes which correspond to the number of the stations/sites.

Return type:

name of tensor and components

Examples

>>> from watex.datasets import load_huayuan
>>> from watex.methods import getTensor2d
>>> box= load_huayuan ( key ='raw', clear_cache = True, samples =7)
>>> data = box.data
>>> phase_yx = getTensor2d ( data, tensor ='phase', component ='yx')
>>> phase_yx.shape
(56, 7)
>>> phase_yx [0, :]
array([        nan,         nan,         nan,         nan, 18.73244951,
       35.00516522, 59.91093054])

watex.utils.getGlobalScore(cvres)[source]#

Retrieve the global mean and standard deviation score from the cross validation containers.

Parameters:

cvres (dict of Array-like, Shape (N, )) – cross validation results after training the models of number of parameters equals to N.

Returns:

tuple ( mean_test_scores’, ‘std_test_scores’)

scores on test_dcore and standard deviation scores

watex.utils.get_aquifer_section(arr_k, /, zname=None, kname=None, z=None, return_index=False, return_sections=True)[source]#

Detect a single aquifer section (upper and lower) in depth.

This is useful trip to compute the thickness of the aquifer.

Parameters:

arr_k (ndarray or dataframe) – Data that contains mainly the aquifer values. It can also contains the depth values. If the depth is included in the arr_k, zname needs to be supplied for recovering and depth.
zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.
kname (str, int) –

Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly

kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.
return_index (bool, default =False ,) –

Returns the positions (indexes) of the upper and lower sections of the
aquifer found in the dataframe arr_k.
return_sections (bool, default=True,) – Returns the sections (upper and lower) of the aquifers.

Returns:

up, low –

(upix, lowix ): Tuple of indexes of lower and upper sections
(up, low): Tuple of aquifer sections (upper and lower)
(upix, lowix), (up, low)positions and sections values of aquifers
if return_index and return_sections` are True.

Return type:

list of upper and lower section values of aquifer.

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import get_aquifer_section
>>> data = load_hlogs ().frame # return all data including the 'depth' values
>>> get_aquifer_section (data , zname ='depth', kname ='k')
... [197.12, 369.71] # section starts from 197.12 -> 369.71 m
>>> get_aquifer_section (data , zname ='depth', kname ='k', return_index=True)
... ([16, 29], [197.12, 369.71]) # upper and lower-> position 16 and 29.

watex.utils.get_aquifer_sections(*data, zname, kname, return_index=False, return_data=False, error='ignore', **kws)[source]#

Get the section of each aquifer form multiple dataframes.

The unique section ‘upper’ and ‘lower’ is the valid range of the whole data to consider as a valid data. The use of the index is necessary to shrunk the data of the whole boreholes. Mosly the data from the section is consided the valid data as the predictor Xr. Out of the range of aquifers ection, data can be discarded or compressed to top Xr.

Returns valid section indexes if ‘return_index’ is set to True.

Parameters:

data (list of pandas dataframe) – Data that contains mainly the aquifer values. It needs to specify the name of the depth column zname as well as the name of permeabiliy kname column.
zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.
kname (str, int) –

Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly

kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.
return_index (bool, default =False ,) – Returns the positions (indexes) of the upper and lower sections of the each aquifer found in each dataframe.
error (str, default='ignore') – Raise errors if trouble occurs when computing the section of each aquifer. If ‘ignore’, a UserWarning is displayed if invalid data is found. Any other value of error will set error to raise.
return_data (bool, default=False,) – Return valid data. It is usefull when ‘error’ is set to ‘ignore’ to collect the valid data.
kws (dict,) – Additional keywords arguments passed to get_aquifer_sections().

Returns:

up, low –

(upix, lowix ): Tuple of indexes of lower and upper sections
(up, low): Tuple of aquifer sections (upper and lower)
(upix, lowix), (up, low)positions and sections values of aquifers
if return_index and return_sections` are True.

Return type:

list of upper and lower section values of aquifer.

See also

watex.utils.hydroutils.get_aquifer_sections: compute multiples aquifer sections

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import get_aquifer_sections
>>> data = load_hlogs ().frame
>>> get_aquifer_sections (data, data , zname ='depth', kname ='k' )
... [[197.12, 369.71], [197.12, 369.71]]
>>> get_aquifer_sections (data, data , zname ='depth', kname ='k' ,
                           return_index =True )
...  [[16, 29], [16, 29]]

watex.utils.get_compressed_vector(d, /, sname, stratum=None, strategy='average', as_frame=False, random_state=None)[source]#

Compresses base stratum data into a singular vector composed of all feature names in the targetted data d.

Parameters:

d (pandas DataFrame) – Valid data containing the strata. If dataframe is passed, ‘sname’ is needed to fetch strata values.
sname (str, optional) – Name of column in the dataframe that contains the strata values. Dont confuse ‘sname’ with ‘stratum’ which is the name of the valid layer/rock in the array/Series of strata.
stratum (str, optional) – Name of the base stratum. Must be self contain as an item of the strata data. Note that if stratum is passed, the auto-detection of base stratum is not triggered. It returns the same stratum , however it can gives the rate and occurence of this stratum if return_rate or return_counts is set to True.
strategy (str , default='average' or 'mean',) – strategy used to select or compute the numerical data into a singular series. It can be [‘naive’]. In that case , a single serie if randomly picked up into the base strata data.
as_frame (bool, default='False') – Returns compressed vector into a dataframe rather that keeping in series.
random_state (int, optional,) – State for randomly selected a compressed vector when naive is passed as strategy.

Returns:

ms – returns a compressed vector in pandas series compose of all features. Note , the vector here does not refer as math vector compose of numerical values only. A compressed vector here is a series that is the result of averaging the numerical features of the base stratum and incluing its corresponding categorical values. Note there, the ms can contain categorical values and has the same number and features as the original frame d.

Return type:

pandas series/dataframe

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import get_compressed_vector
>>> data = load_hlogs().frame # get only the frame
>>> get_compressed_vector (data, sname='strata_name')[:4]
... hole_number           H502
    strata_name      siltstone
    aquifer_group           II
    pumping_level       ZFSAII
    dtype: object
>>> get_compressed_vector (data, sname='strata_name', as_frame=True )
...   hole_number strata_name aquifer_group  ...        r     rp remark
    0        H502   siltstone            II  ...  41.7075  59.23    NaN
    [1 rows x 23 columns]
>>> get_compressed_vector (data, sname='strata_name', strategy='naive')
... hole_number          H502
    depth_top          379.15
    depth_bottom        379.7
    strata_name     siltstone
    Name: 39, dtype: object

watex.utils.get_full_frequency(z_or_edis_obj_list, /, to_log10=False)[source]#

Get the frequency with clean data.

The full or plain frequency is array frequency with no missing frequency during the data collection. Note that when using Natural Source Audio-Magnetotellurics, some data are missing due to the weak of missing frequency at certain band especially in the attenuation band.

Parameters:

z_or_edis_obj_list (list of watex.edi.Edi or watex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.
to_log10 (bool, default=False) – Export frequency to base 10 logarithm

Returns:

f – frequency with clean data. Out of attenuation band if survey is completed with Natural Source Audio-Magnetotellurics.

Return type:

Arraylike of shape(N, )

Examples

>>> from watex.datasets import load_huayuan
>>> from watex.methods.em import get_full_frequency
>>> box= load_huayuan ( key ='raw', clear_cache = True, samples =7)
>>> edi_data = box.data
>>> f = get_full_frequency (edi_data )
>>> f
array([8.19200e+04, 7.00000e+04, 5.88000e+04, 4.95000e+04, 4.16000e+04,
       3.50000e+04, 2.94000e+04, 2.47000e+04, 2.08000e+04, 1.75000e+04,
       ...
       3.25000e+01, 2.75000e+01, 2.25000e+01, 1.87500e+01, 1.62500e+01,
       1.37500e+01, 1.12500e+01, 9.37500e+00, 8.12500e+00, 6.87500e+00,
       5.62500e+00])
>>> len(f)
56
>>> # Get only the z component objects
>>> zobjs = [ box.emo.ediObjs_[i].Z for i in  range (len(box.emo.ediObjs_))]
>>> len(zobjs)
56

watex.utils.get_profile_angle(easting=None, northing=None, msg='ignore')[source]#

compute geoprofile angle. :param * easting: easting coordiantes values :type * easting: array_like :param * northing: northing coordinates values :type * northing: array_like :param * msg: :type * msg: output a little message if msg is set to “raises”

Returns:

float – profile_angle
float – geo_electric_strike

watex.utils.get_sections_from_depth(z, z_range, return_index=False)[source]#

Gets aquifer sections (‘upper’, ‘lower’) in data ‘z’ from the depth range.

This might be usefull to compute the thickness of the aquifer.

Parameters:

z (array-like 1d or pd.Series) – Array or pandas series contaning the depth values
z_range (tuple (float),) –
Section [‘upper’, ‘lower’] of the aquifer at differnt depth. The range of the depth must a pair values and could not be

greater than the maximum depth of the well.
return_index (bool, default=False) – returns the indices of the sections [‘upper’, ‘lower’] of the aquifer and non-valid sections too.

Returns:

sections (Tuple (float, float)) – Real values of the upper and lower sections of the aquifer.
If return_index is ‘True’, function returns –

(upix, lowix): Tuple (int, int )
indices of upper and lower sections in the depth array z

(invix): list of Tuple (int, int)
list of indices of invalid sections

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import get_sections_from_depth
>>> data= load_hlogs().frame
>>> # get real sections from depth 16.25 to 125.83 m
>>> get_sections_from_depth ( data.depth_top, ( 16.25, 125.83))
...  (22.46, 128.23)
>>> # aquifer depth from 16.25 m to the end
>>> get_sections_from_depth ( data.depth_top, ( 16.25,))
... (22.46, 693.37)
>>> get_sections_from_depth ( data.depth_top, ( 16.25, 125.83),
                             return_index =True )
... ((3, 11), [(0, 3), (11, 180)])
>>> get_sections_from_depth ( data.depth_top, ( 16.25,),
                             return_index =True )
... ((3, 181), [(0, 3)])

watex.utils.get_strike(profile_angle=None, easting=None, northing=None, gstrike=None, msg='ignore')[source]#

Compute geoelectric strike from profile angle, easting and northing.

Parameters:

profile_angle (*) – If not provided , will comput with easting and northing coordinates
easting (*) – Easting coordiantes values
northing (*) – Northing coordinates values
gstrike (*) – strike value , if provided, will recomputed geo_electric strike . * msg: output a little message if msg is set to “raises”

Returns:

float – profile_angle in degree E of N
float – geo_electric_strike in degrees E of N

watex.utils.get_target(ar, /, tname, drop_target=True, columns=None, as_frame=False)[source]#

Extract target from multidimensional array or dataframe.

Parameters:

ar (arraylike2d or pd.DataFrame) – Array that supposed to contain the target value.
tname (int/str, list of int/str) – index or the name of the target; if int is passed it should range ranged less than the columns number of the array i.e. a shape[1] in the case of np.ndarray. If the list of indexes or names are given, the return target should be in two dimensional array.
drop_target (bool, default=True) – Remove the target array in the 2D array or dataframe in the case the target exists and returns a data exluding the target array.
columns (list, default=False.) – composes the dataframe when the array is given rather than a dataframe. The list of column names must match the number of columns in the two dimensional array, otherwise an error occurs.
as_frame (bool, default=False,) – returns dataframe/series or the target rather than array when the array is supplied. This seems useful when column names are supplied.

Returns:

t, ar – Return the targets and the array/dataframe of the target.

Return type:

array-like/pd.Series , array-like/pd.DataFrame

Examples

>>>> import numpy as np >>> import pandas as pd >>> from watex.utils.mtutils import get_target >>> ar = np.random.randn ( 3, 3 ) >>> df0 = pd.DataFrame ( ar, columns = [‘x1’, ‘x2’, ‘tname’]) >>> df= df0.copy() >>> get_target (df, ‘tname’, drop_target= False ) ( tname

0 -0.542861 1 0.781198,

x1 x2 tname

0 -1.424061 -0.493320 -0.542861 1 0.416050 -1.156182 0.781198)

>>> get_target (df, [ 'tname', 'x1']) # drop is True by default
(      tname        x1
 0 -0.542861 -1.424061
 1  0.781198  0.416050,
          x2
 0 -0.493320
 1 -1.156182)
>>> df = df0.copy()
>>> # when array is passed
>>> get_target (df.values , '2', drop_target= False )
(array([[-0.54286148],
        [ 0.7811981 ]]),
 array([[-1.42406091, -0.49331988, -0.54286148],
        [ 0.41605005, -1.15618243,  0.7811981 ]]))
>>> get_target (df.values , 'tname') # raise error
ValueError: 'tname' ['tname'] is not valid...

watex.utils.get_unique_section(*data, zname, kname, return_index=False, return_data=False, error='raise', **kws)[source]#

Get the section to consider unique in multiple aquifers.

The unique section ‘upper’ and ‘lower’ is the valid range of the whole sections of each aquifers. It is considered as the main valid section from which data can not be compressed and not altered. For instance, the use of indexes is necessary to shrunk the data except this valid section. Mosly the data from the section is considered the valid data as the predictor Xr. Out of the range of aquifers ection, data can be discarded or compressed to top Xr.

Returns valid section indexes if ‘return_index’ is set to True.

Parameters:

d (list of pandas dataframe) – Data that contains mainly the aquifer values. It needs to specify the name of the depth column zname as well as the name of permeabiliy kname column.
zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.
kname (str, int) –

Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly

kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.
return_index (bool, default =False ,) – Returns the positions (indexes) of the upper and lower sections of the shallower and deep aquifers found in the whole dataframes.
return_data (bool, default=False,) – Return valid data. It is usefull when ‘error’ is set to ‘ignore’ to collect the valid data.
error (str, default='raise') – Raise errors if trouble occurs when computing the section of each aquifer. If ‘ignore’, a UserWarning is displayed when invalid data is found. Any other value of error will set error to raise.
kws (dict,) – Additional keywords arguments passed to get_aquifer_sections().

Returns:

up, low –

(upix, lowix ): Tuple of indexes of lower and upper sections
(up, low): Tuple of aquifer sections (upper and lower)
(upix, lowix), (up, low)positions and sections values of aquifers
if return_index and return_sections` are True.

Return type:

list of upper and lower section values of aquifer.

See also

watex.utils.hydroutils.get_aquifer_section: compute single section
watex.utils.hydroutils.get_aquifer_sections: compute multiple sections

Example

>>> from watex.datasets import load_hlogs
>>> data = load_hlogs ().frame
>>> get_unique_section (data.copy() , zname ='depth', kname ='k', )
... array([197.12, 369.71], dtype=float32)
>>> get_unique_sections (data.copy() , zname ='depth', kname ='k',
                                return_index =True)
... array([16, 29])

watex.utils.get_xs_xr_splits(data, /, z_range=None, zname=None, section_indexes=None)[source]#

Split data into matrix \(X_s\) with sample \(ms\) (unwanted data ) and \(X_r\) of samples :math:`m_r`( valid aquifer data )

Parameters:

data (pandas dataframe) – Dataframe for compressing.
zname (str,int ,) – the name of depth column. ‘name’ needs to be supplied when section_indexes is not provided.
z_range (tuple (float),) – Section [‘upper’, ‘lower’] of the aquifer at different depth. The range of the depth must a pair values and could not be greater than the maximum depth of the well.
section_indexes (tuple or list of int) – list of a pair tuple or list of integers. It is be the the valid sections( upper and lower ) indexes of of the aquifer. If the depth range z_range and zname are supplied, section_indexes can be None. Note that the last indix is considered as the last position, the bottom of the section therefore, its value is included in the data.

Returns:

- xs (list of pandas dataframe) –
- shrinking part of data for compressing. Note that it is on list
because if dataframe corresponds to the non-valid dataframe sections.
- xr (pandas dataframe) –
- valid data reflecting to the aquifer part or including the
aquifer data.

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import get_xs_xr_splits
>>> data = load_hlogs ().frame
>>> xs, xr = get_xs_xr_splits (data, 3.11, section_indexes = (17, 20 ) )

watex.utils.interpolate1d(arr, kind='slinear', method='mean', order=None, fill_value='extrapolate', limit=None, **kws)[source]#

Interpolate array containing invalid values NaN

Usefull function to interpolate the missing frequency values in the tensor components.

Parameters:

arr (array_like) – Array to interpolate containg invalid values. The invalid value here is NaN.
kind (str or int, optional) – Specifies the kind of interpolation as a string or as an integer specifying the order of the spline interpolator to use. The string has to be one of linear, nearest, nearest-up, zero, slinear,``quadratic``, cubic, previous, or next. zero, slinear, quadratic``and ``cubic refer to a spline interpolation of zeroth, first, second or third order; previous and next simply return the previous or next value of the point; nearest-up and nearest differ when interpolating half-integers (e.g. 0.5, 1.5) in that nearest-up rounds up and nearest rounds down. If method param is set to pd which refers to pd.interpolate method , kind can be set to polynomial or pad interpolation. Note that the polynomial requires you to specify an order while pad requires to specify the limit. Default is slinear.
method (str, optional) – Method of interpolation. Can be base for scipy.interpolate.interp1d mean or bff for scaling methods and pd``for pandas interpolation methods. Note that the first method is fast and efficient when the number of NaN in the array if relatively few. It is less accurate to use the `base` interpolation when the data is composed of many missing values. Alternatively, the scaled method(the second one) is proposed to be the alternative way more efficient. Indeed, when ``mean argument is set, function replaces the NaN values by the nonzeros in the raw array and then uses the mean to fit the data. The result of fitting creates a smooth curve where the index of each NaN in the raw array is replaced by its corresponding values in the fit results. The same approach is used for bff method. Conversely, rather than averaging the nonzeros values, it uses the backward and forward strategy to fill the NaN before scaling. mean and bff are more efficient when the data are composed of lot of missing values. When the interpolation method is set to pd, function uses the pandas interpolation but ended the interpolation with forward/backward NaN filling since the interpolation with pandas does not deal with all NaN at the begining or at the end of the array. Default is base.
fill_value (array-like or (array-like, array_like) or extrapolate, optional) – If a ndarray (or float), this value will be used to fill in for requested points outside of the data range. If not provided, then the default is NaN. The array-like must broadcast properly to the dimensions of the non-interpolation axes. If a two-element tuple, then the first element is used as a fill value for x_new < x[0] and the second element is used for x_new > x[-1]. Anything that is not a 2-element tuple (e.g., list or ndarray, regardless of shape) is taken to be a single array-like argument meant to be used for both bounds as below, above = fill_value, fill_value. Using a two-element tuple or ndarray requires bounds_error=False. Default is extrapolate.
kws (dict) – Additional keyword arguments from spi.interp1d.

Return type:

array like - New interpoolated array. NaN values are interpolated.

Notes

When interpolated thoughout the complete frequencies i.e all the frequency values using the base method, the missing data in arr can be out of the arr range. So, for consistency and keep all values into the range of frequency, the better idea is to set the param fill_value in kws argument of spi.interp1d to extrapolate. This will avoid an error to raise when the value to interpolated is extra-bound of arr.

References

https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html https://www.askpython.com/python/examples/interpolation-to-fill-missing-entries

Examples

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from watex.utils.exmath  import interpolate1d,
>>> z = np.random.randn(17) *10 # assume 17 freq for 17 values of tensor Z
>>> z [[7, 10, 16]] =np.nan # replace some indexes by NaN values
>>> zit = interpolate1d (z, kind ='linear')
>>> z
... array([ -1.97732415, -16.5883156 ,   8.44484348,   0.24032979,
          8.30863276,   4.76437029, -15.45780568,          nan,
         -4.11301794, -10.94003412,          nan,   9.22228383,
        -15.40298253,  -7.24575491,  -7.15149205, -20.9592011 ,
                 nan]),
>>> zn
...array([ -1.97732415, -16.5883156 ,   8.44484348,   0.24032979,
         8.30863276,   4.76437029, -15.45780568,  -4.11301794,
       -10.94003412,   9.22228383, -15.40298253,  -7.24575491,
        -7.15149205, -20.9592011 , -34.76691014, -48.57461918,
       -62.38232823])
>>> zmean = interpolate1d (z,  method ='mean')
>>> zbff = interpolate1d (z, method ='bff')
>>> zpd = interpolate1d (z,  method ='pd')
>>> plt.plot( np.arange (len(z)),  zit, 'v--',
          np.arange (len(z)), zmean, 'ok-',
          np.arange (len(z)), zbff, '^g:',
          np.arange (len(z)), zpd,'<b:',
          np.arange (len(z)), z,'o',
          )
>>> plt.legend(['interp1d', 'mean strategy', 'bff strategy',
                'pandas strategy', 'data'], loc='best')

watex.utils.interpolate2d(arr2d, method='slinear', **kws)[source]#

Interpolate the data in 2D dimensional array.

If the data contains some missing values. It should be replaced by the interpolated values.

Parameters:

arr2d (np.ndarray, shape (N, M)) – 2D dimensional data
method (str, default linear) – Interpolation technique to use. Can be nearest``or ``pad.
kws (dict) – Additional keywords. Refer to interpolate1d().

Returns:

arr2d – 2D dimensional data interpolated

Return type:

np.ndarray, shape (N, M)

Examples

>>> from watex.methods.em import EM
>>> from watex.utils.exmath import interpolate2d
>>> # make 2d matrix of frequency
>>> emObj = EM().fit(r'data/edis')
>>> freq2d = emObj.make2d (out = 'freq')
>>> freq2d_i = interpolate2d(freq2d )
>>> freq2d.shape
...(55, 3)
>>> freq2d
... array([[7.00000e+04, 7.00000e+04, 7.00000e+04],
       [5.88000e+04, 5.88000e+04, 5.88000e+04],
       ...
        [6.87500e+00, 6.87500e+00, 6.87500e+00],
        [        nan,         nan, 5.62500e+00]])
>>> freq2d_i
... array([[7.000000e+04, 7.000000e+04, 7.000000e+04],
       [5.880000e+04, 5.880000e+04, 5.880000e+04],
       ...
       [6.875000e+00, 6.875000e+00, 6.875000e+00],
       [5.625000e+00, 5.625000e+00, 5.625000e+00]])

References

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.interpolate.interp2d.html

watex.utils.label_importance(label, arr_k, arr_aq, *, method='naive')[source]#

Compute the score for the label and its representativity in the valid: array ‘arr_k’

Parameters:

label: int, or string

class label from the true labels array of permeability coefficient ‘k’. If string, be sure to convert the array to hold the dtype str. It is recommnended to provide data with no NaN to have full control the occurence results.

arr_k: array-like 1d

True labels of array containing the permeability coefficient ‘k’.

arr_aq: array_like 1d

True labels of the groups of aquifers or predicted naive group of: aquifer (NGA labels). See predict_NGA_labels().

method: str [‘naive’, ‘strict’], default=’naive’

The kind of strategy to compute the representativity of a label in the predicted array ‘array_aq’. It can also be ‘strict’. Indeed:

‘naive’ computes the importance of the label by the number of its
occurence for this specific label in the array ‘k’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in ‘arr_k’
‘strict’ computes the importance of the label by the number of
occurence in the whole valid ‘arr_k’ i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in ‘arr_k’.

Returns:

label_dict_group_rate: dict,: Dictionnary of the label and its rate of occurence in the arr_aq. Thus each group in arr_aq has its rate of representativity of the label in arr_k.

‘ , r)

label k = 1 :

{‘V’: 0.17, ‘IV’: 0.141, ‘II’: 0.126, ‘III’: 0.084, ‘IV&V’: 0.005,: ‘II&III’: 0.003, ‘III&IV’: 0.003}

label k = 2 :

{‘III’: 0.052, ‘II’: 0.05, ‘V’: 0.05, ‘IV’: 0.034, ‘III&IV’: 0.005}

label k = 3 :

{‘V’: 0.123, ‘IV’: 0.086, ‘III’: 0.068}

>>> # **comments:
    # label k=1 is 17% importance for group V, 12.3% for group II whereas
    # label k=2 has a weak rate in the whole dataset ~=0.19% for all groups
    # the most dominate labels are k=1 and k=3 with 53.14% and 27.74 %
    # respectively in the dataset.
    # If threshold of representativity is set to 50% , none of the true
    # label k will fit any aquifer group since the max representativity
    # score is 17% and is for the group V especially for k=1.

watex.utils.labels_validator(t, /, labels, return_bool=False)[source]#

Assert the validity of the label in the target and return the label or the boolean whether all items of label are in the target.

Parameters:

t – array-like, target that is expected to contain the labels.
labels – int, str or list of (str or int) that is supposed to be in the target t.
return_bool – bool, default=False; returns ‘True’ or ‘False’ rather the labels if set to True.

Returns:

bool or labels; ‘True’ or ‘False’ if return_bool is set to True and labels otherwise.

Example:

>>> from watex.datasets import fetch_data
>>> from watex.utils.mlutils import cattarget, labels_validator
>>> _, y = fetch_data ('bagoue', return_X_y=True, as_frame=True)
>>> # binarize target y into [0 , 1]
>>> ybin = cattarget(y, labels=2 )
>>> labels_validator (ybin, [0, 1])
... [0, 1] # all labels exist.
>>> labels_validator (y, [0, 1, 3])
... ValueError: Value '3' is missing in the target.
>>> labels_validator (ybin, 0 )
... [0]
>>> labels_validator (ybin, [0, 5], return_bool=True ) # no raise error
... False

watex.utils.linkage_matrix(df, columns=None, kind='design', metric='euclidean', method='complete', as_frame=False, optimal_ordering=False)[source]#

Compute the distance matrix from the hierachical clustering algorithm

Parameters:

df (dataframe or NDArray of (n_samples, n_features)) – dataframe of Ndarray. If array is given , must specify the column names to much the array shape 1
columns (list) – list of labels to name each columns of arrays of (n_samples, n_features) If dataframe is given, don’t need to specify the columns.
kind (str, ['squareform'|'condense'|'design'], default is {'design'}) – kind of approach to summing up the linkage matrix. Indeed, a condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of \(m\) observation vectors in \(n\) dimensions may be passed as an \(m\) by \(n\) array. All elements of the condensed distance matrix must be finite, i.e., no NaNs or infs. Alternatively, we could used the squareform distance matrix to yield different distance values than expected. the design approach uses the complete inpout example matrix also called ‘design matrix’ to lead correct linkage matrix similar to squareform and condense`.
metric (str or callable, default is {'euclidean'}) – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by sklearn.metrics.pairwise.pairwise_distances(). If X is the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.
method (str, optional, default is {'complete'}) – The linkage algorithm to use. See the Linkage Methods section below for full descriptions.
optimal_ordering (bool, optional) – If True, the linkage matrix will be reordered so that the distance between successive leaves is minimal. This results in a more intuitive tree structure when the data are visualized. defaults to False, because this algorithm can be slow, particularly on large datasets. See also scipy.cluster.hierarchy.linkage().

Returns:

row_clusters – consist of several rows where each rw represents one merge. The first and second columns denotes the most dissimilar members of each cluster and the third columns reports the distance between those members

Return type:

linkage matrix

watex.utils.magnitude(cz)[source]#

Compute the magnitude of selected conductive zone.

The magnitude parameter is the absolute resistivity value between the minimum \(\min \rho_a\) and maximum \(\max \rho_a\) value of selected anomaly:

\[magnitude=|\min\rho_a -\max\rho_a|\]

Parameters:: cz – array-like. Array of apparent resistivity values composing the conductive zone.
Returns:: Absolute value of anomaly magnitude in ohm.meters.

watex.utils.makeCoords(reflong, reflat, nsites, *, r=45.0, utm_zone=None, step='1km', order='+', todms=False, is_utm=False, raise_warning=True, **kws)[source]#

Generate multiple stations coordinates (longitudes, latitudes) from a reference station/site.

One degree of latitude equals approximately 364,000 feet (69 miles), one minute equals 6,068 feet (1.15 miles), and one-second equals 101 feet. One-degree of longitude equals 288,200 feet (54.6 miles), one minute equals 4,800 feet (0.91 mile) , and one second equals 80 feet. Illustration showing longitude convergence. (1 feet ~=0.3048 meter)

Parameters:

reflong (float or string or list of [start, stop]) – Reference longitude in degree decimal or in DD:MM:SS for the first site considered as the origin of the landmark.
reflat (float or string or list of [start, stop]) – Reference latitude in degree decimal or in DD:MM:SS for the reference site considered as the landmark origin. If value is given in a list, it can containt the start point and the stop point.
nsites (int or float) – Number of site to generate the coordinates onto.
r (float or int) – The rotate angle in degrees. Rotate the angle features the direction of the projection line. Default value is 45 degrees.
step (float or str) – Offset or the distance of seperation between different sites in meters. If the value is given as string type, except the km, it should be considered as a m value. Only meters and kilometers are accepables.
order (str) – Direction of the projection line. By default the projected line is in ascending order i.e. from SW to NE with angle r set to 45 degrees. Could be - for descending order. Any other value should be in ascending order.
is_utm (bool,) – Consider the first two positional arguments as UTM coordinate values. This is an alternative way to assume reflong and reflat are UTM coordinates ‘easting’and ‘northing` by default. If utm2deg is False, any value greater than 180 degrees for longitude and 90 degrees for latitude will raise an error. Default is False.
utm_zone (string (##N or ##S)) – utm zone in the form of number and North or South hemisphere, 10S or 03N Must be given if utm2deg is set to True.
todms (bool) – Convert the degree decimal values into the DD:MM:SS. Default is False.
raise_warning (bool, default=True,) – Raises warnings if GDAL is not set or the coordinates accurately status.
kws (dict,) – Additional keywords of gistools.project_point_utm2ll().

Returns:

Tuple of generated projected coordinates longitudes and latitudes
either in degree decimals or DD (MM:SS)

Notes

The distances vary. A degree, minute, or second of latitude remains fairly constant from the equator to the poles; however a degree, minute, or second of longitude can vary greatly as one approaches the poles and the meridians converge.

References

https://math.answers.com/Q/How_do_you_convert_degrees_to_meters

Examples

>>> from watex.utils.coreutils import makeCoords
>>> rlons, rlats = makeCoords('110:29:09.00', '26:03:05.00',
...                                     nsites = 7, todms=True)
>>> rlons
... array(['110:29:09.00', '110:29:35.77', '110:30:02.54', '110:30:29.30',
       '110:30:56.07', '110:31:22.84', '110:31:49.61'], dtype='<U12')
>>> rlats
... array(['26:03:05.00', '26:03:38.81', '26:04:12.62', '26:04:46.43',
       '26:05:20.23', '26:05:54.04', '26:06:27.85'], dtype='<U11')
>>> rlons, rlats = makeCoords ((116.7, 119.90) , (44.2 , 40.95),
                                        nsites = 238, step =20. ,
                                        order = '-', r= 125)
>>> rlons
... array(['119:54:00.00', '119:53:11.39', '119:52:22.78', '119:51:34.18',
       '119:50:45.57', '119:49:56.96', '119:49:08.35', '119:48:19.75',
       ...
       '116:46:03.04', '116:45:14.43', '116:44:25.82', '116:43:37.22',
       '116:42:48.61', '116:42:00.00'], dtype='<U12')
>>> rlats
... array(['40:57:00.00', '40:57:49.37', '40:58:38.73', '40:59:28.10',
       '41:00:17.47', '41:01:06.84', '41:01:56.20', '41:02:45.57',
       ...
   '44:07:53.16', '44:08:42.53', '44:09:31.90', '44:10:21.27',
   '44:11:10.63', '44:12:00.00'], dtype='<U11')

watex.utils.make_MXS_labels(y_true, y_pred, threshold=None, similar_labels=None, sep=None, prefix=None, method='naive', trailer='*', return_obj=False, **kws)[source]#

Create a Mixture Learning Strategy (MXS) labels from true labels ‘y_true’ and the predicted Naive Group of Aquifer (NGA) labels ‘y_pred’

Parameters:

y_true (array-like 1d, pandas.Series) – Array composed of valid k-values and possible missing k-values.
y_pred (Array-like 1d, pandas.Series) – Array composing the valid NGA labels. Note that NGA labels is a predicted labels mostly using the unsupervising learning.
threshold (float, default=None) – The threshold from which, label in ‘y_true’ can be considered similar than the one in NGA labels ‘y_pred’. The default is ‘None’ which means none rule is considered and the high preponderence or occurence in the data compared to other labels is considered as the most representative and similar. Setting the rule instead by fixing the threshold is recommended especially in a huge dataset.
similar_labels (list of tuple, optional) – list of tuple in pair (label and similar group). If given, the similar group must be the label existing in the predicted NGA. If None, the auto-similarity is triggered.
sep (str, default'') –
Separator between the true labels ‘y_true’ and predicted NGA labels. Sep is used to rewrite the MXS labels. Mostly the MXS labels is a combinaison with the true label of permeability coefficient ‘k’ and the label of NGA to compose new similarity labels. For instance
```
>>> true_labels=['k1', 'k2', 'k3'] ; NGA_labels =['II', 'I', 'UV']
>>> # gives
>>> MXS_labels= ['k1_II', 'k2_I', 'k3_UV']
```
where the seperator sep is set to _. This happens especially when one of the label (NGA or true_labels) is not a numeric datatype and a similariy is found between ‘k1’ and ‘II’, ‘k2’ and ‘I’ and so on.
prefix (str, default='') –
prefix is used to rename the true_labels i.e the true valid-k. For instance:
```
>>> k_valid =[1, 2, ..] -> k_new = [k1, k2, ...]
```
where ‘k’ is the prefix.
method (str ['naive', 'strict'], default='naive') –
The kind of strategy to compute the representativity of a label in the predicted array ‘y_pred’. It can also be ‘strict’. Indeed:
- naive computes the importance of the label by the number of its
  occurence for this specific label in the array ‘y_true’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in y_true.
- strict computes the importance of the label by the number of
  occurence in the whole valid y_true i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in y_pred.
trailer (str, default='*') –
The Mixture strategy marker to differentiate the existing class label in ‘y_true’ with the predicted labels ‘y_pred’ especially when the the same class labels are also present the true label with the same label-identifier name. This usefull to avoid any confusion for both labels in y_true and y_pred for better demarcation and distinction. Note that if the trailer`is set to ``None` and both y_true and y_pred are numeric data, the labels in y_pred are systematically renamed to be distinct with the ones in the ‘y_true’. For instance
```
>>> true_labels=[1, 2, 3] ; NGA_labels =[0, 1, 2]
>>> # with trailer , MXS labels should be
>>>  MXS_labels= ['0', '1*', '2*', '3'] # 1 and 2 are in true_labels
>>> # with no trailer
>>> MXS_labels= [0, 4, 5, 3] # 1 and 2 have been changed to [4, 5]
```
return_obj (watex.utils.box.Boxspace) –
If True, returns a MXS object with usefull attributes such as:
- mxs_classes_ = the MXS class labels
- mxs_labels_= the array-like of MXS labels. It also includes some
  non similar labels from NGA
mxs_map_classes_= a dict or original class labels of the array
’k’ <’y_true’> and their temporary integer class labels. Indeed, if ‘y_true’ class labels are not a numeric dtype, New labels with integer dtype is created. The dict is used to wrap the true labels (original ones) during the MXS creation. Thus, the original labels are not altered and will be map in turn at the end to recover their positions as well in new MXS array. It is set to ‘None’ if ‘y_true’ has a numeric dtype.

mxs_group_classes_: dict of all the similar group labels with the
MXS labels related from the modified existing groups of NGA. Note that the non-similar group are modified if their labels are also found in the true_labels to avoid any confusion. Thus the dict wrap the non-similar label with their new temporay labels.

mxs_similar_groups_= list of the similar labels found in
y_true that have a similarity in NGA.

mxs_similarity_= Tuple of similarity in pair (label, group)
existing between the label class in y_true and NGA.

mxs_group_labels_= list of the similar groups found in the
predicted NGA that have a similarity in true labels ‘y_true’

Returns:

MXS – array like of MXS labels or MXS object containing the usefull attributes.

Return type:

array-like 1d or Boxspace

See also

predict_NGA_labels: Predicts Naive group of Aquifers labels.

Examples

>>> from watex.datasets import load_hlogs
>>> from watex.utils import read_data
>>> from watex.utils.hydroutils import classify_k, make_MXS_labels
>>> data = load_hlogs ().frame
>>> # map data.k to categorize k values
>>> ymap = classify_k(data.k , default_func =True)
>>> y_mxs = make_MXS_labels (ymap, data.aquifer_group)
>>> y_mxs[14:24]
...  array(['I', 'I', 2, 2, 2, 2, 2, 2, 2, 2], dtype=object)
>>> mxs_obj = make_MXS_labels (ymap, data.aquifer_group, return_obj=True )
>>> mxs_obj.mxs_labels_[14: 24]
... array(['I', 'I', 2, 2, 2, 2, 2, 2, 2, 2], dtype=object)
>>> # now we did the same task using the private data 'hf.csv'
>>> # composed of 11 boreholes. For default we alternatively uses
>>> # the aquifer groups like a fake NGA
>>> data = read_data ('data/boreholes/hf.csv')
>>> ymap =  classify_k(data.k , default_func =True)
>>> y_mxs= make_MXS_labels (ymap, data.aquifer_group)
>>> np.unique (y_mxs)
... array(['1', '1V', '2', '2III', '3', 'I', 'II', 'III&IV', 'IV'],
      dtype='<U6')
>>> # *comments:
    # label '1V' means the group V (expected to be a cluster)
    # and label 1 (true labels) have a similarity
    # the same of label '2III' while the remain label 3 does not
    #  any similarity in the other labels  in the 'y_pred' expected
    # to be NGA labels.

watex.utils.make_naive_pipe(X, y=None, *, num_features=None, cat_features=None, label_encoding='LabelEncoder', scaler='StandardScaler', missing_values=nan, impute_strategy='median', sparse_output=True, for_pca=False, transform=False)[source]#

make a pipeline to transform data at once.

make a naive pipeline is usefull to fast preprocess the data at once for quick prediction.

Work with a pandas dataframe. If None features is set, the numerical and categorial features are automatically retrieved.

Parameters:

X (pandas dataframe of shape (n_samples, n_features)) – The input samples. Use dtype=np.float32 for maximum efficiency. Sparse matrices are also supported, use sparse csc_matrix for maximum efficiency.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
num_features (list or str, optional) – Numerical features put on the list. If num_features are given whereas cat_features are None, cat_features are figured out automatically.
cat_features (list of str, optional) – Categorial features put on the list. If num_features are given whereas num_features are None, num_features are figured out automatically.
label_encoding (callable or str, default='sklearn.preprocessing.LabelEncoder') – kind of encoding used to encode label. This assumes ‘y’ is supplied.
scaler (callable or str , default='sklearn.preprocessing.StandardScaler') – kind of scaling used to scaled the numerical data. Note that for the categorical data encoding, ‘sklearn.preprocessing.OneHotEncoder’ is implemented under the hood instead.
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
impute_strategy (str, default='mean') –
The imputation strategy.
- If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
- If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
- If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
- If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
  
  strategy=”constant” for fixed value imputation.
sparse_output (bool, default=False) – Is used when label y is given. Binarize labels in a one-vs-all fashion. If True, returns array from transform is desired to be in sparse CSR format.
for_pca (bool, default=False,) – Transform data for principal component ( PCA) analysis. If set to True, watex.exlib.sklearn.OrdinalEncoder` is used insted of watex.exlib.sklearn.OneHotEncoder`.
transform (bool, default=False,) – Tranform data inplace rather than returning the naive pipeline.

Returns:

full_pipeline (watex.exlib.sklearn.FeatureUnion) –
- Full pipeline composed of numerical and categorical pipes
(X_transformed &| y_transformed) ({array-like, sparse matrix} of shape (n_samples, n_features)) –
- Transformed data.

Examples

>>> from watex.utils.mlutils import make_naive_pipe
>>> from watex.datasets import load_hlogs

(1) Make a naive simple pipeline with RobustScaler, StandardScaler >>> from watex.exlib.sklearn import RobustScaler >>> X_, y_ = load_hlogs (as_frame=True )# get all the data >>> pipe = make_naive_pipe(X_, scaler =RobustScaler )

(2) Transform X in place with numerical and categorical features with StandardScaler (default). Returned CSR matrix

>>> make_naive_pipe(X_, transform =True )
... <181x40 sparse matrix of type '<class 'numpy.float64'>'
    with 2172 stored elements in Compressed Sparse Row format>

watex.utils.moving_average(y, *, window_size=3, method='sma', mode='same', alpha=0.5)[source]#

A moving average is used with time series data to smooth out short-term fluctuations and highlight longer-term trends or cycles.

Funtion analyzes data points by creating a series of averages of different subsets of the full data set.

Parameters:

y (array_like, shape (N,)) – the values of the time history of the signal.
window_size (int) – the length of the window. Must be greater than 1 and preferably an odd integer number.Default is 3
method (str) – variant of moving-average. Can be sma, cma, wma and ema for simple, cummulative, weight and exponential moving average. Default is wma.
mode (str) – returns the convolution at each point of overlap, with an output shape of (N+M-1,). At the end-points of the convolution, the signals do not overlap completely, and boundary effects may be seen. Can be full, same and valid. See ~np.convole for more details. Default is same.
alpha (float,) – smoothing factor. Only uses in exponential moving-average. Default is .5.

Returns:

ya – Averaged time history of the signal

Return type:

array like, shape (N,)

Notes

The first element of the moving average is obtained by taking the average of the initial fixed subset of the number series. Then the subset is modified by “shifting forward”; that is, excluding the first number of the series and including the next value in the subset.

Examples

>>> import numpy as np ; import matplotlib.pyplot as plt
>>> from watex.utils.exmath  import moving_average
>>> data = np.random.randn (37)
>>> # add gaussion noise to the data
>>> data = 2 * np.sin( data)  + np.random.normal (0, 1 , len(data))
>>> window = 5  # fixed size to 5
>>> sma = moving_average(data, window)
>>> cma = moving_average(data, window, method ='cma' )
>>> wma = moving_average(data, window, method ='wma' )
>>> ema = moving_average(data, window, method ='ema' , alpha =0.6)
>>> x = np.arange(len(data))
>>> plt.plot (x, data, 'o', x, sma , 'ok--', x, cma, 'g-.', x, wma, 'b:')
>>> plt.legend (['data', 'sma', 'cma', 'wma'])

References

watex.utils.naive_imputer(X, y=None, strategy='mean', mode=None, drop_features=False, missing_values=nan, fill_value=None, verbose='deprecated', add_indicator=False, copy=True, **fit_params)[source]#

Imput missing values in the data.

Whatever data contains categorial features, ‘bi-impute’ argument passed to ‘kind’ parameters has a strategy to both impute the numerical and categorical features rather than raising an error when the ‘strategy’ is not set to ‘most_frequent’.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Not used, present here for API consistency by convention.
strategy (str, default='mean') –
The imputation strategy.
- If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
- If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
- If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
- If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
  
  strategy=”constant” for fixed value imputation.
mode (str, [bi-impute'], default= None) – If mode is set to ‘bi-impute’, it imputes the both numerical and categorical features and returns a single imputed dataframe.
drop_features (bool or list, default =False,) – drop a list of features in the dataframe before imputation. If True and no list of features is supplied, the categorial features are dropped.
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.

fill_valuestr or numerical value, default=None

When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.

verboseint, default=0

Controls the verbosity of the imputer.

copybool, default=True

If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:

If X is not an array of floating values;
If X is encoded as a CSR matrix;
If add_indicator=True.

add_indicatorbool, default=False

If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

fit_params: dict,

keywords arguments passed to the scikit-learn fitting parameters More details on https://scikit-learn.org/stable/

Returns:: Xi – Data imputed
Return type:: Dataframe, array-like, sparse matrix of shape (n_samples, n_features)

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from watex.utils.mlutils import naive_imputer
>>> X= np.random.randn ( 7, 4 )
>>> X[3, :] =np.nan  ; X[:, 3][-4:]=np.nan
>>> naive_imputer  (X)
... array([[ 1.34783528,  0.53276798, -1.57704281,  0.43455785],
           [ 0.36843174, -0.27132106, -0.38509441, -0.29371997],
           [-1.68974996,  0.15268509, -2.54446498,  0.18939122],
           [ 0.06013775,  0.36687602, -0.21973368,  0.11007637],
           [-0.27129147,  1.18103398,  1.78985393,  0.11007637],
           [ 1.09223954,  0.12924661,  0.52473794,  0.11007637],
           [-0.48663864,  0.47684353,  0.87360825,  0.11007637]])
>>> frame = pd.DataFrame (X, columns =['a', 'b', 'c', 'd']  )
>>> # change [bc] types to categorical values.
>>> frame['b']=['pineaple', '', 'cabbage', 'watermelon', 'onion',
                'cabbage', 'onion']
>>> frame['c']=['lion', '', 'cat', 'cat', 'dog', '', 'mouse']
>>> naive_imputer(frame, kind ='bi-impute')
...             b      c         a         d
    0    pineaple   lion  1.347835  0.434558
    1     cabbage    cat  0.368432 -0.293720
    2     cabbage    cat -1.689750  0.189391
    3  watermelon    cat  0.060138  0.110076
    4       onion    dog -0.271291  0.110076
    5     cabbage    cat  1.092240  0.110076
    6       onion  mouse -0.486639  0.110076

watex.utils.naive_scaler(X, y=None, *, kind=<class 'sklearn.preprocessing._data.StandardScaler'>, copy=True, with_mean=True, with_std=True, feature_range=(0, 1), clip=False, norm='l2', **fit_params)[source]#

Quick data scaling using both strategies implemented in scikit-learn with StandardScaler and MinMaxScaler.

Function returns scaled frame if dataframe is passed or ndarray. For other scaling, call scikit-learn instead.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
kind (str, default='StandardScaler') – Kind of data scaling. Can also be [‘MinMaxScaler’, ‘Normalizer’]. The default is ‘StandardScaler’
copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
with_mean (bool, default=True) – If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_std (bool, default=True) – If True, scale the data to unit variance (or equivalently, unit standard deviation).
feature_range (tuple (min, max), default=(0, 1)) – Desired range of transformed data.
norm ({'l1', 'l2', 'max'}, default='l2') – The norm to use to normalize each non zero sample. If norm=’max’ is used, values will be rescaled by the maximum of the absolute values.
clip (bool, default=False) – Set to True to clip transformed values of held-out data to provided feature range.
fit_params (dict,) – keywords arguments passed to the scikit-learn fitting parameters More details on https://scikit-learn.org/stable/

Returns:

X_sc – Transformed array.

Return type:

{ndarray, sparse matrix} or dataframe of shape (n_samples, n_features)

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from watex.utils.mlutils import naive_scaler
>>> X= np.random.randn (7 , 3 )
>>> X_std = naive_scaler (X )
... array([[ 0.17439644,  1.55683005,  0.24115109],
       [-0.59738672,  1.3166854 ,  1.23748004],
       [-1.6815365 , -1.19775838,  0.71381357],
       [-0.1518278 , -0.32063059, -0.47483155],
       [-0.41335886,  0.13880519,  0.69258621],
       [ 1.45221902, -1.03852015, -0.40157981],
       [ 1.21749443, -0.45541153, -2.00861955]])
>>> # use dataframe
>>> Xdf = pd.DataFrame (X, columns =['a', 'c', 'c'])
>>> naive_scaler (Xdf , kind='Normalizer') # return data frame
...           a         c         c
    0  0.252789  0.967481 -0.008858
    1 -0.265161  0.908862  0.321961
    2 -0.899863 -0.416231  0.130380
    3  0.178203  0.039443 -0.983203
    4 -0.418487  0.800306  0.429394
    5  0.933933 -0.309016 -0.179661
    6  0.795234 -0.051054 -0.604150

watex.utils.normalizer(arr, /, method='naive')[source]#

Normalize values to be between 0 and 1.

This normlizer handles NaN values translates data individually such that it is in the given range on the training set, e.g. between zero and one.

Note that when the transformation is set to the method ='MinMax', The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_normed = X_std * (max - min) + min

where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.

Parameters:

arr (Arraylike,) – Array to normalize, can contain NaN values.
method (str,) – Can be use ‘scikit-learn’ MinMaxScaler for normalization. Any other values used the naive normalization.

Returns:

arr_norm

Return type:

Normalized array.

Examples

>>> import numpy as np
>>> from watex.utils.funcutils import normalizer
>>> np.random.seed (42)
>>> arr = np.random.randn (3, 2 )
array([[ 0.49671415, -0.1382643 ],
       [ 0.64768854,  1.52302986],
       [-0.23415337, -0.23413696]])
>>> normalizer (arr )
array([[4.15931313e-01, 5.45697636e-02],
       [5.01849720e-01, 1.00000000e+00],
       [0.00000000e+00, 9.34323403e-06]])
>>> normalizer (arr , method ='min-max')  # normalize data along axis=0
array([[0.82879654, 0.05456093],
       [1.        , 1.        ],
       [0.        , 0.        ]])
>>> arr [0, 1] = np.nan; arr [1, 0] = np.nan
>>> normalizer (arr )
array([[4.15931313e-01,            nan],
       [           nan, 1.00000000e+00],
       [0.00000000e+00, 9.34323403e-06]])
>>> normalizer (arr , method ='min-max')
array([[ 1., nan],
       [nan,  1.],
       [ 0.,  0.]])

watex.utils.ohmicArea(data=None, search=45.0, sum=False, objective='ohmS', **kws)[source]#

Compute the ohmic-area from the Vertical Electrical Sounding data collected in exploration area.

Parameters:

* data: Dataframe pandas - contains the depth measurement AB from current: electrodes, the potentials electrodes MN and the collected apparents resistivities.
* search: float - The depth in meters from which one expects to find a: fracture zone outside of pollutions. Indeed, the search parameter is used to speculate about the expected groundwater in the fractured rocks under the average level of water inrush in a specific area. For instance in Bagoue region , the average depth of water inrush is around 45m. So the search can be specified via the water inrush average value.
* objective: str - Type operation to outputs. By default, the function: outputs the value of pseudo-area in \(\Omega .m^2\). However, for plotting purpose by setting the argument to view, its gives an alternatively outputs of X and Y, recomputed and projected as weel as the X and Y values of the expected fractured zone. Where X is the AB dipole spacing when imaging to the depth and Y is the apparent resistivity computed
kws: dict - Additionnal keywords arguments from |VES| data operations.: See watex.utils.exmath.vesDataOperator() for futher details.

Returns:

List of twice tuples:

Tuple(ohmS, error, roots):
- `ohmS`is the pseudo-area computed expected to be a fractured zone
- error is the integration error
- roots is the integration boundaries of the expected fractured
  zone where the basement rocks is located above the resistivity transform function. At these points both curves values equal to null.
Tuple (XY, fit XY,XYohmSarea):
- XY is the ndarray(nvalues, 2) of the operated of AB dipole
  spacing and resistivity rhoa values.
- fit XY is the fitting ndarray(nvalues, 2) uses to redraw the
  dummy resistivity transform function.
- XYohmSarea is ndarray(nvalues, 2) of the dipole spacing and
  resistiviy values of the expected fracture zone.

Raises:

VESError: If the search is greater or equal to the maximum investigation depth in meters.

Notes

The ohmS value calculated from pseudo-area is a fully data-driven parameter and is used to evaluate a pseudo-area of the fracture zone from the depth where the basement rock is supposed to start. Usually, when exploring deeper using the Vertical Electrical Sounding, we are looking for groundwater in thefractured rock that is outside the anthropic pollution (Biemi, 1992). Since the VES is an indirect method, we cannot ascertain whether the presumed fractured rock contains water inside. However, we assume that the fracture zone could exist and should contain groundwater. Mathematically, based on the VES1D model proposed by `Koefoed, O. (1976)`_ , we consider a function \(\rho_T(l)\), a set of reducing resistivity transform function to lower the boundary plane at half the current electrode spacing \((l)\). From the sounding curve \(\rho_T(l)\), curve an imaginary basement rock \(b_r (l)\) of slope equal to 45° with the horizontal \(h(l)\) was created. A pseudo-area \(S(l)\) should be defined by extending from \(h(l)\) the \(b_r (l)\) curve when the sounding curve \(\rho_T(l)\) is below \(b_r(l)\), otherwise \(S(l)\) is equal to null. The computed area is called the ohmic-area \(ohmS\) expressed in \(\Omega .m^2\) and constitutes the expected fractured zone. Thus \(ohmS\) ≠ \(0\) confirms the existence of the fracture zone while of \(Ohms=0\) raises doubts. The equation to determine the parameter is given as:

\[ \begin{align}\begin{aligned}ohmS & = &\int_{ l_i}^{l_{i+1}} S(l)dl \quad {s.t.}\\\begin{split}S(l) & = & b_r (l) - \rho_T (l) \quad \text{if} \quad b_r (l) > \rho_T (l) \\ & = & 0. \quad \text{if} \quad b_r (l) \leq \rho_T (l)\end{split}\\b_r(l) & = & l + h(l) \quad ; \quad h(l) = \beta\\\rho_T(l) & = & l^2 \int_{0}^{\infty} T_i( \lambda ) h_1( \lambda l) \lambda d\lambda\end{aligned}\end{align} \]

where \(l_i \quad \text{and} \quad l_{i+1}\) solve the equation \(S(l=0)\); \(l\) is half the current electrode spacing \(AB/2\), and \(h_1\) denotes the first-order of the Bessel function of the first kind, \(\beta\) is the coordinate value on y-axis direction of the intercept term of the \(b_r(l)\) and \(h(l)\), \(T_i(\lambda )\) resistivity transform function, \(lamda\) denotes the integral variable, where n denotes the number of layers, \(rho_i\) and \(h_i\) are the resistivity and thickness of the \(i-th\) layer, respectively. Get more explanations and cleareance of formula in the paper of `Kouadio et al 2022`_.

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

Examples

>>> from watex.utils.exmath import ohmicArea
>>> from watex.utils.coreutils import vesSelector
>>> data = vesSelector (f= 'data/ves/ves_gbalo.xlsx')
>>> (ohmS, err, roots), *_ = ohmicArea(data = data, search =45, sum =True )
... (13.46012197818152, array([5.8131967e-12]), array([45.        , 98.07307307]))
# pseudo-area is computed between the spacing point AB =[45, 98] depth.
>>> _, (XY.shape, XYfit.shape, XYohms_area.shape) = ohmicArea(
                AB= data.AB, rhoa =data.resistivity, search =45,
                objective ='plot')
... ((26, 2), (1000, 2), (8, 2))

watex.utils.plotAnomaly(erp, cz=None, station=None, fig_size=(10, 4), fig_dpi=300, savefig=None, show_fig_title=True, style='seaborn', fig_title_kws=Ellipsis, czkws=Ellipsis, legkws=Ellipsis, how='py', **kws)[source]#

Plot the whole Electrical Resistivity Profiling line and selected conductive zone.

Conductive zone can be supplied nannualy as a subset of the erp or by specifying the station expected for drilling location. For instance S07 for the seventh station. Futhermore, for automatic detection, one should set the station argument s to auto. However, it ‘s recommended to provide the cz or the s to have full control. The conductive zone overlained the whole Electrical Resistivity Profiling survey. user can customize the cz plot by filling with Matplotlib pyplot additional keywords araguments thought the keyword arguments czkws.

Parameters:

erp: array_like 1d

the Electrical Resistivity Profiling survey line. The line is an array of resistivity values. Note that if a dataframe is passed, be sure that the frame matches the DC resistivity data (ERP), otherwise an error occurs. At least, the frame columns includes the resistivity and stations.

cz: array_like 1d

the selected conductive zone. If None, only the erp should be displayed. Note that cz is an subset of erp array.

station: str, optional

The station location given as string (e.g. s= "S10") or as a station number (indexing; e.g s =10). If value is set to "auto", s should be find automatically and fetching cz as well.

figsize: tuple, default =(10, 4)

Tuple value of figure size. Refer to the web resources Matplotlib figure.

fig_dpi: int , default=300,

figure resolution “dot per inch”. Refer to Matplotlib figure.

savefig: str, optional,

save the figure. Refer to Matplotlib figure.

show_fig_title: bool, default =True

display the title of the figure.

fig_title_kws: dict,

Keywords arguments of figure suptile. Refer to Matplotlib figsuptitle.

style: str - the style for customizing visualization. For instance to

get the first seven available styles in pyplot, one can run the script below:

plt.style.available[:7]

Futher details can be foud in Webresources below or click on GeekforGeeks.

how: str, default=’py’

By default (how='py'), the station is naming following the Python indexing. Station is counting from station 00(S00). Any other values will start the station naming from 1.

czkws: dict,

keywords Matplotlib pyplot additional arguments to customize the cz plot.

legkws: dict,

Additional keywords Matplotlib legend arguments.

kws: dict,

additional keywords argument for Matplotlib pyplot to customize the erp plot.

See also

watex.erpSmartDetector: Detection conductive zone applying the constraint. Set the view=True for constraints visualization.

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

Examples

>>> import numpy as np
>>> from watex.utils import plotAnomaly, defineConductiveZone
>>> test_array = np.abs (np.random.randn (10)) *1e2
>>> selected_cz ,*_ = defineConductiveZone(test_array, 7)
>>> plotAnomaly(test_array, selected_cz )
>>> plotAnomaly(test_array, selected_cz , s= 5)
>>> plotAnomaly(test_array, s= 's02')
>>> plotAnomaly(test_array)

watex.utils.plotOhmicArea(data=None, search=45.0, pre_computed=False, xy=None, xyf=None, xyarea=None, colors=None, fbtw=False, **plot_kws)[source]#

Plot the Vertical Electrical Sounding data ohmic -area

Parameters:

data (*) – contains the depth measurement AB from current electrodes, the potentials electrodes MN and the collected apparent resistivities.
search (*) – The depth in meters from which one expects to find a fracture zone outside of pollutions. Indeed, the search parameter is used to speculate about the expected groundwater in the fractured rocks under the average level of water inrush in a specific area. For instance in Bagoue region , the average depth of water inrush is around 45m. So the search can be specified via the water inrush average value.
pre_computed (bool, default=False,) – If True computed the ohmic_area parameters. If False, the ohmic area arguments must be passed to xy, xyf and xyarea, otherwise an errors will raise.
xy (array-like of shape (n_AB, 2)) – Arraylike of the sanitized depth measurement AB from current. electrodes n_AB. See vesDataOperator().
xyf (array-like of shape (n_fit_samples, 2)) – Array-like of the fitted samples i.e the number of points for fitting the sounding resistivity values from the surface thin the total depth. The fitted rhoa showns a smooth curves. The default point is 1000.
xyarea (array-like of shape (n_area, 2)) – Arraylike of the resistivity positions of the depth measurment AB where the fractured zone is found.
fbtw (bool, default=False,) – If True, filled the computed fractured zone using the parameters computed from xyf and xyarea.
kws (dict - Additionnal keywords arguments from Vertical Electrical Sounding data operations.) – See watex.utils.exmath.vesDataOperator() for futher details.

Notes

The first and second columns of xy, xyfit and xyarea are the position AB/2 and their corresponding resistivity values.

Examples

>>> from watex.datasets import load_semien
>>> from watex.utils.exmath import plotOhmicArea
>>> ves_data = load_semien ()
>>> plotOhmicArea (ves_data)

watex.utils.plot_clusters(n_clusters, X, y_pred, cluster_centers=None, savefig=None)[source]#

Visualize the cluster that k-means identified in the dataset

Parameters:

n_clusters – int, number of cluster to visualize
X – NDArray, data containing the features, expect to be a two dimensional data
y_pred – array-like, array containing the predicted class labels.
cluster_centers – NDArray containg the coordinates of the centroids or the similar points with continous features.

Example:

>>> from watex.exlib.sklearn import KMeans, MinMaxScaler
>>> from watex.utils.plotutils import plot_clusters
>>> from watex.datasets import fetch_data
>>> h= fetch_data('hlogs').frame
>>> # collect two features 'resistivity' and gamma-gamma logging values
>>> h2 = h[['resistivity', 'gamma_gamma']]
>>> km = KMeans (n_clusters =3 , init= 'random' )
>>> # scaled the data with MinMax scaler i.e. between ( 0-1)
>>> h2_scaled = MinMaxScaler().fit_transform(h2)
>>> ykm = km.fit_predict(h2_scaled )
>>> plot_clusters (3 , h2_scaled, ykm , km.cluster_centers_ )

watex.utils.plot_confidence_in(z_or_edis_obj_list, /, tensor='res', view='1d', drop_outliers=True, distance=None, c_line=False, view_ci=True, figsize=(6, 2), fontsize=4.0, dpi=300.0, top_label='Stations', rotate_xlabel=90.0, fbtw=True, savefig=None, **plot_kws)[source]#

Plot data confidency from tensor errors.

The default tensor for evaluating the data confidence is the resistivity at TE mode (‘xy’).

Check confidence in the data before starting the concrete processing seems meaningful. In the area with complex terrain, with high topography addition to interference noises, signals are weals or missing especially when using AMT survey. The most common technique to do this is to eliminate the bad frequency and interpolate the remains one. However, the tricks for eliminating frequency differ from one author to another. Here, the tip using the data confidence seems meaningful to indicate which frequencies to eliminate (at which stations/sites) and which ones are still recoverable using the tensor recovering strategy.

The plot implements three levels of confidence:

High confidence: \(conf. \geq 0.95\) values greater than 95%
Soft confidence: \(0.5 \leq conf. < 0.95\). The data in this confidence range can be beneficial for tensor recovery to restore the weak and missing signals.
bad confidence: \(conf. <0.5\). Data in this interval must be deleted.

Parameters:

z_or_edis_obj_list (list of watex.edi.Edi or watex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.
tensor (str, default='res') – Tensor name. Can be [ resistivity|phase|z|frequency]
view (str, default='1d') – Type of plot. Can be [‘1D’|’2D’]
drop_outliers (bool, default=True) – Suppress the ouliers in the data if True.
distance (float, optional) – Distance between stations/sites
fontsize (float, default=3.) – label font size.
figsize (Tuple, default=(6, 2)) – Figure size.
c_line (bool, default=True,) – Display the confidence line in two dimensinal view.
dpi (int, default=300) – Image resolution in dot-per-inch
rotate_xlabel (float, default=90.) – Angle to rotate the stations/sites labels
top_labels (str,default='Stations') – Labels the sites either using the survey name.
view_ci (bool,default=True,) – Show the marker of confidence interval.
fbtw (bool, default=True,) – Fill between confidence interval.
plot_kws (dict,) – Additional keywords pass to the plot()

See also

watex.methods.Processing.zrestore: For more details about the function for tensor recovering technique.

Examples

>>> from watex.utils.exmath import plot_confidence_in
>>> from watex.datasets import fetch_data
>>> emobj  = fetch_data ( 'huayuan', samples = 25, clear_cache =True,
                         key='raw').emo
>>> plot_confidence_in (emobj.ediObjs_ ,
                        distance =20 ,
                        view ='2d',
                        figsize =(6, 2)
                        )
>>> plot_confidence_in (emobj.ediObjs_ , distance =20 ,
                        view ='1d', figsize =(6, 3), fontsize =5,
                        )

watex.utils.plot_confusion_matrices(clfs, Xt, yt, annot=True, pkg=None, normalize='true', sample_weight=None, encoder=None, fig_size=(22, 6), savefig=None, subplot_kws=None, **scorer_kws)[source]#

Plot inline multiple model confusion matrices using either the sckitlearn or ‘yellowbrick’

Parameters:

clfs (list of classifier estimators) – A scikit-learn estimator that should be a classifier. If the model is not a classifier, an exception is raised. Note that the classifier must be fitted beforehand.
Xt (ndarray or DataFrame of shape (M X N)) – A matrix of n instances with m features. Preferably, matrix represents the test data for error evaluation.
yt (ndarray of shape (M, ) or Series oF length (M, )) – An array or series of target or class values. Preferably, the array represent the test class labels data for error evaluation.
pkg (str, optional , default ='sklearn') – the library to handle the plot. It could be ‘yellowbrick’. The basic confusion matrix is handled by the scikit-learn package.
normalize ({'true', 'pred', 'all'}, default=None) – Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
encoder (dict or LabelEncoder, default: None) – A mapping of classes to human readable labels. Often there is a mismatch between desired class labels and those contained in the target variable passed to fit() or score(). The encoder disambiguates this mismatch ensuring that classes are labeled correctly in the visualization.
return_scores (bool, defaut=True,) – Returns a dictionnary of accuracy, precision, recall and AUC scores.
annot (bool, default=True) – Annotate the number of samples (right or wrong prediction ) in the plot. Set False to mute the display.
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.

Examples

>>> import matplotlib.pyplot as plt
>>> plt.style.use ('classic')
>>> from watex.datasets import fetch_data
>>> from watex.exlib.sklearn import train_test_split
>>> from watex.models.premodels import p
>>> from watex.utils.plotutils import plot_confusion_matrices
>>> # split the  data . Note that fetch_data output X and y
>>> X, Xt, y, yt  = train_test_split (* fetch_data ('bagoue analysed'), test_size =.25  )
>>> # compose the models
>>> # from RBF, and poly
>>> models =[ p.SVM.rbf.best_estimator_,
         p.LogisticRegression.best_estimator_,
         p.RandomForest.best_estimator_
         ]
>>> models
[SVC(C=2.0, coef0=0, degree=1, gamma=0.125), LogisticRegression(),
 RandomForestClassifier(criterion='entropy', max_depth=16, n_estimators=350)]
>>> # now fit all estimators
>>> fitted_models = [model.fit(X, y) for model in models ]
>>> plot_confusion_matrices(fitted_models , Xt, yt)

watex.utils.plot_cost_vs_epochs(regs, *, fig_size=(10, 4), marker='o', savefig=None, **kws)[source]#

Plot the cost against the number of epochs for the two different learnings rates

Parameters:

regs (Callable, single or list of regression estimators) – Estimator should be already fitted.
fig_size (tuple , default is (10, 4)) – the size of figure
kws (dict ,) – Additionnal keywords arguments passes to matplotlib.pyplot.plot()

Returns:

Return type:

Matplotlib.pyplot axes objects

Examples

>>> from watex.datasets import load_iris
>>> from watex.base import AdalineGradientDescent
>>> from watex.utils.plotutils import plot_cost_vs_epochs
>>> X, y = load_iris (return_X_y= True )
>>> ada1 = AdalineGradientDescent (n_iter= 10 , eta= .01 ).fit(X, y)
>>> ada2 = AdalineGradientDescent (n_iter=10 , eta =.0001 ).fit(X, y)
>>> plot_cost_vs_epochs (regs = [ada1, ada2] )

watex.utils.plot_elbow(X, n_clusters, n_init=10, max_iter=300, random_state=42, fig_size=(10, 4), marker='o', savefig=None, **kwd)[source]#

Plot elbow method to find the optimal number of cluster, k’, for a given data.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.
n_init (int, default=10) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
max_iter (int, default=300) – Maximum number of iterations of the k-means algorithm for a single run.
tol (float, default=1e-4) – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
verbose (int, default=0) – Verbosity mode.
random_state (int, RandomState instance or None, default=42) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
savefig (str, default =None ,) – the path to save the figure. Argument is passed to matplotlib.Figure class.
marker (str, default='o',) – cluster marker point.
kwd (dict) – Addionnal keywords arguments passed to matplotlib.pyplot.plot()

Returns:

Return type:

Matplotlib.pyplot axes objects

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.plotutils import plot_elbow
>>> # get the only resistivy and gamma-gama values for example
>>> res_gamma = load_hlogs ().frame[['resistivity', 'gamma_gamma']]
>>> plot_elbow(res_gamma, n_clusters=11)

watex.utils.plot_learning_curves(models, X, y, *, cv=None, train_sizes=None, baseline_score=0.4, convergence_line=True, fig_size=(20, 6), sns_style=None, savefig=None, set_legend=True, subplot_kws=None, **kws)[source]#

Horizontally visualization of multiple models learning curves.

Determines cross-validated training and test scores for different training set sizes.

Parameters:

models (list or estimators) – An estimator instance or not that implements fit and predict methods which will be cloned for each validation.
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a (Stratified)KFold,
- CV splitter,
- An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

Refer User Guide for the various cross-validation strategies that can be used here.

cv default value if None changed from 3-fold to 4-fold.
train_sizesarray-like of shape (n_ticks,), default=np.linspace(0.1, 1, 50)
Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.
baseline_score (floatm default=.4) – base score to start counting in score y-axis (score)
convergence_line (bool, default=True) – display the convergence line or not that indicate the level of bias between the training and validation curve.
fig_size (tuple (width, height), default =(14, 6)) – the matplotlib figure size given as a tuple of width and height
sns_style (str, optional,) – the seaborn style .
set_legend (bool, default=True) – display legend in each figure. Note the default location of the legend is ‘best’ from legend()
subplot_kws (dict, default is dict(left=0.0625, right = 0.95, wspace = 0.1)) – the subplot keywords arguments passed to matplotlib.subplots_adjust()
kws (dict,) – keyword arguments passed to sklearn.model_selection.learning_curve()

Examples

-> plot via a metaestimator already cross-validated.

>>> from watex.models.premodels import p
>>> from watex.datasets import fetch_data
>>> from watex.utils.plotutils import plot_learning_curves
>>> X, y = fetch_data ('bagoue prepared') # yields a sparse matrix
>>> # let collect 04 estimators already cross-validated from SVMs
>>> models = [ p.SVM.linear , p.SVM.rbf , p.SVM.sigmoid , p.SVM.poly ]
>>> plot_learning_curves (models, X, y, cv=4, sns_style = 'darkgrid')

-> plot with multiples models not crossvalidated yet.

>>> from watex.exlib.sklearn import (LogisticRegression,
                                     RandomForestClassifier,
                                     SVC , KNeighborsClassifier
                                     )
>>> models =[LogisticRegression(), RandomForestClassifier(), SVC() ,
             KNeighborsClassifier() ]
>>> plot_learning_curves (models, X, y, cv=4, sns_style = 'darkgrid')

watex.utils.plot_logging(X, y=None, zname=None, tname=None, labels=None, impute_nan=True, normalize=False, log10=False, columns_to_skip=None, pattern=None, strategy='mean', posiy=None, fill_value=None, fig_size=(16, 7), fig_dpi=300, colors=None, sns_style=False, savefig=None, draw_spines=False, verbose=0, **kws)[source]#

Plot logging data

Plot expects a collection of logging data. Each logging data composes a column of data collected on the field.Note that can also plot anykind of data related that it contains numerical values. The function does not accept categorical data. If categorical data are given, they should be discarded.

Parameters:

X (Dataframe of shape (n_samples, n_features)) – where n_samples is the number of data, expected to be the data collected at different depths and n_features is the number of columns (features) that supposed to be plot. Note that X must include the depth columns. If not given a relative depth should be created according to the number of sample that composes X.
y (array-like or series of shape (n_samples,), optional) – Target relative to X for classification or regression; If given, by default the target plot should be located at the last position. However with the argument of posiy , target plot can be toggled to the desired position.
zname (str, default='depth' or 'None') – The name of the depth column in X. If the name ‘depth’ is not specified as the main depth columns, an other name in the columns that matches the depth can also be indicated so the function will put aside this columm as depth column for plot purpose. If set to None, zname holds the name depth and assumes that depth exists in X columns.
tname (str, optional,) – name of the target. This can rename of the target name if given y as a pandas series or add the name of target if given as an array-like. If not provided, it should use the name of the target series if y is not None.
normalize (bool, default = False) – Normalize all the data to be range between (0, 1) except the depth,
labels (list or str, optional) – If labels are given, they should fit the size of the number of columns. The given labels should replace the old columns in X and should figue out in the plot. This is usefull to change the columns labels in the dataframe to a new labels that describe the best the plot ; for instance by inluding the units in the new labels. Note that if the labels do not match the size of the old columns in X a warning should be let to the user and none operation will be performed.
impute_nan (bool, default=True,) – Replace the NaN values in the dataframe. Note that the default behaviour for replacing NaN is the mean. However if the argument of fill_value is provided,the latter should be used to replace ‘NaN’ in X.
log10 (bool, default=False) – Convert values to log10. This can be usefull when using the logarithm data. However, it seems not all the data can be used this operation, for instance, a negative data. In that case, column_to_skip argument is usefull to provide so to skip that columns when converting values to log10.
columns_to_skip (list or str, optional,) –

Columns to skip when performing some operation like ‘log10’. These
columns with not be affected by the ‘log10’ operations. Note that

columns_to_skip can also gives as litteral string. In that case, the pattern is need to parse the columns into a list of string.
pattern (str, default = '[#&*@!,;s]s*') –
Regex pattern to parse the columns_to_skip into a list of string where each item is a column name especially when the latter is given as litteral text string. For instance:
```
columns_to_skip='depth_top, thickness, sp, gamma_gamma'
-> ['depth_top', 'thickness', 'sp', 'gamma_gamma']
```
by using the default pattern. To have full control of columns splitted it is recommended to provided your own pattern to avoid wrong parsing and can lead to an error.
strategy (str, default='mean') –
The imputation strategy.
- If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
- If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
- If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
- If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
fill_value (str or numerical value, optional) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types. If not given and impute_nan is True, the mean strategy is used instead.
posiy (int, optional) – the position to place the target plot y . By default the target plot if given is located at the last position behind the logging plots.
colors (list of Matplotlib.colors map, optional) – The colors for plotting each columns of X except the depth. If not given, default colors are auto-generated.
draw_spines (bool, tuple (-lim, +lim), default= False,) – Only draw spine between the y-ticks. -lim and +lim are lower and upper bound i.e. a range to draw the spines in y-axis.
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
fig_dpi (float or 'figure', default: rcParams["savefig.dpi"] (default: 'figure')) – The resolution in dots per inch. If ‘figure’, use the figure’s dpi value.
savefig (str, default =None ,) – the path to save the figure. Argument is passed to matplotlib.Figure class.
sns_style (str, optional,) – the seaborn style.
verbose (int, default=0) – Output the number of categorial features dropped in the dataframe.
kws (dict,) – Additional keyword arguments passed to matplotlib.axes.plot()

Examples

>>> from watex.datasets import load_hlogs
>>> from watex.utils.plotutils import plot_logging
>>> X0, y = load_hlogs (as_frame =True) # get the frames rather than object
>>> # plot the default logging with Normalize =True
>>> plot_logging (X0, normalize =True)
>>> # Include the target in the plot
>>> plot_logging ( X0,  y = y.kp , posiy = 0,
                  columns_to_skip=['thickness', 'sp'],
                  log10 =True,
                  )
>>> # draw spines and limit plot from (0, 700) m depth
>>> plot_logging (X0 , y= y.kp, draw_spines =(0, 700) )

watex.utils.plot_mlxtend_heatmap(df, columns=None, savefig=None, **kws)[source]#

Plot correlation matrix array as a heat map

Parameters:

df – dataframe pandas
columns – list of features, If given, only the dataframe with that features is considered.
kws – additional keyword arguments passed to mlxtend.plotting.heatmap()

Returns:

mlxtend.plotting.heatmap() axes object

Example:

>>> from watex.datasets import load_hlogs
>>> from watex.utils.plotutils import plot_mlxtend_heatmap
>>> h=load_hlogs()
>>> features = ['gamma_gamma', 'sp',
            'natural_gamma', 'resistivity']
>>> plot_mlxtend_heatmap (h.frame , columns =features, cmap ='PuOr')

watex.utils.plot_mlxtend_matrix(df, columns=None, fig_size=(10, 8), alpha=0.5, savefig=None)[source]#

Visualize the pair wise correlation between the different features in the dataset in one place.

Parameters:

df – dataframe pandas
columns – list of features, If given, only the dataframe with that features is considered.
fig_size – tuple of int (width, heigh) Size of the displayed figure
alpha – figure transparency, default is .5.

Returns:

mlxtend.plotting.scatterplotmatrix() axes object

Example:

>>> from watex.datasets import load_hlogs
>>> from watex.utils.plotutils import plot_mlxtend_matrix
>>> import pandas as pd
>>> import numpy as np
>>> h=load_hlogs()
>>> features = ['gamma_gamma', 'natural_gamma', 'resistivity']
>>> data = pd.DataFrame ( np.log10 (h.frame[features]), columns =features )
>>> plot_mlxtend_matrix (data, columns =features)

watex.utils.plot_naive_dendrogram(X, *ybounds, fig_size=(12, 5), savefig=None, **kws)[source]#

Quick plot dendrogram using the ward clustering function from Scipy.

Parameters:

X – ndarray of shape (n_samples, n_features) Array of features
ybounds – int, integrer values to draw horizontal cluster lines that indicate the number of clusters.
fig_size – tuple (width, height), default =(12,5) the matplotlib figure size given as a tuple of width and height
kws – dict , Addditional keyword arguments passed to scipy.cluster.hierarchy.dendrogram()

Examples:

>>> from watex.datasets import fetch_data
>>> from watex.utils.plotutils import plot_naive_dendrogram
>>> X, _= fetch_data('Bagoue analysed') # data is already scaled
>>> # get the two features 'power' and  'magnitude'
>>> data = X[['power', 'magnitude']]
>>> plot_naive_dendrogram(data )
>>> # add the horizontal line of the cluster at ybounds = (20 , 20 )
>>> # for a single cluster (cluser 1)
>>> plot_naive_dendrogram(data , 20, 20 )

watex.utils.plot_pca_components(components, *, feature_names=None, cmap='viridis', savefig=None, **kws)[source]#

Visualize the coefficient of principal component analysis (PCA) as a heatmap

Parameters:

components – Ndarray, shape (n_components, n_features)or PCA object Array of the PCA compoments or object from watex.analysis.dimensionality.nPCA. If the object is given it is not necessary to set the feature_names
feature_names – list or str, optional list of the feature names to locate in the map. Feature_names and the number of eigen vectors must be the same length. If PCA object is passed as components arguments, no need to set the feature_names. The name of features is retreived automatically.
cmap – str, default=’viridis’ the matplotlib color map for matshow visualization.
kws – dict, Additional keywords arguments passed to matplotlib.pyplot.matshow

Examples:

(1)-> with PCA object

>>> from watex.datasets import fetch_data
>>> from watex.utils.plotutils import plot_pca_components
>>> from watex.analysis import nPCA
>>> X, _= fetch_data('bagoue pca')
>>> pca = nPCA (X, n_components=2, return_X =False)# to return object
>>> plot_pca_components (pca)

(2)-> use the components and features individually

>>> components = pca.components_
>>> features = pca.feature_names_in_
>>> plot_pca_components (components, feature_names= features,
                         cmap='jet_r')

watex.utils.plot_profiling(erp, cz=None, *, style='classic', fig_size=(10, 4), cz_plot_kws=None, marker_kws=None, savefig=None, **plot_kws)[source]#

Visualizes the resistivity profiling of ERP data.

Function can overlain the selected conductive zone to the ERP if cz is given.

Parameters:

erp (array_like 1d) – The electrical resistivity profiling array.
cz (array_like, optional,) – The selected conductive zone. If None, cz should not be plotted.
style (str, default='classic') – Matplotlib plottings style.
fig_size (tuple, default= (10, 4)) – Matplotlib figure size.
marker_kws (dict, default = {'marker':'o', 'c':'#9EB3DD' }) – The dictionnary to customize marker in the plot
cz_plot_kws (dict, default = {'ls':'-','c':'#0A4CEE', 'lw'L2 }) – The dictionnary to customize the conductize zone in the plot.
savefig (str, optional) – Save figure name. The default resolution dot-per-inch is 300.
plot_kws (dict,) – Additional keyword arguments passed to matplotlib.pyplot.plot() function

Returns:

ax – Return axis

Return type:

Matplotlib.pyplot.Axis

Examples

>>> from watex.datasets import make_erp
>>> from watex.utils import defineConductiveZone
>>> from watex.utils.plotutils import plot_profiling
>>> d= make_erp (n_stations =56, seed = 42)
>>> plot_profiling  (d.resistivity)

watex.utils.plot_regularization_path(X, y, c_range=(-4.0, 6.0), fig_size=(8, 5), sns_style=False, savefig=None, **kws)[source]#

Plot the regularisation path from Logit / LogisticRegression

Varying the different regularization strengths and plot the weight coefficient of the different features for different regularization strength.

Note that, it is recommended to standardize the data first.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features. X is expected to be standardized.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
c_range (list or tuple [start, stop]) – Regularization strength list. It is a range from the strong strong ( start) to lower (stop) regularization. Note that ‘C’ is the inverse of the Logistic Regression regularization parameter \(\lambda\).
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.
sns_style (str, optional,) – the seaborn style.
kws (dict,) – Additional keywords arguments passed to sklearn.linear_model.LogisticRegression

Examples

>>> from watex.utils.plotutils import plot_regularization_path
>>> from watex.datasets import fetch_data
>>> X, y = fetch_data ('bagoue analysed' ) # data aleardy standardized
>>> plot_regularization_path (X, y )

watex.utils.plot_rf_feature_importances(clf, X=None, y=None, fig_size=(8, 4), savefig=None, n_estimators=500, verbose=0, sns_style=None, **kws)[source]#

Plot features importance with RandomForest.

Parameters:

clf (estimator object) – The base estimator from which the transformer is built. This can be both a fitted (if prefit is set to True) or a non-fitted estimator. The estimator should have a feature_importances_ or coef_ attribute after fitting. Otherwise, the importance_getter parameter should be used.
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
n_estimators (int, default=500) – The number of trees in the forest.
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.
sns_style (str, optional,) – the seaborn style.
verbose (int, default=0) – print the feature labels with the rate of their importances.
kws (dict,) – Additional keyyword arguments passed to sklearn.ensemble.RandomForestClassifier

Examples

>>> from watex.datasets import fetch_data
>>> from watex.exlib.sklearn import RandomForestClassifier
>>> from watex.utils.plotutils import plot_rf_feature_importances
>>> X, y = fetch_data ('bagoue analysed' )
>>> plot_rf_feature_importances (
    RandomForestClassifier(), X=X, y=y , sns_style=True)

watex.utils.plot_sbs_feature_selection(sbs_estimator, /, X=None, y=None, fig_size=(8, 5), sns_style=False, savefig=None, verbose=0, **sbs_kws)[source]#

plot Sequential Backward Selection (SBS) for feature selection.

SBS collects the scores of the best feature subset at each stage.

Parameters:

sbs_estimator (SequentialBackwardSelection estimator object) – The Sequential Backward Selection estimator can either be fitted or not. If not fitted. Please provide the training X and y, otherwise an error will occurs.
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
n_estimators (int, default=500) – The number of trees in the forest.
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.
sns_style (str, optional,) – the seaborn style.
verbose (int, default=0) – print the feature labels with the rate of their importances.
sbs_kws (dict,) – Additional keyyword arguments passed to SequentialBackwardSelection

Examples

(1)-> Plot fitted SBS in action >>> from watex.exlib.sklearn import KNeighborsClassifier , train_test_split >>> from watex.datasets import fetch_data >>> from watex.base import SequentialBackwardSelection >>> from watex.utils.plotutils import plot_sbs_feature_selection >>> X, y = fetch_data(‘bagoue analysed’) # data already standardized >>> Xtrain, Xt, ytrain, yt = train_test_split(X, y) >>> knn = KNeighborsClassifier(n_neighbors=5) >>> sbs= SequentialBackwardSelection (knn) >>> sbs.fit(Xtrain, ytrain ) >>> plot_sbs_feature_selection(sbs, sns_style= True)

(2)-> Plot estimator with no prefit SBS. >>> plot_sbs_feature_selection(knn, Xtrain, ytrain) # yield the same result

watex.utils.plot_sfi(cz, p=None, s=None, dipolelength=None, fig_size=(10, 4), style='classic', **plotkws)[source]#

Plot sfi parameter components.

Parameters:

cz (array-like 1d,) – Selected conductive zone
p (array-like 1d,) – Station positions of the conductive zone.
dipolelength (float. If p is not given, it will be set) – automatically using the default value to match the cz size. The default value is 10.
fig_size (tuple, default=(10, 4)) – Matplotlib (MPL) figure size; should be a tuple value of integers

See also

watex.utils.exmath.sfi: for more details about the sfi parameter computation.

Examples

>>> import numpy as np
>>> from watex.utils.exmath import plot_sfi
>>> rang = np.random.RandomState (42)
>>> condzone = np.abs(rang.randn (7))*1e2
>>> plotkws  = dict (rlabel = 'Selected conductive zone (cz)',
                     color=f'{P().frcolortags.get("fr3")}',
                     )
>>> plot_sfi (condzone, **plotkws)

watex.utils.plot_silhouette(X, labels, metric='euclidean', savefig=None, **kwds)[source]#

Plot quantifying the quality of clustering silhouette

Parameters:

X (array-like of shape (n_samples_a, n_samples_a) if metric == ) – “precomputed” or (n_samples_a, n_features) otherwise An array of pairwise distances between samples, or a feature array.
labels (array-like of shape (n_samples,)) – Label values for each sample.
metric (str or callable, default='euclidean') – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by sklearn.metrics.pairwise.pairwise_distances(). If X is the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.
savefig (str, default =None ,) – the path to save the figure. Argument is passed to matplotlib.Figure class.
**kwds (optional keyword parameters) – Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.

See also

watex.view.mlplot.plotSilhouette: Gives consistency plot as the use of prefit parameter which checks whether`labels` are expected to be passed into the function directly or not.

Examples

>>> import numpy as np
>>> from watex.exlib.sklearn import KMeans
>>> from watex.datasets import load_iris
>>> from watex.utils.plotutils import plot_silhouette
>>> d= load_iris ()
>>> X= d.data [:, 0][:, np.newaxis] # take the first axis
>>> km= KMeans (n_clusters =3 , init='k-means++', n_init =10 ,
                max_iter = 300 ,
                tol=1e-4,
                random_state =0
                )
>>> y_km = km.fit_predict(X)
>>> plot_silhouette (X, y_km)

watex.utils.plot_skew(edi_obj, method='Bahr', mode=None, threshold_line=None, fig_size=(7, 5), savefig=None, view=None, style=None, **kws)[source]#

Plot phase sensitive skew visualization.

Phase Sensitivity Skew (\(\eta\)) is a dimensionality tool that represents a measure of the skew of the phases of the impedance tensor. The parameter is thus unaffected by the distortion effect, unlike the Swift-skew and ellipticity dimensionality tools [1].

Values of \(\eta\) > 0.3 are considered to represent 3D data. Phase-sensitive skews less than 0.1 indicate 1D, 2D or distorted 2D (3-D /2-D) cases. Values of \(\eta\) between 0.1 and 0.3 indicates modified 3D/2D structures [2] according to Bahr’ methods. However, values :math:eta >=0.2` using the Swift methods, the smaller the value \(\eta\) ( close to \(0.\)), the closer the structure to 2D structure and vice versa.However, it is generally considered that an electrical structure of \(\eta < 0.4\) can be treated as a 2D medium. Here as the threshold_line for eta() using the Swift method should be set as 0.4.

New in version 0.1.5.

Parameters:

edi_obj (str, watex.edi.Edi) – Full path to edifiles or Edi object.
method (str, default='Bahr':) – Kind of correction. Can be swift for the remove distorsion proposed by Swift in 1967 [3]. The value close to 0. assume the 1D and 2D structures, and 3D otherwise. Conversly to bahr for the remove distorsion proposed by Bahr in 1991 [2]. The latter threshold is set to 0.3. Above this value the structures is 3D.
threshold_line (float, optional) –
Visualize th threshold line. Can be [‘bahr’, ‘swift’, ‘both’]:
- Note that when method is set to swift, the value close to close to \(0.\) assume the 1D and 2D structures, and 3D otherwise.
- when method is set to Bahr, \(\mu > 0.3`\) is 3D structures, between \([0.1 - 0.3]\) assumes modified 3D/2D structures whereas \(<0.1\) 1D, 2D or distorted 2D.
mode (str, optional) – X-axis coordinates for visualisation. plot either 'frequency' or 'periods'. The default is 'frequency'
view (str, default='skew') – phase sensistive visualization. Can be rotational invariant invariant. Note that setting to mu or invariant does not change any interpretation since the distortion of Z are all rotational invariant whether using the Bahr or swift methods.
fig_size (tuple, default= (10, 4)) – Matplotlib figure size.
savefig (str, optional) – Save figure name. The default resolution dot-per-inch is 300.
style (str, default='classic') – Matplotlib plottings style.
kws (dict,) – Matplotlib Axes scatterplot additional keywords arguments.

Returns:

ax – Return axis

Return type:

Matplotlib.pyplot.Axis

See also

watex.methods.em.Processing.skew: Skew equation formulations.
watex.view.TPlot.plotSkew: Give a consistent plot where user can customize the plot using the plot parameter of watex.property.BasePlot class.

References

[1]

Bahr, K. (1988) Interpretation of the magnetotelluric impedance tensor: regional induction 395 and local telluric distortion. J. Geophys. Res., 62, 119–127.

[2] (1,2)

Bahr, K. (1991) Geological noise in magnetotelluric data: a classification of distortion types. 397 Phys. Earth Planet. Inter., 66, 24–38.

[3]

Bahr, K., 1991. Geological noise in magnetotelluric data: a classification of distortion types. Physics of the Earth and Planetary Interiors 66 (1–2), 24–38.

Examples

>>> import watex as wx
>>> from watex.utils.plotutils import plot_skew
>>> edi_sk = wx.fetch_data ("edis", return_data =True , samples = 20 )
>>> plot_skew (edi_sk)
>>> plot_skew (edi_sk, threshold_line= True)

watex.utils.plot_strike(list_of_edis, /, kind=2, period_tolerance=0.05, text_pad=1.65, rot_z=0.0, **kws)[source]#

plot the strike estimated from the invariants, phase tensor and the tipper in either a rose diagram of xy plot.

Parameters:

edi_list (list,) – full paths to .edi files to plot or list of EDI-files.
kind (int, default=2) –
Can be [ 1 | 2 ] where:
- 1 to plot individual decades in one plot
- 2 to plot all period ranges into one polar diagram for each strike angle estimation
One could try also plot_type = 1 to plot by decade
fig_num (int, default=1,) – figure number to be plotted. Default is 1
font_size (float, default=10,) – Figure size
rot_z (float, default=0.,) – angle of rotation clockwise positive.
period_tolerance (float, default=.05) – Tolerance level to match periods from different edi files. Default is 0.05
text_pad (float, default=1.65) –

padding of the angle label at the bottom of each
polar diagram. Default is 1.65
plot_range (str, tuple) –
The period range to estimate the strike angle. It can be [ ‘data’ | (period_min,period_max) ]. Options are:
- ’data’ for estimating the strike for all periods in the data.
- (pmin,pmax) for period min and period max, input as (log10(pmin),log10(pmax))
plot_tipper ([ True | False ]) –
- True to plot the tipper strike
- False to not plot tipper strike
pt_error_floor (int, optional) – Maximum error in degrees that is allowed to estimate strike. Default is None allowing all estimates to be used.
fold ([ True | False ]) –
- True to plot only from 0 to 180
- False to plot from 0 to 360
plot_orthogonal ([ True | False]) –
- True to plot the orthogonal strike directions
- False to not
color ([ True | False ]) –
- True to plot shade colors
- False to plot all in one color
color_inv (str,) – color of invariants plots
color_pt (str,) – color of phase tensor plots
color_tip (str) – color of tipper plots
ring_spacing (float, optional) – spacing of rings in polar plots
ring_limits (tuple of int,) – plot limits (min count, max count) set each plot have these limits
plot_orientation (str, [ 'h' | 'v' ]) – horizontal or vertical plots

watex.utils.plot_yb_confusion_matrix(clf, Xt, yt, labels=None, encoder=None, savefig=None, fig_size=(6, 6), **kws)[source]#

Confusion matrix plot using the ‘yellowbrick’ package.

Creates a heatmap visualization of the sklearn.metrics.confusion_matrix(). A confusion matrix shows each combination of the true and predicted classes for a test data set.

The default color map uses a yellow/orange/red color scale. The user can choose between displaying values as the percent of true (cell value divided by sum of row) or as direct counts. If percent of true mode is selected, 100% accurate predictions are highlighted in green.

Requires a classification model.

Be sure ‘yellowbrick’ is installed before using the function, otherwise an ImportError will raise.

Parameters:

clf (classifier estimator) – A scikit-learn estimator that should be a classifier. If the model is not a classifier, an exception is raised. If the internal model is not fitted, it is fit when the visualizer is fitted, unless otherwise specified by is_fitted.
Xt (ndarray or DataFrame of shape n x m) – A matrix of n instances with m features. Preferably, matrix represents the test data for error evaluation.
yt (ndarray or Series of length n) – An array or series of target or class values. Preferably, the array represent the test class labels data for error evaluation.
ax (matplotlib Axes, default: None) – The axes to plot the figure on. If not specified the current axes will be used (or generated if required).
sample_weight (array-like of shape = [n_samples], optional) – Passed to confusion_matrix to weight the samples.
encoder (dict or LabelEncoder, default: None) – A mapping of classes to human readable labels. Often there is a mismatch between desired class labels and those contained in the target variable passed to fit() or score(). The encoder disambiguates this mismatch ensuring that classes are labeled correctly in the visualization.
labels (list of str, default: None) – The class labels to use for the legend ordered by the index of the sorted classes discovered in the fit() method. Specifying classes in this manner is used to change the class names to a more specific format or to label encoded integer classes. Some visualizers may also use this field to filter the visualization for specific classes. For more advanced usage specify an encoder rather than class labels.
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.

Returns:

cmo – return a yellowbrick confusion matrix object instance.

Return type:

yellowbrick.classifier.confusion_matrix.ConfusionMatrix

Examples

>>> #Import the required models and fetch a an extreme gradient boosting
>>> # for instance then plot the confusion metric
>>> import matplotlib.pyplot as plt
>>> plt.style.use ('classic')
>>> from watex.datasets import fetch_data
>>> from watex.exlib.sklearn import train_test_split
>>> from watex.models import pModels
>>> from watex.utils.plotutils import plot_yb_confusion_matrix
>>> # split the  data . Note that fetch_data output X and y
>>> X, Xt, y, yt  = train_test_split (* fetch_data ('bagoue analysed'),
                                      test_size =.25  )
>>> # train the model with the best estimator
>>> pmo = pModels (model ='xgboost' )
>>> pmo.fit(X, y )
>>> print(pmo.estimator_ ) # pmo.XGB.best_estimator_
>>> #%%
>>> # Predict the score using under the hood the best estimator
>>> # for adaboost classifier
>>> ypred = pmo.predict(Xt)

>>> # now plot the score
>>> plot_yb_confusion_matrix (pmo.XGB.best_estimator_, Xt, yt  )

watex.utils.power(p)[source]#

Compute the power of the selected conductive zone. Anomaly power is closely referred to the width of the conductive zone.

The power parameter implicitly defines the width of the conductive zone and is evaluated from the difference between the abscissa \(X_{LB}\) and the end \(X_{UB}\) points of the selected anomaly:

\[power=|X_{LB} - X_{UB} |\]

Parameters:: p – array-like. Station position of conductive zone.
Returns:: Absolute value of the width of conductive zone in meters.

watex.utils.predict_NGA_labels(X, /, n_clusters, random_state=0, keep_label_0=False, return_cluster_centers=False, **kws)[source]#

Predict the Naive Group of Aquifer (NGA) labels.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.
random_state (int, RandomState instance or None, default=42) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
keep_label_0 (bool, default=False) – The prediction already include the label 0. However, including 0 in the predicted label refers to ‘k=0’ i.e. no permeability coefficient equals to 0, which is not True in principle, because all rocks have a permeability coefficient ‘k’. Here we considered ‘k=0’ as an undefined permeability coefficient. Therefore, ‘0’ , can be exclude since, it can also considered as a missing ‘k’-value. If predicted ‘0’ is in the target it should mean a missing ‘k’-value rather than being a concrete label. Therefore, to avoid any confusion, ‘0’ is altered to ‘1’ so the value +1 is used to move forward all class labels thereby excluding the ‘0’ label. To force include 0 in the label, set keep_label_0 to True.
return_cluster_centers (bool, default=False,) – export the array of clusters centers if True.
kws (dict,) – Additional keyword arguments passed to sklearn.clusters.KMeans.

Returns:

NGA (array_like of shape (n_samples, n_features)) – Predicted NGA labels.
( NGA , cluster_centers) (Tuple of array-like,) – MGA and clusters centers if return_cluster_centers` is set to ``True.

watex.utils.projection_validator(X, Xt=None, columns=None)[source]#

Retrieve x, y coordinates of a datraframe ( X, Xt ) from columns names or indexes.

If X or Xt are given as arrays, columns may hold integers from selecting the the coordinates ‘x’ and ‘y’.

Parameters:

X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.
Xt (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix.
columns (list of str or index, optional) – columns is usefull when a dataframe is given with a dimension size greater than 2. If such data is passed to X or Xt, columns must hold the name to consider as ‘easting’, ‘northing’ when UTM coordinates are given or ‘latitude’ , ‘longitude’ when latlon are given. If dimension size is greater than 2 and columns is None , an error will raises to prevent the user to provide the index for ‘y’ and ‘x’ coordinated retrieval.

Returns:

arrays and coordinate labels

Return type:

( x, y, xt, yt ), (xname, yname, xtname, ytname), Tuple of coordinate

watex.utils.qc(z_or_edis_obj_list, /, tol=0.5, *, interpolate_freq=False, return_freq=False, tensor='res', return_data=False, to_log10=False, return_qco=False)[source]#

Check the quality control in the collection of Z or EDI objects.

Analyse the data in the EDI collection and return the quality control value. It indicates how percentage are the data to be representative.

Parameters:

tol (float, default=.5) – the tolerance parameter. The value indicates the rate from which the data can be consider as meaningful. Preferably it should be less than 1 and greater than 0. Default is .5 means 50 %. Analysis becomes soft with higher tol values and severe otherwise.
interpolate_freq (bool,) – interpolate the valid frequency after removing the frequency which data threshold is under the ``1-tol``% goodness
return_freq (bool, default=False) – returns the interpolated frequency.
return_data (bool, default= False,) – returns the valid data from up to 1-tol% goodness.
tensor (str, default='z') – Tensor name. Can be [ resistivity|phase|z|frequency]. Impedance is used for data quality assessment.
to_log10 (bool, default=True) – convert the frequency value to log10.
qco (return) –
retuns quality control object that wraps all usefull informations after control. The following attributes can be fetched as:
- rate_: the rate of the quality of the data
- component_: The selected component where data is selected for analysis By default used either xy or yx.
- mode_: The EM mode. Either the [‘TE’|’TM’] modes
- freqs_: The valid frequency in the data selected according to the tol parameters. Note that if interpolate_freq is True, it is used instead.
- invalid_freqs_: Useless frequency dropped in the data during control
- data_: Valid tensor data either in TE or TM mode.

Returns:

return the quality control value and interpolated frequency if

return_freq is set to True otherwise return the only the quality control ratio.

return the the quality control object.

Return type:

Tuple (float ) or (float, array-like, shape (N, )) or QCo

Examples

>>> import watex as wx
>>> data = wx.fetch_data ('huayuan', samples =20, return_data =True ,
                          key='raw')
>>> r,= wx.qc (data)
r
Out[61]: 0.75
>>> r, = wx.qc (data, tol=.2 )
0.75
>>> r, = wx.qc (data, tol=.1 )

watex.utils.read_data(f, **read_kws)[source]#

Assert and read specific files and url allowed by the package

Readable files are systematically convert to a pandas dataframe frame.

Parameters:

f (str, Path-like object) – File path or Pathlib object. Must contain a valid file name and should be a readable file or url
read_kws (dict,) – Additional keywords arguments passed to pandas readable file keywords.

Returns:

f – A dataframe with head contents by default.

Return type:

pandas.DataFrame

watex.utils.reduce_samples(*data, sname, zname=None, kname=None, section_indexes=None, error='raise', strategy='average', verify_integrity=False, ignore_index=False, **kws)[source]#

Create a new dataframe by squeezing/compressing the non valid data.

The m-samples reduction is necessary for the dataset with a lot of missing k-values. The technique of shrinking the number of k0 –values (k-missing values ) seems a relevant idea. It consists to compressed the values of the missing \(k -values from the top ( depth equals 0 ) thin the upper section of the first aquifer with lower depth into a single vector :math:`x_r\) with dimension (1×n ) i.e. contains the n-features.

Parameters:

data (list of dataframes) – Data that contains mainly the aquifer values. It must contains the depth values refering at the column_name passed at zname and the permeability coefficient k passed to kname . Both argument need t supplied when datafame as passes as positional arguments.
sname (str, optional) – Name of column in the dataframe that contains the strata values. Dont confuse ‘sname’ with ‘stratum’ which is the name of the valid layer/rock in the array/Series of strata.
zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.
kname (str, int) –

Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly

kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.
strategy (str , default='average' or 'mean',) – strategy used to select or compute the numerical data into a singular series. It can be [‘naive’]. In that case , a single serie if randomly picked up into the base strata data.
section_indexes (tuple or list of int) – list of a pair tuple or list of integers. It is be the the valid sections( upper and lower ) indexes of of the aquifer. If the depth range z_range and zname are supplied, section_indexes can be None. Note that the last indix is considered as the last position, the bottom of the section therefore, its value is included in the data.
error (str, default='raise') – Raise errors if trouble occurs when computing the section of each aquifer. If ‘ignore’, a UserWarning is displayed when invalid data is found. Any other value of error will set error to raise.
verify_integrity (bool, default=False) –
Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method. if ‘True’, remove the duplicate rows from a DataFrame.

subset: By default, if the rows have the same values in all the columns, they are considered duplicates. This parameter is used to specify the columns that only need to be considered for identifying duplicates. keep: Determines which duplicates (if any) to keep. It takes inputs as, first – Drop duplicates except for the first occurrence. This is the default behavior. last – Drop duplicates except for the last occurrence. False – Drop all duplicates. inplace: It is used to specify whether to return a new DataFrame or update an existing one. It is a boolean flag with default False.
ignore_index (bool, default=False,) – It is a boolean flag to indicate if row index should be reset after dropping duplicate rows. False: It keeps the original row index. True: It reset the index, and the resulting rows will be labeled 0, 1, …, n – 1.

Returns:

df_new – new dataframes with reducing samples.

Return type:

List of pandas.dataframes

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import reduce_samples
>>> data = load_hlogs ().frame # get the frames
>>> # add explicitly the aquifer section indices
>>> dfnew= reduce_samples (data.copy(), sname='strata_name',
                             section_indexes = (16, 29 ),)
>>> dfnew[0]
...    hole_number               strata_name     rock_name  ...      r     rp  remark
    0         H502                  mudstone           J2z  ...    NaN    NaN     NaN
    16        H502                 siltstone           NaN  ...  35.74  59.23     NaN
    17        H502    fine-grained sandstone           NaN  ...  35.74  59.23     NaN
    18        H502                 siltstone           NaN  ...  35.74  59.23     NaN
    19        H502    fine-grained sandstone           NaN  ...  35.74  59.23     NaN
    20        H502                  mudstone           NaN  ...  35.74  59.23     NaN
    21        H502                 siltstone           NaN  ...  35.74  59.23     NaN
    22        H502    fine-grained sandstone           NaN  ...  59.61  59.23     NaN
    23        H502                 siltstone           NaN  ...  59.61  59.23     NaN
    24        H502    fine-grained sandstone           NaN  ...  59.61  59.23     NaN
    25        H502  Coarse-grained sandstone           NaN  ...  59.61  59.23     NaN
    26        H502                  mudstone           NaN  ...  82.33  59.23     NaN
    27        H502    fine-grained sandstone           NaN  ...  82.33  59.23     NaN
    28        H502  Coarse-grained sandstone           J2z  ...  82.33  59.23     NaN
    29        H502                      coal  (J2y)  2coal  ...  82.33  59.23     NaN
    0         H502                 siltstone           NaN  ...    NaN    NaN     NaN

[16 rows x 23 columns] >>> # specify the column name and kname without section indexes >>> dfnew= reduce_samples (

data.copy(), sname=’strata_name’, data, zname=’depth’, kname=’k’, ignore_index= True )[0]

… dfnew[0].index # index is reset … RangeIndex(start=0, stop=16, step=1)

watex.utils.remove_outliers(ar, method='IQR', threshold=3.0, fill_value=None, axis=1)[source]#

Efficient strategy to remove outliers in the data.

Indeed, an outlier is the data point of the given sample, observation, or distribution that shall lie outside the overall pattern. A commonly used rule says that one will consider a data point an outlier if it has more than 1.5 IQR below the first quartile or above the third.

Two approaches is used to remove the outliers.

Inter Quartile Range (IQR) IQR is the most commonly used and most trusted approach used in the research field. Said differently, low outliers shall lie below Q1-1.5 IQR, and high outliers shall lie Q3+1.5IQR. One needs to calculate median, quartiles, including IQR, Q1, and Q3.

\[ \begin{align}\begin{aligned}Q1 = 1/4(n + 1)\\Q3 = 1/4 (n + 1)\\Q2 = Q3 – Q1\end{aligned}\end{align} \]

To define the outlier base value is defined above and below datasets normal range namely Upper and Lower bounds, define the upper and the lower bound (1.5*IQR value is considered) :

\[ \begin{align}\begin{aligned}upper = Q3 +1.5*IQR\\lower = Q1 – 1.5*IQR\end{aligned}\end{align} \]

In the above formula as according to statistics, the 0.5 scale-up of \(IQR (new_IQR = IQR + 0.5*IQR)\) is taken, to consider all the data between 2.7 standard deviations in the Gaussian Distribution
Z-score Is also called a standard score. This value/score helps to understand that how far is the data point from the mean. And after setting up a threshold value one can utilize z score values of data points to define the outliers.

\[Zscore = ( ext{data_point} - ext{mean}) / ext{std. deviation}\]

Now to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the data points lie between +/- 3 standard deviation (using Gaussian Distribution approach).

Parameters:

ar (Arraylike,) – Array containing outliers to remove
method (str, default='IQR') – The selected approach to remove the outliers. It can be [‘IQR’|’Z-score’]. See Above for outlier explanations. Note that when selecting "z-score" the threshold value greatly influence the quality of data considering as ooutliers.
threshold (float, default=3) – Thershold values is useful for "z-score" as the value for considering data above as outliers.
fill_value (float, optional) – Value to replace the outliers. If not given, outliers are suppressed in the array.
axis (int, default=1) – axis from which to remove values. This is useful when two dimensional array is supplied. Default, delete outlier from the rows.

Returns:

arr – New array whith removed outliers.

Return type:

Array_like

Examples

>>> import numpy as np
>>> np.random.seed (42 )
>>> from watex.utils.funcutils import remove_outliers
>>> data = np.random.randn (7, 3 )
>>> data_r = remove_outliers ( data )
>>> data.shape , data_r.shape
(7, 3) (5, 3)
>>> remove_outliers ( data, fill_value =np.nan )
array([[ 0.49671415, -0.1382643 ,  0.64768854],
       [ 1.52302986, -0.23415337, -0.23413696],
       [ 1.57921282,  0.76743473, -0.46947439],
       [ 0.54256004, -0.46341769, -0.46572975],
       [ 0.24196227,         nan,         nan],
       [-0.56228753, -1.01283112,  0.31424733],
       [-0.90802408,         nan,  1.46564877]])
>>> # for one dimensional
>>> remove_outliers ( data[:, 0] , fill_value =np.nan )
array([ 0.49671415,  1.52302986,  1.57921282,  0.54256004,  0.24196227,
       -0.56228753,         nan])

watex.utils.rename_labels_in(arr, new_names, coerce=False)[source]#

Rename label by a new names

Parameters:

arr – arr: array-like |pandas.Series array or series containing numerical values. If a non-numerical values is given , an errors will raises.
new_names – list of str; list of string or values to replace the label integer identifier.
coerce – bool, default =False, force the ‘new_names’ to appear in the target including or not some integer identifier class label. coerce is True, the target array hold the dtype of new_array; coercing the label names will not yield error. Consequently can introduce an unexpected results.

Returns:

array-like, An array-like with full new label names.

watex.utils.reshape(arr, axis=None)[source]#

Detect the array shape and reshape it accordingly, back to the given axis.

Parameters:

array – array_like with number of dimension equals to 1 or 2
axis – axis to reshape back array. If ‘axis’ is None and the number of dimension is greater than 1, it reshapes back array to array-like

Returns:

New reshaped array

Example:

>>> import numpy as np
>>> from watex.utils.funcutils import reshape
>>> array = np.random.randn(50 )
>>> array.shape
... (50,)
>>> ar1 = reshape(array, 1)
>>> ar1.shape
... (1, 50)
>>> ar2 =reshape(ar1 , 0)
>>> ar2.shape
... (50, 1)
>>> ar3 = reshape(ar2, axis = None)
>>> ar3.shape # goes back to the original array
>>> ar3.shape
... (50,)

watex.utils.rhoa2z(rhoa, phs, freq)[source]#

Convert apparent resistivity to impendance tensor z

Parameters:

rhoa (ndarray, shape (N, M)) – Apparent resistivity in \(\Omega.m\)
phs (ndarray, shape (N, M)) – Phase in degrees
freq (array-like , shape (N, )) – Frequency in Hertz

: :return: Impendance tensor; Tensor is a complex number in \(\Omega\). :rtype: ndarray, shape (N, M), dtype = ‘complex’

Example:

>>> import numpy as np
>>> rhoa = np.array([1623.73691735])
>>> phz = np.array([45.])
>>> f = np.array ([1014])
>>> rhoa2z(rhoa, phz, f)
... array([[2.54950976+2.54950976j]])

watex.utils.scalePosition(ydata, xdata=None, func=None, c_order=0, show=False, **kws)[source]#

Correct data location or position and return new corrected location

Parameters:

ydata (array_like, series or dataframe) – The dependent data, a length M array - nominally f(xdata, ...).
xdata (array_like or object) – The independent variable where the data is measured. Should usually be an M-length sequence or an (k,M)-shaped array for functions with k predictors, but can actually be any object. If None, xdata is generated by default using the length of the given ydata.
func (callable) – The model function, f(x, ...). It must take the independent variable as the first argument and the parameters to fit as separate remaining arguments. The default func is linear function i.e for f(x)= ax +b. where a is slope and b is the intercept value. Setting your own function for better fitting is recommended.
c_order (int or str) – The index or the column name if ydata is given as a dataframe to select the right column for scaling.
show (bool) – Quick visualization of data distribution.
kws (dict) – Additional keyword argument from scipy.optimize_curvefit parameters. Refer to scipy.optimize.curve_fit.

Returns:

- ydata - array -like - Data scaled
- popt - array-like Optimal values for the parameters so that the sum of
the squared residuals of f(xdata, \*popt) - ydata is minimized.
- pcov - array like The estimated covariance of popt. The diagonals provide
the variance of the parameter estimate. To compute one standard deviation
errors on the parameters use perr = np.sqrt(np.diag(pcov)). How the
sigma parameter affects the estimated covariance depends on absolute_sigma
argument, as described above. If the Jacobian matrix at the solution
doesn’t have a full rank, then ‘lm’ method returns a matrix filled with
np.inf, on the other hand ‘trf’ and ‘dogbox’ methods use Moore-Penrose
pseudoinverse to compute the covariance matrix.

Examples

>>> from watex.utils import erpSelector, scalePosition
>>> df = erpSelector('data/erp/l10_gbalo.xlsx')
>>> df.columns
... Index(['station', 'resistivity', 'longitude', 'latitude', 'easting',
'northing'],
dtype='object')
>>> # correcting northing coordinates from easting data
>>> northing_corrected, popt, pcov = scalePosition(ydata =df.northing ,
xdata = df.easting, show=True)
>>> len(df.northing.values) , len(northing_corrected)
... (20, 20)
>>> popt  # by default popt =(slope:a ,intercept: b)
...  array([1.01151734e+00, 2.93731377e+05])
>>> # corrected easting coordinates using the default x.
>>> easting_corrected, *_= scalePosition(ydata =df.easting , show=True)
>>> df.easting.values
... array([790284, 790281, 790277, 790270, 790265, 790260, 790254, 790248,
...       790243, 790237, 790231, 790224, 790218, 790211, 790206, 790200,
...       790194, 790187, 790181, 790175], dtype=int64)
>>> easting_corrected
... array([790288.18571705, 790282.30300999, 790276.42030293, 790270.53759587,
...       790264.6548888 , 790258.77218174, 790252.88947468, 790247.00676762,
...       790241.12406056, 790235.2413535 , 790229.35864644, 790223.47593938,
...       790217.59323232, 790211.71052526, 790205.8278182 , 790199.94511114,
...       790194.06240407, 790188.17969701, 790182.29698995, 790176.41428289])
.. _Bagoue region: https://en.wikipedia.org/wiki/Bagou%C3%A9

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

watex.utils.scaley(y, x=None, deg=None, func=None)[source]#

Scaling value using a fitting curve.

Create polyfit function from a specifc data points x to correct y values.

Parameters:

y – array-like of y-axis. Is the array of value to be scaled.
x – array-like of x-axis. If x is given, it should be the same length as y, otherwise and error will occurs. Default is None.
func – callable - The model function, f(x, ...). It must take the independent variable as the first argument and the parameters to fit as separate remaining arguments. func can be a linear function i.e for f(x)= ax +b where a is slope and b is the intercept value. It is recommended according to the y value distribution to set up a custom function for better fitting. If func is given, the deg is not needed.
deg – polynomial degree. If value is None, it should be computed using the length of extrema (local and/or global) values.

Returns:

y: array scaled - projected sample values got from f.
x: new x-axis - new axis x_new generated from the samples.
linear of polynomial function f

References:

Wikipedia, Curve fitting, https://en.wikipedia.org/wiki/Curve_fitting Wikipedia, Polynomial interpolation, https://en.wikipedia.org/wiki/Polynomial_interpolation

Example:

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from watex.exmath import scale_values
>>> rdn = np.random.RandomState(42)
>>> x0 =10 * rdn.rand(50)
>>> y = 2 * x0  +  rnd.randn(50) -1
>>> plt.scatter(x0, y)
>>> yc, x , f = scale_values(y)
>>> plt.plot(x, y, x, yc)

watex.utils.select_base_stratum(d, /, sname=None, stratum=None, return_rate=False, return_counts=False)[source]#

Selects base stratum from the the strata column in the logging data.

Find the most recurrent stratum in the data and compute the rate of occurrence.

Parameters:

d (array-like 1D , pandas.Series or DataFrame) – Valid data containing the strata. If dataframe is passed, ‘sname’ is needed to fetch strata values.
sname (str, optional) – Name of column in the dataframe that contains the strata values. Dont confuse ‘sname’ with ‘stratum’ which is the name of the valid layer/rock in the array/Series of strata.
stratum (str, optional) – Name of the base stratum. Must be self contain as an item of the strata data. Note that if stratum is passed, the auto-detection of base stratum is not triggered. It returns the same stratum , however it can gives the rate and occurence of this stratum if return_rate or return_counts is set to True.
return_rate (bool,default=False,) – Returns the rate of occurence of the base stratum in the data.
return_counts (bool, default=False,) – Returns each stratum name and the occurences (count) in the data.

Returns:

bs (str) – base stratum , self contain in the data
r (float) – rate of occurence in base stratum in the data
c (tuple (str, int)) – Tuple of each stratum whith their occurrence in the data.

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import select_base_stratum
>>> data = load_hlogs().frame # get only the frame
>>> select_base_stratum(data, sname ='strata_name')
... 'siltstone'
>>> select_base_stratum(data, sname ='strata_name', return_rate =True)
... 0.287292817679558
>>> select_base_stratum(data, sname ='strata_name', return_counts=True)
... [('siltstone', 52),
     ('fine-grained sandstone', 40),
     ('mudstone', 37),
     ('coal', 24),
     ('Coarse-grained sandstone', 15),
     ('carbonaceous mudstone', 9),
     ('medium-grained sandstone', 2),
     ('topsoil', 1),
     ('gravel layer', 1)]

watex.utils.select_feature_importances(clf, X, y=None, *, threshold=0.1, prefit=True, verbose=0, return_selector=False, **kws)[source]#

Select feature importance based on a user-specified threshold after model fitting, which is useful if one want to use RandomForestClassifier as a feature selector and intermediate step in scikit-learn Pipeline object, which allows us to connect different processing steps with an estimator.

Parameters:

clf (estimator object) – The base estimator from which the transformer is built. This can be both a fitted (if prefit is set to True) or a non-fitted estimator. The estimator should have a feature_importances_ or coef_ attribute after fitting. Otherwise, the importance_getter parameter should be used.
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples, )) – Target vector where n_samples is the number of samples. If given, set prefit=False for estimator to fit and transform the data for feature importance selecting. If estimator is already fitted i.e. prefit=True, ‘y’ is not needed.
threshold (str or float, default=None) – The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.
prefit (bool, default=False) – Whether a prefit model is expected to be passed into the constructor directly or not. If True, estimator must be a fitted estimator. If False, estimator is fitted and updated by calling fit and partial_fit, respectively.
importance_getter (str or callable, default='auto') –
If ‘auto’, uses the feature importance either through a coef_ attribute or feature_importances_ attribute of estimator.

Also accepts a string that specifies an attribute name/path for extracting feature importance (implemented with attrgetter). For example, give regressor_.coef_ in case of TransformedTargetRegressor or named_steps.clf.feature_importances_ in case of Pipeline with its last step named clf.

If callable, overrides the default feature importance getter. The callable is passed with the fitted estimator and it should return importance for each feature.
norm_order (non-zero int, inf, -inf, default=1) – Order of the norm used to filter the vectors of coefficients below threshold in the case where the coef_ attribute of the estimator is of dimension 2.
max_features (int, callable, default=None) –
The maximum number of features to select.
- If an integer, then it specifies the maximum number of features to allow.
- If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of max_feaures(X).
- If None, then all features are kept.
To only select based on max_features, set threshold=-np.inf.
return_selector (bool, default=False,) – Returns selector object if True., otherwise returns the transformed X.
verbose (int, default=0) – display the number of features that meet the criterion according to their importance range.

Returns:

Xs or selector – Ndarray of number of samples and features that meet the criterion according to the importance range or selector object

Return type:

ndarray (n_samples, n_criterion_features), or sklearn.feature_selection.SelectFromModel

Examples

>>> from watex.utils.mlutils import select_feature_importances
>>> from watex.exlib.sklearn import LogisticRegression
>>> X0 = [[ 0.87, -1.34,  0.31 ],
...      [-2.79, -0.02, -0.85 ],
...      [-1.34, -0.48, -2.55 ],
...      [ 1.92,  1.48,  0.65 ]]
>>> y0 = [0, 1, 0, 1]

(1) use prefit =True and get the Xs importance features >>> Xs = select_feature_importances (

LogisticRegression().fit(X0, y0), X0 , prefit =True )

>>> Xs
array([[ 0.87, -1.34,  0.31],
       [-2.79, -0.02, -0.85],
       [-1.34, -0.48, -2.55],
       [ 1.92,  1.48,  0.65]])

Set off prefix and return selector obj

>>> selector= select_feature_importances (
    LogisticRegression(), X= X0 ,
    y =y0  ,
    prefit =False , return_selector= True
    )
>>> selector.estimator_.coef_
array([[-0.3252302 ,  0.83462377,  0.49750423]])
>>> selector.threshold_
0.1
>>> selector.get_support()
array([ True,  True,  True])

>>> selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y)
>>> selector.estimator_.coef_
array([[-0.3252302 ,  0.83462377,  0.49750423]])
>>> selector.threshold_
0.55245...
>>> selector.get_support()
array([False,  True, False])
>>> selector.transform (X0)
array([[ 0.87, -1.34,  0.31],
       [-2.79, -0.02, -0.85],
       [-1.34, -0.48, -2.55],
       [ 1.92,  1.48,  0.65]])

watex.utils.selectfeatures(df, features=None, include=None, exclude=None, coerce=False, **kwd)[source]#

Select features and return new dataframe.

Parameters:

df – a dataframe for features selections
features – list of features to select. List of features must be in the dataframe otherwise an error occurs.
include – the type of data to retrieve in the dataframe df. Can be number.
exclude – type of the data to exclude in the dataframe df. Can be number i.e. only non-digits data will be keep in the data return.
coerce – return the whole dataframe with transforming numeric columns. Be aware that no selection is done and no error is raises instead. default is False
kwd – additional keywords arguments from pd.astype function

Ref:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

watex.utils.sfi(cz, p=None, s=None, dipolelength=None, view=False, raw=False, return_components=False, **plotkws)[source]#

Compute the pseudo-fracturing index known as sfi.

The sfi parameter does not indicate the rock fracturing degree in the underground but it is used to speculate about the apparent resistivity dispersion ratio around the cumulated sum of the resistivity values of the selected anomaly. It uses a similar approach of IF parameter proposed by Dieng et al (2004). Furthermore, its threshold is set to \(sqrt{2}\) for symmetrical anomaly characterized by a perfect distribution of resistivity in a homogenous medium. The formula is given by:

\[sfi=\sqrt{(P_a^{*}/P_a )^2+(M_a^{*}/M_a )^2}\]

where \(P_a\) and \(M_a\) are the anomaly power and the magnitude respectively. \(P_a^{*}\) is and \(M_a^{*}\) are the projected power and magnitude of the lower point of the selected anomaly.

Parameters:

cz (array-like,) – Selected conductive zone
p (array-like,) – Station positions of the conductive zone.
dipolelength (float. If p is not given, it will be set) – automatically using the default value to match the cz size. The default value is 10..
view (bool, default=False,) – Visualize the fitting curve. Default is False.
raw (bool,default=False,) – Overlaining the fitting curve with the raw curve from cz.
return_components (bool, default=False,) – If True, it returns the different components used for compute sfi especially for external visualization.
plotkws (dict) – Matplotlib plot keyword arguments.

Returns:

sfi – value computed for pseudo-fracturing index

Return type:

float

Examples

>>> import numpy as np
>>> from watex.property import P
>>> from watex.utils.exmath import sfi
>>> rang = np.random.RandomState (42)
>>> condzone = np.abs(rang.randn (7))
>>> # no visualization and default value `s` with global minimal rho
>>> pfi = sfi (condzone)
... 3.35110143
>>> # visualize fitting curve
>>> plotkws  = dict (rlabel = 'Conductive zone (cz)',
                     label = 'fitting model',
                     color=f'{P().frcolortags.get("fr3")}',
                     )
>>> sfi (condzone, view= True , s= 5, figsize =(7, 7),
          **plotkws )
Out[598]: (array([ 0., 10., 20., 30.]), 1)

References

See Numpy Polyfit
See Stackoverflow
the answer of AkaRem edited by Tobu and Migilson.
See Numpy Errorstate and
how to implement the context manager.

watex.utils.shape(cz, s=Ellipsis, p=Ellipsis)[source]#

Compute the shape of anomaly.

The shape parameter is mostly used in the basement medium to depict the better conductive zone for the drilling location. According to Sombo et al. (2011; 2012), various shapes of anomalies can be described such as:

“V”, “U”, “W”, “M”, “K”, “C”, and “H”

The shape consists to feed the algorithm with the Electrical Resistivity Profiling resistivity values by specifying the station \((S_{VES})\). Indeed, mostly, \(S_{VES}\) is the station with a very low resistivity value expected to be the drilling location.

Parameters:

cz – array-like - Conductive zone resistivity values
s – int, str - Station position index or name.
p – Array-like - Should be the position of the conductive zone.

Note

If s is given, p should be provided. If p is missing an error will raises.

Returns:

str - the shape of anomaly.

Example:

>>> import numpy as np
>>> rang = np.random.RandomState(42)
>>> from watex.utils.exmath import shape
>>> test_array1 = np.arange(10)
>>> shape (test_array1)
...  'C'
>>> test_array2 = rang.randn (7)
>>> shape(test_array2)
... 'H'
>>> test_array3 = np.power(10, test_array2 , dtype =np.float32)
>>> shape (test_array3)
... 'H'   # does not change whatever the resistivity values.

References

Sombo, P. A., Williams, F., Loukou, K. N., & Kouassi, E. G. (2011).: Contribution de la Prospection Électrique à L’identification et à la Caractérisation des Aquifères de Socle du Département de Sikensi (Sud de la Côte d’Ivoire). European Journal of Scientific Research, 64(2), 206–219.
Sombo, P. A. (2012). Application des methodes de resistivites electriques: dans la determination et la caracterisation des aquiferes de socle en Cote d’Ivoire. Cas des departements de Sikensi et de Tiassale (Sud de la Cote d’Ivoire). Universite Felix Houphouet Boigny.

watex.utils.smart_label_classifier(arr, /, values=None, labels=None, order='soft', func=None, raise_warn=True)[source]#

map smartly the numeric array into a class labels from a map function or a given fixed values.

New classes created from the fixed values can be renamed if labels are supplied.

Parameters:

arr (Arraylike 1d,) – array-like whose items are expected to be categorized.
values (float, list of float,) – The threshold item values from which the default categorization must be fixed.
labels (int |str| or List of [str, int],) – The labels values that might be correspond to the fixed values. Note that the number of fixed_labels might be consistent with the fixed values plus one, otherwise a ValueError shall raise if order is set to strict.
order (str, ['soft'|'strict'], default='soft',) – If order is True, the argument passed to values must be self contain as item in the arr, and raise warning otherwise.
func (callable, optional) – Function to map the given array. If given, values dont need to be supply.
raise_warn (bool, default='True') – Raise warning message if order=soft and the fixed values are not found in the arr. Also raise warnings, if labels arguments does not match the number of class from fixed values.

Returns:

arr – categorized array with the same length as the raw

Return type:

array-like 1d

Examples

>>> import numpy as np
>>> from watex.utils.funcutils import smart_label_classifier
>>> sc = np.arange (0, 7, .5 )
>>> smart_label_classifier (sc, values = [1, 3.2 ])
array([0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2], dtype=int64)
>>> # rename labels <=1 : 'l1', ]1; 3.2]: 'l2' and >3.2 :'l3'
>>> smart_label_classifier (sc, values = [1, 3.2 ], labels =['l1', 'l2', 'l3'])
>>> array(['l1', 'l1', 'l1', 'l2', 'l2', 'l2', 'l2', 'l3', 'l3', 'l3', 'l3',
       'l3', 'l3', 'l3'], dtype=object)
>>> def f (v):
        if v <=1: return 'l1'
        elif 1< v<=3.2: return "l2"
        else : return "l3"
>>> smart_label_classifier (sc, func= f )
array(['l1', 'l1', 'l1', 'l2', 'l2', 'l2', 'l2', 'l3', 'l3', 'l3', 'l3',
       'l3', 'l3', 'l3'], dtype=object)
>>> smart_label_classifier (sc, values = 1.)
array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
>>> smart_label_classifier (sc, values = 1., labels='l1')
array(['l1', 'l1', 'l1', 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=object)

watex.utils.split_train_test(df, test_ratio)[source]#

A naive dataset split into train and test sets from a ratio and return a shuffled train set and test set.

Parameters:

df – a dataframe containing features
test_ratio – a ratio for test set batch. test_ratio is ranged between 0 to 1. Default is 20%.

Returns:

a tuple of train set and test set.

watex.utils.to_numeric_dtypes(arr, *, columns=None, return_feature_types=False, missing_values=nan, pop_cat_features=False, verbose=False)[source]#

Convert array to dataframe and coerce arguments to appropriate dtypes.

Parameters:

arr (Ndarray or Dataframe, shape (M=samples, N=features)) – Array of dataframe to create
columns (list of str, optional) – Usefull to create a dataframe when array is given. Be aware to fit the number of array columns (shape[1])
return_feature_types (bool, default=False,) – return the list of numerical and categorial features
missing_values (float:) – Replace the missing or empty string if exist in the dataframe.
pop_cat_features (bool, default=False,) – remove removes the categorial features from the DataFrame.
verbose (bool, default=False,) – outputs a message by listing the categorial items dropped from the dataframe if exists.

Returns:

df or (df, nf, cf) – also return nf and cf if return_feature_types is set to``True``.

Return type:

Dataframe of values casted to numeric types

Examples

>>> from watex.datasets.dload import load_bagoue
>>> from watex.utils.funcutils import to_numeric_dtypes
>>> X, y = load_bagoue (as_frame =True )
>>> X0 =X[['shape', 'power', 'magnitude']]
>>> X0.dtypes
... shape        object
    power        object
    magnitude    object
    dtype: object
>>> df = to_numeric_dtypes(X0)
>>> df.dtypes
... shape         object
    power        float64
    magnitude    float64
    dtype: object

watex.utils.type_(erp)[source]#

Compute the type of anomaly.

The type parameter is defined by the African Hydraulic Study Committee report (CIEH, 2001). Later it was implemented by authors such as (Adam et al., 2020; Michel et al., 2013; Nikiema, 2012). Type comes to help the differenciation of two or several anomalies with the same shape. For instance, two anomalies with the same shape W will differ from the order of priority of their types. The type depends on the lateral resistivity distribution of underground (resulting from the pace of the apparent resistivity curve) along with the whole Electrical Resistivity Profiling survey line. Indeed, four types of anomalies were emphasized:

“EC”, “CB2P”, “NC” and “CP”.

For more details refers to references.

Parameters:

erp – array-like - Array of Electrical Resistivity Profiling line composed of apparent resistivity values.

Returns:

str -The type of anomaly.

Example:

>>> import numpy as np
>>> from watex.utils.exmath import type_
>>> rang = np.random.RandomState(42)
>>> test_array2 = rang.randn (7)
>>> type_(np.abs(test_array2))
... 'EC'
>>> long_array = np.abs (rang.randn(71))
>>> type(long_array)
... 'PC'

References

Adam, B. M., Abubakar, A. H., Dalibi, J. H., Khalil Mustapha,M., & Abubakar,: A. H. (2020). Assessment of Gaseous Emissions and Socio-Economic Impacts From Diesel Generators used in GSM BTS in Kano Metropolis. African Journal of Earth and Environmental Sciences, 2(1),517–523. https://doi.org/10.11113/ajees.v3.n1.104
CIEH. (2001). L’utilisation des méthodes géophysiques pour la recherche: d’eaux dans les aquifères discontinus. Série Hydrogéologie, 169.
Michel, K. A., Drissa, C., Blaise, K. Y., & Jean, B. (2013). Application: de méthodes géophysiques à l ’ étude de la productivité des forages d ’eau en milieu cristallin : cas de la région de Toumodi ( Centre de la Côte d ’Ivoire). International Journal of Innovation and Applied Studies, 2(3), 324–334.
Nikiema, D. G. C. (2012). Essai d‘optimisation de l’implantation géophysique: des forages en zone de socle : Cas de la province de Séno, Nord Est du Burkina Faso (IRD). (I. / I. Ile-de-France, Ed.). IST / IRD Ile-de-France, Ouagadougou, Burkina Faso, West-africa. Retrieved from http://documentation.2ie-edu.org/cdi2ie/opac_css/doc_num.php?explnum_id=148

watex.utils.vesDataOperator(AB=None, rhoa=None, data=None, typeofop=None, outdf=False)[source]#

Check the data in the given deep measurement and set the suitable operations for duplicated spacing distance of current electrodes AB.

Sometimes at the potential electrodes (MN), the measurement of AB are collected twice after modifying the distance of MN a bit. At this point, two or many resistivity values are targetted to the same distance AB (AB still remains unchangeable while while MN is changed). So the operation consists whether to average (mean) the resistiviy values or to take the median values or to leaveOneOut (i.e. keep one value of resistivity among the different values collected at the same point`AB`) at the same spacing AB. Note that for the LeaveOneOut`, the selected resistivity value is randomly chosen.

Parameters:

AB (array-like 1d,) – Spacing of the current electrodes when exploring in deeper. Is the depth measurement (AB/2) using the current electrodes AB. Units are in meters.
rhoa (array-like 1d) – Apparent resistivity values collected by imaging in depth. Units are in \(\Omega {.m}\) not \(log10(\Omega {.m})\)
data (DataFrame,) – It is composed of spacing values AB and the apparent resistivity values rhoa. If data is given, params AB and rhoa should be kept to None.
typeofop (str,['mean'|'median'|'leaveoneout'], default='mean') – Type of operation to apply to the resistivity values rhoa of the duplicated spacing points AB. The default operation is mean.
outdf (bool , default=False,) – Outpout a new dataframe composed of AB and rhoa; data renewed.

Returns:

- Tuple of (AB, rhoa) (New values computed from typeofop)
- DataFrame (New dataframe outputed only if outdf is True.)

Notes

By convention AB and MN are half-space dipole length which correspond to AB/2 and MN/2 respectively.

Examples

>>> from watex.utils.exmath import vesDataOperator
>>> from watex.utils.coreutils import vesSelector
>>> data = vesSelector ('data/ves/ves_gbalo.xlsx')
>>> len(data)
... (32, 3) # include the potentiel electrode values `MN`
>>> df= vesDataOperator(data.AB, data.resistivity,
                        typeofop='leaveOneOut', outdf =True)
>>> df.shape
... (26, 2) # exclude `MN` values and reduce(-6) the duplicated values.

watex.utils.vesSelector(data=None, *, rhoa=None, AB=None, MN=None, index_rhoa=None, **kws)[source]#

Assert the validity of Vertical Electrical Sounding data and return a sanitize dataframe.

param rhoa:

array-like - Apparent resistivities collected during the sounding.

param AB:

array-like - Investigation distance between the current electrodes. Note that the AB is by convention equals to AB/2. It’s taken as half-space of the investigation depth.

param MN:

array-like - Potential electrodes distances at each investigation depth. Note by convention the values are half-space and equals to MN/2.

param f:

Path-like object or sounding dataframe. If given, the others parameters could keep the ``None` values.

param index_rhoa:

int - The index to retrieve the resistivity data of a specific sounding point. Sometimes the sounding data are composed of the different sounding values collected in the same survey area into different Electrical Resistivity Profiling line. For instance:

AB/2

MN/2

SE1

SE2

SE3

…

SEn

Where SE are the electrical sounding data values and n is the number of the sounding points selected. SE1, SE2 and SE3 are three points selected for Vertical Electrical Sounding i.e. 3 sounding points carried out either in the same Electrical Resistivity Profiling or somewhere else. These sounding data are the resistivity data with a specific numbers. Commonly the number are randomly chosen. It does not refer to the expected best fracture zone selected after the prior-interpretation. After transformation via the function ves_selector, the header of the data should hold the resistivity. For instance, refering to the table above, the data should be:

AB

MN

resistivity

resistivity

resistivity

…

Therefore, the index_rhoa is used to select the specific resistivity values i.e. select the corresponding sounding number of the Vertical Electrical Sounding expecting to locate the drilling operations or for computation. For esample, index_rhoa=1 should figure out:

AB/2

MN/2

SE2

–>

AB

MN

resistivity

If index_rhoa is None and the number of sounding curves are more than one, by default the first sounding curve is selected ie index_rhoa equals to 0.

param kws:

dict - Pandas dataframe reading additionals keywords arguments.

return:

-dataframe -Sanitize Vertical Electrical Sounding dataframe with ` AB`, MN and resistivity as the column headers.

Example:
>>> from watex.utils.coreutils import vesSelector
>>> df = vesSelector (data='data/ves/ves_gbalo.csv')
>>> df.head(3)
...    AB   MN  resistivity
    0   1  0.4          943
    1   2  0.4         1179
    2   3  0.4         1103
>>> df = vesSelector ('data/ves/ves_gbalo.csv', index_rhoa=3 )
>>> df.head(3)
...    AB   MN  resistivity
    0   1  0.4          457
    1   2  0.4          582
    2   3  0.4          558

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

watex.utils.z2rhoa(z, freq)[source]#

Convert impendance tensor z to apparent resistivity

Parameters:

z (ndarray, shape (N, M)) – Impedance tensor in \(\Omega\)
freq (array-like , shape (N, )) – Frequency in Hertz

: :return: Apparent resistivity in \(\Omega.m\) :rtype: ndarray, shape (N, M)

Example:

>>> import numpy as np
>>> z = np.array([2 + 1j *3 ])
>>> f = np.array ([1014])
>>> z2rhoa(z, f)
... array([[1623.73691735]])

watex.utils package#

Submodules#