watex.utils package#

Utils sub-package offers several tools for data handling, parameters computation models estimation and evalution, and graphs visualization. The extension of the mathematical concepts, and the core of program are performed via the modules exmath and coreutils respectively. Whereas the machine learning utilities and additional functionalities are performed with mlutils and funcutils respectively. The plot utilities from plotutils gives several plotting tools for visualization.

watex.utils.adaptive_moving_average(data, /, window_size_factor=0.1)[source]#

Adaptative moving average as smoothing technique.

Parameters:
  • data (Arraylike) – Noise data for smoothing

  • window_size_factor (float, default=0.1) – Parameter to control the adaptiveness of the moving average.

Returns:

result – Smoothed data

Return type:

Arraylike

Example

>>> import matplotlib.pyplot as plt
>>> from watex.utils.exmath import adaptive_moving_average
>>> # Sample magnetotelluric data (replace this with your own data)
>>> # Example data: a sine wave with noise
>>> time = np.linspace(0, 10, 1000)  # Replace with your actual time values
>>> mt_data = np.sin(2 * np.pi * 1 * time) + 0.2 * np.random.randn(1000)  # Example data
>>> # Function to calculate the adaptive moving average
>>> # Define the window size factor (adjust as needed)
>>> window_size_factor = 0.1  # Adjust this value based on your data characteristics
>>> # Apply adaptive moving average to the magnetotelluric data
>>> smoothed_data = adaptive_moving_average(mt_data, window_size_factor)
>>> # Plot the original and smoothed data
>>> plt.figure(figsize=(10, 6))
>>> plt.plot(time, mt_data, 'b-', label='Original Data')
>>> plt.plot(time, smoothed_data, 'r-', label='Smoothed Data (AMA)')
>>> plt.xlabel('Time')
>>> plt.ylabel('Amplitude')
>>> plt.title('Adaptive Moving Average (AMA) Smoothing')
>>> plt.legend()
>>> plt.grid(True)
>>> plt.show()
watex.utils.bi_selector(d, /, features=None, return_frames=False)[source]#

Auto-differentiates the numerical from categorical attributes.

This is usefull to select the categorial features from the numerical features and vice-versa when we are a lot of features. Enter features individually become tiedous and a mistake could probably happenned.

Parameters:
  • d (pandas dataframe) – Dataframe pandas

  • features (list of str) – List of features in the dataframe columns. Raise error is feature(s) does/do not exist in the frame. Note that if features is None, it returns the categorical and numerical features instead.

  • return_frames (bool, default =False) – return the difference columns (features) from the given features as a list. If set to True returns bi-frames composed of the given features and the remaining features.

Returns:

  • - Tuple ( list, list) – list of features and remaining features

  • - Tuple ( pd.DataFrame, pd.DataFrame ) – List of features and remaing features frames.

Example

>>> from watex.utils.mlutils import bi_selector
>>> from watex.datasets import load_hlogs
>>> data = load_hlogs().frame # get the frame
>>> data.columns
>>> Index(['hole_id', 'depth_top', 'depth_bottom', 'strata_name', 'rock_name',
       'layer_thickness', 'resistivity', 'gamma_gamma', 'natural_gamma', 'sp',
       'short_distance_gamma', 'well_diameter', 'aquifer_group',
       'pumping_level', 'aquifer_thickness', 'hole_depth_before_pumping',
       'hole_depth_after_pumping', 'hole_depth_loss', 'depth_starting_pumping',
       'pumping_depth_at_the_end', 'pumping_depth', 'section_aperture', 'k',
       'kp', 'r', 'rp', 'remark'],
      dtype='object')
>>> num_features, cat_features = bi_selector (data)
>>> num_features
...['gamma_gamma',
     'depth_top',
     'aquifer_thickness',
     'pumping_depth_at_the_end',
     'section_aperture',
     'remark',
     'depth_starting_pumping',
     'hole_depth_before_pumping',
     'rp',
     'hole_depth_after_pumping',
     'hole_depth_loss',
     'depth_bottom',
     'sp',
     'pumping_depth',
     'kp',
     'resistivity',
     'short_distance_gamma',
     'r',
     'natural_gamma',
     'layer_thickness',
     'k',
     'well_diameter']
>>> cat_features
... ['hole_id', 'strata_name', 'rock_name', 'aquifer_group',
     'pumping_level']
watex.utils.bin_counting(data, bin_columns, tname, odds='N+', return_counts=Ellipsis, tolog=Ellipsis)[source]#
Bin counting categorical variable and turn it into probabilistic

ratio.

Bin counting is one of the perennial rediscoveries in machine learning. It has been reinvented and used in a variety of applications, from ad click-through rate prediction to hardware branch prediction [Ra4bd71913a2f-1], [Ra4bd71913a2f-2] and [Ra4bd71913a2f-3].

Given an input variable X and a target variable Y, the odds ratio is defined as:

\[odds ratio =\]
rac{ P(Y = 1 | X = 1)/ P(Y = 0 | X = 1)}{

P(Y = 1 | X = 0)/ P(Y = 0 | X = 0)}

Probability ratios can easily become very small or very large. The log transform again comes to our rescue. Anotheruseful property of the logarithm is that it turns a division into a subtraction. To turn bin statistic probability value to log, set uselog=True.

Parameters:
data: dataframe

Data containing the categorical values.

bin_columns: str or list

The columns to applied the bin_countings

tname: str, pd.Series

The target name for which the counting is operated. If series, it must have the same length as the data.

odds: str , {“N+”, “N-”, “log_N+”}:

The odds ratio of bin counting to fill the categorical. N+ and N- are positive and negative probabilistic computing. Whereas the log_N+ is the logarithm odds ratio useful when value are smaller or larger.

return_counts: bool, default=True

return the bin counting dataframes.

tolog: bool, default=False,

Apply the logarithm to the output data ratio. Indeed, Probability ratios can easily become very small or very large. For instance, there will be users who almost never click on ads, and perhaps users who click on ads much more frequently than not.) The log transform again comes to our rescue. Another useful property of the logarithm is that it turns a division

Returns:
d: dataframe

Dataframe transformed or bin-counting data

Examples

>>> import watex as wx
>>> from watex.utils.mlutils import bin_counting
>>> X, y = wx.fetch_data ('bagoue analysed', as_frame =True)
>>> # target binarize
>>> y [y <=1] = 0;  y [y > 0]=1
>>> X.head(2)
Out[7]:
      power  magnitude       sfi      ohmS       lwi  shape  type  geol
0  0.191800  -0.140799 -0.426916  0.386121  0.638622    4.0   1.0   3.0
1 -0.430644  -0.114022  1.678541 -0.185662 -0.063900    3.0   2.0   2.0
>>>  bin_counting (X , bin_columns= 'geol', tname =y).head(2)
Out[8]:
      power  magnitude       sfi      ohmS  ...  shape  type      geol  bin_target
0  0.191800  -0.140799 -0.426916  0.386121  ...    4.0   1.0  0.656716           1
1 -0.430644  -0.114022  1.678541 -0.185662  ...    3.0   2.0  0.219251           0
[2 rows x 9 columns]
>>>  bin_counting (X , bin_columns= ['geol', 'shape', 'type'], tname =y).head(2)
Out[10]:
      power  magnitude       sfi  ...      type      geol  bin_target
0  0.191800  -0.140799 -0.426916  ...  0.267241  0.656716           1
1 -0.430644  -0.114022  1.678541  ...  0.385965  0.219251           0
[2 rows x 9 columns]
>>> df = pd.DataFrame ( pd.concat ( [X, pd.Series ( y, name ='flow')],
                                   axis =1))
>>> bin_counting (df , bin_columns= ['geol', 'shape', 'type'],
                  tname ="flow", tolog=True).head(2)
Out[12]:
      power  magnitude       sfi      ohmS  ...     shape      type      geol  flow
0  0.191800  -0.140799 -0.426916  0.386121  ...  0.828571  0.364706  1.913043     1
1 -0.430644  -0.114022  1.678541 -0.185662  ...  0.364865  0.628571  0.280822     0
>>> bin_counting (df , bin_columns= ['geol', 'shape', 'type'],odds ="N-",
                  tname =y, tolog=True).head(2)
Out[13]:
      power  magnitude       sfi  ...      geol  flow  bin_target
0  0.191800  -0.140799 -0.426916  ...  0.522727     1           1
1 -0.430644  -0.114022  1.678541  ...  3.560976     0           0
[2 rows x 10 columns]
>>> bin_counting (df , bin_columns= "geol",tname ="flow", tolog=True,
                  return_counts= True )
Out[14]:
     flow  no_flow  total_flow        N+        N-     logN+     logN-
3.0    44       23          67  0.656716  0.343284  1.913043  0.522727
2.0    41      146         187  0.219251  0.780749  0.280822  3.560976
0.0    18       43          61  0.295082  0.704918  0.418605  2.388889
1.0     9       20          29  0.310345  0.689655  0.450000  2.222222
watex.utils.build_random_thickness(depth, /, n_layers=None, h0=1, shuffle=True, dirichlet_dist=False, random_state=None, unit='m')[source]#

Generate a random thickness value for number of layers in deeper.

Parameters:
  • depth (ArrayLike, float) – Depth data. If float the number of layers n_layers must be specified. Otherwise an error occurs.

  • n_layers (int, Optional) – Number of layers that fit the samples in depth. If depth is passed as an ArrayLike, n_layers is ignored instead.

  • h0 (int, default='1m') – Thickness of the first layer.

  • shuffle (bool, default=True) – Shuffle the random generated thicknesses.

  • dirichlet_dis (bool, default=False) – Draw samples from the Dirichlet distribution. A Dirichlet-distributed random variable can be seen as a multivariate generalization of a Beta distribution. The Dirichlet distribution is a conjugate prior of a multinomial distribution in Bayesian inference.

  • random_state (int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional) – If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.

  • unit (str, default='m') – The reference unit for generated layer thicknesses. Default is meters

Returns:

thickness – ArrayLike of shape equals to the number of layers.

Return type:

Arraylike of shape (n_layers, )

Examples

>>> from watex.utils.geotools import build_random_thickness
>>> build_random_thickness (7, 10, random_state =42  )
array([0.41865079, 0.31785714, 1.0234127 , 1.12420635, 0.51944444,
       0.92261905, 0.6202381 , 0.8218254 , 0.72103175, 1.225     ])
>>> build_random_thickness (7, 10, random_state =42 , dirichlet_dist=True )
array([1.31628992, 0.83342521, 1.16073915, 1.03137592, 0.79986286,
       0.8967135 , 0.97709521, 1.34502617, 1.01632075, 0.62315132])
watex.utils.butterworth_filter(data, /, freqs, fs=None, frange=None, order=5, plot=False)[source]#

Defines a bandpass filter using a Butterworth filter and then applies it to your AFMT data to remove frequencies outside the specified range.

Adjust the lowcut and highcut parameters according to the desired frequency range for your data. Removing bad frequencies from data typically involves filtering the data to eliminate unwanted noise or artifacts.

Parameters:
  • data (arraylike 1D) – Noise data to filter.

  • freqs (Arraylike 1d) – Array of frequencies onto apply the bandpass filter.

  • fs (int,) – Sample of frequencies. If None, use the number of original frequency

  • frange (list , Optional) – frequency range ( min/200., max/5) for the bandpass filter (in Hz). By default, use the minimum and maximum of original frquency array. Note that digital filter critical frequencies must be 0 < Wn < 1 i.e.

  • order (int, default=5) – Order for butter bandpass.

  • plot (bool, default=False) – Visualize the filtered data.

Returns:

y

Return type:

filtered data

Example

>>> import numpy as np
>>> from watex.utils.exmath import butterworth_filter
>>> time = np.linspace(0, 1, 1000)  # Replace with your actual time values
>>> freqs = np.linspace ( 1, 1000, 500)
>>> data = np.sin(2 * np.pi * 10 *freqs) + 0.5 * np.sin(2 * np.pi * 50 *freqs)
>>> _=butterworth_filter (data , freqs , fs = 1000, frange=( 5, 20), plot=True )
watex.utils.cattarget(arr, /, func=None, labels=None, rename_labels=None, coerce=False, order='strict')[source]#

Categorize array to hold the given identifier labels.

Classifier numerical values according to the given label values. Labels are a list of integers where each integer is a group of unique identifier of a sample in the dataset.

Parameters:
  • arr (array-like |pandas.Series) – array or series containing numerical values. If a non-numerical values is given , an errors will raises.

  • func (Callable,) – Function to categorize the target y.

  • labels (int, list of int,) – if an integer value is given, it should be considered as the number of category to split ‘y’. For instance label=3 applied on the first ten number, the labels values should be [0, 1, 2]. If labels are given as a list, items must be self-contain in the target ‘y’.

  • rename_labels (list of str;) – list of string or values to replace the label integer identifier.

  • coerce (bool, default =False,) – force the new label names passed to rename_labels to appear in the target including or not some integer identifier class label. If coerce is True, the target array holds the dtype of new_array.

Returns:

arr – The category array with unique identifer labels

Return type:

Arraylike |pandas.Series

Examples

>>> from watex.utils.mlutils import cattarget
>>> def binfunc(v):
        if v < 3 : return 0
        else : return 1
>>> arr = np.arange (10 )
>>> arr
... array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> target = cattarget(arr, func =binfunc)
... array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
>>> cattarget(arr, labels =3 )
... array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2])
>>> array([2, 2, 2, 2, 1, 1, 1, 0, 0, 0])
>>> cattarget(arr, labels =3 , order =None )
... array([0, 0, 0, 0, 1, 1, 1, 2, 2, 2])
>>> cattarget(arr[::-1], labels =3 , order =None )
... array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2]) # reverse does not change
>>> cattarget(arr, labels =[0 , 2,  4]  )
... array([0, 0, 0, 2, 2, 4, 4, 4, 4, 4])
watex.utils.check_flow_objectivity(y, /, values, classes)[source]#

Function checks the flow rate objectivity

If objective is set to flow i.e the prediction focuses on the flow rate, there are some conditions that the target y needs to meet when values are passed for classes categorization.

Parameters:
  • values – list of values to encoding the numerical target y. for instance values=[0, 1, 2]

  • objective – str, relate to the flow rate prediction. Set to None for any other predictions.

  • prefix

    the prefix to add to the class labels. For instance, if the prefix equals to FR, class labels will become:

    [0, 1, 2] => [FR0, FR1, FR2]
    

  • classes

    list of classes names to replace the default FR that is used to specify the flow rate. For instance, it can be:

    [0, 1, 2] => [sf0, sf1, sf2]
    

Returns:

(y, classes): Tuple, - y: array-like 1d of categorized y - classes: list of flow rate classes.

watex.utils.classify_k(o, /, func=None, kname=None, inplace=False, string=False, default_func=False)[source]#

Categorize the permeability coefficient ‘k’

Map the continuous ‘k’ into categorial classes.

Parameters:
  • o (ndarray of pd.Series or Dataframe) – data containing the permeability coefficient k contineous values. If data is passsed as a pandas dataframe, the column containing the k-values kname needs to be specified.

  • func (callable) – Function to specifically map the permeability coefficient column in the dataframe of serie. If not given, the default function can be enabled instead from param default_func.

  • inplace (bool, default=False) – Modified object inplace and return None

  • string (bool,) – If set to “True”, categorized map from ‘k’ should be prefixed by “k”. However is string value is given , the prefix is changed according to this label.

  • default_ufunc (bool,) –

    Default function for mapping k is setting to True. Note that, this could probably not fitted your own data. So it is recommended to provide your own function for mapping ‘k’. However the default ‘k’ mapping is given as follow:

    • k0 {0}: k = 0

    • k1 {1}: 0 < k <= .01

    • k2 {2}: .01 < k <= .07

    • k3 {3}: k> .07

Returns:

o – return None only if dataframe is given and inplace is set to True i.e modified object inplace.

Return type:

None, ndarray, Series or Dataframe

Examples

>>> import numpy as np
>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import classify_k
>>> _, y0 = load_hlogs (as_frame =True)
>>> # let visualize four nonzeros values in y0
>>> y0.k.values [ ~np.isnan (y0.k ) ][:4]
...  array([0.054, 0.054, 0.054, 0.054])
>>> classify_k (y0 , kname ='k', inplace =True, use_default_func=True )
>>> # let see again the same four value in the dataframe
>>> y0.k.values [ ~np.isnan (y0.k ) ][:4]
... array([2., 2., 2., 2.])
watex.utils.cleaner(data, /, columns=None, inplace=False, labels=None, func=None, mode='clean', **kws)[source]#

Sanitize data or columns by dropping specified labels from rows or columns.

If data is not a pandas dataframe, should be converted to dataframe and uses index to drop the labels.

Parameters:
  • data (pd.Dataframe or arraylike2D.) – Dataframe pandas or Numpy two dimensional arrays. If 2D array is passed, it should prior be converted to a daframe by default and drop row index from index parameters

  • columns (single label or list-like) –

    Alternative to specifying axis (

    labels, axis=1 is equivalent to columns=labels).

  • labels (single label or list-like) – Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.

  • func (F, callable) – Universal function used to clean the columns. If performs only when mode is on clean option.

  • inplace (bool, default False) – If False, return a copy. Otherwise, do operation inplace and return None.

  • mode (str, default='clean') – Options or mode of operation to do on the data. It could be [‘clean’|’drop’]. If drop, it behaves like dataframe.drop of pandas.

Returns:

DataFrame cleaned or without the removed index or column labels or None if inplace=True or array is data is passed as an array.

Return type:

DataFrame, array2D or None

watex.utils.correlatedfeatures(df, corr='pearson', threshold=0.95, fmt=False)[source]#

Find the correlated features/columns in the dataframe.

Indeed, highly correlated columns don’t add value and can throw off features importance and interpretation of regression coefficients. If we had correlated columns, choose to remove either the columns from level_0 or level_1 from the features data is a good choice.

Parameters:
  • df (Dataframe or shape (M, N) from pandas.DataFrame) – Dataframe containing samples M and features N

  • corr (str, ['pearson'|'spearman'|'covariance']) – Method of correlation to perform. Note that the ‘person’ and ‘covariance’ don’t support string value. If such kind of data is given, turn the corr to spearman. default is pearson

  • threshold (int, default is 0.95) – the value from which can be considered as a correlated data. Should not be greater than 1.

  • fmt (bool, default {False}) – format the correlated dataframe values

Returns:

df – Dataframe with cilumns equals to [level_0, level_1, pearson]

Return type:

pandas.DataFrame

Examples

>>> from watex.utils.mlutils import correlatedcolumns
>>> df_corr = correlatedcolumns (data , corr='spearman',
                                 fmt=None, threshold=.95
                                 )
watex.utils.defineConductiveZone(erp, station=None, position=None, auto=False, index='py', **kws)[source]#

Define conductive zone as subset of the erp line.

Indeed the conductive zone is a specific zone expected to hold the drilling location station. If drilling location is not provided, it would be by default the very low resistivity values found in the erp line.

Parameters:
  • erp (array_like,) – the array contains the apparent resistivity values

  • station (str or int,) – is the station position name.

  • position (float,) – station position value.

  • auto (bool) – If True, the station position should be the position of the lower resistivity value in Electrical Resistivity Profiling.

  • indexing (str,) –

Returns:

  • - conductive zone of resistivity values

  • - conductive zone positionning

  • - station position index in the conductive zone

  • - station position index in the whole |ERP| line

Example:
>>> import numpy as np
>>>
>>> from watex.utils.coreutils import defineConductiveZone
>>> test_array = np.random.randn (10)
>>> selected_cz ,*_ = defineConductiveZone(test_array, 's20')
>>> shortPlot(test_array, selected_cz )
watex.utils.erpSelector(f, columns=Ellipsis, force=False, utm_zone=None, epsg=None, verbose=0.0, **kws)[source]#

Read and sanitize the data collected from the survey.

data should be an array, a dataframe, series, or arranged in .csv or .xlsx formats. Be sure to provide the header of each columns in’ the worksheet. In a file is given, header columns should be aranged as ['station','resistivity' ,'longitude', 'latitude']. Note that coordinates columns (longitude and latitude) are not compulsory.

Parameters:
  • f (Path-like object, ndarray, Series or Dataframe,) – If a path-like object is given, can only parse .csv and .xlsx file formats. However, if ndarray is given and shape along axis 1 is greater than 4, the ndarray should be shrunked.

  • columns (list) – list of the valuable columns. It can be used to fix along the axis 1 of the array the specific values. It should contain the prefix or the whole name of each item in ['station','resistivity' ,'longitude', 'latitude'].

  • force (bool, default=False,) – If Vertical electrical (VES) is passed while expecting ERP data, force set to True will consider the VES data as ERP data and will use only the resistivity values in VES data. This will will an invalid results especially when parameters computation are needed.

  • verbose (int,) – Show the verbosity; outputs more messages if True.

  • utm_zone (string, optional) –

    zone number and ‘S’ or ‘N’ e.g. ‘55S’. Default to the centre point of the provided points. If given, the longitude/latitude are computed from valid easting/northing coordinates.

    New in version 0.2.1.

  • epsg (int) – epsg number defining projection (see http://spatialreference.org/ref/ for moreinfo). Overrides utm_zone if both are provided

  • kws (dict) – Additional pandas pd.read_csv and pd.read_excel methods keyword arguments. Be sure to provide the right argument. when reading f. For instance, provide sep= ',' argument when the file to read is xlsx format will raise an error. Indeed, sep parameter is acceptable for parsing the .csv file format only.

Return type:

DataFrame with valuable column(s).

Notes

The length of acceptable columns is 4. If the size of the columns is higher than 4, the data should be shrunked to match the expected columns. Futhermore, if the header is not specified in f , the defaut column arrangement should be used. Therefore, the second column should be considered as the resistivity column.

Examples

>>> import numpy as np
>>> from watex.utils.coreutils import erpSelector
>>> df = erpSelector ('data/erp/testsafedata.csv')
>>> df.shape
... (45, 4)
>>> list(df.columns)
... ['station','resistivity', 'longitude', 'latitude']
>>> df = erp_selector('data/erp/testunsafedata.xlsx')
>>> list(df.columns)
... ['easting', 'station', 'resistivity', 'northing']
>>> df = erpSelector(np.random.randn(7, 7))
>>> df.shape
... (7, 4)
>>> list(df.columns)
... ['station', 'resistivity', 'longitude', 'latitude']
watex.utils.erpSmartDetector(constr, erp, station=None, coerce=False, return_cz=False, view=False, raise_warn=True, **plot_kws)[source]#

Automatically detect the drilling location by involving the constraints observed in the survey area.

Consider the constraints on the survey area and detect the suitable drilling location. Commonly the station is not needed when using the constraintssince the station indicates that the user is aware about the reason to select this station. However in the case, doubts raise, user can set the parameter coerce to True.

Parameters:
  • constr (list, dict) –

    List of restricted station. The constraint or restricted stations are the station where to ignore when selecting the best drilling location. Indeed, this is useful since in DWSC, not the station are presumed to be suitable to propose the drilling in technical view. For instance, if some stations are close to the household waste site, the stations must be list and ignored.

    If the constr is passed in a dictionnary, it might be contain, the key for the restricted stations and the value for the reason why the station is restricted. For instance:

    constr = {"s02": "station close to the household waste"
              "S25": "station is located in a marsh area."
              }
    

  • erp (array-like 1d) – DC profiling ERP resistivity values

  • station (str, optional) – The station of the presumed location for drilling operations. Commonly the station is not need when using the constraints. If the station is given whereas coerce=False an errors will raise top warnm the users, To force considering the station in the auto-detection, coerce must be set to True.

  • coerce (bool, default=False,) – Allow the station to be consider in the auto-detection.

  • raise_warn (bool, default=True,) – warn the user whether a suitable location is found or not. Returns None otherwise.

  • view (bool, default=False,) – Plot the conductive zone and restricted stations.

  • plot_kws (dict,) – Additional plotting keywords arguments passed to plotAnomaly().

Returns:

(station |None) or cz, cs – staion for the drilling operations detected automatically. If no station is detected, will return None. if return_cz is True, station and the conductive zone are returned as well as the restricted station position number.

Return type:

str,

See also

watex.plotAnomaly

Plot DC profiling ERP and conductive zone.

Examples

>>> import numpy as np
>>> from watex.datasets import make_erp
>>> from watex.utils.coreutils import erpSmartDetector
>>> resistivity = make_erp (n_stations =50 , as_frame=True, seed=125).resistivity
>>> # get the min value of the resistivity
>>> resmin_index = np.where ( resistivity==resistivity.min())
42
>>> erpSmartDetector (constr =['s42'], resistivity )
'S13'
>>> # S42 is rejected and selected another zone presumed to be better.
>>> constraints ={"S00": "Marsh area. ",
                  "S10": " Municipality square, no authorization to make drill",
                  "S29": "Heritage site",
                  "S46": "Household waste site",
                  "S42": "Household waste site"
                  }
>>> erpSmartDetector (constraints, resistivity)
'S16'
>>> erpSmartDetector (['s12', 's40'], resistivity)
'S29'
>>> # station 42 close s40 is rejected too.
watex.utils.evalModel(model, X, y, Xt, yt=None, scorer='accuracy', eval=False, **kws)[source]#

Evaluate model and quick test the score with metric scorers.

Parameters:
  • model (Callable, {'preprocessor + estimator } | estimator,) – the preprocessor is list of step for data handling all encapsulated on the pipeline. model can also be a simple estimator with fit,

  • X (N-d array, shape (N, M)) –

    the training set composed of N-columns and the M-samples. The

    feature set excludes the target y.

  • y (arraylike , shape (M)) – the target is composed of M-examples in supervised learning.

  • Xt (N-d array, shape (N, M)) – test set array composed of N-columns and the M-samples. The feature set excludes the target y.

  • yt (arraylike , shape (M)) – test label (or test target) composed of M-examples in supervised learning.

  • scorer (str, Callable,) –

    a scorer is a metric function for model evaluation. If given as string it should be the prefix of the following metrics:

    • ”classification_report” -> for classification_report,

    • ’precision_recall’ -> for precision_recall_curve,

    • ”confusion_matrix” -> for a confusion_matrix,

    • ’precision’ -> for precision_score,

    • ”accuracy” -> for accuracy_score

    • ”mse” -> for mean_squared_error,

    • ”recall” -> for recall_score,

    • ’auc’ -> for roc_auc_score,

    • ’roc’ -> for roc_curve

    • ’f1’ -> for f1_score,

    Other string prefix values should raises an errors

  • kws (dict,) – Additionnal keywords arguments from scklearn metric function.

Returns:

Tuple – the model score or the predicted y if predict is set to True.

Return type:

(score, ypred)

watex.utils.findCatandNumFeatures(df=None, features=None, return_frames=False)[source]#

Retrieve the categorial or numerical features on whole features of dataset.

Parameters:
  • df (Dataframe) – Dataframe with columns composing the features

  • features (list of str,) – list of the column names. If the dataframe is big, can set the only required features. If features are provided, frame should be shrunked to match the only given features before the numerical and categorical features search. Note that an error will raises if any of one features is missing in the dataframe.

  • return_frames (bool,) – if set to True, it returns two separated dataframes (cat & num) otherwise, it only returns the cat and num columns names.

Returns:

Tuple

Return type:

cat_features and num_features names or frames

Examples

>>> from watex.datasets import fetch_data
>>>> from watex.tools import findCatandNumFeatures
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> cat, num = findCatandNumFeatures(data)
>>> cat, num
... (['type', 'geol', 'shape', 'name', 'flow'],
 ['num', 'east', 'north', 'power', 'magnitude', 'sfi', 'ohmS', 'lwi'])
>>> cat, num = findCatandNumFeatures(
    data, features = ['geol', 'ohmS', 'sfi'])
... (['geol'], ['ohmS', 'sfi'])
watex.utils.find_aquifer_groups(arr_k, /, arr_aq=None, kname=None, aqname=None, subjectivity=False, default_arr=None, keep_label_0=False, method='naive')[source]#

Fit the group of aquifer and find the representative of each true label in array ‘k’ in the aquifer group array.

The idea consists to find the corresponding aquifer group which fits the most the true label ‘X’ in ‘y_true’.

‘arr_k’ and ‘arr_aq’ must contain a class label, not continue values.

Parameters:
  • arr_k (array_like, pandas series or dataframe) – arraylike that contains the permeability coefficients ‘k’. If a dataframe is supplied, the permeabitlity coefficient column name ‘kname’ must be specified.

  • arr_aq (array-like , pandas series or dataframe) – array-like that contains the aquifer groups. If NAN values exists in the aquifer groups, it is suggested to imputed values before feediing to the algorithms. Missing values are not allowed. If dataframe is supplied, the aquifer group column name ‘aqname’ must be specified.

  • kname (str, int) –

    Name of permeability coefficient columns. kname allows to retrieve the

    permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly

    kname needs to be supplied when a dataframe is passed as a positional

    or keyword argument.

  • aqname (str, optional,) –

    Name of aquifer group columns. aqname allows to retrieve the

    aquifer group arr_aq value in a specific dataframe. Commonly

    aqname needs to be supplied when a dataframe is passed as a positional

    or keyword argument.

  • subjectivity (bool, default=False) – Considers each class label as a naive group of aquifer. Subjectivity occurs when a group of aquifer is not found in the data. Therefore, each class label is considered as a naive group of aquifer. It is strongly recommended to provide a default group passes to parameter default_arr to substitute the group of aquifers for more pratical reason. For instance it can be the layer collected at a specific depth like the ‘strata’ columns.

  • default_arr (array-like, pd.Series) – Array used as default to subsitute the group of aqquifer if the latter is missing. This is an heuristic option because it might lead to breaking code or invalid results.

  • keep_label_0 (bool, default=False) – The prediction already include the label 0. However, including 0 in the predicted label refers to ‘k=0’ i.e. permeability coefficient equals to 0, which is not True in principle, because all rocks have a permeability coefficient ‘k’. Here we considered ‘k=0’ as an undefined permeability coefficient. Therefore, ‘0’ , can be exclude since, it can also considered as a missing ‘k’-value. If predicted ‘0’ is in the target it should mean a missing ‘k’-value rather than being a concrete label. Therefore, to avoid any confusion, ‘0’ is altered to ‘1’ so the value +1 is used to move forward all class labels thereby excluding the ‘0’ label. To force include 0 in the label, set keep_label_0 to True.

  • method (str ['naive', 'strict'], default='naive') –

    The kind of strategy to compute the representativity of a label in the predicted array ‘array_aq’. It can also be ‘strict’. Indeed:

    • naive computes the importance of the label by the number of its

      occurence for this specific label in the array ‘k’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in arr_k.

    • strict computes the importance of the label by the number of

      occurence in the whole valid arr_k i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in arr_k.

Returns:

_Group – Use attribute .groups to find the group values.

Return type:

_Group class object

Examples

  1. Use the real aquifer group collected in the area

>>> from watex.utils import naive_imputer, read_data, reshape
>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import classify_k, find_aquifer_groups
>>> b= load_hlogs () #just taking the target names
>>> data = read_data ('data/boreholes/hf.csv') # read complete data
>>> y = data [b.target_names]
>>> # impute the missing values found in aquifer group columns
>>> # reshape 1d array along axis 0 for imputation
>>> agroup_imputed = naive_imputer ( reshape (y.aquifer_group, axis =0 ) ,
...                                    strategy ='most_frequent')
>>> # reshape back to array_like 1d
>>> y.aquifer_group =reshape (agroup_imputed)
>>> # categorize the 'k' continous value in 'y.k' using the default
>>> # 'k' mapping func
>>> y.k = classify_k (y.k , default_func =True)
>>> # get the group obj
>>> group_obj = find_aquifer_groups(y.k, y.aquifer_group)
>>> group_obj
_Group(Label=[' 1 ',
             Preponderance( rate = '53.141  %',
                           [('Groups', {'V': 0.32, 'IV': 0.266, 'II': 0.236,
                                        'III': 0.158, 'IV&V': 0.01,
                                        'II&III': 0.005, 'III&IV': 0.005}),
                            ('Representativity', ( 'V', 0.32)),
                            ('Similarity', 'V')])],
        Label=[' 2 ',
              Preponderance( rate = ' 19.11  %',
                           [('Groups', {'III': 0.274, 'II': 0.26, 'V': 0.26,
                                        'IV': 0.178, 'III&IV': 0.027}),
                            ('Representativity', ( 'III', 0.27)),
                            ('Similarity', 'III')])],
        Label=[' 3 ',
              Preponderance( rate = '27.749  %',
                           [('Groups', {'V': 0.443, 'IV': 0.311, 'III': 0.245}),
                            ('Representativity', ( 'V', 0.44)),
                            ('Similarity', 'V')])],
             )
(2) Use the subjectivity and set the strata columns as default array
>>> find_aquifer_groups(y.k, subjectivity=True, default_arr= X.strata_name )
_Group(Label=[' 1 ',
             Preponderance( rate = '53.141  %',
                           [('Groups', {'siltstone': 0.35, 'coal': 0.227,
                                        'fine-grained sandstone': 0.158,
                                        'medium-grained sandstone': 0.094,
                                        'mudstone': 0.079,
                                        'carbonaceous mudstone': 0.054,
                                        'coarse-grained sandstone': 0.03,
                                        'coarse': 0.01}),
                            ('Representativity', ( 'siltstone', 0.35)),
                            ('Similarity', 'siltstone')])],
        Label=[' 2 ',
              Preponderance( rate = ' 19.11  %',
                           [('Groups', {'mudstone': 0.288, 'siltstone': 0.205,
                                        'coal': 0.192,
                                        'coarse-grained sandstone': 0.137,
                                        'fine-grained sandstone': 0.137,
                                        'carbonaceous mudstone': 0.027,
                                        'medium-grained sandstone': 0.014}),
                            ('Representativity', ( 'mudstone', 0.29)),
                            ('Similarity', 'mudstone')])],
        Label=[' 3 ',
              Preponderance( rate = '27.749  %',
                           [('Groups', {'mudstone': 0.245, 'coal': 0.226,
                                        'siltstone': 0.217,
                                        'fine-grained sandstone': 0.123,
                                        'carbonaceous mudstone': 0.066,
                                        'medium-grained sandstone': 0.066,
                                        'coarse-grained sandstone': 0.057}),
                            ('Representativity', ( 'mudstone', 0.24)),
                            ('Similarity', 'mudstone')])],
             )
watex.utils.find_similar_labels(y_true, y_pred, *, categorize_k=False, threshold=None, func=None, keep_label_0=False, method='naive', return_groups=False, **kwd)[source]#

Find similarities between y_true and y_pred and returns rate

Parameters:
  • y_true (array-like 1d or pandas.Series) – Array containing the true labels of ‘k’

  • y_pred (array_like, or pandas.Series) – array containing the predicted naive group of aquifers (NGA)

  • categorize_k (bool,) – If set to True, user needs to provide a function ufunc to map or categorize the permeability coefficient ‘k’ into an integer labels.

  • func (callable) – Function to specifically map the permeability coefficient column in the dataframe of serie. If not given, the default function can be enabled instead from param default_func.

  • threshold (float, default=None) – The threshold from which, label in ‘y_true’ can be considered similar than the one in NGA labels ‘y_pred’. The default is ‘None’ which means none rule is considered and the high preponderence or occurence in the data compared to other labels is considered as the most representative and similar. Setting the rule instead by fixing the threshold is recommended especially in a huge dataset.

  • keep_label_0 (bool, default=0) –

    Force including 0 in the predicted label if include_label_0 is set to True. Mostly label ‘0’ refers to ‘k=0’ i.e. no permeability coefficient equals to 0, which is not True in principle, because all rocks have a permeability coefficient ‘k’. Here we considered ‘k=0’ as an undefined permeability coefficient. Therefore, ‘0’ , can be exclude since, it can also considered as a missing ‘k’-value. If predicted ‘0’ is in the target it should mean a missing ‘k’-value rather than being a concrete label. Therefore, to avoid any confusion, ‘0’ is removed by default in the ‘k’ categorization. However, when the prediction ‘y_pred’ is made from the the unsupervising method, the prediction ‘0’ straigthforwardly includes

    ’0’ i.e ‘k=0’ as a first class. So the value +1 is used to move forward

    all class labels thereby excluding the ‘0’ label. To force include 0 in the label, set include_label_0 to True.

  • method (str ['naive', 'strict'], default='naive') –

    The kind of strategy to compute the representativity of a label in the predicted array ‘y_pred’. It can also be ‘strict’. Indeed:

    • naive computes the importance of the label by the number of its

      occurence for this specific label in the array ‘y_true’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in y_true.

    • strict computes the importance of the label by the number of

      occurence in the whole valid y_true i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in y_pred.

  • return_groups (bool, default=False) – Returns label groups and their values counts in the predicted labels y_pred where ‘k’ values are not missing.

Returns:

  • g.similarity (Tuple of labels found that are considered similar in) – predicted labels.

  • g.group (Tuple of group that have their similarity in the true labels)

Example

>>> from watex.utils import read_data
>>> from watex.utils.hydroutils import find_similar_labels, classify_k
>>> data = read_data ('data/boreholes/hf.csv')
>>> ymap = classify_k(data.k , default_func =True)
>>> # Note that for the demo we use the group of aquifer columns, however
>>> # in pratical example, y_pred must be a predicted NGA labels. This
>>> # is possible using the function <predict_NGA_labels>
>>> sim = find_similar_labels(y_true= ymap, y_pred=data.aquifer_group)
>>> sim
... ((1, 'V'), (2, 'III'), (3, 'V'))
>>> group= find_similar_labels(ymap, data.aquifer_group, return_groups=True)
>>> group
... ((1,
  {'V': 0.17,
   'IV': 0.141,
   'II': 0.126,
   'III': 0.084,
   'IV&V': 0.005,
   'II&III': 0.003,
   'III&IV': 0.003}),
 (2, {'III': 0.052, 'II': 0.05, 'V': 0.05, 'IV': 0.034, 'III&IV': 0.005}),
 (3, {'V': 0.123, 'IV': 0.086, 'III': 0.068}))
>>> find_similar_labels(y_true= ymap, y_pred=data.aquifer_group,
                              threshold = 0.15)
... [(1, 'V')]
watex.utils.fittensor(refreq, compfreq, z, fill_value=nan)[source]#

Fit each tensor component to the complete frequency range.

The complete frequency is the frequency with clean data. It contain all the frequency range on the site. During the survey, the missing frequencies lead to missing tensor data. So the function will indicate where the tensor data is missing and fit to the prior frequencies.

Parameters:
  • refreq (ArrayLike) – Reference frequency - Should be the complete frequency collected in the field.

  • comfreq (array-like,) – The specific frequency collect in the site. Sometimes due to the interferences, the frequency at individual site could be different from the complete. However, the frequency values at the individual site must be included in the complete frequency refreq.

  • z (array-like,) – should be the tensor value (real or imaginary part ) at the component xx, xy, yx, yy.

  • fill_value (float . default='NaN') – Value to replace the missing data in tensors.

Returns:

Z – new Z filled by invalid value NaN where the frequency is missing in the data.

Return type:

Arraylike

Examples

>>> import numpy as np
>>> from watex.utils.exmath import fittensor
>>> refreq = np.linspace(7e7, 1e0, 20) # 20 frequencies as reference
>>> freq_ = np.hstack ((refreq.copy()[:7], refreq.copy()[12:] ))
>>> z = np.random.randn(len(freq_)) *10 # assume length of  freq as
...                 # the same like the tensor Z value
>>> zn  = fittensor (refreq, freq_, z)
>>> z # some frequency values are missing but not visible.
...array([-23.23448367,   2.93185982,  10.81194723, -12.46326732,
         1.57312908,   7.23926576, -14.65645799,   9.85956253,
         3.96269863, -10.38325124,  -4.29739755,  -8.2591703 ,
        21.7930423 ,   0.21709129,   4.07815217])
>>> # zn show where the frequencies are missing
>>> # the NaN value means in a missing value in  tensor Z at specific frequency
>>> zn
... array([-23.23448367,   2.93185982,  10.81194723, -12.46326732,
         1.57312908,   7.23926576, -14.65645799,          nan,
                nan,          nan,          nan,          nan,
         9.85956253,   3.96269863, -10.38325124,  -4.29739755,
        -8.2591703 ,  21.7930423 ,   0.21709129,   4.07815217])
>>> # let visualize where the missing frequency value in tensor Z
>>> refreq
... array([7.00000000e+07, 6.63157895e+07, 6.26315791e+07, 5.89473686e+07,
       5.52631581e+07, 5.15789476e+07, 4.78947372e+07, 4.42105267e+07*,
       4.05263162e+07*, 3.68421057e+07*, 3.31578953e+07*, 2.94736848e+07*,
       2.57894743e+07, 2.21052638e+07, 1.84210534e+07, 1.47368429e+07,
       1.10526324e+07, 7.36842195e+06, 3.68421147e+06, 1.00000000e+00])
>>> refreq[np.isnan(zn)] #we can see the missing value between [7:12](*) in refreq
... array([44210526.68421052, 40526316.21052632, 36842105.73684211,
       33157895.2631579 , 29473684.78947368])
watex.utils.get2dtensor(z_or_edis_obj_list, /, tensor='z', component='xy', kind='modulus', return_freqs=False, **kws)[source]#

Make tensor into two dimensional array from a collection of Impedance tensors Z.

Out 2D resistivity, phase-error and tensor matrix from a collection of EDI-objects.

Matrix depends of the number of frequency times number of sites. The function asserts whether all data from all frequencies are available. The missing values should be filled by NaN. Note that each element of z is (nfreq, 2, 2) dimension for:

xx ( 0, 0) ------- xy ( 0, 1)
yx ( 1, 0) ------- yy ( 1, 1)
Parameters:
  • z_or_edis_obj_list (list of watex.edi.Edi or watex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.

  • tensor (str, default='z') – Tensor name. Can be [ resistivity|phase|z|frequency]

  • component (str, default='xy' (TE mode)) – EM mode. Can be [‘xx’, ‘xy’, ‘yx’, ‘yy’]

  • out (str) – kind of data to output. Be sure to provide the component to retrieve the attribute from the collection object. Except the error and frequency attribute, the missing component to the attribute will raise an error. for instance resxy for xy component. Default is resxy.

  • kind (str , default='modulus') – focuses on the tensor output. Note that the tensor is a complex number of ndarray (nfreq, 2,2 ). If set to``modulus`, the modulus of the complex tensor should be outputted. If real or``imag``, it returns only the specific one. Default is complex.

  • return_freqs (Arraylike ,) – If True , returns also the full frequency ranges.

  • kws (dict) – Additional keywords arguments from :meth:`~EM.getfullfrequency `.

Returns:

mat2d – the matrix of number of frequency and number of Edi-collectes which correspond to the number of the stations/sites.

Return type:

arraylike2d

Examples

>>> from watex.datasets import load_huayuan
>>> from watex.methods import get2dtensor
>>> box= load_huayuan ( key ='raw', clear_cache = True, samples =7)
>>> data = box.data
>>> phase_yx = get2dtensor ( data, tensor ='phase', component ='yx')
>>> phase_yx.shape
(56, 7)
>>> phase_yx [0, :]
array([        nan,         nan,         nan,         nan, 18.73244951,
       35.00516522, 59.91093054])
watex.utils.getGlobalScore(cvres)[source]#

Retrieve the global mean and standard deviation score from the cross validation containers.

Parameters:

cvres (dict of Array-like, Shape (N, )) – cross validation results after training the models of number of parameters equals to N.

Returns:

tuple ( mean_test_scores’, ‘std_test_scores’)

scores on test_dcore and standard deviation scores

watex.utils.get_aquifer_section(arr_k, /, zname=None, kname=None, z=None, return_index=False, return_sections=True)[source]#

Detect a single aquifer section (upper and lower) in depth.

This is useful trip to compute the thickness of the aquifer.

Parameters:
  • arr_k (ndarray or dataframe) – Data that contains mainly the aquifer values. It can also contains the depth values. If the depth is included in the arr_k, zname needs to be supplied for recovering and depth.

  • zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.

  • kname (str, int) –

    Name of permeability coefficient columns. kname allows to retrieve the

    permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly

    kname needs to be supplied when a dataframe is passed as a positional

    or keyword argument.

  • z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.

  • return_index (bool, default =False ,) –

    Returns the positions (indexes) of the upper and lower sections of the

    aquifer found in the dataframe arr_k.

  • return_sections (bool, default=True,) – Returns the sections (upper and lower) of the aquifers.

Returns:

up, low

  • (upix, lowix ): Tuple of indexes of lower and upper sections

  • (up, low): Tuple of aquifer sections (upper and lower)

  • (upix, lowix), (up, low)positions and sections values of aquifers

    if return_index and return_sections` are True.

Return type:

list of upper and lower section values of aquifer.

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import get_aquifer_section
>>> data = load_hlogs ().frame # return all data including the 'depth' values
>>> get_aquifer_section (data , zname ='depth', kname ='k')
... [197.12, 369.71] # section starts from 197.12 -> 369.71 m
>>> get_aquifer_section (data , zname ='depth', kname ='k', return_index=True)
... ([16, 29], [197.12, 369.71]) # upper and lower-> position 16 and 29.
watex.utils.get_aquifer_sections(*data, zname, kname, return_index=False, return_data=False, error='ignore', **kws)[source]#

Get the section of each aquifer form multiple dataframes.

The unique section ‘upper’ and ‘lower’ is the valid range of the whole data to consider as a valid data. The use of the index is necessary to shrunk the data of the whole boreholes. Mosly the data from the section is consided the valid data as the predictor Xr. Out of the range of aquifers ection, data can be discarded or compressed to top Xr.

Returns valid section indexes if ‘return_index’ is set to True.

Parameters:
  • data (list of pandas dataframe) – Data that contains mainly the aquifer values. It needs to specify the name of the depth column zname as well as the name of permeabiliy kname column.

  • zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.

  • kname (str, int) –

    Name of permeability coefficient columns. kname allows to retrieve the

    permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly

    kname needs to be supplied when a dataframe is passed as a positional

    or keyword argument.

  • z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.

  • return_index (bool, default =False ,) – Returns the positions (indexes) of the upper and lower sections of the each aquifer found in each dataframe.

  • error (str, default='ignore') – Raise errors if trouble occurs when computing the section of each aquifer. If ‘ignore’, a UserWarning is displayed if invalid data is found. Any other value of error will set error to raise.

  • return_data (bool, default=False,) – Return valid data. It is usefull when ‘error’ is set to ‘ignore’ to collect the valid data.

  • kws (dict,) – Additional keywords arguments passed to get_aquifer_sections().

Returns:

up, low

  • (upix, lowix ): Tuple of indexes of lower and upper sections

  • (up, low): Tuple of aquifer sections (upper and lower)

  • (upix, lowix), (up, low)positions and sections values of aquifers

    if return_index and return_sections` are True.

Return type:

list of upper and lower section values of aquifer.

See also

watex.utils.hydroutils.get_aquifer_sections

compute multiples aquifer sections

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import get_aquifer_sections
>>> data = load_hlogs ().frame
>>> get_aquifer_sections (data, data , zname ='depth', kname ='k' )
... [[197.12, 369.71], [197.12, 369.71]]
>>> get_aquifer_sections (data, data , zname ='depth', kname ='k' ,
                           return_index =True )
...  [[16, 29], [16, 29]]
watex.utils.get_azimuth(xlon, ylat, *, data=None, utm_zone=None, projection='ll', isdeg=True, mode='soft', extrapolate=Ellipsis, view=Ellipsis)[source]#

Compute azimuth from coordinate locations ( latitude, longitude).

If easting and northing are given rather than longitude and latitude, the projection should explicitely set to UTM to perform the ideal conversion. However if mode is set to soft (default), the type of projection is automatically detected . Note that when UTM coordinates are provided, xlon and ylat fit easting and northing respectively.

Parameters:
  • xlon (Arraylike 1d or str, str) – ArrayLike of easting/longitude and arraylike of nothing/latitude. They should be one dimensional. In principle if data is supplied, they must be series. If xlon and ylat are given as string values, the data must be supplied. xlon and ylat names must be included in the dataframe otherwise an error raises.

  • ylat (Arraylike 1d or str, str) – ArrayLike of easting/longitude and arraylike of nothing/latitude. They should be one dimensional. In principle if data is supplied, they must be series. If xlon and ylat are given as string values, the data must be supplied. xlon and ylat names must be included in the dataframe otherwise an error raises.

  • data (pd.DataFrame,) – Data containing x and y names. Need to be supplied when x and y are given as string names.

  • utm_zone (Optional, string) – zone number and ‘S’ or ‘N’ e.g. ‘55S’. Default to the centre point of coordinates points in the survey area. It should be a string (##N or ##S) in the form of number and North or South hemisphere, 10S or 03N

  • projection (str, ['utm'|'ll']) – The coordinate system in which the data points for the profile is collected. when mode=’soft’, the auto-detection will be triggered and find the suitable coordinate system. However, it is recommended to explicitly provide projection when data is in UTM coordinates. Note that if x and y are composed of value greater than 180 degrees for longitude and 90 degrees for latitude, and method is still in the soft` mode, it should be considered as  longitude-latitude ``UTM coordinates system.

  • isdeg (bool, default=True) – By default xlon and xlat are in degree coordinates. If both arguments are given in radians, set to False instead.

  • mode (str , ['soft'|'strict']) – strict mode does not convert any coordinates system to other at least it is explicitly set to projection whereas the soft does.

  • extrapolate (bool, default=False) – In principle, the azimuth is compute between two points. Thus, the number of values computed for \(N\) stations should be \(N-1\). To fit values to match the number of size of the array, extrapolate should be True. In that case, the first station holds a <<fake>> azimuth as the closer value computed from interpolation of all azimuths.

  • view (bool, default=False,) – Quick view of the azimuth. It is usefull especially when extrapolate is set to True.

Returns:

azim – Azimuth computed from locations.

Return type:

ArrayLike

Examples

>>> import watex as wx
>>> from watex.utils.exmath import get_azimuth
>>> # generate a data from ERP
>>> data = wx.make_erp (n_stations =7 ).frame
>>> get_azimuth ( data.longitude, data.latitude)
array([54.575, 54.575, 54.575, 54.575, 54.575, 54.575])
>>> get_azimuth ( data.longitude, data.latitude, view =True, extrapolate=True)
array([54.57500007, 54.575     , 54.575     , 54.575     , 54.575     ,
       54.575     , 54.575     ])
watex.utils.get_bearing(latlon1, latlon2, to_deg=True)[source]#

Calculate the bearing between two points.

A bearing can be defined as a direction of one point relative to another point, usually given as an angle measured clockwise from north. The formula of the bearing \(eta\) between two points 1(lat1 , lon1) and 2(lat2, lon2) is expressed as below:

\[eta = atan2(sin(y_2-y_1)*cos(x_2), cos(x_1)*sin(x_2) – sin(x_1)*cos(x_2)*cos(y_2-y_1))\]

where:

  • :math:`x_1`(lat1): the latitude of the first coordinate

  • :math:`y_1`(lon1): the longitude of the first coordinate

  • :math:`x_2`(lat2) : the latitude of the second coordinate

  • :math:`y_2`(lon2): the longitude of the second coordinate

Parameters:
  • latlon (Tuple ( latitude, longitude)) – A latitude and longitude coordinates of the first point in degree.

  • latlon2 (Tuple ( latitude, longitude)) – A latitude and longitude of coordinates of the second point in degree.

  • to_deg (bool, default=True) – Convert the bearing from radians to degree.

Returns:

Examples

>>> from watex.utils import get_bearing
>>> latlon1 = (28.41196763902007, 109.3328724432221) # (lat, lon) point 1
>>> latlon2= (28.38756530909265, 109.36931920880758) # (lat, lon) point 2
>>> get_bearing (latlon1, latlon2 )
127.26739270447973 # in degree
watex.utils.get_compressed_vector(d, /, sname, stratum=None, strategy='average', as_frame=False, random_state=None)[source]#

Compresses base stratum data into a singular vector composed of all feature names in the targetted data d.

Parameters:
  • d (pandas DataFrame) – Valid data containing the strata. If dataframe is passed, ‘sname’ is needed to fetch strata values.

  • sname (str, optional) – Name of column in the dataframe that contains the strata values. Dont confuse ‘sname’ with ‘stratum’ which is the name of the valid layer/rock in the array/Series of strata.

  • stratum (str, optional) – Name of the base stratum. Must be self contain as an item of the strata data. Note that if stratum is passed, the auto-detection of base stratum is not triggered. It returns the same stratum , however it can gives the rate and occurence of this stratum if return_rate or return_counts is set to True.

  • strategy (str , default='average' or 'mean',) – strategy used to select or compute the numerical data into a singular series. It can be [‘naive’]. In that case , a single serie if randomly picked up into the base strata data.

  • as_frame (bool, default='False') – Returns compressed vector into a dataframe rather that keeping in series.

  • random_state (int, optional,) – State for randomly selected a compressed vector when naive is passed as strategy.

Returns:

ms – returns a compressed vector in pandas series compose of all features. Note , the vector here does not refer as math vector compose of numerical values only. A compressed vector here is a series that is the result of averaging the numerical features of the base stratum and incluing its corresponding categorical values. Note there, the ms can contain categorical values and has the same number and features as the original frame d.

Return type:

pandas series/dataframe

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import get_compressed_vector
>>> data = load_hlogs().frame # get only the frame
>>> get_compressed_vector (data, sname='strata_name')[:4]
... hole_number           H502
    strata_name      siltstone
    aquifer_group           II
    pumping_level       ZFSAII
    dtype: object
>>> get_compressed_vector (data, sname='strata_name', as_frame=True )
...   hole_number strata_name aquifer_group  ...        r     rp remark
    0        H502   siltstone            II  ...  41.7075  59.23    NaN
    [1 rows x 23 columns]
>>> get_compressed_vector (data, sname='strata_name', strategy='naive')
... hole_number          H502
    depth_top          379.15
    depth_bottom        379.7
    strata_name     siltstone
    Name: 39, dtype: object
watex.utils.get_distance(x, y, *, return_mean_dist=False, is_latlon=False, **kws)[source]#

Compute distance between points

Parameters:
  • x (ArrayLike 1d,) – One dimensional arrays. x can be consider as the abscissa of the landmark and y as ordinates array.

  • y (ArrayLike 1d,) – One dimensional arrays. x can be consider as the abscissa of the landmark and y as ordinates array.

  • return_mean_dist (bool, default =False,) – Returns the average value of the distance between different points.

  • is_latlon (bool, default=False,) – Convert x and y latitude and longitude coordinates values into UTM before computing the distance. x, y should be considered as easting and northing respectively.

  • kws (dict,) – Keyword arguments passed to watex.site.Location.to_utm_in()

Returns:

d – Is the distance between points.

Return type:

Arraylike of shape (N-1)

Examples

>>> import numpy as np
>>> from watex.utils.exmath import get_distance
>>> x = np.random.rand (7) *10
>>> y = np.abs ( np.random.randn (7) * 12 )
>>> get_distance (x, y)
array([ 8.7665511 , 12.47545656,  8.53730212, 13.54998351, 14.0419387 ,
       20.12086781])
>>> get_distance (x, y, return_mean_dist= True)
12.91534996818084
watex.utils.get_full_frequency(z_or_edis_obj_list, /, to_log10=False)[source]#

Get the frequency with clean data.

The full or plain frequency is array frequency with no missing frequency during the data collection. Note that when using Natural Source Audio-Magnetotellurics, some data are missing due to the weak of missing frequency at certain band especially in the attenuation band.

Parameters:
  • z_or_edis_obj_list (list of watex.edi.Edi or watex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.

  • to_log10 (bool, default=False) – Export frequency to base 10 logarithm

Returns:

f – frequency with clean data. Out of attenuation band if survey is completed with Natural Source Audio-Magnetotellurics.

Return type:

Arraylike of shape(N, )

Examples

>>> from watex.datasets import load_huayuan
>>> from watex.methods.em import get_full_frequency
>>> box= load_huayuan ( key ='raw', clear_cache = True, samples =7)
>>> edi_data = box.data
>>> f = get_full_frequency (edi_data )
>>> f
array([8.19200e+04, 7.00000e+04, 5.88000e+04, 4.95000e+04, 4.16000e+04,
       3.50000e+04, 2.94000e+04, 2.47000e+04, 2.08000e+04, 1.75000e+04,
       ...
       3.25000e+01, 2.75000e+01, 2.25000e+01, 1.87500e+01, 1.62500e+01,
       1.37500e+01, 1.12500e+01, 9.37500e+00, 8.12500e+00, 6.87500e+00,
       5.62500e+00])
>>> len(f)
56
>>> # Get only the z component objects
>>> zobjs = [ box.emo.ediObjs_[i].Z for i in  range (len(box.emo.ediObjs_))]
>>> len(zobjs)
56
watex.utils.get_profile_angle(easting=None, northing=None, msg='ignore')[source]#

compute geoprofile angle. :param * easting: easting coordiantes values :type * easting: array_like :param * northing: northing coordinates values :type * northing: array_like :param * msg: :type * msg: output a little message if msg is set to “raises”

Returns:

  • float – profile_angle

  • float – geo_electric_strike

watex.utils.get_sections_from_depth(z, z_range, return_index=False)[source]#

Gets aquifer sections (‘upper’, ‘lower’) in data ‘z’ from the depth range.

This might be usefull to compute the thickness of the aquifer.

Parameters:
  • z (array-like 1d or pd.Series) – Array or pandas series contaning the depth values

  • z_range (tuple (float),) –

    Section [‘upper’, ‘lower’] of the aquifer at differnt depth. The range of the depth must a pair values and could not be

    greater than the maximum depth of the well.

  • return_index (bool, default=False) – returns the indices of the sections [‘upper’, ‘lower’] of the aquifer and non-valid sections too.

Returns:

  • sections (Tuple (float, float)) – Real values of the upper and lower sections of the aquifer.

  • If return_index is ‘True’, function returns –

    (upix, lowix): Tuple (int, int )

    indices of upper and lower sections in the depth array z

    (invix): list of Tuple (int, int)

    list of indices of invalid sections

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import get_sections_from_depth
>>> data= load_hlogs().frame
>>> # get real sections from depth 16.25 to 125.83 m
>>> get_sections_from_depth ( data.depth_top, ( 16.25, 125.83))
...  (22.46, 128.23)
>>> # aquifer depth from 16.25 m to the end
>>> get_sections_from_depth ( data.depth_top, ( 16.25,))
... (22.46, 693.37)
>>> get_sections_from_depth ( data.depth_top, ( 16.25, 125.83),
                             return_index =True )
... ((3, 11), [(0, 3), (11, 180)])
>>> get_sections_from_depth ( data.depth_top, ( 16.25,),
                             return_index =True )
... ((3, 181), [(0, 3)])
watex.utils.get_strike(profile_angle=None, easting=None, northing=None, gstrike=None, msg='ignore')[source]#

Compute geoelectric strike from profile angle, easting and northing.

Parameters:
  • profile_angle (*) – If not provided , will comput with easting and northing coordinates

  • easting (*) – Easting coordiantes values

  • northing (*) – Northing coordinates values

  • gstrike (*) – strike value , if provided, will recomputed geo_electric strike . * msg: output a little message if msg is set to “raises”

Returns:

  • float – profile_angle in degree E of N

  • float – geo_electric_strike in degrees E of N

watex.utils.get_target(ar, /, tname, drop_target=True, columns=None, as_frame=False)[source]#

Extract target from multidimensional array or dataframe.

Parameters:
  • ar (arraylike2d or pd.DataFrame) – Array that supposed to contain the target value.

  • tname (int/str, list of int/str) – index or the name of the target; if int is passed it should range ranged less than the columns number of the array i.e. a shape[1] in the case of np.ndarray. If the list of indexes or names are given, the return target should be in two dimensional array.

  • drop_target (bool, default=True) – Remove the target array in the 2D array or dataframe in the case the target exists and returns a data exluding the target array.

  • columns (list, default=False.) – composes the dataframe when the array is given rather than a dataframe. The list of column names must match the number of columns in the two dimensional array, otherwise an error occurs.

  • as_frame (bool, default=False,) – returns dataframe/series or the target rather than array when the array is supplied. This seems useful when column names are supplied.

Returns:

t, ar – Return the targets and the array/dataframe of the target.

Return type:

array-like/pd.Series , array-like/pd.DataFrame

Examples

>>>> import numpy as np >>> import pandas as pd >>> from watex.utils.mtutils import get_target >>> ar = np.random.randn ( 3, 3 ) >>> df0 = pd.DataFrame ( ar, columns = [‘x1’, ‘x2’, ‘tname’]) >>> df= df0.copy() >>> get_target (df, ‘tname’, drop_target= False ) ( tname

0 -0.542861 1 0.781198,

x1 x2 tname

0 -1.424061 -0.493320 -0.542861 1 0.416050 -1.156182 0.781198)

>>> get_target (df, [ 'tname', 'x1']) # drop is True by default
(      tname        x1
 0 -0.542861 -1.424061
 1  0.781198  0.416050,
          x2
 0 -0.493320
 1 -1.156182)
>>> df = df0.copy()
>>> # when array is passed
>>> get_target (df.values , '2', drop_target= False )
(array([[-0.54286148],
        [ 0.7811981 ]]),
 array([[-1.42406091, -0.49331988, -0.54286148],
        [ 0.41605005, -1.15618243,  0.7811981 ]]))
>>> get_target (df.values , 'tname') # raise error
ValueError: 'tname' ['tname'] is not valid...
watex.utils.get_unique_section(*data, zname, kname, return_index=False, return_data=False, error='raise', **kws)[source]#

Get the section to consider unique in multiple aquifers.

The unique section ‘upper’ and ‘lower’ is the valid range of the whole sections of each aquifers. It is considered as the main valid section from which data can not be compressed and not altered. For instance, the use of indexes is necessary to shrunk the data except this valid section. Mosly the data from the section is considered the valid data as the predictor Xr. Out of the range of aquifers ection, data can be discarded or compressed to top Xr.

Returns valid section indexes if ‘return_index’ is set to True.

Parameters:
  • d (list of pandas dataframe) – Data that contains mainly the aquifer values. It needs to specify the name of the depth column zname as well as the name of permeabiliy kname column.

  • zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.

  • kname (str, int) –

    Name of permeability coefficient columns. kname allows to retrieve the

    permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly

    kname needs to be supplied when a dataframe is passed as a positional

    or keyword argument.

  • z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.

  • return_index (bool, default =False ,) – Returns the positions (indexes) of the upper and lower sections of the shallower and deep aquifers found in the whole dataframes.

  • return_data (bool, default=False,) – Return valid data. It is usefull when ‘error’ is set to ‘ignore’ to collect the valid data.

  • error (str, default='raise') – Raise errors if trouble occurs when computing the section of each aquifer. If ‘ignore’, a UserWarning is displayed when invalid data is found. Any other value of error will set error to raise.

  • kws (dict,) – Additional keywords arguments passed to get_aquifer_sections().

Returns:

up, low

  • (upix, lowix ): Tuple of indexes of lower and upper sections

  • (up, low): Tuple of aquifer sections (upper and lower)

  • (upix, lowix), (up, low)positions and sections values of aquifers

    if return_index and return_sections` are True.

Return type:

list of upper and lower section values of aquifer.

See also

watex.utils.hydroutils.get_aquifer_section

compute single section

watex.utils.hydroutils.get_aquifer_sections

compute multiple sections

Example

>>> from watex.datasets import load_hlogs
>>> data = load_hlogs ().frame
>>> get_unique_section (data.copy() , zname ='depth', kname ='k', )
... array([197.12, 369.71], dtype=float32)
>>> get_unique_sections (data.copy() , zname ='depth', kname ='k',
                                return_index =True)
... array([16, 29])
watex.utils.get_xs_xr_splits(data, /, z_range=None, zname=None, section_indexes=None)[source]#

Split data into matrix \(X_s\) with sample \(ms\) (unwanted data ) and \(X_r\) of samples :math:`m_r`( valid aquifer data )

Parameters:
  • data (pandas dataframe) – Dataframe for compressing.

  • zname (str,int ,) – the name of depth column. ‘name’ needs to be supplied when section_indexes is not provided.

  • z_range (tuple (float),) – Section [‘upper’, ‘lower’] of the aquifer at different depth. The range of the depth must a pair values and could not be greater than the maximum depth of the well.

  • section_indexes (tuple or list of int) – list of a pair tuple or list of integers. It is be the the valid sections( upper and lower ) indexes of of the aquifer. If the depth range z_range and zname are supplied, section_indexes can be None. Note that the last indix is considered as the last position, the bottom of the section therefore, its value is included in the data.

Returns:

  • - xs (list of pandas dataframe) –

    • shrinking part of data for compressing. Note that it is on list

    because if dataframe corresponds to the non-valid dataframe sections.

  • - xr (pandas dataframe) –

    • valid data reflecting to the aquifer part or including the

    aquifer data.

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import get_xs_xr_splits
>>> data = load_hlogs ().frame
>>> xs, xr = get_xs_xr_splits (data, 3.11, section_indexes = (17, 20 ) )
watex.utils.interpolate1d(arr, kind='slinear', method=None, order=None, fill_value='extrapolate', limit=None, **kws)[source]#

Interpolate array containing invalid values NaN

Usefull function to interpolate the missing frequency values in the tensor components.

Parameters:
  • arr (array_like) – Array to interpolate containg invalid values. The invalid value here is NaN.

  • kind (str or int, optional) – Specifies the kind of interpolation as a string or as an integer specifying the order of the spline interpolator to use. The string has to be one of linear, nearest, nearest-up, zero, slinear,``quadratic``, cubic, previous, or next. zero, slinear, quadratic``and ``cubic refer to a spline interpolation of zeroth, first, second or third order; previous and next simply return the previous or next value of the point; nearest-up and nearest differ when interpolating half-integers (e.g. 0.5, 1.5) in that nearest-up rounds up and nearest rounds down. If method param is set to pd which refers to pd.interpolate method , kind can be set to polynomial or pad interpolation. Note that the polynomial requires you to specify an order while pad requires to specify the limit. Default is slinear.

  • method (str, optional, default='mean') – Method of interpolation. Can be base for scipy.interpolate.interp1d mean or bff for scaling methods and pd``for pandas interpolation methods. Note that the first method is fast and efficient when the number of NaN in the array if relatively few. It is less accurate to use the `base` interpolation when the data is composed of many missing values. Alternatively, the scaled method(the  second one) is proposed to be the alternative way more efficient. Indeed, when ``mean argument is set, function replaces the NaN values by the nonzeros in the raw array and then uses the mean to fit the data. The result of fitting creates a smooth curve where the index of each NaN in the raw array is replaced by its corresponding values in the fit results. The same approach is used for bff method. Conversely, rather than averaging the nonzeros values, it uses the backward and forward strategy to fill the NaN before scaling. mean and bff are more efficient when the data are composed of lot of missing values. When the interpolation method is set to pd, function uses the pandas interpolation but ended the interpolation with forward/backward NaN filling since the interpolation with pandas does not deal with all NaN at the begining or at the end of the array. Default is base.

  • fill_value (array-like or (array-like, array_like) or extrapolate, optional) – If a ndarray (or float), this value will be used to fill in for requested points outside of the data range. If not provided, then the default is NaN. The array-like must broadcast properly to the dimensions of the non-interpolation axes. If a two-element tuple, then the first element is used as a fill value for x_new < x[0] and the second element is used for x_new > x[-1]. Anything that is not a 2-element tuple (e.g., list or ndarray, regardless of shape) is taken to be a single array-like argument meant to be used for both bounds as below, above = fill_value, fill_value. Using a two-element tuple or ndarray requires bounds_error=False. Default is extrapolate.

  • kws (dict) – Additional keyword arguments from spi.interp1d.

Return type:

array like - New interpoolated array. NaN values are interpolated.

Notes

When interpolated thoughout the complete frequencies i.e all the frequency values using the base method, the missing data in arr can be out of the arr range. So, for consistency and keep all values into the range of frequency, the better idea is to set the param fill_value in kws argument of spi.interp1d to extrapolate. This will avoid an error to raise when the value to interpolated is extra-bound of arr.

References

https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html https://www.askpython.com/python/examples/interpolation-to-fill-missing-entries

Examples

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from watex.utils.exmath  import interpolate1d,
>>> z = np.random.randn(17) *10 # assume 17 freq for 17 values of tensor Z
>>> z [[7, 10, 16]] =np.nan # replace some indexes by NaN values
>>> zit = interpolate1d (z, kind ='linear')
>>> z
... array([ -1.97732415, -16.5883156 ,   8.44484348,   0.24032979,
          8.30863276,   4.76437029, -15.45780568,          nan,
         -4.11301794, -10.94003412,          nan,   9.22228383,
        -15.40298253,  -7.24575491,  -7.15149205, -20.9592011 ,
                 nan]),
>>> zn
...array([ -1.97732415, -16.5883156 ,   8.44484348,   0.24032979,
         8.30863276,   4.76437029, -15.45780568,  -4.11301794,
       -10.94003412,   9.22228383, -15.40298253,  -7.24575491,
        -7.15149205, -20.9592011 , -34.76691014, -48.57461918,
       -62.38232823])
>>> zmean = interpolate1d (z,  method ='mean')
>>> zbff = interpolate1d (z, method ='bff')
>>> zpd = interpolate1d (z,  method ='pd')
>>> plt.plot( np.arange (len(z)),  zit, 'v--',
          np.arange (len(z)), zmean, 'ok-',
          np.arange (len(z)), zbff, '^g:',
          np.arange (len(z)), zpd,'<b:',
          np.arange (len(z)), z,'o',
          )
>>> plt.legend(['interp1d', 'mean strategy', 'bff strategy',
                'pandas strategy', 'data'], loc='best')
watex.utils.interpolate2d(arr2d, method='slinear', **kws)[source]#

Interpolate the data in 2D dimensional array.

If the data contains some missing values. It should be replaced by the interpolated values.

Parameters:
  • arr2d (np.ndarray, shape (N, M)) – 2D dimensional data

  • method (str, default linear) – Interpolation technique to use. Can be nearest``or ``pad.

  • kws (dict) – Additional keywords. Refer to interpolate1d().

Returns:

arr2d – 2D dimensional data interpolated

Return type:

np.ndarray, shape (N, M)

Examples

>>> from watex.methods.em import EM
>>> from watex.utils.exmath import interpolate2d
>>> # make 2d matrix of frequency
>>> emObj = EM().fit(r'data/edis')
>>> freq2d = emObj.make2d (out = 'freq')
>>> freq2d_i = interpolate2d(freq2d )
>>> freq2d.shape
...(55, 3)
>>> freq2d
... array([[7.00000e+04, 7.00000e+04, 7.00000e+04],
       [5.88000e+04, 5.88000e+04, 5.88000e+04],
       ...
        [6.87500e+00, 6.87500e+00, 6.87500e+00],
        [        nan,         nan, 5.62500e+00]])
>>> freq2d_i
... array([[7.000000e+04, 7.000000e+04, 7.000000e+04],
       [5.880000e+04, 5.880000e+04, 5.880000e+04],
       ...
       [6.875000e+00, 6.875000e+00, 6.875000e+00],
       [5.625000e+00, 5.625000e+00, 5.625000e+00]])

References

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.interpolate.interp2d.html

watex.utils.interpolate_grid(arr, /, method='cubic', fill_value='auto', view=False)[source]#

Interpolate data containing missing values.

Parameters:
  • arr (ArrayLike2D) – Two dimensional array for interpolation

  • method (str, default='cubic') – kind of interpolation. It could be [‘nearest’|’linear’|’cubic’].

  • fill_value (float, str, default='auto') – Fill the interpolated grid at the egdes or surrounding NaN with a filled value. The auto uses the forward and backward fill strategy.

  • view (bool, default=False,) – Quick visualize the interpolated grid.

Changed in version 0.2.8: One-dimensional array is henceforth possible. Error no longer raises.

Returns:

arri – Interpolated 2D grid.

Return type:

ArrayLike2d

See also

spi.griddata

Scipy interpolate Grid data

fillNaN

Fill missing data strategy.

Examples

>>> import numpy as np
>>> from watex.utils.funcutils import interpolate_grid
>>> x = [28, np.nan, 50, 60] ; y = [np.nan, 1000, 2000, 3000]
>>> xy = np.vstack ((x, y)).T
>>> xyi = interpolate_grid (xy, view=True )
>>> xyi
array([[  28.        ,   28.        ],
       [  22.78880663, 1000.        ],
       [  50.        , 2000.        ],
       [  60.        , 3000.        ]])
watex.utils.label_importance(label, arr_k, arr_aq, *, method='naive')[source]#
Compute the score for the label and its representativity in the valid

array ‘arr_k’

Parameters:
label: int, or string

class label from the true labels array of permeability coefficient ‘k’. If string, be sure to convert the array to hold the dtype str. It is recommnended to provide data with no NaN to have full control the occurence results.

arr_k: array-like 1d

True labels of array containing the permeability coefficient ‘k’.

arr_aq: array_like 1d
True labels of the groups of aquifers or predicted naive group of

aquifer (NGA labels). See predict_NGA_labels().

method: str [‘naive’, ‘strict’], default=’naive’

The kind of strategy to compute the representativity of a label in the predicted array ‘array_aq’. It can also be ‘strict’. Indeed:

  • ‘naive’ computes the importance of the label by the number of its

    occurence for this specific label in the array ‘k’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in ‘arr_k’

  • ‘strict’ computes the importance of the label by the number of

    occurence in the whole valid ‘arr_k’ i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in ‘arr_k’.

Returns:
label_dict_group_rate: dict,

Dictionnary of the label and its rate of occurence in the arr_aq. Thus each group in arr_aq has its rate of representativity of the label in arr_k.

‘ , r)
label k = 1 :
{‘V’: 0.17, ‘IV’: 0.141, ‘II’: 0.126, ‘III’: 0.084, ‘IV&V’: 0.005,

‘II&III’: 0.003, ‘III&IV’: 0.003}

label k = 2 :

{‘III’: 0.052, ‘II’: 0.05, ‘V’: 0.05, ‘IV’: 0.034, ‘III&IV’: 0.005}

label k = 3 :

{‘V’: 0.123, ‘IV’: 0.086, ‘III’: 0.068}

>>> # **comments:
    # label k=1 is 17% importance for group V, 12.3% for group II whereas
    # label k=2 has a weak rate in the whole dataset ~=0.19% for all groups
    # the most dominate labels are k=1 and k=3 with 53.14% and 27.74 %
    # respectively in the dataset.
    # If threshold of representativity is set to 50% , none of the true
    # label k will fit any aquifer group since the max representativity
    # score is 17% and is for the group V especially for k=1.
watex.utils.labels_validator(t, /, labels, return_bool=False)[source]#

Assert the validity of the label in the target and return the label or the boolean whether all items of label are in the target.

Parameters:
  • t – array-like, target that is expected to contain the labels.

  • labels – int, str or list of (str or int) that is supposed to be in the target t.

  • return_bool – bool, default=False; returns ‘True’ or ‘False’ rather the labels if set to True.

Returns:

bool or labels; ‘True’ or ‘False’ if return_bool is set to True and labels otherwise.

Example:

>>> from watex.datasets import fetch_data
>>> from watex.utils.mlutils import cattarget, labels_validator
>>> _, y = fetch_data ('bagoue', return_X_y=True, as_frame=True)
>>> # binarize target y into [0 , 1]
>>> ybin = cattarget(y, labels=2 )
>>> labels_validator (ybin, [0, 1])
... [0, 1] # all labels exist.
>>> labels_validator (y, [0, 1, 3])
... ValueError: Value '3' is missing in the target.
>>> labels_validator (ybin, 0 )
... [0]
>>> labels_validator (ybin, [0, 5], return_bool=True ) # no raise error
... False
watex.utils.linkage_matrix(df, columns=None, kind='design', metric='euclidean', method='complete', as_frame=False, optimal_ordering=False)[source]#

Compute the distance matrix from the hierachical clustering algorithm

Parameters:
  • df (dataframe or NDArray of (n_samples, n_features)) – dataframe of Ndarray. If array is given , must specify the column names to much the array shape 1

  • columns (list) – list of labels to name each columns of arrays of (n_samples, n_features) If dataframe is given, don’t need to specify the columns.

  • kind (str, ['squareform'|'condense'|'design'], default is {'design'}) – kind of approach to summing up the linkage matrix. Indeed, a condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of \(m\) observation vectors in \(n\) dimensions may be passed as an \(m\) by \(n\) array. All elements of the condensed distance matrix must be finite, i.e., no NaNs or infs. Alternatively, we could used the squareform distance matrix to yield different distance values than expected. the design approach uses the complete inpout example matrix also called ‘design matrix’ to lead correct linkage matrix similar to squareform and condense`.

  • metric (str or callable, default is {'euclidean'}) – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by sklearn.metrics.pairwise.pairwise_distances(). If X is the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.

  • method (str, optional, default is {'complete'}) – The linkage algorithm to use. See the Linkage Methods section below for full descriptions.

  • optimal_ordering (bool, optional) – If True, the linkage matrix will be reordered so that the distance between successive leaves is minimal. This results in a more intuitive tree structure when the data are visualized. defaults to False, because this algorithm can be slow, particularly on large datasets. See also scipy.cluster.hierarchy.linkage().

Returns:

row_clusters – consist of several rows where each rw represents one merge. The first and second columns denotes the most dissimilar members of each cluster and the third columns reports the distance between those members

Return type:

linkage matrix

watex.utils.magnitude(cz)[source]#

Compute the magnitude of selected conductive zone.

The magnitude parameter is the absolute resistivity value between the minimum \(\min \rho_a\) and maximum \(\max \rho_a\) value of selected anomaly:

\[magnitude=|\min\rho_a -\max\rho_a|\]
Parameters:

cz – array-like. Array of apparent resistivity values composing the conductive zone.

Returns:

Absolute value of anomaly magnitude in ohm.meters.

watex.utils.makeCoords(reflong, reflat, nsites, *, r=45.0, utm_zone=None, step='1km', order='+', todms=False, is_utm=False, raise_warning=True, **kws)[source]#

Generate multiple stations coordinates (longitudes, latitudes) from a reference station/site.

One degree of latitude equals approximately 364,000 feet (69 miles), one minute equals 6,068 feet (1.15 miles), and one-second equals 101 feet. One-degree of longitude equals 288,200 feet (54.6 miles), one minute equals 4,800 feet (0.91 mile) , and one second equals 80 feet. Illustration showing longitude convergence. (1 feet ~=0.3048 meter)

Parameters:
  • reflong (float or string or list of [start, stop]) – Reference longitude in degree decimal or in DD:MM:SS for the first site considered as the origin of the landmark.

  • reflat (float or string or list of [start, stop]) – Reference latitude in degree decimal or in DD:MM:SS for the reference site considered as the landmark origin. If value is given in a list, it can contain the start point and the stop point.

  • nsites (int or float) – Number of site to generate the coordinates onto.

  • r (float or int) – The rotate angle in degrees. Rotate the angle features the direction of the projection line. Default value is 45 degrees.

  • step (float or str) – Offset or the distance of seperation between different sites in meters. If the value is given as string type, except the km, it should be considered as a m value. Only meters and kilometers are accepables.

  • order (str) – Direction of the projection line. By default the projected line is in ascending order i.e. from SW to NE with angle r set to 45 degrees. Could be - for descending order. Any other value should be in ascending order.

  • is_utm (bool,) – Consider the first two positional arguments as UTM coordinate values. This is an alternative way to assume reflong and reflat are UTM coordinates ‘easting’and ‘northing` by default. If utm2deg is False, any value greater than 180 degrees for longitude and 90 degrees for latitude will raise an error. Default is False.

  • utm_zone (string (##N or ##S)) – utm zone in the form of number and North or South hemisphere, 10S or 03N Must be given if utm2deg is set to True.

  • todms (bool) – Convert the degree decimal values into the DD:MM:SS. Default is False.

  • raise_warning (bool, default=True,) – Raises warnings if GDAL is not set or the coordinates accurately status.

  • kws (dict,) – Additional keywords of gistools.project_point_utm2ll().

Returns:

  • Tuple of generated projected coordinates longitudes and latitudes

  • either in degree decimals or DD (MM:SS)

Notes

The distances vary. A degree, minute, or second of latitude remains fairly constant from the equator to the poles; however a degree, minute, or second of longitude can vary greatly as one approaches the poles and the meridians converge.

References

https://math.answers.com/Q/How_do_you_convert_degrees_to_meters

Examples

>>> from watex.utils.coreutils import makeCoords
>>> rlons, rlats = makeCoords('110:29:09.00', '26:03:05.00',
...                                     nsites = 7, todms=True)
>>> rlons
... array(['110:29:09.00', '110:29:35.77', '110:30:02.54', '110:30:29.30',
       '110:30:56.07', '110:31:22.84', '110:31:49.61'], dtype='<U12')
>>> rlats
... array(['26:03:05.00', '26:03:38.81', '26:04:12.62', '26:04:46.43',
       '26:05:20.23', '26:05:54.04', '26:06:27.85'], dtype='<U11')
>>> rlons, rlats = makeCoords ((116.7, 119.90) , (44.2 , 40.95),
                                        nsites = 238, step =20. ,
                                        order = '-', r= 125)
>>> rlons
... array(['119:54:00.00', '119:53:11.39', '119:52:22.78', '119:51:34.18',
       '119:50:45.57', '119:49:56.96', '119:49:08.35', '119:48:19.75',
       ...
       '116:46:03.04', '116:45:14.43', '116:44:25.82', '116:43:37.22',
       '116:42:48.61', '116:42:00.00'], dtype='<U12')
>>> rlats
... array(['40:57:00.00', '40:57:49.37', '40:58:38.73', '40:59:28.10',
       '41:00:17.47', '41:01:06.84', '41:01:56.20', '41:02:45.57',
       ...
   '44:07:53.16', '44:08:42.53', '44:09:31.90', '44:10:21.27',
   '44:11:10.63', '44:12:00.00'], dtype='<U11')
watex.utils.make_MXS_labels(y_true, y_pred, threshold=None, similar_labels=None, sep=None, prefix=None, method='naive', trailer='*', return_obj=False, **kws)[source]#

Create a Mixture Learning Strategy (MXS) labels from true labels ‘y_true’ and the predicted Naive Group of Aquifer (NGA) labels ‘y_pred’

Parameters:
  • y_true (array-like 1d, pandas.Series) – Array composed of valid k-values and possible missing k-values.

  • y_pred (Array-like 1d, pandas.Series) – Array composing the valid NGA labels. Note that NGA labels is a predicted labels mostly using the unsupervising learning.

  • threshold (float, default=None) – The threshold from which, label in ‘y_true’ can be considered similar than the one in NGA labels ‘y_pred’. The default is ‘None’ which means none rule is considered and the high preponderence or occurence in the data compared to other labels is considered as the most representative and similar. Setting the rule instead by fixing the threshold is recommended especially in a huge dataset.

  • similar_labels (list of tuple, optional) – list of tuple in pair (label and similar group). If given, the similar group must be the label existing in the predicted NGA. If None, the auto-similarity is triggered.

  • sep (str, default'') –

    Separator between the true labels ‘y_true’ and predicted NGA labels. Sep is used to rewrite the MXS labels. Mostly the MXS labels is a combinaison with the true label of permeability coefficient ‘k’ and the label of NGA to compose new similarity labels. For instance

    >>> true_labels=['k1', 'k2', 'k3'] ; NGA_labels =['II', 'I', 'IV']
    >>> # gives
    >>> MXS_labels= ['k1_II', 'k2_I', 'k3_IV']
    

    where the seperator sep is set to _. This happens especially when one of the label (NGA or true_labels) is not a numeric datatype and a similariy is found between ‘k1’ and ‘II’, ‘k2’ and ‘I’ and so on.

  • prefix (str, default='') –

    prefix is used to rename the true_labels i.e the true valid-k. For instance:

    >>> k_valid =[1, 2, ..] -> k_new = [k1, k2, ...]
    

    where ‘k’ is the prefix.

  • method (str ['naive', 'strict'], default='naive') –

    The kind of strategy to compute the representativity of a label in the predicted array ‘y_pred’. It can also be ‘strict’. Indeed:

    • naive computes the importance of the label by the number of its

      occurence for this specific label in the array ‘y_true’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in y_true.

    • strict computes the importance of the label by the number of

      occurence in the whole valid y_true i.e. under the total of occurence of all the labels that exist in the whole ‘arr_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in y_pred.

  • trailer (str, default='*') –

    The Mixture strategy marker to differentiate the existing class label in ‘y_true’ with the predicted labels ‘y_pred’ especially when the same class labels are also present the true label with the same label-identifier name. This usefull to avoid any confusion for both labels in y_true and y_pred for better demarcation and distinction. Note that if the trailer`is set to ``None` and both y_true and y_pred are numeric data, the labels in y_pred are systematically renamed to be distinct with the ones in the ‘y_true’. For instance

    >>> true_labels=[1, 2, 3] ; NGA_labels =[0, 1, 2]
    >>> # with trailer , MXS labels should be
    >>>  MXS_labels= ['0', '1*', '2*', '3'] # 1 and 2 are in true_labels
    >>> # with no trailer
    >>> MXS_labels= [0, 4, 5, 3] # 1 and 2 have been changed to [4, 5]
    

  • return_obj (watex.utils.box.Boxspace) –

    If True, returns a MXS object with usefull attributes such as:
    • mxs_classes_ = the MXS class labels

    • mxs_labels_= the array-like of MXS labels. It also includes some

      non similar labels from NGA

    mxs_map_classes_= a dict or original class labels of the array

    ’k’ <’y_true’> and their temporary integer class labels. Indeed, if ‘y_true’ class labels are not a numeric dtype, New labels with integer dtype is created. The dict is used to wrap the true labels (original ones) during the MXS creation. Thus, the original labels are not altered and will be map in turn at the end to recover their positions as well in new MXS array. It is set to ‘None’ if ‘y_true’ has a numeric dtype.

    mxs_group_classes_: dict of all the similar group labels with the

    MXS labels related from the modified existing groups of NGA. Note that the non-similar group are modified if their labels are also found in the true_labels to avoid any confusion. Thus the dict wrap the non-similar label with their new temporay labels.

    mxs_similar_groups_= list of the similar labels found in

    y_true that have a similarity in NGA.

    mxs_similarity_= Tuple of similarity in pair (label, group)

    existing between the label class in y_true and NGA.

    mxs_group_labels_= list of the similar groups found in the

    predicted NGA that have a similarity in true labels ‘y_true’

Returns:

MXS – array like of MXS labels or MXS object containing the usefull attributes.

Return type:

array-like 1d or Boxspace

See also

predict_NGA_labels

Predicts Naive group of Aquifers labels.

Examples

>>> from watex.datasets import load_hlogs
>>> from watex.utils import read_data
>>> from watex.utils.hydroutils import classify_k, make_MXS_labels
>>> data = load_hlogs ().frame
>>> # map data.k to categorize k values
>>> ymap = classify_k(data.k , default_func =True)
>>> y_mxs = make_MXS_labels (ymap, data.aquifer_group)
>>> y_mxs[14:24]
...  array(['I', 'I', 2, 2, 2, 2, 2, 2, 2, 2], dtype=object)
>>> mxs_obj = make_MXS_labels (ymap, data.aquifer_group, return_obj=True )
>>> mxs_obj.mxs_labels_[14: 24]
... array(['I', 'I', 2, 2, 2, 2, 2, 2, 2, 2], dtype=object)
>>> # now we did the same task using the private data 'hf.csv'
>>> # composed of 11 boreholes. For default we alternatively uses
>>> # the aquifer groups like a fake NGA
>>> data = read_data ('data/boreholes/hf.csv')
>>> ymap =  classify_k(data.k , default_func =True)
>>> y_mxs= make_MXS_labels (ymap, data.aquifer_group)
>>> np.unique (y_mxs)
... array(['1', '1V', '2', '2III', '3', 'I', 'II', 'III&IV', 'IV'],
      dtype='<U6')
>>> # *comments:
    # label '1V' means the group V (expected to be a cluster)
    # and label 1 (true labels) have a similarity
    # the same of label '2III' while the remain label 3 does not
    #  any similarity in the other labels  in the 'y_pred' expected
    # to be NGA labels.
watex.utils.make_naive_pipe(X, y=None, *, num_features=None, cat_features=None, label_encoding='LabelEncoder', scaler='StandardScaler', missing_values=nan, impute_strategy='median', sparse_output=True, for_pca=False, transform=False)[source]#

make a pipeline to transform data at once.

make a naive pipeline is usefull to fast preprocess the data at once for quick prediction.

Work with a pandas dataframe. If None features is set, the numerical and categorial features are automatically retrieved.

Parameters:
  • X (pandas dataframe of shape (n_samples, n_features)) – The input samples. Use dtype=np.float32 for maximum efficiency. Sparse matrices are also supported, use sparse csc_matrix for maximum efficiency.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.

  • num_features (list or str, optional) – Numerical features put on the list. If num_features are given whereas cat_features are None, cat_features are figured out automatically.

  • cat_features (list of str, optional) – Categorial features put on the list. If num_features are given whereas num_features are None, num_features are figured out automatically.

  • label_encoding (callable or str, default='sklearn.preprocessing.LabelEncoder') – kind of encoding used to encode label. This assumes ‘y’ is supplied.

  • scaler (callable or str , default='sklearn.preprocessing.StandardScaler') – kind of scaling used to scaled the numerical data. Note that for the categorical data encoding, ‘sklearn.preprocessing.OneHotEncoder’ is implemented under the hood instead.

  • missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.

  • impute_strategy (str, default='mean') –

    The imputation strategy.

    • If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.

    • If “median”, then replace missing values using the median along each column. Can only be used with numeric data.

    • If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.

    • If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

      strategy=”constant” for fixed value imputation.

  • sparse_output (bool, default=False) – Is used when label y is given. Binarize labels in a one-vs-all fashion. If True, returns array from transform is desired to be in sparse CSR format.

  • for_pca (bool, default=False,) – Transform data for principal component ( PCA) analysis. If set to True, watex.exlib.sklearn.OrdinalEncoder` is used insted of watex.exlib.sklearn.OneHotEncoder`.

  • transform (bool, default=False,) – Tranform data inplace rather than returning the naive pipeline.

Returns:

  • full_pipeline (watex.exlib.sklearn.FeatureUnion) –

    • Full pipeline composed of numerical and categorical pipes

  • (X_transformed &| y_transformed) ({array-like, sparse matrix} of shape (n_samples, n_features)) –

    • Transformed data.

Examples

>>> from watex.utils.mlutils import make_naive_pipe
>>> from watex.datasets import load_hlogs

(1) Make a naive simple pipeline with RobustScaler, StandardScaler >>> from watex.exlib.sklearn import RobustScaler >>> X_, y_ = load_hlogs (as_frame=True )# get all the data >>> pipe = make_naive_pipe(X_, scaler =RobustScaler )

(2) Transform X in place with numerical and categorical features with StandardScaler (default). Returned CSR matrix

>>> make_naive_pipe(X_, transform =True )
... <181x40 sparse matrix of type '<class 'numpy.float64'>'
    with 2172 stored elements in Compressed Sparse Row format>
watex.utils.moving_average(y, *, window_size=3, method='sma', mode='same', alpha=0.5)[source]#

A moving average is used with time series data to smooth out short-term fluctuations and highlight longer-term trends or cycles.

Funtion analyzes data points by creating a series of averages of different subsets of the full data set.

Parameters:
  • y (array_like, shape (N,)) – the values of the time history of the signal.

  • window_size (int) – the length of the window. Must be greater than 1 and preferably an odd integer number.Default is 3

  • method (str) – variant of moving-average. Can be sma, cma, wma and ema for simple, cummulative, weight and exponential moving average. Default is sma.

  • mode (str) – returns the convolution at each point of overlap, with an output shape of (N+M-1,). At the end-points of the convolution, the signals do not overlap completely, and boundary effects may be seen. Can be full, same and valid. See ~np.convole for more details. Default is same.

  • alpha (float,) – smoothing factor. Only uses in exponential moving-average. Default is .5.

Returns:

ya – Averaged time history of the signal

Return type:

array like, shape (N,)

Notes

The first element of the moving average is obtained by taking the average of the initial fixed subset of the number series. Then the subset is modified by “shifting forward”; that is, excluding the first number of the series and including the next value in the subset.

Examples

>>> import numpy as np ; import matplotlib.pyplot as plt
>>> from watex.utils.exmath  import moving_average
>>> data = np.random.randn (37)
>>> # add gaussion noise to the data
>>> data = 2 * np.sin( data)  + np.random.normal (0, 1 , len(data))
>>> window = 5  # fixed size to 5
>>> sma = moving_average(data, window)
>>> cma = moving_average(data, window, method ='cma' )
>>> wma = moving_average(data, window, method ='wma' )
>>> ema = moving_average(data, window, method ='ema' , alpha =0.6)
>>> x = np.arange(len(data))
>>> plt.plot (x, data, 'o', x, sma , 'ok--', x, cma, 'g-.', x, wma, 'b:')
>>> plt.legend (['data', 'sma', 'cma', 'wma'])

References

watex.utils.naive_imputer(X, y=None, strategy='mean', mode=None, drop_features=False, missing_values=nan, fill_value=None, verbose='deprecated', add_indicator=False, copy=True, keep_empty_features=False, **fit_params)[source]#

Imput missing values in the data.

Whatever data contains categorial features, ‘bi-impute’ argument passed to ‘kind’ parameters has a strategy to both impute the numerical and categorical features rather than raising an error when the ‘strategy’ is not set to ‘most_frequent’.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Not used, present here for API consistency by convention.

  • strategy (str, default='mean') –

    The imputation strategy.

    • If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.

    • If “median”, then replace missing values using the median along each column. Can only be used with numeric data.

    • If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.

    • If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

      strategy=”constant” for fixed value imputation.

  • mode (str, [bi-impute'], default= None) – If mode is set to ‘bi-impute’, it imputes the both numerical and categorical features and returns a single imputed dataframe.

  • drop_features (bool or list, default =False,) – drop a list of features in the dataframe before imputation. If True and no list of features is supplied, the categorial features are dropped.

  • missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.

  • fill_value (str or numerical value, default=None) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.

  • keep_empty_features (bool, default=False) –

    If True, features that consist exclusively of missing values when fit is called are returned in results when transform is called. The imputed value is always 0 except when strategy=”constant” in which case fill_value will be used instead.

    New in version 0.2.0.

  • verbose (int, default=0) – Controls the verbosity of the imputer.

  • copy (bool, default=True) –

    If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:

    • If X is not an array of floating values;

    • If X is encoded as a CSR matrix;

    • If add_indicator=True.

  • add_indicator (bool, default=False) – If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

  • fit_params (dict,) – keywords arguments passed to the scikit-learn fitting parameters More details on https://scikit-learn.org/stable/

Returns:

Xi – Data imputed

Return type:

Dataframe, array-like, sparse matrix of shape (n_samples, n_features)

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from watex.utils.mlutils import naive_imputer
>>> X= np.random.randn ( 7, 4 )
>>> X[3, :] =np.nan  ; X[:, 3][-4:]=np.nan
>>> naive_imputer  (X)
... array([[ 1.34783528,  0.53276798, -1.57704281,  0.43455785],
           [ 0.36843174, -0.27132106, -0.38509441, -0.29371997],
           [-1.68974996,  0.15268509, -2.54446498,  0.18939122],
           [ 0.06013775,  0.36687602, -0.21973368,  0.11007637],
           [-0.27129147,  1.18103398,  1.78985393,  0.11007637],
           [ 1.09223954,  0.12924661,  0.52473794,  0.11007637],
           [-0.48663864,  0.47684353,  0.87360825,  0.11007637]])
>>> frame = pd.DataFrame (X, columns =['a', 'b', 'c', 'd']  )
>>> # change [bc] types to categorical values.
>>> frame['b']=['pineaple', '', 'cabbage', 'watermelon', 'onion',
                'cabbage', 'onion']
>>> frame['c']=['lion', '', 'cat', 'cat', 'dog', '', 'mouse']
>>> naive_imputer(frame, kind ='bi-impute')
...             b      c         a         d
    0    pineaple   lion  1.347835  0.434558
    1     cabbage    cat  0.368432 -0.293720
    2     cabbage    cat -1.689750  0.189391
    3  watermelon    cat  0.060138  0.110076
    4       onion    dog -0.271291  0.110076
    5     cabbage    cat  1.092240  0.110076
    6       onion  mouse -0.486639  0.110076
watex.utils.naive_scaler(X, y=None, *, kind=<class 'sklearn.preprocessing._data.StandardScaler'>, copy=True, with_mean=True, with_std=True, feature_range=(0, 1), clip=False, norm='l2', **fit_params)[source]#

Quick data scaling using both strategies implemented in scikit-learn with StandardScaler and MinMaxScaler.

Function returns scaled frame if dataframe is passed or ndarray. For other scaling, call scikit-learn instead.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

  • kind (str, default='StandardScaler') – Kind of data scaling. Can also be [‘MinMaxScaler’, ‘Normalizer’]. The default is ‘StandardScaler’

  • copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

  • with_mean (bool, default=True) – If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

  • with_std (bool, default=True) – If True, scale the data to unit variance (or equivalently, unit standard deviation).

  • feature_range (tuple (min, max), default=(0, 1)) – Desired range of transformed data.

  • norm ({'l1', 'l2', 'max'}, default='l2') – The norm to use to normalize each non zero sample. If norm=’max’ is used, values will be rescaled by the maximum of the absolute values.

  • clip (bool, default=False) – Set to True to clip transformed values of held-out data to provided feature range.

  • fit_params (dict,) – keywords arguments passed to the scikit-learn fitting parameters More details on https://scikit-learn.org/stable/

Returns:

X_sc – Transformed array.

Return type:

{ndarray, sparse matrix} or dataframe of shape (n_samples, n_features)

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from watex.utils.mlutils import naive_scaler
>>> X= np.random.randn (7 , 3 )
>>> X_std = naive_scaler (X )
... array([[ 0.17439644,  1.55683005,  0.24115109],
       [-0.59738672,  1.3166854 ,  1.23748004],
       [-1.6815365 , -1.19775838,  0.71381357],
       [-0.1518278 , -0.32063059, -0.47483155],
       [-0.41335886,  0.13880519,  0.69258621],
       [ 1.45221902, -1.03852015, -0.40157981],
       [ 1.21749443, -0.45541153, -2.00861955]])
>>> # use dataframe
>>> Xdf = pd.DataFrame (X, columns =['a', 'c', 'c'])
>>> naive_scaler (Xdf , kind='Normalizer') # return data frame
...           a         c         c
    0  0.252789  0.967481 -0.008858
    1 -0.265161  0.908862  0.321961
    2 -0.899863 -0.416231  0.130380
    3  0.178203  0.039443 -0.983203
    4 -0.418487  0.800306  0.429394
    5  0.933933 -0.309016 -0.179661
    6  0.795234 -0.051054 -0.604150
watex.utils.normalizer(arr, /, method='naive')[source]#

Normalize values to be between 0 and 1.

This normlizer handles NaN values translates data individually such that it is in the given range on the training set, e.g. between zero and one.

Note that when the transformation is set to the method ='MinMax', The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_normed = X_std * (max - min) + min

where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.

Parameters:
  • arr (Arraylike,) – Array to normalize, can contain NaN values.

  • method (str,) – Can be use ‘scikit-learn’ MinMaxScaler for normalization. Any other values used the naive normalization.

Returns:

arr_norm

Return type:

Normalized array.

Examples

>>> import numpy as np
>>> from watex.utils.funcutils import normalizer
>>> np.random.seed (42)
>>> arr = np.random.randn (3, 2 )
array([[ 0.49671415, -0.1382643 ],
       [ 0.64768854,  1.52302986],
       [-0.23415337, -0.23413696]])
>>> normalizer (arr )
array([[4.15931313e-01, 5.45697636e-02],
       [5.01849720e-01, 1.00000000e+00],
       [0.00000000e+00, 9.34323403e-06]])
>>> normalizer (arr , method ='min-max')  # normalize data along axis=0
array([[0.82879654, 0.05456093],
       [1.        , 1.        ],
       [0.        , 0.        ]])
>>> arr [0, 1] = np.nan; arr [1, 0] = np.nan
>>> normalizer (arr )
array([[4.15931313e-01,            nan],
       [           nan, 1.00000000e+00],
       [0.00000000e+00, 9.34323403e-06]])
>>> normalizer (arr , method ='min-max')
array([[ 1., nan],
       [nan,  1.],
       [ 0.,  0.]])
watex.utils.ohmicArea(data=None, search=45.0, sum=False, objective='ohmS', **kws)[source]#

Compute the ohmic-area from the Vertical Electrical Sounding data collected in exploration area.

Parameters:
* data: Dataframe pandas - contains the depth measurement AB from current

electrodes, the potentials electrodes MN and the collected apparents resistivities.

* search: float - The depth in meters from which one expects to find a

fracture zone outside of pollutions. Indeed, the search parameter is used to speculate about the expected groundwater in the fractured rocks under the average level of water inrush in a specific area. For instance in Bagoue region , the average depth of water inrush is around 45m. So the search can be specified via the water inrush average value.

* objective: str - Type operation to outputs. By default, the function

outputs the value of pseudo-area in \(\Omega .m^2\). However, for plotting purpose by setting the argument to view, its gives an alternatively outputs of X and Y, recomputed and projected as weel as the X and Y values of the expected fractured zone. Where X is the AB dipole spacing when imaging to the depth and Y is the apparent resistivity computed

kws: dict - Additionnal keywords arguments from |VES| data operations.

See watex.utils.exmath.vesDataOperator() for futher details.

Returns:
List of twice tuples:
  • Tuple(ohmS, error, roots):
    • `ohmS`is the pseudo-area computed expected to be a fractured zone

    • error is the integration error

    • roots is the integration boundaries of the expected fractured

      zone where the basement rocks is located above the resistivity transform function. At these points both curves values equal to null.

  • Tuple (XY, fit XY,XYohmSarea):
    • XY is the ndarray(nvalues, 2) of the operated of AB dipole

      spacing and resistivity rhoa values.

    • fit XY is the fitting ndarray(nvalues, 2) uses to redraw the

      dummy resistivity transform function.

    • XYohmSarea is ndarray(nvalues, 2) of the dipole spacing and

      resistiviy values of the expected fracture zone.

Raises:
VESError

If the search is greater or equal to the maximum investigation depth in meters.

Notes

The ohmS value calculated from pseudo-area is a fully data-driven parameter and is used to evaluate a pseudo-area of the fracture zone from the depth where the basement rock is supposed to start. Usually, when exploring deeper using the Vertical Electrical Sounding, we are looking for groundwater in thefractured rock that is outside the anthropic pollution (Biemi, 1992). Since the VES is an indirect method, we cannot ascertain whether the presumed fractured rock contains water inside. However, we assume that the fracture zone could exist and should contain groundwater. Mathematically, based on the VES1D model proposed by `Koefoed, O. (1976)`_ , we consider a function \(\rho_T(l)\), a set of reducing resistivity transform function to lower the boundary plane at half the current electrode spacing \((l)\). From the sounding curve \(\rho_T(l)\), curve an imaginary basement rock \(b_r (l)\) of slope equal to 45° with the horizontal \(h(l)\) was created. A pseudo-area \(S(l)\) should be defined by extending from \(h(l)\) the \(b_r (l)\) curve when the sounding curve \(\rho_T(l)\) is below \(b_r(l)\), otherwise \(S(l)\) is equal to null. The computed area is called the ohmic-area \(ohmS\) expressed in \(\Omega .m^2\) and constitutes the expected fractured zone. Thus \(ohmS\)\(0\) confirms the existence of the fracture zone while of \(Ohms=0\) raises doubts. The equation to determine the parameter is given as:

\[ \begin{align}\begin{aligned}ohmS & = &\int_{ l_i}^{l_{i+1}} S(l)dl \quad {s.t.}\\\begin{split}S(l) & = & b_r (l) - \rho_T (l) \quad \text{if} \quad b_r (l) > \rho_T (l) \\ & = & 0. \quad \text{if} \quad b_r (l) \leq \rho_T (l)\end{split}\\b_r(l) & = & l + h(l) \quad ; \quad h(l) = \beta\\\rho_T(l) & = & l^2 \int_{0}^{\infty} T_i( \lambda ) h_1( \lambda l) \lambda d\lambda\end{aligned}\end{align} \]

where \(l_i \quad \text{and} \quad l_{i+1}\) solve the equation \(S(l=0)\); \(l\) is half the current electrode spacing \(AB/2\), and \(h_1\) denotes the first-order of the Bessel function of the first kind, \(\beta\) is the coordinate value on y-axis direction of the intercept term of the \(b_r(l)\) and \(h(l)\), \(T_i(\lambda )\) resistivity transform function, \(lamda\) denotes the integral variable, where n denotes the number of layers, \(rho_i\) and \(h_i\) are the resistivity and thickness of the \(i-th\) layer, respectively. Get more explanations and cleareance of formula in the paper of `Kouadio et al 2022`_.

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

Examples

>>> from watex.utils.exmath import ohmicArea
>>> from watex.utils.coreutils import vesSelector
>>> data = vesSelector (f= 'data/ves/ves_gbalo.xlsx')
>>> (ohmS, err, roots), *_ = ohmicArea(data = data, search =45, sum =True )
... (13.46012197818152, array([5.8131967e-12]), array([45.        , 98.07307307]))
# pseudo-area is computed between the spacing point AB =[45, 98] depth.
>>> _, (XY.shape, XYfit.shape, XYohms_area.shape) = ohmicArea(
                AB= data.AB, rhoa =data.resistivity, search =45,
                objective ='plot')
... ((26, 2), (1000, 2), (8, 2))
watex.utils.plotAnomaly(erp, cz=None, station=None, fig_size=(10, 4), fig_dpi=300, savefig=None, show_fig_title=True, style='seaborn', fig_title_kws=Ellipsis, czkws=Ellipsis, legkws=Ellipsis, how='py', **kws)[source]#

Plot the whole Electrical Resistivity Profiling line and selected conductive zone.

Conductive zone can be supplied nannualy as a subset of the erp or by specifying the station expected for drilling location. For instance S07 for the seventh station. Futhermore, for automatic detection, one should set the station argument s to auto. However, it ‘s recommended to provide the cz or the s to have full control. The conductive zone overlained the whole Electrical Resistivity Profiling survey. user can customize the cz plot by filling with Matplotlib pyplot additional keywords araguments thought the keyword arguments czkws.

Parameters:
erp: array_like 1d

the Electrical Resistivity Profiling survey line. The line is an array of resistivity values. Note that if a dataframe is passed, be sure that the frame matches the DC resistivity data (ERP), otherwise an error occurs. At least, the frame columns includes the resistivity and stations.

cz: array_like 1d

the selected conductive zone. If None, only the erp should be displayed. Note that cz is an subset of erp array.

station: str, optional

The station location given as string (e.g. s= "S10") or as a station number (indexing; e.g s =10). If value is set to "auto", s should be find automatically and fetching cz as well.

figsize: tuple, default =(10, 4)

Tuple value of figure size. Refer to the web resources Matplotlib figure.

fig_dpi: int , default=300,

figure resolution “dot per inch”. Refer to Matplotlib figure.

savefig: str, optional,

save the figure. Refer to Matplotlib figure.

show_fig_title: bool, default =True

display the title of the figure.

fig_title_kws: dict,

Keywords arguments of figure suptile. Refer to Matplotlib figsuptitle.

style: str - the style for customizing visualization. For instance to

get the first seven available styles in pyplot, one can run the script below:

plt.style.available[:7]

Futher details can be foud in Webresources below or click on GeekforGeeks.

how: str, default=’py’

By default (how='py'), the station is naming following the Python indexing. Station is counting from station 00(S00). Any other values will start the station naming from 1.

czkws: dict,

keywords Matplotlib pyplot additional arguments to customize the cz plot.

legkws: dict,

Additional keywords Matplotlib legend arguments.

kws: dict,

additional keywords argument for Matplotlib pyplot to customize the erp plot.

See also

watex.erpSmartDetector

Detection conductive zone applying the constraint. Set the view=True for constraints visualization.

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

Examples

>>> import numpy as np
>>> from watex.utils import plotAnomaly, defineConductiveZone
>>> test_array = np.abs (np.random.randn (10)) *1e2
>>> selected_cz ,*_ = defineConductiveZone(test_array, 7)
>>> plotAnomaly(test_array, selected_cz )
>>> plotAnomaly(test_array, selected_cz , s= 5)
>>> plotAnomaly(test_array, s= 's02')
>>> plotAnomaly(test_array)
watex.utils.plotOhmicArea(data=None, search=45.0, pre_computed=False, xy=None, xyf=None, xyarea=None, colors=None, fbtw=False, **plot_kws)[source]#

Plot the Vertical Electrical Sounding data ohmic -area

Parameters:
  • data (*) – contains the depth measurement AB from current electrodes, the potentials electrodes MN and the collected apparent resistivities.

  • search (*) – The depth in meters from which one expects to find a fracture zone outside of pollutions. Indeed, the search parameter is used to speculate about the expected groundwater in the fractured rocks under the average level of water inrush in a specific area. For instance in Bagoue region , the average depth of water inrush is around 45m. So the search can be specified via the water inrush average value.

  • pre_computed (bool, default=False,) – If True computed the ohmic_area parameters. If False, the ohmic area arguments must be passed to xy, xyf and xyarea, otherwise an errors will raise.

  • xy (array-like of shape (n_AB, 2)) – Arraylike of the sanitized depth measurement AB from current. electrodes n_AB. See vesDataOperator().

  • xyf (array-like of shape (n_fit_samples, 2)) – Array-like of the fitted samples i.e the number of points for fitting the sounding resistivity values from the surface thin the total depth. The fitted rhoa showns a smooth curves. The default point is 1000.

  • xyarea (array-like of shape (n_area, 2)) – Arraylike of the resistivity positions of the depth measurment AB where the fractured zone is found.

  • fbtw (bool, default=False,) – If True, filled the computed fractured zone using the parameters computed from xyf and xyarea.

  • kws (dict - Additionnal keywords arguments from Vertical Electrical Sounding data operations.) – See watex.utils.exmath.vesDataOperator() for futher details.

Notes

The first and second columns of xy, xyfit and xyarea are the position AB/2 and their corresponding resistivity values.

Examples

>>> from watex.datasets import load_semien
>>> from watex.utils.exmath import plotOhmicArea
>>> ves_data = load_semien ()
>>> plotOhmicArea (ves_data)
watex.utils.plot_clusters(n_clusters, X, y_pred, cluster_centers=None, savefig=None)[source]#

Visualize the cluster that k-means identified in the dataset

Parameters:
  • n_clusters – int, number of cluster to visualize

  • X – NDArray, data containing the features, expect to be a two dimensional data

  • y_pred – array-like, array containing the predicted class labels.

  • cluster_centers – NDArray containg the coordinates of the centroids or the similar points with continous features.

Example:

>>> from watex.exlib.sklearn import KMeans, MinMaxScaler
>>> from watex.utils.plotutils import plot_clusters
>>> from watex.datasets import fetch_data
>>> h= fetch_data('hlogs').frame
>>> # collect two features 'resistivity' and gamma-gamma logging values
>>> h2 = h[['resistivity', 'gamma_gamma']]
>>> km = KMeans (n_clusters =3 , init= 'random' )
>>> # scaled the data with MinMax scaler i.e. between ( 0-1)
>>> h2_scaled = MinMaxScaler().fit_transform(h2)
>>> ykm = km.fit_predict(h2_scaled )
>>> plot_clusters (3 , h2_scaled, ykm , km.cluster_centers_ )
watex.utils.plot_confidence_in(z_or_edis_obj_list, /, tensor='res', view='1d', drop_outliers=True, distance=None, c_line=False, view_ci=True, figsize=(6, 2), fontsize=4.0, dpi=300.0, top_label='Stations', rotate_xlabel=90.0, fbtw=True, savefig=None, **plot_kws)[source]#

Plot data confidency from tensor errors.

The default tensor for evaluating the data confidence is the resistivity at TE mode (‘xy’).

Check confidence in the data before starting the concrete processing seems meaningful. In the area with complex terrain, with high topography addition to interference noises, signals are weals or missing especially when using AMT survey. The most common technique to do this is to eliminate the bad frequency and interpolate the remains one. However, the tricks for eliminating frequency differ from one author to another. Here, the tip using the data confidence seems meaningful to indicate which frequencies to eliminate (at which stations/sites) and which ones are still recoverable using the tensor recovering strategy.

The plot implements three levels of confidence:

  • High confidence: \(conf. \geq 0.95\) values greater than 95%

  • Soft confidence: \(0.5 \leq conf. < 0.95\). The data in this confidence range can be beneficial for tensor recovery to restore the weak and missing signals.

  • bad confidence: \(conf. <0.5\). Data in this interval must be deleted.

Parameters:
  • z_or_edis_obj_list (list of watex.edi.Edi or watex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.

  • tensor (str, default='res') – Tensor name. Can be [ ‘resistivity’|’phase’|’z’|’frequency’]

  • view (str, default='1d') – Type of plot. Can be [‘1D’|’2D’]

  • drop_outliers (bool, default=True) – Suppress the ouliers in the data if True.

  • distance (float, optional) – Distance between stations/sites

  • fontsize (float, default=3.) – label font size.

  • figsize (Tuple, default=(6, 2)) – Figure size.

  • c_line (bool, default=True,) – Display the confidence line in two dimensinal view.

  • dpi (int, default=300) – Image resolution in dot-per-inch

  • rotate_xlabel (float, default=90.) – Angle to rotate the stations/sites labels

  • top_label (str,default='Stations') – Labels the sites either using the survey name.

  • view_ci (bool,default=True,) – Show the marker of confidence interval.

  • fbtw (bool, default=True,) – Fill between confidence interval.

  • plot_kws (dict,) – Additional keywords pass to the plot()

See also

watex.methods.Processing.zrestore

For more details about the function for tensor recovering technique.

Examples

>>> from watex.utils.exmath import plot_confidence_in
>>> from watex.datasets import fetch_data
>>> emobj  = fetch_data ( 'huayuan', samples = 25, clear_cache =True,
                         key='raw').emo
>>> plot_confidence_in (emobj.ediObjs_ ,
                        distance =20 ,
                        view ='2d',
                        figsize =(6, 2)
                        )
>>> plot_confidence_in (emobj.ediObjs_ , distance =20 ,
                        view ='1d', figsize =(6, 3), fontsize =5,
                        )
watex.utils.plot_confusion_matrices(clfs, Xt, yt, annot=True, pkg=None, normalize='true', sample_weight=None, encoder=None, fig_size=(22, 6), savefig=None, subplot_kws=None, **scorer_kws)[source]#

Plot inline multiple model confusion matrices using either the sckitlearn or ‘yellowbrick’

Parameters:
  • clfs (list of classifier estimators) – A scikit-learn estimator that should be a classifier. If the model is not a classifier, an exception is raised. Note that the classifier must be fitted beforehand.

  • Xt (ndarray or DataFrame of shape (M X N)) – A matrix of n instances with m features. Preferably, matrix represents the test data for error evaluation.

  • yt (ndarray of shape (M, ) or Series oF length (M, )) – An array or series of target or class values. Preferably, the array represent the test class labels data for error evaluation.

  • pkg (str, optional , default ='sklearn') – the library to handle the plot. It could be ‘yellowbrick’. The basic confusion matrix is handled by the scikit-learn package.

  • normalize ({'true', 'pred', 'all'}, default=None) – Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • encoder (dict or LabelEncoder, default: None) – A mapping of classes to human readable labels. Often there is a mismatch between desired class labels and those contained in the target variable passed to fit() or score(). The encoder disambiguates this mismatch ensuring that classes are labeled correctly in the visualization.

annot: bool, default=True

Annotate the number of samples (right or wrong prediction ) in the plot. Set False to mute the display.

fig_sizetuple (width, height), default =(8, 6)

the matplotlib figure size given as a tuple of width and height

savefig: str, default =None ,

the path to save the figures. Argument is passed to matplotlib.Figure class.

Examples

>>> import matplotlib.pyplot as plt
>>> plt.style.use ('classic')
>>> from watex.datasets import fetch_data
>>> from watex.exlib.sklearn import train_test_split
>>> from watex.models.premodels import p
>>> from watex.utils.plotutils import plot_confusion_matrices
>>> # split the  data . Note that fetch_data output X and y
>>> X, Xt, y, yt  = train_test_split (* fetch_data ('bagoue analysed'), test_size =.25  )
>>> # compose the models
>>> # from RBF, and poly
>>> models =[ p.SVM.rbf.best_estimator_,
         p.LogisticRegression.best_estimator_,
         p.RandomForest.best_estimator_
         ]
>>> models
[SVC(C=2.0, coef0=0, degree=1, gamma=0.125), LogisticRegression(),
 RandomForestClassifier(criterion='entropy', max_depth=16, n_estimators=350)]
>>> # now fit all estimators
>>> fitted_models = [model.fit(X, y) for model in models ]
>>> plot_confusion_matrices(fitted_models , Xt, yt)
watex.utils.plot_cost_vs_epochs(regs, *, fig_size=(10, 4), marker='o', savefig=None, **kws)[source]#

Plot the cost against the number of epochs for the two different learnings rates

Parameters:
  • regs (Callable, single or list of regression estimators) – Estimator should be already fitted.

  • fig_size (tuple , default is (10, 4)) – the size of figure

  • kws (dict ,) – Additionnal keywords arguments passes to matplotlib.pyplot.plot()

Returns:

ax

Return type:

Matplotlib.pyplot axes objects

Examples

>>> from watex.datasets import load_iris
>>> from watex.base import AdalineGradientDescent
>>> from watex.utils.plotutils import plot_cost_vs_epochs
>>> X, y = load_iris (return_X_y= True )
>>> ada1 = AdalineGradientDescent (n_iter= 10 , eta= .01 ).fit(X, y)
>>> ada2 = AdalineGradientDescent (n_iter=10 , eta =.0001 ).fit(X, y)
>>> plot_cost_vs_epochs (regs = [ada1, ada2] )
watex.utils.plot_elbow(X, n_clusters, n_init=10, max_iter=300, random_state=42, fig_size=(10, 4), marker='o', savefig=None, **kwd)[source]#

Plot elbow method to find the optimal number of cluster, k’, for a given data.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.

  • n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.

  • n_init (int, default=10) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

  • max_iter (int, default=300) – Maximum number of iterations of the k-means algorithm for a single run.

  • tol (float, default=1e-4) – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.

  • verbose (int, default=0) – Verbosity mode.

  • random_state (int, RandomState instance or None, default=42) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.

  • savefig (str, default =None ,) – the path to save the figure. Argument is passed to matplotlib.Figure class.

  • marker (str, default='o',) – cluster marker point.

  • kwd (dict) – Addionnal keywords arguments passed to matplotlib.pyplot.plot()

Returns:

ax

Return type:

Matplotlib.pyplot axes objects

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.plotutils import plot_elbow
>>> # get the only resistivy and gamma-gama values for example
>>> res_gamma = load_hlogs ().frame[['resistivity', 'gamma_gamma']]
>>> plot_elbow(res_gamma, n_clusters=11)
watex.utils.plot_l_curve(rms, roughness, tau=None, hansen_point=None, rms_target=None, view_tline=False, hpoint_kws={}, fig_size=(10, 4), ax=None, fig=None, style='classic', savefig=None, **plot_kws)[source]#

Plot the Hansen L-curve.

The L-curve criteria is used to determine the suitable model after runing multiple inversions with different :math:` au` values. The function plots RMS vs. Roughness with an option to highlight a specific point named Hansen point [1]_.

The :math:` au` represents the measure of compromise between data fit and model smoothness. To find out an appropriates-value, the inversion was carried out with differents-values. The RMS error obtained from each inversion is plotted against model roughnes

Plots RMS vs. Roughness with an option to highlight the Hansen point.

Parameters:
  • rms (ArrayLike, list,) – Corresponding list pr Arraylike of RMS values.

  • roughness (Arraylike, list,) – List or ArratLike of roughness values.

  • tau (Arraylike or list, optional) – List of tau values to visualize as text mark in the plot.

  • hansen_point (A tuple (roughness_value, RMS_value) , optional) – The Hansen point to visualize in the plot. It can be determine automatically if highlight_point='auto'.

  • rms_target (float, optional) –

    The root-mean-squared target. If set, and view_tline is False,

    the target value should be axis limit.

    view_tline: bool, default=False

    Display the target line should be displayed.

  • hpoint_kws (dict, optional) – Keyword argument to highlight the hansen point in the figure.

  • ax (Matplotlib.pyplot.Axes, optional) – Axe to collect the figure. Could be used to support other axes.

  • fig (Matplotlib.pyplot.figure, optional) – Supply fig to save automatically the plot, otherwise, keep it to None.

  • savefig (str, optional) – Save figure name. The default resolution dot-per-inch is 300.

Returns:

ax – Return axis

Return type:

Matplotlib.pyplot.Axis

References

[1] Hansen, P. C., & O’Leary, D. P. (1993). The use of the L-Curve in

the regularization of discrete ill-posed problems. SIAM Journal on Scientific Computing, 14(6), 1487–1503. https://doi.org/10.1137/0914086.

Examples

>>> from watex.utils.plotutils import plot_l_curve
>>> # Test the function with the provided data points and
>>> # highlighting point (50, 3.12)
>>> roughness_data = [0, 50, 100, 150, 200, 250, 300, 350]
>>> RMS_data = [3.16, 3.12, 3.1, 3.08, 3.06, 3.04, 3.02, 3]
>>> highlight_data = (50, 3.12)
>>> plot_l_curve(roughness_data, RMS_data, highlight_data)
watex.utils.plot_learning_curves(models, X, y, *, cv=None, train_sizes=None, baseline_score=0.4, scoring=None, convergence_line=True, fig_size=(20, 6), sns_style=None, savefig=None, set_legend=True, subplot_kws=None, **kws)[source]#

Horizontally visualization of multiple models learning curves.

Determines cross-validated training and test scores for different training set sizes.

Parameters:
  • models (list or estimators) – An estimator instance or not that implements fit and predict methods which will be cloned for each validation.

  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.

  • cv (int, cross-validation generator or an iterable, default=None) –

    Determines the cross-validation splitting strategy.

    Possible inputs for cv are:

    • None, to use the default 5-fold cross validation,

    • int, to specify the number of folds in a (Stratified)KFold,

    • CV splitter,

    • An iterable yielding (train, test) splits as arrays of indices.

    For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

    Refer User Guide for the various cross-validation strategies that can be used here.

    cv default value if None changed from 3-fold to 4-fold.

    train_sizesarray-like of shape (n_ticks,), default=np.linspace(0.1, 1, 50)

    Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.

  • baseline_score (floatm default=.4) – base score to start counting in score y-axis (score)

  • scoring (str or callable, default=None) – A str (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y).

  • convergence_line (bool, default=True) – display the convergence line or not that indicate the level of bias between the training and validation curve.

  • fig_size (tuple (width, height), default =(14, 6)) – the matplotlib figure size given as a tuple of width and height

  • sns_style (str, optional,) – the seaborn style .

  • set_legend (bool, default=True) – display legend in each figure. Note the default location of the legend is ‘best’ from legend()

  • subplot_kws (dict, default is dict(left=0.0625, right = 0.95, wspace = 0.1)) – the subplot keywords arguments passed to matplotlib.subplots_adjust()

  • kws (dict,) – keyword arguments passed to sklearn.model_selection.learning_curve()

Examples

  1. -> plot via a metaestimator already cross-validated.

>>> from watex.models.premodels import p
>>> from watex.datasets import fetch_data
>>> from watex.utils.plotutils import plot_learning_curves
>>> X, y = fetch_data ('bagoue prepared') # yields a sparse matrix
>>> # let collect 04 estimators already cross-validated from SVMs
>>> models = [ p.SVM.linear , p.SVM.rbf , p.SVM.sigmoid , p.SVM.poly ]
>>> plot_learning_curves (models, X, y, cv=4, sns_style = 'darkgrid')
  1. -> plot with multiples models not crossvalidated yet.

>>> from watex.exlib.sklearn import (LogisticRegression,
                                     RandomForestClassifier,
                                     SVC , KNeighborsClassifier
                                     )
>>> models =[LogisticRegression(), RandomForestClassifier(), SVC() ,
             KNeighborsClassifier() ]
>>> plot_learning_curves (models, X, y, cv=4, sns_style = 'darkgrid')
watex.utils.plot_logging(X, y=None, zname=None, tname=None, labels=None, impute_nan=True, normalize=False, log10=False, columns_to_skip=None, pattern=None, strategy='mean', posiy=None, fill_value=None, fig_size=(16, 7), fig_dpi=300, colors=None, cs4_colors=False, sns_style=False, savefig=None, draw_spines=False, seed=None, verbose=0, **kws)[source]#

Plot logging data

Plot expects a collection of logging data. Each logging data composes a column of data collected on the field.Note that can also plot anykind of data related that it contains numerical values. The function does not accept categorical data. If categorical data are given, they should be discarded.

Parameters:
  • X (Dataframe of shape (n_samples, n_features)) – where n_samples is the number of data, expected to be the data collected at different depths and n_features is the number of columns (features) that supposed to be plot. Note that X must include the depth columns. If not given a relative depth should be created according to the number of sample that composes X.

  • y (array-like or series of shape (n_samples,), optional) – Target relative to X for classification or regression; If given, by default the target plot should be located at the last position. However with the argument of posiy , target plot can be toggled to the desired position.

  • zname (str, default='depth' or 'None') – The name of the depth column in X. If the name ‘depth’ is not specified as the main depth columns, an other name in the columns that matches the depth can also be indicated so the function will put aside this columm as depth column for plot purpose. If set to None, zname holds the name depth and assumes that depth exists in X columns.

  • tname (str, optional,) – name of the target. This can rename of the target name if given y as a pandas series or add the name of target if given as an array-like. If not provided, it should use the name of the target series if y is not None.

  • normalize (bool, default = False) – Normalize all the data to be range between (0, 1) except the depth,

  • labels (list or str, optional) – If labels are given, they should fit the size of the number of columns. The given labels should replace the old columns in X and should figue out in the plot. This is usefull to change the columns labels in the dataframe to a new labels that describe the best the plot ; for instance by inluding the units in the new labels. Note that if the labels do not match the size of the old columns in X a warning should be let to the user and none operation will be performed.

  • impute_nan (bool, default=True,) – Replace the NaN values in the dataframe. Note that the default behaviour for replacing NaN is the mean. However if the argument of fill_value is provided,the latter should be used to replace ‘NaN’ in X.

  • log10 (bool, default=False) – Convert values to log10. This can be usefull when using the logarithm data. However, it seems not all the data can be used this operation, for instance, a negative data. In that case, column_to_skip argument is usefull to provide so to skip that columns when converting values to log10.

  • columns_to_skip (list or str, optional,) –

    Columns to skip when performing some operation like ‘log10’. These

    columns with not be affected by the ‘log10’ operations. Note that

    columns_to_skip can also gives as litteral string. In that case, the pattern is need to parse the columns into a list of string.

  • pattern (str, default = '[#&*@!,;s]s*') –

    Regex pattern to parse the columns_to_skip into a list of string where each item is a column name especially when the latter is given as litteral text string. For instance:

    columns_to_skip='depth_top, thickness, sp, gamma_gamma'
    -> ['depth_top', 'thickness', 'sp', 'gamma_gamma']
    

    by using the default pattern. To have full control of columns splitted it is recommended to provided your own pattern to avoid wrong parsing and can lead to an error.

  • strategy (str, default='mean') –

    The imputation strategy.

    • If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.

    • If “median”, then replace missing values using the median along each column. Can only be used with numeric data.

    • If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.

    • If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

  • fill_value (str or numerical value, optional) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types. If not given and impute_nan is True, the mean strategy is used instead.

  • posiy (int, optional) – the position to place the target plot y . By default the target plot if given is located at the last position behind the logging plots.

  • colors (str, list of Matplotlib.colors map, optional) –

    The colors for plotting each columns of X except the depth. If not given, default colors are auto-generated.

    If colors is string and ‘cs4’or ‘xkcd’ is included. Matplotlib.colors.CS4_COLORS or Matplotlib.colors.XKCD_COLORS should be used instead. In addition if the ‘cs4’ or ‘xkcd’ is suffixed by colons and integer value like cs4:4 or xkcd:4, the CS4 or XKCD colors should be used from index equals to 4.

    New in version 0.2.3: Matplotlib.colors.CS4_COLORS or Matplotlib.colors.XKCD_COLORS can be used by setting colors to 'cs4' or 'xkcd'. To reproduce the same CS4 or XKCD colors, set the seed parameter to a specific value.

  • draw_spines (bool, tuple (-lim, +lim), default= False,) – Only draw spine between the y-ticks. -lim and +lim are lower and upper bound i.e. a range to draw the spines in y-axis.

  • fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height

  • fig_dpi (float or 'figure', default: rcParams["savefig.dpi"] (default: 'figure')) – The resolution in dots per inch. If ‘figure’, use the figure’s dpi value.

  • savefig (str, default =None ,) – the path to save the figure. Argument is passed to matplotlib.Figure class.

  • sns_style (str, optional,) – the seaborn style.

  • seed (int, optional) –

    Allow to reproduce the Matplotlib.colors.CS4_COLORS if colors is set to cs4.

    New in version 0.2.3.

  • verbose (int, default=0) – Output the number of categorial features dropped in the dataframe.

  • kws (dict,) – Additional keyword arguments passed to matplotlib.axes.plot()

Examples

>>> from watex.datasets import load_hlogs
>>> from watex.utils.plotutils import plot_logging
>>> X0, y = load_hlogs (as_frame =True) # get the frames rather than object
>>> # plot the default logging with Normalize =True
>>> plot_logging (X0, normalize =True)
>>> # Include the target in the plot
>>> plot_logging ( X0,  y = y.kp , posiy = 0,
                  columns_to_skip=['thickness', 'sp'],
                  log10 =True,
                  )
>>> # draw spines and limit plot from (0, 700) m depth
>>> plot_logging (X0 , y= y.kp, draw_spines =(0, 700) )
watex.utils.plot_mlxtend_heatmap(df, columns=None, savefig=None, **kws)[source]#

Plot correlation matrix array as a heat map

Parameters:
  • df – dataframe pandas

  • columns – list of features, If given, only the dataframe with that features is considered.

  • kws – additional keyword arguments passed to mlxtend.plotting.heatmap()

Returns:

mlxtend.plotting.heatmap() axes object

Example:

>>> from watex.datasets import load_hlogs
>>> from watex.utils.plotutils import plot_mlxtend_heatmap
>>> h=load_hlogs()
>>> features = ['gamma_gamma', 'sp',
            'natural_gamma', 'resistivity']
>>> plot_mlxtend_heatmap (h.frame , columns =features, cmap ='PuOr')
watex.utils.plot_mlxtend_matrix(df, columns=None, fig_size=(10, 8), alpha=0.5, savefig=None)[source]#

Visualize the pair wise correlation between the different features in the dataset in one place.

Parameters:
  • df – dataframe pandas

  • columns – list of features, If given, only the dataframe with that features is considered.

  • fig_size – tuple of int (width, heigh) Size of the displayed figure

  • alpha – figure transparency, default is .5.

Returns:

mlxtend.plotting.scatterplotmatrix() axes object

Example:

>>> from watex.datasets import load_hlogs
>>> from watex.utils.plotutils import plot_mlxtend_matrix
>>> import pandas as pd
>>> import numpy as np
>>> h=load_hlogs()
>>> features = ['gamma_gamma', 'natural_gamma', 'resistivity']
>>> data = pd.DataFrame ( np.log10 (h.frame[features]), columns =features )
>>> plot_mlxtend_matrix (data, columns =features)
watex.utils.plot_naive_dendrogram(X, *ybounds, fig_size=(12, 5), savefig=None, **kws)[source]#

Quick plot dendrogram using the ward clustering function from Scipy.

Parameters:
  • X – ndarray of shape (n_samples, n_features) Array of features

  • ybounds – int, integrer values to draw horizontal cluster lines that indicate the number of clusters.

  • fig_size – tuple (width, height), default =(12,5) the matplotlib figure size given as a tuple of width and height

  • kws – dict , Addditional keyword arguments passed to scipy.cluster.hierarchy.dendrogram()

Examples:
>>> from watex.datasets import fetch_data
>>> from watex.utils.plotutils import plot_naive_dendrogram
>>> X, _= fetch_data('Bagoue analysed') # data is already scaled
>>> # get the two features 'power' and  'magnitude'
>>> data = X[['power', 'magnitude']]
>>> plot_naive_dendrogram(data )
>>> # add the horizontal line of the cluster at ybounds = (20 , 20 )
>>> # for a single cluster (cluser 1)
>>> plot_naive_dendrogram(data , 20, 20 )
watex.utils.plot_pca_components(components, *, feature_names=None, cmap='viridis', savefig=None, **kws)[source]#

Visualize the coefficient of principal component analysis (PCA) as a heatmap

Parameters:
  • components – Ndarray, shape (n_components, n_features)or PCA object Array of the PCA compoments or object from watex.analysis.dimensionality.nPCA. If the object is given it is not necessary to set the feature_names

  • feature_names – list or str, optional list of the feature names to locate in the map. Feature_names and the number of eigen vectors must be the same length. If PCA object is passed as components arguments, no need to set the feature_names. The name of features is retreived automatically.

  • cmap – str, default=’viridis’ the matplotlib color map for matshow visualization.

  • kws – dict, Additional keywords arguments passed to matplotlib.pyplot.matshow

Examples:

(1)-> with PCA object

>>> from watex.datasets import fetch_data
>>> from watex.utils.plotutils import plot_pca_components
>>> from watex.analysis import nPCA
>>> X, _= fetch_data('bagoue pca')
>>> pca = nPCA (X, n_components=2, return_X =False)# to return object
>>> plot_pca_components (pca)

(2)-> use the components and features individually

>>> components = pca.components_
>>> features = pca.feature_names_in_
>>> plot_pca_components (components, feature_names= features,
                         cmap='jet_r')
watex.utils.plot_profiling(erp, /, station=None, cz=None, *, style='classic', fig_size=(10, 4), cz_plot_kws=None, marker_kws=None, savefig=None, ax=None, fig=None, **plot_kws)[source]#

Visualizes the resistivity profiling of ERP data.

Function can overlain the selected conductive zone to the ERP if cz is given.

Parameters:
  • erp (array_like 1d) –

    The electrical resistivity profiling array. If dataframe is passed, resistivity column must be included.

    Changed in version 0.2.1: Can henceforth accept dataframe that contains resistivity values.

  • station (str, int, optional) –

    Station is used to visualize the conductive zone in the erp profile.

    This seems useful if cz is not given. When station=’auto’ it automatically detect the best conductive zone assuming the very low resistivity in the profile and plot the conductive zone. To have the expected results, station position or cz must be given or the .

    New in version 0.2.1: Can henceforth pass the station to plot the conductive zone.

  • cz (array_like, optional,) – The selected conductive zone. If None, cz should not be plotted.

  • style (str, default='classic') – Matplotlib plottings style.

  • fig_size (tuple, default= (10, 4)) – Matplotlib figure size.

  • marker_kws (dict, default = {'marker':'o', 'c':'#9EB3DD' }) – The dictionnary to customize marker in the plot

  • cz_plot_kws (dict, default = {'ls':'-','c':'#0A4CEE', 'lw'L2 }) – The dictionnary to customize the conductize zone in the plot.

  • savefig (str, optional) – Save figure name. The default resolution dot-per-inch is 300.

  • ax (Matplotlib.pyplot.Axes, optional) –

    Axe to collect the figure.

    New in version 0.2.8.

fig: Matplotlib.pyplot.figure, optional

Supply fig to save automatically the plot, otherwise, keep it to None.

plot_kws: dict,

Additional keyword arguments passed to matplotlib.pyplot.plot() function

Returns:

ax – Return axis

Return type:

Matplotlib.pyplot.Axis

Examples

>>> from watex.datasets import make_erp
>>> from watex.utils.plotutils import plot_profiling
>>> d= make_erp (n_stations =56, seed = 42)
>>> plot_profiling  (d.resistivity)
>>> # read the frame and get the resistivity values
>>> plot_profiling (d.frame, station ='s07' )
<AxesSubplot:xlabel='Stations', ylabel='App.resistivity ($\Omega.m$)'>
watex.utils.plot_regularization_path(X, y, c_range=(-4.0, 6.0), fig_size=(8, 5), sns_style=False, savefig=None, **kws)[source]#

Plot the regularisation path from Logit / LogisticRegression

Varying the different regularization strengths and plot the weight coefficient of the different features for different regularization strength.

Note that, it is recommended to standardize the data first.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features. X is expected to be standardized.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.

  • c_range (list or tuple [start, stop]) – Regularization strength list. It is a range from the strong strong ( start) to lower (stop) regularization. Note that ‘C’ is the inverse of the Logistic Regression regularization parameter \(\lambda\).

  • fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height

  • savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.

  • sns_style (str, optional,) – the seaborn style.

  • kws (dict,) – Additional keywords arguments passed to sklearn.linear_model.LogisticRegression

Examples

>>> from watex.utils.plotutils import plot_regularization_path
>>> from watex.datasets import fetch_data
>>> X, y = fetch_data ('bagoue analysed' ) # data aleardy standardized
>>> plot_regularization_path (X, y )
watex.utils.plot_rf_feature_importances(clf, X=None, y=None, fig_size=(8, 4), savefig=None, n_estimators=500, verbose=0, sns_style=None, **kws)[source]#

Plot features importance with RandomForest.

Parameters:
  • clf (estimator object) – The base estimator from which the transformer is built. This can be both a fitted (if prefit is set to True) or a non-fitted estimator. The estimator should have a feature_importances_ or coef_ attribute after fitting. Otherwise, the importance_getter parameter should be used.

  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.

  • n_estimators (int, default=500) – The number of trees in the forest.

  • fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height

  • savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.

  • sns_style (str, optional,) – the seaborn style.

  • verbose (int, default=0) – print the feature labels with the rate of their importances.

  • kws (dict,) – Additional keyyword arguments passed to sklearn.ensemble.RandomForestClassifier

Examples

>>> from watex.datasets import fetch_data
>>> from watex.exlib.sklearn import RandomForestClassifier
>>> from watex.utils.plotutils import plot_rf_feature_importances
>>> X, y = fetch_data ('bagoue analysed' )
>>> plot_rf_feature_importances (
    RandomForestClassifier(), X=X, y=y , sns_style=True)
watex.utils.plot_roc_curves(clfs, /, X, y, names=Ellipsis, colors=Ellipsis, ncols=3, score=False, kind='inone', ax=None, fig_size=(7, 7), **roc_kws)[source]#

Quick plot of Receiving Operating Characterisctic (ROC) of fitted models

Parameters:
  • clfs (list,) – list of models for ROC evaluation. Model should be a scikit-learn or XGBoost estimators

  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.

  • y (ndarray or Series of length (n_samples, )) – An array or series of target or class values. Preferably, the array represent the test class labels data for error evaluation.

  • names (list,) – List of model names. If not given, a raw name of the model is passed instead.

  • kind (str, default='inone') –

    If ['individual'|'2'|'single'], plot each ROC model separately. Any other value, group of ROC curves into a single plot.

    Changed in version 0.2.5: Parameter all is deprecated and replaced by kind. It henceforth accepts arguments allinone|1|grouped or individual|2|single for plotting mutliple ROC curves in one or separate each ROC curves respecively.

  • colors (str, list) – Colors to specify each model plot.

  • ncols (int, default=3) – Number of plot to be placed inline before skipping to the next column. This is feasible if many is set to True.

  • score (bool,default=False) –

    Append the Area Under the curve score to the legend.

    New in version 0.2.4.

  • kws (dict,) – keyword argument of :func:`sklearn.metrics.roc_curve

Returns:

ax

Return type:

Axes.Subplot.

Examples

>>> from watex.utils.plotutils import plot_roc_curves
>>> from sklearn.datasets import make_moons
>>> from watex.exlib import ( train_test_split, KNeighborsClassifier, SVC ,
XGBClassifier, LogisticRegression )
>>> X, y = make_moons (n_samples=2000, noise=0.2)
>>> X, Xt, y, yt = train_test_split (X, y, test_size=0.2)
>>> clfs = [ m().fit(X, y) for m in ( KNeighborsClassifier, SVC ,
                                     XGBClassifier, LogisticRegression)]
>>> plot_roc_curves(clfs, Xt, yt)
Out[66]: <AxesSubplot:xlabel='False Positive Rate (FPR)', ylabel='True Positive Rate (FPR)'>
>>> plot_roc_curves(clfs, Xt, yt,kind='2', ncols = 4 , fig_size = (10, 4))
watex.utils.plot_sbs_feature_selection(sbs_estimator, /, X=None, y=None, fig_size=(8, 5), sns_style=False, savefig=None, verbose=0, **sbs_kws)[source]#

plot Sequential Backward Selection (SBS) for feature selection.

SBS collects the scores of the best feature subset at each stage.

Parameters:
  • sbs_estimator (SequentialBackwardSelection estimator object) – The Sequential Backward Selection estimator can either be fitted or not. If not fitted. Please provide the training X and y, otherwise an error will occurs.

  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.

  • n_estimators (int, default=500) – The number of trees in the forest.

  • fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height

  • savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.

  • sns_style (str, optional,) – the seaborn style.

  • verbose (int, default=0) – print the feature labels with the rate of their importances.

  • sbs_kws (dict,) – Additional keyyword arguments passed to SequentialBackwardSelection

Examples

(1)-> Plot fitted SBS in action >>> from watex.exlib.sklearn import KNeighborsClassifier , train_test_split >>> from watex.datasets import fetch_data >>> from watex.base import SequentialBackwardSelection >>> from watex.utils.plotutils import plot_sbs_feature_selection >>> X, y = fetch_data(‘bagoue analysed’) # data already standardized >>> Xtrain, Xt, ytrain, yt = train_test_split(X, y) >>> knn = KNeighborsClassifier(n_neighbors=5) >>> sbs= SequentialBackwardSelection (knn) >>> sbs.fit(Xtrain, ytrain ) >>> plot_sbs_feature_selection(sbs, sns_style= True)

(2)-> Plot estimator with no prefit SBS. >>> plot_sbs_feature_selection(knn, Xtrain, ytrain) # yield the same result

watex.utils.plot_sfi(cz, p=None, s=None, dipolelength=None, fig_size=(10, 4), style='classic', **plotkws)[source]#

Plot sfi parameter components.

Parameters:
  • cz (array-like 1d,) – Selected conductive zone

  • p (array-like 1d,) – Station positions of the conductive zone.

  • dipolelength (float. If p is not given, it will be set) – automatically using the default value to match the cz size. The default value is 10.

  • fig_size (tuple, default=(10, 4)) – Matplotlib (MPL) figure size; should be a tuple value of integers

See also

watex.utils.exmath.sfi

for more details about the sfi parameter computation.

Examples

>>> import numpy as np
>>> from watex.utils.exmath import plot_sfi
>>> rang = np.random.RandomState (42)
>>> condzone = np.abs(rang.randn (7))*1e2
>>> plotkws  = dict (rlabel = 'Selected conductive zone (cz)',
                     color=f'{P().frcolortags.get("fr3")}',
                     )
>>> plot_sfi (condzone, **plotkws)
watex.utils.plot_silhouette(X, labels, metric='euclidean', savefig=None, **kwds)[source]#

Plot quantifying the quality of clustering silhouette

Parameters:
  • X (array-like of shape (n_samples_a, n_samples_a) if metric == ) – “precomputed” or (n_samples_a, n_features) otherwise An array of pairwise distances between samples, or a feature array.

  • labels (array-like of shape (n_samples,)) – Label values for each sample.

  • metric (str or callable, default='euclidean') – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by sklearn.metrics.pairwise.pairwise_distances(). If X is the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.

  • savefig (str, default =None ,) – the path to save the figure. Argument is passed to matplotlib.Figure class.

  • **kwds (optional keyword parameters) – Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.

See also

watex.view.mlplot.plotSilhouette

Gives consistency plot as the use of prefit parameter which checks whether`labels` are expected to be passed into the function directly or not.

Examples

>>> import numpy as np
>>> from watex.exlib.sklearn import KMeans
>>> from watex.datasets import load_iris
>>> from watex.utils.plotutils import plot_silhouette
>>> d= load_iris ()
>>> X= d.data [:, 0][:, np.newaxis] # take the first axis
>>> km= KMeans (n_clusters =3 , init='k-means++', n_init =10 ,
                max_iter = 300 ,
                tol=1e-4,
                random_state =0
                )
>>> y_km = km.fit_predict(X)
>>> plot_silhouette (X, y_km)
watex.utils.plot_skew(edis_list, /, method='Bahr', sensitivity='skew', mode='frequency', show_skewness=Ellipsis, view='1d', interpolate=True, threshold_line=None, tolog10=True, interp_method='cubic', fill_value='auto', get_sites_by=None, top_label='Stations', cb_label='Sensitivity (S)', spacing=80, fig=None, fig_size=(7, 5), dpi=300, font_size=5.0, cmap='jet_r', plot_style=None, rotate_xlabel=0.0, plot_contours=Ellipsis, style=None, savefig=None, ax=None, **plot_kws)[source]#

Visualize the phase sensitive skew in one or two dimensional.

Phase Sensitivity Skew (\(\eta\)) is a dimensionality tool that represents a measure of the skew of the phases of the impedance tensor. The parameter is thus unaffected by the distortion effect, unlike the Swift-skew and ellipticity dimensionality tools [1].

Values of \(\eta\) > 0.3 are considered to represent 3D data. Phase-sensitive skews less than 0.1 indicate 1D, 2D or distorted 2D (3-D /2-D) cases. Values of \(\eta\) between 0.1 and 0.3 indicates modified 3D/2D structures [2] according to Bahr’ methods. However, values :math:eta >=0.2` using the Swift methods, the smaller the value \(\eta\) ( close to \(0.\)), the closer the structure to 2D structure and vice versa.However, it is generally considered that an electrical structure of \(\eta < 0.4\) can be treated as a 2D medium. Here as the threshold_line for eta() using the Swift method should be set as 0.4.

Parameters:
  • edis_list (str, watex.edi.Edi) –

    Full path to edifiles.

    Changed in version 0.3.1: The parameter edi_obj is replaced by edis_list which indicate a collection of :term:`EDI`files.

  • method (str, default='Bahr':) – Kind of correction. Can be swift for the remove distorsion proposed by Swift in 1967 [3]. The value close to 0. assume the 1D and 2D structures, and 3D otherwise. Conversly to bahr for the remove distorsion proposed by Bahr in 1991 [2]. The latter threshold is set to 0.3. Above this value the structures is 3D.

  • sensitivity (str, default='skew') –

    phase sensistive visualization. Can be rotational invariant invariant. Note that setting to mu or invariant does not change any interpretation since the distortion of Z are all rotational invariant whatever we’re using the Bahr or swift method.

    Changed in version 0.3.1: The parameter view is deprecated and replaced with sensitivity.

  • mode (str, optional) – X-axis coordinates for visualisation. plot either 'frequency' or 'periods'. The default is 'frequency'

  • show_skewness (bool,default=False) –

    Display the average skewness value.

    New in version 0.3.1: show_skewness display the average value of the whole Z tensor at each frequency.

  • view (str, ['1D', '2D'], default ='1D') – Type of skewness visualisation.

  • interpolate (bool, default=True) – Interpolate the data if NaN is found.

  • tolog10 (bool, default=True) –

    Compute the the logarithm base 10 of the frequency array. If the

    frequency data is passed as log10 values, it should be turned to False.

  • interp_method (bool,default='cubic') – Data interpolation method. It could be [‘nearest’|’linear’|’cubic’].

  • fill_value (float, str, default='auto') – Fill the interpolated grid at the egdes or surrounding NaN with a filled value. The auto uses the forward and backward fill strategy.

  • get_sites_by (str, optional) – Fetch the sites and place names on the map. It should be ['dataid'``|’name’``]. The former uses the names collected in EDI data id whereas the latter generates new names from the sites id and the survey name. In that case, it expects the survey name to be specified. By default, it merely use the sites id.

  • threshold_line (float, optional) –

    Visualize th threshold line. Can be [‘bahr’, ‘swift’, ‘both’]:

    • Note that when method is set to swift, the value close to close to \(0.\) assume the 1D and 2D structures, and 3D otherwise.

    • when method is set to Bahr, \(\mu > 0.3`\) is 3D structures, between \([0.1 - 0.3]\) assumes modified 3D/2D structures whereas \(<0.1\) 1D, 2D or distorted 2D.

  • top_label (str, default='Stations') – Label used to name the xticks in upper.

  • cb_label (str, default='Sensitivity (S)') – The colorbar label.

  • spacing (float, default=80.) – The step in meters between two stations/sites. If given, it creates an array of positions.

  • fig_size (tuple, default= (6, 2)) – Matplotlib figure size.

  • dpi (int, default=300) – Image resolution in dot-per-inch

  • cmap (str, default='jet_r') – Matplotlib colormap

  • plot_style (str, optional) – The kind of plot. It could be [‘pcolormesh’|’imshow’]. The default is pcolormesh.

  • rotate_xlabel (float, Optional) – The degree angle to rotate the station/site label accordingly.

  • prefix (str) – string value to add as prefix of given id. Prefix can be the site name. Default is S.

  • how (str) – Mode to index the station. Default is ‘Python indexing’ i.e. the counting of stations would starts by 0. Any other mode will start the counting by 1.

  • to_log10 (bool, default=False) – Recompute the ar in logarithm base 10 values. Note when True, the y should be also in log10.

  • plot_contours (bool, default=True) – Plot the contours map. Is available only if the plot_style is set to pcolormesh.

  • savefig (str, optional) – Save figure name. The default resolution dot-per-inch is 300.

  • plot_kws (dict,) – Matplotlib Axes scatterplot additional keywords arguments.

  • ax (Matplotlib.pyplot.Axes, optional) – Axe to collect the figure. Could be used to support other axes.

Returns:

ax – Return axis

Return type:

Matplotlib.pyplot.Axis

See also

watex.methods.em.Processing.skew

Skew equation formulations.

watex.view.TPlot.plotSkew

Give a consistent plot where user can customize the plot using the plot parameter of watex.property.BasePlot class.

References

[1]

Bahr, K. (1988) Interpretation of the magnetotelluric impedance tensor: regional induction 395 and local telluric distortion. J. Geophys. Res., 62, 119–127.

[2] (1,2)

Bahr, K. (1991) Geological noise in magnetotelluric data: a classification of distortion types. 397 Phys. Earth Planet. Inter., 66, 24–38.

[3]

Bahr, K., 1991. Geological noise in magnetotelluric data: a classification of distortion types. Physics of the Earth and Planetary Interiors 66 (1–2), 24–38.

Example

>>> import watex as wx
>>> from watex.utils.plotutils import plot_skew
>>> edi_sk = wx.fetch_data ("edis", return_data =True , samples = 20 )
>>> # Get 1d visualization with Swift skewness method
>>> plot_skew (edi_sk, threshold_line= True, method ='Swift',
               fig_size =( 12, 4))
>>> # plot the 2D with Bahr method with period in y-axis
>>> plot_skew (edi_sk, view='2d', show_skewness=True, interpolate=True,
                 get_sites_by='name', mode='periods', fig_size =(6, 2))
watex.utils.plot_sounding(ves, /, style='bmh', fig_size=(10, 4), cz_plot_kws=None, marker_kws=None, savefig=None, ax=None, fig=None, **plot_kws)[source]#

Visualize the vertical electrical sounding.

Function plots the sounding curve from AB/2 sounding points.

Parameters:
  • ves (array_like 1d) – The vertical electrical resistivity sounding array. If dataframe is passed,`resistivity` column must be included.

  • style (str, default='bmh') – Matplotlib plottings style.

  • fig_size (tuple, default= (10, 4)) – Matplotlib figure size.

  • marker_kws (dict, default = {'marker':'o', 'c':'#9EB3DD' }) – The dictionnary to customize marker in the plot

  • cz_plot_kws (dict, default = {'ls':'-','c':'#0A4CEE', 'lw'L2 }) – The dictionnary to customize the conductize zone in the plot.

  • savefig (str, optional) – Save figure name. The default resolution dot-per-inch is 300.

  • ax (Matplotlib.pyplot.Axes, optional) – Axe to collect the figure.

  • fig (Matplotlib.pyplot.figure, optional) – Supply fig to save automatically the plot, otherwise, keep it to None.

  • plot_kws (dict,) – Additional keyword arguments passed to matplotlib.pyplot.plot() function

Returns:

ax – Return axis

Return type:

Matplotlib.pyplot.Axis

See also

watex.utils.exmath.plotOhmicArea

plot the Ohmic Area including the computed fracture zone.

Examples

>>> from watex.datasets import make_ves
>>> from watex.utils.plotutils import plot_sounding
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots ( 2, 1, figsize = (10, 10))
>>> d= make_ves (samples =56, seed = 42)
>>> plot_sounding  (d.resistivity, ax =ax [0], color ='k', marker ='D', )
>>> ax[0].set_title ("VES: samples=56, seed =42")
>>> # read the frame and get the resistivity values
>>> ax[1] = plot_sounding(make_ves (order ='+', max_rho =1e4, seed =65 ,
                                    as_frame=True,iorder =5),
                          ax= ax[1], ls=':', marker ='o', color ='blue')
>>> ax[1].set_title ("VES:samples=41, order='+', iorder=5,"
                     " max_rho=10000.$\Omega.m$, seed=65")
watex.utils.plot_stratalog(thick, layers, station, *, zoom=None, hatch=None, color=None, fig_size=(10, 4), **annot_kws)[source]#

Make the stratalog log with annotate figure.

Parameters:
  • thick (list,) – list of the layers thicknesses , names, patterns and colors.

  • layer (list,) – list of the layers thicknesses , names, patterns and colors.

  • hatch (list,) – list of the layers thicknesses , names, patterns and colors.

  • colors (list,) – list of the layers thicknesses , names, patterns and colors.

  • zoom (float, list) –

    If float value is given, it considered as a

    zoom ratio and it should be ranged between 0 and 1. For isntance:

    • 0.25 –> 25% plot start from 0. to max depth * 0.25 m.

    Otherwise if values given are in the list, they should be composed of two items which are the top and bottom of the plot. For instance:

    • [10, 120] –> top =10m and bottom = 120 m.

    Note that if the length of zoom list is greater than 2, the function will return all the plot and no errors should raised.

    fig_size: tuple, default=(10, 4)

    Figure size

Examples

>>> import watex.utils.geotools as GU
>>> layers= ['$(i)$', 'granite', '$(i)$', 'granite']
>>> thicknesses= [59.0, 150.0, 590.0, 200.0]
>>> hatch =['//.', '.--', '+++.', 'oo+.']
>>> color =[(0.5019607843137255, 0.0, 1.0), 'b', (0.8, 0.6, 1.), 'lime']
>>> GU.plot_stratalog (thicknesses, layers, hatch =hatch ,
                   color =color, station='S00')
>>> GU.plot_stratalog ( thicknesses,layers,hatch =hatch,
                        zoom =0.25, color =color, station='S00')
watex.utils.plot_strike(list_of_edis, /, kind=2, period_tolerance=0.05, text_pad=1.65, rot_z=0.0, **kws)[source]#

Plot the strike estimated from the invariants and phase tensor. in a rose diagram of xy plot.

Parameters:
  • list_of_edis (list,) –

    full paths to .edi files to plot or list of EDI files.

    Changed in version 0.2.0: No need to provide a list of term:EDI files. Henceforth list_of_edis accepts the EDI path-like object of single EDI file then asserts the validity of the EDI files afterward.

  • kind (int, default=2) –

    Can be [ 1 | 2 ] where:

    • 1 to plot individual decades in one plot

    • 2 to plot all period ranges into one polar diagram for each strike angle estimation

    One could try also plot_type = 1 to plot by decade

  • fig_num (int, default=1,) – figure number to be plotted. Default is 1

  • font_size (float, default=10,) – Figure size

  • rot_z (float, default=0.,) – angle of rotation clockwise positive.

  • period_tolerance (float, default=.05) – Tolerance level to match periods from different edi files. Default is 0.05

  • text_pad (float, default=1.65) –

    padding of the angle label at the bottom of each

    polar diagram. Default is 1.65

  • plot_range (str, tuple) –

    The period range to estimate the strike angle. It can be [ ‘data’ | (period_min,period_max) ]. Options are:

    • ’data’ for estimating the strike for all periods in the data.

    • (pmin,pmax) for period min and period max, input as (log10(pmin),log10(pmax))

  • plot_tipper ([ True | False ]) –

    • True to plot the tipper strike

    • False to not plot tipper strike

  • pt_error_floor (int, optional) – Maximum error in degrees that is allowed to estimate strike. Default is None allowing all estimates to be used.

  • fold ([ True | False ]) –

    • True to plot only from 0 to 180

    • False to plot from 0 to 360

  • plot_orthogonal ([ True | False]) –

    • True to plot the orthogonal strike directions

    • False to not

  • color ([ True | False ]) –

    • True to plot shade colors

    • False to plot all in one color

  • color_inv (str,) – color of invariants plots

  • color_pt (str,) – color of phase tensor plots

  • color_tip (str) – color of tipper plots

  • ring_spacing (float, optional) – spacing of rings in polar plots

  • ring_limits (tuple of int,) – plot limits (min count, max count) set each plot have these limits

  • plot_orientation (str, [ 'h' | 'v' ]) – horizontal or vertical plots

watex.utils.plot_tensors(z_or_edis_obj_list, /, station='S00', zplot=False, show_error_bars=False, **kwargs)[source]#

Plot resistivity and phase tensors or the real and imaginary impedance.

Plots the real and imaginary impedance and induction vector if present.

Parameters:
  • z_or_edis_obj_list (list of watex.edi.Edi or watex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.

  • station (int, default='S00') – Station to visualize the resistivity, phases or impendances tensors. Default is the first station. Note that station counting start from index equal to 0.

  • zplot (bool, default=False,) – Visualize the impedance tensors values Z.

kwargs: Additional keywords arguments

To get further details about the way to control the plot, refer to the following attributes.

Attributes

Description

color_mode

[ ‘color’ | ‘bw’ ] color or black and white plots

cted

color for data Z_XX and Z_XY mode

ctem

color for model Z_XX and Z_XY mode

ctmd

color for data Z_YX and Z_YY mode

ctmm

color for model Z_YX and Z_YY mode

data_fn

full path to data file

data_object

WSResponse instance

e_capsize

cap size of error bars in points (default is .5)

e_capthick

cap thickness of error bars in points (default is 1)

fig_dpi

resolution of figure in dots-per-inch (300)

fig_list

list of matplotlib.figure instances for plots

fig_size

size of figure in inches (default is [6, 6])

font_size

size of font for tick labels, axes labels are font_size+2 (default is 7)

legend_border_axes_pad

padding between legend box and axes

legend_border_pad

padding between border of legend and symbols

legend_handle_text_pad

padding between text labels and symbols of legend

legend_label_spacing

padding between labels

legend_loc

location of legend

legend_marker_scale

scale of symbols in legend

lw

line width data curves (default is .5)

ms

size of markers (default is 1.5)

lw_r

line width response curves (default is .5)

ms_r

size of markers response curves (default is 1.5)

mted

marker for data Z_XX and Z_XY mode

mtem

marker for model Z_XX and Z_XY mode

mtmd

marker for data Z_YX and Z_YY mode

mtmm

marker for model Z_YX and Z_YY mode

phase_limits

limits of phase

plot_component

[ 2 | 4 ] 2 for TE and TM or 4 for all components

plot_style

[ 1 | 2 ] 1 to plot each mode in a seperate subplot and 2 to plot xx, xy and yx, yy in same plots

plot_type

[ ‘1’ | list of station name ] ‘1’ to plot all stations in data file or input a list of station names to plot if station_fn is input, otherwise input a list of integers associated with the index with in the data file, ie 2 for 2nd station

plot_z

[ True | False ] default is True to plot impedance, False for plotting resistivity and phase

plot_yn

[ ‘n’ | ‘y’ ] to plot on instantiation

res_limits

limits of resistivity in linear scale

resp_fn

full path to response file

resp_object

WSResponse object for resp_fn, or list of WSResponse objects if resp_fn is a list of response files

station_fn

full path to station file written by WSStation

subplot_bottom

space between axes and bottom of figure

subplot_hspace

space between subplots in vertical direction

subplot_left

space between axes and left of figure

subplot_right

space between axes and right of figure

subplot_top

space between axes and top of figure

subplot_wspace

space between subplots in horizontal direction

Examples

>>> import watex as wx
>>> edi_data = wx.fetch_data ('edis', samples= 17 , return_data =True )
>>> wx.utils.plotutils.plot_tensors ( edi_data, station =4 )
watex.utils.plot_tensors2(z_or_edis_obj_list, /, station='S00', plot_z=False, show_error_bars=True, **kwargs)[source]#

Plot resistivity and phase tensors or the real and imaginary impedance.

This function plots the apparent resistivity and phase or the real and imaginary parts of impedance tensors for a given station from a list of Z or EDI objects. It supports extensive customization for the plots including the option to show or hide error bars, control over color schemes, marker styles, and much more.

Parameters:
  • z_or_edis_obj_list (list of watex.edi.Edi or watex.externals.z.Z) – A collection of EDI- or Impedance tensor objects. The list can contain objects directly representing impedance tensors or EDI objects from which impedance tensors can be extracted.

  • station (int or str, default 'S00') – The station to visualize. Can be specified as an index (int) or as a string including the station name or number. For example, ‘S00’ or 0 for the first station. The counting starts from 0.

  • plot_z (bool, default False) – If True, visualize the real and imaginary parts of the impedance tensors (Z). If False, visualize the apparent resistivity and phase tensors.

  • show_error_bars (bool, default True) – Whether to show error bars in the plots. If False, error bars are omitted for a cleaner visualization.

  • **kwargs (dict) – Additional keyword arguments for plot customization. These can include matplotlib parameters for markers, lines, colors, and other plot attributes.

Returns:

The Z object for the specified station, containing the impedance tensor data and any computed properties like resistivity and phase.

Return type:

object

Examples

Plotting the apparent resistivity and phase for the fourth station from a list of EDI objects:

>>> import watex as wx
>>> edi_objects = wx.fetch_data('edis', samples=17, return_data=True)
>>> wx.utils.plotutils.plot_tensors(edi_objects, station=3)

Plotting the real and imaginary parts of the impedance tensor for the first station, without error bars:

>>> wx.utils.plotutils.plot_tensors(edi_objects, station='S00', zplot=True,
                                    show_error_bars=False)

Notes

This function is a part of the watex visualization utilities and requires a matplotlib environment to display the plots. Ensure that your environment supports graphical output or adjust your environment accordingly.

See also

watex.methods.EM

Class for electromagnetic method processing.

watex.utils.plotutils.plot_errorbar

Helper function to plot error bars.

watex.utils.plot_voronoi(X, y, *, cluster_centers, ax=None, show_vertices=False, line_colors='k', line_width=1.0, line_alpha=1.0, fig_size=(7, 7), cmap='Set1', show_grid=True, alpha=0.2, fig_title='')[source]#

Plots the Voronoi diagram of the k-means clusters overlaid with the data

Parameters:
  • X (NDarray, Arraylike 1d) – Data training X and y. Must have the same length

  • y (NDarray, Arraylike 1d) – Data training X and y. Must have the same length

  • cluster_center (int,) – Cluster center. Cluster center can be obtain withe KMeans algorithms

  • show_vertices (bool, optional) – Add the Voronoi vertices to the plot.

  • line_colors (string, optional) – Specifies the line color for polygon boundaries

  • line_width (float, optional) – Specifies the line width for polygon boundaries

  • line_alpha (float, optional) – Specifies the line alpha for polygon boundaries

  • point_size (float, optional) – Specifies the size of points

  • ax (Matplotlib.Axes) – Maplotlib axes. If None, a axis is created instead.

  • fig_size (tuple, default = (7, 7)) – Size of the figures.

Returns:

ax – Axes to support the figure

Return type:

Matplotlib.Axes

Examples

>>> from sklearn.datasets import make_moons
>>> from sklearn.cluster import KMeans
>>> from watex.utils.plotutils import plot_voronoi
>>> X, y = make_moons(n_samples=2000, noise=0.2)
>>> km = KMeans (n_init ='auto').fit(X, y )
>>> plot_voronoi ( X, y , cluster_centers = km.cluster_centers_)
watex.utils.plot_yb_confusion_matrix(clf, Xt, yt, labels=None, encoder=None, savefig=None, fig_size=(6, 6), **kws)[source]#

Confusion matrix plot using the ‘yellowbrick’ package.

Creates a heatmap visualization of the sklearn.metrics.confusion_matrix(). A confusion matrix shows each combination of the true and predicted classes for a test data set.

The default color map uses a yellow/orange/red color scale. The user can choose between displaying values as the percent of true (cell value divided by sum of row) or as direct counts. If percent of true mode is selected, 100% accurate predictions are highlighted in green.

Requires a classification model.

Be sure ‘yellowbrick’ is installed before using the function, otherwise an ImportError will raise.

Parameters:
  • clf (classifier estimator) – A scikit-learn estimator that should be a classifier. If the model is not a classifier, an exception is raised. If the internal model is not fitted, it is fit when the visualizer is fitted, unless otherwise specified by is_fitted.

  • Xt (ndarray or DataFrame of shape n x m) – A matrix of n instances with m features. Preferably, matrix represents the test data for error evaluation.

  • yt (ndarray or Series of length n) – An array or series of target or class values. Preferably, the array represent the test class labels data for error evaluation.

  • ax (matplotlib Axes, default: None) – The axes to plot the figure on. If not specified the current axes will be used (or generated if required).

  • sample_weight (array-like of shape = [n_samples], optional) – Passed to confusion_matrix to weight the samples.

  • encoder (dict or LabelEncoder, default: None) – A mapping of classes to human readable labels. Often there is a mismatch between desired class labels and those contained in the target variable passed to fit() or score(). The encoder disambiguates this mismatch ensuring that classes are labeled correctly in the visualization.

  • labels (list of str, default: None) – The class labels to use for the legend ordered by the index of the sorted classes discovered in the fit() method. Specifying classes in this manner is used to change the class names to a more specific format or to label encoded integer classes. Some visualizers may also use this field to filter the visualization for specific classes. For more advanced usage specify an encoder rather than class labels.

  • fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height

  • savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.

Returns:

cmo – return a yellowbrick confusion matrix object instance.

Return type:

yellowbrick.classifier.confusion_matrix.ConfusionMatrix

Examples

>>> #Import the required models and fetch a an extreme gradient boosting
>>> # for instance then plot the confusion metric
>>> import matplotlib.pyplot as plt
>>> plt.style.use ('classic')
>>> from watex.datasets import fetch_data
>>> from watex.exlib.sklearn import train_test_split
>>> from watex.models import pModels
>>> from watex.utils.plotutils import plot_yb_confusion_matrix
>>> # split the  data . Note that fetch_data output X and y
>>> X, Xt, y, yt  = train_test_split (* fetch_data ('bagoue analysed'),
                                      test_size =.25  )
>>> # train the model with the best estimator
>>> pmo = pModels (model ='xgboost' )
>>> pmo.fit(X, y )
>>> print(pmo.estimator_ ) # pmo.XGB.best_estimator_
>>> #%%
>>> # Predict the score using under the hood the best estimator
>>> # for adaboost classifier
>>> ypred = pmo.predict(Xt)
>>> # now plot the score
>>> plot_yb_confusion_matrix (pmo.XGB.best_estimator_, Xt, yt  )
watex.utils.power(p)[source]#

Compute the power of the selected conductive zone. Anomaly power is closely referred to the width of the conductive zone.

The power parameter implicitly defines the width of the conductive zone and is evaluated from the difference between the abscissa \(X_{LB}\) and the end \(X_{UB}\) points of the selected anomaly:

\[power=|X_{LB} - X_{UB} |\]
Parameters:

p – array-like. Station position of conductive zone.

Returns:

Absolute value of the width of conductive zone in meters.

watex.utils.predict_NGA_labels(X, /, n_clusters, random_state=0, keep_label_0=False, n_init='auto', return_cluster_centers=False, **kws)[source]#

Predict the Naive Group of Aquifer (NGA) labels.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.

  • n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.

  • random_state (int, RandomState instance or None, default=42) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.

  • keep_label_0 (bool, default=False) – The prediction already includes the label 0. However, including 0 in the predicted label refers to ‘k=0’ i.e. permeability coefficient equals to 0, which is not True in principle, because all rocks have a permeability coefficient ‘k’. Here we considered ‘k=0’ as an undefined permeability coefficient. Therefore, ‘0’ , can be exclude since, it can also considered as a missing ‘k’-value. If predicted ‘0’ is in the target it should mean a missing ‘k’-value rather than being a concrete label. Therefore, to avoid any confusion, ‘0’ is altered to ‘1’ so the value +1 is used to move forward all class labels thereby excluding the ‘0’ label. To force include 0 in the label, set keep_label_0 to True.

  • n_init ('auto' or int, default=10) –

    Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of n_init consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see kmeans_sparse_high_dim).

    When n_init=’auto’, the number of runs will be 10 if using init=’random’, and 1 if using init=’kmeans++’.

    New in version 0.2.0: Added ‘auto’ option for n_init.

  • return_cluster_centers (bool, default=False,) – export the array of clusters centers if True.

  • kws (dict,) – Additional keyword arguments passed to sklearn.clusters.KMeans.

Returns:

  • NGA (array_like of shape (n_samples, n_features)) – Predicted NGA labels.

  • ( NGA , cluster_centers) (Tuple of array-like,) – MGA and clusters centers if return_cluster_centers` is set to ``True.

watex.utils.projection_validator(X, Xt=None, columns=None)[source]#

Retrieve x, y coordinates of a datraframe ( X, Xt ) from columns names or indexes.

If X or Xt are given as arrays, columns may hold integers from selecting the the coordinates ‘x’ and ‘y’.

Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.

  • Xt (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix.

  • columns (list of str or index, optional) – columns is usefull when a dataframe is given with a dimension size greater than 2. If such data is passed to X or Xt, columns must hold the name to consider as ‘easting’, ‘northing’ when UTM coordinates are given or ‘latitude’ , ‘longitude’ when latlon are given. If dimension size is greater than 2 and columns is None , an error will raises to prevent the user to provide the index for ‘y’ and ‘x’ coordinated retrieval.

Returns:

arrays and coordinate labels

Return type:

( x, y, xt, yt ), (xname, yname, xtname, ytname), Tuple of coordinate

watex.utils.qc(z_or_edis_obj_list, /, tol=0.5, *, interpolate_freq=False, return_freq=False, tensor='res', return_data=False, to_log10=False, return_qco=False)[source]#

Check the quality control in the collection of Z or EDI objects.

Analyse the data in the EDI collection and return the quality control value. It indicates how percentage are the data to be representative.

Parameters:
  • z_or_edis_obj_list (list of watex.edi.Edi or watex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.

  • tol (float, default=.5) – the tolerance parameter. The value indicates the rate from which the data can be consider as meaningful. Preferably it should be less than 1 and greater than 0. Default is .5 means 50 %. Analysis becomes soft with higher tol values and severe otherwise.

  • interpolate_freq (bool,) – interpolate the valid frequency after removing the frequency which data threshold is under the ``1-tol``% goodness

  • return_freq (bool, default=False) – returns the interpolated frequency.

  • return_data (bool, default= False,) – returns the valid data from up to 1-tol% goodness.

  • tensor (str, default='z') – Tensor name. Can be [ resistivity|phase|z|frequency]. Impedance is used for data quality assessment.

  • to_log10 (bool, default=True) – convert the frequency value to log10.

  • qco (return) –

    retuns quality control object that wraps all usefull informations after control. The following attributes can be fetched as:

    • rate_: the rate of the quality of the data

    • component_: The selected component where data is selected for analysis By default used either xy or yx.

    • mode_: The EM mode. Either the [‘TE’|’TM’] modes

    • freqs_: The valid frequency in the data selected according to the tol parameters. Note that if interpolate_freq is True, it is used instead.

    • invalid_freqs_: Useless frequency dropped in the data during control

    • data_: Valid tensor data either in TE or TM mode.

Returns:

  • return the quality control value and interpolated frequency if

return_freq is set to True otherwise return the only the quality control ratio.

  • return the the quality control object.

Return type:

Tuple (float ) or (float, array-like, shape (N, )) or QCo

Examples

>>> import watex as wx
>>> data = wx.fetch_data ('huayuan', samples =20, return_data =True ,
                          key='raw')
>>> r,= wx.qc (data)
r
Out[61]: 0.75
>>> r, = wx.qc (data, tol=.2 )
0.75
>>> r, = wx.qc (data, tol=.1 )
watex.utils.random_sampling(d, /, samples=None, replace=False, random_state=None, shuffle=True)[source]#

Sampling data.

Parameters:
  • d ({array-like, sparse matrix} of shape (n_samples, n_features)) – Data for sampling, where n_samples is the number of samples and n_features is the number of features.

  • samples (int,optional) – Ratio or number of items from axis to return. Default = 1 if samples is None.

  • replace (bool, default=False) – Allow or disallow sampling of the same row more than once.

  • random_state (int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional) – If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.

  • shuffle (bool, default=True) – Shuffle the data before sampling

Returns:

  • d ({array-like, sparse matrix} of shape (n_samples, n_features))

  • samples data based on the given samples.

Examples

>>> from watex.utils.funcutils import random_sampling
>>> from watex.datasets import load_hlogs
>>> data= load_hlogs().frame
>>> random_sampling( data, samples = 7 ).shape
(7, 27)
watex.utils.random_selector(arr, /, value, seed=None, shuffle=False)[source]#

Randomly select the number of values in array.

Parameters:
  • arr (ArrayLike) – Array of values

  • value (float, arraylike) – If float value is passed, it indicates the number of values to select among the length of arr. If array (value) is passed, it should be self contain in the given arr`. However if ``string is given and contain the %, it calculates the ratio of number to randomly select.

  • seed (int, Optional) – Allow retrieving the identical value randomly selected in the given array.

  • suffle (bool, False) – If True , shuffle the selected values.

Returns:

arr

Return type:

Array containing the selected values

Examples

>>> import numpy as np
>>> from watex.utils.funcutils import random_selector
>>> dat= np.arange (42 )
>>> random_selector (dat , 7, seed = 42 )
array([0, 1, 2, 3, 4, 5, 6])
>>> random_selector ( dat, ( 23, 13 , 7))
array([ 7, 13, 23])
>>> random_selector ( dat , "7%", seed =42 )
array([0, 1])
>>> random_selector ( dat , "70%", seed =42 , shuffle =True )
array([ 0,  5, 20, 25, 13,  7, 22, 10, 12, 27, 23, 21, 16,  3,  1, 17,  8,
        6,  4,  2, 19, 11, 18, 24, 14, 15,  9, 28, 26])
watex.utils.read_data(f, sanitize=Ellipsis, reset_index=Ellipsis, comments='#', delimiter=None, columns=None, npz_objkey=None, verbose=Ellipsis, **read_kws)[source]#

Assert and read specific files and url allowed by the package

Readable files are systematically convert to a data frame.

Parameters:
  • f (str, Path-like object) – File path or Pathlib object. Must contain a valid file name and should be a readable file or url

  • sanitize (bool, default=False,) –

    Push a minimum sanitization of the data such as:
    • replace a non-alphabetic column items with a pattern ‘_’

    • cast data values to numeric if applicable

    • drop full NaN columns and rows in the data

  • reset_index (bool, default=False,) –

    Reset index if full NaN columns are dropped after sanitization.

    New in version 0.2.5: Apply minimum data sanitization after reading data.

  • comments (str or sequence of str or None, default='#') – The characters or list of characters used to indicate the start of a comment. None implies no comments. For backwards compatibility, byte strings will be decoded as ‘latin1’.

  • delimiter (str, optional) – The character used to separate the values. For backwards compatibility, byte strings will be decoded as ‘latin1’. The default is whitespace.

  • npz_objkey (str, optional) –

    Dataset key to indentify array in multiples array storages in ‘.npz’ format. If key is not set during ‘npz’ storage, arr_0 should be used.

    New in version 0.2.7: Capable to read text and numpy formats (‘.npy’ and ‘.npz’) data. Note that when data is stored in compressed “.npz” format, provided the ‘.npz’ object key as argument of parameter npz_objkey. If None, only the first array should be read and npz_objkey='arr_0'.

  • verbose (bool, default=0) – Outputs message for user guide.

  • read_kws (dict,) – Additional keywords arguments passed to pandas readable file keywords.

Returns:

f – A dataframe with head contents by default.

Return type:

pandas.DataFrame

See also

np.loadtxt

load text file.

np.load

Load uncompressed or compressed numpy .npy and .npz formats.

watex.utils.baseutils.save_or_load

Save or load numpy arrays.

watex.utils.reduce_samples(*data, sname, zname=None, kname=None, section_indexes=None, error='raise', strategy='average', verify_integrity=False, ignore_index=False, **kws)[source]#

Create a new dataframe by squeezing/compressing the non valid data.

The m-samples reduction is necessary for the dataset with a lot of missing k-values. The technique of shrinking the number of k0 –values (k-missing values ) seems a relevant idea. It consists to compressed the values of the missing \(k -values from the top ( depth equals 0 ) thin the upper section of the first aquifer with lower depth into a single vector :math:`x_r\) with dimension (1×n ) i.e. contains the n-features.

Parameters:
  • data (list of dataframes) – Data that contains mainly the aquifer values. It must contains the depth values refering at the column_name passed at zname and the permeability coefficient k passed to kname . Both argument need t supplied when datafame as passes as positional arguments.

  • sname (str, optional) – Name of column in the dataframe that contains the strata values. Dont confuse ‘sname’ with ‘stratum’ which is the name of the valid layer/rock in the array/Series of strata.

  • zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.

  • kname (str, int) –

    Name of permeability coefficient columns. kname allows to retrieve the

    permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly

    kname needs to be supplied when a dataframe is passed as a positional

    or keyword argument.

  • z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.

  • strategy (str , default='average' or 'mean',) – strategy used to select or compute the numerical data into a singular series. It can be [‘naive’]. In that case , a single serie if randomly picked up into the base strata data.

  • section_indexes (tuple or list of int) – list of a pair tuple or list of integers. It is be the the valid sections( upper and lower ) indexes of of the aquifer. If the depth range z_range and zname are supplied, section_indexes can be None. Note that the last indix is considered as the last position, the bottom of the section therefore, its value is included in the data.

  • error (str, default='raise') – Raise errors if trouble occurs when computing the section of each aquifer. If ‘ignore’, a UserWarning is displayed when invalid data is found. Any other value of error will set error to raise.

  • verify_integrity (bool, default=False) –

    Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method. if ‘True’, remove the duplicate rows from a DataFrame.

    subset: By default, if the rows have the same values in all the columns, they are considered duplicates. This parameter is used to specify the columns that only need to be considered for identifying duplicates. keep: Determines which duplicates (if any) to keep. It takes inputs as, first – Drop duplicates except for the first occurrence. This is the default behavior. last – Drop duplicates except for the last occurrence. False – Drop all duplicates. inplace: It is used to specify whether to return a new DataFrame or update an existing one. It is a boolean flag with default False.

  • ignore_index (bool, default=False,) – It is a boolean flag to indicate if row index should be reset after dropping duplicate rows. False: It keeps the original row index. True: It reset the index, and the resulting rows will be labeled 0, 1, …, n – 1.

Returns:

df_new – new dataframes with reducing samples.

Return type:

List of pandas.dataframes

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import reduce_samples
>>> data = load_hlogs ().frame # get the frames
>>> # add explicitly the aquifer section indices
>>> dfnew= reduce_samples (data.copy(), sname='strata_name',
                             section_indexes = (16, 29 ),)
>>> dfnew[0]
...    hole_number               strata_name     rock_name  ...      r     rp  remark
    0         H502                  mudstone           J2z  ...    NaN    NaN     NaN
    16        H502                 siltstone           NaN  ...  35.74  59.23     NaN
    17        H502    fine-grained sandstone           NaN  ...  35.74  59.23     NaN
    18        H502                 siltstone           NaN  ...  35.74  59.23     NaN
    19        H502    fine-grained sandstone           NaN  ...  35.74  59.23     NaN
    20        H502                  mudstone           NaN  ...  35.74  59.23     NaN
    21        H502                 siltstone           NaN  ...  35.74  59.23     NaN
    22        H502    fine-grained sandstone           NaN  ...  59.61  59.23     NaN
    23        H502                 siltstone           NaN  ...  59.61  59.23     NaN
    24        H502    fine-grained sandstone           NaN  ...  59.61  59.23     NaN
    25        H502  Coarse-grained sandstone           NaN  ...  59.61  59.23     NaN
    26        H502                  mudstone           NaN  ...  82.33  59.23     NaN
    27        H502    fine-grained sandstone           NaN  ...  82.33  59.23     NaN
    28        H502  Coarse-grained sandstone           J2z  ...  82.33  59.23     NaN
    29        H502                      coal  (J2y)  2coal  ...  82.33  59.23     NaN
    0         H502                 siltstone           NaN  ...    NaN    NaN     NaN

[16 rows x 23 columns] >>> # specify the column name and kname without section indexes >>> dfnew= reduce_samples (

data.copy(), sname=’strata_name’, data, zname=’depth’, kname=’k’, ignore_index= True )[0]

… dfnew[0].index # index is reset … RangeIndex(start=0, stop=16, step=1)

watex.utils.remove_outliers(ar, method='IQR', threshold=3.0, fill_value=None, axis=1, interpolate=False, kind='linear')[source]#

Efficient strategy to remove outliers in the data.

Indeed, an outlier is the data point of the given sample, observation, or distribution that shall lie outside the overall pattern. A commonly used rule says that one will consider a data point an outlier if it has more than 1.5 IQR below the first quartile or above the third.

Two approaches are used to remove the outliers.

  • Inter Quartile Range (IQR) IQR is the most commonly used and most trusted approach used in the research field. Said differently, low outliers shall lie below Q1-1.5 IQR, and high outliers shall lie Q3+1.5IQR. One needs to calculate median, quartiles, including IQR, Q1, and Q3.

    \[ \begin{align}\begin{aligned}Q1 = 1/4(n + 1)\\Q3 = 1/4 (n + 1)\\Q2 = Q3 – Q1\end{aligned}\end{align} \]

    To define the outlier base value is defined above and below datasets normal range namely Upper and Lower bounds, define the upper and the lower bound (1.5*IQR value is considered) :

    \[ \begin{align}\begin{aligned}upper = Q3 +1.5*IQR\\lower = Q1 – 1.5*IQR\end{aligned}\end{align} \]

    In the above formula as according to statistics, the 0.5 scale-up of \(IQR (new_IQR = IQR + 0.5*IQR)\) is taken, to consider all the data between 2.7 standard deviations in the Gaussian Distribution

  • Z-score Is also called a standard score. This value/score helps to understand that how far is the data point from the mean. And after setting up a threshold value one can utilize z score values of data points to define the outliers.

    \[Zscore = ( ext{data_point} - ext{mean}) / ext{std. deviation}\]

Now to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the data points lie between +/- 3 standard deviation (using Gaussian Distribution approach).

Parameters:
  • ar (Arraylike, pd.dataframe) –

    Arraylike containing outliers to remove.

    New in version 0.2.7: Accepts dataframe and can remove outliers using the z_score.

  • method (str, default='IQR') – The selected approach to remove the outliers. It can be [‘IQR’|’Z-score’]. See Above for outlier explanations. Note that when selecting "z-score" the threshold value greatly influence the quality of data considering as ooutliers.

  • threshold (float, default=3) – Thershold values is useful for "z-score" as the value for considering data above as outliers.

  • fill_value (float, optional) – Value to replace the outliers. If not given, outliers are suppressed in the array.

  • axis (int, default=1) – axis from which to remove values. This is useful when two dimensional array is supplied. Default, delete outlier from the rows.

  • interpolate (bool, default=False,) – If fill_value='NaN', interpolation can be triggered to get the closest value in array to replace missing values. Note that fill_value should be NaN for interpolation to be concise.

  • kind (str, default='linear') – kind of interpolation. It could be [‘nearest’|’linear’|’cubic’].

  • versionadded: (..) – 0.2.8: Interpolate NaN value after outliers removal.

Returns:

arr – New array whith removed outliers.

Return type:

Array_like

Examples

>>> import numpy as np
>>> np.random.seed (42 )
>>> from watex.utils.funcutils import remove_outliers
>>> data = np.random.randn (7, 3 )
>>> data_r = remove_outliers ( data )
>>> data.shape , data_r.shape
(7, 3) (5, 3)
>>> remove_outliers ( data, fill_value =np.nan )
array([[ 0.49671415, -0.1382643 ,  0.64768854],
       [ 1.52302986, -0.23415337, -0.23413696],
       [ 1.57921282,  0.76743473, -0.46947439],
       [ 0.54256004, -0.46341769, -0.46572975],
       [ 0.24196227,         nan,         nan],
       [-0.56228753, -1.01283112,  0.31424733],
       [-0.90802408,         nan,  1.46564877]])
>>> # for one dimensional
>>> remove_outliers ( data[:, 0] , fill_value =np.nan )
array([ 0.49671415,  1.52302986,  1.57921282,  0.54256004,  0.24196227,
       -0.56228753,         nan])
>>> remove_outliers ( data[:, 0] , fill_value =np.nan, interpolate=True  )
>>> import matplotlib.pyplot as plt
>>> plt.plot (np.arange (len(data ), data, ))
watex.utils.rename_labels_in(arr, new_names, coerce=False)[source]#

Rename label by a new names

Parameters:
  • arr – arr: array-like |pandas.Series array or series containing numerical values. If a non-numerical values is given , an errors will raises.

  • new_names – list of str; list of string or values to replace the label integer identifier.

  • coerce – bool, default =False, force the ‘new_names’ to appear in the target including or not some integer identifier class label. coerce is True, the target array hold the dtype of new_array; coercing the label names will not yield error. Consequently can introduce an unexpected results.

Returns:

array-like, An array-like with full new label names.

watex.utils.replace_data(X, y=None, n_times=1, axis=0, reset_index=Ellipsis)[source]#

Replace items in data \(n\) times

Parameters:
  • X (Arraylike 1D or pd.DataFrame) – Data to replace. Note Sparse matrices is not allowed. Use random_sampling() instead.

  • y (Arraylike 1d.) – Preferably one dimensional data.

  • n_times (int,) – Number of times all items should be replaced in data.

  • reset_index (bool, default=False.) – If True and dataframe,Index is reset and dropped.

Returns:

X or (X, y) – Tuple is returned if y is passed.

Return type:

Tuple of data replaced

Examples

>>> import numpy as np
>>> from watex.utils.funcutils import replace_data
>>> X, y = np.random.randn ( 7, 2 ), np.arange(7)
>>> X.shape, y.shape
((7, 2), (7,))
>>> X_new, y_new = replace_data (X, y, n_times =10 )
>>> X_new.shape , y_new.shape
Out[158]: ((70, 2), (70,))
watex.utils.resampling(X, y, kind='over', strategy='auto', random_state=None, verbose=Ellipsis, **kws)[source]#

Combining Random Oversampling and Undersampling

Resampling involves creating a new transformed version of the training dataset in which the selected examples have a different class distribution. This is a simple and effective strategy for imbalanced classification problems.

Applying re-sampling strategies to obtain a more balanced data distribution is an effective solution to the imbalance problem. There are two main approaches to random resampling for imbalanced classification; they are oversampling and undersampling.

  • Random Oversampling: Randomly duplicate examples in the minority class.

  • Random Undersampling: Randomly delete examples in the majority class.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples, )) – Target vector where n_samples is the number of samples.

  • kind (str, {"over", "under"} , default="over") – kind of sampling to perform. "over" and "under" stand for oversampling and undersampling respectively.

  • strategy (float, str, dict, callable, default='auto') –

    Sampling information to sample the data set.

    • When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as \(\alpha_{us} = N_{m} / N_{rM}\) where \(N_{m}\) is the number of samples in the minority class and \(N_{rM}\) is the number of samples in the majority class after resampling.

      Warning

      float is only available for binary classification. An error is raised for multi-class classification.

    • When str, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:

      'majority': resample only the majority class;

      'not minority': resample all classes but the minority class;

      'not majority': resample all classes but the majority class;

      'all': resample all classes;

      'auto': equivalent to 'not minority'.

    • When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.

    • When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

  • random_state (int, RandomState instance, default=None) –

    Control the randomization of the algorithm.

    • If int, random_state is the seed used by the random number generator;

    • If RandomState instance, random_state is the random number generator;

    • If None, the random number generator is the RandomState instance used by np.random.

  • verbose (bool, default=False) – Display the counting samples

Returns:

X, y – Arraylike sampled

Return type:

NDarray, Arraylike

Examples

>>> import watex as wx
>>> from watex.utils.mlutils import resampling
>>> data, target = wx.fetch_data ('bagoue analysed', as_frame =True)
>>> data.shape, target.shape
>>> data_us, target_us = resampling (data, target, kind ='under',
                                     verbose=True)
>>> data_us.shape, target_us.shape
Counters: Auto
                     Raw counter y: Counter({0: 232, 1: 112})
           UnderSampling counter y: Counter({0: 112, 1: 112})
Out[43]: ((224, 8), (224,))
watex.utils.reshape(arr, axis=None)[source]#

Detect the array shape and reshape it accordingly, back to the given axis.

Parameters:
  • array – array_like with number of dimension equals to 1 or 2

  • axis – axis to reshape back array. If ‘axis’ is None and the number of dimension is greater than 1, it reshapes back array to array-like

Returns:

New reshaped array

Example:
>>> import numpy as np
>>> from watex.utils.funcutils import reshape
>>> array = np.random.randn(50 )
>>> array.shape
... (50,)
>>> ar1 = reshape(array, 1)
>>> ar1.shape
... (1, 50)
>>> ar2 =reshape(ar1 , 0)
>>> ar2.shape
... (50, 1)
>>> ar3 = reshape(ar2, axis = None)
>>> ar3.shape # goes back to the original array
>>> ar3.shape
... (50,)
watex.utils.rhoa2z(rhoa, phs, freq)[source]#

Convert apparent resistivity to impendance tensor z

Parameters:
  • rhoa (ndarray, shape (N, M)) – Apparent resistivity in \(\Omega.m\)

  • phs (ndarray, shape (N, M)) – Phase in degrees

  • freq (array-like , shape (N, )) – Frequency in Hertz

: :return: Impendance tensor; Tensor is a complex number in \(\Omega\). :rtype: ndarray, shape (N, M), dtype = ‘complex’

Example:

>>> import numpy as np
>>> rhoa = np.array([1623.73691735])
>>> phz = np.array([45.])
>>> f = np.array ([1014])
>>> rhoa2z(rhoa, phz, f)
... array([[2.54950976+2.54950976j]])
watex.utils.rhophi2z(rho, phi, freq)[source]#

Convert impedance-style information given in Rho/Phi format into complex valued Z.

Parameters:
  • rho (ArrayLike 1D/2D) – Resistivity array in \(\Omega.m\). If array is two-dimensional, it should be 2x2 array (real).

  • phi (ArrayLike 1D/2D) – Phase array in degree (\(\degree\)). If array is two-dimensional, it should be 2x2 array (real).

  • freq (float, arraylike 1d) – Frequency in Hz

Returns:

Z – Z dimension depends to the inputs array rho and phi.

Return type:

Arraylike 1d or 2d , complex

Examples

>>> import numpy as np
>>> from watex.utils.exmath import rhophi2z
>>> rhophi2z (823 , 25 , 500 )
array([1300.00682824+606.20313966j])
>>> rho = np.array ([[823, 700], [723, 526]] )
>>> phi = np.array ([[45, 50], [90, 180]])
>>> rhophi2z (rho, phi , freq= 500  )
array([[ 1.01427314e+03+1.01427314e+03j,  8.50328081e+02+1.01338154e+03j],
       [ 8.23227764e-14+1.34443297e+03j, -1.14673449e+03+1.40434473e-13j]])
>>> rhophi2z (np.array ( [ 823, 700])  , np.array ([45, 50 ])  , [500, 700] )
array([1014.27313876+1014.27313876j, 1006.12175325+1199.04921402j])
>>> rho  = np.abs (np.random.randn (7, 3 ) * 100 )
>>> phi = np.abs ( np.random.randn (7, 3 ) *180 % 90 )
>>> freq = np.abs ( np.random.randn (7) * 100 )
>>> rhophi2z (rho   , phi  , freq )
watex.utils.savejob(job, savefile, *, protocol=None, append_versions=True, append_date=True, fix_imports=True, buffer_callback=None, **job_kws)[source]#

Quick save your job using ‘joblib’ or persistent Python pickle module

Parameters:
  • job (Any) – Anything to save, preferabaly a models in dict

  • savefile (str, or path-like object) – name of file to store the model The file argument must have a write() method that accepts a single bytes argument. It can thus be a file object opened for binary writing, an io.BytesIO instance, or any other custom object that meets this interface.

  • append_versions (bool, default =True) – Append the version of Joblib module or Python Pickle module following by the scikit-learn, numpy and also pandas versions. This is useful to have idea about previous versions for loading file when system or modules have been upgraded. This could avoid bottleneck when data have been stored for long times and user has forgotten the date and versions at the time the file was saved.

  • append_date (bool, default=True,) –

    Append the date of the day to the filename.

    New in version 0.2.3.

  • protocol (int, optional) –

    The optional protocol argument tells the pickler to use the given protocol; supported protocols are 0, 1, 2, 3, 4 and 5. The default protocol is 4. It was introduced in Python 3.4, and is incompatible with previous versions.

    Specifying a negative protocol version selects the highest protocol version supported. The higher the protocol used, the more recent the version of Python needed to read the pickle produced.

  • fix_imports (bool, default=True,) – If fix_imports is True and protocol is less than 3, pickle will try to map the new Python 3 names to the old module names used in Python 2, so that the pickle data stream is readable with Python 2.

  • buffer_call_back (int, optional) –

    If buffer_callback is None (the default), buffer views are serialized into file as part of the pickle stream.

    If buffer_callback is not None, then it can be called any number of times with a buffer view. If the callback returns a false value (such as None), the given buffer is out-of-band; otherwise the buffer is serialized in-band, i.e. inside the pickle stream.

    It is an error if buffer_callback is not None and protocol is None or smaller than 5.

  • job_kws (dict,) – Additional keywords arguments passed to joblib.dump().

Returns:

savefile – returns the filename

Return type:

str,

watex.utils.scalePosition(ydata, xdata=None, func=None, c_order=0, show=False, **kws)[source]#

Correct data location or position and return new corrected location

Parameters:
  • ydata (array_like, series or dataframe) – The dependent data, a length M array - nominally f(xdata, ...).

  • xdata (array_like or object) – The independent variable where the data is measured. Should usually be an M-length sequence or an (k,M)-shaped array for functions with k predictors, but can actually be any object. If None, xdata is generated by default using the length of the given ydata.

  • func (callable) – The model function, f(x, ...). It must take the independent variable as the first argument and the parameters to fit as separate remaining arguments. The default func is linear function i.e for f(x)= ax +b. where a is slope and b is the intercept value. Setting your own function for better fitting is recommended.

  • c_order (int or str) – The index or the column name if ydata is given as a dataframe to select the right column for scaling.

  • show (bool) – Quick visualization of data distribution.

  • kws (dict) – Additional keyword argument from scipy.optimize_curvefit parameters. Refer to scipy.optimize.curve_fit.

Returns:

  • - ydata - array -like - Data scaled

  • - popt - array-like Optimal values for the parameters so that the sum of

  • the squared residuals of f(xdata, \*popt) - ydata is minimized.

  • - pcov - array like The estimated covariance of popt. The diagonals provide

  • the variance of the parameter estimate. To compute one standard deviation

  • errors on the parameters use perr = np.sqrt(np.diag(pcov)). How the

  • sigma parameter affects the estimated covariance depends on absolute_sigma

  • argument, as described above. If the Jacobian matrix at the solution

  • doesn’t have a full rank, then ‘lm’ method returns a matrix filled with

  • np.inf, on the other hand ‘trf’ and ‘dogbox’ methods use Moore-Penrose

  • pseudoinverse to compute the covariance matrix.

Examples

>>> from watex.utils import erpSelector, scalePosition
>>> df = erpSelector('data/erp/l10_gbalo.xlsx')
>>> df.columns
... Index(['station', 'resistivity', 'longitude', 'latitude', 'easting',
'northing'],
dtype='object')
>>> # correcting northing coordinates from easting data
>>> northing_corrected, popt, pcov = scalePosition(ydata =df.northing ,
xdata = df.easting, show=True)
>>> len(df.northing.values) , len(northing_corrected)
... (20, 20)
>>> popt  # by default popt =(slope:a ,intercept: b)
...  array([1.01151734e+00, 2.93731377e+05])
>>> # corrected easting coordinates using the default x.
>>> easting_corrected, *_= scalePosition(ydata =df.easting , show=True)
>>> df.easting.values
... array([790284, 790281, 790277, 790270, 790265, 790260, 790254, 790248,
...       790243, 790237, 790231, 790224, 790218, 790211, 790206, 790200,
...       790194, 790187, 790181, 790175], dtype=int64)
>>> easting_corrected
... array([790288.18571705, 790282.30300999, 790276.42030293, 790270.53759587,
...       790264.6548888 , 790258.77218174, 790252.88947468, 790247.00676762,
...       790241.12406056, 790235.2413535 , 790229.35864644, 790223.47593938,
...       790217.59323232, 790211.71052526, 790205.8278182 , 790199.94511114,
...       790194.06240407, 790188.17969701, 790182.29698995, 790176.41428289])
.. _Bagoue region: https://en.wikipedia.org/wiki/Bagou%C3%A9

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

watex.utils.scaley(y, x=None, deg=None, func=None)[source]#

Scaling value using a fitting curve.

Create polyfit function from a specifc data points x to correct y values.

Parameters:
  • y – array-like of y-axis. Is the array of value to be scaled.

  • x – array-like of x-axis. If x is given, it should be the same length as y, otherwise and error will occurs. Default is None.

  • func – callable - The model function, f(x, ...). It must take the independent variable as the first argument and the parameters to fit as separate remaining arguments. func can be a linear function i.e for f(x)= ax +b where a is slope and b is the intercept value. It is recommended according to the y value distribution to set up a custom function for better fitting. If func is given, the deg is not needed.

  • deg – polynomial degree. If value is None, it should be computed using the length of extrema (local and/or global) values.

Returns:

  • y: array scaled - projected sample values got from f.

  • x: new x-axis - new axis x_new generated from the samples.

  • linear of polynomial function f

References:

Wikipedia, Curve fitting, https://en.wikipedia.org/wiki/Curve_fitting Wikipedia, Polynomial interpolation, https://en.wikipedia.org/wiki/Polynomial_interpolation

Example:
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from watex.exmath import scale_values
>>> rdn = np.random.RandomState(42)
>>> x0 =10 * rdn.rand(50)
>>> y = 2 * x0  +  rnd.randn(50) -1
>>> plt.scatter(x0, y)
>>> yc, x , f = scale_values(y)
>>> plt.plot(x, y, x, yc)
watex.utils.select_base_stratum(d, /, sname=None, stratum=None, return_rate=False, return_counts=False)[source]#

Selects base stratum from the the strata column in the logging data.

Find the most recurrent stratum in the data and compute the rate of occurrence.

Parameters:
  • d (array-like 1D , pandas.Series or DataFrame) – Valid data containing the strata. If dataframe is passed, ‘sname’ is needed to fetch strata values.

  • sname (str, optional) – Name of column in the dataframe that contains the strata values. Dont confuse ‘sname’ with ‘stratum’ which is the name of the valid layer/rock in the array/Series of strata.

  • stratum (str, optional) – Name of the base stratum. Must be self contain as an item of the strata data. Note that if stratum is passed, the auto-detection of base stratum is not triggered. It returns the same stratum , however it can gives the rate and occurence of this stratum if return_rate or return_counts is set to True.

  • return_rate (bool,default=False,) – Returns the rate of occurence of the base stratum in the data.

  • return_counts (bool, default=False,) – Returns each stratum name and the occurences (count) in the data.

Returns:

  • bs (str) – base stratum , self contain in the data

  • r (float) – rate of occurence in base stratum in the data

  • c (tuple (str, int)) – Tuple of each stratum whith their occurrence in the data.

Example

>>> from watex.datasets import load_hlogs
>>> from watex.utils.hydroutils import select_base_stratum
>>> data = load_hlogs().frame # get only the frame
>>> select_base_stratum(data, sname ='strata_name')
... 'siltstone'
>>> select_base_stratum(data, sname ='strata_name', return_rate =True)
... 0.287292817679558
>>> select_base_stratum(data, sname ='strata_name', return_counts=True)
... [('siltstone', 52),
     ('fine-grained sandstone', 40),
     ('mudstone', 37),
     ('coal', 24),
     ('Coarse-grained sandstone', 15),
     ('carbonaceous mudstone', 9),
     ('medium-grained sandstone', 2),
     ('topsoil', 1),
     ('gravel layer', 1)]
watex.utils.select_feature_importances(clf, X, y=None, *, threshold=0.1, prefit=True, verbose=0, return_selector=False, **kws)[source]#

Select feature importance based on a user-specified threshold after model fitting.

This is useful if one want to use RandomForestClassifier as a feature selector and intermediate step in scikit-learn Pipeline object, which allows us to connect different processing steps with an estimator.

Parameters:
  • clf (estimator object) – The base estimator from which the transformer is built. This can be both a fitted (if prefit is set to True) or a non-fitted estimator. The estimator should have a feature_importances_ or coef_ attribute after fitting. Otherwise, the importance_getter parameter should be used.

  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples, )) – Target vector where n_samples is the number of samples. If given, set prefit=False for estimator to fit and transform the data for feature importance selecting. If estimator is already fitted i.e. prefit=True, ‘y’ is not needed.

  • threshold (str or float, default=None) – The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.

  • prefit (bool, default=False) – Whether a prefit model is expected to be passed into the constructor directly or not. If True, estimator must be a fitted estimator. If False, estimator is fitted and updated by calling fit and partial_fit, respectively.

  • importance_getter (str or callable, default='auto') –

    If ‘auto’, uses the feature importance either through a coef_ attribute or feature_importances_ attribute of estimator.

    Also accepts a string that specifies an attribute name/path for extracting feature importance (implemented with attrgetter). For example, give regressor_.coef_ in case of TransformedTargetRegressor or named_steps.clf.feature_importances_ in case of Pipeline with its last step named clf.

    If callable, overrides the default feature importance getter. The callable is passed with the fitted estimator and it should return importance for each feature.

  • norm_order (non-zero int, inf, -inf, default=1) – Order of the norm used to filter the vectors of coefficients below threshold in the case where the coef_ attribute of the estimator is of dimension 2.

  • max_features (int, callable, default=None) –

    The maximum number of features to select.

    • If an integer, then it specifies the maximum number of features to allow.

    • If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of max_feaures(X).

    • If None, then all features are kept.

    To only select based on max_features, set threshold=-np.inf.

  • return_selector (bool, default=False,) – Returns selector object if True., otherwise returns the transformed X.

  • verbose (int, default=0) – display the number of features that meet the criterion according to their importance range.

Returns:

Xs or selector – Ndarray of number of samples and features that meet the criterion according to the importance range or selector object

Return type:

ndarray (n_samples, n_criterion_features), or sklearn.feature_selection.SelectFromModel

Examples

>>> from watex.utils.mlutils import select_feature_importances
>>> from watex.exlib.sklearn import LogisticRegression
>>> X0 = [[ 0.87, -1.34,  0.31 ],
...      [-2.79, -0.02, -0.85 ],
...      [-1.34, -0.48, -2.55 ],
...      [ 1.92,  1.48,  0.65 ]]
>>> y0 = [0, 1, 0, 1]

(1) use prefit =True and get the Xs importance features >>> Xs = select_feature_importances (

LogisticRegression().fit(X0, y0), X0 , prefit =True )

>>> Xs
array([[ 0.87, -1.34,  0.31],
       [-2.79, -0.02, -0.85],
       [-1.34, -0.48, -2.55],
       [ 1.92,  1.48,  0.65]])
  1. Set off prefix and return selector obj

>>> selector= select_feature_importances (
    LogisticRegression(), X= X0 ,
    y =y0  ,
    prefit =False , return_selector= True
    )
>>> selector.estimator_.coef_
array([[-0.3252302 ,  0.83462377,  0.49750423]])
>>> selector.threshold_
0.1
>>> selector.get_support()
array([ True,  True,  True])
>>> selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y)
>>> selector.estimator_.coef_
array([[-0.3252302 ,  0.83462377,  0.49750423]])
>>> selector.threshold_
0.55245...
>>> selector.get_support()
array([False,  True, False])
>>> selector.transform (X0)
array([[ 0.87, -1.34,  0.31],
       [-2.79, -0.02, -0.85],
       [-1.34, -0.48, -2.55],
       [ 1.92,  1.48,  0.65]])
watex.utils.selectfeatures(df, features=None, include=None, exclude=None, coerce=False, **kwd)[source]#

Select features and return new dataframe.

Parameters:
  • df – a dataframe for features selections

  • features – list of features to select. List of features must be in the dataframe otherwise an error occurs.

  • include – the type of data to retrieve in the dataframe df. Can be number.

  • exclude – type of the data to exclude in the dataframe df. Can be number i.e. only non-digits data will be keep in the data return.

  • coerce – return the whole dataframe with transforming numeric columns. Be aware that no selection is done and no error is raises instead. default is False

  • kwd – additional keywords arguments from pd.astype function

Ref:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

watex.utils.sfi(cz, p=None, s=None, dipolelength=None, view=False, raw=False, return_components=False, **plotkws)[source]#

Compute the pseudo-fracturing index known as sfi.

The sfi parameter does not indicate the rock fracturing degree in the underground but it is used to speculate about the apparent resistivity dispersion ratio around the cumulated sum of the resistivity values of the selected anomaly. It uses a similar approach of IF parameter proposed by Dieng et al (2004). Furthermore, its threshold is set to \(sqrt{2}\) for symmetrical anomaly characterized by a perfect distribution of resistivity in a homogenous medium. The formula is given by:

\[sfi=\sqrt{(P_a^{*}/P_a )^2+(M_a^{*}/M_a )^2}\]

where \(P_a\) and \(M_a\) are the anomaly power and the magnitude respectively. \(P_a^{*}\) is and \(M_a^{*}\) are the projected power and magnitude of the lower point of the selected anomaly.

Parameters:
  • cz (array-like,) – Selected conductive zone

  • p (array-like,) – Station positions of the conductive zone.

  • dipolelength (float. If p is not given, it will be set) – automatically using the default value to match the cz size. The default value is 10..

  • view (bool, default=False,) – Visualize the fitting curve. Default is False.

  • raw (bool,default=False,) – Overlaining the fitting curve with the raw curve from cz.

  • return_components (bool, default=False,) – If True, it returns the different components used for compute sfi especially for external visualization.

  • plotkws (dict) – Matplotlib plot keyword arguments.

Returns:

sfi – value computed for pseudo-fracturing index

Return type:

float

Examples

>>> import numpy as np
>>> from watex.property import P
>>> from watex.utils.exmath import sfi
>>> rang = np.random.RandomState (42)
>>> condzone = np.abs(rang.randn (7))
>>> # no visualization and default value `s` with global minimal rho
>>> pfi = sfi (condzone)
... 3.35110143
>>> # visualize fitting curve
>>> plotkws  = dict (rlabel = 'Conductive zone (cz)',
                     label = 'fitting model',
                     color=f'{P().frcolortags.get("fr3")}',
                     )
>>> sfi (condzone, view= True , s= 5, figsize =(7, 7),
          **plotkws )
Out[598]: (array([ 0., 10., 20., 30.]), 1)

References

watex.utils.shape(cz, s=Ellipsis, p=Ellipsis)[source]#

Compute the shape of anomaly.

The shape parameter is mostly used in the basement medium to depict the better conductive zone for the drilling location. According to Sombo et al. (2011; 2012), various shapes of anomalies can be described such as:

“V”, “U”, “W”, “M”, “K”, “C”, and “H”

The shape consists to feed the algorithm with the Electrical Resistivity Profiling resistivity values by specifying the station \((S_{VES})\). Indeed, mostly, \(S_{VES}\) is the station with a very low resistivity value expected to be the drilling location.

Parameters:
  • cz – array-like - Conductive zone resistivity values

  • s – int, str - Station position index or name.

  • p – Array-like - Should be the position of the conductive zone.

Note

If s is given, p should be provided. If p is missing an error will raises.

Returns:

str - the shape of anomaly.

Example:
>>> import numpy as np
>>> rang = np.random.RandomState(42)
>>> from watex.utils.exmath import shape
>>> test_array1 = np.arange(10)
>>> shape (test_array1)
...  'C'
>>> test_array2 = rang.randn (7)
>>> shape(test_array2)
... 'H'
>>> test_array3 = np.power(10, test_array2 , dtype =np.float32)
>>> shape (test_array3)
... 'H'   # does not change whatever the resistivity values.

References

Sombo, P. A., Williams, F., Loukou, K. N., & Kouassi, E. G. (2011).

Contribution de la Prospection Électrique à L’identification et à la Caractérisation des Aquifères de Socle du Département de Sikensi (Sud de la Côte d’Ivoire). European Journal of Scientific Research, 64(2), 206–219.

Sombo, P. A. (2012). Application des methodes de resistivites electriques

dans la determination et la caracterisation des aquiferes de socle en Cote d’Ivoire. Cas des departements de Sikensi et de Tiassale (Sud de la Cote d’Ivoire). Universite Felix Houphouet Boigny.

watex.utils.smart_label_classifier(arr, /, values=None, labels=None, order='soft', func=None, raise_warn=True)[source]#

map smartly the numeric array into a class labels from a map function or a given fixed values.

New classes created from the fixed values can be renamed if labels are supplied.

Parameters:
  • arr (Arraylike 1d,) – array-like whose items are expected to be categorized.

  • values (float, list of float,) – The threshold item values from which the default categorization must be fixed.

  • labels (int |str| or List of [str, int],) – The labels values that might be correspond to the fixed values. Note that the number of fixed_labels might be consistent with the fixed values plus one, otherwise a ValueError shall raise if order is set to strict.

  • order (str, ['soft'|'strict'], default='soft',) – If order is strict, the argument passed to values must be self contain as item in the arr, and raise warning otherwise.

  • func (callable, optional) – Function to map the given array. If given, values dont need to be supply.

  • raise_warn (bool, default='True') – Raise warning message if order=soft and the fixed values are not found in the arr. Also raise warnings, if labels arguments does not match the number of class from fixed values.

Returns:

arr – categorized array with the same length as the raw

Return type:

array-like 1d

Examples

>>> import numpy as np
>>> from watex.utils.funcutils import smart_label_classifier
>>> sc = np.arange (0, 7, .5 )
>>> smart_label_classifier (sc, values = [1, 3.2 ])
array([0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2], dtype=int64)
>>> # rename labels <=1 : 'l1', ]1; 3.2]: 'l2' and >3.2 :'l3'
>>> smart_label_classifier (sc, values = [1, 3.2 ], labels =['l1', 'l2', 'l3'])
>>> array(['l1', 'l1', 'l1', 'l2', 'l2', 'l2', 'l2', 'l3', 'l3', 'l3', 'l3',
       'l3', 'l3', 'l3'], dtype=object)
>>> def f (v):
        if v <=1: return 'l1'
        elif 1< v<=3.2: return "l2"
        else : return "l3"
>>> smart_label_classifier (sc, func= f )
array(['l1', 'l1', 'l1', 'l2', 'l2', 'l2', 'l2', 'l3', 'l3', 'l3', 'l3',
       'l3', 'l3', 'l3'], dtype=object)
>>> smart_label_classifier (sc, values = 1.)
array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
>>> smart_label_classifier (sc, values = 1., labels='l1')
array(['l1', 'l1', 'l1', 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=object)
watex.utils.smart_thickness_ranker(t, /, depth=None, regex=None, sep=':-', surface_value=0.0, mode='strict', return_thickness=Ellipsis, verbose=Ellipsis)[source]#

Compute the layer thicknesses and rank strata accordingly.

Grub from the litteral string the layer depth range to find the ranking of layer thickness.

Parameters:
t: str, or List of Any

Litteral string that containing the data arrangement. The kind of data to provide for thickness arrangement are:

  • t-value: Compose only with the layer thickness values. For instance t= "10 20 7 58" indicates four layers with layer thicknesses equals to 10, 20, 7 and 58 ( meters) respectively.

  • tb-range: compose only with thickness range at each depth. For instance t= "0-10 10-30 40-47 101-159". Note the character used to separate thickness range is '-'. Any other character must be specified using the parameter sep. Here, the top(roof) and bottom(wall) of the layers are 0 (top) and 10 (bottom), 10 and 30, 40 and 47 , and 101 and 159 for stratum 1, 2, 3 and 4 respectively.

  • mixed: Mixed data kind is composed of the both t-value and tb-range. When this kind of data is provied, to smartly parse the data, user must set the operation mode to soft. However, to avoid any unexpected result, it is suggested to used either t-value or tb-range layer thickness naming.

depth: float, optional

Depth is mostly used when t-value thickness arrangement is provided. It add additional layer at the bottom the given thickness and recompute the last layer thickness. Howewer for a sampling as geochemistry sampling, depth specification is not necessary.

regex: re object,

Regular expresion object used to parse the litteral string v. If not given, the default is:

>>> import re
>>> re.compile (r'[_#&.)(*@!,;\s-]\s*', flags=re.IGNORECASE)
sep:str, default= ‘:-’

The character used to separate two layer thickness ranged from top to bottom. Any other character must be specified. Here is an example:

>>> sep ='10-35' or sep='10:35'
surface_value: float, default=0.

The top value of the first layer. The default is the sea level. For instance, if the first layer l0 is 10m thick, the top (roof) and the bottom(wall) of l0 should be 0-10 for surface_value=0..

return_thickness: bool, default=False

If True, return the calculated thickness of each stratum.

mode: str, default=’strict’

Control the layer thickness ranking. It can be [‘soft’|’strict’]. Any other value should be in ‘soft’ mode. Indeed, the mode is used to retrieve, arrange and compute the layer thicknesses. For instance, in strict mode, any bad arrangement or misimputed layer thicknesses should raise an error. However, in ‘soft’, the bad arrangements are systematically dropped especially when top and bottom values of the layers are null.

verbose: bool, default=False

Warn user about the layer ranking and thickness calculation.

Returns:
dh_from, dh_to| thickness: Tuple of Arraylike
  • dh_from: Arraylike of each layer roof ( top)

  • dh_to: Arraylike of each layer wall ( bottom)

  • thickness: Arraylike of composed of each stratum thickness. Values are returned if returun_thickness=True.

Examples

>>> from watex.utils.geotools import smart_thickness_ranker
>>> smart_thickness_ranker ("10 15 70 125")
(array([ 0., 10., 25., 95.]), array([ 10.,  25.,  95., 220.]))
>>> smart_thickness_ranker ("10 15 70 125", depth =300,
                            return_thickness= True)
(array([  0.,  10.,  25.,  95., 220.]),
 array([ 10.,  25.,  95., 220., 300.]),
 array([ 10.,  15.,  70., 125.,  80.]))
>>> smart_thickness_ranker ("10-15 70-125")
(array([10., 70.]), array([ 15., 125.]))
>>> smart_thickness_ranker ("10-15 70-125", depth =300)
(array([ 10.,  70., 125.]), array([ 15., 125., 300.]))
>>> smart_thickness_ranker ("7 10-15 13 70-125 ",mode='soft')
(array([ 0., 10., 15., 70.]), array([  7.,  15.,  28., 125.]))
>>> smart_thickness_ranker ("7 10-15 13 70-125 ",depth =300, mode='soft',
                            return_thickness=True)
(array([  0.,  10.,  15.,  70., 125.]),
 array([  7.,  15.,  28., 125., 300.]),
 array([  7.,   5.,  13.,  55., 175.]))
watex.utils.smooth1d(ar, /, drop_outliers=True, ma=True, absolute=False, interpolate=False, view=False, x=None, xlabel=None, ylabel=None, fig_size=(10, 5))[source]#

Smooth one-dimensional array.

Parameters:
  • ar (ArrayLike 1d) – Array of one-dimensional

  • drop_outliers (bool, default=True) – Remove the outliers in the data before smoothing

  • ma (bool, default=True,) – Use the moving average for smoothing array value. This seems more realistic.

  • interpolate (bool, default=False) –

    Interpolate value to fit the original data size after NaN filling.

    New in version 0.2.8.

  • absolute (bool, default=False,) – keep postive the extrapolated scaled values. Indeed, when scaling data, negative value can be appear due to the polyfit function. to absolute this value, set absolute=True. Note that converting to values to positive must be considered as the last option when values in the array must be positive.

  • view (bool, default =False) – Display curves

  • x (ArrayLike, optional) – Abscissa array for visualization. If given, it must be consistent with the given array ar. Raises error otherwise.

  • xlabel (str, optional) – Label of x

  • ylabel (str, optional) – label of y

  • fig_size (tuple , default=(10, 5)) – Matplotlib figure size

Returns:

yc – Smoothed array value.

Return type:

ArrayLike

Examples

>>> import numpy as np
>>> from watex.utils.exmath import smooth1d
>>> # add Guassian Noise
>>> np.random.seed (42)
>>> ar = np.random.randn (20 ) * 20 + np.random.normal ( 20 )
>>> ar [:7 ]
array([6.42891445e+00, 3.75072493e-02, 1.82905357e+01, 2.92957265e+01,
       6.20589038e+01, 2.26399535e+01, 1.12596434e+01])
>>> arc = smooth1d (ar, view =True , ma =False )
>>> arc [:7 ]
array([12.08603102, 15.29819907, 18.017749  , 20.27968322, 22.11900412,
       23.5707141 , 24.66981557])
>>> arc = smooth1d (ar, view =True )# ma=True by default
array([ 5.0071604 ,  5.90839339,  9.6264018 , 13.94679804, 17.67369252,
       20.34922943, 22.00836725])
watex.utils.smoothing(ar, /, drop_outliers=True, ma=True, absolute=False, interpolate=False, axis=0, view=False, fig_size=(7, 7), xlabel=None, ylabel=None, cmap='binary')[source]#

Smooth data along axis.

Parameters:
  • ar (ArrayLike 1d or 2d) – One dimensional or two dimensional array.

  • drop_outliers (bool, default=True) – Remove the outliers in the data before smoothing along the given axis

  • ma (bool, default=True,) – Use the moving average for smoothing array value along axis. This seems more realistic rather than using only the scaling method.

  • absolute (bool, default=False,) – keep positive the extrapolated scaled values. Indeed, when scaling data, negative value can be appear due to the polyfit function. to absolute this value, set absolute=True. Note that converting to values to positive must be considered as the last option when values in the array must be positive.

  • axis (int, default=0) – Axis along with the data must be smoothed. The default is the along the row.

  • view (bool, default =False) – Visualize the two dimensional raw and smoothing grid.

  • xlabel (str, optional) – Label of x

  • ylabel (str, optional) – label of y

  • fig_size (tuple , default=(7, 5)) – Matplotlib figure size

  • cmap (str, default='binary') – Matplotlib.colormap to manage the view color

Returns:

arr0 – Smoothed array value.

Return type:

ArrayLike

Examples

>>> import numpy as np
>>> from watex.utils.exmath import smoothing
>>> # add Guassian Noises
>>> np.random.seed (42)
>>> ar = np.random.randn (20, 7 ) * 20 + np.random.normal ( 20, 7 )
>>> ar [:3, :3 ]
array([[ 31.5265026 ,  18.82693352,  34.5459903 ],
       [ 36.94091413,  12.20273182,  32.44342041],
       [-12.90613711,  10.34646896,   1.33559714]])
>>> arc = smoothing (ar, view =True , ma =False )
>>> arc [:3, :3 ]
array([[32.20356863, 17.18624398, 41.22258603],
       [33.46353806, 15.56839464, 19.20963317],
       [23.22466498, 13.8985316 ,  5.04748584]])
>>> arcma = smoothing (ar, view =True )# ma=True by default
>>> arcma [:3, :3 ]
array([[23.96547827,  8.48064226, 31.81490918],
       [26.21374675, 13.33233065, 12.29345026],
       [22.60143346, 16.77242118,  2.07931194]])
>>> arcma_1 = smoothing (ar, view =True, axis =1 )
>>> arcma_1 [:3, :3 ]
array([[18.74017857, 26.91532187, 32.02914421],
       [18.4056216 , 21.81293014, 21.98535213],
       [-1.44359989,  3.49228057,  7.51734762]])
watex.utils.split_train_test(df, test_ratio)[source]#

A naive dataset split into train and test sets from a ratio and return a shuffled train set and test set.

Parameters:
  • df – a dataframe containing features

  • test_ratio – a ratio for test set batch. test_ratio is ranged between 0 to 1. Default is 20%.

Returns:

a tuple of train set and test set.

watex.utils.storeOrwritehdf5(d, /, key=None, mode='a', kind=None, path_or_buf=None, encoding='utf8', csv_sep=',', index=Ellipsis, columns=None, sanitize_columns=Ellipsis, func=None, args=(), applyto=None, **func_kwds)[source]#

Store data to hdf5 or write data to csv file.

Note that by default, the data is not store nor write and return data if frame or transform the Path-Like object to data frame.

Parameters:
  • d (Dataframe, shape (m_samples, n_features)) – data to store or write or sanitize.

  • key (str) – Identifier for the group in the store.

  • mode ({'a', 'w', 'r+'}, default 'a') –

    Mode to open file:

    • ’w’: write, a new file is created (an existing file with the

      same name would be deleted).

    • ’a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.

    • ’r+’: similar to ‘a’, but the file must already exist.

  • kind (str, {'store', 'write', None} , default=None) –

    Type of task to perform:

    • ’store’: Store data to hdf5

    • ’write’: export data to csv file.

    • None: construct a dataframe if array is passed or sanitize it.

  • path_or_buf (str or pandas.HDFStore, or str, path object, file-like object, or None, default=None) – File path or HDFStore object. String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If write=True and None, the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=” “, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.

  • encoding (str, default='utf8') – A string representing the encoding to use in the output file, Encoding is not supported if path_or_buf is a non-binary file object.

  • csv_sep (str, default=',',) – String of length 1. Field delimiter for the output file.

  • index (bool, index =False,) – Write data to csv with index or not.

  • columns (list of str, optional) – Usefull to create a dataframe when array is passed. Be aware to fit the number of array columns (shape[1])

  • sanitize_columns (bool, default=False,) –

    remove undesirable character in the data columns using the default argument of regex parameters and fill pattern to underscore ‘_’. The default regex implementation is:

    >>> import re
    >>> re.compile (r'[_#&.)(*@!,;\s-]\s*', flags=re.IGNORECASE)
    

  • func (callable, Optional) – A custom sanitizing function and apply to each columns of the dataframe. If provide, the expected columns must be listed to applyto parameter.

  • args (tuple, optional) – Positional arguments of the sanitizing columns

  • applyto (str or list of str, Optional) – The list of columns to apply the function func. To apply the function to all columns, use the * instead.

  • func_kwds (dict,) – Keywords arguments of the sanitizing function func.

Returns:

None or d – returns None if kind is set to write or store otherwise return the dataframe.

Return type:

None of dataframe.

Examples

>>> from watex.utils.funcutils import storeOrwritehdf5
>>> from watex.datasets import load_bagoue
>>> data = load_bagoue().frame
>>> data.geol[:5]
0    VOLCANO-SEDIM. SCHISTS
1                  GRANITES
2                  GRANITES
3                  GRANITES
4          GEOSYN. GRANITES
Name: geol, dtype: object
>>> data = storeOrwritehdf5 ( data, sanitize_columns = True)
>>> data[['type', 'geol', 'shape']] # put all to lowercase
  type                    geol shape
0   cp  volcano-sedim. schists     w
1   ec                granites     v
2   ec                granites     v
>>> # compute using func
>>> def test_func ( a, times  , to_percent=False ):
        return ( a * times / 100)   if to_percent else ( a *times )
>>> data.sfi[:5]
0    0.388909
1    1.340127
2    0.446594
3    0.763676
4    0.068501
Name: sfi, dtype: float64
>>> d = storeOrwritehdf5 ( data,  func = test_func, args =(7,), applyto='sfi')
>>> d.sfi[:5]
0    2.722360
1    9.380889
2    3.126156
3    5.345733
4    0.479507
Name: sfi, dtype: float64
>>> storeOrwritehdf5 ( data,  func = test_func, args =(7,),
                      applyto='sfi', to_percent=True).sfi[:5]
0    0.027224
1    0.093809
2    0.031262
3    0.053457
4    0.004795
Name: sfi, dtype: float64
>>> # write data to hdf5 and outputs to current directory
>>> storeOrwritehdf5 ( d, key='test0', path_or_buf= 'test_data.h5',
                      kind ='store')
>>> # export data to csv
>>> storeOrwritehdf5 ( d, key='test0', path_or_buf= 'test_data',
                      kind ='export')
watex.utils.to_numeric_dtypes(arr, *, columns=None, return_feature_types=Ellipsis, missing_values=nan, pop_cat_features=Ellipsis, sanitize_columns=Ellipsis, regex=None, fill_pattern='_', drop_nan_columns=True, how='all', reset_index=Ellipsis, drop_index=True, verbose=Ellipsis)[source]#

Convert array to dataframe and coerce arguments to appropriate dtypes.

Function includes additional tools to manipulate the transformed data such as:

  • pop_cat_features to remove the categorical attributes,

  • sanitize_columns to clean the columns of the dataframe by removing the undesirable characters,

  • drop_nan_columns to drop all the columns and/or rows that contains full NaN, …

Parameters:
  • arr (Ndarray or Dataframe, shape (m_samples, n_features)) – Array of dataframe to create, to sanitize or to auto-detect feature categories ( numerical or categorical).

  • columns (list of str, optional) – Usefull to create a dataframe when array is given. Be aware to fit the number of array columns (shape[1])

  • return_feature_types (bool, default=False,) – return the list of numerical and categorial features.

  • missing_values (float, default='NaN') – Replace the missing or empty string if exist in the dataframe.

  • pop_cat_features (bool, default=False,) – remove the categorial features from the DataFrame.

  • sanitize_columns (bool, default=False,) –

    remove undesirable character in the data columns using the default argument of regex parameters.

    New in version 0.1.9.

  • regex (re object,) –

    Regular expresion object used to polish the data columns.

    the default is:

    >>> import re
    >>> re.compile (r'[_#&.)(*@!_,;\s-]\s*', flags=re.IGNORECASE)
    

    New in version 0.1.9.

  • fill_pattern (str, default='') – Pattern to replace the non-alphabetic character in each item of columns.

  • drop_nan_columns (bool, default=True) –

    Remove all columns filled by NaN values.

  • how (str, default='all') – Drop also the NaN row data. The row data which is composed entirely with NaN or Null values.

  • reset_index (bool, default=False) –

    Reset the index of the dataframe.

  • drop_index (bool, default=True,) –

    Drop index in the dataframe after reseting.

  • verbose (bool, default=False,) – outputs a message by listing the categorial items dropped from the dataframe if exists.

Returns:

df or (df, nf, cf) – also return nf and cf if return_feature_types is set to``True``.

Return type:

Dataframe of values casted to numeric types

Examples

>>> from watex.datasets.dload import load_bagoue
>>> from watex.utils.funcutils import to_numeric_dtypes
>>> X, y = load_bagoue (as_frame =True )
>>> X0 =X[['shape', 'power', 'magnitude']]
>>> X0.dtypes
... shape        object
    power        object
    magnitude    object
    dtype: object
>>> df = to_numeric_dtypes(X0)
>>> df.dtypes
... shape         object
    power        float64
    magnitude    float64
    dtype: object
watex.utils.torres_verdin_filter(arr, /, weight_factor=0.1, beta=1.0, logify=False, axis=Ellipsis)[source]#

Calculates the adaptive moving average of a given data array from Torres and Verdin algorithm [1].

Parameters:
  • arr (Arraylike 1d) – List or array-like of data points. If two-dimensional array is passed, axis must be specified to apply the filter onto.

  • weight_factor (float, default=.1) – Base smoothing factor for window size which gets adjusted by a factor dependent on the rate of change in the data.

  • beta (float, default =1.) – Scaling factor to adjust weight_factor during high volatility. It controls how much the weight_factor is adjusted during periods of high volatility.

  • logify (bool, default=False,) – By default , Torres uses exponential moving average. So if the values can be logarithmized to ensure the weight be ranged between 0 and 1. This is important when data are resistivity or phase.

  • axis (int, default=0) – Axis along which to apply the AMA filter.

Returns:

ama

Return type:

Adaptive moving average

References

[1]

Torres-Verdin and Bostick, 1992, Principles of spatial surface electric field filtering in magnetotellurics: electromagnetic array profiling (EMAP), Geophysics, v57, p603-622.https://doi.org/10.1190/1.2400625

Example

>>> import matplotlib.pyplot as plt
>>> from watex.utils.exmath import torres_verdin_filter
>>> data = np.random.randn(100)
>>> ama = torres_verdin_filter(data)
>>> plt.plot (range (len(data)), data, 'k', range(len(data)), ama, '-or')
>>> # apply on two dimensional array
>>> data2d = np.random.randn(7, 10)
>>> ama2d = torres_verdin_filter ( data2d, axis =0)
>>> fig, ax  = plt.subplots (nrows = 1, ncols = 2 , sharey= True,
                         figsize = (7,7) )
>>> ax[0].imshow(data2d , label ='Raw data', cmap = 'binary' )
>>> ax[1].imshow (ama2d,  label = 'AMA data', cmap ='binary' )
>>> ax[0].set_title ('Raw data')
>>> ax[1].set_title ('AMA data')
>>> plt.legend
>>> plt.show ()
watex.utils.twinning(*d, on=None, parse_on=False, mode='strict', coerce=False, force=False, decimals=7, raise_warn=True)[source]#
Find indentical object in all data and concatenate them using merge

intersection (cross) strategy.

Parameters:
  • d (List of DataFrames) – List of pandas DataFrames

  • on (str, label or list) –

    Column or index level names to join on. These must be found in all DataFrames. If on is None and not merging on indexes then a concatenation along columns axis is performed in all DataFrames. Note that on works with parse_on if its argument is a list of columns names passed into single litteral string. For instance:

    on ='longitude latitude' --[parse_on=True]-> ['longitude' , 'latitude']
    

  • parse_on (bool, default=False) – Parse on arguments if given as string and return_iterable objects.

  • mode (str, default='strict') – Mode to the data. Can be [‘soft’|’strict’]. In strict mode, all the data passed must be a DataFrame, otherwise an error raises. in soft mode, ignore the non-DataFrame. Note that any other values should be in strict mode.

  • coerce (bool, default=False) – Truncate all DataFrame size to much the shorter one before performing the merge.

  • force (bool, default=False,) – Force on items to be in the all DataFrames, This could be possible at least, on items should be in one DataFrame. If missing in all data, an error occurs.

  • decimals (int, default=5) –

    Decimal is used for comparison between numeric labels in on columns

    items. If set, it rounds values of on items in all data before performing the merge.

    raise_warn: bool, default=False

    Warn user to concatenate data along column axis if on is None.

Returns:

data – A DataFrame of the merged objects.

Return type:

DataFrames

Examples

>>> import watex as wx
>>> from watex.utils.funcutils import twinning
>>> data = wx.make_erp (seed =42 , n_stations =12, as_frame =True )
>>> table1 = wx.methods.DCProfiling ().fit(data).summary()
>>> table1
       dipole   longitude  latitude  ...  shape  type       sfi
line1      10  110.486111  26.05174  ...      C    EC  1.141844
>>> data_no_xy = wx.make_ves ( seed=0 , as_frame =True)
>>> data_no_xy.head(2)
    AB   MN  resistivity
0  1.0  0.4   448.860148
1  2.0  0.4   449.060335
>>> data_xy = wx.make_ves ( seed =0 , as_frame =True , add_xy =True )
>>> data_xy.head(2)
    AB   MN  resistivity   longitude  latitude
0  1.0  0.4   448.860148  109.332931  28.41193
1  2.0  0.4   449.060335  109.332931  28.41193
>>> table = wx.methods.VerticalSounding (
    xycoords = (110.486111,   26.05174)).fit(data_no_xy).summary()
>>> table.table_
         AB    MN   arrangememt  ... nareas   longitude  latitude
area                             ...
None  200.0  20.0  schlumberger  ...      1  110.486111  26.05174
>>> twinning (table1, table.table_,  )
       dipole   longitude  latitude  ...  nareas   longitude  latitude
line1    10.0  110.486111  26.05174  ...     NaN         NaN       NaN
None      NaN         NaN       NaN  ...     1.0  110.486111  26.05174
>>> twinning (table1, table.table_, on =['longitude', 'latitude'] )
Empty DataFrame
>>> # comments: Empty dataframe appears because, decimal is too large
>>> # then it considers values longitude and latitude differents
>>> twinning (table1, table.table_, on =['longitude', 'latitude'], decimals =5 )
    dipole  longitude  latitude  ...  max_depth  ohmic_area  nareas
0      10  110.48611  26.05174  ...      109.0  690.063003       1
>>> # Now is able to find existing dataframe with identical closer coordinates.
watex.utils.type_(erp)[source]#

Compute the type of anomaly.

The type parameter is defined by the African Hydraulic Study Committee report (CIEH, 2001). Later it was implemented by authors such as (Adam et al., 2020; Michel et al., 2013; Nikiema, 2012). Type comes to help the differenciation of two or several anomalies with the same shape. For instance, two anomalies with the same shape W will differ from the order of priority of their types. The type depends on the lateral resistivity distribution of underground (resulting from the pace of the apparent resistivity curve) along with the whole Electrical Resistivity Profiling survey line. Indeed, four types of anomalies were emphasized:

“EC”, “CB2P”, “NC” and “CP”.

For more details refers to references.

Parameters:

erp – array-like - Array of Electrical Resistivity Profiling line composed of apparent resistivity values.

Returns:

str -The type of anomaly.

Example:
>>> import numpy as np
>>> from watex.utils.exmath import type_
>>> rang = np.random.RandomState(42)
>>> test_array2 = rang.randn (7)
>>> type_(np.abs(test_array2))
... 'EC'
>>> long_array = np.abs (rang.randn(71))
>>> type(long_array)
... 'PC'

References

Adam, B. M., Abubakar, A. H., Dalibi, J. H., Khalil Mustapha,M., & Abubakar,

A. H. (2020). Assessment of Gaseous Emissions and Socio-Economic Impacts From Diesel Generators used in GSM BTS in Kano Metropolis. African Journal of Earth and Environmental Sciences, 2(1),517–523. https://doi.org/10.11113/ajees.v3.n1.104

CIEH. (2001). L’utilisation des méthodes géophysiques pour la recherche

d’eaux dans les aquifères discontinus. Série Hydrogéologie, 169.

Michel, K. A., Drissa, C., Blaise, K. Y., & Jean, B. (2013). Application

de méthodes géophysiques à l ’ étude de la productivité des forages d ’eau en milieu cristallin : cas de la région de Toumodi ( Centre de la Côte d ’Ivoire). International Journal of Innovation and Applied Studies, 2(3), 324–334.

Nikiema, D. G. C. (2012). Essai d‘optimisation de l’implantation géophysique

des forages en zone de socle : Cas de la province de Séno, Nord Est du Burkina Faso (IRD). (I. / I. Ile-de-France, Ed.). IST / IRD Ile-de-France, Ouagadougou, Burkina Faso, West-africa. Retrieved from http://documentation.2ie-edu.org/cdi2ie/opac_css/doc_num.php?explnum_id=148

watex.utils.vesDataOperator(AB=None, rhoa=None, data=None, typeofop='mean', outdf=False)[source]#

Process VES data to handle duplicated spacing distances (AB) by applying specified operations to the corresponding resistivity values (rhoa).

In VES measurements, it’s common to encounter duplicated AB values with different resistivity readings due to minor adjustments in the MN distance. This function consolidates such duplicates by either averaging, taking the median, or randomly selecting one of the resistivity readings.

Parameters:
  • AB (Optional[ArrayLike], optional) – 1D array of AB spacings from current electrodes, representing the exploration depth measurements in meters. If data is provided, this parameter is ignored.

  • rhoa (Optional[ArrayLike], optional) – 1D array of apparent resistivity values corresponding to AB spacings, measured in ohm-meters (Ω·m). If data is provided, this parameter is ignored.

  • data (Optional[DataFrame], optional) – DataFrame containing both AB spacings and corresponding rhoa values. Overrides AB and rhoa parameters when provided.

  • typeofop (str, optional) –

    Specifies the operation to apply to rhoa values for duplicated AB spacings: - ‘mean’: Calculates the mean of rhoa values for each unique AB spacing. - ‘median’: Determines the median of rhoa values for each unique AB spacing. - ‘leaveoneout’: Randomly selects one rhoa value from the duplicates for each

    unique AB spacing. This approach is useful for experiments with significant measurement variance at the same AB spacing.

    Default is ‘mean’.

  • outdf (bool, optional) – Determines the format of the output. If True, returns a DataFrame with processed AB and rhoa values; otherwise, returns a tuple (AB, rhoa). Default is False.

Returns:

Processed AB and rhoa values. The format is dictated by the outdf parameter: a DataFrame if True, or a tuple (AB, rhoa) if False.

Return type:

Union[Tuple[ArrayLike, ArrayLike], DataFrame]

Examples

>>> from watex.utils.exmath import vesDataOperator
>>> AB = np.array([10, 10, 20, 30, 30])
>>> rhoa = np.array([100, 105, 150, 200, 195])
>>> # Processing with mean operation
>>> AB_proc, rhoa_proc = vesDataOperator(AB, rhoa, typeofop='mean')
>>> print(AB_proc)
[10 20 30]
>>> print(rhoa_proc)
[102.5 150 197.5]
>>> # Using DataFrame input and median operation
>>> data = pd.DataFrame({'AB': AB, 'rhoa': rhoa})
>>> df_proc = vesDataOperator(data=data, typeofop='median', outdf=True)
>>> print(df_proc)
     AB  rhoa
0  10  102.5
1  20  150.0
2  30  197.5
watex.utils.vesSelector(data=None, *, rhoa=None, AB=None, MN=None, index_rhoa=None, xy_coords=None, is_utm=False, utm_zone=None, epsg=None, **kws)[source]#

Assert the validity of Vertical Electrical Sounding data and return a sanitize dataframe.

param rhoa:

array-like - Apparent resistivities collected during the sounding.

param AB:

array-like - Investigation distance between the current electrodes. Note that the AB is by convention equals to AB/2. It’s taken as half-space of the investigation depth.

param MN:

array-like - Potential electrodes distances at each investigation depth. Note by convention the values are half-space and equals to MN/2.

param f:

Path-like object or sounding dataframe. If given, the others parameters could keep the ``None` values.

param index_rhoa:

int - The index to retrieve the resistivity data of a specific sounding point. Sometimes the sounding data are composed of the different sounding values collected in the same survey area into different Electrical Resistivity Profiling line. For instance:

AB/2

MN/2

SE1

SE2

SE3

SEn

Where SE are the electrical sounding data values and n is the number of the sounding points selected. SE1, SE2 and SE3 are three points selected for Vertical Electrical Sounding i.e. 3 sounding points carried out either in the same Electrical Resistivity Profiling or somewhere else. These sounding data are the resistivity data with a specific numbers. Commonly the number are randomly chosen. It does not refer to the expected best fracture zone selected after the prior-interpretation. After transformation via the function ves_selector, the header of the data should hold the resistivity. For instance, refering to the table above, the data should be:

AB

MN

resistivity

resistivity

resistivity

Therefore, the index_rhoa is used to select the specific resistivity values i.e. select the corresponding sounding number of the Vertical Electrical Sounding expecting to locate the drilling operations or for computation. For esample, index_rhoa=1 should figure out:

AB/2

MN/2

SE2

–>

AB

MN

resistivity

If index_rhoa is None and the number of sounding curves are more than one, by default the first sounding curve is selected ie index_rhoa equals to 0.

param xy_coords:

tuple (float, float) Coordinates of the sounding point. Must be (‘longitude’,’latitude’) or (‘easting’, ‘northing’). If xy is xy_coords is given as (‘easting’ , ‘northing’), specify is_utm=True so the conversion to (‘longitude’, ‘latitude’) should be triggered. If False, a warnings occurs if values are greater than 180 and 90 degree for longitude and latitude respectively. Note that if the coordinates exists in the dataframe, its should takes the priority

New in version 0.2.1.

param is_utm:

bool, default= False, Allow conversion the (‘easting’, ‘northing’) coordinated from xy_coords to (‘longitude’, ‘latitude’)

param utm_zone:

default=’49R’ Is needed when xy_coords is passed as (‘easting’, ‘northing’) for conversion.

param epsg:

int, str , optional EPSG number defining projection. See http://spatialreference.org/ref/ for moreinfo. Overrides utm_zone if both are provided

param kws:

dict - Pandas dataframe reading additionals keywords arguments.

return:

-dataframe -Sanitize Vertical Electrical Sounding dataframe with ` AB`, MN and resistivity as the column headers.

Example:
>>> from watex.utils.coreutils import vesSelector
>>> df = vesSelector (data='data/ves/ves_gbalo.csv')
>>> df.head(3)
...    AB   MN  resistivity
    0   1  0.4          943
    1   2  0.4         1179
    2   3  0.4         1103
>>> df = vesSelector ('data/ves/ves_gbalo.csv', index_rhoa=3 )
>>> df.head(3)
...    AB   MN  resistivity
    0   1  0.4          457
    1   2  0.4          582
    2   3  0.4          558

. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast

watex.utils.z2rhoa(z, freq)[source]#

Convert impendance tensor z to apparent resistivity

Parameters:
  • z (ndarray, shape (N, M)) – Impedance tensor in \(\Omega\)

  • freq (array-like , shape (N, )) – Frequency in Hertz

: :return: Apparent resistivity in \(\Omega.m\) :rtype: ndarray, shape (N, M)

Example:

>>> import numpy as np
>>> z = np.array([2 + 1j *3 ])
>>> f = np.array ([1014])
>>> z2rhoa(z, f)
... array([[1623.73691735]])

Submodules#