watex.utils package#
Utils sub-package offers several tools for data handling, parameters computation
models estimation and evalution, and graphs visualization. The extension of the
mathematical concepts, and the core of program are performed via the modules
exmath and coreutils respectively. Whereas
the machine learning utilities and additional functionalities are performed
with mlutils and funcutils respectively.
The plot utilities from plotutils gives several plotting
tools for visualization.
- watex.utils.bi_selector(d, /, features=None, return_frames=False)[source]#
Auto-differentiates the numerical from categorical attributes.
This is usefull to select the categorial features from the numerical features and vice-versa when we are a lot of features. Enter features individually become tiedous and a mistake could probably happenned.
- Parameters:
d (pandas dataframe) – Dataframe pandas
features (list of str) – List of features in the dataframe columns. Raise error is feature(s) does/do not exist in the frame. Note that if features is
None, it returns the categorical and numerical features instead.return_frames (bool, default =False) – return the difference columns (features) from the given features as a list. If set to
Truereturns bi-frames composed of the given features and the remaining features.
- Returns:
- Tuple ( list, list) – list of features and remaining features
- Tuple ( pd.DataFrame, pd.DataFrame ) – List of features and remaing features frames.
Example
>>> from watex.utils.mlutils import bi_selector >>> from watex.datasets import load_hlogs >>> data = load_hlogs().frame # get the frame >>> data.columns >>> Index(['hole_id', 'depth_top', 'depth_bottom', 'strata_name', 'rock_name', 'layer_thickness', 'resistivity', 'gamma_gamma', 'natural_gamma', 'sp', 'short_distance_gamma', 'well_diameter', 'aquifer_group', 'pumping_level', 'aquifer_thickness', 'hole_depth_before_pumping', 'hole_depth_after_pumping', 'hole_depth_loss', 'depth_starting_pumping', 'pumping_depth_at_the_end', 'pumping_depth', 'section_aperture', 'k', 'kp', 'r', 'rp', 'remark'], dtype='object') >>> num_features, cat_features = bi_selector (data) >>> num_features ...['gamma_gamma', 'depth_top', 'aquifer_thickness', 'pumping_depth_at_the_end', 'section_aperture', 'remark', 'depth_starting_pumping', 'hole_depth_before_pumping', 'rp', 'hole_depth_after_pumping', 'hole_depth_loss', 'depth_bottom', 'sp', 'pumping_depth', 'kp', 'resistivity', 'short_distance_gamma', 'r', 'natural_gamma', 'layer_thickness', 'k', 'well_diameter'] >>> cat_features ... ['hole_id', 'strata_name', 'rock_name', 'aquifer_group', 'pumping_level']
- watex.utils.cattarget(arr, /, func=None, labels=None, rename_labels=None, coerce=False, order='strict')[source]#
Categorize array to hold the given identifier labels.
Classifier numerical values according to the given label values. Labels are a list of integers where each integer is a group of unique identifier of a sample in the dataset.
- Parameters:
arr (array-like |pandas.Series) – array or series containing numerical values. If a non-numerical values is given , an errors will raises.
func (Callable,) – Function to categorize the target y.
labels (int, list of int,) – if an integer value is given, it should be considered as the number of category to split ‘y’. For instance
label=3applied on the first ten number, the labels values should be[0, 1, 2]. If labels are given as a list, items must be self-contain in the target ‘y’.rename_labels (list of str;) – list of string or values to replace the label integer identifier.
coerce (bool, default =False,) – force the new label names passed to rename_labels to appear in the target including or not some integer identifier class label. If coerce is
True, the target array holds the dtype of new_array.
- Returns:
arr – The category array with unique identifer labels
- Return type:
Arraylike |pandas.Series
Examples
>>> from watex.utils.mlutils import cattarget >>> def binfunc(v): if v < 3 : return 0 else : return 1 >>> arr = np.arange (10 ) >>> arr ... array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> target = cattarget(arr, func =binfunc) ... array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1], dtype=int64) >>> cattarget(arr, labels =3 ) ... array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2]) >>> array([2, 2, 2, 2, 1, 1, 1, 0, 0, 0]) >>> cattarget(arr, labels =3 , order =None ) ... array([0, 0, 0, 0, 1, 1, 1, 2, 2, 2]) >>> cattarget(arr[::-1], labels =3 , order =None ) ... array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2]) # reverse does not change >>> cattarget(arr, labels =[0 , 2, 4] ) ... array([0, 0, 0, 2, 2, 4, 4, 4, 4, 4])
- watex.utils.check_flow_objectivity(y, /, values, classes)[source]#
Function checks the flow rate objectivity
If objective is set to flow i.e the prediction focuses on the flow rate, there are some conditions that the target y needs to meet when values are passed for classes categorization.
- Parameters:
values – list of values to encoding the numerical target y. for instance
values=[0, 1, 2]objective – str, relate to the flow rate prediction. Set to
Nonefor any other predictions.prefix –
the prefix to add to the class labels. For instance, if the prefix equals to
FR, class labels will become:[0, 1, 2] => [FR0, FR1, FR2]
classes –
list of classes names to replace the default FR that is used to specify the flow rate. For instance, it can be:
[0, 1, 2] => [sf0, sf1, sf2]
- Returns:
(y, classes): Tuple, - y: array-like 1d of categorized y - classes: list of flow rate classes.
- watex.utils.classify_k(o, /, func=None, kname=None, inplace=False, string=False, default_func=False)[source]#
Categorize the permeability coefficient ‘k’
Map the continuous ‘k’ into categorial classes.
- Parameters:
o (ndarray of pd.Series or Dataframe) – data containing the permeability coefficient k contineous values. If data is passsed as a pandas dataframe, the column containing the k-values kname needs to be specified.
func (callable) – Function to specifically map the permeability coefficient column in the dataframe of serie. If not given, the default function can be enabled instead from param default_func.
inplace (bool, default=False) – Modified object inplace and return None
string (bool,) – If set to “True”, categorized map from ‘k’ should be prefixed by “k”. However is string value is given , the prefix is changed according to this label.
default_ufunc (bool,) –
Default function for mapping k is setting to
True. Note that, this could probably not fitted your own data. So it is recommended to provide your own function for mapping ‘k’. However the default ‘k’ mapping is given as follow:k0 {0}: k = 0
k1 {1}: 0 < k <= .01
k2 {2}: .01 < k <= .07
k3 {3}: k> .07
- Returns:
o – return None only if dataframe is given and inplace is set to
Truei.e modified object inplace.- Return type:
None, ndarray, Series or Dataframe
Examples
>>> import numpy as np >>> from watex.datasets import load_hlogs >>> from watex.utils.hydroutils import classify_k >>> _, y0 = load_hlogs (as_frame =True) >>> # let visualize four nonzeros values in y0 >>> y0.k.values [ ~np.isnan (y0.k ) ][:4] ... array([0.054, 0.054, 0.054, 0.054]) >>> classify_k (y0 , kname ='k', inplace =True, use_default_func=True ) >>> # let see again the same four value in the dataframe >>> y0.k.values [ ~np.isnan (y0.k ) ][:4] ... array([2., 2., 2., 2.])
- watex.utils.cleaner(data, /, columns=None, inplace=False, labels=None, func=None, mode='clean', **kws)[source]#
Sanitize data in the data or columns by dropping specified labels from rows or columns.
If data is not a pandas dataframe, should be converted to dataframe and uses index to drop the labels.
- Parameters:
data (pd.Dataframe or arraylike2D.) – Dataframe pandas or Numpy two dimensional arrays. If 2D array is passed, it should prior be converted to a daframe by default and drop row index from index parameters
columns (single label or list-like) –
- Alternative to specifying axis (
labels, axis=1 is equivalent to columns=labels).
labels (single label or list-like) – Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.
func (F, callable) – Universal function used to clean the columns. If performs only when mode is on
cleanoption.inplace (bool, default False) – If False, return a copy. Otherwise, do operation inplace and return None.
mode (str, default='clean') – Options or mode of operation to do on the data. It could be [‘clean’|’drop’]. If
drop, it behaves likedataframe.dropof pandas.
- Returns:
DataFrame cleaned or without the removed index or column labels or None if inplace=True or array is data is passed as an array.
- Return type:
DataFrame, array2D or None
Find the correlated features/columns in the dataframe.
Indeed, highly correlated columns don’t add value and can throw off features importance and interpretation of regression coefficients. If we had correlated columns, choose to remove either the columns from level_0 or level_1 from the features data is a good choice.
- Parameters:
df (Dataframe or shape (M, N) from
pandas.DataFrame) – Dataframe containing samples M and features Ncorr (str, ['pearson'|'spearman'|'covariance']) – Method of correlation to perform. Note that the ‘person’ and ‘covariance’ don’t support string value. If such kind of data is given, turn the corr to spearman. default is
pearsonthreshold (int, default is
0.95) – the value from which can be considered as a correlated data. Should not be greater than 1.fmt (bool, default {
False}) – format the correlated dataframe values
- Returns:
df – Dataframe with cilumns equals to [level_0, level_1, pearson]
- Return type:
pandas.DataFrame
Examples
>>> from watex.utils.mlutils import correlatedcolumns >>> df_corr = correlatedcolumns (data , corr='spearman', fmt=None, threshold=.95 )
- watex.utils.defineConductiveZone(erp, station=None, position=None, auto=False, index='py', **kws)[source]#
Define conductive zone as subset of the erp line.
Indeed the conductive zone is a specific zone expected to hold the drilling location station. If drilling location is not provided, it would be by default the very low resistivity values found in the erp line.
- Parameters:
erp (array_like,) – the array contains the apparent resistivity values
station (str or int,) – is the station position name.
position (float,) – station position value.
auto (bool) – If
True, the station position should be the position of the lower resistivity value in Electrical Resistivity Profiling.indexing (str,) –
- Returns:
- conductive zone of resistivity values
- conductive zone positionning
- station position index in the conductive zone
- station position index in the whole |ERP| line
- Example:
>>> import numpy as np >>> >>> from watex.utils.coreutils import defineConductiveZone >>> test_array = np.random.randn (10) >>> selected_cz ,*_ = defineConductiveZone(test_array, 's20') >>> shortPlot(test_array, selected_cz )
- watex.utils.erpSelector(f, columns=Ellipsis, force=False, utm_zone=None, epsg=None, verbose=0.0, **kws)[source]#
Read and sanitize the data collected from the survey.
data should be an array, a dataframe, series, or arranged in
.csvor.xlsxformats. Be sure to provide the header of each columns in’ the worksheet. In a file is given, header columns should be aranged as['station','resistivity' ,'longitude', 'latitude']. Note that coordinates columns (longitude and latitude) are not compulsory.- Parameters:
f (Path-like object, ndarray, Series or Dataframe,) – If a path-like object is given, can only parse .csv and .xlsx file formats. However, if ndarray is given and shape along axis 1 is greater than 4, the ndarray should be shrunked.
columns (list) – list of the valuable columns. It can be used to fix along the axis 1 of the array the specific values. It should contain the prefix or the whole name of each item in
['station','resistivity' ,'longitude', 'latitude'].force (bool, default=False,) – If Vertical electrical (VES) is passed while expecting ERP data, force set to True will consider the VES data as ERP data and will use only the resistivity values in VES data. This will will an invalid results especially when parameters computation are needed.
verbose (int,) – Show the verbosity; outputs more messages if
True.utm_zone (string, optional) –
zone number and ‘S’ or ‘N’ e.g. ‘55S’. Default to the centre point of the provided points. If given, the longitude/latitude are computed from valid easting/northing coordinates.
New in version 0.2.1.
epsg (int) – epsg number defining projection (see http://spatialreference.org/ref/ for moreinfo). Overrides utm_zone if both are provided
kws (dict) – Additional pandas pd.read_csv and pd.read_excel methods keyword arguments. Be sure to provide the right argument. when reading f. For instance, provide
sep= ','argument when the file to read isxlsxformat will raise an error. Indeed, sep parameter is acceptable for parsing the .csv file format only.
- Return type:
DataFrame with valuable column(s).
Notes
The length of acceptable columns is
4. If the size of the columns is higher than 4, the data should be shrunked to match the expected columns. Futhermore, if the header is not specified in f , the defaut column arrangement should be used. Therefore, the second column should be considered as theresistivitycolumn.Examples
>>> import numpy as np >>> from watex.utils.coreutils import erpSelector >>> df = erpSelector ('data/erp/testsafedata.csv') >>> df.shape ... (45, 4) >>> list(df.columns) ... ['station','resistivity', 'longitude', 'latitude'] >>> df = erp_selector('data/erp/testunsafedata.xlsx') >>> list(df.columns) ... ['easting', 'station', 'resistivity', 'northing'] >>> df = erpSelector(np.random.randn(7, 7)) >>> df.shape ... (7, 4) >>> list(df.columns) ... ['station', 'resistivity', 'longitude', 'latitude']
- watex.utils.erpSmartDetector(constr, erp, station=None, coerce=False, return_cz=False, view=False, raise_warn=True, **plot_kws)[source]#
Automatically detect the drilling location by involving the constraints observed in the survey area.
Consider the constraints on the survey area and detect the suitable drilling location. Commonly the station is not needed when using the constraintssince the station indicates that the user is aware about the reason to select this station. However in the case, doubts raise, user can set the parameter coerce to
True.- Parameters:
constr (list, dict) –
List of restricted station. The constraint or restricted stations are the station where to ignore when selecting the best drilling location. Indeed, this is useful since in DWSC, not the station are presumed to be suitable to propose the drilling in technical view. For instance, if some stations are close to the household waste site, the stations must be list and ignored.
If the constr is passed in a dictionnary, it might be contain, the key for the restricted stations and the value for the reason why the station is restricted. For instance:
constr = {"s02": "station close to the household waste" "S25": "station is located in a marsh area." }
erp (array-like 1d) – DC profiling ERP resistivity values
station (str, optional) – The station of the presumed location for drilling operations. Commonly the station is not need when using the constraints. If the station is given whereas
coerce=Falsean errors will raise top warnm the users, To force considering the station in the auto-detection,coercemust be set toTrue.coerce (bool, default=False,) – Allow the station to be consider in the auto-detection.
raise_warn (bool, default=True,) – warn the user whether a suitable location is found or not. Returns
Noneotherwise.view (bool, default=False,) – Plot the conductive zone and restricted stations.
plot_kws (dict,) – Additional plotting keywords arguments passed to
plotAnomaly().
- Returns:
(station |None) or cz, cs – staion for the drilling operations detected automatically. If no station is detected, will return
None. if return_cz isTrue, station and the conductive zone are returned as well as the restricted station position number.- Return type:
str,
See also
watex.plotAnomalyPlot DC profiling ERP and conductive zone.
Examples
>>> import numpy as np >>> from watex.datasets import make_erp >>> from watex.utils.coreutils import erpSmartDetector >>> resistivity = make_erp (n_stations =50 , as_frame=True, seed=125).resistivity >>> # get the min value of the resistivity >>> resmin_index = np.where ( resistivity==resistivity.min()) 42 >>> erpSmartDetector (constr =['s42'], resistivity ) 'S13' >>> # S42 is rejected and selected another zone presumed to be better. >>> constraints ={"S00": "Marsh area. ", "S10": " Municipality square, no authorization to make drill", "S29": "Heritage site", "S46": "Household waste site", "S42": "Household waste site" } >>> erpSmartDetector (constraints, resistivity) 'S16' >>> erpSmartDetector (['s12', 's40'], resistivity) 'S29' >>> # station 42 close s40 is rejected too.
- watex.utils.evalModel(model, X, y, Xt, yt=None, scorer='accuracy', eval=False, **kws)[source]#
Evaluate model and quick test the score with metric scorers.
- Parameters:
model (Callable, {'preprocessor + estimator } | estimator,) – the preprocessor is list of step for data handling all encapsulated on the pipeline. model can also be a simple estimator with fit,
X (N-d array, shape (N, M)) –
- the training set composed of N-columns and the M-samples. The
feature set excludes the target y.
y (arraylike , shape (M)) – the target is composed of M-examples in supervised learning.
Xt (N-d array, shape (N, M)) – test set array composed of N-columns and the M-samples. The feature set excludes the target y.
yt (arraylike , shape (M)) – test label (or test target) composed of M-examples in supervised learning.
scorer (str, Callable,) –
a scorer is a metric function for model evaluation. If given as string it should be the prefix of the following metrics:
”classification_report” -> for classification_report,
’precision_recall’ -> for precision_recall_curve,
”confusion_matrix” -> for a confusion_matrix,
’precision’ -> for precision_score,
”accuracy” -> for accuracy_score
”mse” -> for mean_squared_error,
”recall” -> for recall_score,
’auc’ -> for roc_auc_score,
’roc’ -> for roc_curve
’f1’ -> for f1_score,
Other string prefix values should raises an errors
kws (dict,) – Additionnal keywords arguments from scklearn metric function.
- Returns:
Tuple – the model score or the predicted y if predict is set to
True.- Return type:
(score, ypred)
- watex.utils.findCatandNumFeatures(df=None, features=None, return_frames=False)[source]#
Retrieve the categorial or numerical features on whole features of dataset.
- Parameters:
df (Dataframe) – Dataframe with columns composing the features
features (list of str,) – list of the column names. If the dataframe is big, can set the only required features. If features are provided, frame should be shrunked to match the only given features before the numerical and categorical features search. Note that an error will raises if any of one features is missing in the dataframe.
return_frames (bool,) – if set to
True, it returns two separated dataframes (cat & num) otherwise, it only returns the cat and num columns names.
- Returns:
Tuple
- Return type:
cat_features and num_features names or frames
Examples
>>> from watex.datasets import fetch_data >>>> from watex.tools import findCatandNumFeatures >>> data = fetch_data ('bagoue original').get('data=dfy2') >>> cat, num = findCatandNumFeatures(data) >>> cat, num ... (['type', 'geol', 'shape', 'name', 'flow'], ['num', 'east', 'north', 'power', 'magnitude', 'sfi', 'ohmS', 'lwi']) >>> cat, num = findCatandNumFeatures( data, features = ['geol', 'ohmS', 'sfi']) ... (['geol'], ['ohmS', 'sfi'])
- watex.utils.find_aquifer_groups(arr_k, /, arr_aq=None, kname=None, aqname=None, subjectivity=False, default_arr=None, keep_label_0=False, method='naive')[source]#
Fit the group of aquifer and find the representative of each true label in array ‘k’ in the aquifer group array.
The idea consists to find the corresponding aquifer group which fits the most the true label ‘X’ in ‘y_true’.
‘arr_k’ and ‘arr_aq’ must contain a class label, not continue values.
- Parameters:
arr_k (array_like, pandas series or dataframe) – arraylike that contains the permeability coefficients ‘k’. If a dataframe is supplied, the permeabitlity coefficient column name ‘kname’ must be specified.
arr_aq (array-like , pandas series or dataframe) – array-like that contains the aquifer groups. If NAN values exists in the aquifer groups, it is suggested to imputed values before feediing to the algorithms. Missing values are not allowed. If dataframe is supplied, the aquifer group column name ‘aqname’ must be specified.
kname (str, int) –
- Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly
- kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
aqname (str, optional,) –
- Name of aquifer group columns. aqname allows to retrieve the
aquifer group arr_aq value in a specific dataframe. Commonly
- aqname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
subjectivity (bool, default=False) – Considers each class label as a naive group of aquifer. Subjectivity occurs when a group of aquifer is not found in the data. Therefore, each class label is considered as a naive group of aquifer. It is strongly recommended to provide a default group passes to parameter default_arr to substitute the group of aquifers for more pratical reason. For instance it can be the layer collected at a specific depth like the ‘strata’ columns.
default_arr (array-like, pd.Series) – Array used as default to subsitute the group of aqquifer if the latter is missing. This is an heuristic option because it might lead to breaking code or invalid results.
keep_label_0 (bool, default=False) – The prediction already include the label 0. However, including 0 in the predicted label refers to ‘k=0’ i.e. permeability coefficient equals to 0, which is not True in principle, because all rocks have a permeability coefficient ‘k’. Here we considered ‘k=0’ as an undefined permeability coefficient. Therefore, ‘0’ , can be exclude since, it can also considered as a missing ‘k’-value. If predicted ‘0’ is in the target it should mean a missing ‘k’-value rather than being a concrete label. Therefore, to avoid any confusion, ‘0’ is altered to ‘1’ so the value +1 is used to move forward all class labels thereby excluding the ‘0’ label. To force include 0 in the label, set keep_label_0 to
True.method (str ['naive', 'strict'], default='naive') –
The kind of strategy to compute the representativity of a label in the predicted array ‘array_aq’. It can also be ‘strict’. Indeed:
naivecomputes the importance of the label by the number of itsoccurence for this specific label in the array ‘k’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in arr_k.
strictcomputes the importance of the label by the number ofoccurence in the whole valid arr_k i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in arr_k.
- Returns:
_Group – Use attribute .groups to find the group values.
- Return type:
_Groupclass object
Examples
Use the real aquifer group collected in the area
>>> from watex.utils import naive_imputer, read_data, reshape >>> from watex.datasets import load_hlogs >>> from watex.utils.hydroutils import classify_k, find_aquifer_groups >>> b= load_hlogs () #just taking the target names >>> data = read_data ('data/boreholes/hf.csv') # read complete data >>> y = data [b.target_names] >>> # impute the missing values found in aquifer group columns >>> # reshape 1d array along axis 0 for imputation >>> agroup_imputed = naive_imputer ( reshape (y.aquifer_group, axis =0 ) , ... strategy ='most_frequent') >>> # reshape back to array_like 1d >>> y.aquifer_group =reshape (agroup_imputed) >>> # categorize the 'k' continous value in 'y.k' using the default >>> # 'k' mapping func >>> y.k = classify_k (y.k , default_func =True) >>> # get the group obj >>> group_obj = find_aquifer_groups(y.k, y.aquifer_group) >>> group_obj _Group(Label=[' 1 ', Preponderance( rate = '53.141 %', [('Groups', {'V': 0.32, 'IV': 0.266, 'II': 0.236, 'III': 0.158, 'IV&V': 0.01, 'II&III': 0.005, 'III&IV': 0.005}), ('Representativity', ( 'V', 0.32)), ('Similarity', 'V')])], Label=[' 2 ', Preponderance( rate = ' 19.11 %', [('Groups', {'III': 0.274, 'II': 0.26, 'V': 0.26, 'IV': 0.178, 'III&IV': 0.027}), ('Representativity', ( 'III', 0.27)), ('Similarity', 'III')])], Label=[' 3 ', Preponderance( rate = '27.749 %', [('Groups', {'V': 0.443, 'IV': 0.311, 'III': 0.245}), ('Representativity', ( 'V', 0.44)), ('Similarity', 'V')])], ) (2) Use the subjectivity and set the strata columns as default array
>>> find_aquifer_groups(y.k, subjectivity=True, default_arr= X.strata_name ) _Group(Label=[' 1 ', Preponderance( rate = '53.141 %', [('Groups', {'siltstone': 0.35, 'coal': 0.227, 'fine-grained sandstone': 0.158, 'medium-grained sandstone': 0.094, 'mudstone': 0.079, 'carbonaceous mudstone': 0.054, 'coarse-grained sandstone': 0.03, 'coarse': 0.01}), ('Representativity', ( 'siltstone', 0.35)), ('Similarity', 'siltstone')])], Label=[' 2 ', Preponderance( rate = ' 19.11 %', [('Groups', {'mudstone': 0.288, 'siltstone': 0.205, 'coal': 0.192, 'coarse-grained sandstone': 0.137, 'fine-grained sandstone': 0.137, 'carbonaceous mudstone': 0.027, 'medium-grained sandstone': 0.014}), ('Representativity', ( 'mudstone', 0.29)), ('Similarity', 'mudstone')])], Label=[' 3 ', Preponderance( rate = '27.749 %', [('Groups', {'mudstone': 0.245, 'coal': 0.226, 'siltstone': 0.217, 'fine-grained sandstone': 0.123, 'carbonaceous mudstone': 0.066, 'medium-grained sandstone': 0.066, 'coarse-grained sandstone': 0.057}), ('Representativity', ( 'mudstone', 0.24)), ('Similarity', 'mudstone')])], )
- watex.utils.find_similar_labels(y_true, y_pred, *, categorize_k=False, threshold=None, func=None, keep_label_0=False, method='naive', return_groups=False, **kwd)[source]#
Find similarities between y_true and y_pred and returns rate
- Parameters:
y_true (array-like 1d or pandas.Series) – Array containing the true labels of ‘k’
y_pred (array_like, or pandas.Series) – array containing the predicted naive group of aquifers (NGA)
categorize_k (bool,) – If set to
True, user needs to provide a function ufunc to map or categorize the permeability coefficient ‘k’ into an integer labels.func (callable) – Function to specifically map the permeability coefficient column in the dataframe of serie. If not given, the default function can be enabled instead from param default_func.
threshold (float, default=None) – The threshold from which, label in ‘y_true’ can be considered similar than the one in NGA labels ‘y_pred’. The default is ‘None’ which means none rule is considered and the high preponderence or occurence in the data compared to other labels is considered as the most representative and similar. Setting the rule instead by fixing the threshold is recommended especially in a huge dataset.
keep_label_0 (bool, default=0) –
Force including 0 in the predicted label if include_label_0 is set to
True. Mostly label ‘0’ refers to ‘k=0’ i.e. no permeability coefficient equals to 0, which is not True in principle, because all rocks have a permeability coefficient ‘k’. Here we considered ‘k=0’ as an undefined permeability coefficient. Therefore, ‘0’ , can be exclude since, it can also considered as a missing ‘k’-value. If predicted ‘0’ is in the target it should mean a missing ‘k’-value rather than being a concrete label. Therefore, to avoid any confusion, ‘0’ is removed by default in the ‘k’ categorization. However, when the prediction ‘y_pred’ is made from the the unsupervising method, the prediction ‘0’ straigthforwardly includes’0’ i.e ‘k=0’ as a first class. So the value +1 is used to move forward
all class labels thereby excluding the ‘0’ label. To force include 0 in the label, set include_label_0 to
True.method (str ['naive', 'strict'], default='naive') –
The kind of strategy to compute the representativity of a label in the predicted array ‘y_pred’. It can also be ‘strict’. Indeed:
naivecomputes the importance of the label by the number of itsoccurence for this specific label in the array ‘y_true’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in y_true.
strictcomputes the importance of the label by the number ofoccurence in the whole valid y_true i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in y_pred.
return_groups (bool, default=False) – Returns label groups and their values counts in the predicted labels y_pred where ‘k’ values are not missing.
- Returns:
g.similarity (Tuple of labels found that are considered similar in) – predicted labels.
g.group (Tuple of group that have their similarity in the true labels)
Example
>>> from watex.utils import read_data >>> from watex.utils.hydroutils import find_similar_labels, classify_k >>> data = read_data ('data/boreholes/hf.csv') >>> ymap = classify_k(data.k , default_func =True) >>> # Note that for the demo we use the group of aquifer columns, however >>> # in pratical example, y_pred must be a predicted NGA labels. This >>> # is possible using the function <predict_NGA_labels> >>> sim = find_similar_labels(y_true= ymap, y_pred=data.aquifer_group) >>> sim ... ((1, 'V'), (2, 'III'), (3, 'V')) >>> group= find_similar_labels(ymap, data.aquifer_group, return_groups=True) >>> group ... ((1, {'V': 0.17, 'IV': 0.141, 'II': 0.126, 'III': 0.084, 'IV&V': 0.005, 'II&III': 0.003, 'III&IV': 0.003}), (2, {'III': 0.052, 'II': 0.05, 'V': 0.05, 'IV': 0.034, 'III&IV': 0.005}), (3, {'V': 0.123, 'IV': 0.086, 'III': 0.068})) >>> find_similar_labels(y_true= ymap, y_pred=data.aquifer_group, threshold = 0.15) ... [(1, 'V')]
- watex.utils.fittensor(refreq, compfreq, z, fill_value=nan)[source]#
Fit each tensor component to the complete frequency range.
The complete frequency is the frequency with clean data. It contain all the frequency range on the site. During the survey, the missing frequencies lead to missing tensor data. So the function will indicate where the tensor data is missing and fit to the prior frequencies.
- Parameters:
refreq (ArrayLike) – Reference frequency - Should be the complete frequency collected in the field.
comfreq (array-like,) – The specific frequency collect in the site. Sometimes due to the interferences, the frequency at individual site could be different from the complete. However, the frequency values at the individual site must be included in the complete frequency refreq.
z (array-like,) – should be the tensor value (real or imaginary part ) at the component xx, xy, yx, yy.
fill_value (float . default='NaN') – Value to replace the missing data in tensors.
- Returns:
Z – new Z filled by invalid value NaN where the frequency is missing in the data.
- Return type:
Arraylike
Examples
>>> import numpy as np >>> from watex.utils.exmath import fittensor >>> refreq = np.linspace(7e7, 1e0, 20) # 20 frequencies as reference >>> freq_ = np.hstack ((refreq.copy()[:7], refreq.copy()[12:] )) >>> z = np.random.randn(len(freq_)) *10 # assume length of freq as ... # the same like the tensor Z value >>> zn = fittensor (refreq, freq_, z) >>> z # some frequency values are missing but not visible. ...array([-23.23448367, 2.93185982, 10.81194723, -12.46326732, 1.57312908, 7.23926576, -14.65645799, 9.85956253, 3.96269863, -10.38325124, -4.29739755, -8.2591703 , 21.7930423 , 0.21709129, 4.07815217]) >>> # zn show where the frequencies are missing >>> # the NaN value means in a missing value in tensor Z at specific frequency >>> zn ... array([-23.23448367, 2.93185982, 10.81194723, -12.46326732, 1.57312908, 7.23926576, -14.65645799, nan, nan, nan, nan, nan, 9.85956253, 3.96269863, -10.38325124, -4.29739755, -8.2591703 , 21.7930423 , 0.21709129, 4.07815217]) >>> # let visualize where the missing frequency value in tensor Z >>> refreq ... array([7.00000000e+07, 6.63157895e+07, 6.26315791e+07, 5.89473686e+07, 5.52631581e+07, 5.15789476e+07, 4.78947372e+07, 4.42105267e+07*, 4.05263162e+07*, 3.68421057e+07*, 3.31578953e+07*, 2.94736848e+07*, 2.57894743e+07, 2.21052638e+07, 1.84210534e+07, 1.47368429e+07, 1.10526324e+07, 7.36842195e+06, 3.68421147e+06, 1.00000000e+00]) >>> refreq[np.isnan(zn)] #we can see the missing value between [7:12](*) in refreq ... array([44210526.68421052, 40526316.21052632, 36842105.73684211, 33157895.2631579 , 29473684.78947368])
- watex.utils.get2dtensor(z_or_edis_obj_list, /, tensor='z', component='xy', kind='modulus', return_freqs=False, **kws)[source]#
Make tensor into two dimensional array from a collection of Impedance tensors Z.
Out 2D resistivity, phase-error and tensor matrix from a collection of EDI-objects.
Matrix depends of the number of frequency times number of sites. The function asserts whether all data from all frequencies are available. The missing values should be filled by NaN. Note that each element of z is (nfreq, 2, 2) dimension for:
xx ( 0, 0) ------- xy ( 0, 1) yx ( 1, 0) ------- yy ( 1, 1)
- Parameters:
z_or_edis_obj_list (list of
watex.edi.Ediorwatex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.tensor (str, default='z') – Tensor name. Can be [ resistivity|phase|z|frequency]
component (str, default='xy' (TE mode)) – EM mode. Can be [‘xx’, ‘xy’, ‘yx’, ‘yy’]
out (str) – kind of data to output. Be sure to provide the component to retrieve the attribute from the collection object. Except the error and frequency attribute, the missing component to the attribute will raise an error. for instance
resxyfor xy component. Default isresxy.kind (str , default='modulus') – focuses on the tensor output. Note that the tensor is a complex number of ndarray (nfreq, 2,2 ). If set to``modulus`, the modulus of the complex tensor should be outputted. If
realor``imag``, it returns only the specific one. Default iscomplex.return_freqs (Arraylike ,) – If
True, returns also the full frequency ranges.kws (dict) – Additional keywords arguments from :meth:`~EM.getfullfrequency `.
- Returns:
mat2d – the matrix of number of frequency and number of Edi-collectes which correspond to the number of the stations/sites.
- Return type:
arraylike2d
Examples
>>> from watex.datasets import load_huayuan >>> from watex.methods import get2dtensor >>> box= load_huayuan ( key ='raw', clear_cache = True, samples =7) >>> data = box.data >>> phase_yx = get2dtensor ( data, tensor ='phase', component ='yx') >>> phase_yx.shape (56, 7) >>> phase_yx [0, :] array([ nan, nan, nan, nan, 18.73244951, 35.00516522, 59.91093054])
- watex.utils.getGlobalScore(cvres)[source]#
Retrieve the global mean and standard deviation score from the cross validation containers.
- Parameters:
cvres (dict of Array-like, Shape (N, )) – cross validation results after training the models of number of parameters equals to N.
- Returns:
tuple ( mean_test_scores’, ‘std_test_scores’)
scores on test_dcore and standard deviation scores
- watex.utils.get_aquifer_section(arr_k, /, zname=None, kname=None, z=None, return_index=False, return_sections=True)[source]#
Detect a single aquifer section (upper and lower) in depth.
This is useful trip to compute the thickness of the aquifer.
- Parameters:
arr_k (ndarray or dataframe) – Data that contains mainly the aquifer values. It can also contains the depth values. If the depth is included in the arr_k, zname needs to be supplied for recovering and depth.
zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.
kname (str, int) –
- Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly
- kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.
return_index (bool, default =False ,) –
- Returns the positions (indexes) of the upper and lower sections of the
aquifer found in the dataframe arr_k.
return_sections (bool, default=True,) – Returns the sections (upper and lower) of the aquifers.
- Returns:
up, low –
(upix, lowix ): Tuple of indexes of lower and upper sections
(up, low): Tuple of aquifer sections (upper and lower)
- (upix, lowix), (up, low)positions and sections values of aquifers
if return_index and return_sections` are
True.
- Return type:
list of upper and lower section values of aquifer.
Example
>>> from watex.datasets import load_hlogs >>> from watex.utils.hydroutils import get_aquifer_section >>> data = load_hlogs ().frame # return all data including the 'depth' values >>> get_aquifer_section (data , zname ='depth', kname ='k') ... [197.12, 369.71] # section starts from 197.12 -> 369.71 m >>> get_aquifer_section (data , zname ='depth', kname ='k', return_index=True) ... ([16, 29], [197.12, 369.71]) # upper and lower-> position 16 and 29.
- watex.utils.get_aquifer_sections(*data, zname, kname, return_index=False, return_data=False, error='ignore', **kws)[source]#
Get the section of each aquifer form multiple dataframes.
The unique section ‘upper’ and ‘lower’ is the valid range of the whole data to consider as a valid data. The use of the index is necessary to shrunk the data of the whole boreholes. Mosly the data from the section is consided the valid data as the predictor Xr. Out of the range of aquifers ection, data can be discarded or compressed to top Xr.
Returns valid section indexes if ‘return_index’ is set to
True.- Parameters:
data (list of pandas dataframe) – Data that contains mainly the aquifer values. It needs to specify the name of the depth column zname as well as the name of permeabiliy kname column.
zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.
kname (str, int) –
- Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly
- kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.
return_index (bool, default =False ,) – Returns the positions (indexes) of the upper and lower sections of the each aquifer found in each dataframe.
error (str, default='ignore') – Raise errors if trouble occurs when computing the section of each aquifer. If ‘ignore’, a UserWarning is displayed if invalid data is found. Any other value of error will set error to raise.
return_data (bool, default=False,) – Return valid data. It is usefull when ‘error’ is set to ‘ignore’ to collect the valid data.
kws (dict,) – Additional keywords arguments passed to
get_aquifer_sections().
- Returns:
up, low –
(upix, lowix ): Tuple of indexes of lower and upper sections
(up, low): Tuple of aquifer sections (upper and lower)
- (upix, lowix), (up, low)positions and sections values of aquifers
if return_index and return_sections` are
True.
- Return type:
list of upper and lower section values of aquifer.
See also
watex.utils.hydroutils.get_aquifer_sectionscompute multiples aquifer sections
Example
>>> from watex.datasets import load_hlogs >>> from watex.utils.hydroutils import get_aquifer_sections >>> data = load_hlogs ().frame >>> get_aquifer_sections (data, data , zname ='depth', kname ='k' ) ... [[197.12, 369.71], [197.12, 369.71]] >>> get_aquifer_sections (data, data , zname ='depth', kname ='k' , return_index =True ) ... [[16, 29], [16, 29]]
- watex.utils.get_bearing(latlon1, latlon2, to_deg=True)[source]#
Calculate the bearing between two points.
A bearing can be defined as a direction of one point relative to another point, usually given as an angle measured clockwise from north. The formula of the bearing \(eta\) between two points 1(lat1 , lon1) and 2(lat2, lon2) is expressed as below:
\[eta = atan2(sin(y_2-y_1)*cos(x_2), cos(x_1)*sin(x_2) – sin(x_1)*cos(x_2)*cos(y_2-y_1))\]where:
- Parameters:
latlon (Tuple ( latitude, longitude)) – A latitude and longitude coordinates of the first point in degree.
latlon2 (Tuple ( latitude, longitude)) – A latitude and longitude of coordinates of the second point in degree.
to_deg (bool, default=True) – Convert the bearing from radians to degree.
- Returns:
b (Value of bearing in degree ( default).)
See More
———-
See more details by clicking in the link below – https://mapscaping.com/how-to-calculate-bearing-between-two-coordinates/
Examples
>>> from watex.utils import get_bearing >>> latlon1 = (28.41196763902007, 109.3328724432221) # (lat, lon) point 1 >>> latlon2= (28.38756530909265, 109.36931920880758) # (lat, lon) point 2 >>> get_bearing (latlon1, latlon2 ) 127.26739270447973 # in degree
- watex.utils.get_compressed_vector(d, /, sname, stratum=None, strategy='average', as_frame=False, random_state=None)[source]#
Compresses base stratum data into a singular vector composed of all feature names in the targetted data d.
- Parameters:
d (pandas DataFrame) – Valid data containing the strata. If dataframe is passed, ‘sname’ is needed to fetch strata values.
sname (str, optional) – Name of column in the dataframe that contains the strata values. Dont confuse ‘sname’ with ‘stratum’ which is the name of the valid layer/rock in the array/Series of strata.
stratum (str, optional) – Name of the base stratum. Must be self contain as an item of the strata data. Note that if stratum is passed, the auto-detection of base stratum is not triggered. It returns the same stratum , however it can gives the rate and occurence of this stratum if return_rate or return_counts is set to
True.strategy (str , default='average' or 'mean',) – strategy used to select or compute the numerical data into a singular series. It can be [‘naive’]. In that case , a single serie if randomly picked up into the base strata data.
as_frame (bool, default='False') – Returns compressed vector into a dataframe rather that keeping in series.
random_state (int, optional,) – State for randomly selected a compressed vector when
naiveis passed as strategy.
- Returns:
ms – returns a compressed vector in pandas series compose of all features. Note , the vector here does not refer as math vector compose of numerical values only. A compressed vector here is a series that is the result of averaging the numerical features of the base stratum and incluing its corresponding categorical values. Note there, the ms can contain categorical values and has the same number and features as the original frame d.
- Return type:
pandas series/dataframe
Example
>>> from watex.datasets import load_hlogs >>> from watex.utils.hydroutils import get_compressed_vector >>> data = load_hlogs().frame # get only the frame >>> get_compressed_vector (data, sname='strata_name')[:4] ... hole_number H502 strata_name siltstone aquifer_group II pumping_level ZFSAII dtype: object >>> get_compressed_vector (data, sname='strata_name', as_frame=True ) ... hole_number strata_name aquifer_group ... r rp remark 0 H502 siltstone II ... 41.7075 59.23 NaN [1 rows x 23 columns] >>> get_compressed_vector (data, sname='strata_name', strategy='naive') ... hole_number H502 depth_top 379.15 depth_bottom 379.7 strata_name siltstone Name: 39, dtype: object
- watex.utils.get_distance(x, y, *, return_mean_dist=False, is_latlon=False, **kws)[source]#
Compute distance between points
- Parameters:
x (ArrayLike 1d,) – One dimensional arrays. x can be consider as the abscissa of the landmark and y as ordinates array.
y (ArrayLike 1d,) – One dimensional arrays. x can be consider as the abscissa of the landmark and y as ordinates array.
return_mean_dist (bool, default =False,) – Returns the average value of the distance between different points.
is_latlon (bool, default=False,) – Convert x and y latitude and longitude coordinates values into UTM before computing the distance. x, y should be considered as
eastingandnorthingrespectively.kws (dict,) – Keyword arguments passed to
watex.site.Location.to_utm_in()
- Returns:
d – Is the distance between points.
- Return type:
Arraylike of shape (N-1)
Examples
>>> import numpy as np >>> from watex.utils.exmath import get_distance >>> x = np.random.rand (7) *10 >>> y = np.abs ( np.random.randn (7) * 12 ) >>> get_distance (x, y) array([ 8.7665511 , 12.47545656, 8.53730212, 13.54998351, 14.0419387 , 20.12086781]) >>> get_distance (x, y, return_mean_dist= True) 12.91534996818084
- watex.utils.get_full_frequency(z_or_edis_obj_list, /, to_log10=False)[source]#
Get the frequency with clean data.
The full or plain frequency is array frequency with no missing frequency during the data collection. Note that when using Natural Source Audio-Magnetotellurics, some data are missing due to the weak of missing frequency at certain band especially in the attenuation band.
- Parameters:
z_or_edis_obj_list (list of
watex.edi.Ediorwatex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.to_log10 (bool, default=False) – Export frequency to base 10 logarithm
- Returns:
f – frequency with clean data. Out of attenuation band if survey is completed with Natural Source Audio-Magnetotellurics.
- Return type:
Arraylike of shape(N, )
Examples
>>> from watex.datasets import load_huayuan >>> from watex.methods.em import get_full_frequency >>> box= load_huayuan ( key ='raw', clear_cache = True, samples =7) >>> edi_data = box.data >>> f = get_full_frequency (edi_data ) >>> f array([8.19200e+04, 7.00000e+04, 5.88000e+04, 4.95000e+04, 4.16000e+04, 3.50000e+04, 2.94000e+04, 2.47000e+04, 2.08000e+04, 1.75000e+04, ... 3.25000e+01, 2.75000e+01, 2.25000e+01, 1.87500e+01, 1.62500e+01, 1.37500e+01, 1.12500e+01, 9.37500e+00, 8.12500e+00, 6.87500e+00, 5.62500e+00]) >>> len(f) 56 >>> # Get only the z component objects >>> zobjs = [ box.emo.ediObjs_[i].Z for i in range (len(box.emo.ediObjs_))] >>> len(zobjs) 56
- watex.utils.get_profile_angle(easting=None, northing=None, msg='ignore')[source]#
compute geoprofile angle. :param * easting: easting coordiantes values :type * easting: array_like :param * northing: northing coordinates values :type * northing: array_like :param * msg: :type * msg: output a little message if msg is set to “raises”
- Returns:
float – profile_angle
float – geo_electric_strike
- watex.utils.get_sections_from_depth(z, z_range, return_index=False)[source]#
Gets aquifer sections (‘upper’, ‘lower’) in data ‘z’ from the depth range.
This might be usefull to compute the thickness of the aquifer.
- Parameters:
z (array-like 1d or pd.Series) – Array or pandas series contaning the depth values
z_range (tuple (float),) –
Section [‘upper’, ‘lower’] of the aquifer at differnt depth. The range of the depth must a pair values and could not be
greater than the maximum depth of the well.
return_index (bool, default=False) – returns the indices of the sections [‘upper’, ‘lower’] of the aquifer and non-valid sections too.
- Returns:
sections (Tuple (float, float)) – Real values of the upper and lower sections of the aquifer.
If
return_indexis ‘True’, function returns –- (upix, lowix): Tuple (int, int )
indices of upper and lower sections in the depth array z
- (invix): list of Tuple (int, int)
list of indices of invalid sections
Example
>>> from watex.datasets import load_hlogs >>> from watex.utils.hydroutils import get_sections_from_depth >>> data= load_hlogs().frame >>> # get real sections from depth 16.25 to 125.83 m >>> get_sections_from_depth ( data.depth_top, ( 16.25, 125.83)) ... (22.46, 128.23) >>> # aquifer depth from 16.25 m to the end >>> get_sections_from_depth ( data.depth_top, ( 16.25,)) ... (22.46, 693.37) >>> get_sections_from_depth ( data.depth_top, ( 16.25, 125.83), return_index =True ) ... ((3, 11), [(0, 3), (11, 180)]) >>> get_sections_from_depth ( data.depth_top, ( 16.25,), return_index =True ) ... ((3, 181), [(0, 3)])
- watex.utils.get_strike(profile_angle=None, easting=None, northing=None, gstrike=None, msg='ignore')[source]#
Compute geoelectric strike from profile angle, easting and northing.
- Parameters:
profile_angle (*) – If not provided , will comput with easting and northing coordinates
easting (*) – Easting coordiantes values
northing (*) – Northing coordinates values
gstrike (*) – strike value , if provided, will recomputed geo_electric strike . * msg: output a little message if msg is set to “raises”
- Returns:
float – profile_angle in degree E of N
float – geo_electric_strike in degrees E of N
- watex.utils.get_target(ar, /, tname, drop_target=True, columns=None, as_frame=False)[source]#
Extract target from multidimensional array or dataframe.
- Parameters:
ar (arraylike2d or pd.DataFrame) – Array that supposed to contain the target value.
tname (int/str, list of int/str) – index or the name of the target; if
intis passed it should range ranged less than the columns number of the array i.e. a shape[1] in the case of np.ndarray. If the list of indexes or names are given, the return target should be in two dimensional array.drop_target (bool, default=True) – Remove the target array in the 2D array or dataframe in the case the target exists and returns a data exluding the target array.
columns (list, default=False.) – composes the dataframe when the array is given rather than a dataframe. The list of column names must match the number of columns in the two dimensional array, otherwise an error occurs.
as_frame (bool, default=False,) – returns dataframe/series or the target rather than array when the array is supplied. This seems useful when column names are supplied.
- Returns:
t, ar – Return the targets and the array/dataframe of the target.
- Return type:
array-like/pd.Series , array-like/pd.DataFrame
Examples
>>>> import numpy as np >>> import pandas as pd >>> from watex.utils.mtutils import get_target >>> ar = np.random.randn ( 3, 3 ) >>> df0 = pd.DataFrame ( ar, columns = [‘x1’, ‘x2’, ‘tname’]) >>> df= df0.copy() >>> get_target (df, ‘tname’, drop_target= False ) ( tname
0 -0.542861 1 0.781198,
x1 x2 tname
0 -1.424061 -0.493320 -0.542861 1 0.416050 -1.156182 0.781198)
>>> get_target (df, [ 'tname', 'x1']) # drop is True by default ( tname x1 0 -0.542861 -1.424061 1 0.781198 0.416050, x2 0 -0.493320 1 -1.156182) >>> df = df0.copy() >>> # when array is passed >>> get_target (df.values , '2', drop_target= False ) (array([[-0.54286148], [ 0.7811981 ]]), array([[-1.42406091, -0.49331988, -0.54286148], [ 0.41605005, -1.15618243, 0.7811981 ]])) >>> get_target (df.values , 'tname') # raise error ValueError: 'tname' ['tname'] is not valid...
- watex.utils.get_unique_section(*data, zname, kname, return_index=False, return_data=False, error='raise', **kws)[source]#
Get the section to consider unique in multiple aquifers.
The unique section ‘upper’ and ‘lower’ is the valid range of the whole sections of each aquifers. It is considered as the main valid section from which data can not be compressed and not altered. For instance, the use of indexes is necessary to shrunk the data except this valid section. Mosly the data from the section is considered the valid data as the predictor Xr. Out of the range of aquifers ection, data can be discarded or compressed to top Xr.
Returns valid section indexes if ‘return_index’ is set to
True.- Parameters:
d (list of pandas dataframe) – Data that contains mainly the aquifer values. It needs to specify the name of the depth column zname as well as the name of permeabiliy kname column.
zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.
kname (str, int) –
- Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly
- kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.
return_index (bool, default =False ,) – Returns the positions (indexes) of the upper and lower sections of the shallower and deep aquifers found in the whole dataframes.
return_data (bool, default=False,) – Return valid data. It is usefull when ‘error’ is set to ‘ignore’ to collect the valid data.
error (str, default='raise') – Raise errors if trouble occurs when computing the section of each aquifer. If ‘ignore’, a UserWarning is displayed when invalid data is found. Any other value of error will set error to raise.
kws (dict,) – Additional keywords arguments passed to
get_aquifer_sections().
- Returns:
up, low –
(upix, lowix ): Tuple of indexes of lower and upper sections
(up, low): Tuple of aquifer sections (upper and lower)
- (upix, lowix), (up, low)positions and sections values of aquifers
if return_index and return_sections` are
True.
- Return type:
list of upper and lower section values of aquifer.
See also
watex.utils.hydroutils.get_aquifer_sectioncompute single section
watex.utils.hydroutils.get_aquifer_sectionscompute multiple sections
Example
>>> from watex.datasets import load_hlogs >>> data = load_hlogs ().frame >>> get_unique_section (data.copy() , zname ='depth', kname ='k', ) ... array([197.12, 369.71], dtype=float32) >>> get_unique_sections (data.copy() , zname ='depth', kname ='k', return_index =True) ... array([16, 29])
- watex.utils.get_xs_xr_splits(data, /, z_range=None, zname=None, section_indexes=None)[source]#
Split data into matrix \(X_s\) with sample \(ms\) (unwanted data ) and \(X_r\) of samples :math:`m_r`( valid aquifer data )
- Parameters:
data (pandas dataframe) – Dataframe for compressing.
zname (str,int ,) – the name of depth column. ‘name’ needs to be supplied when section_indexes is not provided.
z_range (tuple (float),) – Section [‘upper’, ‘lower’] of the aquifer at different depth. The range of the depth must a pair values and could not be greater than the maximum depth of the well.
section_indexes (tuple or list of int) – list of a pair tuple or list of integers. It is be the the valid sections( upper and lower ) indexes of of the aquifer. If the depth range z_range and zname are supplied, section_indexes can be None. Note that the last indix is considered as the last position, the bottom of the section therefore, its value is included in the data.
- Returns:
- xs (list of pandas dataframe) –
shrinking part of data for compressing. Note that it is on list
because if dataframe corresponds to the non-valid dataframe sections.
- xr (pandas dataframe) –
valid data reflecting to the aquifer part or including the
aquifer data.
Example
>>> from watex.datasets import load_hlogs >>> from watex.utils.hydroutils import get_xs_xr_splits >>> data = load_hlogs ().frame >>> xs, xr = get_xs_xr_splits (data, 3.11, section_indexes = (17, 20 ) )
- watex.utils.interpolate1d(arr, kind='slinear', method=None, order=None, fill_value='extrapolate', limit=None, **kws)[source]#
Interpolate array containing invalid values NaN
Usefull function to interpolate the missing frequency values in the tensor components.
- Parameters:
arr (array_like) – Array to interpolate containg invalid values. The invalid value here is NaN.
kind (str or int, optional) – Specifies the kind of interpolation as a string or as an integer specifying the order of the spline interpolator to use. The string has to be one of
linear,nearest,nearest-up,zero,slinear,``quadratic``,cubic,previous, ornext.zero,slinear,quadratic``and ``cubicrefer to a spline interpolation of zeroth, first, second or third order;previousandnextsimply return the previous or next value of the point;nearest-upandnearestdiffer when interpolating half-integers (e.g. 0.5, 1.5) in thatnearest-uprounds up andnearestrounds down. If method param is set topdwhich refers to pd.interpolate method , kind can be set topolynomialorpadinterpolation. Note that the polynomial requires you to specify an order whilepadrequires to specify the limit. Default isslinear.method (str, optional, default='mean') – Method of interpolation. Can be
basefor scipy.interpolate.interp1dmeanorbfffor scaling methods andpd``for pandas interpolation methods. Note that the first method is fast and efficient when the number of NaN in the array if relatively few. It is less accurate to use the `base` interpolation when the data is composed of many missing values. Alternatively, the scaled method(the second one) is proposed to be the alternative way more efficient. Indeed, when ``meanargument is set, function replaces the NaN values by the nonzeros in the raw array and then uses the mean to fit the data. The result of fitting creates a smooth curve where the index of each NaN in the raw array is replaced by its corresponding values in the fit results. The same approach is used forbffmethod. Conversely, rather than averaging the nonzeros values, it uses the backward and forward strategy to fill the NaN before scaling.meanandbffare more efficient when the data are composed of lot of missing values. When the interpolation method is set to pd, function uses the pandas interpolation but ended the interpolation with forward/backward NaN filling since the interpolation with pandas does not deal with all NaN at the begining or at the end of the array. Default isbase.fill_value (array-like or (array-like, array_like) or
extrapolate, optional) – If a ndarray (or float), this value will be used to fill in for requested points outside of the data range. If not provided, then the default is NaN. The array-like must broadcast properly to the dimensions of the non-interpolation axes. If a two-element tuple, then the first element is used as a fill value for x_new < x[0] and the second element is used for x_new > x[-1]. Anything that is not a 2-element tuple (e.g., list or ndarray, regardless of shape) is taken to be a single array-like argument meant to be used for both bounds as below, above = fill_value, fill_value. Using a two-element tuple or ndarray requires bounds_error=False. Default isextrapolate.kws (dict) – Additional keyword arguments from
spi.interp1d.
- Return type:
array like - New interpoolated array. NaN values are interpolated.
Notes
When interpolated thoughout the complete frequencies i.e all the frequency values using the
basemethod, the missing data in arr can be out of the arr range. So, for consistency and keep all values into the range of frequency, the better idea is to set the param fill_value in kws argument ofspi.interp1dto extrapolate. This will avoid an error to raise when the value to interpolated is extra-bound of arr.References
https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html https://www.askpython.com/python/examples/interpolation-to-fill-missing-entries
Examples
>>> import numpy as np >>> import matplotlib.pyplot as plt >>> from watex.utils.exmath import interpolate1d, >>> z = np.random.randn(17) *10 # assume 17 freq for 17 values of tensor Z >>> z [[7, 10, 16]] =np.nan # replace some indexes by NaN values >>> zit = interpolate1d (z, kind ='linear') >>> z ... array([ -1.97732415, -16.5883156 , 8.44484348, 0.24032979, 8.30863276, 4.76437029, -15.45780568, nan, -4.11301794, -10.94003412, nan, 9.22228383, -15.40298253, -7.24575491, -7.15149205, -20.9592011 , nan]), >>> zn ...array([ -1.97732415, -16.5883156 , 8.44484348, 0.24032979, 8.30863276, 4.76437029, -15.45780568, -4.11301794, -10.94003412, 9.22228383, -15.40298253, -7.24575491, -7.15149205, -20.9592011 , -34.76691014, -48.57461918, -62.38232823]) >>> zmean = interpolate1d (z, method ='mean') >>> zbff = interpolate1d (z, method ='bff') >>> zpd = interpolate1d (z, method ='pd') >>> plt.plot( np.arange (len(z)), zit, 'v--', np.arange (len(z)), zmean, 'ok-', np.arange (len(z)), zbff, '^g:', np.arange (len(z)), zpd,'<b:', np.arange (len(z)), z,'o', ) >>> plt.legend(['interp1d', 'mean strategy', 'bff strategy', 'pandas strategy', 'data'], loc='best')
- watex.utils.interpolate2d(arr2d, method='slinear', **kws)[source]#
Interpolate the data in 2D dimensional array.
If the data contains some missing values. It should be replaced by the interpolated values.
- Parameters:
arr2d (np.ndarray, shape (N, M)) – 2D dimensional data
method (str, default
linear) – Interpolation technique to use. Can benearest``or ``pad.kws (dict) – Additional keywords. Refer to
interpolate1d().
- Returns:
arr2d – 2D dimensional data interpolated
- Return type:
np.ndarray, shape (N, M)
Examples
>>> from watex.methods.em import EM >>> from watex.utils.exmath import interpolate2d >>> # make 2d matrix of frequency >>> emObj = EM().fit(r'data/edis') >>> freq2d = emObj.make2d (out = 'freq') >>> freq2d_i = interpolate2d(freq2d ) >>> freq2d.shape ...(55, 3) >>> freq2d ... array([[7.00000e+04, 7.00000e+04, 7.00000e+04], [5.88000e+04, 5.88000e+04, 5.88000e+04], ... [6.87500e+00, 6.87500e+00, 6.87500e+00], [ nan, nan, 5.62500e+00]]) >>> freq2d_i ... array([[7.000000e+04, 7.000000e+04, 7.000000e+04], [5.880000e+04, 5.880000e+04, 5.880000e+04], ... [6.875000e+00, 6.875000e+00, 6.875000e+00], [5.625000e+00, 5.625000e+00, 5.625000e+00]])
References
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.interpolate.interp2d.html
- watex.utils.interpolate_grid(arr, /, method='cubic', fill_value='auto', view=False)[source]#
Interpolate data containing missing values.
- Parameters:
arr (ArrayLike2D) – Two dimensional array for interpolation
method (str, default='cubic') – kind of interpolation. It could be [‘nearest’|’linear’|’cubic’].
fill_value (float, str, default='auto') – Fill the interpolated grid at the egdes or surrounding NaN with a filled value. The
autofill use the forward and backward fill stragety.view (bool, default=False,) – Quick visualize the interpolated grid.
- Returns:
arri – Interpolated 2D grid.
- Return type:
ArrayLike2d
See also
spi.griddataScipy interpolate Grid data
fillNaNFill missing data strategy.
Examples
>>> import numpy as np >>> from watex.utils.funcutils import interpolate_grid >>> x = [28, np.nan, 50, 60] ; y = [np.nan, 1000, 2000, 3000] >>> xy = np.vstack ((x, y)).T >>> xyi = interpolate_grid (xy, view=True ) >>> xyi array([[ 28. , 22.78880936, 50. , 60. ], [1000. , 1000. , 2000. , 3000. ]])
- watex.utils.label_importance(label, arr_k, arr_aq, *, method='naive')[source]#
- Compute the score for the label and its representativity in the valid
array ‘arr_k’
- Parameters:
- label: int, or string
class label from the true labels array of permeability coefficient ‘k’. If string, be sure to convert the array to hold the dtype str. It is recommnended to provide data with no NaN to have full control the occurence results.
- arr_k: array-like 1d
True labels of array containing the permeability coefficient ‘k’.
- arr_aq: array_like 1d
- True labels of the groups of aquifers or predicted naive group of
aquifer (NGA labels). See
predict_NGA_labels().
- method: str [‘naive’, ‘strict’], default=’naive’
The kind of strategy to compute the representativity of a label in the predicted array ‘array_aq’. It can also be ‘strict’. Indeed:
- ‘naive’ computes the importance of the label by the number of its
occurence for this specific label in the array ‘k’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in ‘arr_k’
- ‘strict’ computes the importance of the label by the number of
occurence in the whole valid ‘arr_k’ i.e. under the total of occurence of all the labels that exist in the whole ‘arra_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in ‘arr_k’.
- Returns:
- label_dict_group_rate: dict,
Dictionnary of the label and its rate of occurence in the arr_aq. Thus each group in arr_aq has its rate of representativity of the label in arr_k.
- ‘ , r)
- label k = 1 :
- {‘V’: 0.17, ‘IV’: 0.141, ‘II’: 0.126, ‘III’: 0.084, ‘IV&V’: 0.005,
‘II&III’: 0.003, ‘III&IV’: 0.003}
- label k = 2 :
{‘III’: 0.052, ‘II’: 0.05, ‘V’: 0.05, ‘IV’: 0.034, ‘III&IV’: 0.005}
- label k = 3 :
{‘V’: 0.123, ‘IV’: 0.086, ‘III’: 0.068}
>>> # **comments: # label k=1 is 17% importance for group V, 12.3% for group II whereas # label k=2 has a weak rate in the whole dataset ~=0.19% for all groups # the most dominate labels are k=1 and k=3 with 53.14% and 27.74 % # respectively in the dataset. # If threshold of representativity is set to 50% , none of the true # label k will fit any aquifer group since the max representativity # score is 17% and is for the group V especially for k=1.
- watex.utils.labels_validator(t, /, labels, return_bool=False)[source]#
Assert the validity of the label in the target and return the label or the boolean whether all items of label are in the target.
- Parameters:
t – array-like, target that is expected to contain the labels.
labels – int, str or list of (str or int) that is supposed to be in the target t.
return_bool – bool, default=False; returns ‘True’ or ‘False’ rather the labels if set to
True.
- Returns:
bool or labels; ‘True’ or ‘False’ if return_bool is set to
Trueand labels otherwise.- Example:
>>> from watex.datasets import fetch_data >>> from watex.utils.mlutils import cattarget, labels_validator >>> _, y = fetch_data ('bagoue', return_X_y=True, as_frame=True) >>> # binarize target y into [0 , 1] >>> ybin = cattarget(y, labels=2 ) >>> labels_validator (ybin, [0, 1]) ... [0, 1] # all labels exist. >>> labels_validator (y, [0, 1, 3]) ... ValueError: Value '3' is missing in the target. >>> labels_validator (ybin, 0 ) ... [0] >>> labels_validator (ybin, [0, 5], return_bool=True ) # no raise error ... False
- watex.utils.linkage_matrix(df, columns=None, kind='design', metric='euclidean', method='complete', as_frame=False, optimal_ordering=False)[source]#
Compute the distance matrix from the hierachical clustering algorithm
- Parameters:
df (dataframe or NDArray of (n_samples, n_features)) – dataframe of Ndarray. If array is given , must specify the column names to much the array shape 1
columns (list) – list of labels to name each columns of arrays of (n_samples, n_features) If dataframe is given, don’t need to specify the columns.
kind (str, ['squareform'|'condense'|'design'], default is {'design'}) – kind of approach to summing up the linkage matrix. Indeed, a condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that
pdistreturns. Alternatively, a collection of \(m\) observation vectors in \(n\) dimensions may be passed as an \(m\) by \(n\) array. All elements of the condensed distance matrix must be finite, i.e., no NaNs or infs. Alternatively, we could used thesquareformdistance matrix to yield different distance values than expected. thedesignapproach uses the complete inpout example matrix also called ‘design matrix’ to lead correct linkage matrix similar to squareform and condense`.metric (str or callable, default is {'euclidean'}) – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by
sklearn.metrics.pairwise.pairwise_distances(). IfXis the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.method (str, optional, default is {'complete'}) – The linkage algorithm to use. See the
Linkage Methodssection below for full descriptions.optimal_ordering (bool, optional) – If True, the linkage matrix will be reordered so that the distance between successive leaves is minimal. This results in a more intuitive tree structure when the data are visualized. defaults to False, because this algorithm can be slow, particularly on large datasets. See also
scipy.cluster.hierarchy.linkage().
- Returns:
row_clusters – consist of several rows where each rw represents one merge. The first and second columns denotes the most dissimilar members of each cluster and the third columns reports the distance between those members
- Return type:
linkage matrix
- watex.utils.magnitude(cz)[source]#
Compute the magnitude of selected conductive zone.
The magnitude parameter is the absolute resistivity value between the minimum \(\min \rho_a\) and maximum \(\max \rho_a\) value of selected anomaly:
\[magnitude=|\min\rho_a -\max\rho_a|\]- Parameters:
cz – array-like. Array of apparent resistivity values composing the conductive zone.
- Returns:
Absolute value of anomaly magnitude in ohm.meters.
- watex.utils.makeCoords(reflong, reflat, nsites, *, r=45.0, utm_zone=None, step='1km', order='+', todms=False, is_utm=False, raise_warning=True, **kws)[source]#
Generate multiple stations coordinates (longitudes, latitudes) from a reference station/site.
One degree of latitude equals approximately 364,000 feet (69 miles), one minute equals 6,068 feet (1.15 miles), and one-second equals 101 feet. One-degree of longitude equals 288,200 feet (54.6 miles), one minute equals 4,800 feet (0.91 mile) , and one second equals 80 feet. Illustration showing longitude convergence. (1 feet ~=0.3048 meter)
- Parameters:
reflong (float or string or list of [start, stop]) – Reference longitude in degree decimal or in DD:MM:SS for the first site considered as the origin of the landmark.
reflat (float or string or list of [start, stop]) – Reference latitude in degree decimal or in DD:MM:SS for the reference site considered as the landmark origin. If value is given in a list, it can contain the start point and the stop point.
nsites (int or float) – Number of site to generate the coordinates onto.
r (float or int) – The rotate angle in degrees. Rotate the angle features the direction of the projection line. Default value is
45degrees.step (float or str) – Offset or the distance of seperation between different sites in meters. If the value is given as string type, except the
km, it should be considered as amvalue. Only meters and kilometers are accepables.order (str) – Direction of the projection line. By default the projected line is in ascending order i.e. from SW to NE with angle r set to
45degrees. Could be-for descending order. Any other value should be in ascending order.is_utm (bool,) – Consider the first two positional arguments as UTM coordinate values. This is an alternative way to assume reflong and reflat are UTM coordinates ‘easting’and ‘northing` by default. If utm2deg is
False, any value greater than 180 degrees for longitude and 90 degrees for latitude will raise an error. Default isFalse.utm_zone (string (##N or ##S)) – utm zone in the form of number and North or South hemisphere, 10S or 03N Must be given if utm2deg is set to
True.todms (bool) – Convert the degree decimal values into the DD:MM:SS. Default is
False.raise_warning (bool, default=True,) – Raises warnings if GDAL is not set or the coordinates accurately status.
kws (dict,) – Additional keywords of
gistools.project_point_utm2ll().
- Returns:
Tuple of generated projected coordinates longitudes and latitudes
either in degree decimals or DD (MM:SS)
Notes
The distances vary. A degree, minute, or second of latitude remains fairly constant from the equator to the poles; however a degree, minute, or second of longitude can vary greatly as one approaches the poles and the meridians converge.
References
https://math.answers.com/Q/How_do_you_convert_degrees_to_meters
Examples
>>> from watex.utils.coreutils import makeCoords >>> rlons, rlats = makeCoords('110:29:09.00', '26:03:05.00', ... nsites = 7, todms=True) >>> rlons ... array(['110:29:09.00', '110:29:35.77', '110:30:02.54', '110:30:29.30', '110:30:56.07', '110:31:22.84', '110:31:49.61'], dtype='<U12') >>> rlats ... array(['26:03:05.00', '26:03:38.81', '26:04:12.62', '26:04:46.43', '26:05:20.23', '26:05:54.04', '26:06:27.85'], dtype='<U11') >>> rlons, rlats = makeCoords ((116.7, 119.90) , (44.2 , 40.95), nsites = 238, step =20. , order = '-', r= 125) >>> rlons ... array(['119:54:00.00', '119:53:11.39', '119:52:22.78', '119:51:34.18', '119:50:45.57', '119:49:56.96', '119:49:08.35', '119:48:19.75', ... '116:46:03.04', '116:45:14.43', '116:44:25.82', '116:43:37.22', '116:42:48.61', '116:42:00.00'], dtype='<U12') >>> rlats ... array(['40:57:00.00', '40:57:49.37', '40:58:38.73', '40:59:28.10', '41:00:17.47', '41:01:06.84', '41:01:56.20', '41:02:45.57', ... '44:07:53.16', '44:08:42.53', '44:09:31.90', '44:10:21.27', '44:11:10.63', '44:12:00.00'], dtype='<U11')
- watex.utils.make_MXS_labels(y_true, y_pred, threshold=None, similar_labels=None, sep=None, prefix=None, method='naive', trailer='*', return_obj=False, **kws)[source]#
Create a Mixture Learning Strategy (MXS) labels from true labels ‘y_true’ and the predicted Naive Group of Aquifer (NGA) labels ‘y_pred’
- Parameters:
y_true (array-like 1d, pandas.Series) – Array composed of valid k-values and possible missing k-values.
y_pred (Array-like 1d, pandas.Series) – Array composing the valid NGA labels. Note that NGA labels is a predicted labels mostly using the unsupervising learning.
threshold (float, default=None) – The threshold from which, label in ‘y_true’ can be considered similar than the one in NGA labels ‘y_pred’. The default is ‘None’ which means none rule is considered and the high preponderence or occurence in the data compared to other labels is considered as the most representative and similar. Setting the rule instead by fixing the threshold is recommended especially in a huge dataset.
similar_labels (list of tuple, optional) – list of tuple in pair (label and similar group). If given, the similar group must be the label existing in the predicted NGA. If
None, the auto-similarity is triggered.sep (str, default'') –
Separator between the true labels ‘y_true’ and predicted NGA labels. Sep is used to rewrite the MXS labels. Mostly the MXS labels is a combinaison with the true label of permeability coefficient ‘k’ and the label of NGA to compose new similarity labels. For instance
>>> true_labels=['k1', 'k2', 'k3'] ; NGA_labels =['II', 'I', 'IV'] >>> # gives >>> MXS_labels= ['k1_II', 'k2_I', 'k3_IV']
where the seperator sep is set to
_. This happens especially when one of the label (NGA or true_labels) is not a numeric datatype and a similariy is found between ‘k1’ and ‘II’, ‘k2’ and ‘I’ and so on.prefix (str, default='') –
prefix is used to rename the true_labels i.e the true valid-k. For instance:
>>> k_valid =[1, 2, ..] -> k_new = [k1, k2, ...]
where ‘k’ is the prefix.
method (str ['naive', 'strict'], default='naive') –
The kind of strategy to compute the representativity of a label in the predicted array ‘y_pred’. It can also be ‘strict’. Indeed:
naivecomputes the importance of the label by the number of itsoccurence for this specific label in the array ‘y_true’. It does not take into account of the occurence of other existing labels. This is usefull for unbalanced class labels in y_true.
strictcomputes the importance of the label by the number ofoccurence in the whole valid y_true i.e. under the total of occurence of all the labels that exist in the whole ‘arr_aq’. This can give a suitable anaylse results if the data is not unbalanced for each labels in y_pred.
trailer (str, default='*') –
The Mixture strategy marker to differentiate the existing class label in ‘y_true’ with the predicted labels ‘y_pred’ especially when the same class labels are also present the true label with the same label-identifier name. This usefull to avoid any confusion for both labels in y_true and y_pred for better demarcation and distinction. Note that if the trailer`is set to ``None` and both y_true and y_pred are numeric data, the labels in y_pred are systematically renamed to be distinct with the ones in the ‘y_true’. For instance
>>> true_labels=[1, 2, 3] ; NGA_labels =[0, 1, 2] >>> # with trailer , MXS labels should be >>> MXS_labels= ['0', '1*', '2*', '3'] # 1 and 2 are in true_labels >>> # with no trailer >>> MXS_labels= [0, 4, 5, 3] # 1 and 2 have been changed to [4, 5]
return_obj (
watex.utils.box.Boxspace) –- If
True, returns a MXS object with usefull attributes such as: mxs_classes_ = the MXS class labels
- mxs_labels_= the array-like of MXS labels. It also includes some
non similar labels from NGA
- mxs_map_classes_= a dict or original class labels of the array
’k’ <’y_true’> and their temporary integer class labels. Indeed, if ‘y_true’ class labels are not a numeric dtype, New labels with integer dtype is created. The dict is used to wrap the true labels (original ones) during the MXS creation. Thus, the original labels are not altered and will be map in turn at the end to recover their positions as well in new MXS array. It is set to ‘None’ if ‘y_true’ has a numeric dtype.
- mxs_group_classes_: dict of all the similar group labels with the
MXS labels related from the modified existing groups of NGA. Note that the non-similar group are modified if their labels are also found in the true_labels to avoid any confusion. Thus the dict wrap the non-similar label with their new temporay labels.
- mxs_similar_groups_= list of the similar labels found in
y_true that have a similarity in NGA.
- mxs_similarity_= Tuple of similarity in pair (label, group)
existing between the label class in y_true and NGA.
- mxs_group_labels_= list of the similar groups found in the
predicted NGA that have a similarity in true labels ‘y_true’
- If
- Returns:
MXS – array like of MXS labels or MXS object containing the usefull attributes.
- Return type:
array-like 1d or
Boxspace
See also
predict_NGA_labelsPredicts Naive group of Aquifers labels.
Examples
>>> from watex.datasets import load_hlogs >>> from watex.utils import read_data >>> from watex.utils.hydroutils import classify_k, make_MXS_labels >>> data = load_hlogs ().frame >>> # map data.k to categorize k values >>> ymap = classify_k(data.k , default_func =True) >>> y_mxs = make_MXS_labels (ymap, data.aquifer_group) >>> y_mxs[14:24] ... array(['I', 'I', 2, 2, 2, 2, 2, 2, 2, 2], dtype=object) >>> mxs_obj = make_MXS_labels (ymap, data.aquifer_group, return_obj=True ) >>> mxs_obj.mxs_labels_[14: 24] ... array(['I', 'I', 2, 2, 2, 2, 2, 2, 2, 2], dtype=object) >>> # now we did the same task using the private data 'hf.csv' >>> # composed of 11 boreholes. For default we alternatively uses >>> # the aquifer groups like a fake NGA >>> data = read_data ('data/boreholes/hf.csv') >>> ymap = classify_k(data.k , default_func =True) >>> y_mxs= make_MXS_labels (ymap, data.aquifer_group) >>> np.unique (y_mxs) ... array(['1', '1V', '2', '2III', '3', 'I', 'II', 'III&IV', 'IV'], dtype='<U6') >>> # *comments: # label '1V' means the group V (expected to be a cluster) # and label 1 (true labels) have a similarity # the same of label '2III' while the remain label 3 does not # any similarity in the other labels in the 'y_pred' expected # to be NGA labels.
- watex.utils.make_naive_pipe(X, y=None, *, num_features=None, cat_features=None, label_encoding='LabelEncoder', scaler='StandardScaler', missing_values=nan, impute_strategy='median', sparse_output=True, for_pca=False, transform=False)[source]#
make a pipeline to transform data at once.
make a naive pipeline is usefull to fast preprocess the data at once for quick prediction.
Work with a pandas dataframe. If None features is set, the numerical and categorial features are automatically retrieved.
- Parameters:
X (pandas dataframe of shape (n_samples, n_features)) – The input samples. Use
dtype=np.float32for maximum efficiency. Sparse matrices are also supported, use sparsecsc_matrixfor maximum efficiency.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
num_features (list or str, optional) – Numerical features put on the list. If num_features are given whereas cat_features are
None, cat_features are figured out automatically.cat_features (list of str, optional) – Categorial features put on the list. If num_features are given whereas num_features are
None, num_features are figured out automatically.label_encoding (callable or str, default='sklearn.preprocessing.LabelEncoder') – kind of encoding used to encode label. This assumes ‘y’ is supplied.
scaler (callable or str , default='sklearn.preprocessing.StandardScaler') – kind of scaling used to scaled the numerical data. Note that for the categorical data encoding, ‘sklearn.preprocessing.OneHotEncoder’ is implemented under the hood instead.
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
impute_strategy (str, default='mean') –
The imputation strategy.
If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
strategy=”constant” for fixed value imputation.
sparse_output (bool, default=False) – Is used when label y is given. Binarize labels in a one-vs-all fashion. If
True, returns array from transform is desired to be in sparse CSR format.for_pca (bool, default=False,) – Transform data for principal component ( PCA) analysis. If set to
True,watex.exlib.sklearn.OrdinalEncoder`is used insted ofwatex.exlib.sklearn.OneHotEncoder`.transform (bool, default=False,) – Tranform data inplace rather than returning the naive pipeline.
- Returns:
full_pipeline (
watex.exlib.sklearn.FeatureUnion) –Full pipeline composed of numerical and categorical pipes
(X_transformed &| y_transformed) ({array-like, sparse matrix} of shape (n_samples, n_features)) –
Transformed data.
Examples
>>> from watex.utils.mlutils import make_naive_pipe >>> from watex.datasets import load_hlogs
(1) Make a naive simple pipeline with RobustScaler, StandardScaler >>> from watex.exlib.sklearn import RobustScaler >>> X_, y_ = load_hlogs (as_frame=True )# get all the data >>> pipe = make_naive_pipe(X_, scaler =RobustScaler )
(2) Transform X in place with numerical and categorical features with StandardScaler (default). Returned CSR matrix
>>> make_naive_pipe(X_, transform =True ) ... <181x40 sparse matrix of type '<class 'numpy.float64'>' with 2172 stored elements in Compressed Sparse Row format>
- watex.utils.moving_average(y, *, window_size=3, method='sma', mode='same', alpha=0.5)[source]#
A moving average is used with time series data to smooth out short-term fluctuations and highlight longer-term trends or cycles.
Funtion analyzes data points by creating a series of averages of different subsets of the full data set.
- Parameters:
y (array_like, shape (N,)) – the values of the time history of the signal.
window_size (int) – the length of the window. Must be greater than 1 and preferably an odd integer number.Default is
3method (str) – variant of moving-average. Can be
sma,cma,wmaandemafor simple, cummulative, weight and exponential moving average. Default issma.mode (str) – returns the convolution at each point of overlap, with an output shape of (N+M-1,). At the end-points of the convolution, the signals do not overlap completely, and boundary effects may be seen. Can be
full,sameandvalid. See ~np.convole for more details. Default issame.alpha (float,) – smoothing factor. Only uses in exponential moving-average. Default is
.5.
- Returns:
ya – Averaged time history of the signal
- Return type:
array like, shape (N,)
Notes
The first element of the moving average is obtained by taking the average of the initial fixed subset of the number series. Then the subset is modified by “shifting forward”; that is, excluding the first number of the series and including the next value in the subset.
Examples
>>> import numpy as np ; import matplotlib.pyplot as plt >>> from watex.utils.exmath import moving_average >>> data = np.random.randn (37) >>> # add gaussion noise to the data >>> data = 2 * np.sin( data) + np.random.normal (0, 1 , len(data)) >>> window = 5 # fixed size to 5 >>> sma = moving_average(data, window) >>> cma = moving_average(data, window, method ='cma' ) >>> wma = moving_average(data, window, method ='wma' ) >>> ema = moving_average(data, window, method ='ema' , alpha =0.6) >>> x = np.arange(len(data)) >>> plt.plot (x, data, 'o', x, sma , 'ok--', x, cma, 'g-.', x, wma, 'b:') >>> plt.legend (['data', 'sma', 'cma', 'wma'])
References
- watex.utils.naive_imputer(X, y=None, strategy='mean', mode=None, drop_features=False, missing_values=nan, fill_value=None, verbose='deprecated', add_indicator=False, copy=True, keep_empty_features=False, **fit_params)[source]#
Imput missing values in the data.
Whatever data contains categorial features, ‘bi-impute’ argument passed to ‘kind’ parameters has a strategy to both impute the numerical and categorical features rather than raising an error when the ‘strategy’ is not set to ‘most_frequent’.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Not used, present here for API consistency by convention.
strategy (str, default='mean') –
The imputation strategy.
If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
strategy=”constant” for fixed value imputation.
mode (str, [bi-impute'], default= None) – If mode is set to ‘bi-impute’, it imputes the both numerical and categorical features and returns a single imputed dataframe.
drop_features (bool or list, default =False,) – drop a list of features in the dataframe before imputation. If
Trueand no list of features is supplied, the categorial features are dropped.missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
fill_value (str or numerical value, default=None) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
keep_empty_features (bool, default=False) –
If True, features that consist exclusively of missing values when fit is called are returned in results when transform is called. The imputed value is always 0 except when strategy=”constant” in which case fill_value will be used instead.
New in version 0.2.0.
verbose (int, default=0) – Controls the verbosity of the imputer.
copy (bool, default=True) –
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:
If X is not an array of floating values;
If X is encoded as a CSR matrix;
If add_indicator=True.
add_indicator (bool, default=False) – If True, a
MissingIndicatortransform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.fit_params (dict,) – keywords arguments passed to the scikit-learn fitting parameters More details on https://scikit-learn.org/stable/
- Returns:
Xi – Data imputed
- Return type:
Dataframe, array-like, sparse matrix of shape (n_samples, n_features)
Examples
>>> import numpy as np >>> import pandas as pd >>> from watex.utils.mlutils import naive_imputer >>> X= np.random.randn ( 7, 4 ) >>> X[3, :] =np.nan ; X[:, 3][-4:]=np.nan >>> naive_imputer (X) ... array([[ 1.34783528, 0.53276798, -1.57704281, 0.43455785], [ 0.36843174, -0.27132106, -0.38509441, -0.29371997], [-1.68974996, 0.15268509, -2.54446498, 0.18939122], [ 0.06013775, 0.36687602, -0.21973368, 0.11007637], [-0.27129147, 1.18103398, 1.78985393, 0.11007637], [ 1.09223954, 0.12924661, 0.52473794, 0.11007637], [-0.48663864, 0.47684353, 0.87360825, 0.11007637]]) >>> frame = pd.DataFrame (X, columns =['a', 'b', 'c', 'd'] ) >>> # change [bc] types to categorical values. >>> frame['b']=['pineaple', '', 'cabbage', 'watermelon', 'onion', 'cabbage', 'onion'] >>> frame['c']=['lion', '', 'cat', 'cat', 'dog', '', 'mouse'] >>> naive_imputer(frame, kind ='bi-impute') ... b c a d 0 pineaple lion 1.347835 0.434558 1 cabbage cat 0.368432 -0.293720 2 cabbage cat -1.689750 0.189391 3 watermelon cat 0.060138 0.110076 4 onion dog -0.271291 0.110076 5 cabbage cat 1.092240 0.110076 6 onion mouse -0.486639 0.110076
- watex.utils.naive_scaler(X, y=None, *, kind=<class 'sklearn.preprocessing._data.StandardScaler'>, copy=True, with_mean=True, with_std=True, feature_range=(0, 1), clip=False, norm='l2', **fit_params)[source]#
Quick data scaling using both strategies implemented in scikit-learn with StandardScaler and MinMaxScaler.
Function returns scaled frame if dataframe is passed or ndarray. For other scaling, call scikit-learn instead.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
kind (str, default='StandardScaler') – Kind of data scaling. Can also be [‘MinMaxScaler’, ‘Normalizer’]. The default is ‘StandardScaler’
copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
with_mean (bool, default=True) – If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_std (bool, default=True) – If True, scale the data to unit variance (or equivalently, unit standard deviation).
feature_range (tuple (min, max), default=(0, 1)) – Desired range of transformed data.
norm ({'l1', 'l2', 'max'}, default='l2') – The norm to use to normalize each non zero sample. If norm=’max’ is used, values will be rescaled by the maximum of the absolute values.
clip (bool, default=False) – Set to True to clip transformed values of held-out data to provided feature range.
fit_params (dict,) – keywords arguments passed to the scikit-learn fitting parameters More details on https://scikit-learn.org/stable/
- Returns:
X_sc – Transformed array.
- Return type:
{ndarray, sparse matrix} or dataframe of shape (n_samples, n_features)
Examples
>>> import numpy as np >>> import pandas as pd >>> from watex.utils.mlutils import naive_scaler >>> X= np.random.randn (7 , 3 ) >>> X_std = naive_scaler (X ) ... array([[ 0.17439644, 1.55683005, 0.24115109], [-0.59738672, 1.3166854 , 1.23748004], [-1.6815365 , -1.19775838, 0.71381357], [-0.1518278 , -0.32063059, -0.47483155], [-0.41335886, 0.13880519, 0.69258621], [ 1.45221902, -1.03852015, -0.40157981], [ 1.21749443, -0.45541153, -2.00861955]]) >>> # use dataframe >>> Xdf = pd.DataFrame (X, columns =['a', 'c', 'c']) >>> naive_scaler (Xdf , kind='Normalizer') # return data frame ... a c c 0 0.252789 0.967481 -0.008858 1 -0.265161 0.908862 0.321961 2 -0.899863 -0.416231 0.130380 3 0.178203 0.039443 -0.983203 4 -0.418487 0.800306 0.429394 5 0.933933 -0.309016 -0.179661 6 0.795234 -0.051054 -0.604150
- watex.utils.normalizer(arr, /, method='naive')[source]#
Normalize values to be between 0 and 1.
This normlizer handles NaN values translates data individually such that it is in the given range on the training set, e.g. between zero and one.
Note that when the transformation is set to the
method ='MinMax', The transformation is given by:X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) X_normed = X_std * (max - min) + min
where min, max = feature_range.
This transformation is often used as an alternative to zero mean, unit variance scaling.
- Parameters:
arr (Arraylike,) – Array to normalize, can contain NaN values.
method (str,) – Can be use ‘scikit-learn’
MinMaxScalerfor normalization. Any other values used the naive normalization.
- Returns:
arr_norm
- Return type:
Normalized array.
Examples
>>> import numpy as np >>> from watex.utils.funcutils import normalizer >>> np.random.seed (42) >>> arr = np.random.randn (3, 2 ) array([[ 0.49671415, -0.1382643 ], [ 0.64768854, 1.52302986], [-0.23415337, -0.23413696]]) >>> normalizer (arr ) array([[4.15931313e-01, 5.45697636e-02], [5.01849720e-01, 1.00000000e+00], [0.00000000e+00, 9.34323403e-06]]) >>> normalizer (arr , method ='min-max') # normalize data along axis=0 array([[0.82879654, 0.05456093], [1. , 1. ], [0. , 0. ]]) >>> arr [0, 1] = np.nan; arr [1, 0] = np.nan >>> normalizer (arr ) array([[4.15931313e-01, nan], [ nan, 1.00000000e+00], [0.00000000e+00, 9.34323403e-06]]) >>> normalizer (arr , method ='min-max') array([[ 1., nan], [nan, 1.], [ 0., 0.]])
- watex.utils.ohmicArea(data=None, search=45.0, sum=False, objective='ohmS', **kws)[source]#
Compute the ohmic-area from the Vertical Electrical Sounding data collected in exploration area.
- Parameters:
- * data: Dataframe pandas - contains the depth measurement AB from current
electrodes, the potentials electrodes MN and the collected apparents resistivities.
- * search: float - The depth in meters from which one expects to find a
fracture zone outside of pollutions. Indeed, the search parameter is used to speculate about the expected groundwater in the fractured rocks under the average level of water inrush in a specific area. For instance in Bagoue region , the average depth of water inrush is around
45m. So the search can be specified via the water inrush average value.- * objective: str - Type operation to outputs. By default, the function
outputs the value of pseudo-area in \(\Omega .m^2\). However, for plotting purpose by setting the argument to
view, its gives an alternatively outputs of X and Y, recomputed and projected as weel as the X and Y values of the expected fractured zone. Where X is the AB dipole spacing when imaging to the depth and Y is the apparent resistivity computed- kws: dict - Additionnal keywords arguments from |VES| data operations.
See
watex.utils.exmath.vesDataOperator()for futher details.
- Returns:
- List of twice tuples:
- Tuple(ohmS, error, roots):
`ohmS`is the pseudo-area computed expected to be a fractured zone
error is the integration error
- roots is the integration boundaries of the expected fractured
zone where the basement rocks is located above the resistivity transform function. At these points both curves values equal to null.
- Tuple (XY, fit XY,XYohmSarea):
- XY is the ndarray(nvalues, 2) of the operated of AB dipole
spacing and resistivity rhoa values.
- fit XY is the fitting ndarray(nvalues, 2) uses to redraw the
dummy resistivity transform function.
- XYohmSarea is ndarray(nvalues, 2) of the dipole spacing and
resistiviy values of the expected fracture zone.
- Raises:
- VESError
If the search is greater or equal to the maximum investigation depth in meters.
Notes
The ohmS value calculated from pseudo-area is a fully data-driven parameter and is used to evaluate a pseudo-area of the fracture zone from the depth where the basement rock is supposed to start. Usually, when exploring deeper using the Vertical Electrical Sounding, we are looking for groundwater in thefractured rock that is outside the anthropic pollution (Biemi, 1992). Since the VES is an indirect method, we cannot ascertain whether the presumed fractured rock contains water inside. However, we assume that the fracture zone could exist and should contain groundwater. Mathematically, based on the VES1D model proposed by `Koefoed, O. (1976)`_ , we consider a function \(\rho_T(l)\), a set of reducing resistivity transform function to lower the boundary plane at half the current electrode spacing \((l)\). From the sounding curve \(\rho_T(l)\), curve an imaginary basement rock \(b_r (l)\) of slope equal to
45°with the horizontal \(h(l)\) was created. A pseudo-area \(S(l)\) should be defined by extending from \(h(l)\) the \(b_r (l)\) curve when the sounding curve \(\rho_T(l)\) is below \(b_r(l)\), otherwise \(S(l)\) is equal to null. The computed area is called the ohmic-area \(ohmS\) expressed in \(\Omega .m^2\) and constitutes the expected fractured zone. Thus \(ohmS\) ≠ \(0\) confirms the existence of the fracture zone while of \(Ohms=0\) raises doubts. The equation to determine the parameter is given as:\[ \begin{align}\begin{aligned}ohmS & = &\int_{ l_i}^{l_{i+1}} S(l)dl \quad {s.t.}\\\begin{split}S(l) & = & b_r (l) - \rho_T (l) \quad \text{if} \quad b_r (l) > \rho_T (l) \\ & = & 0. \quad \text{if} \quad b_r (l) \leq \rho_T (l)\end{split}\\b_r(l) & = & l + h(l) \quad ; \quad h(l) = \beta\\\rho_T(l) & = & l^2 \int_{0}^{\infty} T_i( \lambda ) h_1( \lambda l) \lambda d\lambda\end{aligned}\end{align} \]where \(l_i \quad \text{and} \quad l_{i+1}\) solve the equation \(S(l=0)\); \(l\) is half the current electrode spacing \(AB/2\), and \(h_1\) denotes the first-order of the Bessel function of the first kind, \(\beta\) is the coordinate value on y-axis direction of the intercept term of the \(b_r(l)\) and \(h(l)\), \(T_i(\lambda )\) resistivity transform function, \(lamda\) denotes the integral variable, where n denotes the number of layers, \(rho_i\) and \(h_i\) are the resistivity and thickness of the \(i-th\) layer, respectively. Get more explanations and cleareance of formula in the paper of `Kouadio et al 2022`_.
. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast
Examples
>>> from watex.utils.exmath import ohmicArea >>> from watex.utils.coreutils import vesSelector >>> data = vesSelector (f= 'data/ves/ves_gbalo.xlsx') >>> (ohmS, err, roots), *_ = ohmicArea(data = data, search =45, sum =True ) ... (13.46012197818152, array([5.8131967e-12]), array([45. , 98.07307307])) # pseudo-area is computed between the spacing point AB =[45, 98] depth. >>> _, (XY.shape, XYfit.shape, XYohms_area.shape) = ohmicArea( AB= data.AB, rhoa =data.resistivity, search =45, objective ='plot') ... ((26, 2), (1000, 2), (8, 2))
- watex.utils.plotAnomaly(erp, cz=None, station=None, fig_size=(10, 4), fig_dpi=300, savefig=None, show_fig_title=True, style='seaborn', fig_title_kws=Ellipsis, czkws=Ellipsis, legkws=Ellipsis, how='py', **kws)[source]#
Plot the whole Electrical Resistivity Profiling line and selected conductive zone.
Conductive zone can be supplied nannualy as a subset of the erp or by specifying the station expected for drilling location. For instance
S07for the seventh station. Futhermore, for automatic detection, one should set the station argument s toauto. However, it ‘s recommended to provide the cz or the s to have full control. The conductive zone overlained the whole Electrical Resistivity Profiling survey. user can customize the cz plot by filling with Matplotlib pyplot additional keywords araguments thought the keyword arguments czkws.- Parameters:
- erp: array_like 1d
the Electrical Resistivity Profiling survey line. The line is an array of resistivity values. Note that if a dataframe is passed, be sure that the frame matches the DC resistivity data (ERP), otherwise an error occurs. At least, the frame columns includes the resistivity and stations.
- cz: array_like 1d
the selected conductive zone. If
None, only the erp should be displayed. Note that cz is an subset of erp array.- station: str, optional
The station location given as string (e.g.
s= "S10") or as a station number (indexing; e.gs =10). If value is set to"auto", s should be find automatically and fetching cz as well.- figsize: tuple, default =(10, 4)
Tuple value of figure size. Refer to the web resources Matplotlib figure.
- fig_dpi: int , default=300,
figure resolution “dot per inch”. Refer to Matplotlib figure.
- savefig: str, optional,
save the figure. Refer to Matplotlib figure.
- show_fig_title: bool, default =True
display the title of the figure.
- fig_title_kws: dict,
Keywords arguments of figure suptile. Refer to Matplotlib figsuptitle.
- style: str - the style for customizing visualization. For instance to
get the first seven available styles in pyplot, one can run the script below:
plt.style.available[:7]
Futher details can be foud in Webresources below or click on GeekforGeeks.
- how: str, default=’py’
By default (
how='py'), the station is naming following the Python indexing. Station is counting from station 00(S00). Any other values will start the station naming from 1.- czkws: dict,
keywords Matplotlib pyplot additional arguments to customize the cz plot.
- legkws: dict,
Additional keywords Matplotlib legend arguments.
- kws: dict,
additional keywords argument for Matplotlib pyplot to customize the erp plot.
See also
watex.erpSmartDetectorDetection conductive zone applying the constraint. Set the
view=Truefor constraints visualization.
. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast
Examples
>>> import numpy as np >>> from watex.utils import plotAnomaly, defineConductiveZone >>> test_array = np.abs (np.random.randn (10)) *1e2 >>> selected_cz ,*_ = defineConductiveZone(test_array, 7) >>> plotAnomaly(test_array, selected_cz ) >>> plotAnomaly(test_array, selected_cz , s= 5) >>> plotAnomaly(test_array, s= 's02') >>> plotAnomaly(test_array)
- watex.utils.plotOhmicArea(data=None, search=45.0, pre_computed=False, xy=None, xyf=None, xyarea=None, colors=None, fbtw=False, **plot_kws)[source]#
Plot the Vertical Electrical Sounding data ohmic -area
- Parameters:
data (*) – contains the depth measurement AB from current electrodes, the potentials electrodes MN and the collected apparent resistivities.
search (*) – The depth in meters from which one expects to find a fracture zone outside of pollutions. Indeed, the search parameter is used to speculate about the expected groundwater in the fractured rocks under the average level of water inrush in a specific area. For instance in Bagoue region , the average depth of water inrush is around
45m. So the search can be specified via the water inrush average value.pre_computed (bool, default=False,) – If
Truecomputed the ohmic_area parameters. IfFalse, the ohmic area arguments must be passed to xy, xyf and xyarea, otherwise an errors will raise.xy (array-like of shape (n_AB, 2)) – Arraylike of the sanitized depth measurement AB from current. electrodes n_AB. See
vesDataOperator().xyf (array-like of shape (n_fit_samples, 2)) – Array-like of the fitted samples i.e the number of points for fitting the sounding resistivity values from the surface thin the total depth. The fitted rhoa showns a smooth curves. The default point is
1000.xyarea (array-like of shape (n_area, 2)) – Arraylike of the resistivity positions of the depth measurment AB where the fractured zone is found.
fbtw (bool, default=False,) – If
True, filled the computed fractured zone using the parameters computed from xyf and xyarea.kws (dict - Additionnal keywords arguments from Vertical Electrical Sounding data operations.) – See
watex.utils.exmath.vesDataOperator()for futher details.
Notes
The first and second columns of xy, xyfit and xyarea are the position AB/2 and their corresponding resistivity values.
Examples
>>> from watex.datasets import load_semien >>> from watex.utils.exmath import plotOhmicArea >>> ves_data = load_semien () >>> plotOhmicArea (ves_data)
- watex.utils.plot_clusters(n_clusters, X, y_pred, cluster_centers=None, savefig=None)[source]#
Visualize the cluster that k-means identified in the dataset
- Parameters:
n_clusters – int, number of cluster to visualize
X – NDArray, data containing the features, expect to be a two dimensional data
y_pred – array-like, array containing the predicted class labels.
cluster_centers – NDArray containg the coordinates of the centroids or the similar points with continous features.
- Example:
>>> from watex.exlib.sklearn import KMeans, MinMaxScaler >>> from watex.utils.plotutils import plot_clusters >>> from watex.datasets import fetch_data >>> h= fetch_data('hlogs').frame >>> # collect two features 'resistivity' and gamma-gamma logging values >>> h2 = h[['resistivity', 'gamma_gamma']] >>> km = KMeans (n_clusters =3 , init= 'random' ) >>> # scaled the data with MinMax scaler i.e. between ( 0-1) >>> h2_scaled = MinMaxScaler().fit_transform(h2) >>> ykm = km.fit_predict(h2_scaled ) >>> plot_clusters (3 , h2_scaled, ykm , km.cluster_centers_ )
- watex.utils.plot_confidence_in(z_or_edis_obj_list, /, tensor='res', view='1d', drop_outliers=True, distance=None, c_line=False, view_ci=True, figsize=(6, 2), fontsize=4.0, dpi=300.0, top_label='Stations', rotate_xlabel=90.0, fbtw=True, savefig=None, **plot_kws)[source]#
Plot data confidency from tensor errors.
The default tensor for evaluating the data confidence is the resistivity at TE mode (‘xy’).
Check confidence in the data before starting the concrete processing seems meaningful. In the area with complex terrain, with high topography addition to interference noises, signals are weals or missing especially when using AMT survey. The most common technique to do this is to eliminate the bad frequency and interpolate the remains one. However, the tricks for eliminating frequency differ from one author to another. Here, the tip using the data confidence seems meaningful to indicate which frequencies to eliminate (at which stations/sites) and which ones are still recoverable using the tensor recovering strategy.
The plot implements three levels of confidence:
High confidence: \(conf. \geq 0.95\) values greater than 95%
Soft confidence: \(0.5 \leq conf. < 0.95\). The data in this confidence range can be beneficial for tensor recovery to restore the weak and missing signals.
bad confidence: \(conf. <0.5\). Data in this interval must be deleted.
- Parameters:
z_or_edis_obj_list (list of
watex.edi.Ediorwatex.externals.z.Z) – A collection of EDI- or Impedances tensors objects.tensor (str, default='res') – Tensor name. Can be [ ‘resistivity’|’phase’|’z’|’frequency’]
view (str, default='1d') – Type of plot. Can be [‘1D’|’2D’]
drop_outliers (bool, default=True) – Suppress the ouliers in the data if
True.distance (float, optional) – Distance between stations/sites
fontsize (float, default=3.) – label font size.
figsize (Tuple, default=(6, 2)) – Figure size.
c_line (bool, default=True,) – Display the confidence line in two dimensinal view.
dpi (int, default=300) – Image resolution in dot-per-inch
rotate_xlabel (float, default=90.) – Angle to rotate the stations/sites labels
top_labels (str,default='Stations') – Labels the sites either using the survey name.
view_ci (bool,default=True,) – Show the marker of confidence interval.
fbtw (bool, default=True,) – Fill between confidence interval.
plot_kws (dict,) – Additional keywords pass to the
plot()
See also
watex.methods.Processing.zrestoreFor more details about the function for tensor recovering technique.
Examples
>>> from watex.utils.exmath import plot_confidence_in >>> from watex.datasets import fetch_data >>> emobj = fetch_data ( 'huayuan', samples = 25, clear_cache =True, key='raw').emo >>> plot_confidence_in (emobj.ediObjs_ , distance =20 , view ='2d', figsize =(6, 2) ) >>> plot_confidence_in (emobj.ediObjs_ , distance =20 , view ='1d', figsize =(6, 3), fontsize =5, )
- watex.utils.plot_confusion_matrices(clfs, Xt, yt, annot=True, pkg=None, normalize='true', sample_weight=None, encoder=None, fig_size=(22, 6), savefig=None, subplot_kws=None, **scorer_kws)[source]#
Plot inline multiple model confusion matrices using either the sckitlearn or ‘yellowbrick’
- Parameters:
clfs (list of classifier estimators) – A scikit-learn estimator that should be a classifier. If the model is not a classifier, an exception is raised. Note that the classifier must be fitted beforehand.
Xt (ndarray or DataFrame of shape (M X N)) – A matrix of n instances with m features. Preferably, matrix represents the test data for error evaluation.
yt (ndarray of shape (M, ) or Series oF length (M, )) – An array or series of target or class values. Preferably, the array represent the test class labels data for error evaluation.
pkg (str, optional , default ='sklearn') – the library to handle the plot. It could be ‘yellowbrick’. The basic confusion matrix is handled by the scikit-learn package.
normalize ({'true', 'pred', 'all'}, default=None) – Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
encoder (dict or LabelEncoder, default: None) – A mapping of classes to human readable labels. Often there is a mismatch between desired class labels and those contained in the target variable passed to
fit()orscore(). The encoder disambiguates this mismatch ensuring that classes are labeled correctly in the visualization.
- annot: bool, default=True
Annotate the number of samples (right or wrong prediction ) in the plot. Set
Falseto mute the display.- fig_sizetuple (width, height), default =(8, 6)
the matplotlib figure size given as a tuple of width and height
- savefig: str, default =None ,
the path to save the figures. Argument is passed to matplotlib.Figure class.
Examples
>>> import matplotlib.pyplot as plt >>> plt.style.use ('classic') >>> from watex.datasets import fetch_data >>> from watex.exlib.sklearn import train_test_split >>> from watex.models.premodels import p >>> from watex.utils.plotutils import plot_confusion_matrices >>> # split the data . Note that fetch_data output X and y >>> X, Xt, y, yt = train_test_split (* fetch_data ('bagoue analysed'), test_size =.25 ) >>> # compose the models >>> # from RBF, and poly >>> models =[ p.SVM.rbf.best_estimator_, p.LogisticRegression.best_estimator_, p.RandomForest.best_estimator_ ] >>> models [SVC(C=2.0, coef0=0, degree=1, gamma=0.125), LogisticRegression(), RandomForestClassifier(criterion='entropy', max_depth=16, n_estimators=350)] >>> # now fit all estimators >>> fitted_models = [model.fit(X, y) for model in models ] >>> plot_confusion_matrices(fitted_models , Xt, yt)
- watex.utils.plot_cost_vs_epochs(regs, *, fig_size=(10, 4), marker='o', savefig=None, **kws)[source]#
Plot the cost against the number of epochs for the two different learnings rates
- Parameters:
regs (Callable, single or list of regression estimators) – Estimator should be already fitted.
fig_size (tuple , default is (10, 4)) – the size of figure
kws (dict ,) – Additionnal keywords arguments passes to
matplotlib.pyplot.plot()
- Returns:
ax
- Return type:
Matplotlib.pyplot axes objects
Examples
>>> from watex.datasets import load_iris >>> from watex.base import AdalineGradientDescent >>> from watex.utils.plotutils import plot_cost_vs_epochs >>> X, y = load_iris (return_X_y= True ) >>> ada1 = AdalineGradientDescent (n_iter= 10 , eta= .01 ).fit(X, y) >>> ada2 = AdalineGradientDescent (n_iter=10 , eta =.0001 ).fit(X, y) >>> plot_cost_vs_epochs (regs = [ada1, ada2] )
- watex.utils.plot_elbow(X, n_clusters, n_init=10, max_iter=300, random_state=42, fig_size=(10, 4), marker='o', savefig=None, **kwd)[source]#
Plot elbow method to find the optimal number of cluster, k’, for a given data.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.
n_init (int, default=10) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
max_iter (int, default=300) – Maximum number of iterations of the k-means algorithm for a single run.
tol (float, default=1e-4) – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
verbose (int, default=0) – Verbosity mode.
random_state (int, RandomState instance or None, default=42) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
savefig (str, default =None ,) – the path to save the figure. Argument is passed to
matplotlib.Figureclass.marker (str, default='o',) – cluster marker point.
kwd (dict) – Addionnal keywords arguments passed to
matplotlib.pyplot.plot()
- Returns:
ax
- Return type:
Matplotlib.pyplot axes objects
Example
>>> from watex.datasets import load_hlogs >>> from watex.utils.plotutils import plot_elbow >>> # get the only resistivy and gamma-gama values for example >>> res_gamma = load_hlogs ().frame[['resistivity', 'gamma_gamma']] >>> plot_elbow(res_gamma, n_clusters=11)
- watex.utils.plot_learning_curves(models, X, y, *, cv=None, train_sizes=None, baseline_score=0.4, convergence_line=True, fig_size=(20, 6), sns_style=None, savefig=None, set_legend=True, subplot_kws=None, **kws)[source]#
Horizontally visualization of multiple models learning curves.
Determines cross-validated training and test scores for different training set sizes.
- Parameters:
models (list or estimators) – An estimator instance or not that implements fit and predict methods which will be cloned for each validation.
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
cv (int, cross-validation generator or an iterable, default=None) –
- Determines the cross-validation splitting strategy.
Possible inputs for cv are:
None, to use the default 5-fold cross validation,
int, to specify the number of folds in a (Stratified)KFold,
CV splitter,
An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and
yis either binary or multiclass,StratifiedKFoldis used. In all other cases,KFoldis used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.Refer User Guide for the various cross-validation strategies that can be used here.
cvdefault value if None changed from 3-fold to 4-fold.- train_sizesarray-like of shape (n_ticks,), default=np.linspace(0.1, 1, 50)
Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.
baseline_score (floatm default=.4) – base score to start counting in score y-axis (score)
convergence_line (bool, default=True) – display the convergence line or not that indicate the level of bias between the training and validation curve.
fig_size (tuple (width, height), default =(14, 6)) – the matplotlib figure size given as a tuple of width and height
sns_style (str, optional,) – the seaborn style .
set_legend (bool, default=True) – display legend in each figure. Note the default location of the legend is ‘best’ from
legend()subplot_kws (dict, default is dict(left=0.0625, right = 0.95, wspace = 0.1)) – the subplot keywords arguments passed to
matplotlib.subplots_adjust()kws (dict,) – keyword arguments passed to
sklearn.model_selection.learning_curve()
Examples
-> plot via a metaestimator already cross-validated.
>>> from watex.models.premodels import p >>> from watex.datasets import fetch_data >>> from watex.utils.plotutils import plot_learning_curves >>> X, y = fetch_data ('bagoue prepared') # yields a sparse matrix >>> # let collect 04 estimators already cross-validated from SVMs >>> models = [ p.SVM.linear , p.SVM.rbf , p.SVM.sigmoid , p.SVM.poly ] >>> plot_learning_curves (models, X, y, cv=4, sns_style = 'darkgrid')
-> plot with multiples models not crossvalidated yet.
>>> from watex.exlib.sklearn import (LogisticRegression, RandomForestClassifier, SVC , KNeighborsClassifier ) >>> models =[LogisticRegression(), RandomForestClassifier(), SVC() , KNeighborsClassifier() ] >>> plot_learning_curves (models, X, y, cv=4, sns_style = 'darkgrid')
- watex.utils.plot_logging(X, y=None, zname=None, tname=None, labels=None, impute_nan=True, normalize=False, log10=False, columns_to_skip=None, pattern=None, strategy='mean', posiy=None, fill_value=None, fig_size=(16, 7), fig_dpi=300, colors=None, cs4_colors=False, sns_style=False, savefig=None, draw_spines=False, seed=None, verbose=0, **kws)[source]#
Plot logging data
Plot expects a collection of logging data. Each logging data composes a column of data collected on the field.Note that can also plot anykind of data related that it contains numerical values. The function does not accept categorical data. If categorical data are given, they should be discarded.
- Parameters:
X (Dataframe of shape (n_samples, n_features)) – where n_samples is the number of data, expected to be the data collected at different depths and n_features is the number of columns (features) that supposed to be plot. Note that X must include the
depthcolumns. If not given a relative depth should be created according to the number of sample that composes X.y (array-like or series of shape (n_samples,), optional) – Target relative to X for classification or regression; If given, by default the target plot should be located at the last position. However with the argument of posiy , target plot can be toggled to the desired position.
zname (str, default='depth' or 'None') – The name of the depth column in X. If the name ‘depth’ is not specified as the main depth columns, an other name in the columns that matches the depth can also be indicated so the function will put aside this columm as depth column for plot purpose. If set to
None, zname holds the namedepthand assumes that depth exists in X columns.tname (str, optional,) – name of the target. This can rename of the target name if given y as a pandas series or add the name of target if given as an array-like. If not provided, it should use the name of the target series if y is not None.
normalize (bool, default = False) – Normalize all the data to be range between (0, 1) except the depth,
labels (list or str, optional) – If labels are given, they should fit the size of the number of columns. The given labels should replace the old columns in X and should figue out in the plot. This is usefull to change the columns labels in the dataframe to a new labels that describe the best the plot ; for instance by inluding the units in the new labels. Note that if the labels do not match the size of the old columns in X a warning should be let to the user and none operation will be performed.
impute_nan (bool, default=True,) – Replace the NaN values in the dataframe. Note that the default behaviour for replacing NaN is the
mean. However if the argument of fill_value is provided,the latter should be used to replace ‘NaN’ in X.log10 (bool, default=False) – Convert values to log10. This can be usefull when using the logarithm data. However, it seems not all the data can be used this operation, for instance, a negative data. In that case, column_to_skip argument is usefull to provide so to skip that columns when converting values to log10.
columns_to_skip (list or str, optional,) –
- Columns to skip when performing some operation like ‘log10’. These
columns with not be affected by the ‘log10’ operations. Note that
columns_to_skip can also gives as litteral string. In that case, the pattern is need to parse the columns into a list of string.
pattern (str, default = '[#&*@!,;s]s*') –
Regex pattern to parse the columns_to_skip into a list of string where each item is a column name especially when the latter is given as litteral text string. For instance:
columns_to_skip='depth_top, thickness, sp, gamma_gamma' -> ['depth_top', 'thickness', 'sp', 'gamma_gamma']
by using the default pattern. To have full control of columns splitted it is recommended to provided your own pattern to avoid wrong parsing and can lead to an error.
strategy (str, default='mean') –
The imputation strategy.
If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
fill_value (str or numerical value, optional) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types. If not given and impute_nan is
True, the mean strategy is used instead.posiy (int, optional) – the position to place the target plot y . By default the target plot if given is located at the last position behind the logging plots.
colors (str, list of Matplotlib.colors map, optional) –
The colors for plotting each columns of X except the depth. If not given, default colors are auto-generated.
If colors is string and ‘cs4’or ‘xkcd’ is included. Matplotlib.colors.CS4_COLORS or Matplotlib.colors.XKCD_COLORS should be used instead. In addition if the ‘cs4’ or ‘xkcd’ is suffixed by colons and integer value like
cs4:4orxkcd:4, the CS4 or XKCD colors should be used from index equals to4.New in version 0.2.3: Matplotlib.colors.CS4_COLORS or Matplotlib.colors.XKCD_COLORS can be used by setting colors to
'cs4'or'xkcd'. To reproduce the same CS4 or XKCD colors, set the seed parameter to a specific value.draw_spines (bool, tuple (-lim, +lim), default= False,) – Only draw spine between the y-ticks.
-limand+limare lower and upper bound i.e. a range to draw the spines in y-axis.fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
fig_dpi (float or 'figure', default: rcParams["savefig.dpi"] (default: 'figure')) – The resolution in dots per inch. If ‘figure’, use the figure’s dpi value.
savefig (str, default =None ,) – the path to save the figure. Argument is passed to
matplotlib.Figureclass.sns_style (str, optional,) – the seaborn style.
seed (int, optional) –
Allow to reproduce the Matplotlib.colors.CS4_COLORS if colors is set to
cs4.New in version 0.2.3.
verbose (int, default=0) – Output the number of categorial features dropped in the dataframe.
kws (dict,) – Additional keyword arguments passed to
matplotlib.axes.plot()
Examples
>>> from watex.datasets import load_hlogs >>> from watex.utils.plotutils import plot_logging >>> X0, y = load_hlogs (as_frame =True) # get the frames rather than object >>> # plot the default logging with Normalize =True >>> plot_logging (X0, normalize =True) >>> # Include the target in the plot >>> plot_logging ( X0, y = y.kp , posiy = 0, columns_to_skip=['thickness', 'sp'], log10 =True, ) >>> # draw spines and limit plot from (0, 700) m depth >>> plot_logging (X0 , y= y.kp, draw_spines =(0, 700) )
- watex.utils.plot_mlxtend_heatmap(df, columns=None, savefig=None, **kws)[source]#
Plot correlation matrix array as a heat map
- Parameters:
df – dataframe pandas
columns – list of features, If given, only the dataframe with that features is considered.
kws – additional keyword arguments passed to
mlxtend.plotting.heatmap()
- Returns:
mlxtend.plotting.heatmap()axes object- Example:
>>> from watex.datasets import load_hlogs >>> from watex.utils.plotutils import plot_mlxtend_heatmap >>> h=load_hlogs() >>> features = ['gamma_gamma', 'sp', 'natural_gamma', 'resistivity'] >>> plot_mlxtend_heatmap (h.frame , columns =features, cmap ='PuOr')
- watex.utils.plot_mlxtend_matrix(df, columns=None, fig_size=(10, 8), alpha=0.5, savefig=None)[source]#
Visualize the pair wise correlation between the different features in the dataset in one place.
- Parameters:
df – dataframe pandas
columns – list of features, If given, only the dataframe with that features is considered.
fig_size – tuple of int (width, heigh) Size of the displayed figure
alpha – figure transparency, default is
.5.
- Returns:
mlxtend.plotting.scatterplotmatrix()axes object- Example:
>>> from watex.datasets import load_hlogs >>> from watex.utils.plotutils import plot_mlxtend_matrix >>> import pandas as pd >>> import numpy as np >>> h=load_hlogs() >>> features = ['gamma_gamma', 'natural_gamma', 'resistivity'] >>> data = pd.DataFrame ( np.log10 (h.frame[features]), columns =features ) >>> plot_mlxtend_matrix (data, columns =features)
- watex.utils.plot_naive_dendrogram(X, *ybounds, fig_size=(12, 5), savefig=None, **kws)[source]#
Quick plot dendrogram using the ward clustering function from Scipy.
- Parameters:
X – ndarray of shape (n_samples, n_features) Array of features
ybounds – int, integrer values to draw horizontal cluster lines that indicate the number of clusters.
fig_size – tuple (width, height), default =(12,5) the matplotlib figure size given as a tuple of width and height
kws – dict , Addditional keyword arguments passed to
scipy.cluster.hierarchy.dendrogram()
- Examples:
>>> from watex.datasets import fetch_data >>> from watex.utils.plotutils import plot_naive_dendrogram >>> X, _= fetch_data('Bagoue analysed') # data is already scaled >>> # get the two features 'power' and 'magnitude' >>> data = X[['power', 'magnitude']] >>> plot_naive_dendrogram(data ) >>> # add the horizontal line of the cluster at ybounds = (20 , 20 ) >>> # for a single cluster (cluser 1) >>> plot_naive_dendrogram(data , 20, 20 )
- watex.utils.plot_pca_components(components, *, feature_names=None, cmap='viridis', savefig=None, **kws)[source]#
Visualize the coefficient of principal component analysis (PCA) as a heatmap
- Parameters:
components – Ndarray, shape (n_components, n_features)or PCA object Array of the PCA compoments or object from
watex.analysis.dimensionality.nPCA. If the object is given it is not necessary to set the feature_namesfeature_names – list or str, optional list of the feature names to locate in the map. Feature_names and the number of eigen vectors must be the same length. If PCA object is passed as components arguments, no need to set the feature_names. The name of features is retreived automatically.
cmap – str, default=’viridis’ the matplotlib color map for matshow visualization.
kws – dict, Additional keywords arguments passed to
matplotlib.pyplot.matshow
- Examples:
(1)-> with PCA object
>>> from watex.datasets import fetch_data >>> from watex.utils.plotutils import plot_pca_components >>> from watex.analysis import nPCA >>> X, _= fetch_data('bagoue pca') >>> pca = nPCA (X, n_components=2, return_X =False)# to return object >>> plot_pca_components (pca)
(2)-> use the components and features individually
>>> components = pca.components_ >>> features = pca.feature_names_in_ >>> plot_pca_components (components, feature_names= features, cmap='jet_r')
- watex.utils.plot_profiling(erp, /, station=None, cz=None, *, style='classic', fig_size=(10, 4), cz_plot_kws=None, marker_kws=None, savefig=None, **plot_kws)[source]#
Visualizes the resistivity profiling of ERP data.
Function can overlain the selected conductive zone to the ERP if cz is given.
- Parameters:
erp (array_like 1d) –
The electrical resistivity profiling array. If dataframe is passed, resistivity column must be included.
Changed in version 0.2.1: Can henceforth accept dataframe that contains resistivity values.
station (str, int, optional) –
- Station is used to visualize the conductive zone in the erp profile.
This seems useful if cz is not given. When station=’auto’ it automatically detect the best conductive zone assuming the very low resistivity in the profile and plot the conductive zone. To have the expected results, station position or cz must be given or the .
New in version 0.2.1: Can henceforth pass the station to plot the conductive zone.
cz (array_like, optional,) – The selected conductive zone. If
None, cz should not be plotted.style (str, default='classic') – Matplotlib plottings style.
fig_size (tuple, default= (10, 4)) – Matplotlib figure size.
marker_kws (dict, default = {'marker':'o', 'c':'#9EB3DD' }) – The dictionnary to customize marker in the plot
cz_plot_kws (dict, default = {'ls':'-','c':'#0A4CEE', 'lw'L2 }) – The dictionnary to customize the conductize zone in the plot.
savefig (str, optional) – Save figure name. The default resolution dot-per-inch is
300.plot_kws (dict,) – Additional keyword arguments passed to
matplotlib.pyplot.plot()function
- Returns:
ax – Return axis
- Return type:
Matplotlib.pyplot.Axis
Examples
>>> from watex.datasets import make_erp >>> from watex.utils.plotutils import plot_profiling >>> d= make_erp (n_stations =56, seed = 42) >>> plot_profiling (d.resistivity) >>> # read the frame and get the resistivity values >>> plot_profiling (d.frame, station ='s07' ) <AxesSubplot:xlabel='Stations', ylabel='App.resistivity ($\Omega.m$)'>
- watex.utils.plot_regularization_path(X, y, c_range=(-4.0, 6.0), fig_size=(8, 5), sns_style=False, savefig=None, **kws)[source]#
Plot the regularisation path from Logit / LogisticRegression
Varying the different regularization strengths and plot the weight coefficient of the different features for different regularization strength.
Note that, it is recommended to standardize the data first.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features. X is expected to be standardized.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
c_range (list or tuple [start, stop]) – Regularization strength list. It is a range from the strong strong ( start) to lower (stop) regularization. Note that ‘C’ is the inverse of the Logistic Regression regularization parameter \(\lambda\).
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.
sns_style (str, optional,) – the seaborn style.
kws (dict,) – Additional keywords arguments passed to
sklearn.linear_model.LogisticRegression
Examples
>>> from watex.utils.plotutils import plot_regularization_path >>> from watex.datasets import fetch_data >>> X, y = fetch_data ('bagoue analysed' ) # data aleardy standardized >>> plot_regularization_path (X, y )
- watex.utils.plot_rf_feature_importances(clf, X=None, y=None, fig_size=(8, 4), savefig=None, n_estimators=500, verbose=0, sns_style=None, **kws)[source]#
Plot features importance with RandomForest.
- Parameters:
clf (estimator object) – The base estimator from which the transformer is built. This can be both a fitted (if
prefitis set to True) or a non-fitted estimator. The estimator should have afeature_importances_orcoef_attribute after fitting. Otherwise, theimportance_getterparameter should be used.X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
n_estimators (int, default=500) – The number of trees in the forest.
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.
sns_style (str, optional,) – the seaborn style.
verbose (int, default=0) – print the feature labels with the rate of their importances.
kws (dict,) – Additional keyyword arguments passed to
sklearn.ensemble.RandomForestClassifier
Examples
>>> from watex.datasets import fetch_data >>> from watex.exlib.sklearn import RandomForestClassifier >>> from watex.utils.plotutils import plot_rf_feature_importances >>> X, y = fetch_data ('bagoue analysed' ) >>> plot_rf_feature_importances ( RandomForestClassifier(), X=X, y=y , sns_style=True)
- watex.utils.plot_roc_curves(clfs, /, X, y, names=Ellipsis, colors=Ellipsis, ncols=3, get_score=False, all=False, ax=None, fig_size=(7, 7), **roc_kws)[source]#
Quick plot of Receiving Operating Characterisctic (ROC) of fitted models
- Parameters:
clfs (list,) – list of models for ROC evaluation. Model should be a scikit-learn or XGBoost estimators
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
y (ndarray or Series of length (n_samples, )) – An array or series of target or class values. Preferably, the array represent the test class labels data for error evaluation.
colors (str, list) – Colors to specify each model plot.
ncols (int, default=3) – Number of plot to be placed inline before skipping to the next column. This is feasible if many is set to
True.get_score (bool,default=True) –
Append the Area Under the curve to legend.
New in version 0.2.4.
all (str, default=False) – if
True, plot each ROC model separatelynames (list,) – List of model names. If not given, a raw name of the model is passed instead.
kws (dict,) – keyword argument of :func:`sklearn.metrics.roc_curve
- Returns:
ax
- Return type:
Axes.Subplot.
Examples
>>> from watex.utils.plotutils import plot_roc_curves >>> from sklearn.datasets import make_moons >>> from watex.exlib import ( train_test_split, KNeighborsClassifier, SVC , XGBClassifier, LogisticRegression ) >>> X, y = make_moons (n_samples=2000, noise=0.2) >>> X, Xt, y, yt = train_test_split (X, y, test_size=0.2) >>> clfs = [ m().fit(X, y) for m in ( KNeighborsClassifier, SVC , XGBClassifier, LogisticRegression)] >>> plot_roc_curves(clfs, Xt, yt) Out[66]: <AxesSubplot:xlabel='False Positive Rate (FPR)', ylabel='True Positive Rate (FPR)'> >>> plot_roc_curves(clfs, Xt, yt,all=True, ncols = 4 , fig_size = (10, 4))
- watex.utils.plot_sbs_feature_selection(sbs_estimator, /, X=None, y=None, fig_size=(8, 5), sns_style=False, savefig=None, verbose=0, **sbs_kws)[source]#
plot Sequential Backward Selection (SBS) for feature selection.
SBS collects the scores of the best feature subset at each stage.
- Parameters:
sbs_estimator (
SequentialBackwardSelectionestimator object) – The Sequential Backward Selection estimator can either be fitted or not. If not fitted. Please provide the training X and y, otherwise an error will occurs.X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target relative to X for classification or regression; None for unsupervised learning.
n_estimators (int, default=500) – The number of trees in the forest.
fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.
sns_style (str, optional,) – the seaborn style.
verbose (int, default=0) – print the feature labels with the rate of their importances.
sbs_kws (dict,) – Additional keyyword arguments passed to
SequentialBackwardSelection
Examples
(1)-> Plot fitted SBS in action >>> from watex.exlib.sklearn import KNeighborsClassifier , train_test_split >>> from watex.datasets import fetch_data >>> from watex.base import SequentialBackwardSelection >>> from watex.utils.plotutils import plot_sbs_feature_selection >>> X, y = fetch_data(‘bagoue analysed’) # data already standardized >>> Xtrain, Xt, ytrain, yt = train_test_split(X, y) >>> knn = KNeighborsClassifier(n_neighbors=5) >>> sbs= SequentialBackwardSelection (knn) >>> sbs.fit(Xtrain, ytrain ) >>> plot_sbs_feature_selection(sbs, sns_style= True)
(2)-> Plot estimator with no prefit SBS. >>> plot_sbs_feature_selection(knn, Xtrain, ytrain) # yield the same result
- watex.utils.plot_sfi(cz, p=None, s=None, dipolelength=None, fig_size=(10, 4), style='classic', **plotkws)[source]#
Plot sfi parameter components.
- Parameters:
cz (array-like 1d,) – Selected conductive zone
p (array-like 1d,) – Station positions of the conductive zone.
dipolelength (float. If p is not given, it will be set) – automatically using the default value to match the
czsize. The default value is10.fig_size (tuple, default=(10, 4)) – Matplotlib (MPL) figure size; should be a tuple value of integers
See also
watex.utils.exmath.sfifor more details about the sfi parameter computation.
Examples
>>> import numpy as np >>> from watex.utils.exmath import plot_sfi >>> rang = np.random.RandomState (42) >>> condzone = np.abs(rang.randn (7))*1e2 >>> plotkws = dict (rlabel = 'Selected conductive zone (cz)', color=f'{P().frcolortags.get("fr3")}', ) >>> plot_sfi (condzone, **plotkws)
- watex.utils.plot_silhouette(X, labels, metric='euclidean', savefig=None, **kwds)[source]#
Plot quantifying the quality of clustering silhouette
- Parameters:
X (array-like of shape (n_samples_a, n_samples_a) if metric == ) – “precomputed” or (n_samples_a, n_features) otherwise An array of pairwise distances between samples, or a feature array.
labels (array-like of shape (n_samples,)) – Label values for each sample.
metric (str or callable, default='euclidean') – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by
sklearn.metrics.pairwise.pairwise_distances(). IfXis the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.savefig (str, default =None ,) – the path to save the figure. Argument is passed to
matplotlib.Figureclass.**kwds (optional keyword parameters) – Any further parameters are passed directly to the distance function. If using a
scipy.spatial.distancemetric, the parameters are still metric dependent. See the scipy docs for usage examples.
See also
watex.view.mlplot.plotSilhouetteGives consistency plot as the use of prefit parameter which checks whether`labels` are expected to be passed into the function directly or not.
Examples
>>> import numpy as np >>> from watex.exlib.sklearn import KMeans >>> from watex.datasets import load_iris >>> from watex.utils.plotutils import plot_silhouette >>> d= load_iris () >>> X= d.data [:, 0][:, np.newaxis] # take the first axis >>> km= KMeans (n_clusters =3 , init='k-means++', n_init =10 , max_iter = 300 , tol=1e-4, random_state =0 ) >>> y_km = km.fit_predict(X) >>> plot_silhouette (X, y_km)
- watex.utils.plot_skew(edi_obj, method='Bahr', mode=None, threshold_line=None, fig_size=(7, 5), savefig=None, view=None, style=None, **kws)[source]#
Plot phase sensitive skew visualization.
Phase Sensitivity Skew (\(\eta\)) is a dimensionality tool that represents a measure of the skew of the phases of the impedance tensor. The parameter is thus unaffected by the distortion effect, unlike the Swift-skew and ellipticity dimensionality tools [1].
Values of \(\eta\) > 0.3 are considered to represent 3D data. Phase-sensitive skews less than 0.1 indicate 1D, 2D or distorted 2D (3-D /2-D) cases. Values of \(\eta\) between 0.1 and 0.3 indicates modified 3D/2D structures [2] according to Bahr’ methods. However, values :math:eta >=0.2` using the Swift methods, the smaller the value \(\eta\) ( close to \(0.\)), the closer the structure to 2D structure and vice versa.However, it is generally considered that an electrical structure of \(\eta < 0.4\) can be treated as a 2D medium. Here as the
threshold_lineforeta()using the Swift method should be set as 0.4.New in version 0.1.5.
- Parameters:
edi_obj (str,
watex.edi.Edi) – Full path to edifiles orEdiobject.method (str, default='Bahr':) – Kind of correction. Can be
swiftfor the remove distorsion proposed by Swift in 1967 [3]. The value close to 0. assume the 1D and 2D structures, and 3D otherwise. Conversly tobahrfor the remove distorsion proposed by Bahr in 1991 [2]. The latter threshold is set to 0.3. Above this value the structures is 3D.threshold_line (float, optional) –
Visualize th threshold line. Can be [‘bahr’, ‘swift’, ‘both’]:
Note that when method is set to
swift, the value close to close to \(0.\) assume the 1D and 2D structures, and 3D otherwise.when method is set to
Bahr, \(\mu > 0.3`\) is 3D structures, between \([0.1 - 0.3]\) assumes modified 3D/2D structures whereas \(<0.1\) 1D, 2D or distorted 2D.
mode (str, optional) – X-axis coordinates for visualisation. plot either
'frequency'or'periods'. The default is'frequency'view (str, default='skew') – phase sensistive visualization. Can be rotational invariant
invariant. Note that setting tomuorinvariantdoes not change any interpretation since the distortion of Z are all rotational invariant whether using theBahrorswiftmethods.fig_size (tuple, default= (10, 4)) – Matplotlib figure size.
savefig (str, optional) – Save figure name. The default resolution dot-per-inch is
300.style (str, default='classic') – Matplotlib plottings style.
kws (dict,) – Matplotlib Axes scatterplot additional keywords arguments.
- Returns:
ax – Return axis
- Return type:
Matplotlib.pyplot.Axis
See also
watex.methods.em.Processing.skewSkew equation formulations.
watex.view.TPlot.plotSkewGive a consistent plot where user can customize the plot using the plot parameter of
watex.property.BasePlotclass.
References
[1]Bahr, K. (1988) Interpretation of the magnetotelluric impedance tensor: regional induction 395 and local telluric distortion. J. Geophys. Res., 62, 119–127.
[2] (1,2)Bahr, K. (1991) Geological noise in magnetotelluric data: a classification of distortion types. 397 Phys. Earth Planet. Inter., 66, 24–38.
[3]Bahr, K., 1991. Geological noise in magnetotelluric data: a classification of distortion types. Physics of the Earth and Planetary Interiors 66 (1–2), 24–38.
Examples
>>> import watex as wx >>> from watex.utils.plotutils import plot_skew >>> edi_sk = wx.fetch_data ("edis", return_data =True , samples = 20 ) >>> plot_skew (edi_sk) >>> plot_skew (edi_sk, threshold_line= True)
- watex.utils.plot_strike(list_of_edis, /, kind=2, period_tolerance=0.05, text_pad=1.65, rot_z=0.0, **kws)[source]#
Plot the strike estimated from the invariants and phase tensor. in a rose diagram of xy plot.
- Parameters:
list_of_edis (list,) –
full paths to .edi files to plot or list of EDI-files.
Changed in version 0.2.0: No need to provide a list of term:EDI files. Henceforth list_of_edis accepts the EDI path-like object of single EDI file then asserts the validity of the EDI files afterward.
kind (int, default=2) –
Can be [ 1 | 2 ] where:
1 to plot individual decades in one plot
2 to plot all period ranges into one polar diagram for each strike angle estimation
One could try also plot_type = 1 to plot by decade
fig_num (int, default=1,) – figure number to be plotted. Default is 1
font_size (float, default=10,) – Figure size
rot_z (float, default=0.,) – angle of rotation clockwise positive.
period_tolerance (float, default=.05) – Tolerance level to match periods from different edi files. Default is 0.05
text_pad (float, default=1.65) –
- padding of the angle label at the bottom of each
polar diagram. Default is 1.65
plot_range (str, tuple) –
The period range to estimate the strike angle. It can be [ ‘data’ | (period_min,period_max) ]. Options are:
’data’ for estimating the strike for all periods in the data.
(pmin,pmax) for period min and period max, input as (log10(pmin),log10(pmax))
plot_tipper ([ True | False ]) –
True to plot the tipper strike
False to not plot tipper strike
pt_error_floor (int, optional) – Maximum error in degrees that is allowed to estimate strike. Default is None allowing all estimates to be used.
fold ([ True | False ]) –
True to plot only from 0 to 180
False to plot from 0 to 360
plot_orthogonal ([ True | False]) –
True to plot the orthogonal strike directions
False to not
color ([ True | False ]) –
True to plot shade colors
False to plot all in one color
color_inv (str,) – color of invariants plots
color_pt (str,) – color of phase tensor plots
color_tip (str) – color of tipper plots
ring_spacing (float, optional) – spacing of rings in polar plots
ring_limits (tuple of int,) – plot limits (min count, max count) set each plot have these limits
plot_orientation (str, [ 'h' | 'v' ]) – horizontal or vertical plots
- watex.utils.plot_voronoi(X, y, *, cluster_centers, ax=None, show_vertices=False, line_colors='k', line_width=1.0, line_alpha=1.0, fig_size=(7, 7), fig_title='')[source]#
Plots the Voronoi diagram of the k-means clusters overlaid with the data
- Parameters:
X (NDarray, Arraylike 1d) – Data training X and y. Must have the same length
y (NDarray, Arraylike 1d) – Data training X and y. Must have the same length
cluster_center (int,) – Cluster center. Cluster center can be obtain withe KMeans algorithms
show_vertices (bool, optional) – Add the Voronoi vertices to the plot.
line_colors (string, optional) – Specifies the line color for polygon boundaries
line_width (float, optional) – Specifies the line width for polygon boundaries
line_alpha (float, optional) – Specifies the line alpha for polygon boundaries
point_size (float, optional) – Specifies the size of points
ax (Matplotlib.Axes) – Maplotlib axes. If None, a axis is created instead.
fig_size (tuple, default = (7, 7)) – Size of the figures.
- Returns:
ax – Axes to support the figure
- Return type:
Matplotlib.Axes
Examples
>>> from sklearn.datasets import make_moons >>> from sklearn.cluster import KMeans >>> from watex.utils.plotutils import plot_voronoi >>> X, y = make_moons(n_samples=2000, noise=0.2) >>> km = KMeans (n_init ='auto').fit(X, y ) >>> plot_voronoi ( X, y , cluster_centers = km.cluster_centers_)
- watex.utils.plot_yb_confusion_matrix(clf, Xt, yt, labels=None, encoder=None, savefig=None, fig_size=(6, 6), **kws)[source]#
Confusion matrix plot using the ‘yellowbrick’ package.
Creates a heatmap visualization of the sklearn.metrics.confusion_matrix(). A confusion matrix shows each combination of the true and predicted classes for a test data set.
The default color map uses a yellow/orange/red color scale. The user can choose between displaying values as the percent of true (cell value divided by sum of row) or as direct counts. If percent of true mode is selected, 100% accurate predictions are highlighted in green.
Requires a classification model.
Be sure ‘yellowbrick’ is installed before using the function, otherwise an ImportError will raise.
- Parameters:
clf (classifier estimator) – A scikit-learn estimator that should be a classifier. If the model is not a classifier, an exception is raised. If the internal model is not fitted, it is fit when the visualizer is fitted, unless otherwise specified by
is_fitted.Xt (ndarray or DataFrame of shape n x m) – A matrix of n instances with m features. Preferably, matrix represents the test data for error evaluation.
yt (ndarray or Series of length n) – An array or series of target or class values. Preferably, the array represent the test class labels data for error evaluation.
ax (matplotlib Axes, default: None) – The axes to plot the figure on. If not specified the current axes will be used (or generated if required).
sample_weight (array-like of shape = [n_samples], optional) – Passed to
confusion_matrixto weight the samples.encoder (dict or LabelEncoder, default: None) – A mapping of classes to human readable labels. Often there is a mismatch between desired class labels and those contained in the target variable passed to
fit()orscore(). The encoder disambiguates this mismatch ensuring that classes are labeled correctly in the visualization.labels (list of str, default: None) – The class labels to use for the legend ordered by the index of the sorted classes discovered in the
fit()method. Specifying classes in this manner is used to change the class names to a more specific format or to label encoded integer classes. Some visualizers may also use this field to filter the visualization for specific classes. For more advanced usage specify an encoder rather than class labels.fig_size (tuple (width, height), default =(8, 6)) – the matplotlib figure size given as a tuple of width and height
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.
- Returns:
cmo – return a yellowbrick confusion matrix object instance.
- Return type:
yellowbrick.classifier.confusion_matrix.ConfusionMatrix
Examples
>>> #Import the required models and fetch a an extreme gradient boosting >>> # for instance then plot the confusion metric >>> import matplotlib.pyplot as plt >>> plt.style.use ('classic') >>> from watex.datasets import fetch_data >>> from watex.exlib.sklearn import train_test_split >>> from watex.models import pModels >>> from watex.utils.plotutils import plot_yb_confusion_matrix >>> # split the data . Note that fetch_data output X and y >>> X, Xt, y, yt = train_test_split (* fetch_data ('bagoue analysed'), test_size =.25 ) >>> # train the model with the best estimator >>> pmo = pModels (model ='xgboost' ) >>> pmo.fit(X, y ) >>> print(pmo.estimator_ ) # pmo.XGB.best_estimator_ >>> #%% >>> # Predict the score using under the hood the best estimator >>> # for adaboost classifier >>> ypred = pmo.predict(Xt)
>>> # now plot the score >>> plot_yb_confusion_matrix (pmo.XGB.best_estimator_, Xt, yt )
- watex.utils.power(p)[source]#
Compute the power of the selected conductive zone. Anomaly power is closely referred to the width of the conductive zone.
The power parameter implicitly defines the width of the conductive zone and is evaluated from the difference between the abscissa \(X_{LB}\) and the end \(X_{UB}\) points of the selected anomaly:
\[power=|X_{LB} - X_{UB} |\]- Parameters:
p – array-like. Station position of conductive zone.
- Returns:
Absolute value of the width of conductive zone in meters.
- watex.utils.predict_NGA_labels(X, /, n_clusters, random_state=0, keep_label_0=False, n_init='auto', return_cluster_centers=False, **kws)[source]#
Predict the Naive Group of Aquifer (NGA) labels.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.
random_state (int, RandomState instance or None, default=42) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
keep_label_0 (bool, default=False) – The prediction already includes the label 0. However, including 0 in the predicted label refers to ‘k=0’ i.e. permeability coefficient equals to 0, which is not True in principle, because all rocks have a permeability coefficient ‘k’. Here we considered ‘k=0’ as an undefined permeability coefficient. Therefore, ‘0’ , can be exclude since, it can also considered as a missing ‘k’-value. If predicted ‘0’ is in the target it should mean a missing ‘k’-value rather than being a concrete label. Therefore, to avoid any confusion, ‘0’ is altered to ‘1’ so the value +1 is used to move forward all class labels thereby excluding the ‘0’ label. To force include 0 in the label, set keep_label_0 to
True.n_init ('auto' or int, default=10) –
Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of n_init consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see kmeans_sparse_high_dim).
When n_init=’auto’, the number of runs will be 10 if using init=’random’, and 1 if using init=’kmeans++’.
New in version 0.2.0: Added ‘auto’ option for n_init.
return_cluster_centers (bool, default=False,) – export the array of clusters centers if
True.kws (dict,) – Additional keyword arguments passed to
sklearn.clusters.KMeans.
- Returns:
NGA (array_like of shape (n_samples, n_features)) – Predicted NGA labels.
( NGA , cluster_centers) (Tuple of array-like,) – MGA and clusters centers if
return_cluster_centers` is set to ``True.
- watex.utils.projection_validator(X, Xt=None, columns=None)[source]#
Retrieve x, y coordinates of a datraframe ( X, Xt ) from columns names or indexes.
If X or Xt are given as arrays, columns may hold integers from selecting the the coordinates ‘x’ and ‘y’.
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xt (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix.columns (list of str or index, optional) – columns is usefull when a dataframe is given with a dimension size greater than 2. If such data is passed to X or Xt, columns must hold the name to consider as ‘easting’, ‘northing’ when UTM coordinates are given or ‘latitude’ , ‘longitude’ when latlon are given. If dimension size is greater than 2 and columns is None , an error will raises to prevent the user to provide the index for ‘y’ and ‘x’ coordinated retrieval.
- Returns:
arrays and coordinate labels
- Return type:
( x, y, xt, yt ), (xname, yname, xtname, ytname), Tuple of coordinate
- watex.utils.qc(z_or_edis_obj_list, /, tol=0.5, *, interpolate_freq=False, return_freq=False, tensor='res', return_data=False, to_log10=False, return_qco=False)[source]#
Check the quality control in the collection of Z or EDI objects.
Analyse the data in the EDI collection and return the quality control value. It indicates how percentage are the data to be representative.
- Parameters:
tol (float, default=.5) – the tolerance parameter. The value indicates the rate from which the data can be consider as meaningful. Preferably it should be less than 1 and greater than 0. Default is
.5means 50 %. Analysis becomes soft with higher tol values and severe otherwise.interpolate_freq (bool,) – interpolate the valid frequency after removing the frequency which data threshold is under the ``1-tol``% goodness
return_freq (bool, default=False) – returns the interpolated frequency.
return_data (bool, default= False,) – returns the valid data from up to
1-tol%goodness.tensor (str, default='z') – Tensor name. Can be [ resistivity|phase|z|frequency]. Impedance is used for data quality assessment.
to_log10 (bool, default=True) – convert the frequency value to log10.
qco (return) –
retuns quality control object that wraps all usefull informations after control. The following attributes can be fetched as:
rate_: the rate of the quality of the data
component_: The selected component where data is selected for analysis By default used either
xyoryx.freqs_: The valid frequency in the data selected according to the tol parameters. Note that if
interpolate_freqisTrue, it is used instead.invalid_freqs_: Useless frequency dropped in the data during control
data_: Valid tensor data either in TE or TM mode.
- Returns:
return the quality control value and interpolated frequency if
return_freq is set to
Trueotherwise return the only the quality control ratio.return the the quality control object.
- Return type:
Tuple (float ) or (float, array-like, shape (N, )) or QCo
Examples
>>> import watex as wx >>> data = wx.fetch_data ('huayuan', samples =20, return_data =True , key='raw') >>> r,= wx.qc (data) r Out[61]: 0.75 >>> r, = wx.qc (data, tol=.2 ) 0.75 >>> r, = wx.qc (data, tol=.1 )
- watex.utils.random_sampling(d, /, samples=None, replace=False, random_state=None, shuffle=True)[source]#
Sampling data.
- Parameters:
d ({array-like, sparse matrix} of shape (n_samples, n_features)) – Data for sampling, where n_samples is the number of samples and n_features is the number of features.
samples (int,optional) – Ratio or number of items from axis to return. Default = 1 if samples is
None.replace (bool, default=False) – Allow or disallow sampling of the same row more than once.
random_state (int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional) – If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.
split_Xy –
- Returns:
d ({array-like, sparse matrix} of shape (n_samples, n_features))
samples data based on the given samples.
Examples
>>> from watex.utils.funcutils import random_sampling >>> from watex.datasets import load_hlogs >>> data= load_hlogs().frame >>> random_sampling( data, samples = 7 ).shape (7, 27)
- watex.utils.random_selector(arr, /, value, seed=None, shuffle=False)[source]#
Randomly select the number of values in array.
- Parameters:
arr (ArrayLike) – Array of values
value (float, arraylike) – If
floatvalue is passed, it indicates the number of values to select among the length of arr. If array (value) is passed, it should be self contain in the givenarr`. However if ``stringis given and contain the%, it calculates the ratio of number to randomly selected.seed (int, Optional) – Allow retrieving the identical value randomly selected in the given array.
suffle (bool, False) – If
True, shuffle the selected values.
- Returns:
arr
- Return type:
Array containing the selected values
Examples
>>> import numpy as np >>> from watex.utils.funcutils import random_selector >>> dat= np.arange (42 ) >>> random_selector (dat , 7, seed = 42 ) array([0, 1, 2, 3, 4, 5, 6]) >>> random_selector ( dat, ( 23, 13 , 7)) array([ 7, 13, 23]) >>> random_selector ( dat , "7%", seed =42 ) array([0, 1]) >>> random_selector ( dat , "70%", seed =42 , shuffle =True ) array([ 0, 5, 20, 25, 13, 7, 22, 10, 12, 27, 23, 21, 16, 3, 1, 17, 8, 6, 4, 2, 19, 11, 18, 24, 14, 15, 9, 28, 26])
- watex.utils.read_data(f, **read_kws)[source]#
Assert and read specific files and url allowed by the package
Readable files are systematically convert to a pandas dataframe frame.
- Parameters:
f (str, Path-like object) – File path or Pathlib object. Must contain a valid file name and should be a readable file or url
read_kws (dict,) – Additional keywords arguments passed to pandas readable file keywords.
- Returns:
f – A dataframe with head contents by default.
- Return type:
pandas.DataFrame
- watex.utils.reduce_samples(*data, sname, zname=None, kname=None, section_indexes=None, error='raise', strategy='average', verify_integrity=False, ignore_index=False, **kws)[source]#
Create a new dataframe by squeezing/compressing the non valid data.
The m-samples reduction is necessary for the dataset with a lot of missing k-values. The technique of shrinking the number of k0 –values (k-missing values ) seems a relevant idea. It consists to compressed the values of the missing \(k -values from the top ( depth equals 0 ) thin the upper section of the first aquifer with lower depth into a single vector :math:`x_r\) with dimension (1×n ) i.e. contains the n-features.
- Parameters:
data (list of dataframes) – Data that contains mainly the aquifer values. It must contains the depth values refering at the column_name passed at zname and the permeability coefficient k passed to kname . Both argument need t supplied when datafame as passes as positional arguments.
sname (str, optional) – Name of column in the dataframe that contains the strata values. Dont confuse ‘sname’ with ‘stratum’ which is the name of the valid layer/rock in the array/Series of strata.
zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.
kname (str, int) –
- Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly
- kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.
strategy (str , default='average' or 'mean',) – strategy used to select or compute the numerical data into a singular series. It can be [‘naive’]. In that case , a single serie if randomly picked up into the base strata data.
section_indexes (tuple or list of int) – list of a pair tuple or list of integers. It is be the the valid sections( upper and lower ) indexes of of the aquifer. If the depth range z_range and zname are supplied, section_indexes can be None. Note that the last indix is considered as the last position, the bottom of the section therefore, its value is included in the data.
error (str, default='raise') – Raise errors if trouble occurs when computing the section of each aquifer. If ‘ignore’, a UserWarning is displayed when invalid data is found. Any other value of error will set error to raise.
verify_integrity (bool, default=False) –
Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method. if ‘True’, remove the duplicate rows from a DataFrame.
subset: By default, if the rows have the same values in all the columns, they are considered duplicates. This parameter is used to specify the columns that only need to be considered for identifying duplicates. keep: Determines which duplicates (if any) to keep. It takes inputs as, first – Drop duplicates except for the first occurrence. This is the default behavior. last – Drop duplicates except for the last occurrence. False – Drop all duplicates. inplace: It is used to specify whether to return a new DataFrame or update an existing one. It is a boolean flag with default False.
ignore_index (bool, default=False,) – It is a boolean flag to indicate if row index should be reset after dropping duplicate rows. False: It keeps the original row index. True: It reset the index, and the resulting rows will be labeled 0, 1, …, n – 1.
- Returns:
df_new – new dataframes with reducing samples.
- Return type:
List of pandas.dataframes
Example
>>> from watex.datasets import load_hlogs >>> from watex.utils.hydroutils import reduce_samples >>> data = load_hlogs ().frame # get the frames >>> # add explicitly the aquifer section indices >>> dfnew= reduce_samples (data.copy(), sname='strata_name', section_indexes = (16, 29 ),) >>> dfnew[0] ... hole_number strata_name rock_name ... r rp remark 0 H502 mudstone J2z ... NaN NaN NaN 16 H502 siltstone NaN ... 35.74 59.23 NaN 17 H502 fine-grained sandstone NaN ... 35.74 59.23 NaN 18 H502 siltstone NaN ... 35.74 59.23 NaN 19 H502 fine-grained sandstone NaN ... 35.74 59.23 NaN 20 H502 mudstone NaN ... 35.74 59.23 NaN 21 H502 siltstone NaN ... 35.74 59.23 NaN 22 H502 fine-grained sandstone NaN ... 59.61 59.23 NaN 23 H502 siltstone NaN ... 59.61 59.23 NaN 24 H502 fine-grained sandstone NaN ... 59.61 59.23 NaN 25 H502 Coarse-grained sandstone NaN ... 59.61 59.23 NaN 26 H502 mudstone NaN ... 82.33 59.23 NaN 27 H502 fine-grained sandstone NaN ... 82.33 59.23 NaN 28 H502 Coarse-grained sandstone J2z ... 82.33 59.23 NaN 29 H502 coal (J2y) 2coal ... 82.33 59.23 NaN 0 H502 siltstone NaN ... NaN NaN NaN
[16 rows x 23 columns] >>> # specify the column name and kname without section indexes >>> dfnew= reduce_samples (
data.copy(), sname=’strata_name’, data, zname=’depth’, kname=’k’, ignore_index= True )[0]
… dfnew[0].index # index is reset … RangeIndex(start=0, stop=16, step=1)
- watex.utils.remove_outliers(ar, method='IQR', threshold=3.0, fill_value=None, axis=1)[source]#
Efficient strategy to remove outliers in the data.
Indeed, an outlier is the data point of the given sample, observation, or distribution that shall lie outside the overall pattern. A commonly used rule says that one will consider a data point an outlier if it has more than 1.5 IQR below the first quartile or above the third.
Two approaches is used to remove the outliers.
Inter Quartile Range (
IQR) IQR is the most commonly used and most trusted approach used in the research field. Said differently, low outliers shall lie below Q1-1.5 IQR, and high outliers shall lie Q3+1.5IQR. One needs to calculate median, quartiles, including IQR, Q1, and Q3.\[ \begin{align}\begin{aligned}Q1 = 1/4(n + 1)\\Q3 = 1/4 (n + 1)\\Q2 = Q3 – Q1\end{aligned}\end{align} \]To define the outlier base value is defined above and below datasets normal range namely Upper and Lower bounds, define the upper and the lower bound (1.5*IQR value is considered) :
\[ \begin{align}\begin{aligned}upper = Q3 +1.5*IQR\\lower = Q1 – 1.5*IQR\end{aligned}\end{align} \]In the above formula as according to statistics, the 0.5 scale-up of \(IQR (new_IQR = IQR + 0.5*IQR)\) is taken, to consider all the data between 2.7 standard deviations in the Gaussian Distribution
Z-score Is also called a standard score. This value/score helps to understand that how far is the data point from the mean. And after setting up a threshold value one can utilize z score values of data points to define the outliers.
\[Zscore = ( ext{data_point} - ext{mean}) / ext{std. deviation}\]
Now to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the data points lie between +/- 3 standard deviation (using Gaussian Distribution approach).
- Parameters:
ar (Arraylike,) – Array containing outliers to remove
method (str, default='IQR') – The selected approach to remove the outliers. It can be [‘IQR’|’Z-score’]. See Above for outlier explanations. Note that when selecting
"z-score"the threshold value greatly influence the quality of data considering as ooutliers.threshold (float, default=3) – Thershold values is useful for
"z-score"as the value for considering data above as outliers.fill_value (float, optional) – Value to replace the outliers. If not given, outliers are suppressed in the array.
axis (int, default=1) – axis from which to remove values. This is useful when two dimensional array is supplied. Default, delete outlier from the rows.
- Returns:
arr – New array whith removed outliers.
- Return type:
Array_like
Examples
>>> import numpy as np >>> np.random.seed (42 ) >>> from watex.utils.funcutils import remove_outliers >>> data = np.random.randn (7, 3 ) >>> data_r = remove_outliers ( data ) >>> data.shape , data_r.shape (7, 3) (5, 3) >>> remove_outliers ( data, fill_value =np.nan ) array([[ 0.49671415, -0.1382643 , 0.64768854], [ 1.52302986, -0.23415337, -0.23413696], [ 1.57921282, 0.76743473, -0.46947439], [ 0.54256004, -0.46341769, -0.46572975], [ 0.24196227, nan, nan], [-0.56228753, -1.01283112, 0.31424733], [-0.90802408, nan, 1.46564877]]) >>> # for one dimensional >>> remove_outliers ( data[:, 0] , fill_value =np.nan ) array([ 0.49671415, 1.52302986, 1.57921282, 0.54256004, 0.24196227, -0.56228753, nan])
- watex.utils.rename_labels_in(arr, new_names, coerce=False)[source]#
Rename label by a new names
- Parameters:
arr – arr: array-like |pandas.Series array or series containing numerical values. If a non-numerical values is given , an errors will raises.
new_names – list of str; list of string or values to replace the label integer identifier.
coerce – bool, default =False, force the ‘new_names’ to appear in the target including or not some integer identifier class label. coerce is
True, the target array hold the dtype of new_array; coercing the label names will not yield error. Consequently can introduce an unexpected results.
- Returns:
array-like, An array-like with full new label names.
- watex.utils.replace_data(X, y=None, n_times=1, axis=0, reset_index=Ellipsis)[source]#
Replace items in data \(n\) times
- Parameters:
X (Arraylike 1D or pd.DataFrame) – Data to replace. Note Sparse matrices is not allowed. Use
random_sampling()instead.y (Arraylike 1d.) – Preferably one dimensional data.
n_times (int,) – Number of times all items should be replaced in data.
reset_index (bool, default=False.) – If
Trueand dataframe,Index is reset and dropped.
- Returns:
X or (X, y) – Tuple is returned if y is passed.
- Return type:
Tuple of data replaced
Examples
>>> import numpy as np >>> from watex.utils.funcutils import replace_data >>> X, y = np.random.randn ( 7, 2 ), np.arange(7) >>> X.shape, y.shape ((7, 2), (7,)) >>> X_new, y_new = replace_data (X, y, n_times =10 ) >>> X_new.shape , y_new.shape Out[158]: ((70, 2), (70,))
- watex.utils.reshape(arr, axis=None)[source]#
Detect the array shape and reshape it accordingly, back to the given axis.
- Parameters:
array – array_like with number of dimension equals to 1 or 2
axis – axis to reshape back array. If ‘axis’ is None and the number of dimension is greater than 1, it reshapes back array to array-like
- Returns:
New reshaped array
- Example:
>>> import numpy as np >>> from watex.utils.funcutils import reshape >>> array = np.random.randn(50 ) >>> array.shape ... (50,) >>> ar1 = reshape(array, 1) >>> ar1.shape ... (1, 50) >>> ar2 =reshape(ar1 , 0) >>> ar2.shape ... (50, 1) >>> ar3 = reshape(ar2, axis = None) >>> ar3.shape # goes back to the original array >>> ar3.shape ... (50,)
- watex.utils.rhoa2z(rhoa, phs, freq)[source]#
Convert apparent resistivity to impendance tensor z
- Parameters:
rhoa (ndarray, shape (N, M)) – Apparent resistivity in \(\Omega.m\)
phs (ndarray, shape (N, M)) – Phase in degrees
freq (array-like , shape (N, )) – Frequency in Hertz
: :return: Impendance tensor; Tensor is a complex number in \(\Omega\). :rtype: ndarray, shape (N, M), dtype = ‘complex’
- Example:
>>> import numpy as np >>> rhoa = np.array([1623.73691735]) >>> phz = np.array([45.]) >>> f = np.array ([1014]) >>> rhoa2z(rhoa, phz, f) ... array([[2.54950976+2.54950976j]])
- watex.utils.rhophi2z(rho, phi, freq)[source]#
Convert impedance-style information given in Rho/Phi format into complex valued Z.
- Parameters:
rho (ArrayLike 1D/2D) – Resistivity array in \(\Omega.m\). If array is two-dimensional, it should be 2x2 array (real).
phi (ArrayLike 1D/2D) – Phase array in degree (\(\degree\)). If array is two-dimensional, it should be 2x2 array (real).
freq (float, arraylike 1d) – Frequency in Hz
- Returns:
Z – Z dimension depends to the inputs array rho and phi.
- Return type:
Arraylike 1d or 2d , complex
Examples
>>> import numpy as np >>> from watex.utils.exmath import rhophi2z >>> rhophi2z (823 , 25 , 500 ) array([1300.00682824+606.20313966j]) >>> rho = np.array ([[823, 700], [723, 526]] ) >>> phi = np.array ([[45, 50], [90, 180]]) >>> rhophi2z (rho, phi , freq= 500 ) array([[ 1.01427314e+03+1.01427314e+03j, 8.50328081e+02+1.01338154e+03j], [ 8.23227764e-14+1.34443297e+03j, -1.14673449e+03+1.40434473e-13j]]) >>> rhophi2z (np.array ( [ 823, 700]) , np.array ([45, 50 ]) , [500, 700] ) array([1014.27313876+1014.27313876j, 1006.12175325+1199.04921402j]) >>> rho = np.abs (np.random.randn (7, 3 ) * 100 ) >>> phi = np.abs ( np.random.randn (7, 3 ) *180 % 90 ) >>> freq = np.abs ( np.random.randn (7) * 100 ) >>> rhophi2z (rho , phi , freq )
- watex.utils.savejob(job, savefile, *, protocol=None, append_versions=True, append_date=True, fix_imports=True, buffer_callback=None, **job_kws)[source]#
Quick save your job using ‘joblib’ or persistent Python pickle module
- Parameters:
job (Any) – Anything to save, preferabaly a models in dict
savefile (str, or path-like object) – name of file to store the model The file argument must have a write() method that accepts a single bytes argument. It can thus be a file object opened for binary writing, an io.BytesIO instance, or any other custom object that meets this interface.
append_versions (bool, default =True) – Append the version of Joblib module or Python Pickle module following by the scikit-learn, numpy and also pandas versions. This is useful to have idea about previous versions for loading file when system or modules have been upgraded. This could avoid bottleneck when data have been stored for long times and user has forgotten the date and versions at the time the file was saved.
append_date (bool, default=True,) –
Append the date of the day to the filename.
New in version 0.2.3.
protocol (int, optional) –
The optional protocol argument tells the pickler to use the given protocol; supported protocols are 0, 1, 2, 3, 4 and 5. The default protocol is 4. It was introduced in Python 3.4, and is incompatible with previous versions.
Specifying a negative protocol version selects the highest protocol version supported. The higher the protocol used, the more recent the version of Python needed to read the pickle produced.
fix_imports (bool, default=True,) – If fix_imports is True and protocol is less than 3, pickle will try to map the new Python 3 names to the old module names used in Python 2, so that the pickle data stream is readable with Python 2.
buffer_call_back (int, optional) –
If buffer_callback is None (the default), buffer views are serialized into file as part of the pickle stream.
If buffer_callback is not None, then it can be called any number of times with a buffer view. If the callback returns a false value (such as None), the given buffer is out-of-band; otherwise the buffer is serialized in-band, i.e. inside the pickle stream.
It is an error if buffer_callback is not None and protocol is None or smaller than 5.
job_kws (dict,) – Additional keywords arguments passed to
joblib.dump().
- Returns:
savefile – returns the filename
- Return type:
str,
- watex.utils.scalePosition(ydata, xdata=None, func=None, c_order=0, show=False, **kws)[source]#
Correct data location or position and return new corrected location
- Parameters:
ydata (array_like, series or dataframe) – The dependent data, a length M array - nominally
f(xdata, ...).xdata (array_like or object) – The independent variable where the data is measured. Should usually be an M-length sequence or an (k,M)-shaped array for functions with k predictors, but can actually be any object. If
None, xdata is generated by default using the length of the given ydata.func (callable) – The model function,
f(x, ...). It must take the independent variable as the first argument and the parameters to fit as separate remaining arguments. The default func islinearfunction i.e forf(x)= ax +b. where a is slope and b is the intercept value. Setting your own function for better fitting is recommended.c_order (int or str) – The index or the column name if
ydatais given as a dataframe to select the right column for scaling.show (bool) – Quick visualization of data distribution.
kws (dict) – Additional keyword argument from scipy.optimize_curvefit parameters. Refer to scipy.optimize.curve_fit.
- Returns:
- ydata - array -like - Data scaled
- popt - array-like Optimal values for the parameters so that the sum of
the squared residuals of
f(xdata, \*popt) - ydatais minimized.- pcov - array like The estimated covariance of popt. The diagonals provide
the variance of the parameter estimate. To compute one standard deviation
errors on the parameters use
perr = np.sqrt(np.diag(pcov)). How thesigma parameter affects the estimated covariance depends on absolute_sigma
argument, as described above. If the Jacobian matrix at the solution
doesn’t have a full rank, then ‘lm’ method returns a matrix filled with
np.inf, on the other hand ‘trf’ and ‘dogbox’ methods use Moore-Penrose
pseudoinverse to compute the covariance matrix.
Examples
>>> from watex.utils import erpSelector, scalePosition >>> df = erpSelector('data/erp/l10_gbalo.xlsx') >>> df.columns ... Index(['station', 'resistivity', 'longitude', 'latitude', 'easting', 'northing'], dtype='object') >>> # correcting northing coordinates from easting data >>> northing_corrected, popt, pcov = scalePosition(ydata =df.northing , xdata = df.easting, show=True) >>> len(df.northing.values) , len(northing_corrected) ... (20, 20) >>> popt # by default popt =(slope:a ,intercept: b) ... array([1.01151734e+00, 2.93731377e+05]) >>> # corrected easting coordinates using the default x. >>> easting_corrected, *_= scalePosition(ydata =df.easting , show=True) >>> df.easting.values ... array([790284, 790281, 790277, 790270, 790265, 790260, 790254, 790248, ... 790243, 790237, 790231, 790224, 790218, 790211, 790206, 790200, ... 790194, 790187, 790181, 790175], dtype=int64) >>> easting_corrected ... array([790288.18571705, 790282.30300999, 790276.42030293, 790270.53759587, ... 790264.6548888 , 790258.77218174, 790252.88947468, 790247.00676762, ... 790241.12406056, 790235.2413535 , 790229.35864644, 790223.47593938, ... 790217.59323232, 790211.71052526, 790205.8278182 , 790199.94511114, ... 790194.06240407, 790188.17969701, 790182.29698995, 790176.41428289]) .. _Bagoue region: https://en.wikipedia.org/wiki/Bagou%C3%A9
. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast
- watex.utils.scaley(y, x=None, deg=None, func=None)[source]#
Scaling value using a fitting curve.
Create polyfit function from a specifc data points x to correct y values.
- Parameters:
y – array-like of y-axis. Is the array of value to be scaled.
x – array-like of x-axis. If x is given, it should be the same length as y, otherwise and error will occurs. Default is
None.func – callable - The model function,
f(x, ...). It must take the independent variable as the first argument and the parameters to fit as separate remaining arguments. func can be alinearfunction i.e forf(x)= ax +bwhere a is slope and b is the intercept value. It is recommended according to the y value distribution to set up a custom function for better fitting. If func is given, the deg is not needed.deg – polynomial degree. If value is
None, it should be computed using the length of extrema (local and/or global) values.
- Returns:
y: array scaled - projected sample values got from f.
x: new x-axis - new axis x_new generated from the samples.
linear of polynomial function f
- References:
Wikipedia, Curve fitting, https://en.wikipedia.org/wiki/Curve_fitting Wikipedia, Polynomial interpolation, https://en.wikipedia.org/wiki/Polynomial_interpolation
- Example:
>>> import numpy as np >>> import matplotlib.pyplot as plt >>> from watex.exmath import scale_values >>> rdn = np.random.RandomState(42) >>> x0 =10 * rdn.rand(50) >>> y = 2 * x0 + rnd.randn(50) -1 >>> plt.scatter(x0, y) >>> yc, x , f = scale_values(y) >>> plt.plot(x, y, x, yc)
- watex.utils.select_base_stratum(d, /, sname=None, stratum=None, return_rate=False, return_counts=False)[source]#
Selects base stratum from the the strata column in the logging data.
Find the most recurrent stratum in the data and compute the rate of occurrence.
- Parameters:
d (array-like 1D , pandas.Series or DataFrame) – Valid data containing the strata. If dataframe is passed, ‘sname’ is needed to fetch strata values.
sname (str, optional) – Name of column in the dataframe that contains the strata values. Dont confuse ‘sname’ with ‘stratum’ which is the name of the valid layer/rock in the array/Series of strata.
stratum (str, optional) – Name of the base stratum. Must be self contain as an item of the strata data. Note that if stratum is passed, the auto-detection of base stratum is not triggered. It returns the same stratum , however it can gives the rate and occurence of this stratum if return_rate or return_counts is set to
True.return_rate (bool,default=False,) – Returns the rate of occurence of the base stratum in the data.
return_counts (bool, default=False,) – Returns each stratum name and the occurences (count) in the data.
- Returns:
bs (str) – base stratum , self contain in the data
r (float) – rate of occurence in base stratum in the data
c (tuple (str, int)) – Tuple of each stratum whith their occurrence in the data.
Example
>>> from watex.datasets import load_hlogs >>> from watex.utils.hydroutils import select_base_stratum >>> data = load_hlogs().frame # get only the frame >>> select_base_stratum(data, sname ='strata_name') ... 'siltstone' >>> select_base_stratum(data, sname ='strata_name', return_rate =True) ... 0.287292817679558 >>> select_base_stratum(data, sname ='strata_name', return_counts=True) ... [('siltstone', 52), ('fine-grained sandstone', 40), ('mudstone', 37), ('coal', 24), ('Coarse-grained sandstone', 15), ('carbonaceous mudstone', 9), ('medium-grained sandstone', 2), ('topsoil', 1), ('gravel layer', 1)]
- watex.utils.select_feature_importances(clf, X, y=None, *, threshold=0.1, prefit=True, verbose=0, return_selector=False, **kws)[source]#
Select feature importance based on a user-specified threshold after model fitting.
This is useful if one want to use RandomForestClassifier as a feature selector and intermediate step in scikit-learn
Pipelineobject, which allows us to connect different processing steps with an estimator.- Parameters:
clf (estimator object) – The base estimator from which the transformer is built. This can be both a fitted (if
prefitis set to True) or a non-fitted estimator. The estimator should have afeature_importances_orcoef_attribute after fitting. Otherwise, theimportance_getterparameter should be used.X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples, )) – Target vector where n_samples is the number of samples. If given, set prefit=False for estimator to fit and transform the data for feature importance selecting. If estimator is already fitted i.e. prefit=True, ‘y’ is not needed.
threshold (str or float, default=None) – The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the
thresholdvalue is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.prefit (bool, default=False) – Whether a prefit model is expected to be passed into the constructor directly or not. If True, estimator must be a fitted estimator. If False, estimator is fitted and updated by calling fit and partial_fit, respectively.
importance_getter (str or callable, default='auto') –
If ‘auto’, uses the feature importance either through a
coef_attribute orfeature_importances_attribute of estimator.Also accepts a string that specifies an attribute name/path for extracting feature importance (implemented with attrgetter). For example, give regressor_.coef_ in case of
TransformedTargetRegressoror named_steps.clf.feature_importances_ in case ofPipelinewith its last step named clf.If callable, overrides the default feature importance getter. The callable is passed with the fitted estimator and it should return importance for each feature.
norm_order (non-zero int, inf, -inf, default=1) – Order of the norm used to filter the vectors of coefficients below
thresholdin the case where thecoef_attribute of the estimator is of dimension 2.max_features (int, callable, default=None) –
The maximum number of features to select.
If an integer, then it specifies the maximum number of features to allow.
If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of max_feaures(X).
If None, then all features are kept.
To only select based on
max_features, setthreshold=-np.inf.return_selector (bool, default=False,) – Returns selector object if
True., otherwise returns the transformed X.verbose (int, default=0) – display the number of features that meet the criterion according to their importance range.
- Returns:
Xs or selector – Ndarray of number of samples and features that meet the criterion according to the importance range or selector object
- Return type:
ndarray (n_samples, n_criterion_features), or
sklearn.feature_selection.SelectFromModel
Examples
>>> from watex.utils.mlutils import select_feature_importances >>> from watex.exlib.sklearn import LogisticRegression >>> X0 = [[ 0.87, -1.34, 0.31 ], ... [-2.79, -0.02, -0.85 ], ... [-1.34, -0.48, -2.55 ], ... [ 1.92, 1.48, 0.65 ]] >>> y0 = [0, 1, 0, 1]
(1) use prefit =True and get the Xs importance features >>> Xs = select_feature_importances (
LogisticRegression().fit(X0, y0), X0 , prefit =True )
>>> Xs array([[ 0.87, -1.34, 0.31], [-2.79, -0.02, -0.85], [-1.34, -0.48, -2.55], [ 1.92, 1.48, 0.65]])
Set off prefix and return selector obj
>>> selector= select_feature_importances ( LogisticRegression(), X= X0 , y =y0 , prefit =False , return_selector= True ) >>> selector.estimator_.coef_ array([[-0.3252302 , 0.83462377, 0.49750423]]) >>> selector.threshold_ 0.1 >>> selector.get_support() array([ True, True, True])
>>> selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y) >>> selector.estimator_.coef_ array([[-0.3252302 , 0.83462377, 0.49750423]]) >>> selector.threshold_ 0.55245... >>> selector.get_support() array([False, True, False]) >>> selector.transform (X0) array([[ 0.87, -1.34, 0.31], [-2.79, -0.02, -0.85], [-1.34, -0.48, -2.55], [ 1.92, 1.48, 0.65]])
- watex.utils.selectfeatures(df, features=None, include=None, exclude=None, coerce=False, **kwd)[source]#
Select features and return new dataframe.
- Parameters:
df – a dataframe for features selections
features – list of features to select. List of features must be in the dataframe otherwise an error occurs.
include – the type of data to retrieve in the dataframe df. Can be
number.exclude – type of the data to exclude in the dataframe df. Can be
numberi.e. only non-digits data will be keep in the data return.coerce – return the whole dataframe with transforming numeric columns. Be aware that no selection is done and no error is raises instead. default is
Falsekwd – additional keywords arguments from pd.astype function
- Ref:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html
- watex.utils.sfi(cz, p=None, s=None, dipolelength=None, view=False, raw=False, return_components=False, **plotkws)[source]#
Compute the pseudo-fracturing index known as sfi.
The sfi parameter does not indicate the rock fracturing degree in the underground but it is used to speculate about the apparent resistivity dispersion ratio around the cumulated sum of the resistivity values of the selected anomaly. It uses a similar approach of IF parameter proposed by Dieng et al (2004). Furthermore, its threshold is set to \(sqrt{2}\) for symmetrical anomaly characterized by a perfect distribution of resistivity in a homogenous medium. The formula is given by:
\[sfi=\sqrt{(P_a^{*}/P_a )^2+(M_a^{*}/M_a )^2}\]where \(P_a\) and \(M_a\) are the anomaly power and the magnitude respectively. \(P_a^{*}\) is and \(M_a^{*}\) are the projected power and magnitude of the lower point of the selected anomaly.
- Parameters:
cz (array-like,) – Selected conductive zone
p (array-like,) – Station positions of the conductive zone.
dipolelength (float. If p is not given, it will be set) – automatically using the default value to match the
czsize. The default value is10..view (bool, default=False,) – Visualize the fitting curve. Default is
False.raw (bool,default=False,) – Overlaining the fitting curve with the raw curve from cz.
return_components (bool, default=False,) – If
True, it returns the different components used for compute sfi especially for external visualization.plotkws (dict) – Matplotlib plot keyword arguments.
- Returns:
sfi – value computed for pseudo-fracturing index
- Return type:
float
Examples
>>> import numpy as np >>> from watex.property import P >>> from watex.utils.exmath import sfi >>> rang = np.random.RandomState (42) >>> condzone = np.abs(rang.randn (7)) >>> # no visualization and default value `s` with global minimal rho >>> pfi = sfi (condzone) ... 3.35110143 >>> # visualize fitting curve >>> plotkws = dict (rlabel = 'Conductive zone (cz)', label = 'fitting model', color=f'{P().frcolortags.get("fr3")}', ) >>> sfi (condzone, view= True , s= 5, figsize =(7, 7), **plotkws ) Out[598]: (array([ 0., 10., 20., 30.]), 1)
References
See Numpy Polyfit
- See Stackoverflow
the answer of AkaRem edited by Tobu and Migilson.
- See Numpy Errorstate and
how to implement the context manager.
- watex.utils.shape(cz, s=Ellipsis, p=Ellipsis)[source]#
Compute the shape of anomaly.
The shape parameter is mostly used in the basement medium to depict the better conductive zone for the drilling location. According to Sombo et al. (2011; 2012), various shapes of anomalies can be described such as:
“V”, “U”, “W”, “M”, “K”, “C”, and “H”
The shape consists to feed the algorithm with the Electrical Resistivity Profiling resistivity values by specifying the station \((S_{VES})\). Indeed, mostly, \(S_{VES}\) is the station with a very low resistivity value expected to be the drilling location.
- Parameters:
cz – array-like - Conductive zone resistivity values
s – int, str - Station position index or name.
p – Array-like - Should be the position of the conductive zone.
Note
If s is given, p should be provided. If p is missing an error will raises.
- Returns:
str - the shape of anomaly.
- Example:
>>> import numpy as np >>> rang = np.random.RandomState(42) >>> from watex.utils.exmath import shape >>> test_array1 = np.arange(10) >>> shape (test_array1) ... 'C' >>> test_array2 = rang.randn (7) >>> shape(test_array2) ... 'H' >>> test_array3 = np.power(10, test_array2 , dtype =np.float32) >>> shape (test_array3) ... 'H' # does not change whatever the resistivity values.
References
- Sombo, P. A., Williams, F., Loukou, K. N., & Kouassi, E. G. (2011).
Contribution de la Prospection Électrique à L’identification et à la Caractérisation des Aquifères de Socle du Département de Sikensi (Sud de la Côte d’Ivoire). European Journal of Scientific Research, 64(2), 206–219.
- Sombo, P. A. (2012). Application des methodes de resistivites electriques
dans la determination et la caracterisation des aquiferes de socle en Cote d’Ivoire. Cas des departements de Sikensi et de Tiassale (Sud de la Cote d’Ivoire). Universite Felix Houphouet Boigny.
- watex.utils.smart_label_classifier(arr, /, values=None, labels=None, order='soft', func=None, raise_warn=True)[source]#
map smartly the numeric array into a class labels from a map function or a given fixed values.
New classes created from the fixed values can be renamed if labels are supplied.
- Parameters:
arr (Arraylike 1d,) – array-like whose items are expected to be categorized.
values (float, list of float,) – The threshold item values from which the default categorization must be fixed.
labels (int |str| or List of [str, int],) – The labels values that might be correspond to the fixed values. Note that the number of fixed_labels might be consistent with the fixed values plus one, otherwise a ValueError shall raise if order is set to
strict.order (str, ['soft'|'strict'], default='soft',) – If order is
True, the argument passed to values must be self contain as item in the arr, and raise warning otherwise.func (callable, optional) – Function to map the given array. If given, values dont need to be supply.
raise_warn (bool, default='True') – Raise warning message if order=soft and the fixed values are not found in the arr. Also raise warnings, if labels arguments does not match the number of class from fixed values.
- Returns:
arr – categorized array with the same length as the raw
- Return type:
array-like 1d
Examples
>>> import numpy as np >>> from watex.utils.funcutils import smart_label_classifier >>> sc = np.arange (0, 7, .5 ) >>> smart_label_classifier (sc, values = [1, 3.2 ]) array([0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2], dtype=int64) >>> # rename labels <=1 : 'l1', ]1; 3.2]: 'l2' and >3.2 :'l3' >>> smart_label_classifier (sc, values = [1, 3.2 ], labels =['l1', 'l2', 'l3']) >>> array(['l1', 'l1', 'l1', 'l2', 'l2', 'l2', 'l2', 'l3', 'l3', 'l3', 'l3', 'l3', 'l3', 'l3'], dtype=object) >>> def f (v): if v <=1: return 'l1' elif 1< v<=3.2: return "l2" else : return "l3" >>> smart_label_classifier (sc, func= f ) array(['l1', 'l1', 'l1', 'l2', 'l2', 'l2', 'l2', 'l3', 'l3', 'l3', 'l3', 'l3', 'l3', 'l3'], dtype=object) >>> smart_label_classifier (sc, values = 1.) array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64) >>> smart_label_classifier (sc, values = 1., labels='l1') array(['l1', 'l1', 'l1', 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=object)
- watex.utils.smooth1d(ar, /, drop_outliers=True, ma=True, absolute=False, view=False, x=None, xlabel=None, ylabel=None, fig_size=(10, 5))[source]#
Smooth one-dimensional array.
- Parameters:
ar (ArrayLike 1d) – Array of one-dimensional
drop_outliers (bool, default=True) – Remove the outliers in the data before smoothing
ma (bool, default=True,) – Use the moving average for smoothing array value. This seems more realistic.
absolute (bool, default=False,) – keep postive the extrapolated scaled values. Indeed, when scaling data, negative value can be appear due to the polyfit function. to absolute this value, set
absolute=True. Note that converting to values to positive must be considered as the last option when values in the array must be positive.view (bool, default =False) – Display curves
x (ArrayLike, optional) – Abscissa array for visualization. If given, it must be consistent with the given array ar. Raises error otherwise.
xlabel (str, optional) – Label of x
ylabel (str, optional) – label of y
fig_size (tuple , default=(10, 5)) – Matplotlib figure size
- Returns:
yc – Smoothed array value.
- Return type:
ArrayLike
Examples
>>> import numpy as np >>> from watex.utils.exmath import smooth1d >>> # add Guassian Noise >>> np.random.seed (42) >>> ar = np.random.randn (20 ) * 20 + np.random.normal ( 20 ) >>> ar [:7 ] array([6.42891445e+00, 3.75072493e-02, 1.82905357e+01, 2.92957265e+01, 6.20589038e+01, 2.26399535e+01, 1.12596434e+01]) >>> arc = smooth1d (ar, view =True , ma =False ) >>> arc [:7 ] array([12.08603102, 15.29819907, 18.017749 , 20.27968322, 22.11900412, 23.5707141 , 24.66981557]) >>> arc = smooth1d (ar, view =True )# ma=True by default array([ 5.0071604 , 5.90839339, 9.6264018 , 13.94679804, 17.67369252, 20.34922943, 22.00836725])
- watex.utils.smoothing(ar, /, drop_outliers=True, ma=True, absolute=False, axis=0, view=False, fig_size=(7, 7), xlabel=None, ylabel=None, cmap='binary')[source]#
Smooth data along axis.
- Parameters:
ar (ArrayLike 1d or 2d) – One dimensional or two dimensional array.
drop_outliers (bool, default=True) – Remove the outliers in the data before smoothing along the given axis
ma (bool, default=True,) – Use the moving average for smoothing array value along axis. This seems more realistic rather than using only the scaling method.
absolute (bool, default=False,) – keep postive the extrapolated scaled values. Indeed, when scaling data, negative value can be appear due to the polyfit function. to absolute this value, set
absolute=True. Note that converting to values to positive must be considered as the last option when values in the array must be positive.axis (int, default=0) – Axis along with the data must be smoothed. The default is the along the row.
view (bool, default =False) – Visualize the two dimensional raw and smoothing grid.
xlabel (str, optional) – Label of x
ylabel (str, optional) – label of y
fig_size (tuple , default=(7, 5)) – Matplotlib figure size
cmap (str, default='binary') – Matplotlib.colormap to manage the view color
- Returns:
arr0 – Smoothed array value.
- Return type:
ArrayLike
Examples
>>> import numpy as np >>> from watex.utils.exmath import smoothing >>> # add Guassian Noises >>> np.random.seed (42) >>> ar = np.random.randn (20, 7 ) * 20 + np.random.normal ( 20, 7 ) >>> ar [:3, :3 ] array([[ 31.5265026 , 18.82693352, 34.5459903 ], [ 36.94091413, 12.20273182, 32.44342041], [-12.90613711, 10.34646896, 1.33559714]]) >>> arc = smoothing (ar, view =True , ma =False ) >>> arc [:3, :3 ] array([[32.20356863, 17.18624398, 41.22258603], [33.46353806, 15.56839464, 19.20963317], [23.22466498, 13.8985316 , 5.04748584]]) >>> arcma = smoothing (ar, view =True )# ma=True by default >>> arcma [:3, :3 ] array([[23.96547827, 8.48064226, 31.81490918], [26.21374675, 13.33233065, 12.29345026], [22.60143346, 16.77242118, 2.07931194]]) >>> arcma_1 = smoothing (ar, view =True, axis =1 ) >>> arcma_1 [:3, :3 ] array([[18.74017857, 26.91532187, 32.02914421], [18.4056216 , 21.81293014, 21.98535213], [-1.44359989, 3.49228057, 7.51734762]])
- watex.utils.split_train_test(df, test_ratio)[source]#
A naive dataset split into train and test sets from a ratio and return a shuffled train set and test set.
- Parameters:
df – a dataframe containing features
test_ratio – a ratio for test set batch. test_ratio is ranged between 0 to 1. Default is 20%.
- Returns:
a tuple of train set and test set.
- watex.utils.to_numeric_dtypes(arr, *, columns=None, return_feature_types=False, missing_values=nan, pop_cat_features=False, sanitize_columns=False, regex=None, fill_pattern=None, drop_nan_columns=True, how='all', reset_index=False, drop_index=True, verbose=False)[source]#
Convert array to dataframe and coerce arguments to appropriate dtypes.
- Parameters:
arr (Ndarray or Dataframe, shape (M=samples, N=features)) – Array of dataframe to create
columns (list of str, optional) – Usefull to create a dataframe when array is given. Be aware to fit the number of array columns (shape[1])
return_feature_types (bool, default=False,) – return the list of numerical and categorial features
missing_values (float:) – Replace the missing or empty string if exist in the dataframe.
pop_cat_features (bool, default=False,) – remove removes the categorial features from the DataFrame.
sanitize_columns (bool, default=False,) –
remove undesirable character in the data columns using the default argument of regex parameters.
New in version 0.1.9.
regex (re object,) –
- Regular expresion object used to polish the data columns.
the default is:
>>> import re >>> re.compile (r'[_#&.)(*@!_,;\s-]\s*', flags=re.IGNORECASE)
New in version 0.1.9.
fill_pattern (str, default='') – Pattern to replace the non-alphabetic character in each item of columns.
drop_nan_columns (bool, default=True) –
Remove all columns filled by NaN values.
how (str, default='all') – Drop also the NaN row data. The row data which is composed entirely with NaN or Null values.
reset_index (bool, default=False) –
Reset the index of the dataframe.
drop_index (bool, default=True,) –
Drop index in the dataframe after reseting.
verbose (bool, default=False,) – outputs a message by listing the categorial items dropped from the dataframe if exists.
- Returns:
df or (df, nf, cf) – also return nf and cf if return_feature_types is set to``True``.
- Return type:
Dataframe of values casted to numeric types
Examples
>>> from watex.datasets.dload import load_bagoue >>> from watex.utils.funcutils import to_numeric_dtypes >>> X, y = load_bagoue (as_frame =True ) >>> X0 =X[['shape', 'power', 'magnitude']] >>> X0.dtypes ... shape object power object magnitude object dtype: object >>> df = to_numeric_dtypes(X0) >>> df.dtypes ... shape object power float64 magnitude float64 dtype: object
- watex.utils.twinning(*d, on=None, parse_on=False, mode='strict', coerce=False, force=False, decimals=7, raise_warn=True)[source]#
- Find indentical object in all data and concatenate them using merge
intersection (cross) strategy.
- Parameters:
d (List of DataFrames) – List of pandas DataFrames
on (str, label or list) –
Column or index level names to join on. These must be found in all DataFrames. If on is
Noneand not merging on indexes then a concatenation along columns axis is performed in all DataFrames. Note that on works with parse_on if its argument is a list of columns names passed into single litteral string. For instance:on ='longitude latitude' --[parse_on=True]-> ['longitude' , 'latitude']
parse_on (bool, default=False) – Parse on arguments if given as string and return_iterable objects.
mode (str, default='strict') – Mode to the data. Can be [‘soft’|’strict’]. In
strictmode, all the data passed must be a DataFrame, otherwise an error raises. insoftmode, ignore the non-DataFrame. Note that any other values should be instrictmode.coerce (bool, default=False) – Truncate all DataFrame size to much the shorter one before performing the
merge.force (bool, default=False,) – Force on items to be in the all DataFrames, This could be possible at least, on items should be in one DataFrame. If missing in all data, an error occurs.
decimals (int, default=5) –
- Decimal is used for comparison between numeric labels in on columns
items. If set, it rounds values of on items in all data before performing the merge.
- raise_warn: bool, default=False
Warn user to concatenate data along column axis if on is
None.
- Returns:
data – A DataFrame of the merged objects.
- Return type:
DataFrames
Examples
>>> import watex as wx >>> from watex.utils.funcutils import twinning >>> data = wx.make_erp (seed =42 , n_stations =12, as_frame =True ) >>> table1 = wx.DCProfiling ().fit(data).summary() >>> table1 dipole longitude latitude ... shape type sfi line1 10 110.486111 26.05174 ... C EC 1.141844 >>> data_no_xy = wx.make_ves ( seed=0 , as_frame =True) >>> data_no_xy.head(2) AB MN resistivity 0 1.0 0.4 448.860148 1 2.0 0.4 449.060335 >>> data_xy = wx.make_ves ( seed =0 , as_frame =True , add_xy =True ) >>> data_xy.head(2) AB MN resistivity longitude latitude 0 1.0 0.4 448.860148 109.332931 28.41193 1 2.0 0.4 449.060335 109.332931 28.41193 >>> table = wx.methods.VerticalSounding ( xycoords = (110.486111, 26.05174)).fit(data_no_xy).summary() >>> table.table_ AB MN arrangememt ... nareas longitude latitude area ... None 200.0 20.0 schlumberger ... 1 110.486111 26.05174 >>> twinning (table1, table.table_, ) dipole longitude latitude ... nareas longitude latitude line1 10.0 110.486111 26.05174 ... NaN NaN NaN None NaN NaN NaN ... 1.0 110.486111 26.05174 >>> twinning (table1, table.table_, on =['longitude', 'latitude'] ) Empty DataFrame >>> # comments: Empty dataframe appears because, decimal is too large >>> # then it considers values longitude and latitude differents >>> twinning (table1, table.table_, on =['longitude', 'latitude'], decimals =5 ) dipole longitude latitude ... max_depth ohmic_area nareas 0 10 110.48611 26.05174 ... 109.0 690.063003 1 >>> # Now is able to find existing dataframe with identical closer coordinates.
- watex.utils.type_(erp)[source]#
Compute the type of anomaly.
The type parameter is defined by the African Hydraulic Study Committee report (CIEH, 2001). Later it was implemented by authors such as (Adam et al., 2020; Michel et al., 2013; Nikiema, 2012). Type comes to help the differenciation of two or several anomalies with the same shape. For instance, two anomalies with the same shape
Wwill differ from the order of priority of their types. The type depends on the lateral resistivity distribution of underground (resulting from the pace of the apparent resistivity curve) along with the whole Electrical Resistivity Profiling survey line. Indeed, four types of anomalies were emphasized:“EC”, “CB2P”, “NC” and “CP”.
For more details refers to references.
- Parameters:
erp – array-like - Array of Electrical Resistivity Profiling line composed of apparent resistivity values.
- Returns:
str -The type of anomaly.
- Example:
>>> import numpy as np >>> from watex.utils.exmath import type_ >>> rang = np.random.RandomState(42) >>> test_array2 = rang.randn (7) >>> type_(np.abs(test_array2)) ... 'EC' >>> long_array = np.abs (rang.randn(71)) >>> type(long_array) ... 'PC'
References
- Adam, B. M., Abubakar, A. H., Dalibi, J. H., Khalil Mustapha,M., & Abubakar,
A. H. (2020). Assessment of Gaseous Emissions and Socio-Economic Impacts From Diesel Generators used in GSM BTS in Kano Metropolis. African Journal of Earth and Environmental Sciences, 2(1),517–523. https://doi.org/10.11113/ajees.v3.n1.104
- CIEH. (2001). L’utilisation des méthodes géophysiques pour la recherche
d’eaux dans les aquifères discontinus. Série Hydrogéologie, 169.
- Michel, K. A., Drissa, C., Blaise, K. Y., & Jean, B. (2013). Application
de méthodes géophysiques à l ’ étude de la productivité des forages d ’eau en milieu cristallin : cas de la région de Toumodi ( Centre de la Côte d ’Ivoire). International Journal of Innovation and Applied Studies, 2(3), 324–334.
- Nikiema, D. G. C. (2012). Essai d‘optimisation de l’implantation géophysique
des forages en zone de socle : Cas de la province de Séno, Nord Est du Burkina Faso (IRD). (I. / I. Ile-de-France, Ed.). IST / IRD Ile-de-France, Ouagadougou, Burkina Faso, West-africa. Retrieved from http://documentation.2ie-edu.org/cdi2ie/opac_css/doc_num.php?explnum_id=148
- watex.utils.vesDataOperator(AB=None, rhoa=None, data=None, typeofop=None, outdf=False)[source]#
Check the data in the given deep measurement and set the suitable operations for duplicated spacing distance of current electrodes AB.
Sometimes at the potential electrodes (MN), the measurement of AB are collected twice after modifying the distance of MN a bit. At this point, two or many resistivity values are targetted to the same distance AB (AB still remains unchangeable while while MN is changed). So the operation consists whether to average (
mean) the resistiviy values or to take themedianvalues or toleaveOneOut(i.e. keep one value of resistivity among the different values collected at the same point`AB`) at the same spacing AB. Note that for the LeaveOneOut`, the selected resistivity value is randomly chosen.- Parameters:
AB (array-like 1d,) – Spacing of the current electrodes when exploring in deeper. Is the depth measurement (AB/2) using the current electrodes AB. Units are in meters.
rhoa (array-like 1d) – Apparent resistivity values collected by imaging in depth. Units are in \(\Omega {.m}\) not \(log10(\Omega {.m})\)
data (DataFrame,) – It is composed of spacing values AB and the apparent resistivity values rhoa. If data is given, params AB and rhoa should be kept to
None.typeofop (str,['mean'|'median'|'leaveoneout'], default='mean') – Type of operation to apply to the resistivity values rhoa of the duplicated spacing points AB. The default operation is
mean.outdf (bool , default=False,) – Outpout a new dataframe composed of AB and rhoa; data renewed.
- Returns:
- Tuple of (AB, rhoa) (New values computed from typeofop)
- DataFrame (New dataframe outputed only if
outdfisTrue.)
Notes
By convention AB and MN are half-space dipole length which correspond to AB/2 and MN/2 respectively.
Examples
>>> from watex.utils.exmath import vesDataOperator >>> from watex.utils.coreutils import vesSelector >>> data = vesSelector ('data/ves/ves_gbalo.xlsx') >>> len(data) ... (32, 3) # include the potentiel electrode values `MN` >>> df= vesDataOperator(data.AB, data.resistivity, typeofop='leaveOneOut', outdf =True) >>> df.shape ... (26, 2) # exclude `MN` values and reduce(-6) the duplicated values.
- watex.utils.vesSelector(data=None, *, rhoa=None, AB=None, MN=None, index_rhoa=None, xy_coords=None, is_utm=False, utm_zone=None, epsg=None, **kws)[source]#
Assert the validity of Vertical Electrical Sounding data and return a sanitize dataframe.
- param rhoa:
array-like - Apparent resistivities collected during the sounding.
- param AB:
array-like - Investigation distance between the current electrodes. Note that the AB is by convention equals to AB/2. It’s taken as half-space of the investigation depth.
- param MN:
array-like - Potential electrodes distances at each investigation depth. Note by convention the values are half-space and equals to MN/2.
- param f:
Path-like object or sounding dataframe. If given, the others parameters could keep the ``None` values.
- param index_rhoa:
int - The index to retrieve the resistivity data of a specific sounding point. Sometimes the sounding data are composed of the different sounding values collected in the same survey area into different Electrical Resistivity Profiling line. For instance:
AB/2
MN/2
SE1
SE2
SE3
…
SEn
Where SE are the electrical sounding data values and n is the number of the sounding points selected. SE1, SE2 and SE3 are three points selected for Vertical Electrical Sounding i.e. 3 sounding points carried out either in the same Electrical Resistivity Profiling or somewhere else. These sounding data are the resistivity data with a specific numbers. Commonly the number are randomly chosen. It does not refer to the expected best fracture zone selected after the prior-interpretation. After transformation via the function ves_selector, the header of the data should hold the resistivity. For instance, refering to the table above, the data should be:
AB
MN
resistivity
resistivity
resistivity
…
Therefore, the index_rhoa is used to select the specific resistivity values i.e. select the corresponding sounding number of the Vertical Electrical Sounding expecting to locate the drilling operations or for computation. For esample,
index_rhoa=1should figure out:AB/2
MN/2
SE2
–>
AB
MN
resistivity
If index_rhoa is
Noneand the number of sounding curves are more than one, by default the first sounding curve is selected ie index_rhoa equals to0.- param xy_coords:
tuple (float, float) Coordinates of the sounding point. Must be (‘longitude’,’latitude’) or (‘easting’, ‘northing’). If xy is xy_coords is given as (‘easting’ , ‘northing’), specify
is_utm=Trueso the conversion to (‘longitude’, ‘latitude’) should be triggered. IfFalse, a warnings occurs if values are greater than 180 and 90 degree for longitude and latitude respectively. Note that if the coordinates exists in the dataframe, its should takes the priorityNew in version 0.2.1.
- param is_utm:
bool, default= False, Allow conversion the (‘easting’, ‘northing’) coordinated from xy_coords to (‘longitude’, ‘latitude’)
- param utm_zone:
default=’49R’ Is needed when xy_coords is passed as (‘easting’, ‘northing’) for conversion.
- param epsg:
int, str , optional EPSG number defining projection. See http://spatialreference.org/ref/ for moreinfo. Overrides utm_zone if both are provided
- param kws:
dict - Pandas dataframe reading additionals keywords arguments.
- return:
-dataframe -Sanitize Vertical Electrical Sounding dataframe with ` AB`, MN and resistivity as the column headers.
- Example:
>>> from watex.utils.coreutils import vesSelector >>> df = vesSelector (data='data/ves/ves_gbalo.csv') >>> df.head(3) ... AB MN resistivity 0 1 0.4 943 1 2 0.4 1179 2 3 0.4 1103 >>> df = vesSelector ('data/ves/ves_gbalo.csv', index_rhoa=3 ) >>> df.head(3) ... AB MN resistivity 0 1 0.4 457 1 2 0.4 582 2 3 0.4 558
. _Cote d’Ivoire: https://en.wikipedia.org/wiki/Ivory_Coast
- watex.utils.z2rhoa(z, freq)[source]#
Convert impendance tensor z to apparent resistivity
- Parameters:
z (ndarray, shape (N, M)) – Impedance tensor in \(\Omega\)
freq (array-like , shape (N, )) – Frequency in Hertz
: :return: Apparent resistivity in \(\Omega.m\) :rtype: ndarray, shape (N, M)
- Example:
>>> import numpy as np >>> z = np.array([2 + 1j *3 ]) >>> f = np.array ([1014]) >>> z2rhoa(z, f) ... array([[1623.73691735]])
Submodules#
Boxspacedata2Box()defineConductiveZone()erpSelector()fill_coordinates()is_erp_dataframe()is_erp_series()makeCoords()parseDCArgs()plotAnomaly()read_data()vesSelector()betaj()compute_anr()compute_errors()compute_lower_anomaly()compute_magnitude()compute_power()compute_sfi()convert_distance_to_m()d_hanning_window()define_anomaly()define_conductive_zone()detect_station_position()dummy_basement_curve()find_bound_for_integration()find_closest()find_limit_for_integration()fitfunc()fittensor()get2dtensor()get_anomaly_ratio()get_bearing()get_distance()get_full_frequency()get_minVal()get_profile_angle()get_shape()get_station_number()get_strike()get_type()get_type2()get_z_from()getshape()gettype()interpolate1d()interpolate2d()invertVES()linkage_matrix()magnitude()moving_average()ohmicArea()plotOhmicArea()plot_()plot_confidence_in()plot_sfi()power()qc()quickplot()rhoa2z()rhophi2z()savgol_coeffs()savgol_filter()savitzky_golay1d()scalePosition()scale_positions()scaley()select_anomaly()sfi()shape()shortPlot()smooth1d()smoothing()type_()vesDataOperator()z2rhoa()accept_types()assert_doi()assert_ratio()check_dimensionality()cleaner()concat_array_from_list()convert_csvdata_from_fr_to_en()convert_value_in()count_func()cparser_manager()cpath()display_infos()drawn_anomaly_boundaries2()drawn_boundaries()exist_features()fetch_json_data_from_url()fillNaN()find_by_regex()find_close_position()find_feature_positions()find_position_bounds()find_position_from_sa()fit_by_ll()fmt_text()format_notes()fr_en_parser()get_boundaries()get_confidence_ratio()get_config_fname_from_varname()get_params()get_xy_coordinates()hex_to_rgb()interpol_scipy()interpolate_grid()is_depth_in()is_in_if()is_installing()is_iterable()ismissing()key_checker()listing_items_format()load_serialized_data()make_arr_consistent()make_ids()make_introspection()make_obj_consistent_if()map_specific_columns()minimum_parser_to_write_edi()move_cfile()normalizer()parse_attrs()parse_csv()parse_json()parse_md_data()parse_yaml()pretty_printer()print_cmsg()random_sampling()random_selector()random_state_validator()read_from_excelsheets()read_main()read_worksheets()remove_outliers()rename_files()replace_data()repr_callable_obj()reshape()return_ctask()round_dipole_length()sPath()sanitize_fdataset()sanitize_frame_cols()sanitize_unicode_string()savejob()savepath_()serialize_data()show_stats()shrunkformat()smart_format()smart_label_classifier()smart_strobj_recognition()station_id()stn_check_split_type()str2columns()strip_item()to_hdf5()to_numeric_dtypes()twinning()url_checker()wrap_infos()zip_extractor()annotate_log()assert_len_lns_tres()assert_station()base_log()display_s_infos()find_distinct_items_and_indexes()fit_rocks()fit_stratum_property()frame_top_to_bottom()get_closest_gap()get_index_for_mapping()get_random_thickness()get_s_thicknesses()grouped_items()lns_and_tres_split()map_bottom()map_top()print_running_line_prop()pseudostratigraphic_log()set_default_hatch_color_values()smart_zoom()zoom_processing()- GIS Utilities
assert_elevation_value()assert_lat_value()assert_lon_value()assert_xy_coordinate_system()convert_position_float2str()convert_position_str2float()epsg_project()get_epsg()get_utm_string_from_sr()get_utm_zone()get_utm_zone_2()ll_to_utm()project_point_ll2utm()project_point_ll2utm_2()project_point_utm2ll()project_point_utm2ll_2()project_points_ll2utm()split_utm_zone()utm_letter_designator()utm_to_ll()utm_wgs84_conv()utm_zone_to_epsg()validate_epsg()validate_input_values()validate_utm_zone()categorize_target()check_flow_objectivity()classify_k()find_aquifer_groups()find_similar_labels()get_aquifer_section()get_aquifer_sections()get_compressed_vector()get_sections_from_depth()get_unique_section()get_xs_xr_splits()is_valid_depth()label_importance()make_MXS_labels()predict_NGA_labels()reduce_samples()rename_labels_in()select_base_stratum()transmissibility()validate_labels()bi_selector()cattarget()correlatedfeatures()default_data_splitting()discretizeCategoriesforStratification()dumpOrSerializeData()evalModel()exporttarget()fetchGeoDATA()fetchModel()fetch_model()findCatandNumFeatures()getGlobalScore()labels_validator()loadDumpedOrSerializedData()load_data()make_naive_pipe()naive_imputer()naive_scaler()predict()projection_validator()rename_labels_in()select_feature_importances()selectfeatures()split_train_test()split_train_test_by_id()stratifiedUsingDiscretedCategories()confidence_ellipse()controle_delineate_curve()fmt_text()get_color_palette()make_mpl_properties()make_plot_colors()plot_bar()plot_clusters()plot_confidence()plot_confidence_ellipse()plot_confusion_matrices()plot_confusion_matrix()plot_cost_vs_epochs()plot_elbow()plot_errorbar()plot_learning_curves()plot_logging()plot_mlxtend_heatmap()plot_mlxtend_matrix()plot_naive_dendrogram()plot_pca_components()plot_profiling()plot_regularization_path()plot_rf_feature_importances()plot_roc_curves()plot_sbs_feature_selection()plot_silhouette()plot_skew()plot_strike()plot_text()plot_voronoi()plot_yb_confusion_matrix()plotvec1()plotvec2()resetting_colorbar_bound()resetting_ticks()savefigure()- Cythonize “_openmp_helpers”
configuration()get_config()threadpool_info()threadpool_limits()DataConversionWarningPositiveSpectrumWarningarray_to_frame()assert_all_finite()assert_xy_in()check_X_y()check_array()check_consistency_size()check_consistent_length()check_is_fitted()check_memory()check_random_state()check_scalar()check_symmetric()check_y()get_estimator_name()has_fit_parameter()is_frame()is_valid_dc_data()set_array_back()to_dtype_str()InvalidVersionLegacyVersionVersionparse()