watex.datasets package#

Dataset subpackage is used to fetch data from the local machine. If the data does not exist or deleted, the remote searching (repository or zenodo record ) triggers via the module rload

class watex.datasets.DATASET#

Bases: object

DOC = "WATex dataset is composed of different kind of data for software implementation. \n - ERP data found in 'gbalo', 'boundiali' localities in northern part of \n Cote d'Ivoire <'https://en.wikipedia.org/wiki/Ivory_Coast'>'\n - VES data collected in 'gbalo', 'semien', 'tankesse' in center and \n eastearn part of Cote d'Ivoire'.\n - FLOW RATE FEATURES data computed from Bagoue ERP and VES data. \n Refer to paper :doi:`https://doi.org/10.1029/2021wr031623`. \n - COMMON MACHINE LEARNING popular data sets such IRIS. \n - EDI datasets from huayaun county in China, Hunan Province. \n - HLOGS and MXS data for Hongliu coal mine raw and preprocessed borehole data. \n"#

HOW = "Fetch data using 'load_<type_of_data|area_name>'like: 1- BAGOUE : load_bagoue() 2- GBALO : load_gbalo() 3- IRIS : load_iris() 4- SEMIEN : load_semien() 5- TANKESSE: load_tankesse() 6- BOUNDIALI: load_boundiali() 7- HLOGS : load_hlogs() 8- NLOGS : load_nlogs() 9- MXS : load_mxs() 10- HUAYUAN: load_huayuan() 11- EDIS : load_edis() or using ufunc 'fetch_data (<type_of_data|area_name>)'."#

KIND = ('bagoue', 'gbalo', 'iris', 'semien', 'tankesse', 'boundiali', 'hlogs', 'nlogs', 'mxs', 'huayuan', 'edis')#

watex.datasets.fetch_data(tag, **kws)[source]#

Fetch dataset from tag.

A tag corresponds to the name area of data collection or each level of data processing.

Parameters:

tag (str, ['bagoue', 'tankesse', 'semien', 'iris', 'boundiali', 'gbalo']) –

name of the area of data to fetch. For instance set the tag to bagoue will load the bagoue datasets. If the tag name is following by a suffix, the later specifies the stage of the data processing. As an example, bagoue original or bagoue prepared will retrieve the original data and the transformed data after applying default transformers respectively.

There are different options to retrieve data such as:

[‘original’] => original or raw data -& returns a dict of details
contex combine with get method to get the dataframe like:
>>> fetch_data ('bagoue original').get ('data=df')
[‘stratified’] => stratification data
[‘mid’ |'semi'|’preprocess’|’fit’]=> data cleaned with
attributes experience combinaisons.
[‘pipe’]=> default pipeline created during the data preparing.
[‘analyses’|’pca’|’reduce dimension’]=> data with text attributes
only encoded using the ordinal encoder + attributes combinaisons.
[‘test’] => stratified test set data

Returns:

dict, X, y –

If tag is following by suffix in the case of ‘bagoue’ area, it returns:

data: Original data
X, y : Stratified train set and training target
X0, y0: data cleaned after dropping useless features and combined
numerical attributes combinaisons if True
X_prepared, y_prepared: Data prepared after applying all the
transformation via the transformer (pipeline).
XT, yT : stratified test set and test label
_X: Stratified training set for data analysis. So None sparse
matrix is contained. The text attributes (categorical) are converted using Ordianal Encoder.
_pipeline: the default pipeline.

Return type:

frame of Boxspace object

Examples

>>> from watex.datasets import fetch_data
>>> b = fetch_data('bagoue' ) # no suffix returns 'Boxspace' object
>>> b.tnames
... array(['flow'], dtype='<U4')
>>> b.feature_names
... ['num',
     'name',
     'east',
     'north',
     'power',
     'magnitude',
     'shape',
     'type',
     'sfi',
     'ohmS',
     'lwi',
     'geol']
>>> X, y = fetch_data('bagoue prepared' )
>>> X # is transformed  # ready for prediction
>>> X[0]
... <1x18 sparse matrix of type '<class 'numpy.float64'>'
        with 8 stored elements in Compressed Sparse Row format>
>>> y
... array([2, 1, 2, 2, 1, 0, ... , 3, 2, 3, 3, 2], dtype=int64)

watex.datasets.load_bagoue(*, return_X_y=False, as_frame=False, split_X_y=False, test_size=0.3, tag=None, data_names=None, **kws)[source]#

Load the Bagoue dataset.

The Bagoue dataset is a classic and a multi-class classification dataset. Refer to the description for more details.

Parameters:

return_X_y (bool, default=False) – If True, returns (data, target) instead of a Boxspace object. See below for more information about the data and target object. .. versionadded:: 0.1.2
as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below. .. versionadded:: 0.1.1
split_X_y (bool, default=False,) – If True, the data is splitted to hold the training set (X, y) and the testing set (Xt, yt) with the according to the test size ratio.
test_size (float, default is {{.3}} i.e. 30% (X, y)) – The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.
tag (None) – tag and data_names do nothing. just for API purpose. They allow to fetch the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.
data_names (None) – tag and data_names do nothing. just for API purpose. They allow to fetch the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.

Returns:

data (Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe} of shape (150, 4)

The data matrix. If as_frame=True, data will be a pandas DataFrame.

target: {ndarray, Series} of shape (150,)
The classification target. If as_frame=True, target will be a pandas Series.

feature_names: list
The names of the dataset columns.

target_names: list
The names of target classes.

frame: DataFrame of shape (150, 5)
Only present when as_frame=True. DataFrame with data and target. .. versionadded:: 0.1.2

DESCR: str
The full description of the dataset.

filename: str
The path to the location of the data. .. versionadded:: 0.1.2
data, target (tuple if return_X_y is True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples. .. versionadded:: 0.1.2
X, Xt, y, yt (Tuple if split_X_y is True) – A tuple of two ndarray (X, Xt). The first containing a 2D array of:

\[ \begin{align}\begin{aligned}\text{shape}(X, y) = 1- \text{test_ratio} * (n_{samples}, n_{features}) *100\\\text{shape}(Xt, yt)= \text{test_ratio} * (n_{samples}, n_{features}) *100\end{aligned}\end{align} \]

where each row representing one sample and each column representing the features. The second ndarray of shape(n_samples,) containing the target samples.

Examples

Let’s say you are interested in the samples 10, 25, and 50, and want to know their class name:

>>> from watex.datasets import load_bagoue
>>> d = load_bagoue ()
>>> d.target[[10, 25, 50]]

array([0, 2, 0]) >>> list(d.target_names) [‘flow’]

watex.datasets.load_boundiali(*, as_frame=True, index_rhoa=0, tag=None, data_names=None, **kws)[source]#

A DC-vertical Electrical resistivity data collected from BOUNDIALI during the National Drinking Water Supply Program (PNAEP) occurs in 2014 in Cote d’Ivoire. An illustration of the data arrangement is the following:

AB/2	MN/2	SE1	SE2	SE…
1	0.4	107	93	75
2	0.4	97	91	49
…	…	…	…	…
100	10	79	96	98
110	10	84	104	104

Parameters:

as_frame (bool, default=False) –
If True, the data is a pandas DataFrame including columns with appropriate types (numeric). The target is a panda DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a Boxspace dictionary-like object, with the following attributes:
- data{ndarray, dataframe} of shape (33, 6)
  The data matrix. If as_frame=True, data will be a pandas DataFrame.
- resistivity: {array-like} of shape (33,)
  The resistivity of the sounding point.
- MN: {array-like} of shape (33,)
  The step value of potential electrodes increasing in meters
- AB: {array-like} of shape (33,)
  The step value of current electrodes increasing in meters
- feature_names: list
  The names of the dataset columns. .. versionadded:: 0.23
- DESCR: str
  The full description of the dataset.
- filename: str
  The path to the location of the data. .. versionadded:: 0.20
Added in version 0.1.2.
index_rhoa (int, default=0) – index of the resistivy columns to retrieve. Note that this is useful in the cases many sounding values are collected in the same survey area. index_rhoa=0 fetches the first sounding values in the collection of all values.
tag (None,) – Always None for API consistency
data_names (None,) – Always None for API consistency
kws (dict,) – Keywords arguments pass to _is_readable() function for parsing data.

Returns:

data – Dictionary-like object, with the following attributes. - data : {ndarray, dataframe}

The data matrix. If as_frame=True, data will be a pandas DataFrame.

Return type:

Boxspace

Notes

The array configuration is Schlumberger and the max depth investigation is 110.0 meters for \(AB/2\) (current electrodes). The sounding steps \(AB\) starts from 1.0 to 110.0 meters whereas \(MN/2\) (potential electrodes) starts from 0.4 to 10.0 meters. The total number of sounding performers in 4 with the prefix ‘SE’. AB, AB is in meters and SE are in ohm. meters as apparent resistivity values. Use the param index_rho to get the ranking of the sounding resistivity value. For instance index_rhoa=0 fetch the first array of resistivity values (SE1).

watex.datasets.load_edis(*, return_data=False, as_frame=False, key=None, tag=None, samples=None, data_names=None, **kws)[source]#

Load SEG-Electrical Data Interchange (EDI) object

EDI data is a sample of data collected in Huayuan county in China.

Parameters:

return_data (bool, default=False) – If True, returns data in array-like 1D instead of a Boxspace object. Note that the data is only a collection of EDI-objects from watex.edi.Edi
as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric).
samples (int, default=None) – The sample of data to retrieve.
(tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.
data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.
key (str, {'site', 'edi', 'latitude', '*', 'longitude'}, default='edi') – Kind of EDI-data to fetch. If the star is given, fetch all the data on a single frame

Returns:

data – Dictionary-like object, with the following attributes. data : {ndarray, dataframe} of shape (50, 4)

The data matrix. If as_frame=True, data will be a pandas DataFrame.

feature_names: list: The names of the dataset columns.
frame: DataFrame of shape (50, 4): Only present when as_frame=True. DataFrame with data and no target. .. versionadded:: 0.1.2
DESCR: str: The full description of the dataset.
filename: str: The path to the location of the data. .. versionadded:: 0.1.2

Return type:

Boxspace

Examples

>>> from watex.datasets.dload import load_edis
>>> load_edis ().frame [:3]
                edi
0  Edi( verbose=0 )
1  Edi( verbose=0 )
2  Edi( verbose=0 )
>>> load_edis (as_frame =True, key='longitude latitude', samples = 7)
    latitude   longitude
0  26.051390  110.485833
1  26.051794  110.486153
2  26.052198  110.486473
3  26.052602  110.486793
4  26.053006  110.487113
5  26.053410  110.487433
6  26.053815  110.487753

watex.datasets.load_gbalo(*, kind='erp', as_frame=True, index_rhoa=0, tag=None, data_names=None, **kws)[source]#

Load Gbalo ERP and VES datasets.

The dataset is composed of DC-Electrical resistivity profiling (ERP) and Vertical sounding (VES) data collected from Gbalo locality. To change the kind of dataset, pass erp or ves to parameter key. The data is collected during the National Drinking Water Supply Program (PNAEP) occurs in 2012-2014 in Cote d’Ivoire. Refer to erp_doc and ves_doc for the data arrangement illustration.

Parameters:

kind (str , ['ves'|'erp'], default is {'erp'}) – the kind of DC data to retrieve. If kind`is set to ``ves` and VES data is fetched and ERP otherwise.
as_frame (bool, default=False) –
If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a Boxspace dictionary-like object, with the following attributes: data : {ndarray, dataframe} of shape (33, 6) and (45, 4) for VES and ERP

The data matrix. If as_frame=True, data will be a pandas DataFrame.

resistivity: {array-like} of shape (33,) and (45,) for VES and ERP
The resistivity of the sounding point.

station: {array-like} of shape (33,) and (45,) for VES and ERP
The motion distance of each station that increasing in meters. can be considered as the station point for data collection.

northing: {array-like} of shape (33,) and (45,) for VES and ERP
The northing coordinates in UTM in meters at each station where the data is collected.

easting: {array-like} of shape (33,) and (45,) for VES and ERP
The easting coordinates in UTM in meters at each station where the data is collected.

latitude: {array-like} of shape (33,) and (45,) for VES and ERP
The latitude coordinates in degree decimals or ‘DD:MM.SS’ at each station where the data is collected.

longitude: {array-like} of shape (33,) and (45,) for VES and ERP
The longitude coordinates in degree decimals or ‘DD:MM.SS’ at each station where the data is collected.

DESCR: str
The full description of the dataset.

filename: str
The path to the location of the data.
(tag (None) – Always None for API consistency
data_names) (None) – Always None for API consistency
kws (dict,) – Keywords arguments pass to _is_readable() function for parsing data.

Returns:

data – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}

The data matrix. If as_frame=True, data will be a pandas DataFrame.

Return type:

Boxspace

Example

>>> from watex.datasets import load_gbalo
>>> b= load_gbalo (as_frame =False , kind ='erp')
>>> b.station  # retreive the station position
... array([  0.,  10.,  20.,  30.,  40.,  50.,  60.,  70.,  80.,  90., 100.,
       110., 120., 130., 140., 150., 160., 170., 180., 190., 200., 210.,
       220., 230., 240., 250., 260., 270., 280., 290., 300., 310., 320.,
       330., 340., 350., 360., 370., 380., 390., 400., 410., 420., 430.,
       440.])

Notes

The array configuration is schlumberger and the max depth investigation is 100 meters for \(AB/2\) (current electrodes). The profiling step \(AB\) is fixed to 100 meters whereas \(MN/2\) also fixed to (potential electrodes) to 10 meters. The total number of station data collected is 45 while the sounding points is estimated to 33. station , easting and northing are in meters and rho columns are in ohm.meters as apparent resistivity values. Furthermore, the total number of soundings performed with the prefix ‘SE’ is 4.

watex.datasets.load_hlogs(*, return_X_y=False, as_frame=False, key=None, split_X_y=False, test_size=0.3, tag=None, tnames=None, data_names=None, **kws)[source]#

Load the hydro-logging dataset.

Dataset contains multi-target and can be used for a classification or regression problem.

Parameters:

return_X_y (bool, default=False) – If True, returns (data, target) instead of a Bowlspace object. See below for more information about the data and target object. .. versionadded:: 0.1.2
as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below. .. versionadded:: 0.1.3
split_X_y (bool, default=False,) – If True, the data is splitted to hold the training set (X, y) and the testing set (Xt, yt) with the according to the test size ratio.
test_size (float, default is {{.3}} i.e. 30% (X, y)) – The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.
tnames (str, optional) – the name of the target to retreive. If None the full target columns are collected and compose a multioutput y. For a singular classification or regression problem, it is recommended to indicate the name of the target that is needed for the learning task.
(tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
key (str, default='h502') –
Kind of logging data to fetch. Can also be the borehole [“h2601”, “*”]. If key='*', all the data is aggregated on a single frame of borehole.

Added in version 0.2.3.: Add 08 new boreholes data from logging, strata, layer thicknesses and rock_names.
drop_observations (bool, default='False') – Drop the remark column in the logging data if set to True. .. versionadded:: 0.1.5

Returns:

data (Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}

The data matrix. If as_frame=True, data will be a pandas DataFrame.

target: {ndarray, Series}
The classification target. If as_frame=True, target will be a pandas Series.

feature_names: list
The names of the dataset columns.

target_names: list
The names of target classes.

frame: DataFrame
Only present when as_frame=True. DataFrame with data and target. .. versionadded:: 0.1.1

DESCR: str
The full description of the dataset.

filename: str
The path to the location of the data. .. versionadded:: 0.1.2
data, target (tuple if return_X_y is True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples. .. versionadded:: 0.1.2
X, Xt, y, yt (Tuple if split_X_y is True) – A tuple of two ndarray (X, Xt). The first containing a 2D array of:

\[ \begin{align}\begin{aligned}\text{shape}(X, y) = 1- \text{test_ratio} * (n_{samples}, n_{features}) *100\\\text{shape}(Xt, yt)= \text{test_ratio} * (n_{samples}, n_{features}) *100\end{aligned}\end{align} \]

where each row representing one sample and each column representing the features. The second ndarray of shape(n_samples,) containing the target samples.

Examples

Let’s say ,we do not have any idea of the columns that compose the target, thus, the best approach is to run the function without passing any parameters:

>>> from watex.datasets.dload import load_hlogs
>>> b= load_hlogs()
>>> b.target_names

[‘aquifer_group’,: ‘pumping_level’, ‘aquifer_thickness’, ‘hole_depth’, ‘pumping_depth’, ‘section_aperture’, ‘k’, ‘kp’, ‘r’, ‘rp’, ‘remark’]

>>> # Let's say we are interested of the targets 'pumping_level' and
>>> # 'aquifer_thickness' and returns `y'
>>> _, y = load_hlogs (as_frame=True, # return as frame X and y
                       tnames =['pumping_level','aquifer_thickness'],
                       )
>>> list(y.columns)
... ['pumping_level', 'aquifer_thickness']

watex.datasets.load_huayuan(*, samples=None, key=None, as_frame=False, return_data=False, tag=None, data_names=None, **kws)[source]#

Load AMT data from Huayuan locality.

The data is a bacth of 50 SEG EDI files collected in the Huayuan county, Hunan province, China. The data is a AMT data collected the first line.

Parameters:

return_data (bool, default=False) – If True, returns data in array-like 1D instead of a Boxspace object. Note that the data is only a collection of EDI-objects from watex.edi.Edi
as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric).
samples (int, str, default=None,) – Is the number of EDI files to collect. None or * allow retrieving all the 50 EDI sites.
key (str, optional) – Is the kind of data to fetch. Can be [‘preprocessed’|’raw’]. The default is default='pre-processed' for None value. The raw data contains missing tensors and ( weak frequency or noised data intereferences).
(tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.
data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.
savepath (str,) – Path to store temporarily the extracted huayuan data data. If no supplied the default cache is used.
clear_cache (bool, default=False) – Clear the cache before storing the new EDI files. For instance, if at the first turn all edis is fetched. At the second run, if samples is given, It does not have effect. All the EDI files will be retrieved. Thus, to get the the number of samples for EDI, clear_cache should be useful by setting to True.

Returns:

data – Dictionary-like object, with the following attributes:

data: {ndarray, dataframe} of shape (50, 4): The data matrix. If as_frame=True, data will be a pandas DataFrame.
path: str: Path where the EDI data is stored. By default is stored in a cache ~/watex_data.
frame: DataFrame of shape (50, 4): Only present when as_frame=True. DataFrame with data no target.
feature_names: list: The names of the dataset columns.’None’ if as_frame=False.
emo: EM object.: Object from EM class.
filename: str: The path to the location of the data.
DESCR: str: The full description of the dataset.

Return type:

Boxspace

See also

load_edis: Another collection of EDI stored as a supplement inner dataset.

Examples

>>> from watex.datasets import load_huayuan
>>> box = load_huayuan ()
>>> len(box.data)
50
>>> len(load_huayuan ( return_data =True ))
50
>>> data_sample = load_huayuan (samples=17, return_data =True, clear_cache=True )
>>> len(data_sample)
17
>>> data_sample = load_huayuan (samples=17 , as_frame=True)
                edi   longitude   latitude   site   id
0  Edi( verbose=0 )  110.485833  26.051389  s.E00  S00
1  Edi( verbose=0 )  110.486483  26.052210  s.E01  S01
2  Edi( verbose=0 )  110.487134  26.053032  s.E02  S02

watex.datasets.load_iris(*, return_X_y=False, as_frame=False, tag=None, data_names=None, **kws)[source]#

Load and return the iris dataset (classification). The iris dataset is a classic and very easy multi-class classification dataset.

Parameters:

return_X_y (bool, default=False) – If True, returns (data, target) instead of a BowlSpace object. See below for more information about the data and target object. .. versionadded:: 0.1.2
as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below. .. versionadded:: 0.1.2
(tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.

Returns:

data (Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe} of shape (150, 4)

The data matrix. If as_frame=True, data will be a pandas DataFrame.

target: {ndarray, Series} of shape (150,)
The classification target. If as_frame=True, target will be a pandas Series.

feature_names: list
The names of the dataset columns.

target_names: list
The names of target classes.

frame: DataFrame of shape (150, 5)
Only present when as_frame=True. DataFrame with data and target. .. versionadded:: 0.1.2

DESCR: str
The full description of the dataset.

filename: str
The path to the location of the data. .. versionadded:: 0.1.2
(data, target) (tuple if return_X_y is True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples. .. versionadded:: 0.1.2

Notes

Changed in version 0.1.1: Fixed two wrong data points according to Fisher’s paper. The new version is the same as in R, but not as in the UCI Machine Learning Repository.

Examples

Let’s say you are interested in the samples 10, 25, and 50, and want to know their class name. >>> from watex.datasets import load_iris >>> data = load_iris() >>> data.target[[10, 25, 50]] array([0, 0, 1]) >>> list(data.target_names) [‘setosa’, ‘versicolor’, ‘virginica’]

watex.datasets.load_mxs(*, return_X_y=False, as_frame=False, key=None, tag=None, samples=None, tnames=None, data_names=None, split_X_y=False, seed=None, shuffle=False, test_ratio=0.2, **kws)[source]#

Load the dataset after implementing the mixture learning strategy (MXS).

Dataset is composed of 11 boreholes merged with multiple-target that can be used for a classification problem.

Parameters:

return_X_y (bool, default=False) – If True, returns (data, target) instead of a Bowlspace object. See below for more information about the data and target object.
as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below.
split_X_y (bool, default=False,) – If True, the data is splitted to hold the training set (X, y) and the testing set (Xt, yt) based on to the test_ratio value.
tnames (str, optional) – the name of the target to retrieve. If None the full target columns are collected and compose a multioutput y. For a singular classification or regression problem, it is recommended to indicate the name of the target that is needed for the learning task.
(tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
samples (int,optional) – Ratio or number of items from axis to fetch in the data. Default = .5 if samples is None.
key (str, default='data') –
Kind of MXS data to fetch. Can also be:
- ”sparse”: for a compressed sparsed row matrix format of train set X.
- ”scale”: returns a scaled X using the standardization strategy
- ”num”: Exclusive numerical data and exclude the ‘strata’ feature.
- ”test”: test data X and y
- ”train”: train data X and y with preprocessing already performed
- ”raw”: for original dataset X and y with no preprocessing
- ”data”: Default when key is not supplied. It returns the Bowlspace objects.
When k is not supplied, “data” is used instead and return a Bowlspace objects. where:
- target_map: is the mapping of MXS labels in the target y.
- nga_labels: is the y predicted for Naive Group of Aquifer.
drop_observations (bool, default='False') – Drop the remark column in the logging data if set to True.
seed (int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional) – If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.
shuffle (bool, default =False,) – If True, borehole data should be shuffling before sampling.
test_ratio (float, default is 0.2 i.e. 20% (X, y)) – The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.

Returns:

data (Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}

The data matrix. If as_frame=True, data will be a pandas DataFrame.

target: {ndarray, Series}
The classification target. If as_frame=True, target will be a pandas Series.

feature_names: list
The names of the dataset columns.

target_names: list
The names of target classes.

target_map: dict,
is the mapping of MXS labels in the target y.

nga_labels: arryalike 1D,
is the y predicted for Naive Group of Aquifer.

frame: DataFrame
Only present when as_frame=True. DataFrame with data and target.

DESCR: str
The full description of the dataset.

filename: str
The path to the location of the data.
data, target (tuple if return_X_y is True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples.
X, Xt, y, yt (Tuple if split_X_y is True) – A tuple of two ndarray (X, Xt). The first containing a 2D array of training and test data whereas y and yt are training and test labels. The number of samples are based on the test_ratio.

Examples

>>> from watex.datasets.dload import load_mxs
>>> load_mxs (return_X_y= True, key ='sparse', samples ='*')
(<1038x21 sparse matrix of type '<class 'numpy.float64'>'
        with 8298 stored elements in Compressed Sparse Row format>,
 array([1, 1, 1, ..., 5, 5, 5], dtype=int64))

watex.datasets.load_nlogs(*, return_X_y=False, as_frame=False, key=None, years=None, split_X_y=False, test_ratio=0.3, tag=None, tnames=None, data_names=None, samples=None, seed=None, shuffle=False, **kws)[source]#

Load the Nanshang Engineering and hydrogeological drilling dataset.

Dataset contains multi-target and can be used for a classification or regression problem.

Parameters:

return_X_y (bool, default=False) – If True, returns (data, target) instead of a Bowlspace object. See below for more information about the data and target object.
as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below.
split_X_y (bool, default=False,) – If True, the data is splitted to hold the training set (X, y) and the testing set (Xt, yt) with the according to the test size ratio.
test_ratio (float, default is {{.3}} i.e. 30% (X, y)) – The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.
tnames (str, optional) – the name of the target to retreive. If None the full target columns are collected and compose a multioutput y. For a singular classification or regression problem, it is recommended to indicate the name of the target that is needed for the learning task. When collecting data for land subsidence with key="ls", tnames and years are used interchangeability.
(tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
key (str, default='b0') – Kind of drilling data to fetch. Can also be the borehole [“ns”, “ls”]. The ns data refer mostly to engineering drilling whereas the b0 refers to pure hydrogeological drillings. In the former case, the 'ground_height_distance' attribute used to control soil settlement is the target while the latter targets fit the water inflow, the drawdown and the static water level. The “ls” key is used for collection the times series land subsidence data from 2015-2018. It should be used in combinaison with the years parameter for collecting the specific year data. The default land-subsidence data is 2022.
years (str, default="2022") –
the year of land subsidence. Note that land subsidence data are collected from 2015 to 2022. For instance to select two years subsidence, use space between years like years ="2015 2022". The star * argument can be used for selecting all years data.

Added in version 0.2.7: Years of Nanshan land subsidence data collected are added. Use key ls and years for retrieving the subsidence data of each year.
samples (int,optional) – Ratio or number of items from axis to fetch in the data. fetch all data if samples is None.
seed (int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional) – If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.
shuffle (bool, default =False,) – If True, borehole data should be shuffling before sampling.
drop_display_rate (bool, default=True) –
Display the rate is used for image visualization. To increase the image pixels.

Returns:

data (Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}

The data matrix. If as_frame=True, data will be a pandas DataFrame.

target: {ndarray, Series}
The classification target. If as_frame=True, target will be a pandas Series.

feature_names: list
The names of the dataset columns.

target_names: list
The names of target classes.

frame: DataFrame
Only present when as_frame=True. DataFrame with data and target. .. versionadded:: 0.1.1

DESCR: str
The full description of the dataset.

filename: str
The path to the location of the data. .. versionadded:: 0.1.2
data, target (tuple if return_X_y is True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples. .. versionadded:: 0.1.2
X, Xt, y, yt (Tuple if split_X_y is True) – A tuple of two ndarray (X, Xt). The first containing a 2D array of:

\[ \begin{align}\begin{aligned}\text{shape}(X, y) = 1- \text{test_ratio} * (n_{samples}, n_{features}) *100\\\text{shape}(Xt, yt)= \text{test_ratio} * (n_{samples}, n_{features}) *100\end{aligned}\end{align} \]

where each row representing one sample and each column representing the features. The second ndarray of shape(n_samples,) containing the target samples.

Examples

Let’s say ,we do not have any idea of the columns that compose the target, thus, the best approach is to run the function without passing any parameters and then DESCR attributes to get the unit of each attribute:

>>> from watex.datasets.dload import load_nlogs
>>> b= load_nlogs()
>>> b.target_names

Out[241]: [‘static_water_level’,

‘drawdown’, ‘water_inflow’, ‘unit_water_inflow’, ‘water_inflow_in_m3_d’]

>>> b.DESCR
... (...)
>>> # Let's say we are interested of the targets 'drawdown' and
>>> # 'static_water_level' and returns `y'
>>> _, y = load_nlogs (as_frame=True, # return as frame X and y
                       tnames =['drawdown','static_water_level'], )
>>> list(y.columns)
... ['drawdown', 'static_water_level']
>>> y.head(2)
   drawdown  static_water_level
0     70.03                4.21
1      7.38                3.60
>>> # let say we want subsidence data of 2015 and 2018 with the
>>> # diplay resolution rate. Because the display is removed, we must set
>>> # it to False so keep it included in the data.
>>> n= load_nlogs (key ='ls', samples = 3 , years = "2015 2018 disp",
                   drop_display_rate =False )
>>> n.frame
        easting      northing   longitude  ...      2015       2018  disp_rate
0  2.531191e+06  1.973515e+07  113.291328  ... -0.494959 -27.531837  -7.352538
1  2.531536e+06  1.973519e+07  113.291847  ... -1.104473 -21.852705  -7.999145
2  2.531479e+06  1.973520e+07  113.291847  ... -1.139404 -22.022655  -7.894940

watex.datasets.load_semien(*, as_frame=True, index_rhoa=0, tag=None, data_names=None, **kws)[source]#

A DC-vertical Electrical resistivity data collected from SEMIEN during the National Drinking Water Supply Program (PNAEP) occurs in 2014 in Cote d’Ivoire. An illustration of the data arrangement is the following:

AB/2	MN/2	SE1	SE2	SE…
1	0.4	107	93	75
2	0.4	97	91	49
…	…	…	…	…
100	10	79	96	98
110	10	84	104	104

Parameters:

as_frame (bool, default=False) –
If True, the data is a pandas DataFrame including columns with appropriate types (numeric). The target is a panda DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a Boxspace dictionary-like object, with the following attributes:
- data{ndarray, dataframe} of shape (33, 5)
  The data matrix. If as_frame=True, data will be a pandas DataFrame.
- resistivity: {array-like} of shape (33,)
  The resistivity of the sounding point.
- MN: {array-like} of shape (33,)
  The step value of potential electrodes increasing in meters
- AB: {array-like} of shape (33,)
  The step value of current electrodes increasing in meters
- feature_names: list
  The names of the dataset columns. .. versionadded:: 0.23
- DESCR: str
  The full description of the dataset.
- filename: str
  The path to the location of the data. .. versionadded:: 0.20
Added in version 0.1.2.
index_rhoa (int, default=0) – index of the resistivy columns to retrieve. Note that this is useful in the cases many sounding values are collected in the same survey area. index_rhoa=0 fetches the first sounding values in the collection of all values.
tag (None,) – Always None for API consistency
data_names (None,) – Always None for API consistency
kws (dict,) – Keywords arguments pass to _is_readable() function for parsing data.

Returns:

data – Dictionary-like object, with the following attributes. - data : {ndarray, dataframe}

The data matrix. If as_frame=True, data will be a pandas DataFrame.

Return type:

Boxspace

Notes

The array configuration is Schlumberger and the max depth investigation is 110.0 meters for \(AB/2\) (current electrodes). The sounding steps \(AB\) starts from 1.0 to 110.0 meters whereas \(MN/2\) (potential electrodes) starts from 0.4 to 10.0 meters. The total number of sounding performers in 3 with the prefix ‘SE’. AB, AB is in meters and SE are in ohm. meters as apparent resistivity values. Use the param index_rho to get the ranking of the sounding resistivity value. For instance index_rhoa=0 fetch the first array of resistivity values (SE1).

watex.datasets.load_tankesse(*, as_frame=True, tag=None, data_names=None, **kws)[source]#

A DC-Electrical resistivity profiling data collected from TANKESSE during the National Drinking Water Supply Program (PNAEP) occurs in 2014 in Cote d’Ivoire and an example of the data arrangement is the following:

pk	east	north	rho
0	382741	896203	79
10	382743	896193	62
20	382747	896184	51
…	…	…	…
980	382705	894887	55
990	382704	895879	58

Parameters:

as_frame (bool, default=False) –
If True, the data is a pandas DataFrame including columns with appropriate types (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a Boxspace dictionary-like object, with the following attributes: - data : {ndarray, dataframe} of shape (100, 4)

The data matrix. If as_frame=True, data will be a pandas DataFrame.
- resistivity: {array-like} of shape (100,)
  The resistivity of the sounding point.
- station: {array-like}of shape (100,)
  The motion distance of each station that increasing in meters. can be considered as the station point for data collection.
- northing: {array-like} of shape (100,)
  The northing coordinates in UTM in meters at each station where the data is collected.
- easting: {array-like} of shape (100,)
  The easting coordinates in UTM are in meters at each station where the data is collected.
- latitude: {array-like} of shape (100,)
  The latitude coordinates in degree decimals or ‘DD:MM.SS’ at each station where the data is collected.
- longitude: {array-like} of shape (100,)
  The longitude coordinates in degree decimals or ‘DD:MM.SS’ at each the station where the data is collected.
- DESCR: str
  The full description of the dataset.
- filename: str
  The path to the location of the data.
tag (None,) – Always None for API consistency
data_names (None,) – Always None for API consistency
kws (dict,) – Keywords arguments pass to _is_readable() function for parsing data.

Returns:

data – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}

The data matrix. If as_frame=True, data will be a pandas DataFrame.

Return type:

Boxspace

Notes

The array configuration is Schlumberger and the max depth investigation is 100.0 meters for \(AB/2\) (current electrodes). The profiling step \(AB\) is fixed to 100.0 meters whereas \(MN/2\) also fixed to (potential electrodes) to 10.0meters. The total number of station data collected is 99. station, easting, and northing are in meters and rho columns are in ohm. meters as apparent resistivity values. Furthermore, if the UTM coordinate (easting and northing) data is given as well as the UTM_zone, the latitude and longitude data are auto-computed and vice versa. The user does need to provide both coordinates data types ( UTM or DD:MM.SS)

watex.datasets.make_erp(*, n_stations=42, max_rho=1000.0, min_rho=1.0, step=20.0, reflong='110:29:09.00', reflat='26:03:05.00', utm_zone='29N', order='+', full_coordinates=True, raise_warning=False, as_frame=False, seed=None, is_utm=False, epsg=None, **coord_kws)[source]#

Generate Electrical Resistivity Profiling (ERP) data from stations and coordinates points.

To generate samples from specific area, it is better to provide both latitude and longitude values from a single station of this area as arguments passed to parameters reflat and reflong respectively. Also specify the utm_zone for the lat/lon coordinates conversion into UTM if necessary. If not useful, can turn off the parameter full_coordinates to False.

Parameters:

n_stations (int, default=42) – number of measurements stations
max_rho (float, default=1e3) – maximum resistivity value on the survey area in \(\Omega.m\)
min_rho (float, default=1e0) – minimum resistivity value on the survey area in \(\Omega.m\)
reflong (float or string or list of [start, stop], default='110:29:09.00') – Reference longitude in degree decimal or in DD:MM:SS for the first station considered as the origin of the landmark.
reflat (float or string or list of [start, stop], default='26:03:05.00') – Reference latitude in degree decimal or in DD:MM:SS for the reference site considered as the landmark origin. If value is given in a list, it can contain the start point and the stop point.
step (float or str , default=20) – Offset or the distance of seperation between different sites in meters. If the value is given as string type, except the km, it should be considered as a m value. Only meters and kilometers are accepables.
order (str , default='-') – Direction of the projection line. By default the projected line is in ascending order i.e. from SW to NE with angle r set to 45 degrees. Could be - for descending order. Any other value should be in ascending order.
utm_zone (string (##N or ##S), default='29N') – utm zone in the form of number and North or South hemisphere, 10S or 03N Must be given if utm2deg is set to True.
full_coordinates (bool, default=True,) – Convert latitude and longitude to approximate UTM values. Easting and northing are gotten using the reference ellipsoid =23 with WGS84. If False, easting and northing are not computed and set to null.
raise_warning (bool, default=True,) – Raises warnings if GDAL is not set or the coordinates accurately status.
as_frame (bool, default=False,) – if True, outputs the data into as a pandas dataframe, Boxspace object otherwise.
seed (int, Optional,) – It allows reproducing the same data. If value is passed, it reproduces the same data at that sample points.
is_utm (bool, default=False) –
Type of coordinates passed to reflat and reflong` params for generating longitude-latitude coordinates. If is_utm is explicity set to True, that means values reflong and reflat arein UTM coordinates. Then the conversion to longitude-latitude should be operated. However if is_utm is False when reflat and reflong values are greater than 90 and 180 degrees respectively, an errors should raise.

Added in version 0.2.1.
epsg (int, str, Optional) – EPSG number defining projection. See http://spatialreference.org/ref/ for moreinfo. Overrides utm_zone if both are provided
coord_kws (dict,) – Additional keywords passed to makeCoords().

Return type:

(pd.Dataframe | Boxspace )

Examples

>>> from watex.datasets.gdata import make_erp
>>> erp_data = make_erp (n_stations =50 , step =30  , as_frame =True)
>>> erp_data.head(3)
Out[256]:
   station  longitude  latitude        easting    northing  resistivity
0        0 -13.488511  0.000997  668210.580864  110.183287   225.265306
1       30 -13.488511  0.000997  668210.581109  110.183482   327.204082
2       60 -13.488510  0.000997  668210.581355  110.183676   204.877551

watex.datasets.make_ves(*, samples=31, min_rho=10.0, max_rho=1000.0, max_depth=100.0, order='-', as_frame=False, seed=None, iorder=3, xy=None, is_utm=False, add_xy=False, utm_zone=None, epsg=None)[source]#

Generate Vertical Electrical Sounding (VES) data from pseudo-depth measurements.

For a large pseudo-depth measurements, one can change the number of samples to a large values. The default samples presumed collected is samples=31 measurements in deeper.

Parameters:

samples (int, default=42) – number of measurements depth AB/2 in meters.
max_rho (float, default=1e3) – maximum resistivity value expected in deeeper on the survey area in \(\\Omega.m\)
min_rho (float, default=1e1) – minimum resistivity value expected in deeper on the survey area in \(\\Omega.m\)
order (str , default='-') – Direction of the projection line. By default the projected line is in ascending order i.e. from SW to NE with angle r set to 45 degrees. Could be - for descending order. Any other value should be in ascending order.
max_depth (float, default=100) – Value of the measurement in deeper expected to reach by AB/2 in meters.
as_frame (bool, default=False,) – if True, outputs the data into as a pandas dataframe, Boxspace object otherwise.
seed (int, Optional,) – It allows reproducing the same data. If value is passed, it reproduces the same data at that sample points.
iorder (int, default=3) – Inflexion order. It is a positive value greater than 0. If None, it should be computed using the length of extrema (local + global). It also might be lower as possible to let the fitting VES curve more realistic.
xy (tuple, optional) –
Coordinates point ( easting, northing ) or (lon, lat) corresponding to the VES points sves. If coordinates values are not given coordinates are randomly generated into (lon, lat) and stored into the attribute xy. To returns the xy auto-coordinates when as_frame=True set add_xy to True.

Added in version 0.2.1.
is_utm (bool, default=False) – In principle, xy expects to be in longitude-latitude coordinates. However if coordinates are passed into a UTM such as easting-northing, user can specify the utm_zone to convert the xy values into a valid longitude and latitude coordinates.
add_xy (bool, default=False) – Add xy coordinates to the VES dataframe.
utm_zone (str, Optional) – To generate coordinates xy from a specific zone, utm_zone can be specified, otherwise 29N is used instead.
epsg (int, str, Optional) – EPSG number defining projection. See http://spatialreference.org/ref/ for moreinfo. Overrides utm_zone if both are provided

Return type:

(pd.Dataframe | Boxspace )

Notes

when returning the Boxspace object, each columns of ‘VES’ data can be retrieved as an attributes. Check the examples below

Examples

>>> from watex.datasets.gdata import make_ves
>>> b = make_ves (samples =50 , order ='+') # 50 measurements in deeper
>>> b.resistivity [:-7]
Out[314]:
array([429.873 , 434.255 , 438.5707, 442.8203, 447.0042, 451.1228,
       457.5775])
>>> b.frame.head(3)
Out[315]:
    AB   MN  resistivity
0  1.0  0.6   429.872999
1  2.0  0.6   434.255018
2  3.0  0.6   438.570675
>>> ves_data = make_ves (samples =50 , min_rho =10, max_rho =1e5 ,
                         as_frame =True, add_xy= True ,
                         xy = ( 3143965.855 , 336704.455) ,
                         is_utm = True , utm_zone = '49N', epsg =None)
>>> ves_data.head(2)
Out[316]:
    AB   MN   resistivity   longitude   latitude
0  1.0  0.6  51544.426685  107.901553 -61.802165
1  2.0  0.6  51420.739513  107.901553 -61.802165

watex.datasets package#

Subpackages#

Submodules#

This Page

AB/2	MN/2	SE1	SE2	SE…
1	0.4	107	93	75
2	0.4	97	91	49
…	…	…	…	…
100	10	79	96	98
110	10	84	104	104

AB/2	MN/2	SE1	SE2	SE…
1	0.4	107	93	75
2	0.4	97	91	49
…	…	…	…	…
100	10	79	96	98
110	10	84	104	104

AB/2	MN/2	SE1	SE2	SE…
1	0.4	107	93	75
2	0.4	97	91	49
…	…	…	…	…
100	10	79	96	98
110	10	84	104	104

AB/2	MN/2	SE1	SE2	SE…
1	0.4	107	93	75
2	0.4	97	91	49
…	…	…	…	…
100	10	79	96	98
110	10	84	104	104

AB/2	MN/2	SE1	SE2	SE…
1	0.4	107	93	75
2	0.4	97	91	49
…	…	…	…	…
100	10	79	96	98
110	10	84	104	104

AB/2	MN/2	SE1	SE2	SE…
1	0.4	107	93	75
2	0.4	97	91	49
…	…	…	…	…
100	10	79	96	98
110	10	84	104	104