watex.datasets package#
Dataset subpackage is used to fetch data from the local machine.
If the data does not exist or deleted, the remote searching
(repository or zenodo record ) triggers via the module
rload
- class watex.datasets.DATASET#
Bases:
object- DOC = "WATex dataset is composed of different kind of data for software implementation. \n - ERP data found in 'gbalo', 'boundiali' localities in northern part of \n Cote d'Ivoire <'https://en.wikipedia.org/wiki/Ivory_Coast'>'\n - VES data collected in 'gbalo', 'semien', 'tankesse' in center and \n eastearn part of Cote d'Ivoire'.\n - FLOW RATE FEATURES data computed from Bagoue ERP and VES data. \n Refer to paper :doi:`https://doi.org/10.1029/2021wr031623`. \n - COMMON MACHINE LEARNING popular data sets such IRIS. \n - EDI datasets from huayaun county in China, Hunan Province. \n - HLOGS and MXS data for Hongliu coal mine raw and preprocessed borehole data. \n"#
- HOW = "Fetch data using 'load_<type_of_data|area_name>'like: 1- BAGOUE : load_bagoue() 2- GBALO : load_gbalo() 3- IRIS : load_iris() 4- SEMIEN : load_semien() 5- TANKESSE: load_tankesse() 6- BOUNDIALI: load_boundiali() 7- HLOGS : load_hlogs() 8- NLOGS : load_nlogs() 9- MXS : load_mxs() 10- HUAYUAN: load_huayuan() 11- EDIS : load_edis() or using ufunc 'fetch_data (<type_of_data|area_name>)'."#
- KIND = ('bagoue', 'gbalo', 'iris', 'semien', 'tankesse', 'boundiali', 'hlogs', 'nlogs', 'mxs', 'huayuan', 'edis')#
- watex.datasets.fetch_data(tag, **kws)[source]#
Fetch dataset from tag.
A tag corresponds to the name area of data collection or each level of data processing.
- Parameters:
tag (str, ['bagoue', 'tankesse', 'semien', 'iris', 'boundiali', 'gbalo']) –
name of the area of data to fetch. For instance set the tag to
bagouewill load the bagoue datasets. If the tag name is following by a suffix, the later specifies the stage of the data processing. As an example, bagoue original or bagoue prepared will retrieve the original data and the transformed data after applying default transformers respectively.There are different options to retrieve data such as:
- [‘original’] => original or raw data -& returns a dict of details
contex combine with get method to get the dataframe like:
>>> fetch_data ('bagoue original').get ('data=df')
[‘stratified’] => stratification data
- [‘mid’ |'semi'|’preprocess’|’fit’]=> data cleaned with
attributes experience combinaisons.
[‘pipe’]=> default pipeline created during the data preparing.
- [‘analyses’|’pca’|’reduce dimension’]=> data with text attributes
only encoded using the ordinal encoder + attributes combinaisons.
[‘test’] => stratified test set data
- Returns:
dict, X, y –
- If tag is following by suffix in the case of ‘bagoue’ area, it returns:
data: Original data
X, y : Stratified train set and training target
- X0, y0: data cleaned after dropping useless features and combined
numerical attributes combinaisons if
True
- X_prepared, y_prepared: Data prepared after applying all the
transformation via the transformer (pipeline).
XT, yT : stratified test set and test label
- _X: Stratified training set for data analysis. So None sparse
matrix is contained. The text attributes (categorical) are converted using Ordianal Encoder.
_pipeline: the default pipeline.
- Return type:
frame of
Boxspaceobject
Examples
>>> from watex.datasets import fetch_data >>> b = fetch_data('bagoue' ) # no suffix returns 'Boxspace' object >>> b.tnames ... array(['flow'], dtype='<U4') >>> b.feature_names ... ['num', 'name', 'east', 'north', 'power', 'magnitude', 'shape', 'type', 'sfi', 'ohmS', 'lwi', 'geol'] >>> X, y = fetch_data('bagoue prepared' ) >>> X # is transformed # ready for prediction >>> X[0] ... <1x18 sparse matrix of type '<class 'numpy.float64'>' with 8 stored elements in Compressed Sparse Row format> >>> y ... array([2, 1, 2, 2, 1, 0, ... , 3, 2, 3, 3, 2], dtype=int64)
- watex.datasets.load_bagoue(*, return_X_y=False, as_frame=False, split_X_y=False, test_size=0.3, tag=None, data_names=None, **kws)[source]#
Load the Bagoue dataset.
The Bagoue dataset is a classic and a multi-class classification dataset. Refer to the description for more details.
- Parameters:
return_X_y (bool, default=False) – If True, returns
(data, target)instead of aBoxspaceobject. See below for more information about the data and target object. .. versionadded:: 0.1.2as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below. .. versionadded:: 0.1.1
split_X_y (bool, default=False,) – If True, the data is splitted to hold the training set (X, y) and the testing set (Xt, yt) with the according to the test size ratio.
test_size (float, default is {{.3}} i.e. 30% (X, y)) – The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.
tag (None) – tag and data_names do nothing. just for API purpose. They allow to fetch the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.
data_names (None) – tag and data_names do nothing. just for API purpose. They allow to fetch the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.
- Returns:
data (
Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe} of shape (150, 4)The data matrix. If as_frame=True, data will be a pandas DataFrame.
- target: {ndarray, Series} of shape (150,)
The classification target. If as_frame=True, target will be a pandas Series.
- feature_names: list
The names of the dataset columns.
- target_names: list
The names of target classes.
- frame: DataFrame of shape (150, 5)
Only present when as_frame=True. DataFrame with data and target. .. versionadded:: 0.1.2
- DESCR: str
The full description of the dataset.
- filename: str
The path to the location of the data. .. versionadded:: 0.1.2
data, target (tuple if
return_X_yis True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples. .. versionadded:: 0.1.2X, Xt, y, yt (Tuple if
split_X_yis True) – A tuple of two ndarray (X, Xt). The first containing a 2D array of:\[ \begin{align}\begin{aligned}\text{shape}(X, y) = 1- \text{test_ratio} * (n_{samples}, n_{features}) *100\\\text{shape}(Xt, yt)= \text{test_ratio} * (n_{samples}, n_{features}) *100\end{aligned}\end{align} \]where each row representing one sample and each column representing the features. The second ndarray of shape(n_samples,) containing the target samples.
Examples
Let’s say you are interested in the samples 10, 25, and 50, and want to know their class name:
>>> from watex.datasets import load_bagoue >>> d = load_bagoue () >>> d.target[[10, 25, 50]]
array([0, 2, 0]) >>> list(d.target_names) [‘flow’]
- watex.datasets.load_boundiali(*, as_frame=True, index_rhoa=0, tag=None, data_names=None, **kws)[source]#
A DC-vertical Electrical resistivity data collected from BOUNDIALI during the National Drinking Water Supply Program (PNAEP) occurs in 2014 in Cote d’Ivoire. An illustration of the data arrangement is the following:
AB/2
MN/2
SE1
SE2
SE…
1
0.4
107
93
75
2
0.4
97
91
49
…
…
…
…
…
100
10
79
96
98
110
10
84
104
104
- Parameters:
as_frame (bool, default=False) –
If True, the data is a pandas DataFrame including columns with appropriate types (numeric). The target is a panda DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a
Boxspacedictionary-like object, with the following attributes:- data{ndarray, dataframe} of shape (33, 6)
The data matrix. If as_frame=True, data will be a pandas DataFrame.
- resistivity: {array-like} of shape (33,)
The resistivity of the sounding point.
- MN: {array-like} of shape (33,)
The step value of potential electrodes increasing in meters
- AB: {array-like} of shape (33,)
The step value of current electrodes increasing in meters
- feature_names: list
The names of the dataset columns. .. versionadded:: 0.23
- DESCR: str
The full description of the dataset.
- filename: str
The path to the location of the data. .. versionadded:: 0.20
Added in version 0.1.2.
index_rhoa (int, default=0) – index of the resistivy columns to retrieve. Note that this is useful in the cases many sounding values are collected in the same survey area. index_rhoa=0 fetches the first sounding values in the collection of all values.
tag (None,) – Always None for API consistency
data_names (None,) – Always None for API consistency
kws (dict,) – Keywords arguments pass to
_is_readable()function for parsing data.
- Returns:
data – Dictionary-like object, with the following attributes. - data : {ndarray, dataframe}
The data matrix. If
as_frame=True, data will be a pandas DataFrame.- Return type:
Boxspace
Notes
The array configuration is Schlumberger and the max depth investigation is 110.0 meters for \(AB/2\) (current electrodes). The sounding steps \(AB\) starts from 1.0 to 110.0 meters whereas \(MN/2\) (potential electrodes) starts from 0.4 to 10.0 meters. The total number of sounding performers in 4 with the prefix ‘SE’. AB, AB is in meters and SE are in ohm. meters as apparent resistivity values. Use the param
index_rhoto get the ranking of the sounding resistivity value. For instanceindex_rhoa=0fetch the first array of resistivity values (SE1).
- watex.datasets.load_edis(*, return_data=False, as_frame=False, key=None, tag=None, samples=None, data_names=None, **kws)[source]#
Load SEG-Electrical Data Interchange (EDI) object
EDI data is a sample of data collected in Huayuan county in China.
- Parameters:
return_data (bool, default=False) – If True, returns
datain array-like 1D instead of a Boxspace object. Note that the data is only a collection of EDI-objects fromwatex.edi.Edias_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric).
samples (int, default=None) – The sample of data to retrieve.
(tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.
data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.
key (str, {'site', 'edi', 'latitude', '*', 'longitude'}, default='edi') – Kind of EDI-data to fetch. If the star is given, fetch all the data on a single frame
- Returns:
data – Dictionary-like object, with the following attributes. data : {ndarray, dataframe} of shape (50, 4)
The data matrix. If as_frame=True, data will be a pandas DataFrame.
- feature_names: list
The names of the dataset columns.
- frame: DataFrame of shape (50, 4)
Only present when as_frame=True. DataFrame with data and no target. .. versionadded:: 0.1.2
- DESCR: str
The full description of the dataset.
- filename: str
The path to the location of the data. .. versionadded:: 0.1.2
- Return type:
Boxspace
Examples
>>> from watex.datasets.dload import load_edis >>> load_edis ().frame [:3] edi 0 Edi( verbose=0 ) 1 Edi( verbose=0 ) 2 Edi( verbose=0 ) >>> load_edis (as_frame =True, key='longitude latitude', samples = 7) latitude longitude 0 26.051390 110.485833 1 26.051794 110.486153 2 26.052198 110.486473 3 26.052602 110.486793 4 26.053006 110.487113 5 26.053410 110.487433 6 26.053815 110.487753
- watex.datasets.load_gbalo(*, kind='erp', as_frame=True, index_rhoa=0, tag=None, data_names=None, **kws)[source]#
Load Gbalo ERP and VES datasets.
The dataset is composed of DC-Electrical resistivity profiling (ERP) and Vertical sounding (VES) data collected from Gbalo locality. To change the kind of dataset, pass
erporvesto parameter key. The data is collected during the National Drinking Water Supply Program (PNAEP) occurs in 2012-2014 in Cote d’Ivoire. Refer toerp_docandves_docfor the data arrangement illustration.- Parameters:
kind (str , ['ves'|'erp'], default is {'erp'}) – the kind of DC data to retrieve. If kind`is set to ``ves` and VES data is fetched and ERP otherwise.
as_frame (bool, default=False) –
If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a
Boxspacedictionary-like object, with the following attributes: data : {ndarray, dataframe} of shape (33, 6) and (45, 4) for VES and ERPThe data matrix. If as_frame=True, data will be a pandas DataFrame.
- resistivity: {array-like} of shape (33,) and (45,) for VES and ERP
The resistivity of the sounding point.
- station: {array-like} of shape (33,) and (45,) for VES and ERP
The motion distance of each station that increasing in meters. can be considered as the station point for data collection.
- northing: {array-like} of shape (33,) and (45,) for VES and ERP
The northing coordinates in UTM in meters at each station where the data is collected.
- easting: {array-like} of shape (33,) and (45,) for VES and ERP
The easting coordinates in UTM in meters at each station where the data is collected.
- latitude: {array-like} of shape (33,) and (45,) for VES and ERP
The latitude coordinates in degree decimals or ‘DD:MM.SS’ at each station where the data is collected.
- longitude: {array-like} of shape (33,) and (45,) for VES and ERP
The longitude coordinates in degree decimals or ‘DD:MM.SS’ at each station where the data is collected.
- DESCR: str
The full description of the dataset.
- filename: str
The path to the location of the data.
(tag (None) – Always None for API consistency
data_names) (None) – Always None for API consistency
kws (dict,) – Keywords arguments pass to
_is_readable()function for parsing data.
- Returns:
data – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}
The data matrix. If as_frame=True, data will be a pandas DataFrame.
- Return type:
Boxspace
Example
>>> from watex.datasets import load_gbalo >>> b= load_gbalo (as_frame =False , kind ='erp') >>> b.station # retreive the station position ... array([ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90., 100., 110., 120., 130., 140., 150., 160., 170., 180., 190., 200., 210., 220., 230., 240., 250., 260., 270., 280., 290., 300., 310., 320., 330., 340., 350., 360., 370., 380., 390., 400., 410., 420., 430., 440.])
Notes
The array configuration is schlumberger and the max depth investigation is 100 meters for \(AB/2\) (current electrodes). The profiling step \(AB\) is fixed to 100 meters whereas \(MN/2\) also fixed to (potential electrodes) to 10 meters. The total number of station data collected is 45 while the sounding points is estimated to 33. station , easting and northing are in meters and rho columns are in ohm.meters as apparent resistivity values. Furthermore, the total number of soundings performed with the prefix ‘SE’ is 4.
- watex.datasets.load_hlogs(*, return_X_y=False, as_frame=False, key=None, split_X_y=False, test_size=0.3, tag=None, tnames=None, data_names=None, **kws)[source]#
Load the hydro-logging dataset.
Dataset contains multi-target and can be used for a classification or regression problem.
- Parameters:
return_X_y (bool, default=False) – If True, returns
(data, target)instead of a Bowlspace object. See below for more information about the data and target object. .. versionadded:: 0.1.2as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below. .. versionadded:: 0.1.3
split_X_y (bool, default=False,) – If True, the data is splitted to hold the training set (X, y) and the testing set (Xt, yt) with the according to the test size ratio.
test_size (float, default is {{.3}} i.e. 30% (X, y)) – The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.
tnames (str, optional) – the name of the target to retreive. If
Nonethe full target columns are collected and compose a multioutput y. For a singular classification or regression problem, it is recommended to indicate the name of the target that is needed for the learning task.(tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
key (str, default='h502') –
Kind of logging data to fetch. Can also be the borehole [“h2601”, “*”]. If
key='*', all the data is aggregated on a single frame of borehole.Added in version 0.2.3.: Add 08 new boreholes data from logging, strata, layer thicknesses and rock_names.
drop_observations (bool, default='False') – Drop the
remarkcolumn in the logging data if set toTrue. .. versionadded:: 0.1.5
- Returns:
data (
Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}The data matrix. If
as_frame=True, data will be a pandas DataFrame.- target: {ndarray, Series}
The classification target. If as_frame=True, target will be a pandas Series.
- feature_names: list
The names of the dataset columns.
- target_names: list
The names of target classes.
- frame: DataFrame
Only present when as_frame=True. DataFrame with data and target. .. versionadded:: 0.1.1
- DESCR: str
The full description of the dataset.
- filename: str
The path to the location of the data. .. versionadded:: 0.1.2
data, target (tuple if
return_X_yis True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples. .. versionadded:: 0.1.2X, Xt, y, yt (Tuple if
split_X_yis True) – A tuple of two ndarray (X, Xt). The first containing a 2D array of:\[ \begin{align}\begin{aligned}\text{shape}(X, y) = 1- \text{test_ratio} * (n_{samples}, n_{features}) *100\\\text{shape}(Xt, yt)= \text{test_ratio} * (n_{samples}, n_{features}) *100\end{aligned}\end{align} \]where each row representing one sample and each column representing the features. The second ndarray of shape(n_samples,) containing the target samples.
Examples
Let’s say ,we do not have any idea of the columns that compose the target, thus, the best approach is to run the function without passing any parameters:
>>> from watex.datasets.dload import load_hlogs >>> b= load_hlogs() >>> b.target_names
- [‘aquifer_group’,
‘pumping_level’, ‘aquifer_thickness’, ‘hole_depth’, ‘pumping_depth’, ‘section_aperture’, ‘k’, ‘kp’, ‘r’, ‘rp’, ‘remark’]
>>> # Let's say we are interested of the targets 'pumping_level' and >>> # 'aquifer_thickness' and returns `y' >>> _, y = load_hlogs (as_frame=True, # return as frame X and y tnames =['pumping_level','aquifer_thickness'], ) >>> list(y.columns) ... ['pumping_level', 'aquifer_thickness']
- watex.datasets.load_huayuan(*, samples=None, key=None, as_frame=False, return_data=False, tag=None, data_names=None, **kws)[source]#
Load AMT data from Huayuan locality.
The data is a bacth of 50 SEG EDI files collected in the Huayuan county, Hunan province, China. The data is a AMT data collected the first line.
- Parameters:
return_data (bool, default=False) – If True, returns
datain array-like 1D instead of a Boxspace object. Note that the data is only a collection of EDI-objects fromwatex.edi.Edias_frame (bool, default=False) – If
True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric).samples (int, str, default=None,) – Is the number of EDI files to collect.
Noneor*allow retrieving all the 50 EDI sites.key (str, optional) – Is the kind of data to fetch. Can be [‘preprocessed’|’raw’]. The default is
default='pre-processed'forNonevalue. Therawdata contains missing tensors and ( weak frequency or noised data intereferences).(tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.
data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.
savepath (str,) – Path to store temporarily the extracted huayuan data data. If no supplied the default cache is used.
clear_cache (bool, default=False) – Clear the cache before storing the new EDI files. For instance, if at the first turn all edis is fetched. At the second run, if samples is given, It does not have effect. All the EDI files will be retrieved. Thus, to get the the number of samples for EDI, clear_cache should be useful by setting to
True.
- Returns:
data – Dictionary-like object, with the following attributes:
- data: {ndarray, dataframe} of shape (50, 4)
The data matrix. If as_frame=True, data will be a pandas DataFrame.
- path: str
Path where the EDI data is stored. By default is stored in a cache
~/watex_data.- frame: DataFrame of shape (50, 4)
Only present when as_frame=True. DataFrame with data no target.
- feature_names: list
The names of the dataset columns.’None’ if
as_frame=False.- emo:
EMobject. Object from EM class.
- filename: str
The path to the location of the data.
- DESCR: str
The full description of the dataset.
- Return type:
Boxspace
Examples
>>> from watex.datasets import load_huayuan >>> box = load_huayuan () >>> len(box.data) 50 >>> len(load_huayuan ( return_data =True )) 50 >>> data_sample = load_huayuan (samples=17, return_data =True, clear_cache=True ) >>> len(data_sample) 17 >>> data_sample = load_huayuan (samples=17 , as_frame=True) edi longitude latitude site id 0 Edi( verbose=0 ) 110.485833 26.051389 s.E00 S00 1 Edi( verbose=0 ) 110.486483 26.052210 s.E01 S01 2 Edi( verbose=0 ) 110.487134 26.053032 s.E02 S02
- watex.datasets.load_iris(*, return_X_y=False, as_frame=False, tag=None, data_names=None, **kws)[source]#
Load and return the iris dataset (classification). The iris dataset is a classic and very easy multi-class classification dataset.
- Parameters:
return_X_y (bool, default=False) – If True, returns
(data, target)instead of a BowlSpace object. See below for more information about the data and target object. .. versionadded:: 0.1.2as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below. .. versionadded:: 0.1.2
(tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
- Returns:
data (
Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe} of shape (150, 4)The data matrix. If as_frame=True, data will be a pandas DataFrame.
- target: {ndarray, Series} of shape (150,)
The classification target. If as_frame=True, target will be a pandas Series.
- feature_names: list
The names of the dataset columns.
- target_names: list
The names of target classes.
- frame: DataFrame of shape (150, 5)
Only present when as_frame=True. DataFrame with data and target. .. versionadded:: 0.1.2
- DESCR: str
The full description of the dataset.
- filename: str
The path to the location of the data. .. versionadded:: 0.1.2
(data, target) (tuple if
return_X_yis True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples. .. versionadded:: 0.1.2
Notes
Changed in version 0.1.1: Fixed two wrong data points according to Fisher’s paper. The new version is the same as in R, but not as in the UCI Machine Learning Repository.
Examples
Let’s say you are interested in the samples 10, 25, and 50, and want to know their class name. >>> from watex.datasets import load_iris >>> data = load_iris() >>> data.target[[10, 25, 50]] array([0, 0, 1]) >>> list(data.target_names) [‘setosa’, ‘versicolor’, ‘virginica’]
- watex.datasets.load_mxs(*, return_X_y=False, as_frame=False, key=None, tag=None, samples=None, tnames=None, data_names=None, split_X_y=False, seed=None, shuffle=False, test_ratio=0.2, **kws)[source]#
Load the dataset after implementing the mixture learning strategy (MXS).
Dataset is composed of 11 boreholes merged with multiple-target that can be used for a classification problem.
- Parameters:
return_X_y (bool, default=False) – If True, returns
(data, target)instead of a Bowlspace object. See below for more information about the data and target object.as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below.
split_X_y (bool, default=False,) – If True, the data is splitted to hold the training set (X, y) and the testing set (Xt, yt) based on to the test_ratio value.
tnames (str, optional) – the name of the target to retrieve. If
Nonethe full target columns are collected and compose a multioutput y. For a singular classification or regression problem, it is recommended to indicate the name of the target that is needed for the learning task.(tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
samples (int,optional) – Ratio or number of items from axis to fetch in the data. Default = .5 if samples is
None.key (str, default='data') –
Kind of MXS data to fetch. Can also be:
”sparse”: for a compressed sparsed row matrix format of train set X.
”scale”: returns a scaled X using the standardization strategy
”num”: Exclusive numerical data and exclude the ‘strata’ feature.
”test”: test data X and y
”train”: train data X and y with preprocessing already performed
”raw”: for original dataset X and y with no preprocessing
”data”: Default when key is not supplied. It returns the
Bowlspaceobjects.
When k is not supplied, “data” is used instead and return a
Bowlspaceobjects. where:target_map: is the mapping of MXS labels in the target y.
nga_labels: is the y predicted for Naive Group of Aquifer.
drop_observations (bool, default='False') – Drop the
remarkcolumn in the logging data if set toTrue.seed (int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional) – If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.
shuffle (bool, default =False,) – If
True, borehole data should be shuffling before sampling.test_ratio (float, default is 0.2 i.e. 20% (X, y)) – The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.
- Returns:
data (
Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}The data matrix. If
as_frame=True, data will be a pandas DataFrame.- target: {ndarray, Series}
The classification target. If as_frame=True, target will be a pandas Series.
- feature_names: list
The names of the dataset columns.
- target_names: list
The names of target classes.
- target_map: dict,
is the mapping of MXS labels in the target y.
- nga_labels: arryalike 1D,
is the y predicted for Naive Group of Aquifer.
- frame: DataFrame
Only present when as_frame=True. DataFrame with data and target.
- DESCR: str
The full description of the dataset.
- filename: str
The path to the location of the data.
data, target (tuple if
return_X_yis True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples.X, Xt, y, yt (Tuple if
split_X_yis True) – A tuple of two ndarray (X, Xt). The first containing a 2D array of training and test data whereas y and yt are training and test labels. The number of samples are based on the test_ratio.
Examples
>>> from watex.datasets.dload import load_mxs >>> load_mxs (return_X_y= True, key ='sparse', samples ='*') (<1038x21 sparse matrix of type '<class 'numpy.float64'>' with 8298 stored elements in Compressed Sparse Row format>, array([1, 1, 1, ..., 5, 5, 5], dtype=int64))
- watex.datasets.load_nlogs(*, return_X_y=False, as_frame=False, key=None, years=None, split_X_y=False, test_ratio=0.3, tag=None, tnames=None, data_names=None, samples=None, seed=None, shuffle=False, **kws)[source]#
Load the Nanshang Engineering and hydrogeological drilling dataset.
Dataset contains multi-target and can be used for a classification or regression problem.
- Parameters:
return_X_y (bool, default=False) – If True, returns
(data, target)instead of a Bowlspace object. See below for more information about the data and target object.as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below.
split_X_y (bool, default=False,) – If True, the data is splitted to hold the training set (X, y) and the testing set (Xt, yt) with the according to the test size ratio.
test_ratio (float, default is {{.3}} i.e. 30% (X, y)) – The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.
tnames (str, optional) – the name of the target to retreive. If
Nonethe full target columns are collected and compose a multioutput y. For a singular classification or regression problem, it is recommended to indicate the name of the target that is needed for the learning task. When collecting data for land subsidence withkey="ls", tnames and years are used interchangeability.(tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.
key (str, default='b0') – Kind of drilling data to fetch. Can also be the borehole [“ns”, “ls”]. The
nsdata refer mostly to engineering drilling whereas theb0refers to pure hydrogeological drillings. In the former case, the'ground_height_distance'attribute used to control soil settlement is the target while the latter targets fit the water inflow, the drawdown and the static water level. The “ls” key is used for collection the times series land subsidence data from 2015-2018. It should be used in combinaison with the years parameter for collecting the specific year data. The default land-subsidence data is2022.years (str, default="2022") –
the year of land subsidence. Note that land subsidence data are collected from 2015 to 2022. For instance to select two years subsidence, use space between years like
years ="2015 2022". The star*argument can be used for selecting all years data.Added in version 0.2.7: Years of Nanshan land subsidence data collected are added. Use key ls and years for retrieving the subsidence data of each year.
samples (int,optional) – Ratio or number of items from axis to fetch in the data. fetch all data if samples is
None.seed (int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional) – If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.
shuffle (bool, default =False,) – If
True, borehole data should be shuffling before sampling.drop_display_rate (bool, default=True) –
Display the rate is used for image visualization. To increase the image pixels.
- Returns:
data (
Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}The data matrix. If
as_frame=True, data will be a pandas DataFrame.- target: {ndarray, Series}
The classification target. If as_frame=True, target will be a pandas Series.
- feature_names: list
The names of the dataset columns.
- target_names: list
The names of target classes.
- frame: DataFrame
Only present when as_frame=True. DataFrame with data and target. .. versionadded:: 0.1.1
- DESCR: str
The full description of the dataset.
- filename: str
The path to the location of the data. .. versionadded:: 0.1.2
data, target (tuple if
return_X_yis True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples. .. versionadded:: 0.1.2X, Xt, y, yt (Tuple if
split_X_yis True) – A tuple of two ndarray (X, Xt). The first containing a 2D array of:\[ \begin{align}\begin{aligned}\text{shape}(X, y) = 1- \text{test_ratio} * (n_{samples}, n_{features}) *100\\\text{shape}(Xt, yt)= \text{test_ratio} * (n_{samples}, n_{features}) *100\end{aligned}\end{align} \]where each row representing one sample and each column representing the features. The second ndarray of shape(n_samples,) containing the target samples.
Examples
Let’s say ,we do not have any idea of the columns that compose the target, thus, the best approach is to run the function without passing any parameters and then DESCR attributes to get the unit of each attribute:
>>> from watex.datasets.dload import load_nlogs >>> b= load_nlogs() >>> b.target_names
Out[241]: [‘static_water_level’,
‘drawdown’, ‘water_inflow’, ‘unit_water_inflow’, ‘water_inflow_in_m3_d’]
>>> b.DESCR ... (...) >>> # Let's say we are interested of the targets 'drawdown' and >>> # 'static_water_level' and returns `y' >>> _, y = load_nlogs (as_frame=True, # return as frame X and y tnames =['drawdown','static_water_level'], ) >>> list(y.columns) ... ['drawdown', 'static_water_level'] >>> y.head(2) drawdown static_water_level 0 70.03 4.21 1 7.38 3.60 >>> # let say we want subsidence data of 2015 and 2018 with the >>> # diplay resolution rate. Because the display is removed, we must set >>> # it to False so keep it included in the data. >>> n= load_nlogs (key ='ls', samples = 3 , years = "2015 2018 disp", drop_display_rate =False ) >>> n.frame easting northing longitude ... 2015 2018 disp_rate 0 2.531191e+06 1.973515e+07 113.291328 ... -0.494959 -27.531837 -7.352538 1 2.531536e+06 1.973519e+07 113.291847 ... -1.104473 -21.852705 -7.999145 2 2.531479e+06 1.973520e+07 113.291847 ... -1.139404 -22.022655 -7.894940
- watex.datasets.load_semien(*, as_frame=True, index_rhoa=0, tag=None, data_names=None, **kws)[source]#
A DC-vertical Electrical resistivity data collected from SEMIEN during the National Drinking Water Supply Program (PNAEP) occurs in 2014 in Cote d’Ivoire. An illustration of the data arrangement is the following:
AB/2
MN/2
SE1
SE2
SE…
1
0.4
107
93
75
2
0.4
97
91
49
…
…
…
…
…
100
10
79
96
98
110
10
84
104
104
- Parameters:
as_frame (bool, default=False) –
If True, the data is a pandas DataFrame including columns with appropriate types (numeric). The target is a panda DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a
Boxspacedictionary-like object, with the following attributes:- data{ndarray, dataframe} of shape (33, 5)
The data matrix. If as_frame=True, data will be a pandas DataFrame.
- resistivity: {array-like} of shape (33,)
The resistivity of the sounding point.
- MN: {array-like} of shape (33,)
The step value of potential electrodes increasing in meters
- AB: {array-like} of shape (33,)
The step value of current electrodes increasing in meters
- feature_names: list
The names of the dataset columns. .. versionadded:: 0.23
- DESCR: str
The full description of the dataset.
- filename: str
The path to the location of the data. .. versionadded:: 0.20
Added in version 0.1.2.
index_rhoa (int, default=0) – index of the resistivy columns to retrieve. Note that this is useful in the cases many sounding values are collected in the same survey area. index_rhoa=0 fetches the first sounding values in the collection of all values.
tag (None,) – Always None for API consistency
data_names (None,) – Always None for API consistency
kws (dict,) – Keywords arguments pass to
_is_readable()function for parsing data.
- Returns:
data – Dictionary-like object, with the following attributes. - data : {ndarray, dataframe}
The data matrix. If
as_frame=True, data will be a pandas DataFrame.- Return type:
Boxspace
Notes
The array configuration is Schlumberger and the max depth investigation is 110.0 meters for \(AB/2\) (current electrodes). The sounding steps \(AB\) starts from 1.0 to 110.0 meters whereas \(MN/2\) (potential electrodes) starts from 0.4 to 10.0 meters. The total number of sounding performers in 3 with the prefix ‘SE’. AB, AB is in meters and SE are in ohm. meters as apparent resistivity values. Use the param
index_rhoto get the ranking of the sounding resistivity value. For instanceindex_rhoa=0fetch the first array of resistivity values (SE1).
- watex.datasets.load_tankesse(*, as_frame=True, tag=None, data_names=None, **kws)[source]#
A DC-Electrical resistivity profiling data collected from TANKESSE during the National Drinking Water Supply Program (PNAEP) occurs in 2014 in Cote d’Ivoire and an example of the data arrangement is the following:
pk
east
north
rho
0
382741
896203
79
10
382743
896193
62
20
382747
896184
51
…
…
…
…
980
382705
894887
55
990
382704
895879
58
- Parameters:
as_frame (bool, default=False) –
If True, the data is a pandas DataFrame including columns with appropriate types (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a
Boxspacedictionary-like object, with the following attributes: - data : {ndarray, dataframe} of shape (100, 4)The data matrix. If as_frame=True, data will be a pandas DataFrame.
- resistivity: {array-like} of shape (100,)
The resistivity of the sounding point.
- station: {array-like}of shape (100,)
The motion distance of each station that increasing in meters. can be considered as the station point for data collection.
- northing: {array-like} of shape (100,)
The northing coordinates in UTM in meters at each station where the data is collected.
- easting: {array-like} of shape (100,)
The easting coordinates in UTM are in meters at each station where the data is collected.
- latitude: {array-like} of shape (100,)
The latitude coordinates in degree decimals or ‘DD:MM.SS’ at each station where the data is collected.
- longitude: {array-like} of shape (100,)
The longitude coordinates in degree decimals or ‘DD:MM.SS’ at each the station where the data is collected.
- DESCR: str
The full description of the dataset.
- filename: str
The path to the location of the data.
tag (None,) – Always None for API consistency
data_names (None,) – Always None for API consistency
kws (dict,) – Keywords arguments pass to
_is_readable()function for parsing data.
- Returns:
data – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}
The data matrix. If as_frame=True, data will be a pandas DataFrame.
- Return type:
Boxspace
Notes
The array configuration is Schlumberger and the max depth investigation is 100.0 meters for \(AB/2\) (current electrodes). The profiling step \(AB\) is fixed to 100.0 meters whereas \(MN/2\) also fixed to (potential electrodes) to 10.0meters. The total number of station data collected is 99. station, easting, and northing are in meters and rho columns are in ohm. meters as apparent resistivity values. Furthermore, if the UTM coordinate (easting and northing) data is given as well as the UTM_zone, the latitude and longitude data are auto-computed and vice versa. The user does need to provide both coordinates data types ( UTM or DD:MM.SS)
- watex.datasets.make_erp(*, n_stations=42, max_rho=1000.0, min_rho=1.0, step=20.0, reflong='110:29:09.00', reflat='26:03:05.00', utm_zone='29N', order='+', full_coordinates=True, raise_warning=False, as_frame=False, seed=None, is_utm=False, epsg=None, **coord_kws)[source]#
Generate Electrical Resistivity Profiling (ERP) data from stations and coordinates points.
To generate samples from specific area, it is better to provide both latitude and longitude values from a single station of this area as arguments passed to parameters reflat and reflong respectively. Also specify the utm_zone for the lat/lon coordinates conversion into UTM if necessary. If not useful, can turn off the parameter full_coordinates to
False.- Parameters:
n_stations (int, default=42) – number of measurements stations
max_rho (float, default=1e3) – maximum resistivity value on the survey area in \(\Omega.m\)
min_rho (float, default=1e0) – minimum resistivity value on the survey area in \(\Omega.m\)
reflong (float or string or list of [start, stop], default='110:29:09.00') – Reference longitude in degree decimal or in DD:MM:SS for the first station considered as the origin of the landmark.
reflat (float or string or list of [start, stop], default='26:03:05.00') – Reference latitude in degree decimal or in DD:MM:SS for the reference site considered as the landmark origin. If value is given in a list, it can contain the start point and the stop point.
step (float or str , default=20) – Offset or the distance of seperation between different sites in meters. If the value is given as string type, except the
km, it should be considered as amvalue. Only meters and kilometers are accepables.order (str , default='-') – Direction of the projection line. By default the projected line is in ascending order i.e. from SW to NE with angle r set to
45degrees. Could be-for descending order. Any other value should be in ascending order.utm_zone (string (##N or ##S), default='29N') – utm zone in the form of number and North or South hemisphere, 10S or 03N Must be given if utm2deg is set to
True.full_coordinates (bool, default=True,) – Convert latitude and longitude to approximate UTM values. Easting and northing are gotten using the reference ellipsoid =23 with WGS84. If
False, easting and northing are not computed and set to null.raise_warning (bool, default=True,) – Raises warnings if GDAL is not set or the coordinates accurately status.
as_frame (bool, default=False,) – if
True, outputs the data into as a pandas dataframe,Boxspaceobject otherwise.seed (int, Optional,) – It allows reproducing the same data. If value is passed, it reproduces the same data at that sample points.
is_utm (bool, default=False) –
Type of coordinates passed to reflat and reflong` params for generating longitude-latitude coordinates. If is_utm is explicity set to
True, that means values reflong and reflat arein UTM coordinates. Then the conversion to longitude-latitude should be operated. However if is_utm isFalsewhen reflat and reflong values are greater than90and180degrees respectively, an errors should raise.Added in version 0.2.1.
epsg (int, str, Optional) – EPSG number defining projection. See http://spatialreference.org/ref/ for moreinfo. Overrides utm_zone if both are provided
coord_kws (dict,) – Additional keywords passed to
makeCoords().
- Return type:
(pd.Dataframe |
Boxspace)
Examples
>>> from watex.datasets.gdata import make_erp >>> erp_data = make_erp (n_stations =50 , step =30 , as_frame =True) >>> erp_data.head(3) Out[256]: station longitude latitude easting northing resistivity 0 0 -13.488511 0.000997 668210.580864 110.183287 225.265306 1 30 -13.488511 0.000997 668210.581109 110.183482 327.204082 2 60 -13.488510 0.000997 668210.581355 110.183676 204.877551
- watex.datasets.make_ves(*, samples=31, min_rho=10.0, max_rho=1000.0, max_depth=100.0, order='-', as_frame=False, seed=None, iorder=3, xy=None, is_utm=False, add_xy=False, utm_zone=None, epsg=None)[source]#
Generate Vertical Electrical Sounding (VES) data from pseudo-depth measurements.
For a large pseudo-depth measurements, one can change the number of samples to a large values. The default samples presumed collected is
samples=31measurements in deeper.- Parameters:
samples (int, default=42) – number of measurements depth AB/2 in meters.
max_rho (float, default=1e3) – maximum resistivity value expected in deeeper on the survey area in \(\\Omega.m\)
min_rho (float, default=1e1) – minimum resistivity value expected in deeper on the survey area in \(\\Omega.m\)
order (str , default='-') – Direction of the projection line. By default the projected line is in ascending order i.e. from SW to NE with angle r set to
45degrees. Could be-for descending order. Any other value should be in ascending order.max_depth (float, default=100) – Value of the measurement in deeper expected to reach by AB/2 in meters.
as_frame (bool, default=False,) – if
True, outputs the data into as a pandas dataframe,Boxspaceobject otherwise.seed (int, Optional,) – It allows reproducing the same data. If value is passed, it reproduces the same data at that sample points.
iorder (int, default=3) – Inflexion order. It is a positive value greater than 0. If
None, it should be computed using the length of extrema (local + global). It also might be lower as possible to let the fitting VES curve more realistic.xy (tuple, optional) –
Coordinates point ( easting, northing ) or (lon, lat) corresponding to the VES points
sves. If coordinates values are not given coordinates are randomly generated into (lon, lat) and stored into the attribute xy. To returns the xy auto-coordinates whenas_frame=Trueset add_xy toTrue.Added in version 0.2.1.
is_utm (bool, default=False) – In principle, xy expects to be in longitude-latitude coordinates. However if coordinates are passed into a UTM such as easting-northing, user can specify the utm_zone to convert the xy values into a valid longitude and latitude coordinates.
add_xy (bool, default=False) – Add xy coordinates to the VES dataframe.
utm_zone (str, Optional) – To generate coordinates xy from a specific zone, utm_zone can be specified, otherwise
29Nis used instead.epsg (int, str, Optional) – EPSG number defining projection. See http://spatialreference.org/ref/ for moreinfo. Overrides utm_zone if both are provided
- Return type:
(pd.Dataframe |
Boxspace)
Notes
when returning the
Boxspaceobject, each columns of ‘VES’ data can be retrieved as an attributes. Check the examples belowExamples
>>> from watex.datasets.gdata import make_ves >>> b = make_ves (samples =50 , order ='+') # 50 measurements in deeper >>> b.resistivity [:-7] Out[314]: array([429.873 , 434.255 , 438.5707, 442.8203, 447.0042, 451.1228, 457.5775]) >>> b.frame.head(3) Out[315]: AB MN resistivity 0 1.0 0.6 429.872999 1 2.0 0.6 434.255018 2 3.0 0.6 438.570675 >>> ves_data = make_ves (samples =50 , min_rho =10, max_rho =1e5 , as_frame =True, add_xy= True , xy = ( 3143965.855 , 336704.455) , is_utm = True , utm_zone = '49N', epsg =None) >>> ves_data.head(2) Out[316]: AB MN resistivity longitude latitude 0 1.0 0.6 51544.426685 107.901553 -61.802165 1 2.0 0.6 51420.739513 107.901553 -61.802165
Subpackages#
Submodules#
- load different data as a function
load_bagoue()load_boundiali()load_edis()load_gbalo()load_hlogs()load_huayuan()load_iris()load_semien()load_tankesse()make_erp()make_ves()RemoteMetadatacsv_data_loader()description_loader()get_data()remove_data()text_files_loader()- Remote Loader
LoaderfetchSingleRARData()fetchSingleZIPData()loadBagoueDataset()move_file()retrieveZIPmember()DATASETfetch_data()load_bagoue()load_boundiali()load_edis()load_gbalo()load_hlogs()load_huayuan()load_iris()load_mxs()load_nlogs()load_semien()load_tankesse()make_erp()make_ves()configuration()