load different data as a function#

Inspired from the machine learning popular dataset loading

Created on Thu Oct 13 16:26:47 2022 @author: Daniel

watex.datasets.dload.load_bagoue(*, return_X_y=False, as_frame=False, split_X_y=False, test_size=0.3, tag=None, data_names=None)[source]#

Load the Bagoue dataset.

The Bagoue dataset is a classic and a multi-class classification dataset. Refer to the description for more details.

Parameters:
  • return_X_y (bool, default=False) – If True, returns (data, target) instead of a Boxspace object. See below for more information about the data and target object. .. versionadded:: 0.1.2

  • as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below. .. versionadded:: 0.1.1

  • split_X_y (bool, default=False,) – If True, the data is splitted to hold the training set (X, y) and the testing set (Xt, yt) with the according to the test size ratio.

  • test_size (float, default is {{.3}} i.e. 30% (X, y)) – The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.

  • tag (None) – tag and data_names do nothing. just for API purpose. They allow to fetch the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.

  • data_names (None) – tag and data_names do nothing. just for API purpose. They allow to fetch the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.

Returns:

  • data (Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe} of shape (150, 4)

    The data matrix. If as_frame=True, data will be a pandas DataFrame.

    target: {ndarray, Series} of shape (150,)

    The classification target. If as_frame=True, target will be a pandas Series.

    feature_names: list

    The names of the dataset columns.

    target_names: list

    The names of target classes.

    frame: DataFrame of shape (150, 5)

    Only present when as_frame=True. DataFrame with data and target. .. versionadded:: 0.1.2

    DESCR: str

    The full description of the dataset.

    filename: str

    The path to the location of the data. .. versionadded:: 0.1.2

  • data, target (tuple if return_X_y is True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples. .. versionadded:: 0.1.2

  • X, Xt, y, yt (Tuple if split_X_y is True) – A tuple of two ndarray (X, Xt). The first containing a 2D array of:

    \[ \begin{align}\begin{aligned}\text{shape}(X, y) = 1- \text{test_ratio} * (n_{samples}, n_{features}) *100\\\text{shape}(Xt, yt)= \text{test_ratio} * (n_{samples}, n_{features}) *100\end{aligned}\end{align} \]

    where each row representing one sample and each column representing the features. The second ndarray of shape(n_samples,) containing the target samples.

Examples

Let’s say you are interested in the samples 10, 25, and 50, and want to know their class name:

>>> from watex.datasets import load_bagoue
>>> d = load_bagoue ()
>>> d.target[[10, 25, 50]]

array([0, 2, 0]) >>> list(d.target_names) [‘flow’]

watex.datasets.dload.load_boundiali(*, as_frame=True, index_rhoa=0, tag=None, data_names=None, **kws)[source]#

A DC-vertical Electrical resistivity data collected from BOUNDIALI during the National Drinking Water Supply Program (PNAEP) occurs in 2014 in Cote d’Ivoire. An illustration of the data arrangement is the following:

AB/2

MN/2

SE1

SE2

SE…

1

0.4

107

93

75

2

0.4

97

91

49

100

10

79

96

98

110

10

84

104

104

Parameters:
  • as_frame (bool, default=False) –

    If True, the data is a pandas DataFrame including columns with appropriate types (numeric). The target is a panda DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a Boxspace dictionary-like object, with the following attributes:

    • data{ndarray, dataframe} of shape (33, 6)

      The data matrix. If as_frame=True, data will be a pandas DataFrame.

    • resistivity: {array-like} of shape (33,)

      The resistivity of the sounding point.

    • MN: {array-like} of shape (33,)

      The step value of potential electrodes increasing in meters

    • AB: {array-like} of shape (33,)

      The step value of current electrodes increasing in meters

    • feature_names: list

      The names of the dataset columns. .. versionadded:: 0.23

    • DESCR: str

      The full description of the dataset.

    • filename: str

      The path to the location of the data. .. versionadded:: 0.20

    New in version 0.1.2.

  • index_rhoa (int, default=0) – index of the resistivy columns to retrieve. Note that this is useful in the cases many sounding values are collected in the same survey area. index_rhoa=0 fetches the first sounding values in the collection of all values.

  • tag (None,) – Always None for API consistency

  • data_names (None,) – Always None for API consistency

  • kws (dict,) – Keywords arguments pass to _is_readable() function for parsing data.

Returns:

data – Dictionary-like object, with the following attributes. - data : {ndarray, dataframe}

The data matrix. If as_frame=True, data will be a pandas DataFrame.

Return type:

Boxspace

Notes

The array configuration is Schlumberger and the max depth investigation is 110.0 meters for \(AB/2\) (current electrodes). The sounding steps \(AB\) starts from 1.0 to 110.0 meters whereas \(MN/2\) (potential electrodes) starts from 0.4 to 10.0 meters. The total number of sounding performers in 4 with the prefix ‘SE’. AB, AB is in meters and SE are in ohm. meters as apparent resistivity values. Use the param index_rho to get the ranking of the sounding resistivity value. For instance index_rhoa=0 fetch the first array of resistivity values (SE1).

watex.datasets.dload.load_edis(*, return_data=False, as_frame=False, key=None, tag=None, samples=None, data_names=None, **kws)[source]#

Load SEG-Electrical Data Interchange (EDI) object

EDI data is a sample of data collected in Huayuan county in China.

Parameters:
  • return_data (bool, default=False) – If True, returns data in array-like 1D instead of a Boxspace object. Note that the data is only a collection of EDI-objects from watex.edi.Edi

  • as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric).

  • samples (int, default=None) – The sample of data to retrieve.

  • (tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.

  • data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.datasets.fetch_data since the latter already holds tag and data_names as parameters.

  • key (str, {'site', 'edi', 'latitude', '*', 'longitude'}, default='edi') – Kind of EDI-data to fetch. If the star is given, fetch all the data on a single frame

Returns:

data – Dictionary-like object, with the following attributes. data : {ndarray, dataframe} of shape (50, 4)

The data matrix. If as_frame=True, data will be a pandas DataFrame.

feature_names: list

The names of the dataset columns.

frame: DataFrame of shape (50, 4)

Only present when as_frame=True. DataFrame with data and no target. .. versionadded:: 0.1.2

DESCR: str

The full description of the dataset.

filename: str

The path to the location of the data. .. versionadded:: 0.1.2

Return type:

Boxspace

Examples

>>> from watex.datasets.dload import load_edis
>>> load_edis ().frame [:3]
                edi
0  Edi( verbose=0 )
1  Edi( verbose=0 )
2  Edi( verbose=0 )
>>> load_edis (as_frame =True, key='longitude latitude', samples = 7)
    latitude   longitude
0  26.051390  110.485833
1  26.051794  110.486153
2  26.052198  110.486473
3  26.052602  110.486793
4  26.053006  110.487113
5  26.053410  110.487433
6  26.053815  110.487753
watex.datasets.dload.load_gbalo(*, kind='erp', as_frame=True, index_rhoa=0, tag=None, data_names=None, **kws)[source]#

Load Gbalo ERP and VES datasets.

The dataset is composed of DC-Electrical resistivity profiling (ERP) and Vertical sounding (VES) data collected from Gbalo locality. To change the kind of dataset, pass erp or ves to parameter key. The data is collected during the National Drinking Water Supply Program (PNAEP) occurs in 2012-2014 in Cote d’Ivoire. Refer to erp_doc and ves_doc for the data arrangement illustration.

Parameters:
  • kind (str , ['ves'|'erp'], default is {'erp'}) – the kind of DC data to retrieve. If kind`is set to ``ves` and VES data is fetched and ERP otherwise.

  • as_frame (bool, default=False) –

    If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a Boxspace dictionary-like object, with the following attributes: data : {ndarray, dataframe} of shape (33, 6) and (45, 4) for VES and ERP

    The data matrix. If as_frame=True, data will be a pandas DataFrame.

    resistivity: {array-like} of shape (33,) and (45,) for VES and ERP

    The resistivity of the sounding point.

    station: {array-like} of shape (33,) and (45,) for VES and ERP

    The motion distance of each station that increasing in meters. can be considered as the station point for data collection.

    northing: {array-like} of shape (33,) and (45,) for VES and ERP

    The northing coordinates in UTM in meters at each station where the data is collected.

    easting: {array-like} of shape (33,) and (45,) for VES and ERP

    The easting coordinates in UTM in meters at each station where the data is collected.

    latitude: {array-like} of shape (33,) and (45,) for VES and ERP

    The latitude coordinates in degree decimals or ‘DD:MM.SS’ at each station where the data is collected.

    longitude: {array-like} of shape (33,) and (45,) for VES and ERP

    The longitude coordinates in degree decimals or ‘DD:MM.SS’ at each station where the data is collected.

    DESCR: str

    The full description of the dataset.

    filename: str

    The path to the location of the data.

  • (tag (None) – Always None for API consistency

  • data_names) (None) – Always None for API consistency

  • kws (dict,) – Keywords arguments pass to _is_readable() function for parsing data.

Returns:

data – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}

The data matrix. If as_frame=True, data will be a pandas DataFrame.

Return type:

Boxspace

Example

>>> from watex.datasets import load_gbalo
>>> b= load_gbalo (as_frame =False , kind ='erp')
>>> b.station  # retreive the station position
... array([  0.,  10.,  20.,  30.,  40.,  50.,  60.,  70.,  80.,  90., 100.,
       110., 120., 130., 140., 150., 160., 170., 180., 190., 200., 210.,
       220., 230., 240., 250., 260., 270., 280., 290., 300., 310., 320.,
       330., 340., 350., 360., 370., 380., 390., 400., 410., 420., 430.,
       440.])

Notes

The array configuration is schlumberger and the max depth investigation is 100 meters for \(AB/2\) (current electrodes). The profiling step \(AB\) is fixed to 100 meters whereas \(MN/2\) also fixed to (potential electrodes) to 10 meters. The total number of station data collected is 45 while the sounding points is estimated to 33. station , easting and northing are in meters and rho columns are in ohm.meters as apparent resistivity values. Furthermore, the total number of soundings performed with the prefix ‘SE’ is 4.

watex.datasets.dload.load_hlogs(*, return_X_y=False, as_frame=False, key=None, split_X_y=False, test_size=0.3, tag=None, tnames=None, data_names=None, **kws)[source]#

Load the hydro-logging dataset.

Dataset contains multi-target and can be used for a classification or regression problem.

Parameters:
  • return_X_y (bool, default=False) – If True, returns (data, target) instead of a Bowlspace object. See below for more information about the data and target object. .. versionadded:: 0.1.2

  • as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below. .. versionadded:: 0.1.3

  • split_X_y (bool, default=False,) – If True, the data is splitted to hold the training set (X, y) and the testing set (Xt, yt) with the according to the test size ratio.

  • test_size (float, default is {{.3}} i.e. 30% (X, y)) – The ratio to split the data into training (X, y) and testing (Xt, yt) set respectively.

  • tnames (str, optional) – the name of the target to retreive. If None the full target columns are collected and compose a multioutput y. For a singular classification or regression problem, it is recommended to indicate the name of the target that is needed for the learning task.

  • (tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.

  • data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.

  • key (str, default='h502') –

    Kind of logging data to fetch. Can also be the borehole [“h2601”, “*”]. If key='*', all the data is aggregated on a single frame of borehole. .. versionadded:: 0.1.5

    New in version 0.2.3.: Add 08 new boreholes data from logging, strata, layer thicknesses and rock_names.

  • drop_observations (bool, default='False') – Drop the remark column in the logging data if set to True. .. versionadded:: 0.1.5

Returns:

  • data (Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}

    The data matrix. If as_frame=True, data will be a pandas DataFrame.

    target: {ndarray, Series}

    The classification target. If as_frame=True, target will be a pandas Series.

    feature_names: list

    The names of the dataset columns.

    target_names: list

    The names of target classes.

    frame: DataFrame

    Only present when as_frame=True. DataFrame with data and target. .. versionadded:: 0.1.1

    DESCR: str

    The full description of the dataset.

    filename: str

    The path to the location of the data. .. versionadded:: 0.1.2

  • data, target (tuple if return_X_y is True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples. .. versionadded:: 0.1.2

  • X, Xt, y, yt (Tuple if split_X_y is True) – A tuple of two ndarray (X, Xt). The first containing a 2D array of:

    \[ \begin{align}\begin{aligned}\text{shape}(X, y) = 1- \text{test_ratio} * (n_{samples}, n_{features}) *100\\\text{shape}(Xt, yt)= \text{test_ratio} * (n_{samples}, n_{features}) *100\end{aligned}\end{align} \]

    where each row representing one sample and each column representing the features. The second ndarray of shape(n_samples,) containing the target samples.

Examples

Let’s say ,we do not have any idea of the columns that compose the target, thus, the best approach is to run the function without passing any parameters:

>>> from watex.datasets.dload import load_hlogs
>>> b= load_hlogs()
>>> b.target_names
[‘aquifer_group’,

‘pumping_level’, ‘aquifer_thickness’, ‘hole_depth’, ‘pumping_depth’, ‘section_aperture’, ‘k’, ‘kp’, ‘r’, ‘rp’, ‘remark’]

>>> # Let's say we are interested of the targets 'pumping_level' and
>>> # 'aquifer_thickness' and returns `y'
>>> _, y = load_hlogs (as_frame=True, # return as frame X and y
                       tnames =['pumping_level','aquifer_thickness'],
                       )
>>> list(y.columns)
... ['pumping_level', 'aquifer_thickness']
watex.datasets.dload.load_iris(*, return_X_y=False, as_frame=False, tag=None, data_names=None, **kws)[source]#

Load and return the iris dataset (classification). The iris dataset is a classic and very easy multi-class classification dataset.

Parameters:
  • return_X_y (bool, default=False) – If True, returns (data, target) instead of a BowlSpace object. See below for more information about the data and target object. .. versionadded:: 0.1.2

  • as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below. .. versionadded:: 0.1.2

  • (tag (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.

  • data_names) (None) – tag and data_names do nothing. just for API purpose and to allow fetching the same data uing the func:~watex.data.fetch_data since the latter already holds tag and data_names as parameters.

Returns:

  • data (Boxspace) – Dictionary-like object, with the following attributes. data : {ndarray, dataframe} of shape (150, 4)

    The data matrix. If as_frame=True, data will be a pandas DataFrame.

    target: {ndarray, Series} of shape (150,)

    The classification target. If as_frame=True, target will be a pandas Series.

    feature_names: list

    The names of the dataset columns.

    target_names: list

    The names of target classes.

    frame: DataFrame of shape (150, 5)

    Only present when as_frame=True. DataFrame with data and target. .. versionadded:: 0.1.2

    DESCR: str

    The full description of the dataset.

    filename: str

    The path to the location of the data. .. versionadded:: 0.1.2

  • (data, target) (tuple if return_X_y is True) – A tuple of two ndarray. The first containing a 2D array of shape (n_samples, n_features) with each row representing one sample and each column representing the features. The second ndarray of shape (n_samples,) containing the target samples. .. versionadded:: 0.1.2

Notes

Changed in version 0.1.1: Fixed two wrong data points according to Fisher’s paper. The new version is the same as in R, but not as in the UCI Machine Learning Repository.

Examples

Let’s say you are interested in the samples 10, 25, and 50, and want to know their class name. >>> from watex.datasets import load_iris >>> data = load_iris() >>> data.target[[10, 25, 50]] array([0, 0, 1]) >>> list(data.target_names) [‘setosa’, ‘versicolor’, ‘virginica’]

watex.datasets.dload.load_semien(*, as_frame=True, index_rhoa=0, tag=None, data_names=None, **kws)[source]#

A DC-vertical Electrical resistivity data collected from SEMIEN during the National Drinking Water Supply Program (PNAEP) occurs in 2014 in Cote d’Ivoire. An illustration of the data arrangement is the following:

AB/2

MN/2

SE1

SE2

SE…

1

0.4

107

93

75

2

0.4

97

91

49

100

10

79

96

98

110

10

84

104

104

Parameters:
  • as_frame (bool, default=False) –

    If True, the data is a pandas DataFrame including columns with appropriate types (numeric). The target is a panda DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a Boxspace dictionary-like object, with the following attributes:

    • data{ndarray, dataframe} of shape (33, 5)

      The data matrix. If as_frame=True, data will be a pandas DataFrame.

    • resistivity: {array-like} of shape (33,)

      The resistivity of the sounding point.

    • MN: {array-like} of shape (33,)

      The step value of potential electrodes increasing in meters

    • AB: {array-like} of shape (33,)

      The step value of current electrodes increasing in meters

    • feature_names: list

      The names of the dataset columns. .. versionadded:: 0.23

    • DESCR: str

      The full description of the dataset.

    • filename: str

      The path to the location of the data. .. versionadded:: 0.20

    New in version 0.1.2.

  • index_rhoa (int, default=0) – index of the resistivy columns to retrieve. Note that this is useful in the cases many sounding values are collected in the same survey area. index_rhoa=0 fetches the first sounding values in the collection of all values.

  • tag (None,) – Always None for API consistency

  • data_names (None,) – Always None for API consistency

  • kws (dict,) – Keywords arguments pass to _is_readable() function for parsing data.

Returns:

data – Dictionary-like object, with the following attributes. - data : {ndarray, dataframe}

The data matrix. If as_frame=True, data will be a pandas DataFrame.

Return type:

Boxspace

Notes

The array configuration is Schlumberger and the max depth investigation is 110.0 meters for \(AB/2\) (current electrodes). The sounding steps \(AB\) starts from 1.0 to 110.0 meters whereas \(MN/2\) (potential electrodes) starts from 0.4 to 10.0 meters. The total number of sounding performers in 3 with the prefix ‘SE’. AB, AB is in meters and SE are in ohm. meters as apparent resistivity values. Use the param index_rho to get the ranking of the sounding resistivity value. For instance index_rhoa=0 fetch the first array of resistivity values (SE1).

watex.datasets.dload.load_tankesse(*, as_frame=True, tag=None, data_names=None, **kws)[source]#

A DC-Electrical resistivity profiling data collected from TANKESSE during the National Drinking Water Supply Program (PNAEP) occurs in 2014 in Cote d’Ivoire and an example of the data arrangement is the following:

pk

east

north

rho

0

382741

896203

79

10

382743

896193

62

20

382747

896184

51

980

382705

894887

55

990

382704

895879

58

Parameters:
  • as_frame (bool, default=False) –

    If True, the data is a pandas DataFrame including columns with appropriate types (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a Boxspace dictionary-like object, with the following attributes: - data : {ndarray, dataframe} of shape (100, 4)

    The data matrix. If as_frame=True, data will be a pandas DataFrame.

    • resistivity: {array-like} of shape (100,)

      The resistivity of the sounding point.

    • station: {array-like}of shape (100,)

      The motion distance of each station that increasing in meters. can be considered as the station point for data collection.

    • northing: {array-like} of shape (100,)

      The northing coordinates in UTM in meters at each station where the data is collected.

    • easting: {array-like} of shape (100,)

      The easting coordinates in UTM are in meters at each station where the data is collected.

    • latitude: {array-like} of shape (100,)

      The latitude coordinates in degree decimals or ‘DD:MM.SS’ at each station where the data is collected.

    • longitude: {array-like} of shape (100,)

      The longitude coordinates in degree decimals or ‘DD:MM.SS’ at each the station where the data is collected.

    • DESCR: str

      The full description of the dataset.

    • filename: str

      The path to the location of the data.

  • tag (None,) – Always None for API consistency

  • data_names (None,) – Always None for API consistency

  • kws (dict,) – Keywords arguments pass to _is_readable() function for parsing data.

Returns:

data – Dictionary-like object, with the following attributes. data : {ndarray, dataframe}

The data matrix. If as_frame=True, data will be a pandas DataFrame.

Return type:

Boxspace

Notes

The array configuration is Schlumberger and the max depth investigation is 100.0 meters for \(AB/2\) (current electrodes). The profiling step \(AB\) is fixed to 100.0 meters whereas \(MN/2\) also fixed to (potential electrodes) to 10.0meters. The total number of station data collected is 99. station, easting, and northing are in meters and rho columns are in ohm. meters as apparent resistivity values. Furthermore, if the UTM coordinate (easting and northing) data is given as well as the UTM_zone, the latitude and longitude data are auto-computed and vice versa. The user does need to provide both coordinates data types ( UTM or DD:MM.SS)