Base IO code for managing all the datasets Created on Thu Oct 13 14:26:47 2022

class watex.datasets.io.RemoteMetadata(file, url, checksum)#

Bases: tuple

checksum#

Alias for field number 2

file#

Alias for field number 0

url#

Alias for field number 1

watex.datasets.io.csv_data_loader(data_file, *, data_module='watex.datasets.data', descr_file=None, descr_module='watex.datasets.descr', include_headline=False)[source]#

Loads data_file from data_module with `importlib.resources.

Parameters:
  • data_file (str) – Name of csv file to be loaded from data_module/data_file. For example ‘bagoue.csv’.

  • data_module (str or module, default='watex.datasets.data') – Module where data lives. The default is ‘watex.datasets.data’.

  • descr_file_name (str, default=None) – Name of rst file to be loaded from descr_module/descr_file. For example ‘bagoue.rst’. See also description_loader(). If not None, also returns the corresponding description of the dataset.

  • descr_module (str or module, default='watex.datasets.descr') – Module where descr_file lives. See also description_loader(). The default is ‘watex.datasets.descr’.

Returns:

  • data (ndarray of shape (n_samples, n_features)) – A 2D array with each row representing one sample and each column representing the features of a given sample.

  • target (ndarry of shape (n_samples,)) – A 1D array holding target variables for all the samples in data. For example target[0] is the target variable for data[0].

  • target_names (ndarry of shape (n_samples,)) – A 1D array containing the names of the classifications. For example target_names[0] is the name of the target[0] class.

  • descr (str, optional) – Description of the dataset (the content of descr_file_name). Only returned if descr_file is not None.

watex.datasets.io.description_loader(descr_file, *, descr_module='watex.datasets.descr', encoding='utf8')[source]#

Load descr_file from descr_module with importlib.resources.

Parameters:
  • descr_file_name (str, default=None) – Name of rst file to be loaded from descr_module/descr_file. For example ‘bagoue.rst’. See also description_loader(). If not None, also returns the corresponding description of the dataset.

  • descr_module (str or module, default='watex.datasets.descr') – Module where descr_file lives. See also description_loader(). The default is ‘watex.datasets.descr’.

Returns:

fdescr – Content of descr_file_name.

Return type:

str

watex.datasets.io.get_data(data=None)[source]#

Get the data from home directory and return watex data directory

By default the data directory is set to a folder named ‘watex_data’ in the user home folder. Alternatively, it can be set by the ‘WATEX_DATA’ environment variable or programmatically by giving an explicit folder path. The ‘~’ symbol is expanded to the user home folder. If the folder does not already exist, it is automatically created.

Parameters:

data (str, default=None) – The path to watex data directory. If None, the default path is ~/watex_data.

Returns:

data – The path to watex data directory.

Return type:

str

watex.datasets.io.remove_data(data=None)[source]#

Delete all the content of the data home cache.

Parameters:

data (str, default=None) – The path to watex data directory. If None, the default path is ~/watex_data.

watex.datasets.io.text_files_loader(container_path, *, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=42, allowed_extensions=None)[source]#

Load text files with categories as subfolder names.

Individual samples are assumed to be files stored a two levels folder structure such as the following:

container_folder/
    category_1_folder/
        file1.txt
        file2.txt
        ...
        file30.txt
    category_2_folder/
        file31.txt
        file32.txt
        ...

The folder names are used as supervised signal label names. The individual file names are not important.

In addition, if load_content is false it does not try to load the files in memory. If you set load_content=True, you should also specify the encoding of the text using the ‘encoding’ parameter. For many modern text files, ‘utf-8’ will be the correct encoding. If you want files with a specific file extension (e.g. .txt) then you can pass a list of those file extensions to allowed_extensions.

Parameters:
  • container_path (str) – Path to the main folder holding one subfolder per category.

  • description (str, default=None) – A paragraph describing the characteristic of the dataset: its source, reference, etc.

  • categories (list of str, default=None) – If None (default), load all the categories. If not None, list of category names to load (other categories ignored).

  • load_content (bool, default=True) – Whether to load or not the content of the different files. If true a ‘data’ attribute containing the text information is present in the data structure returned. If not, a filenames attribute gives the path to the files.

  • shuffle (bool, default=True) – Whether or not to shuffle the data: might be important for models that make the assumption that the samples are independent and identically distributed (i.i.d.), such as stochastic gradient descent.

  • encoding (str, default=None) – If None, do not try to decode the content of the files (e.g. for images or other non-text content). If not None, encoding to use to decode text files to Unicode if load_content is True.

  • decode_error ({'strict', 'ignore', 'replace'}, default='strict') – Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. Passed as keyword argument ‘errors’ to bytes.decode.

  • random_state (int, RandomState instance or None, default=42) – Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls.

  • allowed_extensions (list of str, default=None) – List of desired file extensions to filter the files to be loaded.

Returns:

data – Dictionary-like object, with the following attributes. data : list of str

Only present when load_content=True. The raw text data to learn.

targetndarray

The target labels (integer index).

target_nameslist

The names of target classes.

DESCRstr

The full description of the dataset.

filenames: ndarray

The filenames holding the dataset.

Return type:

Boxspace