Base IO code for managing all the datasets Created on Thu Oct 13 14:26:47 2022
- class watex.datasets.io.RemoteMetadata(file, url, checksum)#
Bases:
tuple- checksum#
Alias for field number 2
- file#
Alias for field number 0
- url#
Alias for field number 1
- watex.datasets.io.csv_data_loader(data_file, *, data_module='watex.datasets.data', descr_file=None, descr_module='watex.datasets.descr', include_headline=False)[source]#
Loads data_file from data_module with `importlib.resources.
- Parameters
data_file (str) – Name of csv file to be loaded from data_module/data_file. For example ‘bagoue.csv’.
data_module (str or module, default='watex.datasets.data') – Module where data lives. The default is ‘watex.datasets.data’.
descr_file_name (str, default=None) – Name of rst file to be loaded from descr_module/descr_file. For example ‘bagoue.rst’. See also
description_loader(). If not None, also returns the corresponding description of the dataset.descr_module (str or module, default='watex.datasets.descr') – Module where descr_file lives. See also
description_loader(). The default is ‘watex.datasets.descr’.
- Returns
data (ndarray of shape (n_samples, n_features)) – A 2D array with each row representing one sample and each column representing the features of a given sample.
target (ndarry of shape (n_samples,)) – A 1D array holding target variables for all the samples in data. For example target[0] is the target variable for data[0].
target_names (ndarry of shape (n_samples,)) – A 1D array containing the names of the classifications. For example target_names[0] is the name of the target[0] class.
descr (str, optional) – Description of the dataset (the content of descr_file_name). Only returned if descr_file is not None.
- watex.datasets.io.description_loader(descr_file, *, descr_module='watex.datasets.descr', encoding='utf8')[source]#
Load descr_file from descr_module with importlib.resources.
- Parameters
descr_file_name (str, default=None) – Name of rst file to be loaded from descr_module/descr_file. For example ‘bagoue.rst’. See also
description_loader(). If not None, also returns the corresponding description of the dataset.descr_module (str or module, default='watex.datasets.descr') – Module where descr_file lives. See also
description_loader(). The default is ‘watex.datasets.descr’.
- Returns
fdescr – Content of descr_file_name.
- Return type
str
- watex.datasets.io.get_data(data=None)[source]#
Get the data from home directory and return watex data directory
By default the data directory is set to a folder named ‘watex_data’ in the user home folder. Alternatively, it can be set by the ‘WATEX_DATA’ environment variable or programmatically by giving an explicit folder path. The ‘~’ symbol is expanded to the user home folder. If the folder does not already exist, it is automatically created.
- Parameters
data (str, default=None) – The path to watex data directory. If None, the default path is ~/watex_data.
- Returns
data – The path to watex data directory.
- Return type
str
- watex.datasets.io.remove_data(data=None)[source]#
Delete all the content of the data home cache.
- Parameters
data (str, default=None) – The path to watex data directory. If None, the default path is ~/watex_data.
- watex.datasets.io.text_files_loader(container_path, *, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=42, allowed_extensions=None)[source]#
Load text files with categories as subfolder names.
Individual samples are assumed to be files stored a two levels folder structure such as the following:
container_folder/ category_1_folder/ file1.txt file2.txt ... file30.txt category_2_folder/ file31.txt file32.txt ...
The folder names are used as supervised signal label names. The individual file names are not important.
In addition, if load_content is false it does not try to load the files in memory. If you set load_content=True, you should also specify the encoding of the text using the ‘encoding’ parameter. For many modern text files, ‘utf-8’ will be the correct encoding. If you want files with a specific file extension (e.g. .txt) then you can pass a list of those file extensions to allowed_extensions.
- Parameters
container_path (str) – Path to the main folder holding one subfolder per category.
description (str, default=None) – A paragraph describing the characteristic of the dataset: its source, reference, etc.
categories (list of str, default=None) – If None (default), load all the categories. If not None, list of category names to load (other categories ignored).
load_content (bool, default=True) – Whether to load or not the content of the different files. If true a ‘data’ attribute containing the text information is present in the data structure returned. If not, a filenames attribute gives the path to the files.
shuffle (bool, default=True) – Whether or not to shuffle the data: might be important for models that make the assumption that the samples are independent and identically distributed (i.i.d.), such as stochastic gradient descent.
encoding (str, default=None) – If None, do not try to decode the content of the files (e.g. for images or other non-text content). If not None, encoding to use to decode text files to Unicode if load_content is True.
decode_error ({'strict', 'ignore', 'replace'}, default='strict') – Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. Passed as keyword argument ‘errors’ to bytes.decode.
random_state (int, RandomState instance or None, default=42) – Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls.
allowed_extensions (list of str, default=None) – List of desired file extensions to filter the files to be loaded.
- Returns
data – Dictionary-like object, with the following attributes. data : list of str
Only present when load_content=True. The raw text data to learn.
- targetndarray
The target labels (integer index).
- target_nameslist
The names of target classes.
- DESCRstr
The full description of the dataset.
- filenames: ndarray
The filenames holding the dataset.
- Return type
Boxspace