watex.base.Data#

class watex.base.Data(verbose=0)[source]#

Data base class

Typically, we train a model with a matrix of data. Note that pandas Dataframe is the most used because it is very nice to have columns lables even though Numpy arrays work as well.

For supervised Learning for instance, suc as regression or clasification, our intent is to have a function that transforms features into a label. If we were to write this as an algebra formula, it would be look like:

\[y = f(X)\]

X is a matrix. Each row represent a sample of data or information about individual. Every columns in X is a feature.The output of our function, y, is a vector that contains labels (for classification) or values (for regression).

In Python, by convention, we use the variable name X to hold the sample data even though the capitalization of variable is a violation of standard naming convention (see PEP8).

Parameters:

data (str, filepath_or_buffer or pandas.core.DataFrame) – Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.
columns (str or list of str) – columns to replace which contain the missing data. Can use the axis equals to ‘1’.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Determine if rows or columns which contain missing values are removed. * 0, or ‘index’ : Drop rows which contain missing values. * 1, or ‘columns’ : Drop columns which contain missing value. Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
sample (int, Optional,) – Number of row to visualize or the limit of the number of sample to be able to see the patterns. This is usefull when data is composed of many rows. Skrunked the data to keep some sample for visualization is recommended. None plot all the samples ( or examples) in the data
kind (str, Optional) –
type of visualization. Can be dendrogramm, mbar or bar. corr plot for dendrogram , msno bar, plt and msno correlation visualization respectively:
- bar plot counts the nonmissing data using pandas
- mbar use the msno package to count the number
  of nonmissing data.
- dendrogram`` show the clusterings of where the data is missing.
  leaves that are the same level predict one onother presence (empty of filled). The vertical arms are used to indicate how different cluster are. short arms mean that branch are similar.
- ``corr` creates a heat map showing if there are correlations
  where the data is missing. In this case, it does look like the locations where missing data are corollated.
- None is the default vizualisation. It is useful for viewing
  contiguous area of the missing data which would indicate that the missing data is not random. The matrix function includes a sparkline along the right side. Patterns here would also indicate non-random missing data. It is recommended to limit the number of sample to be able to see the patterns.
Any other value will raise an error.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.

Returns:

self – returns self for easy method chaining.

Return type:

Baseclass instance

Examples