watex.base.Data#
- class watex.base.Data(verbose=0)[source]#
Data base class
Typically, we train a model with a matrix of data. Note that pandas Dataframe is the most used because it is very nice to have columns lables even though Numpy arrays work as well.
For supervised Learning for instance, suc as regression or clasification, our intent is to have a function that transforms features into a label. If we were to write this as an algebra formula, it would be look like:
\[y = f(X)\]Xis a matrix. Each row represent a sample of data or information about individual. Every columns inXis a feature.The output of our function,y, is a vector that contains labels (for classification) or values (for regression).In Python, by convention, we use the variable name
Xto hold the sample data even though the capitalization of variable is a violation of standard naming convention (see PEP8).- Parameters:
data (str, filepath_or_buffer or
pandas.core.DataFrame) – Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts anyos.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.columns (str or list of str) – columns to replace which contain the missing data. Can use the axis equals to ‘1’.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Determine if rows or columns which contain missing values are removed. * 0, or ‘index’ : Drop rows which contain missing values. * 1, or ‘columns’ : Drop columns which contain missing value. Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
sample (int, Optional,) – Number of row to visualize or the limit of the number of sample to be able to see the patterns. This is usefull when data is composed of many rows. Skrunked the data to keep some sample for visualization is recommended.
Noneplot all the samples ( or examples) in the datakind (str, Optional) –
type of visualization. Can be
dendrogramm,mbarorbar.corrplot for dendrogram ,msnobar,pltandmsnocorrelation visualization respectively:barplot counts the nonmissing data using pandasmbaruse themsnopackage to count the numberof nonmissing data.
- dendrogram`` show the clusterings of where the data is missing.
leaves that are the same level predict one onother presence (empty of filled). The vertical arms are used to indicate how different cluster are. short arms mean that branch are similar.
- ``corr` creates a heat map showing if there are correlations
where the data is missing. In this case, it does look like the locations where missing data are corollated.
Noneis the default vizualisation. It is useful for viewingcontiguous area of the missing data which would indicate that the missing data is not random. The
matrixfunction includes a sparkline along the right side. Patterns here would also indicate non-random missing data. It is recommended to limit the number of sample to be able to see the patterns.
Any other value will raise an error.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
verbose (int, default is
0) – Control the level of verbosity. Higher value lead to more messages.
- Returns:
self – returns
selffor easy method chaining.- Return type:
Baseclass instance
Examples