watex.base.Missing#

class watex.base.Missing(in_percent=False, sample=None, kind=None, drop_columns=None, **kws)[source]#

Deal with missing values in Data

Most algorithms will not work with missing data. Notable exceptions are the recent boosting libraries such as the XGBoost (watex.documentation.xgboost.__doc__) CatBoost and LightGBM. As with many things in machine learning , there are no hard answaers for how to treat a missing data. Also, missing data could represent different situations. There are three warious way to handle missing data:

* Remove any row with missing data
* Remove any columns with missing data
* Impute missing values
* Create an indicator columns to indicator data was missing
Parameters:
  • in_percent (bool,) – give the statistic of missing data in percentage if ser to True.

  • sample (int, Optional,) – Number of row to visualize or the limit of the number of sample to be able to see the patterns. This is usefull when data is composed of many rows. Skrunked the data to keep some sample for visualization is recommended. None plot all the samples ( or examples) in the data

  • kind (str, Optional) –

    type of visualization. Can be dendrogramm, mbar or bar. corr plot for dendrogram , msno bar, plt and msno correlation visualization respectively:

    • bar plot counts the nonmissing data using pandas

    • mbar use the msno package to count the number

      of nonmissing data.

    • dendrogram`` show the clusterings of where the data is missing.

      leaves that are the same level predict one onother presence (empty of filled). The vertical arms are used to indicate how different cluster are. short arms mean that branch are similar.

    • ``corr` creates a heat map showing if there are correlations

      where the data is missing. In this case, it does look like the locations where missing data are corollated.

    • None is the default vizualisation. It is useful for viewing

      contiguous area of the missing data which would indicate that the missing data is not random. The matrix function includes a sparkline along the right side. Patterns here would also indicate non-random missing data. It is recommended to limit the number of sample to be able to see the patterns.

    Any other value will raise an error

Examples

>>> from watex.base import Missing
>>> data ='data/geodata/main.bagciv.data.csv'
>>> ms= Missing().fit(data)
>>> ms.plot_.fig_size = (12, 4 )
>>> ms.plot ()