watex.utils.remove_outliers#

watex.utils.remove_outliers(ar, method='IQR', threshold=3.0, fill_value=None, axis=1)[source]#

Efficient strategy to remove outliers in the data.

Indeed, an outlier is the data point of the given sample, observation, or distribution that shall lie outside the overall pattern. A commonly used rule says that one will consider a data point an outlier if it has more than 1.5 IQR below the first quartile or above the third.

Two approaches is used to remove the outliers.

  • Inter Quartile Range (IQR) IQR is the most commonly used and most trusted approach used in the research field. Said differently, low outliers shall lie below Q1-1.5 IQR, and high outliers shall lie Q3+1.5IQR. One needs to calculate median, quartiles, including IQR, Q1, and Q3.

    \[ \begin{align}\begin{aligned}Q1 = 1/4(n + 1)\\Q3 = 1/4 (n + 1)\\Q2 = Q3 – Q1\end{aligned}\end{align} \]

    To define the outlier base value is defined above and below datasets normal range namely Upper and Lower bounds, define the upper and the lower bound (1.5*IQR value is considered) :

    \[ \begin{align}\begin{aligned}upper = Q3 +1.5*IQR\\lower = Q1 – 1.5*IQR\end{aligned}\end{align} \]

    In the above formula as according to statistics, the 0.5 scale-up of \(IQR (new_IQR = IQR + 0.5*IQR)\) is taken, to consider all the data between 2.7 standard deviations in the Gaussian Distribution

  • Z-score Is also called a standard score. This value/score helps to understand that how far is the data point from the mean. And after setting up a threshold value one can utilize z score values of data points to define the outliers.

    \[Zscore = ( ext{data_point} - ext{mean}) / ext{std. deviation}\]

Now to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the data points lie between +/- 3 standard deviation (using Gaussian Distribution approach).

Parameters:
  • ar (Arraylike, pd.dataframe) –

    Arraylike containing outliers to remove.

    New in version 0.2.7: Accepts dataframe and can remove outliers using the z_score.

  • method (str, default='IQR') – The selected approach to remove the outliers. It can be [‘IQR’|’Z-score’]. See Above for outlier explanations. Note that when selecting "z-score" the threshold value greatly influence the quality of data considering as ooutliers.

  • threshold (float, default=3) – Thershold values is useful for "z-score" as the value for considering data above as outliers.

  • fill_value (float, optional) – Value to replace the outliers. If not given, outliers are suppressed in the array.

  • axis (int, default=1) – axis from which to remove values. This is useful when two dimensional array is supplied. Default, delete outlier from the rows.

Returns:

arr – New array whith removed outliers.

Return type:

Array_like

Examples

>>> import numpy as np
>>> np.random.seed (42 )
>>> from watex.utils.funcutils import remove_outliers
>>> data = np.random.randn (7, 3 )
>>> data_r = remove_outliers ( data )
>>> data.shape , data_r.shape
(7, 3) (5, 3)
>>> remove_outliers ( data, fill_value =np.nan )
array([[ 0.49671415, -0.1382643 ,  0.64768854],
       [ 1.52302986, -0.23415337, -0.23413696],
       [ 1.57921282,  0.76743473, -0.46947439],
       [ 0.54256004, -0.46341769, -0.46572975],
       [ 0.24196227,         nan,         nan],
       [-0.56228753, -1.01283112,  0.31424733],
       [-0.90802408,         nan,  1.46564877]])
>>> # for one dimensional
>>> remove_outliers ( data[:, 0] , fill_value =np.nan )
array([ 0.49671415,  1.52302986,  1.57921282,  0.54256004,  0.24196227,
       -0.56228753,         nan])