watex.utils.remove_outliers#

watex.utils.remove_outliers(ar, method='IQR', threshold=3.0, fill_value=None, axis=1)[source]#

Efficient strategy to remove outliers in the data.

Indeed, an outlier is the data point of the given sample, observation, or distribution that shall lie outside the overall pattern. A commonly used rule says that one will consider a data point an outlier if it has more than 1.5 IQR below the first quartile or above the third.

Two approaches is used to remove the outliers.

Inter Quartile Range (IQR) IQR is the most commonly used and most trusted approach used in the research field. Said differently, low outliers shall lie below Q1-1.5 IQR, and high outliers shall lie Q3+1.5IQR. One needs to calculate median, quartiles, including IQR, Q1, and Q3.

\[ \begin{align}\begin{aligned}Q1 = 1/4(n + 1)\\Q3 = 1/4 (n + 1)\\Q2 = Q3 – Q1\end{aligned}\end{align} \]

To define the outlier base value is defined above and below datasets normal range namely Upper and Lower bounds, define the upper and the lower bound (1.5*IQR value is considered) :

\[ \begin{align}\begin{aligned}upper = Q3 +1.5*IQR\\lower = Q1 – 1.5*IQR\end{aligned}\end{align} \]

In the above formula as according to statistics, the 0.5 scale-up of \(IQR (new_IQR = IQR + 0.5*IQR)\) is taken, to consider all the data between 2.7 standard deviations in the Gaussian Distribution
Z-score Is also called a standard score. This value/score helps to understand that how far is the data point from the mean. And after setting up a threshold value one can utilize z score values of data points to define the outliers.

\[Zscore = ( ext{data_point} - ext{mean}) / ext{std. deviation}\]

Now to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the data points lie between +/- 3 standard deviation (using Gaussian Distribution approach).

Parameters:

ar (Arraylike, pd.dataframe) –
Arraylike containing outliers to remove.

New in version 0.2.7: Accepts dataframe and can remove outliers using the z_score.
method (str, default='IQR') – The selected approach to remove the outliers. It can be [‘IQR’|’Z-score’]. See Above for outlier explanations. Note that when selecting "z-score" the threshold value greatly influence the quality of data considering as ooutliers.
threshold (float, default=3) – Thershold values is useful for "z-score" as the value for considering data above as outliers.
fill_value (float, optional) – Value to replace the outliers. If not given, outliers are suppressed in the array.
axis (int, default=1) – axis from which to remove values. This is useful when two dimensional array is supplied. Default, delete outlier from the rows.

Returns:

arr – New array whith removed outliers.

Return type:

Array_like

Examples

>>> import numpy as np
>>> np.random.seed (42 )
>>> from watex.utils.funcutils import remove_outliers
>>> data = np.random.randn (7, 3 )
>>> data_r = remove_outliers ( data )
>>> data.shape , data_r.shape
(7, 3) (5, 3)
>>> remove_outliers ( data, fill_value =np.nan )
array([[ 0.49671415, -0.1382643 ,  0.64768854],
       [ 1.52302986, -0.23415337, -0.23413696],
       [ 1.57921282,  0.76743473, -0.46947439],
       [ 0.54256004, -0.46341769, -0.46572975],
       [ 0.24196227,         nan,         nan],
       [-0.56228753, -1.01283112,  0.31424733],
       [-0.90802408,         nan,  1.46564877]])
>>> # for one dimensional
>>> remove_outliers ( data[:, 0] , fill_value =np.nan )
array([ 0.49671415,  1.52302986,  1.57921282,  0.54256004,  0.24196227,
       -0.56228753,         nan])