watex.utils.remove_outliers#
- watex.utils.remove_outliers(ar, method='IQR', threshold=3.0, fill_value=None, axis=1)[source]#
Efficient strategy to remove outliers in the data.
Indeed, an outlier is the data point of the given sample, observation, or distribution that shall lie outside the overall pattern. A commonly used rule says that one will consider a data point an outlier if it has more than 1.5 IQR below the first quartile or above the third.
Two approaches is used to remove the outliers.
Inter Quartile Range (
IQR) IQR is the most commonly used and most trusted approach used in the research field. Said differently, low outliers shall lie below Q1-1.5 IQR, and high outliers shall lie Q3+1.5IQR. One needs to calculate median, quartiles, including IQR, Q1, and Q3.\[ \begin{align}\begin{aligned}Q1 = 1/4(n + 1)\\Q3 = 1/4 (n + 1)\\Q2 = Q3 – Q1\end{aligned}\end{align} \]To define the outlier base value is defined above and below datasets normal range namely Upper and Lower bounds, define the upper and the lower bound (1.5*IQR value is considered) :
\[ \begin{align}\begin{aligned}upper = Q3 +1.5*IQR\\lower = Q1 – 1.5*IQR\end{aligned}\end{align} \]In the above formula as according to statistics, the 0.5 scale-up of \(IQR (new_IQR = IQR + 0.5*IQR)\) is taken, to consider all the data between 2.7 standard deviations in the Gaussian Distribution
Z-score Is also called a standard score. This value/score helps to understand that how far is the data point from the mean. And after setting up a threshold value one can utilize z score values of data points to define the outliers.
\[Zscore = ( ext{data_point} - ext{mean}) / ext{std. deviation}\]
Now to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the data points lie between +/- 3 standard deviation (using Gaussian Distribution approach).
- Parameters:
ar (Arraylike,) – Array containing outliers to remove
method (str, default='IQR') – The selected approach to remove the outliers. It can be [‘IQR’|’Z-score’]. See Above for outlier explanations. Note that when selecting
"z-score"the threshold value greatly influence the quality of data considering as ooutliers.threshold (float, default=3) – Thershold values is useful for
"z-score"as the value for considering data above as outliers.fill_value (float, optional) – Value to replace the outliers. If not given, outliers are suppressed in the array.
axis (int, default=1) – axis from which to remove values. This is useful when two dimensional array is supplied. Default, delete outlier from the rows.
- Returns:
arr – New array whith removed outliers.
- Return type:
Array_like
Examples
>>> import numpy as np >>> np.random.seed (42 ) >>> from watex.utils.funcutils import remove_outliers >>> data = np.random.randn (7, 3 ) >>> data_r = remove_outliers ( data ) >>> data.shape , data_r.shape (7, 3) (5, 3) >>> remove_outliers ( data, fill_value =np.nan ) array([[ 0.49671415, -0.1382643 , 0.64768854], [ 1.52302986, -0.23415337, -0.23413696], [ 1.57921282, 0.76743473, -0.46947439], [ 0.54256004, -0.46341769, -0.46572975], [ 0.24196227, nan, nan], [-0.56228753, -1.01283112, 0.31424733], [-0.90802408, nan, 1.46564877]]) >>> # for one dimensional >>> remove_outliers ( data[:, 0] , fill_value =np.nan ) array([ 0.49671415, 1.52302986, 1.57921282, 0.54256004, 0.24196227, -0.56228753, nan])