Outlier Function (Python)

Dan Kirk
3 min readMay 25, 2022

Whenever starting a new data project, I found the process of checking for outliers a little tedious. I found many articles on the internet that explained how to detect outliers and how to deal with them, but I didn’t find any good functions that could automate the process of checking for outliers (column-wise, as I almost always desire in my work). I did find a few but, despite working, I didn’t find them so practically useful.

So, I decided to make my own. The function you see below takes a dataframe and a z value threshold as input and loops through all columns and detects outliers based on the specified z threshold. For each column that contains outliers, the following is printed:

  • A line of text that states which column the following information refers to
  • A dataframe that consists of the entire row of values for which an outlier exists in the column
  • A list that states what the values of the outliers are (this is convenient if your dataframe has many features)
  • A list that states the index locations of the outliers in the original dataframe

Here is the code with an example:

import pandas as pd
import numpy as np
from numpy import mean, std
def IDoutliers(df, zscorethreshold):
'''
Parameters: df
----------
df : Numerical
Returns:
-------
For each col with outliers (Z score exceeds treshold),
a df with these datapoints is returned.

A list of the values is also returned for readability

For convenience, a list of the index values is also
provided
'''
#Establish lists to extend to return df with outliers
outliersdf = []
indexvals = []
colswithoutliers = []
for col in df.columns:
#Mean and sd per col
u = np.mean(df[col])
sd = np.std(df[col])
#Z-number threshold
z = zscorethreshold
#Isolate all outliers per col
outliers = df.loc[((abs(df[col]-u))/sd) > z]
#Only select those columns with outliers
if len(outliers) == 0:
pass
else:
#Extend lists
outliersdf.append(outliers.values)
indexvals.append(outliers.index.values)
colswithoutliers.append(col)

#Print df of outliers per column
for number, item in enumerate(outliersdf):
print('\nData points with outliers in column {}\n'.format(colswithoutliers[number]))

df = pd.DataFrame(outliersdf[number],
index=indexvals[number],
columns=df.columns)
print(df)
print('\nValues of outliers: {}'.format(df[colswithoutliers[number]].values))
print('\nIndex values of outliers as a list:{}'.format(indexvals[number]))
print('---------------------------------------------------------------------------')

return print('Outlier report complete')

Let’s generate some random data and then add some outliers in different columns:

df = pd.DataFrame(np.random.randint(1,20,size=(100, 5)), columns=list('abcde'))df['a'].loc[0] = 150
df['a'].loc[1] = 120
df['a'].loc[2] = 200
df['b'].loc[2] = 190
df['e'].loc[99] = 150
print(IDoutliers(df, 3))

The results of the print statement can be seen below

Data points with outliers in column aa    b   c  d   e
0 150 7 16 6 4
1 120 4 14 2 17
2 200 190 16 1 18
Values of outliers: [150 120 200]Index values of outliers as a list:[0 1 2]
---------------------------------------------------------------------------
Data points with outliers in column ba b c d e
2 200 190 16 1 18
Values of outliers: [190]Index values of outliers as a list:[2]
---------------------------------------------------------------------------
Data points with outliers in column ea b c d e
99 4 8 11 18 150
Values of outliers: [150]Index values of outliers as a list:[99]
---------------------------------------------------------------------------
Outlier report complete
None

Some thoughts:

  • I haven’t used the function extensively so there may be some situations where it breaks down or impractical, but so far it has served me well and has saved me a lot of time
  • At times, I have had to change the number of lines in my console (Spyder) to be able to see all of the results. You might have to do the same
  • If you have a large dataset with many outliers, you may find the function to be less useful because there is still some manual labor required after running the function

--

--

Dan Kirk

Researcher at Wageningen University Research; MSc Nutrition & Health and BSc Biochemistry; practicing data science; and lifetime natural bodybuilder