Outlier Function (Python)

Dan Kirk
3 min readMay 25, 2022

Whenever starting a new data project, I found the process of checking for outliers a little tedious. I found many articles on the internet that explained how to detect outliers and how to deal with them, but I didn’t find any good functions that could automate the process of checking for outliers (column-wise, as I almost always desire in my work). I did find a few but, despite working, I didn’t find them so practically useful.

So, I decided to make my own. The function you see below takes a dataframe and a z value threshold as input and loops through all columns and detects outliers based on the specified z threshold. For each column that contains outliers, the following is printed:

  • A line of text that states which column the following information refers to
  • A dataframe that consists of the entire row of values for which an outlier exists in the column
  • A list that states what the values of the outliers are (this is convenient if your dataframe has many features)
  • A list that states the index locations of the outliers in the original dataframe

Here is the code with an example:

import pandas as pd
import numpy as np
from numpy import mean, std
def IDoutliers(df, zscorethreshold):
Parameters: df
df : Numerical
For each col with outliers (Z score exceeds treshold),
a df with these datapoints is returned.

A list of the values is also returned for readability

For convenience, a list of the index values is also
#Establish lists to extend to return df with outliers
outliersdf = []
indexvals = []
colswithoutliers = []
for col in df.columns:
#Mean and sd per col
u = np.mean(df[col])
sd = np.std(df[col])
#Z-number threshold
z = zscorethreshold
#Isolate all outliers per col
outliers = df.loc[((abs(df[col]-u))/sd) > z]
#Only select those columns with outliers
if len(outliers) == 0:
#Extend lists

#Print df of outliers per column
for number, item in enumerate(outliersdf):
print('\nData points with outliers in column {}\n'.format(colswithoutliers[number]))

df = pd.DataFrame(outliersdf[number],
print('\nValues of outliers: {}'.format(df[colswithoutliers[number]].values))
print('\nIndex values of outliers as a list:{}'.format(indexvals[number]))

return print('Outlier report complete')

Let’s generate some random data and then add some outliers in different columns:

df = pd.DataFrame(np.random.randint(1,20,size=(100, 5)), columns=list('abcde'))df['a'].loc[0] = 150
df['a'].loc[1] = 120
df['a'].loc[2] = 200
df['b'].loc[2] = 190
df['e'].loc[99] = 150
print(IDoutliers(df, 3))

The results of the print statement can be seen below

Data points with outliers in column aa    b   c  d   e
0 150 7 16 6 4
1 120 4 14 2 17
2 200 190 16 1 18
Values of outliers: [150 120 200]Index values of outliers as a list:[0 1 2]
Data points with outliers in column ba b c d e
2 200 190 16 1 18
Values of outliers: [190]Index values of outliers as a list:[2]
Data points with outliers in column ea b c d e
99 4 8 11 18 150
Values of outliers: [150]Index values of outliers as a list:[99]
Outlier report complete

Some thoughts:

  • I haven’t used the function extensively so there may be some situations where it breaks down or impractical, but so far it has served me well and has saved me a lot of time
  • At times, I have had to change the number of lines in my console (Spyder) to be able to see all of the results. You might have to do the same
  • If you have a large dataset with many outliers, you may find the function to be less useful because there is still some manual labor required after running the function



Dan Kirk

Researcher at Wageningen University Research; MSc Nutrition & Health and BSc Biochemistry; practicing data science; and lifetime natural bodybuilder