Outlier Function (Python)

3 min readMay 25, 2022

Whenever starting a new data project, I found the process of checking for outliers a little tedious. I found many articles on the internet that explained how to detect outliers and how to deal with them, but I didn’t find any good functions that could automate the process of checking for outliers (column-wise, as I almost always desire in my work). I did find a few but, despite working, I didn’t find them so practically useful.

So, I decided to make my own. The function you see below takes a dataframe and a z value threshold as input and loops through all columns and detects outliers based on the specified z threshold. For each column that contains outliers, the following is printed:

A line of text that states which column the following information refers to
A dataframe that consists of the entire row of values for which an outlier exists in the column
A list that states what the values of the outliers are (this is convenient if your dataframe has many features)
A list that states the index locations of the outliers in the original dataframe

Here is the code with an example:

import pandas as pd
import numpy as np
from numpy import mean, stddef IDoutliers(df, zscorethreshold):
    '''
    Parameters: df
    ----------
    df : NumericalReturns: 
    -------
    For each col with outliers (Z score exceeds treshold),
    a df with these datapoints is returned. 
    
    A list of the values is also returned for readability
    
    For convenience, a list of the index values is also 
    provided'''
    #Establish lists to extend to return df with outliers 
    outliersdf = []
    indexvals = []
    colswithoutliers = []
    for col in df.columns:
        #Mean and sd  per col
        u = np.mean(df[col])
        sd = np.std(df[col])
        #Z-number threshold 
        z = zscorethreshold
        #Isolate all outliers per col
        outliers = df.loc[((abs(df[col]-u))/sd) > z]
        #Only select those columns with outliers
        if len(outliers) == 0:
            pass
        else:
            #Extend lists
            outliersdf.append(outliers.values)
            indexvals.append(outliers.index.values)
            colswithoutliers.append(col)
            
    #Print df of outliers per column
    for number, item in enumerate(outliersdf):
        print('\nData points with outliers in column {}\n'.format(colswithoutliers[number]))
        
        df = pd.DataFrame(outliersdf[number], 
                           index=indexvals[number], 
                           columns=df.columns)
        print(df)
        print('\nValues of outliers: {}'.format(df[colswithoutliers[number]].values))
        print('\nIndex values of outliers as a list:{}'.format(indexvals[number]))
        print('---------------------------------------------------------------------------')
        
    return print('Outlier report complete')

Let’s generate some random data and then add some outliers in different columns:

df = pd.DataFrame(np.random.randint(1,20,size=(100, 5)), columns=list('abcde'))df['a'].loc[0] = 150
df['a'].loc[1] = 120
df['a'].loc[2] = 200
df['b'].loc[2] = 190
df['e'].loc[99] = 150print(IDoutliers(df, 3))

The results of the print statement can be seen below

Data points with outliers in column aa    b   c  d   e
0  150    7  16  6   4
1  120    4  14  2  17
2  200  190  16  1  18Values of outliers: [150 120 200]Index values of outliers as a list:[0 1 2]
---------------------------------------------------------------------------Data points with outliers in column ba    b   c  d   e
2  200  190  16  1  18Values of outliers: [190]Index values of outliers as a list:[2]
---------------------------------------------------------------------------Data points with outliers in column ea  b   c   d    e
99  4  8  11  18  150Values of outliers: [150]Index values of outliers as a list:[99]
---------------------------------------------------------------------------
Outlier report complete
None

Some thoughts:

I haven’t used the function extensively so there may be some situations where it breaks down or impractical, but so far it has served me well and has saved me a lot of time
At times, I have had to change the number of lines in my console (Spyder) to be able to see all of the results. You might have to do the same
If you have a large dataset with many outliers, you may find the function to be less useful because there is still some manual labor required after running the function

Outlier Function (Python)

Written by Dan Kirk