In this first post, we’re going to look at some techniques to determine if your data has outliers present. We’re going to go through this exercise in R (for the stats folks out there)!

rm(list = ls())
library(outliers)
library(ggplot2)
library(mosaic)
library(stats)
set.seed(1)
data(package = .packages(all.available = TRUE))
data(gold, package='forecast')
plot.ts(gold, main = "Gold Prices Data")
favstats(gold)
A data.frame: 1 × 9
minQ1medianQ3maxmeansdnmissing
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
285337.6625403.225443.675593.7392.533356.60597107434

png

Hmm, I guess this looks semi-normal with some skewness to the right. Just to be double sure, let’s see some normal data…

# A box-and-whisker plot should show any outliers clearly
df <- data.frame(Data=as.vector(gold))
ggplot(df, aes(y=Data)) + geom_boxplot()

png

That upper whisker is not indicating a clear outlier, which would be a dot past the end of the whisker, but my gut is still telling me that the spike to nearly 600 in the above timeseries plot is an outlier. One great method of finding outliers is using a grubbs test! The only problem is that the grubbs test expects the data to be normally distributed. Let’s check out if our data is close to a normal distribution

As a quick asside, my use of the ECDF function is inspired by a post from eric-mjl. Join his mailing list if you can!

# What the heck does normal data look like
random_normal_data <- rnorm(100)

favstats(random_normal_data)
plot.ecdf(random_normal_data, main = 'ecdf(x) of Normally Distributed Data')
qqnorm(random_normal_data, main = 'Quantile-Quantile (QQ) plot of Normally Distributed Data')
qqline(random_normal_data, col = "red", lwd = 5)
legend("bottomright", c("Data Points", "Theoretical Normal"), fill=c("black", "red"))
A data.frame: 1 × 9
minQ1medianQ3maxmeansdnmissing
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
-2.786015-0.5789328-0.0034050820.70911812.6748710.031643661.0813411000

png

png

favstats(gold)

# Does our data look normally distributed?
plot.ecdf(gold, main = 'ecdf(x) of Gold Prices Data')
qqnorm(gold, main = 'Quantile-Quantile (QQ) plot of Gold Prices Data')
qqline(gold, col = "red", lwd = 5)
legend("bottomright", c("Data Points", "Theoretical Normal"), fill=c("black", "red"))
A data.frame: 1 × 9
minQ1medianQ3maxmeansdnmissing
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
285337.6625403.225443.675593.7392.533356.60597107434

png

png

It’s important to note that a grubbs test expects normality, and as the data isn’t strictly normal there can be some concerns about the validity of the results.

grubbs.test(gold, type = 10)
	Grubbs test for one outlier

data:  gold
G = 3.55381, U = 0.98822, p-value = 0.1965
alternative hypothesis: highest value 593.7 is an outlier

Grubbs Results

So it looks like we do have an outlier, with a p-value that is approaching significance of 0.1965. The important thing here is to ask ourselves if that outlier is an error in the data, or if it is valuable data that needs to be included in our models to make them more realistic to real-world. For the purposes of this example, we’re going to simply assume that there was a one-day run on gold that is not likely to occur again and is not representative of our dataset as a whole.

# Let's go ahead and remove that max value we think is the outlier
new_gold <- gold[-which.max(gold)]

# Let's compare the old summary to the new summary
favstats(gold)
favstats(new_gold)

# Lastly, let's take a look at the timeseries plot again
plot.ts(new_gold, main = "Gold Prices w/o Outlier Data")
A data.frame: 1 × 9
minQ1medianQ3maxmeansdnmissing
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
285337.6625403.225443.675593.7392.533356.60597107434
A data.frame: 1 × 9
minQ1medianQ3maxmeansdnmissing
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
285337.6403.1443.6502.75392.345956.29777107334

png

grubbs.test(new_gold, type = 10)
	Grubbs test for one outlier

data:  new_gold
G = 1.96107, U = 0.99641, p-value = 1
alternative hypothesis: highest value 502.75 is an outlier

Findings

With the max value of the data set removed, we can see that the timeseries plot no longer includes a spike right at the top. The results from the grubbs test on the new dataset omiting the max value n olonger shows a suspected outlier at the upper threashold of the data. I’d say our work here is done!