<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.5">Jekyll</generator><link href="https://www.kuhl.dev/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.kuhl.dev/" rel="alternate" type="text/html" /><updated>2024-04-02T02:11:38+00:00</updated><id>https://www.kuhl.dev/feed.xml</id><title type="html">Ryan Kuhl’s Dev Blog</title><subtitle>Dev blog and tutorials for data science, web programming, python, and plenty else!</subtitle><author><name>Ryan Kuhl</name></author><entry><title type="html">Finding Outliers in Your Data</title><link href="https://www.kuhl.dev/2020/09/14/finding-outliers-in-your-data.html" rel="alternate" type="text/html" title="Finding Outliers in Your Data" /><published>2020-09-14T09:28:57+00:00</published><updated>2020-09-14T09:28:57+00:00</updated><id>https://www.kuhl.dev/2020/09/14/finding-outliers-in-your-data</id><content type="html" xml:base="https://www.kuhl.dev/2020/09/14/finding-outliers-in-your-data.html"><![CDATA[<p>In this first post, we’re going to look at some techniques to determine if your
data has outliers present. We’re going to go through this exercise in R
(for the stats folks out there)!</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rm</span><span class="p">(</span><span class="n">list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ls</span><span class="p">())</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">outliers</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">mosaic</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">stats</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="p">(</span><span class="n">package</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.packages</span><span class="p">(</span><span class="n">all.available</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="p">(</span><span class="n">gold</span><span class="p">,</span><span class="w"> </span><span class="n">package</span><span class="o">=</span><span class="s1">'forecast'</span><span class="p">)</span><span class="w">
</span><span class="n">plot.ts</span><span class="p">(</span><span class="n">gold</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Gold Prices Data"</span><span class="p">)</span><span class="w">
</span><span class="n">favstats</span><span class="p">(</span><span class="n">gold</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<table>
<caption>A data.frame: 1 × 9</caption>
<thead>
	<tr><th></th><th scope="col">min</th><th scope="col">Q1</th><th scope="col">median</th><th scope="col">Q3</th><th scope="col">max</th><th scope="col">mean</th><th scope="col">sd</th><th scope="col">n</th><th scope="col">missing</th></tr>
	<tr><th></th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;int&gt;</th><th scope="col">&lt;int&gt;</th></tr>
</thead>
<tbody>
	<tr><th scope="row"></th><td>285</td><td>337.6625</td><td>403.225</td><td>443.675</td><td>593.7</td><td>392.5333</td><td>56.60597</td><td>1074</td><td>34</td></tr>
</tbody>
</table>

<p><img src="/assets/2020-09-13-finding-outliers-in-your-data/output_2_1.png" alt="png" /></p>

<p>Hmm, I guess this looks semi-normal with some skewness to the right. Just to be double sure, let’s see some normal data…</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># A box-and-whisker plot should show any outliers clearly</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">Data</span><span class="o">=</span><span class="n">as.vector</span><span class="p">(</span><span class="n">gold</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">Data</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_boxplot</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>

<p><img src="/assets/2020-09-13-finding-outliers-in-your-data/output_4_1.png" alt="png" /></p>

<p>That upper whisker is not indicating a clear outlier, which would be a dot past the end of the whisker, but my gut is still telling me that the spike to nearly 600 in the above timeseries plot is an outlier. One great method of finding outliers is using a grubbs test! The only problem is that the grubbs test expects the data to be normally distributed. Let’s check out if our data is close to a normal distribution</p>

<p>As a quick asside, my use of the
<a href="https://www.statisticshowto.com/empirical-distribution-function/">ECDF function</a> is
inspired by a post from <a href="http://ericmjl.com/blog/2018/7/14/ecdfs/">eric-mjl</a>. Join his
mailing list if you can!</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># What the heck does normal data look like</span><span class="w">
</span><span class="n">random_normal_data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">)</span><span class="w">

</span><span class="n">favstats</span><span class="p">(</span><span class="n">random_normal_data</span><span class="p">)</span><span class="w">
</span><span class="n">plot.ecdf</span><span class="p">(</span><span class="n">random_normal_data</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'ecdf(x) of Normally Distributed Data'</span><span class="p">)</span><span class="w">
</span><span class="n">qqnorm</span><span class="p">(</span><span class="n">random_normal_data</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Quantile-Quantile (QQ) plot of Normally Distributed Data'</span><span class="p">)</span><span class="w">
</span><span class="n">qqline</span><span class="p">(</span><span class="n">random_normal_data</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="s2">"bottomright"</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Data Points"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Theoretical Normal"</span><span class="p">),</span><span class="w"> </span><span class="n">fill</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="s2">"red"</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<table>
<caption>A data.frame: 1 × 9</caption>
<thead>
	<tr><th></th><th scope="col">min</th><th scope="col">Q1</th><th scope="col">median</th><th scope="col">Q3</th><th scope="col">max</th><th scope="col">mean</th><th scope="col">sd</th><th scope="col">n</th><th scope="col">missing</th></tr>
	<tr><th></th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;int&gt;</th><th scope="col">&lt;int&gt;</th></tr>
</thead>
<tbody>
	<tr><th scope="row"></th><td>-2.786015</td><td>-0.5789328</td><td>-0.003405082</td><td>0.7091181</td><td>2.674871</td><td>0.03164366</td><td>1.081341</td><td>100</td><td>0</td></tr>
</tbody>
</table>

<p><img src="/assets/2020-09-13-finding-outliers-in-your-data/output_6_1.png" alt="png" /></p>

<p><img src="/assets/2020-09-13-finding-outliers-in-your-data/output_6_2.png" alt="png" /></p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">favstats</span><span class="p">(</span><span class="n">gold</span><span class="p">)</span><span class="w">

</span><span class="c1"># Does our data look normally distributed?</span><span class="w">
</span><span class="n">plot.ecdf</span><span class="p">(</span><span class="n">gold</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'ecdf(x) of Gold Prices Data'</span><span class="p">)</span><span class="w">
</span><span class="n">qqnorm</span><span class="p">(</span><span class="n">gold</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Quantile-Quantile (QQ) plot of Gold Prices Data'</span><span class="p">)</span><span class="w">
</span><span class="n">qqline</span><span class="p">(</span><span class="n">gold</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="s2">"bottomright"</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Data Points"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Theoretical Normal"</span><span class="p">),</span><span class="w"> </span><span class="n">fill</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="s2">"red"</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<table>
<caption>A data.frame: 1 × 9</caption>
<thead>
	<tr><th></th><th scope="col">min</th><th scope="col">Q1</th><th scope="col">median</th><th scope="col">Q3</th><th scope="col">max</th><th scope="col">mean</th><th scope="col">sd</th><th scope="col">n</th><th scope="col">missing</th></tr>
	<tr><th></th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;int&gt;</th><th scope="col">&lt;int&gt;</th></tr>
</thead>
<tbody>
	<tr><th scope="row"></th><td>285</td><td>337.6625</td><td>403.225</td><td>443.675</td><td>593.7</td><td>392.5333</td><td>56.60597</td><td>1074</td><td>34</td></tr>
</tbody>
</table>

<p><img src="/assets/2020-09-13-finding-outliers-in-your-data/output_7_1.png" alt="png" /></p>

<p><img src="/assets/2020-09-13-finding-outliers-in-your-data/output_7_2.png" alt="png" /></p>

<p>It’s important to note that a grubbs test expects normality, and as the data isn’t strictly normal there can be some concerns about the validity of the results.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">grubbs.test</span><span class="p">(</span><span class="n">gold</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>	Grubbs test for one outlier

data:  gold
G = 3.55381, U = 0.98822, p-value = 0.1965
alternative hypothesis: highest value 593.7 is an outlier
</code></pre></div></div>

<h3 id="grubbs-results">Grubbs Results</h3>

<p>So it looks like we do have an outlier, with a p-value that is approaching significance of 0.1965. The important thing here is to ask ourselves if that outlier is an error in the data, or if it is valuable data that needs to be included in our models to make them more realistic to real-world. For the purposes of this example, we’re going to simply assume that there was a one-day run on gold that is not likely to occur again and is not representative of our dataset as a whole.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Let's go ahead and remove that max value we think is the outlier</span><span class="w">
</span><span class="n">new_gold</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gold</span><span class="p">[</span><span class="o">-</span><span class="n">which.max</span><span class="p">(</span><span class="n">gold</span><span class="p">)]</span><span class="w">

</span><span class="c1"># Let's compare the old summary to the new summary</span><span class="w">
</span><span class="n">favstats</span><span class="p">(</span><span class="n">gold</span><span class="p">)</span><span class="w">
</span><span class="n">favstats</span><span class="p">(</span><span class="n">new_gold</span><span class="p">)</span><span class="w">

</span><span class="c1"># Lastly, let's take a look at the timeseries plot again</span><span class="w">
</span><span class="n">plot.ts</span><span class="p">(</span><span class="n">new_gold</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Gold Prices w/o Outlier Data"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<table>
<caption>A data.frame: 1 × 9</caption>
<thead>
	<tr><th></th><th scope="col">min</th><th scope="col">Q1</th><th scope="col">median</th><th scope="col">Q3</th><th scope="col">max</th><th scope="col">mean</th><th scope="col">sd</th><th scope="col">n</th><th scope="col">missing</th></tr>
	<tr><th></th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;int&gt;</th><th scope="col">&lt;int&gt;</th></tr>
</thead>
<tbody>
	<tr><th scope="row"></th><td>285</td><td>337.6625</td><td>403.225</td><td>443.675</td><td>593.7</td><td>392.5333</td><td>56.60597</td><td>1074</td><td>34</td></tr>
</tbody>
</table>

<table>
<caption>A data.frame: 1 × 9</caption>
<thead>
	<tr><th></th><th scope="col">min</th><th scope="col">Q1</th><th scope="col">median</th><th scope="col">Q3</th><th scope="col">max</th><th scope="col">mean</th><th scope="col">sd</th><th scope="col">n</th><th scope="col">missing</th></tr>
	<tr><th></th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;dbl&gt;</th><th scope="col">&lt;int&gt;</th><th scope="col">&lt;int&gt;</th></tr>
</thead>
<tbody>
	<tr><th scope="row"></th><td>285</td><td>337.6</td><td>403.1</td><td>443.6</td><td>502.75</td><td>392.3459</td><td>56.29777</td><td>1073</td><td>34</td></tr>
</tbody>
</table>

<p><img src="/assets/2020-09-13-finding-outliers-in-your-data/output_11_2.png" alt="png" /></p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">grubbs.test</span><span class="p">(</span><span class="n">new_gold</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>	Grubbs test for one outlier

data:  new_gold
G = 1.96107, U = 0.99641, p-value = 1
alternative hypothesis: highest value 502.75 is an outlier
</code></pre></div></div>

<h3 id="findings">Findings</h3>

<p>With the max value of the data set removed, we can see that the timeseries plot no longer includes a spike right at the top. The results from the grubbs test on the new dataset omiting the max value n olonger shows a suspected outlier at the upper threashold of the data. I’d say our work here is done!</p>]]></content><author><name>Ryan Kuhl</name></author><category term="data_insights" /><category term="R programming" /><category term="R" /><category term="ourliers" /><summary type="html"><![CDATA[In this first post, we’re going to look at some techniques to determine if your data has outliers present. We’re going to go through this exercise in R (for the stats folks out there)!]]></summary></entry></feed>