Feb 8, 2016

What happens when you have outliers in your data?


In this post I am going to talk briefly about outliers and the effect they might have on your data. With an example of course. Let's start with defining the word "outlier": what is an outlier in math/statistics?

An outlier is basically a number (or data point) in a set o data that is either way smaller or way bigger than most of the other data points.

Let's go through a practical example in order to understand the implications of having an outlier within your data set.


Say we have a sample data set like the following:



For this data set I can easily calculate the mean which is 4.3:



I can also find the median which represents the middle value of the distribution. In our case, since there are two middle values I can average them and get a median of 4.5.



And I can algo figure out the mode which is 5 since this is the most frequent value in the distribution.



Finally, let's calculate the standard deviation by which I can see how much my data are spread out around the mean (remember that the standard is the square root of the variance).



Cool, we now know the mean, median, mode and standard deviation for our sample data set:



All right, let's now make a change on our data set. Imagine to remove the the last data point 6 and replace it with a much bigger value like 600...yep an outlier.


See now what happens when we calculate again the mean, median, mode and standard deviation. The new mean is much higher, 63.7! As expected, the standard deviation is much hogher too. On the other hand, median and mode remain exactly the same.



So, this is what happens if you have outliers. Outliers skew the data when you are trying to do any type of average. What can you do then if you need to get a measure of central tendency?

It really depends on each specific situation how to deal with outliers. What is sure, anyway, is that most statistics measures like means, standard deviations, correlations, etc. can be strongly influenced by outliers and you might end up with an incorrect analysis. Generally you can follow two different strategies:

  1. Remove the outliers, and and analyse your data set without them. In such case, the mean would not be affected and you might use it as a measure of central tendency.
  2. Do not use the mean. In this case you keep the outliers, but since the mean would be change a lot, you might use instead  other measures of central tendency like the median or the mode.
Either case, I think it's important to report in your analysis that you identified outliers and what decision you made of them. Why did you drop them? Why those values happened to be out there? Was it likely to be a data entry mistake? What were your assumptions? 

5 comments:

  1. Great beat ! I wish to apprentice while you amend your site, how could i subscribe for a blog website? The account helped me a acceptable deal. I had been a little bit acquainted of this your broadcast provided bright clear concept create email marketing campaigns

    ReplyDelete
  2. Wow that was odd. I just wrote an really long comment but after I clicked submit my comment didn't show up. Grrrr... well I'm not writing all that over again. Anyways, just wanted to say fantastic blog! T shirt supplier in Singapore

    ReplyDelete
  3. excellent post, very informative. I wonder why the other specialists of this sector do not notice this. You should continue your writing. I am sure, you've a great readers' base already! customer support software

    ReplyDelete
  4. This is very interesting, You're a very skilled blogger. I've joined your feed and look forward to seeking more of your wonderful post. Also, I've shared your site in my social networks! The Takeaway

    ReplyDelete
  5. Its like you read my mind! You appear to know so much about this, like you wrote the book in it or something. I think that you could do with a few pics to drive the message home a little bit, but other than that, this is great blog. A fantastic read. I will certainly be back.
    Singapore SEO agencies

    ReplyDelete