Feb 8, 2016

What happens when you have outliers in your data?


In this post I am going to talk briefly about outliers and the effect they might have on your data. With an example of course. Let's start with defining the word "outlier": what is an outlier in math/statistics?

An outlier is basically a number (or data point) in a set o data that is either way smaller or way bigger than most of the other data points.

Let's go through a practical example in order to understand the implications of having an outlier within your data set.


Say we have a sample data set like the following:



For this data set I can easily calculate the mean which is 4.3:



I can also find the median which represents the middle value of the distribution. In our case, since there are two middle values I can average them and get a median of 4.5.



And I can algo figure out the mode which is 5 since this is the most frequent value in the distribution.



Finally, let's calculate the standard deviation by which I can see how much my data are spread out around the mean (remember that the standard is the square root of the variance).



Cool, we now know the mean, median, mode and standard deviation for our sample data set:



All right, let's now make a change on our data set. Imagine to remove the the last data point 6 and replace it with a much bigger value like 600...yep an outlier.


See now what happens when we calculate again the mean, median, mode and standard deviation. The new mean is much higher, 63.7! As expected, the standard deviation is much hogher too. On the other hand, median and mode remain exactly the same.



So, this is what happens if you have outliers. Outliers skew the data when you are trying to do any type of average. What can you do then if you need to get a measure of central tendency?

It really depends on each specific situation how to deal with outliers. What is sure, anyway, is that most statistics measures like means, standard deviations, correlations, etc. can be strongly influenced by outliers and you might end up with an incorrect analysis. Generally you can follow two different strategies:

  1. Remove the outliers, and and analyse your data set without them. In such case, the mean would not be affected and you might use it as a measure of central tendency.
  2. Do not use the mean. In this case you keep the outliers, but since the mean would be change a lot, you might use instead  other measures of central tendency like the median or the mode.
Either case, I think it's important to report in your analysis that you identified outliers and what decision you made of them. Why did you drop them? Why those values happened to be out there? Was it likely to be a data entry mistake? What were your assumptions? 

17 comments:

  1. Great beat ! I wish to apprentice while you amend your site, how could i subscribe for a blog website? The account helped me a acceptable deal. I had been a little bit acquainted of this your broadcast provided bright clear concept create email marketing campaigns

    ReplyDelete
  2. Wow that was odd. I just wrote an really long comment but after I clicked submit my comment didn't show up. Grrrr... well I'm not writing all that over again. Anyways, just wanted to say fantastic blog! T shirt supplier in Singapore

    ReplyDelete
  3. excellent post, very informative. I wonder why the other specialists of this sector do not notice this. You should continue your writing. I am sure, you've a great readers' base already! customer support software

    ReplyDelete
  4. This is very interesting, You're a very skilled blogger. I've joined your feed and look forward to seeking more of your wonderful post. Also, I've shared your site in my social networks! The Takeaway

    ReplyDelete
  5. Its like you read my mind! You appear to know so much about this, like you wrote the book in it or something. I think that you could do with a few pics to drive the message home a little bit, but other than that, this is great blog. A fantastic read. I will certainly be back.
    Singapore SEO agencies

    ReplyDelete
  6. Appreciating the persistence you put into your site and in depth information you present. It's awesome to come across a blog every once in a while that isn't the same outdated rehashed information. Great read! I've saved your site and I'm including your RSS feeds to my Google account.
    Double parallel fold booklet printing services

    ReplyDelete
  7. تعمل الشركة على مكافحة الثعابين في الدمام لإبادة جميع أنواعها. الثعابين بأحجام وأشكال مختلفة. لدينا معدات وأدوات حديثة لمحاربة الثعابين.شركة مكافحة حشرات
    شركة مكافحه النمل الابيض بالمزاحمية
    شركة مكافحه حشرات بالمزاحمية

    ReplyDelete
  8. There are some interesting deadlines on this article however I don’t know if I see all of them middle to heart. There may be some validity but I'll take maintain opinion until I look into it further. Good article , thanks and we wish extra! Added to FeedBurner as well MediaOne is a web marketing consultant

    ReplyDelete
  9. Thanks for the post, can I set it up so I receive an update sent in an email whenever you make a new post? clear and definite SEO strategy

    ReplyDelete
  10. I wanted to type a note to be able to appreciate you for all of the superb tips and hints you are giving on this website. My time intensive internet search has finally been compensated with excellent facts and techniques to talk about with my classmates and friends. I 'd mention that many of us visitors are undeniably blessed to be in a wonderful site with so many wonderful professionals with very beneficial methods. I feel somewhat grateful to have encountered your webpages and look forward to some more awesome times reading here. Thanks once more for all the details. Better user-experience is a factor in SEO rankings

    ReplyDelete
  11. This is really attention-grabbing, You are a very skilled blogger. I have joined your feed and look forward to in search of extra of your fantastic post. Additionally, I've shared your web site in my social networks! Web Design in Singapore

    ReplyDelete
  12. You actually make it seem so easy with your presentation but I find this topic to be actually something that I think I would never understand. It seems too complex and extremely broad for me. I am looking forward for your next post, I’ll try to get the hang of it!
    How to choose SEO agency

    ReplyDelete
  13. Likelihood, measurements, and AI go under the extent of Mathematical perspective while connected angles help you gain learning of information science, dialects which incorporates Python, MATLAB, JAVA, SQL. ExcelR Data Science Courses

    ReplyDelete
  14. Thank you very simple and understandable

    ReplyDelete
  15. Today everyone wants to rank on Google. Do you want your business number one on Google?
    Come and visit Best SEO Company in Bangalore
    That will help you to increase your visibility on Google.

    ReplyDelete
  16. Do you need to promote and advance your business online? Piama Media Labs is the Best SEO Company in Bangalore. That will help you to increase your visibility on Google.

    ReplyDelete