In this post I am going to talk briefly about outliers and the effect they might have on your data. With an example of course. Let's start with defining the word "outlier": what is an outlier in math/statistics?
An outlier is basically a number (or data point) in a set o data that is either way smaller or way bigger than most of the other data points.
Let's go through a practical example in order to understand the implications of having an outlier within your data set.
Say we have a sample data set like the following:
For this data set I can easily calculate the mean which is 4.3:
I can also find the median which represents the middle value of the distribution. In our case, since there are two middle values I can average them and get a median of 4.5.
And I can algo figure out the mode which is 5 since this is the most frequent value in the distribution.
Finally, let's calculate the standard deviation by which I can see how much my data are spread out around the mean (remember that the standard is the square root of the variance).
Cool, we now know the mean, median, mode and standard deviation for our sample data set:
All right, let's now make a change on our data set. Imagine to remove the the last data point 6 and replace it with a much bigger value like 600...yep an outlier.
See now what happens when we calculate again the mean, median, mode and standard deviation. The new mean is much higher, 63.7! As expected, the standard deviation is much hogher too. On the other hand, median and mode remain exactly the same.
So, this is what happens if you have outliers. Outliers skew the data when you are trying to do any type of average. What can you do then if you need to get a measure of central tendency?
It really depends on each specific situation how to deal with outliers. What is sure, anyway, is that most statistics measures like means, standard deviations, correlations, etc. can be strongly influenced by outliers and you might end up with an incorrect analysis. Generally you can follow two different strategies:
- Remove the outliers, and and analyse your data set without them. In such case, the mean would not be affected and you might use it as a measure of central tendency.
- Do not use the mean. In this case you keep the outliers, but since the mean would be change a lot, you might use instead other measures of central tendency like the median or the mode.
Either case, I think it's important to report in your analysis that you identified outliers and what decision you made of them. Why did you drop them? Why those values happened to be out there? Was it likely to be a data entry mistake? What were your assumptions?