Sep 18, 2016

Analyzing Stack Overflow questions and tags with the StackLite dataset

The guys at Stack Overflow have recently released a very interesting dataset containing the entire history of questions made by users since the beginning of the site, back in 2008. It's called StackLite and it contains, for each Stack Overflow question the following data:
  • Question ID
  • Creation Date
  • Closed Date (when applicable)
  • Deletion Date (when applicable)
  • Score
  • Owner user ID
  • Number of answers
  • Tags 

As David Robinson explains in his introductory post, the Stacklite dataset is designed to be easy to read and analysed with any programming language or statistical tool. A fantastic resource if you are a data analyst/scientist and want to crunch some real data! 

I thought to give it a go and perform some exploratory analysis using R. More specifically, I am going to answer the following business questions:
  • What are the most popular tags?
  • How many questions have more than one tag?
  • What is the overall closure rate for the site and which tags present higher values?
  • How much time it takes, on average, to close a question?
  • Which tags tend to have higher/lower score?
  • And in particular: how data science languages perform on the above questions?
Analyzing the Stacklite dataset is a great occasion to practice the dplyr library for data manipulation. And also familiarize with the pipe %>% concept, which lets you express multiple complex operations in a way that is clear to read and understand. As I mentioned, the dataset includes questions since 2008 and is pretty huge...due to memory reasons I have answered most question using a subset of the data, containing only this year data (Jan-Aug 2016), but of course you could replicate the analysis on the whole dataset.

Enough with the introduction. Let show some code and results.


Read the data into R and prepare it for analysis

You can download the data from the StackLite repo here. The data is available as two csv.gz files:
  • "questions.csv.gz": containing all the info about questions except for tags used.
  • "question_tags.csv.gz": which associates each question ID with its correspondent tag(s).
Once downloaded the two files and placed in your working directory, you can read them into R as follows:

library(readr)

questions <- read_csv("questions.csv.gz")
question_tags <- read_csv("question_tags.csv.gz")

head(questions)
Source: local data frame [6 x 7]

     Id        CreationDate          ClosedDate        DeletionDate Score OwnerUserId AnswerCount
  (int)              (time)              (time)              (time) (int)       (int)       (int)
1     1 2008-07-31 21:26:37                <NA> 2011-03-28 00:53:47     1          NA           0
2     4 2008-07-31 21:42:52                <NA>                <NA>   418           8          13
3     6 2008-07-31 22:08:08                <NA>                <NA>   188           9           5
4     8 2008-07-31 23:33:19 2013-06-03 04:00:25 2015-02-11 08:26:40    42          NA           8
5     9 2008-07-31 23:40:59                <NA>                <NA>  1306           1          57
6    11 2008-07-31 23:55:37                <NA>                <NA>  1062           1          33


head(question_tags)
Source: local data frame [6 x 2]

     Id             Tag
  (int)           (chr)
1     1            data
2     4              c#
3     4        winforms
4     4 type-conversion
5     4         decimal
6     4         opacity


Both datasets are pretty huge: 15 millions of rows the first and 46 millions the second. I am going to create a subset with only questions created in 2016, which I will use to perform most of the analysis. And then, very important, I will merge it with the "question_tags" dataset using the Id variable, in order to be able to relate tags with other question variables. Let also load the dplyr library and lubridate to work with dates.

library(dplyr)
library(lubridate)

questions_2016<- filter(questions, year(CreationDate)=="2016")

#Merge questions dataset with question_tag (it duplicates questions df rows where there are multiple tags)
merged_df<-left_join(questions_2016,question_tags,by="Id")


What are the most popular tags?

Ok, let start enjoying the dplyr package. Here I am going to sort tags by number of questions they were categorize in. And also calculate their % share over the total number of questions made.


pop_tags_2016<-merged_df %>%
  select(Id,Tag) %>%
  count(Tag,sort=TRUE) %>%   
 mutate(freq=paste0(round(100*n/sum(n),2),"%")) 

View(pop_tags_2016)


Notice in the image above that I've also generated a data frame with popular tags for 2014 and 2015 to be able to make a comparison.
  • Javascript is the language with more questions made by Stack Overflow users in 2016. Follow Java, Android and Php. Most popular tags were pretty much the same since 2014.
  • On the other hand, we can notice that the tags share is quite fragmented: the 10 top tags generate just over 20%o of total questions. This is expectable given the possibility for the user to place any tag they like.
  • R is at 17th place in 2016 with 41050 questions. Notice as in 2014 it was at position 22. Pretty good result!
It´s interesting to see how data science languages perform. In the code below I create a vector containing some of the most used languages in data science, and later use this vector to filter the data frame of popular tags. Using the pipes I also add some ggplot code to create a bar chart.

ds_tags<-c("r","python","sas","sql","pandas","excel","matlab")

library(ggplot2)
pop_tags_2016 %>%
  filter(Tag %in% ds_tags)%>%
  ggplot( aes(x = Tag, y = n,fill=Tag))+ geom_bar(stat="identity")

Most used tags used for data science
Python stands out with a much much higher % share than other languages, though of course it's a broader language used for server-side web applications.


How many questions have more than one tag?

Here I first need to count the number of distinct tags for each question ID. With the resulting data frame I can analyse the tags distribution with a simple histogram.

distrib_tags_2016<-merged_df %>%
  select(Id,Tag) %>%
  group_by(Id) %>%
  summarize (n_tags=n()) %>%
  arrange(desc(n_tags))

Let me do a sanity check searching for the first question ID of the resulting data frame on Stack Overflow website. Cool, last January Wayne asked a question bout flashlight in Android and he did use 5 tags.

Stack Overflow question tags

By plotting an histogram you can see that most questions in 2016 had 3 or 2 tags. Of course at least one tag is mandatory and I guess there is a maximum of 5 tags allowed per question. There are more questions with 5 tags than questions with only 1 tag.

library(ggplot2)
qplot(distrib_tags_2016$n_tags,
      geom="histogram",
      binwidth = 0.5,  
      main = "Tags per question", 
      xlab = "Number of tags", 
      ylab = "Number of questions",
      fill=I("blue"))

Number of tags per question at Stack Overflow
Finally, to answer my original question: how many questions had more than one tag? We can see below, that on over more than 2 millions questions made in 2016, about 87% of them had been categorized with more than one tag.


nrow(distrib_tags_2016)
[1] 2105443

sum(distrib_tags_2016$n_tags!=1)/nrow(distrib_tags_2016)
[1] 0.8718802


What is the overall closure rate for the site and which tags have the highest rate?

According to Stack Overflow documentation, these are the categories of questions that may be closed by the community users:
  • duplicated
  • off topic
  • unclear
  • too broad
  • primarily opinion-based
Not everyone in the Stack Overflow community is able to close a question. In fact users need to have certain reputation expressed in points (more details here).

To calculate the overall website closure rate is easy. Just use the original "questions_2016" dataset and count how many questions have the field "Closed Date" populated. Over 10% of questions made in 2016 have been closed so far.

sum(!is.na(questions_2016$ClosedDate))/nrow(questions_2016)

[1] 0.1056053

With a few dplyr commands and the code above, we can get the closure rate by tag. Note that I keep the tags sorted by number of questions.

closed_tags_2016<- merged_df %>%
  select(ClosedDate,Tag) %>%
  group_by(Tag) %>%
  summarise_each(funs(close_rate=sum(!is.na(.)) / length(.)*100, n_questions=n())) %>%
  arrange(desc(n_questions))


View(closed_tags_2016)






















Ok, what about closure rate for data science questions? Which data science language questions are "more likely" to be closed? To answer this, I need back both the "questions_2016" and "question_tags" datasets. That is, I need the merged dataset and filter it by the data science languages vector.

#Subset the merged dataset (questions+tags) to include only data science tags
merged_df_ds<- merged_df %>%
  filter(Tag %in% ds_tags)
#check it contains only ds tags
unique(merged_df_ds_tags$Tag)

merged_df_ds %>%
  select(ClosedDate,Tag) %>%
  group_by(Tag) %>%
  summarise_each(funs(close_rate=sum(!is.na(.)) / length(.)*100, n_questions=n())) %>%
  arrange(desc(n_questions)) %>%
  ggplot( aes(x = Tag, y = close_rate,fill=Tag))+ geom_bar(stat="identity") + ggtitle("Closure rate for data science questions")

Closure rate by tags for Stack Overflow questions

Apparently Matlab questions have the highest closure rate among data science languages. R follows with with nearly 15% of questions closed this year. See the good performance (I assume a low closure rate is an indicator of relevant and good quality question) of Excel, Pandas and SAS, And also SQL, given the high amount of questions made.

What we can do is also get the speed at which data science questions are closed. In the following code I compute the average hours needed for each tag and plot it in a bar chart.


merged_df_ds %>%
  filter (!is.na(ClosedDate)) %>%
  mutate(difference=(ClosedDate-CreationDate)/3600) %>%
  select(difference,Tag) %>%
  group_by(Tag) %>%
  summarize(avg_hours=round(mean(difference),2))%>%
  arrange(desc(avg_hours)) %>%
  ggplot( aes(x = Tag, y = avg_hours,label = avg_hours,fill=Tag))+ geom_bar(stat="identity")+ ggtitle("Speed (in hours) at which data science questions are closed")

Speed to close questions at Stack Overflow

Which tags tend to have higher/lower score?

Users can either upvote or downvote questions, which means that questions can have a positive or negative score. We can see this clearly by summarizing the variable Score in the questions dataset:

summary(questions_2016$Score)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 -65.0000    0.0000    0.0000    0.0266    1.0000 1067.0000 

With this in mind, let's calculate the average score for each tag (as always I am ordering tags by number of questions made):

score_tags_2016<-merged_df %>%
  select(Score,Tag) %>%
  group_by(Tag) %>%
 summarize(score_avg=mean(Score),n_questions=n())%>%
  arrange(desc(n_questions))

#Let´s do the same for data science tags
score_tags_2016_ds<-merged_df_ds %>%
  select(Score,Tag) %>%
  group_by(Tag) %>%
 summarize(score_avg=mean(Score),n_questions=n())%>%
  arrange(desc(n_questions))

Score by tags

Personally I expected higher average values for the score variable; it looks as there is a general tendency to don´t score questions (and on a lesser extent to downvote). R questions however, on average show a positive score.

I am just curious: which was the R question with the highest grade in 2016 (so far)?

#Let find actually the top 5 scored questions
merged_df_ds %>%
  filter(Tag=="r") %>%
  arrange(desc(Score)) %>%
            head(5)

R highest score question Stack Overflow

Quite a technical question... curious what was about? Find out here.


What´s next

This was post was not intended to be a comprehensive analysis of Stack Overflow questions but rather an introduction of what, and how easy, you can explore and manipulate real data with the dplyr library. To take the StackLite analysis further, it would be interesting to understand:
  • how much the above indicators (number of questions, tags, closure rate, score,etc) changed over time. Ideally yearly/monthly since 2008.
  • number of answer or response rate by tag.
  • tags association: which tags are most likely to be placed together in the same question? Some kind of basket analysis.
  • which tags tend to be asked on working days vs weekends or on working hours vs night, etc.