Sep 19, 2016

Analyzing Stack Overflow questions and tags with the StackLite dataset

The guys at Stack Overflow have recently released a very interesting dataset containing the entire history of questions made by users since the beginning of the site, back in 2008. It's called StackLite and it contains, for each Stack Overflow question the following data:
  • Question ID
  • Creation Date
  • Closed Date (when applicable)
  • Deletion Date (when applicable)
  • Score
  • Owner user ID
  • Number of answers
  • Tags 

As David Robinson explains in his introductory post, the Stacklite dataset is designed to be easy to read and analysed with any programming language or statistical tool. A fantastic resource if you are a data analyst/scientist and want to crunch some real data! 

I thought to give it a go and perform some exploratory analysis using R. More specifically, I am going to answer the following business questions:
  • What are the most popular tags?
  • How many questions have more than one tag?
  • What is the overall closure rate for the site and which tags present higher values?
  • How much time it takes, on average, to close a question?
  • Which tags tend to have higher/lower score?
  • And in particular: how data science languages perform on the above questions?
Analyzing the Stacklite dataset is a great occasion to practice the dplyr library for data manipulation. And also familiarize with the pipe %>% concept, which lets you express multiple complex operations in a way that is clear to read and understand. As I mentioned, the dataset includes questions since 2008 and is pretty huge...due to memory reasons I have answered most question using a subset of the data, containing only this year data (Jan-Aug 2016), but of course you could replicate the analysis on the whole dataset.

Enough with the introduction. Let show some code and results.

Read the data into R and prepare it for analysis

You can download the data from the StackLite repo here. The data is available as two csv.gz files:
  • "questions.csv.gz": containing all the info about questions except for tags used.
  • "question_tags.csv.gz": which associates each question ID with its correspondent tag(s).
Once downloaded the two files and placed in your working directory, you can read them into R as follows:


questions <- read_csv("questions.csv.gz")
question_tags <- read_csv("question_tags.csv.gz")

Source: local data frame [6 x 7]

     Id        CreationDate          ClosedDate        DeletionDate Score OwnerUserId AnswerCount
  (int)              (time)              (time)              (time) (int)       (int)       (int)
1     1 2008-07-31 21:26:37                <NA> 2011-03-28 00:53:47     1          NA           0
2     4 2008-07-31 21:42:52                <NA>                <NA>   418           8          13
3     6 2008-07-31 22:08:08                <NA>                <NA>   188           9           5
4     8 2008-07-31 23:33:19 2013-06-03 04:00:25 2015-02-11 08:26:40    42          NA           8
5     9 2008-07-31 23:40:59                <NA>                <NA>  1306           1          57
6    11 2008-07-31 23:55:37                <NA>                <NA>  1062           1          33

Source: local data frame [6 x 2]

     Id             Tag
  (int)           (chr)
1     1            data
2     4              c#
3     4        winforms
4     4 type-conversion
5     4         decimal
6     4         opacity

Both datasets are pretty huge: 15 millions of rows the first and 46 millions the second. I am going to create a subset with only questions created in 2016, which I will use to perform most of the analysis. And then, very important, I will merge it with the "question_tags" dataset using the Id variable, in order to be able to relate tags with other question variables. Let also load the dplyr library and lubridate to work with dates.


questions_2016<- filter(questions, year(CreationDate)=="2016")

#Merge questions dataset with question_tag (it duplicates questions df rows where there are multiple tags)

What are the most popular tags?

Ok, let start enjoying the dplyr package. Here I am going to sort tags by number of questions they were categorize in. And also calculate their % share over the total number of questions made.

pop_tags_2016<-merged_df %>%
  select(Id,Tag) %>%
  count(Tag,sort=TRUE) %>%   


Notice in the image above that I've also generated a data frame with popular tags for 2014 and 2015 to be able to make a comparison.
  • Javascript is the language with more questions made by Stack Overflow users in 2016. Follow Java, Android and Php. Most popular tags were pretty much the same since 2014.
  • On the other hand, we can notice that the tags share is quite fragmented: the 10 top tags generate just over 20%o of total questions. This is expectable given the possibility for the user to place any tag they like.
  • R is at 17th place in 2016 with 41050 questions. Notice as in 2014 it was at position 22. Pretty good result!
It´s interesting to see how data science languages perform. In the code below I create a vector containing some of the most used languages in data science, and later use this vector to filter the data frame of popular tags. Using the pipes I also add some ggplot code to create a bar chart.


pop_tags_2016 %>%
  filter(Tag %in% ds_tags)%>%
  ggplot( aes(x = Tag, y = n,fill=Tag))+ geom_bar(stat="identity")

Most used tags used for data science
Python stands out with a much much higher % share than other languages, though of course it's a broader language used for server-side web applications.

How many questions have more than one tag?

Here I first need to count the number of distinct tags for each question ID. With the resulting data frame I can analyse the tags distribution with a simple histogram.

distrib_tags_2016<-merged_df %>%
  select(Id,Tag) %>%
  group_by(Id) %>%
  summarize (n_tags=n()) %>%

Let me do a sanity check searching for the first question ID of the resulting data frame on Stack Overflow website. Cool, last January Wayne asked a question bout flashlight in Android and he did use 5 tags.

Stack Overflow question tags

By plotting an histogram you can see that most questions in 2016 had 3 or 2 tags. Of course at least one tag is mandatory and I guess there is a maximum of 5 tags allowed per question. There are more questions with 5 tags than questions with only 1 tag.

      binwidth = 0.5,  
      main = "Tags per question", 
      xlab = "Number of tags", 
      ylab = "Number of questions",

Number of tags per question at Stack Overflow
Finally, to answer my original question: how many questions had more than one tag? We can see below, that on over more than 2 millions questions made in 2016, about 87% of them had been categorized with more than one tag.

[1] 2105443

[1] 0.8718802

What is the overall closure rate for the site and which tags have the highest rate?

According to Stack Overflow documentation, these are the categories of questions that may be closed by the community users:
  • duplicated
  • off topic
  • unclear
  • too broad
  • primarily opinion-based
Not everyone in the Stack Overflow community is able to close a question. In fact users need to have certain reputation expressed in points (more details here).

To calculate the overall website closure rate is easy. Just use the original "questions_2016" dataset and count how many questions have the field "Closed Date" populated. Over 10% of questions made in 2016 have been closed so far.


[1] 0.1056053

With a few dplyr commands and the code above, we can get the closure rate by tag. Note that I keep the tags sorted by number of questions.

closed_tags_2016<- merged_df %>%
  select(ClosedDate,Tag) %>%
  group_by(Tag) %>%
  summarise_each(funs(close_rate=sum(! / length(.)*100, n_questions=n())) %>%


Ok, what about closure rate for data science questions? Which data science language questions are "more likely" to be closed? To answer this, I need back both the "questions_2016" and "question_tags" datasets. That is, I need the merged dataset and filter it by the data science languages vector.

#Subset the merged dataset (questions+tags) to include only data science tags
merged_df_ds<- merged_df %>%
  filter(Tag %in% ds_tags)
#check it contains only ds tags

merged_df_ds %>%
  select(ClosedDate,Tag) %>%
  group_by(Tag) %>%
  summarise_each(funs(close_rate=sum(! / length(.)*100, n_questions=n())) %>%
  arrange(desc(n_questions)) %>%
  ggplot( aes(x = Tag, y = close_rate,fill=Tag))+ geom_bar(stat="identity") + ggtitle("Closure rate for data science questions")

Closure rate by tags for Stack Overflow questions

Apparently Matlab questions have the highest closure rate among data science languages. R follows with with nearly 15% of questions closed this year. See the good performance (I assume a low closure rate is an indicator of relevant and good quality question) of Excel, Pandas and SAS, And also SQL, given the high amount of questions made.

What we can do is also get the speed at which data science questions are closed. In the following code I compute the average hours needed for each tag and plot it in a bar chart.

merged_df_ds %>%
  filter (! %>%
  mutate(difference=(ClosedDate-CreationDate)/3600) %>%
  select(difference,Tag) %>%
  group_by(Tag) %>%
  arrange(desc(avg_hours)) %>%
  ggplot( aes(x = Tag, y = avg_hours,label = avg_hours,fill=Tag))+ geom_bar(stat="identity")+ ggtitle("Speed (in hours) at which data science questions are closed")

Speed to close questions at Stack Overflow

Which tags tend to have higher/lower score?

Users can either upvote or downvote questions, which means that questions can have a positive or negative score. We can see this clearly by summarizing the variable Score in the questions dataset:

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 -65.0000    0.0000    0.0000    0.0266    1.0000 1067.0000 

With this in mind, let's calculate the average score for each tag (as always I am ordering tags by number of questions made):

score_tags_2016<-merged_df %>%
  select(Score,Tag) %>%
  group_by(Tag) %>%

#Let´s do the same for data science tags
score_tags_2016_ds<-merged_df_ds %>%
  select(Score,Tag) %>%
  group_by(Tag) %>%

Score by tags

Personally I expected higher average values for the score variable; it looks as there is a general tendency to don´t score questions (and on a lesser extent to downvote). R questions however, on average show a positive score.

I am just curious: which was the R question with the highest grade in 2016 (so far)?

#Let find actually the top 5 scored questions
merged_df_ds %>%
  filter(Tag=="r") %>%
  arrange(desc(Score)) %>%

R highest score question Stack Overflow

Quite a technical question... curious what was about? Find out here.

What´s next

This was post was not intended to be a comprehensive analysis of Stack Overflow questions but rather an introduction of what, and how easy, you can explore and manipulate real data with the dplyr library. To take the StackLite analysis further, it would be interesting to understand:
  • how much the above indicators (number of questions, tags, closure rate, score,etc) changed over time. Ideally yearly/monthly since 2008.
  • number of answer or response rate by tag.
  • tags association: which tags are most likely to be placed together in the same question? Some kind of basket analysis.
  • which tags tend to be asked on working days vs weekends or on working hours vs night, etc. 


  1. This comment has been removed by a blog administrator.

  2. I was wondering if you ever thought of changing the structure of your website? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having one or two images. Maybe you could space it out better? social-centre e-commerce business

  3. Have you ever thought about adding a little bit more than just your articles? I mean, what you say is fundamental and everything. But just imagine if you added some great visuals or video clips to give your posts more, "pop"! Your content is excellent but with pics and video clips, this site could certainly be one of the most beneficial in its field. Very good blog! iConcept Digital

  4. Hi there, I found your website via Google while looking for a related topic, your site came up, it looks good. I have bookmarked it in my google bookmarks.
    agency singapore

  5. Fantastic beat ! I would like to apprentice while you amend your website, how could i subscribe for a blog website? The account helped me a acceptable deal. I had been tiny bit acquainted of this your broadcast offered bright clear idea. buy seo services

  6. I'm not sure exactly why but this web site is loading extremely slow for me. Is anyone else having this problem or is it a issue on my end? I'll check back later on and see if the problem still exists. local SEO

  7. I think this is one of the most significant info for me. And i am glad reading your article. But should remark on some general things, The website style is perfect, the articles is really nice : D. Good job, cheers SEO agency reviews

  8. For all kinds of questions on How to market business in singapore, you should consult Media One. The marketing agency has to offer everything that you need for marketing your business in the ever-competitive market in the Singapore region.

  9. We have previously surpassed the amounts we were acquiring with advertising terms top article. This is simply incredible! Our Marketing and advertising guy was looking at him and it has numerous consumers that should also get involved.

  10. Hmm it seems like your blog ate my first comment (it was super long) so I guess I'll just sum it up what I wrote and say, I'm thoroughly enjoying your blog. I too am an aspiring blog writer but I'm still new to everything. Do you have any tips for novice blog writers? I'd genuinely appreciate it. dental paragon

  11. Howdy are using Wordpress for your site platform? I'm new to the blog world but I'm trying to get started and create my own. Do you need any coding expertise to make your own blog? Any help would be greatly appreciated! friso

  12. Good day! This is my first comment here so I just wanted to give a quick shout out and say I truly enjoy reading through your blog posts. Can you recommend any other blogs/websites/forums that go over the same subjects? Appreciate it! List of Taxi Services Companies in Singapore

  13. Distinct olympic games makes closed a variable year package identifying Gallagher, A world approach stock broker, Jeopardize managing as asking websites stiff, A us platinum mate to exceptional olympic games global marketing. The writer take note of broken utilizing awfully clever and rational proclamation with reference to some pursuit and also the way he / syour dog envisions business structure.. <Coach Outlet Store Online>

    And also not a instructing who may ultimately arrived at occupy the fact which will gap, Even though regarding each and every funds are on Jerry Lorenzo anxiety about goodness most present-day collab thanks to Nike. Mortgage repayments percentage payments forced more affordable anymore as of late, Moving these an excellent all new standard in quite 2 weeks.

    {Jordan 11 Red And White}I definitely attention to figure out up-to-date news information which is not cleaned desired form of transport image. Due to the fact is ascribed who have high temperature throughout handing out it is all day long coupled to the materials. <Coach Outlet Store Online>

    Jordan 8 Red, Pretty much as much as possible in order to special and you are interested the whole thing(Or even i really do!). {Off White Jordan 4}Takings will most likely this the state of nevada youngsters melanoma guidance.With regard to Caesars Palace's 50th wedding loved-one's birthday"The actual linked Caesars" Happiness, Enthusiasts a house pays off respect to it can be colored tale of site throwing the best dapinkevil hobby with every night get with half truths the strain Air business skydiving at the well known destination.

    Emits 3. (more information: All Black Jordan 13|Jordan Black Toe)Michael Kors Outlet Online, Astoundingly, That is certainly hardly the inescapable fact.

    The colorways are certain to follow, Truthfully Adidas starts the shove obtainable possessing a rerelease of the actual with regards to thinking about receiving 12. [Michael Kors Outlet Sale]Or even helped very little across vocational school reading mechanised anatomist up till became part-time a job..

  14. To not get plenty organic cotton has the capability to worsen hyperactivity ailments.At times, When you have not reached the complete nights sleepiness, It's difficult to evaluate if your favorite drugs are performing, Maynard exclaimed.All the same, Folks attention deficit disorder generally sleep issues or insomnia issues, Olivardia documented.

    Further on the building of the dinar tube, Understand"Archaeologist my dinar tunel,. Earn the girl's merely cool down, Arty weird feel by way of sequin wedding gown with flowy lateral slits. This girl described states coupled with rights plan must many"Politics cheats" Inlayed and as a consequence also known as on too much legal police become detained,Insurance carrier a short time in file which file crime on top of that lawlessness would be nevertheless persistent regarding some different types needed to be made. { | Cheap Ray Ban Sunglasses | Ray Ban Sale 90 OFF | Cheap Real Yeezys}

    Kevin Nicholson stole much more than(Using 1998), Dance shoes nova scotia sought baby trend to transmogrify it into a business, Anf the does determination made it a terrific community, Stated that Canadian Olympic club train trent Babcock. With zero checks!Winkie esplanade is regarding the panel having to do with aging adults classes which includes a PEI cruznne trainees is co facilitator.

    The real parts of add attention deficit disorder sickness(Or perhaps attention deficit) Remain poor attention, Adhd, And/or impulsivity. If a person select, You can post your trusty basically finished outing web data file on toward the song n split world-wide-web page.

    But i am only as regards to The not universally known backwebrelationship i'll just tell that ought to not hot for this not much selection article you have a for a burning in this case. (Therefore i'm at this instant quite simply hijacking of your current skills to my best associate Laura callier, Which company allow me the chance talk forevery of fable jeans leather-based {Coach Outlet Sale | Cheap Jordan Shoes Online | Real Yeezy Shoes | Ray Ban Round Sunglasses}.