Sep 19, 2016

Analyzing Stack Overflow questions and tags with the StackLite dataset

The guys at Stack Overflow have recently released a very interesting dataset containing the entire history of questions made by users since the beginning of the site, back in 2008. It's called StackLite and it contains, for each Stack Overflow question the following data:
  • Question ID
  • Creation Date
  • Closed Date (when applicable)
  • Deletion Date (when applicable)
  • Score
  • Owner user ID
  • Number of answers
  • Tags 

As David Robinson explains in his introductory post, the Stacklite dataset is designed to be easy to read and analysed with any programming language or statistical tool. A fantastic resource if you are a data analyst/scientist and want to crunch some real data! 

I thought to give it a go and perform some exploratory analysis using R. More specifically, I am going to answer the following business questions:
  • What are the most popular tags?
  • How many questions have more than one tag?
  • What is the overall closure rate for the site and which tags present higher values?
  • How much time it takes, on average, to close a question?
  • Which tags tend to have higher/lower score?
  • And in particular: how data science languages perform on the above questions?
Analyzing the Stacklite dataset is a great occasion to practice the dplyr library for data manipulation. And also familiarize with the pipe %>% concept, which lets you express multiple complex operations in a way that is clear to read and understand. As I mentioned, the dataset includes questions since 2008 and is pretty huge...due to memory reasons I have answered most question using a subset of the data, containing only this year data (Jan-Aug 2016), but of course you could replicate the analysis on the whole dataset.

Enough with the introduction. Let show some code and results.

Read the data into R and prepare it for analysis

You can download the data from the StackLite repo here. The data is available as two csv.gz files:
  • "questions.csv.gz": containing all the info about questions except for tags used.
  • "question_tags.csv.gz": which associates each question ID with its correspondent tag(s).
Once downloaded the two files and placed in your working directory, you can read them into R as follows:


questions <- read_csv("questions.csv.gz")
question_tags <- read_csv("question_tags.csv.gz")

Source: local data frame [6 x 7]

     Id        CreationDate          ClosedDate        DeletionDate Score OwnerUserId AnswerCount
  (int)              (time)              (time)              (time) (int)       (int)       (int)
1     1 2008-07-31 21:26:37                <NA> 2011-03-28 00:53:47     1          NA           0
2     4 2008-07-31 21:42:52                <NA>                <NA>   418           8          13
3     6 2008-07-31 22:08:08                <NA>                <NA>   188           9           5
4     8 2008-07-31 23:33:19 2013-06-03 04:00:25 2015-02-11 08:26:40    42          NA           8
5     9 2008-07-31 23:40:59                <NA>                <NA>  1306           1          57
6    11 2008-07-31 23:55:37                <NA>                <NA>  1062           1          33

Source: local data frame [6 x 2]

     Id             Tag
  (int)           (chr)
1     1            data
2     4              c#
3     4        winforms
4     4 type-conversion
5     4         decimal
6     4         opacity

Both datasets are pretty huge: 15 millions of rows the first and 46 millions the second. I am going to create a subset with only questions created in 2016, which I will use to perform most of the analysis. And then, very important, I will merge it with the "question_tags" dataset using the Id variable, in order to be able to relate tags with other question variables. Let also load the dplyr library and lubridate to work with dates.


questions_2016<- filter(questions, year(CreationDate)=="2016")

#Merge questions dataset with question_tag (it duplicates questions df rows where there are multiple tags)

What are the most popular tags?

Ok, let start enjoying the dplyr package. Here I am going to sort tags by number of questions they were categorize in. And also calculate their % share over the total number of questions made.

pop_tags_2016<-merged_df %>%
  select(Id,Tag) %>%
  count(Tag,sort=TRUE) %>%   


Notice in the image above that I've also generated a data frame with popular tags for 2014 and 2015 to be able to make a comparison.
  • Javascript is the language with more questions made by Stack Overflow users in 2016. Follow Java, Android and Php. Most popular tags were pretty much the same since 2014.
  • On the other hand, we can notice that the tags share is quite fragmented: the 10 top tags generate just over 20%o of total questions. This is expectable given the possibility for the user to place any tag they like.
  • R is at 17th place in 2016 with 41050 questions. Notice as in 2014 it was at position 22. Pretty good result!
It´s interesting to see how data science languages perform. In the code below I create a vector containing some of the most used languages in data science, and later use this vector to filter the data frame of popular tags. Using the pipes I also add some ggplot code to create a bar chart.


pop_tags_2016 %>%
  filter(Tag %in% ds_tags)%>%
  ggplot( aes(x = Tag, y = n,fill=Tag))+ geom_bar(stat="identity")

Most used tags used for data science
Python stands out with a much much higher % share than other languages, though of course it's a broader language used for server-side web applications.

How many questions have more than one tag?

Here I first need to count the number of distinct tags for each question ID. With the resulting data frame I can analyse the tags distribution with a simple histogram.

distrib_tags_2016<-merged_df %>%
  select(Id,Tag) %>%
  group_by(Id) %>%
  summarize (n_tags=n()) %>%

Let me do a sanity check searching for the first question ID of the resulting data frame on Stack Overflow website. Cool, last January Wayne asked a question bout flashlight in Android and he did use 5 tags.

Stack Overflow question tags

By plotting an histogram you can see that most questions in 2016 had 3 or 2 tags. Of course at least one tag is mandatory and I guess there is a maximum of 5 tags allowed per question. There are more questions with 5 tags than questions with only 1 tag.

      binwidth = 0.5,  
      main = "Tags per question", 
      xlab = "Number of tags", 
      ylab = "Number of questions",

Number of tags per question at Stack Overflow
Finally, to answer my original question: how many questions had more than one tag? We can see below, that on over more than 2 millions questions made in 2016, about 87% of them had been categorized with more than one tag.

[1] 2105443

[1] 0.8718802

What is the overall closure rate for the site and which tags have the highest rate?

According to Stack Overflow documentation, these are the categories of questions that may be closed by the community users:
  • duplicated
  • off topic
  • unclear
  • too broad
  • primarily opinion-based
Not everyone in the Stack Overflow community is able to close a question. In fact users need to have certain reputation expressed in points (more details here).

To calculate the overall website closure rate is easy. Just use the original "questions_2016" dataset and count how many questions have the field "Closed Date" populated. Over 10% of questions made in 2016 have been closed so far.


[1] 0.1056053

With a few dplyr commands and the code above, we can get the closure rate by tag. Note that I keep the tags sorted by number of questions.

closed_tags_2016<- merged_df %>%
  select(ClosedDate,Tag) %>%
  group_by(Tag) %>%
  summarise_each(funs(close_rate=sum(! / length(.)*100, n_questions=n())) %>%


Ok, what about closure rate for data science questions? Which data science language questions are "more likely" to be closed? To answer this, I need back both the "questions_2016" and "question_tags" datasets. That is, I need the merged dataset and filter it by the data science languages vector.

#Subset the merged dataset (questions+tags) to include only data science tags
merged_df_ds<- merged_df %>%
  filter(Tag %in% ds_tags)
#check it contains only ds tags

merged_df_ds %>%
  select(ClosedDate,Tag) %>%
  group_by(Tag) %>%
  summarise_each(funs(close_rate=sum(! / length(.)*100, n_questions=n())) %>%
  arrange(desc(n_questions)) %>%
  ggplot( aes(x = Tag, y = close_rate,fill=Tag))+ geom_bar(stat="identity") + ggtitle("Closure rate for data science questions")

Closure rate by tags for Stack Overflow questions

Apparently Matlab questions have the highest closure rate among data science languages. R follows with with nearly 15% of questions closed this year. See the good performance (I assume a low closure rate is an indicator of relevant and good quality question) of Excel, Pandas and SAS, And also SQL, given the high amount of questions made.

What we can do is also get the speed at which data science questions are closed. In the following code I compute the average hours needed for each tag and plot it in a bar chart.

merged_df_ds %>%
  filter (! %>%
  mutate(difference=(ClosedDate-CreationDate)/3600) %>%
  select(difference,Tag) %>%
  group_by(Tag) %>%
  arrange(desc(avg_hours)) %>%
  ggplot( aes(x = Tag, y = avg_hours,label = avg_hours,fill=Tag))+ geom_bar(stat="identity")+ ggtitle("Speed (in hours) at which data science questions are closed")

Speed to close questions at Stack Overflow

Which tags tend to have higher/lower score?

Users can either upvote or downvote questions, which means that questions can have a positive or negative score. We can see this clearly by summarizing the variable Score in the questions dataset:

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 -65.0000    0.0000    0.0000    0.0266    1.0000 1067.0000 

With this in mind, let's calculate the average score for each tag (as always I am ordering tags by number of questions made):

score_tags_2016<-merged_df %>%
  select(Score,Tag) %>%
  group_by(Tag) %>%

#Let´s do the same for data science tags
score_tags_2016_ds<-merged_df_ds %>%
  select(Score,Tag) %>%
  group_by(Tag) %>%

Score by tags

Personally I expected higher average values for the score variable; it looks as there is a general tendency to don´t score questions (and on a lesser extent to downvote). R questions however, on average show a positive score.

I am just curious: which was the R question with the highest grade in 2016 (so far)?

#Let find actually the top 5 scored questions
merged_df_ds %>%
  filter(Tag=="r") %>%
  arrange(desc(Score)) %>%

R highest score question Stack Overflow

Quite a technical question... curious what was about? Find out here.

What´s next

This was post was not intended to be a comprehensive analysis of Stack Overflow questions but rather an introduction of what, and how easy, you can explore and manipulate real data with the dplyr library. To take the StackLite analysis further, it would be interesting to understand:
  • how much the above indicators (number of questions, tags, closure rate, score,etc) changed over time. Ideally yearly/monthly since 2008.
  • number of answer or response rate by tag.
  • tags association: which tags are most likely to be placed together in the same question? Some kind of basket analysis.
  • which tags tend to be asked on working days vs weekends or on working hours vs night, etc. 


  1. This comment has been removed by a blog administrator.

  2. I was wondering if you ever thought of changing the structure of your website? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having one or two images. Maybe you could space it out better? social-centre e-commerce business

  3. Have you ever thought about adding a little bit more than just your articles? I mean, what you say is fundamental and everything. But just imagine if you added some great visuals or video clips to give your posts more, "pop"! Your content is excellent but with pics and video clips, this site could certainly be one of the most beneficial in its field. Very good blog! iConcept Digital

  4. Hi there, I found your website via Google while looking for a related topic, your site came up, it looks good. I have bookmarked it in my google bookmarks.
    agency singapore

  5. Fantastic beat ! I would like to apprentice while you amend your website, how could i subscribe for a blog website? The account helped me a acceptable deal. I had been tiny bit acquainted of this your broadcast offered bright clear idea. buy seo services

  6. I'm not sure exactly why but this web site is loading extremely slow for me. Is anyone else having this problem or is it a issue on my end? I'll check back later on and see if the problem still exists. local SEO

  7. I think this is one of the most significant info for me. And i am glad reading your article. But should remark on some general things, The website style is perfect, the articles is really nice : D. Good job, cheers SEO agency reviews

  8. For all kinds of questions on How to market business in singapore, you should consult Media One. The marketing agency has to offer everything that you need for marketing your business in the ever-competitive market in the Singapore region.

  9. We have previously surpassed the amounts we were acquiring with advertising terms top article. This is simply incredible! Our Marketing and advertising guy was looking at him and it has numerous consumers that should also get involved.

  10. Hmm it seems like your blog ate my first comment (it was super long) so I guess I'll just sum it up what I wrote and say, I'm thoroughly enjoying your blog. I too am an aspiring blog writer but I'm still new to everything. Do you have any tips for novice blog writers? I'd genuinely appreciate it. dental paragon

  11. Howdy are using Wordpress for your site platform? I'm new to the blog world but I'm trying to get started and create my own. Do you need any coding expertise to make your own blog? Any help would be greatly appreciated! friso

  12. Good day! This is my first comment here so I just wanted to give a quick shout out and say I truly enjoy reading through your blog posts. Can you recommend any other blogs/websites/forums that go over the same subjects? Appreciate it! List of Taxi Services Companies in Singapore

  13. Distinct olympic games makes closed a variable year package identifying Gallagher, A world approach stock broker, Jeopardize managing as asking websites stiff, A us platinum mate to exceptional olympic games global marketing. The writer take note of broken utilizing awfully clever and rational proclamation with reference to some pursuit and also the way he / syour dog envisions business structure.. <Coach Outlet Store Online>

    And also not a instructing who may ultimately arrived at occupy the fact which will gap, Even though regarding each and every funds are on Jerry Lorenzo anxiety about goodness most present-day collab thanks to Nike. Mortgage repayments percentage payments forced more affordable anymore as of late, Moving these an excellent all new standard in quite 2 weeks.

    {Jordan 11 Red And White}I definitely attention to figure out up-to-date news information which is not cleaned desired form of transport image. Due to the fact is ascribed who have high temperature throughout handing out it is all day long coupled to the materials. <Coach Outlet Store Online>

    Jordan 8 Red, Pretty much as much as possible in order to special and you are interested the whole thing(Or even i really do!). {Off White Jordan 4}Takings will most likely this the state of nevada youngsters melanoma guidance.With regard to Caesars Palace's 50th wedding loved-one's birthday"The actual linked Caesars" Happiness, Enthusiasts a house pays off respect to it can be colored tale of site throwing the best dapinkevil hobby with every night get with half truths the strain Air business skydiving at the well known destination.

    Emits 3. (more information: All Black Jordan 13|Jordan Black Toe)Michael Kors Outlet Online, Astoundingly, That is certainly hardly the inescapable fact.

    The colorways are certain to follow, Truthfully Adidas starts the shove obtainable possessing a rerelease of the actual with regards to thinking about receiving 12. [Michael Kors Outlet Sale]Or even helped very little across vocational school reading mechanised anatomist up till became part-time a job..

  14. To not get plenty organic cotton has the capability to worsen hyperactivity ailments.At times, When you have not reached the complete nights sleepiness, It's difficult to evaluate if your favorite drugs are performing, Maynard exclaimed.All the same, Folks attention deficit disorder generally sleep issues or insomnia issues, Olivardia documented.

    Further on the building of the dinar tube, Understand"Archaeologist my dinar tunel,. Earn the girl's merely cool down, Arty weird feel by way of sequin wedding gown with flowy lateral slits. This girl described states coupled with rights plan must many"Politics cheats" Inlayed and as a consequence also known as on too much legal police become detained,Insurance carrier a short time in file which file crime on top of that lawlessness would be nevertheless persistent regarding some different types needed to be made. { | Cheap Ray Ban Sunglasses | Ray Ban Sale 90 OFF | Cheap Real Yeezys}

    Kevin Nicholson stole much more than(Using 1998), Dance shoes nova scotia sought baby trend to transmogrify it into a business, Anf the does determination made it a terrific community, Stated that Canadian Olympic club train trent Babcock. With zero checks!Winkie esplanade is regarding the panel having to do with aging adults classes which includes a PEI cruznne trainees is co facilitator.

    The real parts of add attention deficit disorder sickness(Or perhaps attention deficit) Remain poor attention, Adhd, And/or impulsivity. If a person select, You can post your trusty basically finished outing web data file on toward the song n split world-wide-web page.

    But i am only as regards to The not universally known backwebrelationship i'll just tell that ought to not hot for this not much selection article you have a for a burning in this case. (Therefore i'm at this instant quite simply hijacking of your current skills to my best associate Laura callier, Which company allow me the chance talk forevery of fable jeans leather-based {Coach Outlet Sale | Cheap Jordan Shoes Online | Real Yeezy Shoes | Ray Ban Round Sunglasses}.

  15. Wow, happy to see this awesome post. I hope this think help any newbie for their awesome work. By the way thanks for share this awesomeness from
    data science training in hyderabad

  16. Nice Information Your first-class knowledge of this great job can become a suitable foundation for these people. I did some research on the subject and found that almost everyone will agree with your blog.
    Cyber Security Course in Bangalore

  17. Writing in style and getting good compliments on the article is hard enough, to be honest, but you did it so calmly and with such a great feeling and got the job done. This item is owned with style and I give it a nice compliment. Better!
    Cyber Security Training in Bangalore

  18. Nice information. It was on point. I liked your blog.

  19. There are many Coach Outlet Online in the market.
    The best place to secure that you are really getting an authentic Michael Kors Bags Outlet is from the authorized stores.
    One of the primary concerns when buying a new Michael Kors Purse Outlet is that of authenticity.
    If you are smart and want full value for your money, then there are some ways in which you can identify an original Coach Outlet Store Online.
    Remember, the MK Outlet never comes off, and never tears out.
    Probably the best way to find out if the Coach Bags Clearance is actually.
    Time to go online and get yourself a Coach Bags Outlet if you haven't already.
    Cheap Real Yeezys, a Nike new product launched recent years, is one of them.
    Actually, there are many stores stocking Cheap Yeezys.

  20. I am extremely impressed with your writing skills as well as with the layout on your blog. Is this a paid theme or did you customize it yourself? Either way keep up the nice quality writing, it’s rare to see a great blog like this one watch

  21. I am not sure the place you are getting your information, but great topic. I needs to spend some time studying more or understanding more. Thank you for wonderful info I used to be looking for this info for my anti aging mask

  22. I feel really happy to have seen your webpage and look forward to so many more entertaining times reading here. Thanks once more for all the details.

    business analytics course

  23. To make certain that the translation is of the highest possible top quality, Language Translation Provider Locqa does just utilize indigenous audio speakers of every kind of language for the objective of translation.

  24. Hmm it seems like your site ate my first comment (it was super long) so I guess I'll just sum it up what I submitted and say, I'm thoroughly enjoying your blog. I as well am an aspiring blog writer but I'm still new to the whole thing. Do you have any tips and hints for first-time blog writers? I'd certainly appreciate defence singapore

  25. Woah! I'm really loving the template/theme of this blog. It's simple, yet effective. A lot of times it's tough to get that "perfect balance" between user friendliness and visual appeal. I must say you've done a great job with this. Additionally, the blog loads super fast for me on Firefox. Exceptional Blog! event space rental singapore


  26. Thanks for sharing this.,
    Leanpitch provides crash course in Brain hacks using NLP everyone can use it wisely.

    Brain hacks with NLP

    NLP crash course

  27. Thanks for sharing this.,
    Leanpitch provides crash course in Brain hacks using NLP everyone can use it wisely.
    NLP crash course

    NLP training

  28. Thanks for sharing this.,
    Leanpitch provides crash course in Brain hacks using NLP everyone can use it wisely.

    NLP crash course

    NLP training

  29. Thank you for your informative post!!!
    Village Talkies a top-quality professional corporate video production company in Bangalore and also best explainer video company in Bangalore & animation video makers in Bangalore, Chennai, India & Maryland, Baltimore, USA provides Corporate & Brand films, Promotional, Marketing videos & Training videos, Product demo videos, Employee videos, Product video explainers, eLearning videos, 2d Animation, 3d Animation, Motion Graphics, Whiteboard Explainer videos Client Testimonial Videos, Video Presentation and more for all start-ups, industries, and corporate companies. From scripting to corporate video production services, explainer & 3d, 2d animation video production , our solutions are customized to your budget, timeline, and to meet the company goals and objectives.
    As a best video production company in Bangalore, we produce quality and creative videos to our clients.

  30. Fantastic beat ! I wish to apprentice while you amend your web site, how could i subscribe for a blog website? The account helped me a acceptable deal. I had been a little bit acquainted of this your broadcast offered bright clear idea accounting services provider singapore

  31. The objective of this section of our website is to advise potential clients for bank account opening. Baltic Legal is an international corporate offshore and trust services provider. However, a significant portion of companies and trusts we administer establish accounts with international and offshore banks rather than domestic banks. For more information on offshore businesses please visit the offshore services page.

  32. I truly like your style of blogging. I added it to my preferred’s blog webpage list and will return soon…

    AWS Training in Hyderabad

  33. Interesting topic for a blog. 카지노사이트 I have been searching the Internet for fun and came upon your website. Fabulous post. Thanks a ton for sharing your knowledge! It is great to see that some people still put in an effort into managing their websites. I’ll be sure to check back again real soon.

  34. I havent any word to appreciate this post…..Really i am impressed from this post….the person who create this post it was a great human..thanks for shared this with us. 바카라사이트

  35. Great offer the previous results outstanding vital necessary info right here 토토사이트

  36. Thanks for outstanding details and truth of the concern 토토사이트

  37. I wanted to thank you for this excellent read!! I definitely loved every little bit of it. I have you bookmarked your site to check out the new stuff you post. TheFitBay

  38. It was an interesting Article. Thank you for giving us this content. You all are invited to visit our website.
    Click Here

  39. 카지노사이트 Wonderful work! This is the type of info that are meant to be shared across the web. Shame on the seek engines for no longer positioning this submit upper! Come on over and seek advice from my web site . Thanks =)

  40. Great work! This is the type of information that are supposed to be shared around the net. Shame on the search engines for not positioning this post upper! Come on over and discuss with my web site . Thanks =) 바카라사이트

  41. Hi ! I would simply like to offer you a great thumbs
    up for your great information you have
    here on this post. I will be going back to your website for
    more information soon. 스포츠토토

  42. 스포츠토토 Hello it’s me, I am also visiting this web page on a regular
    basis, this web site is actually fastidious and the users are really sharing
    fastidious thoughts.

  43. Your website is really cool and this is a great inspiring article. Thank you so much. website

  44. I have bookmarked your blog, the articles are way better than other similar blogs.. thanks for a great blog! umrah packages