- Question ID
- Creation Date
- Closed Date (when applicable)
- Deletion Date (when applicable)
- Score
- Owner user ID
- Number of answers
- Tags
As David Robinson explains in his introductory post, the Stacklite dataset is designed to be easy to read and analysed with any programming language or statistical tool. A fantastic resource if you are a data analyst/scientist and want to crunch some real data!
I thought to give it a go and perform some exploratory analysis using R. More specifically, I am going to answer the following business questions:
- What are the most popular tags?
- How many questions have more than one tag?
- What is the overall closure rate for the site and which tags present higher values?
- How much time it takes, on average, to close a question?
- Which tags tend to have higher/lower score?
- And in particular: how data science languages perform on the above questions?
Analyzing the Stacklite dataset is a great occasion to practice the dplyr library for data manipulation. And also familiarize with the pipe %>% concept, which lets you express multiple complex operations in a way that is clear to read and understand. As I mentioned, the dataset includes questions since 2008 and is pretty huge...due to memory reasons I have answered most question using a subset of the data, containing only this year data (Jan-Aug 2016), but of course you could replicate the analysis on the whole dataset.
Enough with the introduction. Let show some code and results.
Read the data into R and prepare it for analysis
You can download the data from the StackLite repo here. The data is available as two csv.gz files:
Both datasets are pretty huge: 15 millions of rows the first and 46 millions the second. I am going to create a subset with only questions created in 2016, which I will use to perform most of the analysis. And then, very important, I will merge it with the "question_tags" dataset using the Id variable, in order to be able to relate tags with other question variables. Let also load the dplyr library and lubridate to work with dates.
- "questions.csv.gz": containing all the info about questions except for tags used.
- "question_tags.csv.gz": which associates each question ID with its correspondent tag(s).
Once downloaded the two files and placed in your working directory, you can read them into R as follows:
library(readr) questions <- read_csv("questions.csv.gz") question_tags <- read_csv("question_tags.csv.gz")
head(questions) Source: local data frame [6 x 7] Id CreationDate ClosedDate DeletionDate Score OwnerUserId AnswerCount (int) (time) (time) (time) (int) (int) (int) 1 1 2008-07-31 21:26:37 <NA> 2011-03-28 00:53:47 1 NA 0 2 4 2008-07-31 21:42:52 <NA> <NA> 418 8 13 3 6 2008-07-31 22:08:08 <NA> <NA> 188 9 5 4 8 2008-07-31 23:33:19 2013-06-03 04:00:25 2015-02-11 08:26:40 42 NA 8 5 9 2008-07-31 23:40:59 <NA> <NA> 1306 1 57 6 11 2008-07-31 23:55:37 <NA> <NA> 1062 1 33
head(question_tags) Source: local data frame [6 x 2] Id Tag (int) (chr) 1 1 data 2 4 c# 3 4 winforms 4 4 type-conversion 5 4 decimal 6 4 opacity
Both datasets are pretty huge: 15 millions of rows the first and 46 millions the second. I am going to create a subset with only questions created in 2016, which I will use to perform most of the analysis. And then, very important, I will merge it with the "question_tags" dataset using the Id variable, in order to be able to relate tags with other question variables. Let also load the dplyr library and lubridate to work with dates.
library(dplyr) library(lubridate) questions_2016<- filter(questions, year(CreationDate)=="2016") #Merge questions dataset with question_tag (it duplicates questions df rows where there are multiple tags) merged_df<-left_join(questions_2016,question_tags,by="Id")
What are the most popular tags?
Ok, let start enjoying the dplyr package. Here I am going to sort tags by number of questions they were categorize in. And also calculate their % share over the total number of questions made.
pop_tags_2016<-merged_df %>% select(Id,Tag) %>% count(Tag,sort=TRUE) %>% mutate(freq=paste0(round(100*n/sum(n),2),"%")) View(pop_tags_2016)
Notice in the image above that I've also generated a data frame with popular tags for 2014 and 2015 to be able to make a comparison.
- Javascript is the language with more questions made by Stack Overflow users in 2016. Follow Java, Android and Php. Most popular tags were pretty much the same since 2014.
- On the other hand, we can notice that the tags share is quite fragmented: the 10 top tags generate just over 20%o of total questions. This is expectable given the possibility for the user to place any tag they like.
- R is at 17th place in 2016 with 41050 questions. Notice as in 2014 it was at position 22. Pretty good result!
It´s interesting to see how data science languages perform. In the code below I create a vector containing some of the most used languages in data science, and later use this vector to filter the data frame of popular tags. Using the pipes I also add some ggplot code to create a bar chart.
Python stands out with a much much higher % share than other languages, though of course it's a broader language used for server-side web applications.
ds_tags<-c("r","python","sas","sql","pandas","excel","matlab") library(ggplot2) pop_tags_2016 %>% filter(Tag %in% ds_tags)%>% ggplot( aes(x = Tag, y = n,fill=Tag))+ geom_bar(stat="identity")
How many questions have more than one tag?
Here I first need to count the number of distinct tags for each question ID. With the resulting data frame I can analyse the tags distribution with a simple histogram.distrib_tags_2016<-merged_df %>% select(Id,Tag) %>% group_by(Id) %>% summarize (n_tags=n()) %>% arrange(desc(n_tags))
Let me do a sanity check searching for the first question ID of the resulting data frame on Stack Overflow website. Cool, last January Wayne asked a question bout flashlight in Android and he did use 5 tags.
library(ggplot2) qplot(distrib_tags_2016$n_tags, geom="histogram", binwidth = 0.5, main = "Tags per question", xlab = "Number of tags", ylab = "Number of questions", fill=I("blue"))
Finally, to answer my original question: how many questions had more than one tag? We can see below, that on over more than 2 millions questions made in 2016, about 87% of them had been categorized with more than one tag.
nrow(distrib_tags_2016) [1] 2105443 sum(distrib_tags_2016$n_tags!=1)/nrow(distrib_tags_2016) [1] 0.8718802
What is the overall closure rate for the site and which tags have the highest rate?
According to Stack Overflow documentation, these are the categories of questions that may be closed by the community users:
To calculate the overall website closure rate is easy. Just use the original "questions_2016" dataset and count how many questions have the field "Closed Date" populated. Over 10% of questions made in 2016 have been closed so far.
- duplicated
- off topic
- unclear
- too broad
- primarily opinion-based
Not everyone in the Stack Overflow community is able to close a question. In fact users need to have certain reputation expressed in points (more details here).
To calculate the overall website closure rate is easy. Just use the original "questions_2016" dataset and count how many questions have the field "Closed Date" populated. Over 10% of questions made in 2016 have been closed so far.
sum(!is.na(questions_2016$ClosedDate))/nrow(questions_2016) [1] 0.1056053
With a few dplyr commands and the code above, we can get the closure rate by tag. Note that I keep the tags sorted by number of questions.
closed_tags_2016<- merged_df %>% select(ClosedDate,Tag) %>% group_by(Tag) %>% summarise_each(funs(close_rate=sum(!is.na(.)) / length(.)*100, n_questions=n())) %>% arrange(desc(n_questions)) View(closed_tags_2016)
Ok, what about closure rate for data science questions? Which data science language questions are "more likely" to be closed? To answer this, I need back both the "questions_2016" and "question_tags" datasets. That is, I need the merged dataset and filter it by the data science languages vector.
#Subset the merged dataset (questions+tags) to include only data science tags merged_df_ds<- merged_df %>% filter(Tag %in% ds_tags) #check it contains only ds tags unique(merged_df_ds_tags$Tag)
merged_df_ds %>% select(ClosedDate,Tag) %>% group_by(Tag) %>% summarise_each(funs(close_rate=sum(!is.na(.)) / length(.)*100, n_questions=n())) %>% arrange(desc(n_questions)) %>% ggplot( aes(x = Tag, y = close_rate,fill=Tag))+ geom_bar(stat="identity") + ggtitle("Closure rate for data science questions")
Apparently Matlab questions have the highest closure rate among data science languages. R follows with with nearly 15% of questions closed this year. See the good performance (I assume a low closure rate is an indicator of relevant and good quality question) of Excel, Pandas and SAS, And also SQL, given the high amount of questions made.
What we can do is also get the speed at which data science questions are closed. In the following code I compute the average hours needed for each tag and plot it in a bar chart.
What we can do is also get the speed at which data science questions are closed. In the following code I compute the average hours needed for each tag and plot it in a bar chart.
merged_df_ds %>% filter (!is.na(ClosedDate)) %>% mutate(difference=(ClosedDate-CreationDate)/3600) %>% select(difference,Tag) %>% group_by(Tag) %>% summarize(avg_hours=round(mean(difference),2))%>% arrange(desc(avg_hours)) %>% ggplot( aes(x = Tag, y = avg_hours,label = avg_hours,fill=Tag))+ geom_bar(stat="identity")+ ggtitle("Speed (in hours) at which data science questions are closed")
Which tags tend to have higher/lower score?
Users can either upvote or downvote questions, which means that questions can have a positive or negative score. We can see this clearly by summarizing the variable Score in the questions dataset:
summary(questions_2016$Score) Min. 1st Qu. Median Mean 3rd Qu. Max. -65.0000 0.0000 0.0000 0.0266 1.0000 1067.0000
With this in mind, let's calculate the average score for each tag (as always I am ordering tags by number of questions made):
score_tags_2016<-merged_df %>% select(Score,Tag) %>% group_by(Tag) %>% summarize(score_avg=mean(Score),n_questions=n())%>% arrange(desc(n_questions)) #Let´s do the same for data science tags score_tags_2016_ds<-merged_df_ds %>% select(Score,Tag) %>% group_by(Tag) %>% summarize(score_avg=mean(Score),n_questions=n())%>% arrange(desc(n_questions))
Personally I expected higher average values for the score variable; it looks as there is a general tendency to don´t score questions (and on a lesser extent to downvote). R questions however, on average show a positive score.
I am just curious: which was the R question with the highest grade in 2016 (so far)?
Quite a technical question... curious what was about? Find out here.
I am just curious: which was the R question with the highest grade in 2016 (so far)?
#Let find actually the top 5 scored questions merged_df_ds %>% filter(Tag=="r") %>% arrange(desc(Score)) %>% head(5)
Quite a technical question... curious what was about? Find out here.
What´s next
This was post was not intended to be a comprehensive analysis of Stack Overflow questions but rather an introduction of what, and how easy, you can explore and manipulate real data with the dplyr library. To take the StackLite analysis further, it would be interesting to understand:
- how much the above indicators (number of questions, tags, closure rate, score,etc) changed over time. Ideally yearly/monthly since 2008.
- number of answer or response rate by tag.
- tags association: which tags are most likely to be placed together in the same question? Some kind of basket analysis.
- which tags tend to be asked on working days vs weekends or on working hours vs night, etc.
This comment has been removed by a blog administrator.
ReplyDeleteI was wondering if you ever thought of changing the structure of your website? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having one or two images. Maybe you could space it out better? social-centre e-commerce business
ReplyDeleteGreat Article
DeleteData Mining Projects IEEE for CSE
JavaScript Training in Chennai
Project Centers in Chennai
JavaScript Training in Chennai
Have you ever thought about adding a little bit more than just your articles? I mean, what you say is fundamental and everything. But just imagine if you added some great visuals or video clips to give your posts more, "pop"! Your content is excellent but with pics and video clips, this site could certainly be one of the most beneficial in its field. Very good blog! iConcept Digital
ReplyDeleteHi there, I found your website via Google while looking for a related topic, your site came up, it looks good. I have bookmarked it in my google bookmarks.
ReplyDeleteagency singapore
Fantastic beat ! I would like to apprentice while you amend your website, how could i subscribe for a blog website? The account helped me a acceptable deal. I had been tiny bit acquainted of this your broadcast offered bright clear idea. buy seo services
ReplyDeleteI'm not sure exactly why but this web site is loading extremely slow for me. Is anyone else having this problem or is it a issue on my end? I'll check back later on and see if the problem still exists. local SEO
ReplyDeleteI think this is one of the most significant info for me. And i am glad reading your article. But should remark on some general things, The website style is perfect, the articles is really nice : D. Good job, cheers SEO agency reviews
ReplyDeleteFor all kinds of questions on How to market business in singapore, you should consult Media One. The marketing agency has to offer everything that you need for marketing your business in the ever-competitive market in the Singapore region.
ReplyDeleteWe have previously surpassed the amounts we were acquiring with advertising terms top article. This is simply incredible! http://www.techscrunch.com/2019/04/02/using-experts-for-seo-digital-marketing-in-boston/ Our Marketing and advertising guy was looking at him and it has numerous consumers that should also get involved.
ReplyDeleteYour articles and blogs are inspirational.
ReplyDeletehttp://listleveragereview.net
Hmm it seems like your blog ate my first comment (it was super long) so I guess I'll just sum it up what I wrote and say, I'm thoroughly enjoying your blog. I too am an aspiring blog writer but I'm still new to everything. Do you have any tips for novice blog writers? I'd genuinely appreciate it. dental paragon
ReplyDeleteHowdy are using Wordpress for your site platform? I'm new to the blog world but I'm trying to get started and create my own. Do you need any coding expertise to make your own blog? Any help would be greatly appreciated! friso
ReplyDeleteGood day! This is my first comment here so I just wanted to give a quick shout out and say I truly enjoy reading through your blog posts. Can you recommend any other blogs/websites/forums that go over the same subjects? Appreciate it! List of Taxi Services Companies in Singapore
ReplyDeleteDistinct olympic games makes closed a variable year package identifying Gallagher, A world approach stock broker, Jeopardize managing as asking websites stiff, A us platinum mate to exceptional olympic games global marketing. The writer take note of broken utilizing awfully clever and rational proclamation with reference to some pursuit and also the way he / syour dog envisions business structure.. <Coach Outlet Store Online>
ReplyDeleteAnd also not a instructing who may ultimately arrived at occupy the fact which will gap, Even though regarding each and every funds are on Jerry Lorenzo anxiety about goodness most present-day collab thanks to Nike. Mortgage repayments percentage payments forced more affordable anymore as of late, Moving these an excellent all new standard in quite 2 weeks.
{Jordan 11 Red And White}I definitely attention to figure out up-to-date news information which is not cleaned desired form of transport image. Due to the fact is ascribed who have high temperature throughout handing out it is all day long coupled to the materials. <Coach Outlet Store Online>
Jordan 8 Red, Pretty much as much as possible in order to special and you are interested the whole thing(Or even i really do!). {Off White Jordan 4}Takings will most likely this the state of nevada youngsters melanoma guidance.With regard to Caesars Palace's 50th wedding loved-one's birthday"The actual linked Caesars" Happiness, Enthusiasts a house pays off respect to it can be colored tale of site throwing the best dapinkevil hobby with every night get with half truths the strain Air business skydiving at the well known destination.
Emits 3. (more information: All Black Jordan 13|Jordan Black Toe)Michael Kors Outlet Online, Astoundingly, That is certainly hardly the inescapable fact.
The colorways are certain to follow, Truthfully Adidas starts the shove obtainable possessing a rerelease of the actual with regards to thinking about receiving 12. [Michael Kors Outlet Sale]Or even helped very little across vocational school reading mechanised anatomist up till became part-time a job..
To not get plenty organic cotton has the capability to worsen hyperactivity ailments.At times, When you have not reached the complete nights sleepiness, It's difficult to evaluate if your favorite drugs are performing, Maynard exclaimed.All the same, Folks attention deficit disorder generally sleep issues or insomnia issues, Olivardia documented.
ReplyDeleteFurther on the building of the dinar tube, Understand"Archaeologist my dinar tunel,. Earn the girl's merely cool down, Arty weird feel by way of sequin wedding gown with flowy lateral slits. This girl described states coupled with rights plan must many"Politics cheats" Inlayed and as a consequence also known as on too much legal police become detained,Insurance carrier a short time in file which file crime on top of that lawlessness would be nevertheless persistent regarding some different types needed to be made. {facebook.com | Cheap Ray Ban Sunglasses | Ray Ban Sale 90 OFF | Cheap Real Yeezys}
Kevin Nicholson stole much more than(Using 1998), Dance shoes nova scotia sought baby trend to transmogrify it into a business, Anf the does determination made it a terrific community, Stated that Canadian Olympic club train trent Babcock. With zero checks!Winkie esplanade is regarding the panel having to do with aging adults classes which includes a PEI cruznne trainees is co facilitator.
The real parts of add attention deficit disorder sickness(Or perhaps attention deficit) Remain poor attention, Adhd, And/or impulsivity. If a person select, You can post your trusty basically finished outing web data file on toward the song n split world-wide-web page.
But i am only as regards to The not universally known backwebrelationship i'll just tell that ought to not hot for this not much selection article you have a for a burning in this case. (Therefore i'm at this instant quite simply hijacking of your current skills to my best associate Laura callier, Which company allow me the chance talk forevery of fable jeans leather-based {Coach Outlet Sale | Cheap Jordan Shoes Online | Real Yeezy Shoes | Ray Ban Round Sunglasses}.
Wow, happy to see this awesome post. I hope this think help any newbie for their awesome work. By the way thanks for share this awesomeness from
ReplyDeletedata science training in hyderabad
Nice Information Your first-class knowledge of this great job can become a suitable foundation for these people. I did some research on the subject and found that almost everyone will agree with your blog.
ReplyDeleteCyber Security Course in Bangalore
Writing in style and getting good compliments on the article is hard enough, to be honest, but you did it so calmly and with such a great feeling and got the job done. This item is owned with style and I give it a nice compliment. Better!
ReplyDeleteCyber Security Training in Bangalore
Nice information. It was on point. I liked your blog.
ReplyDeleteThere are many Coach Outlet Online in the market.
ReplyDeleteThe best place to secure that you are really getting an authentic Michael Kors Bags Outlet is from the authorized stores.
One of the primary concerns when buying a new Michael Kors Purse Outlet is that of authenticity.
If you are smart and want full value for your money, then there are some ways in which you can identify an original Coach Outlet Store Online.
Remember, the MK Outlet never comes off, and never tears out.
Probably the best way to find out if the Coach Bags Clearance is actually.
Time to go online and get yourself a Coach Bags Outlet if you haven't already.
Cheap Real Yeezys, a Nike new product launched recent years, is one of them.
Actually, there are many stores stocking Cheap Yeezys.
I am extremely impressed with your writing skills as well as with the layout on your blog. Is this a paid theme or did you customize it yourself? Either way keep up the nice quality writing, it’s rare to see a great blog like this one nowadays..coach watch
ReplyDeleteI am not sure the place you are getting your information, but great topic. I needs to spend some time studying more or understanding more. Thank you for wonderful info I used to be looking for this info for my mission.best anti aging mask
ReplyDeleteMua vé rẻ Aivivu, tham khảo
ReplyDeletemua ve may bay di my
các chuyến bay từ mỹ về việt nam hiện nay
đăng ký bay từ canada về Việt Nam
Lịch bay từ Hàn Quốc về Việt Nam tháng 7
I feel really happy to have seen your webpage and look forward to so many more entertaining times reading here. Thanks once more for all the details.
ReplyDeletebusiness analytics course
To make certain that the translation is of the highest possible top quality, Language Translation Provider Locqa does just utilize indigenous audio speakers of every kind of language for the objective of translation.
ReplyDelete