tag:blogger.com,1999:blog-86930295061713093032024-02-08T00:52:29.323+00:00analytics for funmarquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.comBlogger36125tag:blogger.com,1999:blog-8693029506171309303.post-38417322221571233432017-12-08T17:44:00.000+00:002017-12-08T17:44:43.435+00:00Live Earthquakes AppIt's awesome when you are asked to build a product demo and you end up building something you actually use yourself.<br />
<br />
That is what happened to me with the Live Earthquake Shiny App. A few months ago, as part of the JHU Data Science Specialization course, I was tasked to build a data product demo using the shiny package in R. I'd already had some experience with shiny, but this time I wanted to build an app showing real-time data. Something people would like to monitor regularly and see if something special happened during the last couple of days.<br />
<br />
I am not at all an expert on earthquakes, but I thought this would make a great use case for a real-time data visualization. And now, every time I hear from the news of a new earthquake episode I go to double check it in my app and see what else is going on.<br />
<br />
The app does the following:<br />
<br />
<ul>
<li>Retrieve the latest version of data available from <a href="https://earthquake.usgs.gov/">USGS</a> website. Data comes in a .csv file and reports quakes for the past 7 days (check the exact URL in the R code).</li>
<li>Subset the dataset in case the user chooses to see only data from yesterday.</li>
<li>Plot earthquake data on a world map using the <a href="https://rstudio.github.io/leaflet/">leaflet library</a>.</li>
<li>Calculate a few basic metrics like max and number of occurrences.</li>
<li>Force a manual refresh of data if the user press the button “Update Data”.</li>
</ul>
<div>
Here is a snapshot of the app. You can <a href="https://mcpasincoursera.shinyapps.io/live_earthquakes/">use the app here</a> in shinyapps.io server.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSHsFvs6tAcfj2oOZFdg2r_0PflxYokdmHONO_bD4CdOnrBfuoBxudSdfYcEQMRepyRRmtPJUZy9sfPQhvhwUoEdIWGBaFQ2c4ai50GlOImtspeO9yxmJ919BymBc7HFrBJffZwblKi5k/s1600/live+earthquakes+app.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="live earthquakes map" border="0" data-original-height="622" data-original-width="1332" height="298" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSHsFvs6tAcfj2oOZFdg2r_0PflxYokdmHONO_bD4CdOnrBfuoBxudSdfYcEQMRepyRRmtPJUZy9sfPQhvhwUoEdIWGBaFQ2c4ai50GlOImtspeO9yxmJ919BymBc7HFrBJffZwblKi5k/s640/live+earthquakes+app.JPG" title="" width="640" /></a></div>
<div>
<br /></div>
<br />
Also you check out <a href="https://github.com/mcpasin/datasciencecoursera/blob/master/Developing%20Data%20Products/week4/live_quakes_app/app.R">the code here</a>.<br />
<br />
If you click on each circle some basic info about the quake are shown: place, time, magnitude and depth. Circles colors are based on the magnitude (the darker the stronger). If you wonder how I classified them from minor to strong below is the scale I used:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhMNBuLsRTYMOoTCxSw2IP4BEo7vVd1LtjpRN5BxO8vn9rNGlFDjOqSAP1MTaYN4F4HWEvWNh8HvzQvlur2unCRjtWnoP30Db1Vhx9R7L5GTD_hNBSZpN5BvL8FV0s_Qm3vbne1E6I6Rk/s1600/magnitude+scale.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="magnitude categorization" border="0" data-original-height="370" data-original-width="715" height="206" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhMNBuLsRTYMOoTCxSw2IP4BEo7vVd1LtjpRN5BxO8vn9rNGlFDjOqSAP1MTaYN4F4HWEvWNh8HvzQvlur2unCRjtWnoP30Db1Vhx9R7L5GTD_hNBSZpN5BvL8FV0s_Qm3vbne1E6I6Rk/s400/magnitude+scale.JPG" title="" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Hope you'll have a chance to explore it! Enjoy.</div>
marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-39848275565073947342017-03-24T16:19:00.000+00:002017-12-09T18:18:51.058+00:00Actionable Data Analysis for Ecommerce ProductsManaging an Ecommerce shop backed with proper transactional/warehouse database and a digital analytics collection platform (e.g. Google Analytics) means having access to lots and lots of data. The types of analysis you can do are uncountable. It depends of course on the business question you need to answer.<br />
<br />
However, the unique added value of your analysis is very often represented by how much actionable their results are for the business. In this post I am going to demonstrate a few examples of actionable analysis you can do with your ecommerce business data.<br />
<br />
I will take some data from the Google Merchandise Site (there is a <a href="http://www.analyticsforfun.com/2016/08/google-analytics-makes-demo-account.html">free GA demo account</a>) and use Tableau to create the visualizations.<br />
<a name='more'></a><br />
<h2>
<span style="font-size: large;">Products performance in the site</span></h2>
<div>
In this first example I am developing on some ideas suggested in <a href="http://www.tatvic.com/blog/enhanced-ecommerce-analysis/">this article</a> by Tatvic. The question is: how can we measure performance of each products offered in the website?<br />
<br />
One way to do it is by crossing two variables:<br />
<ul>
<li><b>Pageviews</b>: how many times a specific product page was viewed >> this can be thought as a simple proxy variable to measure demand for that product (people browse a product as they might be interested to buy it).</li>
<li><b>Transactions</b>: how many times the same product was actually bought >> here we are talking about sales (not only the product was viewed, but it was added to checkout and eventually purchased).</li>
</ul>
<div>
If you have the <a href="https://developers.google.com/analytics/devguides/collection/analyticsjs/enhanced-ecommerce">Enhanced Ecommerce module</a> properly implemented in GA, you can find these data under <i>Conversions>Ecommerce>Product Performance>Shopping Behavior tab</i>. The variables we are interested in are called "<i>Product Detail Views</i>" and "<i>Unique Purchases</i>".</div>
<div>
<br /></div>
<div>
In the Tableau visualization below I've plotted data from the <a href="http://www.analyticsforfun.com/2016/08/google-analytics-makes-demo-account.html">Google Merchandise Site</a> for the month of February 2017. Each circle represents a single product and it's placed in the plot based on its views and transactions values.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4NjlD1ruAjzYh9B06MIeLjhy6Sc-dRacesZv6aqPrVnHRAx0nVDMyZbuFXdgKd5VNZM8vgCLS9pkUvImLDiazCrYh8fq1zZ8BSo9u6106XKPE6ocCZlfDyDe57fhkOiNhlARcJc06JSE/s1600/pageviews+vs+transactions.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="pageviews vs transactions" border="0" height="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4NjlD1ruAjzYh9B06MIeLjhy6Sc-dRacesZv6aqPrVnHRAx0nVDMyZbuFXdgKd5VNZM8vgCLS9pkUvImLDiazCrYh8fq1zZ8BSo9u6106XKPE6ocCZlfDyDe57fhkOiNhlARcJc06JSE/s400/pageviews+vs+transactions.JPG" title="Google Analytics Ecommerce" width="400" /></a></div>
<br /></div>
<div>
There are over than 1,000 products in the plot which makes the visualization hard to understand. On top of that, as expected both variable distributions are right-skewed (most products are have few views and sales) which turn the visualization even more unclear (and not actionable). One way to cope with that is to convert our axis into logarithmic.</div>
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgU_RJ2csoDfQ1dnls8meJHrKuzvTzZarl7MFvjGFh1CQn8vM9QMMesvECYQLi2IflnLnIpAWKJBvxP0cCpv02khG9Z6NUgFKNxZMQJ52r_zAV7T-9Y126LBZSYFCvPSBbno2du8Y2NvLc/s1600/pageviews+vs+transactions+log.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="pageviews vs transactions logarithmic axis" border="0" height="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgU_RJ2csoDfQ1dnls8meJHrKuzvTzZarl7MFvjGFh1CQn8vM9QMMesvECYQLi2IflnLnIpAWKJBvxP0cCpv02khG9Z6NUgFKNxZMQJ52r_zAV7T-9Y126LBZSYFCvPSBbno2du8Y2NvLc/s400/pageviews+vs+transactions+log.JPG" title="Google Analytics products performance" width="400" /></a></div>
<br /></div>
<div>
<br /></div>
<div>
Looks nicer. Still not actionable though.<br />
<br />
Let say we want to identify the top segment of products that generate both a high number of views and transactions. Or better, lets generate 4 segments of products as follows:<br />
<ol>
<li>High Pageviews/High Sales</li>
<li>High Pageviews/Low Sales</li>
<li>Low Pageviews/High Sales</li>
<li>Low Pageviews/Low Sales</li>
</ol>
In other words we would like to split the plot area into 4 quadrants. A very simple way to do it would be to just draw reference line in the middle of each axis.<br />
<br />
A more solid approach instead would rely on statistics measures, which means considering the distribution of each variable. What we can do is calculate <a href="https://en.wikipedia.org/wiki/Quartile">quartiles</a> for each variable. Given a sequence of data points, the quartile divides the frequency distribution into 4 equal groups. So, for the pageviews variable for example:<br />
<ol>
<li>the 1st quartile (also called lower quartile) will contain the lowest 25% of products. </li>
<li>the 2nd quartile (up to the median) is the next lowest group. So it has 50% of the data below it.</li>
<li>the 3rd (upper quartile) is the second highest and has 75% of products below it.</li>
<li>Finally the 4rth is the point from which fall the top 25% of products.</li>
</ol>
After calculating and plotting quartiles for both variables, we should be able to obtain a similar visualization as below (if you are browsing from a mobile device you might not see it properly, please visit this <a href="https://public.tableau.com/profile/marco.pasin#!/vizhome/EcommerceProductsAnalysis1/matrixprodperf">Tableau Public page</a>):<br />
<br />
<div class="tableauPlaceholder" id="viz1490316668981" style="position: relative;">
<noscript><a href='#'><img alt='Product Performance Matrix ' src='https://public.tableau.com/static/images/Ec/EcommerceProductsAnalysis1/matrixprodperf/1_rss.png' style='border: none' /></a></noscript><object class="tableauViz" style="display: none;"><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='site_root' value='' /><param name='name' value='EcommerceProductsAnalysis1/matrixprodperf' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Ec/EcommerceProductsAnalysis1/matrixprodperf/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>
<script type="text/javascript"> var divElement = document.getElementById('viz1490316668981'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='90%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
<br />
<br />
where:<br />
<ul>
<li>The top-right quadrant includes the <b>top 25%</b> both in terms of sales and pageviews. It looks there is a high demand for these products (are the most viewed) and high sales are eventually generated. This is probably an ideal context for the business and a possible action could be increasing price of those products in order to improve margins.</li>
<li>In the bottom-right quadrant products are highly viewed/demanded though do not generate as much sales as the previous quadrant. It looks as we could improve the commercial offer in order to push more sales.</li>
<li>In the top-left quadrant it looks as <b>products convert very well</b> (GA has a metric called Buy to Detail Rate which should be high for these products) since they generate high sales with less pageviews than others. Perhaps some actions aimed at giving them more exposure within the website could produce even more sales.</li>
<li>Finally we should take some action on the bottom left products. The idea would be to gradually shift these products towards the top-right, or alternatively drop some of them.</li>
</ul>
</div>
<div>
<br />
Lastly, we can include additional dimensions to the visualization. Here I've added for each products their correspondent business <b>category</b>. The drop-down filter allows you selecting just one category. And the cool thing is that quartiles will be recalculated based on the distribution of that particular category. This means being able to do more granular analysis and meaningful comparisons based on a specific context.</div>
</div>
<br />
<div class="tableauPlaceholder" id="viz1490316933811" style="position: relative;">
<noscript><a href='#'><img alt='Product Performance Matrix ' src='https://public.tableau.com/static/images/Ec/EcommerceProductsAnalysis/matrixprodperf-cat/1_rss.png' style='border: none' /></a></noscript><object class="tableauViz" style="display: none;"><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='site_root' value='' /><param name='name' value='EcommerceProductsAnalysis/matrixprodperf-cat' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Ec/EcommerceProductsAnalysis/matrixprodperf-cat/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>
<script type="text/javascript"> var divElement = document.getElementById('viz1490316933811'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='90%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
<br />
<div>
<h2>
<span style="font-size: large;"><br /></span></h2>
<h2>
<span style="font-size: large;">Products sold vs stock</span></h2>
</div>
Lets apply the same logic above to solve another practical problem for any ecommerce business selling goods: do we have enough stock? Do we need to supply new stock to the warehouse or, we should better discount a particular product?<br />
<br />
The final visualization would be the following (note that for the stock variable I've generated some random data myself since I've no access to real data):<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWTe_qnD7XSqoWnAQfk67FcoOfC1xbjKye7wyql3weWUfEwwz-Qk4qxKhM89S8aO8UdNYIpqSaq9vxLpEpWfA4x-quHcOwRJgJ5o5x1xqqTfAw0qCQZ1VU8_XFVGwfgSDeu9Njx0i8tLc/s1600/matrix+stock.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="sales vs stock" border="0" height="340" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWTe_qnD7XSqoWnAQfk67FcoOfC1xbjKye7wyql3weWUfEwwz-Qk4qxKhM89S8aO8UdNYIpqSaq9vxLpEpWfA4x-quHcOwRJgJ5o5x1xqqTfAw0qCQZ1VU8_XFVGwfgSDeu9Njx0i8tLc/s640/matrix+stock.jpg" title="Stock matrix" width="640" /></a></div>
<br />
where:<br />
<br />
<ul>
<li>the top-left quadrant contains products that generated high sales but are currently low of stock. Should we better buy more stock and make it available in the warehouse?</li>
<li>the bottom-right identify products with high availability of stock but low sales performance. In this case we might need to apply some discount to stimulate sales.</li>
</ul>
<br />
<br />
<h2>
<span style="font-size: large;">Conclusions: how to be actionable?</span></h2>
In this post I've presented only a few examples. Beyond the data and the technical aspects I followed to build the visualization (you could be more rigorous for sure), I do think they all provide actionable results for the business and stakeholders can make decisions from that.<br />
<br />
So, how can we make sure that results will be actionable? Below is a list of of steps I recommend to consider in your data analysis task:<br />
<ol>
<li>Solve a practical problem for the business.</li>
<li>Cross relevant variables.</li>
<li>Provide a comparison/benchmark to give data a proper context.</li>
<li>Make sure the results are ready to be used for people.</li>
<li>Take advantage of visualizations to explain result and simplify interpretation for decision makers.</li>
<li>Provide not only data but also words to describe insights. If you understood the business problem properly from the very beginning, you should be able to talk the same language used by final users of your analysis.</li>
</ol>
<br />marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-63328562983710776132016-10-04T01:25:00.003+01:002016-10-04T02:39:53.254+01:00How to Upgrade R version in Windows. The easy way recommended on CRANToday I have found myself needing to upgrade R. Main reason was that my current version R-3.2.1 did not support some new graphic packages. To install these new packages I needed at least a R-3.3 version.<br />
<br />
After a bit of initial hesitation (will I lose my packages during the new installation? etc. etc.) I finally took some courage and decided to follow the <a href="https://cran.r-project.org/bin/windows/base/rw-FAQ.html#What_0027s-the-best-way-to-upgrade_003f">official documentation on CRAN</a>. Everything worked just fine and I have now installed the latest available R version on CRAN: at the time I write this post it´s R-3.3.1.<br />
<br />
The upgrading process was really easy, so I thought to share it step by step. Enjoy :)<br />
<a name='more'></a><br />
<br />
<h2>
<span style="font-size: large;">
1. Check your current R version</span></h2>
<div>
To find out your current version, open R and it will be shown in the console. If you are using RStudio you can check you R version by clicking on Tools>Global Options... yep my current version is now R-3.3.1.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaK83tbdjgAml84zHjVtAh84aotDoESYKTkuoHdwKcCWK9spM69K1Jrt6HTIO90ZRQJOveiMI_sqxoHfVBbN_56UfGm1r8Z_LHGYWjqOqwJ-729i0yZTBA22KrzgtRHa8_aOgO3Y2Q3Xo/s1600/R+version.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Find your R version" border="0" height="194" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaK83tbdjgAml84zHjVtAh84aotDoESYKTkuoHdwKcCWK9spM69K1Jrt6HTIO90ZRQJOveiMI_sqxoHfVBbN_56UfGm1r8Z_LHGYWjqOqwJ-729i0yZTBA22KrzgtRHa8_aOgO3Y2Q3Xo/s640/R+version.png" title="" width="640" /></a></div>
<div>
<br /></div>
<h2>
<span style="font-size: large;">2. </span><span style="font-size: large;">Locate your current R folder </span></h2>
<div>
If you did not change the default path in your previous installation, you should be able to find the main R folder under C:\Program Files\R.</div>
<div>
<br /></div>
<div>
Make sure to locate your library folder too (the one containing all the packages you installed so far). It can be that either:</div>
<div>
<ol>
<li>all of your packages (the ones you have installed yourself plus the ones coming by default with the R version you originally installed) are in the library folder, under the main R folder (e.g. "C:\Program Files\R\R-3.1")</li>
<li>or, like in my case, the packages I have manually installed are in a different folder. In my previous version they were located under this path: C:\Users\Marco\Documents\R\win-library\3.1</li>
</ol>
</div>
<div>
<span style="font-size: large;"></span><br />
<h2>
<span style="font-size: large;">
<span style="font-size: large;">3. </span><span style="font-size: large;">Download the latest version from CRAN</span></span></h2>
<div>
Go to CRAN website here and download the latest R version for Windows. At the day I am writing this post it's R-3.3.1.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggSmWltkX38aNcx-UFwg18tKw2QE18vVLZVPyv7kQhqqSidYNt1EwJD9H0F4zxK22TA1wY6695kNEi2KrzzPDBvHWZloved8e8ehAk2z80-5CG3X_cvTRzWSQzof8lV4-G2XMF2dvBVeo/s1600/download+R.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Download R-3.3.1 Windows" border="0" height="209" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggSmWltkX38aNcx-UFwg18tKw2QE18vVLZVPyv7kQhqqSidYNt1EwJD9H0F4zxK22TA1wY6695kNEi2KrzzPDBvHWZloved8e8ehAk2z80-5CG3X_cvTRzWSQzof8lV4-G2XMF2dvBVeo/s640/download+R.png" title="" width="640" /></a></div>
<br /></div>
<div>
<h2>
<span style="font-size: large;">
<span style="font-size: large;">4. Uninstall your current R version </span></span></h2>
As you uninstall any other program in your Windows machine, go to Control Panel>Applications and uninstall your current R version.<br />
<div>
<br />
<h2>
<span style="font-size: large;">
<span style="font-size: large;">5. Install the latest version on your machine</span></span></h2>
<div>
I suggest to leave the default path which should be "C:\Program Files\".<br />
<br />
<h2>
<span style="font-size: large;"><span style="font-size: large;">
<span style="font-size: large;">
<span style="font-size: large;">6. Copy your old packages on your new R folder</span></span></span></span></h2>
<div>
With the newer version only basic R packages will be installed. To avoid you installing again all your previous packages, go to your old library folder as per step 2 and copy your packages. Then paste them into your new library folder.</div>
<span style="font-size: large;"><span style="font-size: large;">
</span></span>
<br />
<h2>
<span style="font-size: large;"><span style="font-size: large;">
<span style="font-size: large;">7. Make sure your packages are updated</span></span></span></h2>
<div>
<span style="font-size: medium;"><span style="font-size: medium;"><span style="font-size: medium;">To make sure your packages are updated to the latest version available on CRAN, you can run the following command from the R console: <i>update.packages(checkBuilt=TRUE, ask=FALSE).</i></span></span></span></div>
<div>
<span style="font-size: medium;"><span style="font-size: medium;"><span style="font-size: medium;"><i><br /></i></span></span></span></div>
<div>
<span style="font-size: medium;"><span style="font-size: medium;"><span style="font-size: medium;"><i><br /></i></span></span></span></div>
<span style="font-size: large;"><span style="font-size: large;">
</span></span></div>
<span style="font-size: large;">
</span></div>
</div>
</div>
marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-27477066818107518522016-09-19T01:06:00.000+01:002016-09-20T13:47:08.554+01:00Analyzing Stack Overflow questions and tags with the StackLite datasetThe guys at Stack Overflow have recently released a very interesting dataset containing the entire history of questions made by users since the beginning of the site, back in 2008. It's called <a href="https://github.com/dgrtwo/StackLite">StackLite</a> and it contains, for each Stack Overflow question the following data:<br />
<div>
<ul>
<li>Question ID</li>
<li>Creation Date</li>
<li>Closed Date (when applicable)</li>
<li>Deletion Date (when applicable)</li>
<li>Score</li>
<li>Owner user ID</li>
<li>Number of answers</li>
<li>Tags </li>
</ul>
<div>
<br /></div>
<div>
As David Robinson explains in <a href="http://varianceexplained.org/r/stack-lite/">his introductory post</a>, the Stacklite dataset is designed to be easy to read and analysed with any programming language or statistical tool. A fantastic resource if you are a data analyst/scientist and want to crunch some real data! </div>
<div>
<br /></div>
<div>
I thought to give it a go and perform some exploratory analysis using R. More specifically, I am going to answer the following business questions:</div>
<div>
<ul>
<li>What are the most popular tags?</li>
<li>How many questions have more than one tag?</li>
<li>What is the overall closure rate for the site and which tags present higher values?</li>
<li>How much time it takes, on average, to close a question?</li>
<li>Which tags tend to have higher/lower score?</li>
<li>And in particular: how data science languages perform on the above questions?<a name='more'></a></li>
</ul>
<div>
Analyzing the Stacklite dataset is a great occasion to practice the <a href="https://cran.r-project.org/web/packages/dplyr/index.html"><i>dplyr</i> library</a> for data manipulation. And also familiarize with the <i><a href="http://r4ds.had.co.nz/pipes.html">pipe %>%</a></i> concept, which lets you express multiple complex operations in a way that is clear to read and understand. As I mentioned, the dataset includes questions since 2008 and is pretty huge...due to memory reasons I have answered most question using a subset of the data, containing only this year data (Jan-Aug 2016), but of course you could replicate the analysis on the whole dataset.</div>
</div>
<div>
<br /></div>
<div>
Enough with the introduction. Let show some code and results.</div>
<div>
<br />
<br />
<h2>
<span style="font-size: large;">
Read the data into R and prepare it for analysis</span></h2>
<div>
You can download the data from the <a href="https://github.com/dgrtwo/StackLite">StackLite repo here</a>. The data is available as two csv.gz files:<br />
<ul>
<li>"<i>questions.csv.gz</i>": containing all the info about questions except for tags used.</li>
<li>"<i>question_tags.csv.gz</i>": which associates each question ID with its correspondent tag(s).</li>
</ul>
<div>
Once downloaded the two files and placed in your working directory, you can read them into R as follows:</div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">library(readr)</span>
<span style="color: #888888;">questions <- read_csv("questions.csv.gz")</span>
<span style="color: #888888;">question_tags <- read_csv("question_tags.csv.gz")</span>
</pre>
</div>
<br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">head(questions)</span>
<span style="color: #888888;">Source: local data frame [6 x 7]</span>
<span style="color: #888888;"> Id CreationDate ClosedDate DeletionDate Score OwnerUserId AnswerCount</span>
<span style="color: #888888;"> (int) (time) (time) (time) (int) (int) (int)</span>
<span style="color: #888888;">1 1 2008-07-31 21:26:37 <NA> 2011-03-28 00:53:47 1 NA 0</span>
<span style="color: #888888;">2 4 2008-07-31 21:42:52 <NA> <NA> 418 8 13</span>
<span style="color: #888888;">3 6 2008-07-31 22:08:08 <NA> <NA> 188 9 5</span>
<span style="color: #888888;">4 8 2008-07-31 23:33:19 2013-06-03 04:00:25 2015-02-11 08:26:40 42 NA 8</span>
<span style="color: #888888;">5 9 2008-07-31 23:40:59 <NA> <NA> 1306 1 57</span>
<span style="color: #888888;">6 11 2008-07-31 23:55:37 <NA> <NA> 1062 1 33</span>
</pre>
</div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">head(question_tags)</span>
<span style="color: #888888;">Source: local data frame [6 x 2]</span>
<span style="color: #888888;"> Id Tag</span>
<span style="color: #888888;"> (int) (chr)</span>
<span style="color: #888888;">1 1 data</span>
<span style="color: #888888;">2 4 c#</span>
<span style="color: #888888;">3 4 winforms</span>
<span style="color: #888888;">4 4 type-conversion</span>
<span style="color: #888888;">5 4 decimal</span>
<span style="color: #888888;">6 4 opacity</span>
</pre>
</div>
<br />
<br />
Both datasets are pretty huge: 15 millions of rows the first and 46 millions the second. I am going to create a subset with only questions created in 2016, which I will use to perform most of the analysis. And then, very important, I will merge it with the "question_tags" dataset using the Id variable, in order to be able to relate tags with other question variables. Let also load the <i>dplyr</i> library and <i>lubridate</i> to work with dates.</div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">library(dplyr)</span>
<span style="color: #888888;">library(lubridate)</span>
<span style="color: #888888;">questions_2016<- filter(questions, year(CreationDate)=="2016")</span>
<span style="color: #888888;">#Merge questions dataset with question_tag (it duplicates questions df rows where there are multiple tags)</span>
<span style="color: #888888;">merged_df<-left_join(questions_2016,question_tags,by="Id")</span>
</pre>
</div>
<div>
<br /></div>
<div>
<br /></div>
<h2>
<span style="font-size: large;">
What are the most popular tags?</span></h2>
<div>
Ok, let start enjoying the <i>dplyr</i> package. Here I am going to sort tags by number of questions they were categorize in. And also calculate their % share over the total number of questions made.</div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">pop_tags_2016<-merged_df %>%</span>
<span style="color: #888888;"> select(Id,Tag) %>%</span>
<span style="color: #888888;"> count(Tag,sort=TRUE) %>% </span>
<span style="color: #888888;"> mutate(freq=paste0(round(100*n/sum(n),2),"%")) </span>
<span style="color: #888888;">View(pop_tags_2016)</span>
</pre>
</div>
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5-ja06DB_g2JwgqRPnrthBpEolkrtB5t8ABDsm8isAQRe05E4Z1A0l_4V5ppIxUIY95qg8Pu8wXJ6fEMzip_LoSa8LWUCKTH1OT1yKmr8TuU96gH5s_wsCkGbIY0T88rfRxdWoIVDbDo/s1600/pop_tags_3years.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="385" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5-ja06DB_g2JwgqRPnrthBpEolkrtB5t8ABDsm8isAQRe05E4Z1A0l_4V5ppIxUIY95qg8Pu8wXJ6fEMzip_LoSa8LWUCKTH1OT1yKmr8TuU96gH5s_wsCkGbIY0T88rfRxdWoIVDbDo/s640/pop_tags_3years.png" width="640" /></a></div>
<br />
Notice in the image above that I've also generated a data frame with popular tags for 2014 and 2015 to be able to make a comparison.<br />
<ul>
<li>Javascript is the language with more questions made by Stack Overflow users in 2016. Follow Java, Android and Php. Most popular tags were pretty much the same since 2014.</li>
<li>On the other hand, we can notice that the tags share is quite fragmented: the 10 top tags generate just over 20%o of total questions. This is expectable given the possibility for the user to place any tag they like.</li>
<li>R is at 17th place in 2016 with 41050 questions. Notice as in 2014 it was at position 22. Pretty good result!</li>
</ul>
</div>
<div>
It´s interesting to see how data science languages perform. In the code below I create a vector containing some of the <a href="http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html">most used languages in data science</a>, and later use this vector to filter the data frame of popular tags. Using the pipes I also add some ggplot code to create a bar chart.<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">ds_tags<-c("r","python","sas","sql","pandas","excel","matlab")</span>
<span style="color: #888888;">library(ggplot2)</span>
<span style="color: #888888;">pop_tags_2016 %>%</span>
<span style="color: #888888;"> filter(Tag %in% ds_tags)%>%</span>
<span style="color: #888888;"> ggplot( aes(x = Tag, y = n,fill=Tag))+ geom_bar(stat="identity")</span>
</pre>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFygsvvkmmLVTX9mGWa7LcEiY1JIicp13NC0nDSCVMO-jPDDMsEqkVT6jql4Pg0gXbMuuTxLCsrBRjb0sHCSLYUDEKjgZrfSOPs9tKdLo-rWmIPIeIcfmFRmsQ4hjp46YLGJzRBZu6GBs/s1600/pop_data_science_tags.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Most used tags used for data science" border="0" height="283" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFygsvvkmmLVTX9mGWa7LcEiY1JIicp13NC0nDSCVMO-jPDDMsEqkVT6jql4Pg0gXbMuuTxLCsrBRjb0sHCSLYUDEKjgZrfSOPs9tKdLo-rWmIPIeIcfmFRmsQ4hjp46YLGJzRBZu6GBs/s640/pop_data_science_tags.jpg" title="Stack Overflow tags 2016" width="640" /></a></div>
Python stands out with a much much higher % share than other languages, though of course it's a broader language used for server-side web applications.<br />
<br /></div>
<br />
<h2>
<span style="font-size: large;">
How many questions have more than one tag?</span></h2>
Here I first need to count the number of distinct tags for each question ID. With the resulting data frame I can analyse the tags distribution with a simple histogram.<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">distrib_tags_2016<-merged_df %>%</span>
<span style="color: #888888;"> select(Id,Tag) %>%</span>
<span style="color: #888888;"> group_by(Id) %>%</span>
<span style="color: #888888;"> summarize (n_tags=n()) %>%</span>
<span style="color: #888888;"> arrange(desc(n_tags))</span>
</pre>
</div>
<br />
Let me do a sanity check searching for the first question ID of the resulting data frame on Stack Overflow website. Cool, last January Wayne asked a question bout flashlight in Android and he did use 5 tags.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEju01wyoCt_LAfYQ6AKxN9sehe1BtS3dR9ABy0vqd6OFjUV4quc_izEMyx3h601JBru_RkgFMy6J3EnNLqLE02jj8Fbal23RP2WqPw0t26HgaUm7Xsb-otiM8N1sVNcCxnK9LcIMEWZEtY/s1600/distribution_sanity_check.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Stack Overflow question tags" border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEju01wyoCt_LAfYQ6AKxN9sehe1BtS3dR9ABy0vqd6OFjUV4quc_izEMyx3h601JBru_RkgFMy6J3EnNLqLE02jj8Fbal23RP2WqPw0t26HgaUm7Xsb-otiM8N1sVNcCxnK9LcIMEWZEtY/s640/distribution_sanity_check.jpg" title="" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
By plotting an histogram you can see that <b>most questions in 2016 had 3 or 2 tags</b>. Of course at least one tag is mandatory and I guess there is a maximum of 5 tags allowed per question. There are more questions with 5 tags than questions with only 1 tag.<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">library(ggplot2)</span>
<span style="color: #888888;">qplot(distrib_tags_2016$n_tags,</span>
<span style="color: #888888;"> geom="histogram",</span>
<span style="color: #888888;"> binwidth = 0.5, </span>
<span style="color: #888888;"> main = "Tags per question", </span>
<span style="color: #888888;"> xlab = "Number of tags", </span>
<span style="color: #888888;"> ylab = "Number of questions",</span>
<span style="color: #888888;"> fill=I("blue"))</span>
</pre>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGU1Magd46Io7v9AhjlKZAR9xjzK6h09S-3g9oNb1LYxqco-2_hUzj7jLtjkKKoE-glFZujFmDZMIjMPe5_UEUNHCgq9OgM9LSifUG8buzbjvRnIHcruRsrUwlUEjvin0CLK8CtyZIRbk/s1600/tags_distribution.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Number of tags per question at Stack Overflow" border="0" height="316" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGU1Magd46Io7v9AhjlKZAR9xjzK6h09S-3g9oNb1LYxqco-2_hUzj7jLtjkKKoE-glFZujFmDZMIjMPe5_UEUNHCgq9OgM9LSifUG8buzbjvRnIHcruRsrUwlUEjvin0CLK8CtyZIRbk/s640/tags_distribution.png" title="Stack Overfow tags 2016" width="640" /></a></div>
Finally, to answer my original question: how many questions had more than one tag? We can see below, that on over more than 2 millions questions made in 2016, about <b>87% of them had been categorized with more than one tag</b>.<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">nrow(distrib_tags_2016)</span>
<span style="color: #888888;">[1] 2105443</span>
<span style="color: #888888;">sum(distrib_tags_2016$n_tags!=1)/nrow(distrib_tags_2016)</span>
<span style="color: #888888;">[1] 0.8718802</span>
</pre>
</div>
<br />
<br />
<h3>
</h3>
<h2>
<span style="font-size: large;">
What is the overall closure rate for the site and which tags have the highest rate?</span></h2>
<div>
According to Stack Overflow documentation, these are the <a href="http://stackoverflow.com/help/closed-questions">categories of questions that may be closed</a> by the community users:<br />
<ul>
<li>duplicated</li>
<li>off topic</li>
<li>unclear</li>
<li>too broad</li>
<li>primarily opinion-based</li>
</ul>
<div>
Not everyone in the Stack Overflow community is able to close a question. In fact users need to have certain reputation expressed in points (more details <a href="http://stackoverflow.com/help/closed-questions">here</a>).</div>
<br />
To calculate the overall website closure rate is easy. Just use the original "questions_2016" dataset and count how many questions have the field "Closed Date" populated. <b>Over 10% of questions made in 2016 have been closed</b> so far.</div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">sum(!is.na(questions_2016$ClosedDate))/nrow(questions_2016)</span>
<span style="color: #888888;">[1] 0.1056053</span>
</pre>
</div>
<div>
<br />
With a few <i>dplyr</i> commands and the code above, we can get the closure rate by tag. Note that I keep the tags sorted by number of questions.<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">closed_tags_2016<- merged_df %>%</span>
<span style="color: #888888;"> select(ClosedDate,Tag) %>%</span>
<span style="color: #888888;"> group_by(Tag) %>%</span>
<span style="color: #888888;"> summarise_each(funs(close_rate=sum(!is.na(.)) / length(.)*100, n_questions=n())) %>%</span>
<span style="color: #888888;"> arrange(desc(n_questions))</span>
<span style="color: #888888;">View(closed_tags_2016)</span>
</pre>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8w8ZFWqe9pPIPCmThDVsA4DFmxuyWHO-AYdmH2xPiYdk8PAu58NeNfVokUHQf0Sr1eBvjDLJXhCrzsL2ktAUUq65S9YXcPVH5IhPYCXvngSe1r6ALV8968QPGsn3SeY5Z5s7um2AcwlA/s1600/closure_rate_tags.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8w8ZFWqe9pPIPCmThDVsA4DFmxuyWHO-AYdmH2xPiYdk8PAu58NeNfVokUHQf0Sr1eBvjDLJXhCrzsL2ktAUUq65S9YXcPVH5IhPYCXvngSe1r6ALV8968QPGsn3SeY5Z5s7um2AcwlA/s400/closure_rate_tags.png" width="237" /></a></div>
<br />
<br /></div>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<div>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Ok, what about closure rate for data science questions? Which data science language questions are "more likely" to be closed? To answer this, I need back both the "questions_2016" and "question_tags" datasets. That is, I need the merged dataset and filter it by the data science languages vector.<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">#Subset the merged dataset (questions+tags) to include only data science tags</span>
<span style="color: #888888;">merged_df_ds<- merged_df %>%</span>
<span style="color: #888888;"> filter(Tag %in% ds_tags)</span>
<span style="color: #888888;">#check it contains only ds tags</span>
<span style="color: #888888;">unique(merged_df_ds_tags$Tag)</span>
</pre>
</div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">merged_df_ds %>%</span>
<span style="color: #888888;"> select(ClosedDate,Tag) %>%</span>
<span style="color: #888888;"> group_by(Tag) %>%</span>
<span style="color: #888888;"> summarise_each(funs(close_rate=sum(!is.na(.)) / length(.)*100, n_questions=n())) %>%</span>
<span style="color: #888888;"> arrange(desc(n_questions)) %>%</span>
<span style="color: #888888;"> ggplot( aes(x = Tag, y = close_rate,fill=Tag))+ geom_bar(stat="identity") + ggtitle("Closure rate for data science questions")</span>
</pre>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaXvc4R72NZUwOxMbef2Cp2P9DGswM-0zT2CCnjn_cWRjVIcKWtrYJoUut3ZK1vLywj53stbzxSIOtiHJduz4ZJkpCv3GjVOx8dF6FJTRIPm4q31LuFb7SCveLP53YpSD8gjepogT8Ko0/s1600/close_rate_data_science_questions.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Closure rate by tags for Stack Overflow questions" border="0" height="216" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaXvc4R72NZUwOxMbef2Cp2P9DGswM-0zT2CCnjn_cWRjVIcKWtrYJoUut3ZK1vLywj53stbzxSIOtiHJduz4ZJkpCv3GjVOx8dF6FJTRIPm4q31LuFb7SCveLP53YpSD8gjepogT8Ko0/s640/close_rate_data_science_questions.png" title="Stack Overflow questions closure rate 2016" width="640" /></a></div>
<br /></div>
<div>
Apparently Matlab questions have the highest closure rate among data science languages. R follows with with <b>nearly 15% of questions closed</b> this year. See the good performance (I assume a low closure rate is an indicator of relevant and good quality question) of Excel, Pandas and SAS, And also SQL, given the high amount of questions made.<br />
<br />
What we can do is also get the speed at which data science questions are closed. In the following code I compute the average hours needed for each tag and plot it in a bar chart.<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">merged_df_ds %>%</span>
<span style="color: #888888;"> filter (!is.na(ClosedDate)) %>%</span>
<span style="color: #888888;"> mutate(difference=(ClosedDate-CreationDate)/3600) %>%</span>
<span style="color: #888888;"> select(difference,Tag) %>%</span>
<span style="color: #888888;"> group_by(Tag) %>%</span>
<span style="color: #888888;"> summarize(avg_hours=round(mean(difference),2))%>%</span>
<span style="color: #888888;"> arrange(desc(avg_hours)) %>%</span>
<span style="color: #888888;"> ggplot( aes(x = Tag, y = avg_hours,label = avg_hours,fill=Tag))+ geom_bar(stat="identity")+ ggtitle("Speed (in hours) at which data science questions are closed")</span>
</pre>
</div>
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjipOKTlhrqcTI9qqCULMQiWkUFb1ZZXzNY0JynjukX4E4ruc2PDJ48MoEQdNpqgRwn4CwIAa709s7qWIaWYNlyZcJvZ3YgncxAAkfTQZBwTXYT_t-lqJOltV9ZI8EKz95KC1zNKzB40Ww/s1600/avg.hours_close_data_science_tags.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Speed to close questions at Stack Overflow" border="0" height="203" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjipOKTlhrqcTI9qqCULMQiWkUFb1ZZXzNY0JynjukX4E4ruc2PDJ48MoEQdNpqgRwn4CwIAa709s7qWIaWYNlyZcJvZ3YgncxAAkfTQZBwTXYT_t-lqJOltV9ZI8EKz95KC1zNKzB40Ww/s400/avg.hours_close_data_science_tags.png" title="Stack Overflow time to close questions 2016" width="400" /></a></div>
<br /></div>
<h3>
</h3>
<h2>
<span style="font-size: large;">
Which tags tend to have higher/lower score?</span></h2>
<div>
Users can either upvote or downvote questions, which means that questions can have a positive or negative score. We can see this clearly by summarizing the variable Score in the questions dataset:</div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">summary(questions_2016$Score)</span>
<span style="color: #888888;"> Min. 1st Qu. Median Mean 3rd Qu. Max. </span>
<span style="color: #888888;"> -65.0000 0.0000 0.0000 0.0266 1.0000 1067.0000 </span>
</pre>
</div>
<br />
<div>
With this in mind, let's calculate the average score for each tag (as always I am ordering tags by number of questions made):<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">score_tags_2016<-merged_df %>%</span>
<span style="color: #888888;"> select(Score,Tag) %>%</span>
<span style="color: #888888;"> group_by(Tag) %>%</span>
<span style="color: #888888;"> summarize(score_avg=mean(Score),n_questions=n())%>%</span>
<span style="color: #888888;"> arrange(desc(n_questions))</span>
<span style="color: #888888;">#Let´s do the same for data science tags</span>
<span style="color: #888888;">score_tags_2016_ds<-merged_df_ds %>%</span>
<span style="color: #888888;"> select(Score,Tag) %>%</span>
<span style="color: #888888;"> group_by(Tag) %>%</span>
<span style="color: #888888;"> summarize(score_avg=mean(Score),n_questions=n())%>%</span>
<span style="color: #888888;"> arrange(desc(n_questions))</span>
</pre>
</div>
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhG4rMUi6ozasWx3HdusE7bKlqbPYK6CphIUF93W93ukc1zchwTDLPb1mFlly0loDV6vyo-V_XpEEJGUx_kr_469Tr1TG-cYmEt8LHfdhuBGk_Ng5cHzgwK_T29HDBEmcC60WgGjHHIXyE/s1600/score_tags.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Score by tags" border="0" height="360" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhG4rMUi6ozasWx3HdusE7bKlqbPYK6CphIUF93W93ukc1zchwTDLPb1mFlly0loDV6vyo-V_XpEEJGUx_kr_469Tr1TG-cYmEt8LHfdhuBGk_Ng5cHzgwK_T29HDBEmcC60WgGjHHIXyE/s640/score_tags.png" title="Stack Overflow questions score 2016" width="640" /></a></div>
<br /></div>
<div>
Personally I expected higher average values for the score variable; it looks as there is a general tendency to don´t score questions (and on a lesser extent to downvote). R questions however, on average show a positive score.<br />
<br />
I am just curious: which was the R question with the highest grade in 2016 (so far)?<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: none gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">#Let find actually the top 5 scored questions</span>
<span style="color: #888888;">merged_df_ds %>%</span>
<span style="color: #888888;"> filter(Tag=="r") %>%</span>
<span style="color: #888888;"> arrange(desc(Score)) %>%</span>
<span style="color: #888888;"> head(5)</span>
</pre>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrmo7nciQzWLK9hlr0QNxcT9u_iQKObhOszMlkKbfgax0MyNRZn3VQYxET53Dmf7DDISvE7twp6uLzH1LArUklHUVqpbzIZZ78LWU4aa7gg3Aa8P3tDMUBxK69XCgwYZ-UzwFw8i1NmkI/s1600/top5_R_scored_questions.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="R highest score question Stack Overflow" border="0" height="121" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrmo7nciQzWLK9hlr0QNxcT9u_iQKObhOszMlkKbfgax0MyNRZn3VQYxET53Dmf7DDISvE7twp6uLzH1LArUklHUVqpbzIZZ78LWU4aa7gg3Aa8P3tDMUBxK69XCgwYZ-UzwFw8i1NmkI/s640/top5_R_scored_questions.png" title="R question with highest score 2016" width="640" /></a></div>
<br />
Quite a technical question... curious what was about? Find out<a href="http://stackoverflow.com/questions/34599027/how-exactly-does-r-parse-the-right-assignment-operator"> here</a>.<br />
<br />
<br />
<h2>
<span style="font-size: large;">
What´s next</span></h2>
<div>
This was post was not intended to be a comprehensive analysis of Stack Overflow questions but rather an introduction of what, and how easy, you can explore and manipulate real data with the dplyr library. To take the StackLite analysis further, it would be interesting to understand:</div>
<div>
<ul>
<li>how much the above indicators (number of questions, tags, closure rate, score,etc) changed over time. Ideally yearly/monthly since 2008.</li>
<li>number of answer or response rate by tag.</li>
<li>tags association: which tags are most likely to be placed together in the same question? Some kind of basket analysis.</li>
<li>which tags tend to be asked on working days vs weekends or on working hours vs night, etc. </li>
</ul>
</div>
<br />
<br /></div>
</div>
</div>
marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-33502556807864904312016-08-12T02:24:00.000+01:002016-09-28T21:24:45.015+01:00Google Analytics makes Demo Account available to allPlaying with GA data is much much easier now.<br />
<br />
Last week biggest news was definitely Google making a Demo Google Analytics Account available to everyone. As the word "demo" says, the main purpose is demonstrating all the features and reports GA offers, and become a learning platform for analysts. But it´s actually real numbers! All the data available come from the <a href="https://www.googlemerchandisestore.com/">Google Merchandise Store</a> (which sells Google branded merchandise), so you can apply your favorite algorithm, find valuable insights from the data and show off your analytics skills to others.<br />
<br />
Click on this link to access the <a href="https://analytics.google.com/analytics/web/demoAccount">GA Demo Account</a>.<br />
<br />
<ul>
<li>If you already have a Google Analytics account, Google will add the demo account to it (then you can access it via the Home tab in Google Analytics).</li>
<li>If you do not have a Google Analytics account, it will create one for you in association with your Google account (yes you need a Google account first) and add the demo account to it.</li>
</ul>
<br />
<br />
<h2>
<span style="font-size: large;">
What can you do with the GA Demo Account?<a name='more'></a></span></h2>
<div>
As I said it´s real data from an E-commerce site. So, you will be able to see standard reports such as audience, traffic acquisition and behavior as well as transactions data and shopping behavior throughout the visitor journey. Most GA advanced features are already implemented and these includes:</div>
<div>
<ul>
<li>Enhanced Ecommerce</li>
<li>Goals (there a couple set up) and Funnel </li>
<li>Filters</li>
<li>Demographic & Interests reports</li>
<li>Adwords integrated reports</li>
<li>Search Console reports</li>
<li>Site Search data</li>
<li>Content Groupings</li>
<li>Calculated metrics</li>
</ul>
<div>
<br /></div>
<div>
As an analyst (either new or more experienced one), here are a couple of things you would like to do:</div>
<div>
<ul>
<li>familiarize with the Admin interface and all the account/property/view features available (remember you will have just "Read & Analyze" rights you won´t be able to implement any change).</li>
<li>dive into all the standard reports and study visitors flow throughout the website. Create your own segments, custom reports and dashboards.</li>
<li>analyse conversions and shopping behavior (Enhanced Ecommerce section).</li>
<li>if you are an educator, GA Partner or University teaching digital analytics, the GA Demo account will be your best friend in classes,</li>
<li>you are a blogger like me, you might want real e-commerce data to build proof of concepts, dashboards, or perform powerful analysis.</li>
</ul>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1ULve26CKR5j-beP20eea3hQwAADIHozJbx99xbEvRuaPh0AfyWvspeeA_NVv9S5xxxOeo33jcgLJE-5g6dKxm4RGqPr1GbLOuOWBEC5MREbwA07JZFp6_m1YNqMm1q6-u-bpm8RvQKE/s1600/shopping_behavior_GA.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="GA demo account shopping behavior " border="0" height="276" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1ULve26CKR5j-beP20eea3hQwAADIHozJbx99xbEvRuaPh0AfyWvspeeA_NVv9S5xxxOeo33jcgLJE-5g6dKxm4RGqPr1GbLOuOWBEC5MREbwA07JZFp6_m1YNqMm1q6-u-bpm8RvQKE/s640/shopping_behavior_GA.jpg" title="GA demo account available" width="640" /></a></div>
<div>
<br />
<br /></div>
</div>
<div>
<h2>
<span style="font-size: large;"><br /></span></h2>
<h2>
<span style="font-size: large;">
...and if you are a R user?</span></h2>
</div>
<div>
Of course you can use R to analyse the GA Demo data. It´s real data from the Google Merchandise Store so you might be interested in applying machine learning algorithms, or create beautiful visualizations and dashboards. </div>
<div>
<br /></div>
<div>
In more than one occasion in this blog I shared examples of <a href="http://www.analyticsforfun.com/2015/08/playing-with-r-shiny-dashboard-and.html">GA dashboards made with R and Shiny</a>. Some readers asked me for the original dataset in order to reproduce the code, cause they did not have access to any GA account. With the demo account available, now it´s easy to export the data and import it into R, let say in a .csv format.</div>
<div>
<br /></div>
<div>
As far as I have seen, due to the limited rights granted (only "Read & Analyse") currently it´s not possible to access and extract the data via API. That would be very handy using one of the many available <a href="http://www.analyticsforfun.com/2015/10/query-your-google-analytics-data-with.html">R packages to connect to Google Analytics</a>.</div>
<div>
</div>
<div>
<br /></div>
</div>
<div>
<br /></div>
marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-64440669397111634822016-06-18T00:53:00.000+01:002016-07-24T16:05:12.240+01:00Where to Live in Barcelona in a Dashboard<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8oaJK0tkv-OL9F7Og0mY99U22Y4OLo1PwCyOI8cahWd_HuTMqEUCW25nL9-xHTY6eU90L0ewvqdpK8d96ixyETvxMCqurbxz8lx8o5-U5lX6JJsNxUIj75-yY5U9HcVJGCeMPHuY7lF4/s1600/Barcelona+dashboard+Tableau+Public.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Barcelona best barrio visualization" border="0" height="465" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8oaJK0tkv-OL9F7Og0mY99U22Y4OLo1PwCyOI8cahWd_HuTMqEUCW25nL9-xHTY6eU90L0ewvqdpK8d96ixyETvxMCqurbxz8lx8o5-U5lX6JJsNxUIj75-yY5U9HcVJGCeMPHuY7lF4/s640/Barcelona+dashboard+Tableau+Public.jpg" title="Dashboard of Barcelona to choose where to live in" width="640" /></a></div>
<br />
<br />
Sometimes data can tell a story much faster and effectively than many words. That's why I´ve decided to start sharing more data stories via this blog, hoping to both:<br />
<br />
<ol>
<li>address specific topics readers want to dive in (often these will not be data-people, they would be new to my blog, probably coming after googling a specific questions e.g. "which are the best boroughs to live in Barcelona?").</li>
<li>showcase data visualization tools and best practices to present your data (these are data-people, yes you my regular readers, you might like to see a tool in action).<a name='more'></a></li>
</ol>
I will mainly make use of <a href="https://public.tableau.com/s/">Tableau Public</a> to build and share the data visualizations. Tableau Public is basically the free version of Tableau Software. You can take advantage of the great analytics power offered by Tableau, make amazing visualizations quickly and share them on the web (remember that once published they are all freely accessible to everybody). There is a huge community of people sharing stories via Tableau Public (take a look at the <a href="https://public.tableau.com/s/gallery">Tableau Public Gallery here</a>) and it is very popular in data journalism.<br />
<br />
Back to this post, this quick data viz story is about <b>the city of Barcelona and its boroughs</b>. As the title says, the dashboard aims to help users explore and choose the best borough to live in, given a set of variables. These are:<br />
<br />
<ul>
<li>rent price (avg. rent price for a 50 square meter apartment)</li>
<li>sales price (avg. sales price for a 50 square meter apartment) </li>
<li>safety </li>
<li>green areas </li>
<li>a short tourism description</li>
</ul>
The dashboard is composed of three viz:<br />
<br />
<ol>
<li>A map showing how Barcelona it is divided by its 10 main boroughs ("distritos"). The colour scale indicates the average renting price: the darker is the colour and the more expensive is renting in that area. If you click on any of the borough, the correspondent tourism description will show up in the below section.</li>
<li>A vertical bar chart which lets you compare boroughs by different variables using a drop-down filter. </li>
<li>A bottom section containing the tourism description of the borough selected in the map.</li>
</ol>
<div>
In the "data sources" I provided the sources of data. </div>
<br />
<br />
Below is the viz, I hope enjoy it. Any issue, you can view it <a href="http://public.tableau.com/profile/marco.pasin#!/vizhome/BarcelonaDashboard/DASHBOARD">here in my Tableau Public profile</a>.<br />
<br />
<script src="https://public.tableau.com/javascripts/api/viz_v1.js" type="text/javascript"></script><br />
<div class="tableauPlaceholder" style="height: 895px; width: 1004px;">
<noscript><a href='http://www.analyticsforfun.com/2016/06/where-to-live-in-barcelona-in-dashboard.html'><img alt=' ' src='https://public.tableau.com/static/images/Ba/BarcelonaDashboard/DASHBOARD/1_rss.png' style='border: none' /></a></noscript><object class="tableauViz" height="895" style="display: none;" width="1004"><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='site_root' value='' /><param name='name' value='BarcelonaDashboard/DASHBOARD' /><param name='tabs' value='yes' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Ba/BarcelonaDashboard/DASHBOARD/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='showTabs' value='y' /></object></div>
<br />
<br />
I know it's quite a simplified model in terms of variables considered. Choosing where to live is a complex decision and you'll probably take into account many more factors like proximity to your job/school, public transports, nightlife, etc. etc. <br />
<br />
But, like I did in other previous <a href="http://www.analyticsforfun.com/2015/08/playing-with-r-shiny-dashboard-and.html">posts about dashboards</a>, I like proof of concepts (POC) and this wants to be mostly a proof of concept. <i>Can an interactive data visualization help you choose the best place to live in a city?</i> I think so. At least it can narrow down your focus in your initial search for a house, especially when you are new to it. Often you need to check out many websites before you can get an idea of how a city is structured. A dashboard equipped with relevant information can help you explore a city much more quickly.<br />
<br />
We need more dashboards about cities!<br />
<br />
<b>Why Barcelona?</b> It´s one of my favourite cities. I have been to Barcelona 5 times, always as a tourist, and if you ask me which city would you like to move in the future.... guess which one?<br />
<br />
<br />
PS: would you like to enrich this dashboard with more relevant info? Feel free to suggest other variables and sources of data and I will try to include them.marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-36241411864055316712016-03-28T03:19:00.000+01:002016-04-22T18:09:46.226+01:00Enhance your Blog Measurement with these Google Analytics Calculated Metrics<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2T8LSpckkpG7z-o2x325KMRQeh8JmuSg_nGyWN_DlxNoi3VmrcN39131X7_ZI0Pa4-JyJR5ez6aLvHLRThnuP4DVDQyQoVvzWp6XKzwjJLbw8I58pT_AOskOAaz0K9oGl1lFsMPMOG8s/s1600/Calculated+Metrics+GA.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Calculated Metrics in GA" border="0" height="236" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2T8LSpckkpG7z-o2x325KMRQeh8JmuSg_nGyWN_DlxNoi3VmrcN39131X7_ZI0Pa4-JyJR5ez6aLvHLRThnuP4DVDQyQoVvzWp6XKzwjJLbw8I58pT_AOskOAaz0K9oGl1lFsMPMOG8s/s400/Calculated+Metrics+GA.jpg" title="" width="400" /></a></div>
<br />
Google Analytics has recently incorporated a new powerful feature that offers more flexibility for measuring your own business objectives. I am talking about <a href="https://support.google.com/analytics/answer/6121409?hl=en">calculated metrics</a>.<br />
<br />
In this post I am going to suggest a list of calculated metrics that you can easily configure in Google Analytics to better measure your blog performance.<br />
<br />
As a blogger, when it comes to measure performance of my content, I am very focused on measuring <b>readers engagement</b> with the content I publish. Also, I am constantly looking to <b>increase my readers base</b>, giving my blog more exposure and acquiring new subscribers. Here is an outline of <a href="http://www.analyticsforfun.com/2014/10/how-i-measure-success-for-my-blog.html">my measurement plan using Google Analytics</a> (I highly recommend this read if you are new on the concept of digital measurement plan).<br />
<br />
The new calculated metrics feature gives me the opportunity to<b> customize my own measurement plan</b>. How?<br />
<a name='more'></a><br />
<br />
One of the immediate advantage of calculated metrics is that you can (finally) start switching your focus from sessions (most of current GA reports are sessions-based) to users. Yes users, the actual people who read your blog. I think this is a fundamental mindset change in your blog marketing strategy, which will let focus on the long term objectives (people) rather that the one-off readers (sessions).<br />
<br />
Think about this: will a user land to a post and sign-up straight to your email list? It might happens if your article is really interesting and unique. But in many cases the person needs to know you better: he might enter the site multiple times, and check a couple more of articles before deciding that it´s really worth to leave you his email address. That's where user-based metrics can give you a better understanding of your audience. <br />
<br />
Ok, with these introductory concepts in mind, let's list a couple of ideas of metrics you could incorporate into your blog measurement plan.<br />
<br />
<br />
<h2>
<span style="font-size: large;">Sessions per User</span></h2>
<br />
Starting with this metric you will be able to figure out how many sessions your users do on average. It might be interesting to segment this metric by traffic channel, devices, etc. to spot differences in users behavior.<br />
<br />
Formula: {{Sessions}} / {{Users}}<br />
<br />
<br />
<h2>
<span style="font-size: large;">User Conversion Rate</span></h2>
<br />
The classic conversion rate we are all familiar with, is based on sessions: CR=Goal Completions/Sessions. Here is where we really start to shift our mindset to a user centered approach as opposed to sessions based analytics. The formula is basically the same, we only need to replace the denominator with the total number of users. The number format will be percentage in this case.<br />
<br />
Formula: {{Goal Completions}}/{{Users}}<br />
<br />
You can use any of your configured Goals in the formula. Or you can even sum more than one goal (if it makes sense) and calculate the overall conversion rate. In my case I am using a single goal which is tracking subscriptions to the blog email newsletter.<br />
<br />
And again, start with running a simple traffic acquisition report and compare the "old" conversion rate with your brand new user conversion rate. As expected the latter will be higher, and this fact just reinforces the idea that not all users convert the first time. In some sources of traffic, you will notice that the difference between the two conversion rates is quite high. What insight can you get from that?<br />
<br />
<br />
<h2>
<span style="font-size: large;">Conversion Rate for users who have not bounced</span></h2>
<br />
Another idea to better understand your readers is calculating the conversion rate based on sessions where your readers did not bounce. In this way we will segment out users who did not engage with the content (they bounced for some reasons). Put it in other words, this calculated metrics focuses exclusively on engaged users.<br />
<br />
To achieve that, you just need to remove the number of bounces from the total number of sessions in the denominator.<br />
<br />
Formula: {{Goal Completions}}/({{Sessions}}-{{Bounces}})<br />
<br />
<br />
<h2>
<span style="font-size: large;">Non-Bounce Rate </span></h2>
<br />
Tired of classic bounce rate metric? You might want to have a metric that will focus on engaged users rather than visitors who did not perform any action on your site. <br />
<br />
Formula: ( {{Sessions}} - {{Bounces}} ) / {{Sessions}}<br />
<br />
<br />
<h2>
<span style="font-size: large;">Percent of Users who didn't Complete a Goal </span></h2>
<br />
Let say you have a goal set up in GA that tells you whether a visitor engaged with the content or not (e.g: he spent more than 2 minutes on the sites or he visited at least 3 pages). It might be useful to rank landing pages by the percent of users who did not complete that engagement goal in order to start optimizing your blog.<br />
<br />
Formula: ( {{Users} - {{Goal completions}} ) / {{Users}}<br />
<br />
<br />
<h2>
<span style="font-size: large;">Non-Landing Page Popularity</span></h2>
<br />
I love this metric because it allows you to focus on understanding and improving the page navigation paths of your blog. What this calculated metrics does, is basically removing pageviews when users land on your website in order to show you your most popular "non-landing pages". The formula explains the concept much better.<br />
<br />
Formula: {{Unique Pageviews}} – {{Entrances}}<br />
<br />
If you want to go deeper on navigation paths, I recommend to use the "Navigation Summary" tab within the Behavior>Site Content>All Pages report. Here you will be able to study paths from specific landing pages.<br />
<br />
<br />
<br />
<h1>
<span style="font-size: x-large;">How to Configure Calculated Metrics in Google Analytics</span></h1>
<br />
The best part of this new feature is that Google Analytics makes it super super easy to create new calculated metrics.<br />
<br />
First of all, make sure you have Edit rights on your Google Analytics account (if you own the blog you should not have any issue). Them enter the Admin section and click on "Calculated metrics" option within the View column of your account. <br />
<br />
Click on New Calculated Metrics button and name your metric (this is how it will appear within your GA reports). Note that the below "calcMetric" field will automatically populate: this is the GA API metrics name and you will use it in case you need to query the calculated metrics via API.<br />
<br />
Finally, choose the appropriate formatting type (Float, Integer, Currency, Time, Percent) and enter the actual formula for your calculated metric. As you start typing in, the correspondent metrics will be suggested by GA. Note that each metrics will be included between double curly brackets (e.g. {{Users}}).<br />
<br />
In the below picture you can see as an example how I created the "Non-Landing-Page Popularity" metric.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6ejvCHZwRV4DQ_Tm4NuGVOjkB-ts82hSIAFCEB_ZBk5PG-SqzamdTo5ewfGRbSEIgsQgMIbiJEZxYNaaTat1jjlcno7Z7VOd-n7JAivPxQVxBKCjhLi3q8duVMMEMCmU_dWcVb1gvKG4/s1600/Non-Landing-Page+Popularity.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Google Analytics Calculated Metrics for Blogs" border="0" height="321" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6ejvCHZwRV4DQ_Tm4NuGVOjkB-ts82hSIAFCEB_ZBk5PG-SqzamdTo5ewfGRbSEIgsQgMIbiJEZxYNaaTat1jjlcno7Z7VOd-n7JAivPxQVxBKCjhLi3q8duVMMEMCmU_dWcVb1gvKG4/s640/Non-Landing-Page+Popularity.jpg" title="Non-Landing-Page Popularity" width="640" /></a></div>
<br />
<br />
As you have seen, creating a calculated metric using Google Analytics it's just a matter of minutes. Let me conclude this post with a couple of final notes that will clarify any doubt you might have on calculated metrics:<br />
<br />
<ul>
<li>Calculated metrics are custom calculations that can be made from either standard metrics that Google Analytics already provides (e.g. Sessions, Pageviews, etc.), or custom ones that you implemented on your own (Goals or even Custom Metrics you set up previously).</li>
<li>Calculated metrics are retroactive since they do not modify the underlying view data you have already set up. With a standard GA account (not Premium), you are allowed to set up a maximum of 5 metrics per view. </li>
<li>If you don't like your calculated metric, no worries you can modify it or delete it and replace it with another one.</li>
<li>In order to see calculated metrics and combining them with your available GA dimensions, you will need to create a Custom Report. Building custom reports (and custom segments) is a fundamental skill that will allow you to take your analysis in GA to a more powerful level. So, if you have not done it before, I recommend you start playing with them right now. Here is a great <a href="https://www.youtube.com/watch?v=A7bD_Lbgu7U">video tutorial on how to build custom reports in GA</a>.</li>
</ul>
<br />
<br />
Do you suggest any other metrics for measuring content engagement and blog posts performance? Feel free to add up, I aim to update this post as I get new ideas. Thank you for reading.<br />
<div>
<br /></div>
marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-85560146983504764932016-02-08T23:17:00.000+00:002016-02-08T23:25:31.281+00:00What happens when you have outliers in your data?<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwgRGWw4G1U7dR8zPybrjnQn8CuaY3OYZVojlsUEJ4BOGoVVfWSS2XtJqaSRzf3aDTrdGUBHPg5kNM62oYUlHptgXzhyphenhyphenmj4aO2RivMCQkHvpYCVXutoqvzcLeMKFl9X1wmsOpnMFbIcCU/s1600/Outlier_sheep.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="282" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwgRGWw4G1U7dR8zPybrjnQn8CuaY3OYZVojlsUEJ4BOGoVVfWSS2XtJqaSRzf3aDTrdGUBHPg5kNM62oYUlHptgXzhyphenhyphenmj4aO2RivMCQkHvpYCVXutoqvzcLeMKFl9X1wmsOpnMFbIcCU/s320/Outlier_sheep.jpg" width="320" /></a></div>
<br />
In this post I am going to talk briefly about outliers and the effect they might have on your data. With an example of course. Let's start with defining the word "outlier": <i>what is an outlier in math/statistics?</i><br />
<br />
<blockquote class="tr_bq">
An outlier is basically a number (or data point) in a set o data that is either way smaller or way bigger than most of the other data points.</blockquote>
<br />
Let's go through a practical example in order to understand <b>the implications of having an outlier within your data set</b>.<br />
<a name='more'></a><br />
<br />
Say we have a sample data set like the following:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFVi5dK9TpCPTyG2nRQVADb7P7UeTPKms5gBprCx9uSDOwhwb9KiQWS5JPAGP0pNcI4egGazGtx-WUZEmJf-WBj1U2tyDuT9GYCKJa7cgTmZr8goKxqnCbt6Qb3H3_IvEdQHrDtcRVa2o/s1600/sample_data_set.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="65" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFVi5dK9TpCPTyG2nRQVADb7P7UeTPKms5gBprCx9uSDOwhwb9KiQWS5JPAGP0pNcI4egGazGtx-WUZEmJf-WBj1U2tyDuT9GYCKJa7cgTmZr8goKxqnCbt6Qb3H3_IvEdQHrDtcRVa2o/s400/sample_data_set.jpg" width="400" /></a></div>
<br />
<br />
For this data set I can easily calculate the mean which is 4.3:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieQhVI_1EKiNc3ULCLQbLTAleLc4XK2einDfEq0jEF-BY-QdvWzNyP2JZkXE9ZNby1WKhumeQ3iVeydpspkqv4ntSCMaYXbsuIBRWv1NPBtOIvz_CtUwNrita8iBCSMAbke0uikTuQ-V4/s1600/mean_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="167" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieQhVI_1EKiNc3ULCLQbLTAleLc4XK2einDfEq0jEF-BY-QdvWzNyP2JZkXE9ZNby1WKhumeQ3iVeydpspkqv4ntSCMaYXbsuIBRWv1NPBtOIvz_CtUwNrita8iBCSMAbke0uikTuQ-V4/s400/mean_data.jpg" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
I can also find the median which represents the middle value of the distribution. In our case, since there are two middle values I can average them and get a median of 4.5.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjD1Y1SthmpiJZnVnB6ILR20_WkHtB2QsPwubYI3kZ-37xxQNwLmHiLd5SMi3kGnRvuZLBKBhRSOX6lhevb_vLLr_7OPdRmlQclKkVFWV2LHjt_fWMP6moT4C43uDrTkopxsyiVWYxZRxY/s1600/median_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="136" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjD1Y1SthmpiJZnVnB6ILR20_WkHtB2QsPwubYI3kZ-37xxQNwLmHiLd5SMi3kGnRvuZLBKBhRSOX6lhevb_vLLr_7OPdRmlQclKkVFWV2LHjt_fWMP6moT4C43uDrTkopxsyiVWYxZRxY/s400/median_data.jpg" width="400" /></a></div>
<br />
<br />
And I can algo figure out the mode which is 5 since this is the most frequent value in the distribution.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjoOg2YmU19o6-F9bf9fBSloaVo9lj-ZhgFuEwtcdV1KFIpo4MPGEC4g6W8FSrjyD9Novn2MAZGCCpckHy7Mc84xUo2QmzZp8wBcp8PUptkS4Vt-MwajcdEk7E8pzqubkcJnl3DTvJ0Hs/s1600/mode_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="102" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjoOg2YmU19o6-F9bf9fBSloaVo9lj-ZhgFuEwtcdV1KFIpo4MPGEC4g6W8FSrjyD9Novn2MAZGCCpckHy7Mc84xUo2QmzZp8wBcp8PUptkS4Vt-MwajcdEk7E8pzqubkcJnl3DTvJ0Hs/s400/mode_data.jpg" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
Finally, let's calculate the standard deviation by which I can see how much my data are spread out around the mean (remember that the standard is the square root of the variance).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhAyfuk0P-rnftiEMujk2FohJvG3cTL0gqrYnOKK4kmK2aVfoJ1JXn7H7jraGFLhJdI1nh-fH_82enKnUIPJBFdrdFGoOq0njBg0a4hn3EhHf5pt3LwdbigoMrubxZiDCT7pSXwdyy3Yh4/s1600/standard_deviation_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="283" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhAyfuk0P-rnftiEMujk2FohJvG3cTL0gqrYnOKK4kmK2aVfoJ1JXn7H7jraGFLhJdI1nh-fH_82enKnUIPJBFdrdFGoOq0njBg0a4hn3EhHf5pt3LwdbigoMrubxZiDCT7pSXwdyy3Yh4/s400/standard_deviation_data.jpg" width="400" /></a></div>
<br />
<br />
Cool, we now know the mean, median, mode and standard deviation for our sample data set:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxush__vGYbZl5YKiR7OTZbxydKO5xoz9uvlKhtPEIZKqiORjLCFf6Urzj_Wnm-1f-9JECPJw8dP8raagC-2CXSFUUhQ6G9MzEceqBiR3k6IGR5mgSOr4K9Cjou_O_Epgm1fFDnp1q19s/s1600/statistics_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="183" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxush__vGYbZl5YKiR7OTZbxydKO5xoz9uvlKhtPEIZKqiORjLCFf6Urzj_Wnm-1f-9JECPJw8dP8raagC-2CXSFUUhQ6G9MzEceqBiR3k6IGR5mgSOr4K9Cjou_O_Epgm1fFDnp1q19s/s400/statistics_data.jpg" width="400" /></a></div>
<br />
<br />
All right, let's now make a change on our data set. Imagine to remove the the last data point 6 and replace it with a much bigger value like 600...yep <b>an outlier</b>.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqAgWuPzDvU2h6sW7cCYaSZoNb3ttKE5o5_eXCDEwJijs5LcD4EFODWVOo05RyPqSzaW1tsRVNw-d7IRuVfEv9tHxan-nh8XpMBWf8hFBPJUkTYPvMYq2Bo5yunD2_hfQedzyqi22qcVo/s1600/sample_data_outlier.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="142" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqAgWuPzDvU2h6sW7cCYaSZoNb3ttKE5o5_eXCDEwJijs5LcD4EFODWVOo05RyPqSzaW1tsRVNw-d7IRuVfEv9tHxan-nh8XpMBWf8hFBPJUkTYPvMYq2Bo5yunD2_hfQedzyqi22qcVo/s400/sample_data_outlier.jpg" width="400" /></a></div>
<br />
See now what happens when we calculate again the mean, median, mode and standard deviation. <b>The new mean is much higher</b>, 63.7! As expected, the standard deviation is much hogher too. On the other hand, <b>median and mode remain exactly the same</b>.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEic840BGA9VGsDqjPouANWryAcqHZZyjPWrzqAU33HSQy2LYM2WVtAJgPSDvcIEnOLhw15-RgYcogtpghdTt6OMZaUlPVzBUDbqcZO21VS76ZipDEETU3uNXpNRy9pff3D6SOBOwi6Eqoo/s1600/mean_outlier.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEic840BGA9VGsDqjPouANWryAcqHZZyjPWrzqAU33HSQy2LYM2WVtAJgPSDvcIEnOLhw15-RgYcogtpghdTt6OMZaUlPVzBUDbqcZO21VS76ZipDEETU3uNXpNRy9pff3D6SOBOwi6Eqoo/s400/mean_outlier.jpg" width="367" /></a></div>
<br />
<br />
So, this is what happens if you have outliers. <b>Outliers skew the data when you are trying to do any type of average</b>. <i>What can you do then if you need to get a measure of central tendency?</i><br />
<br />
It really depends on each specific situation how to deal with outliers. What is sure, anyway, is that most statistics measures like means, standard deviations, correlations, etc. can be strongly influenced by outliers and you might end up with an incorrect analysis. Generally you can follow two different strategies:<br />
<br />
<ol>
<li><b>Remove the outliers</b>, and and analyse your data set without them. In such case, the mean would not be affected and you might use it as a measure of central tendency.</li>
<li><b>Do not use the mean</b>. In this case you keep the outliers, but since the mean would be change a lot, you might use instead other measures of central tendency like the median or the mode.</li>
</ol>
<div>
Either case, I think it's important to <b>report in your analysis</b> that you identified outliers and what decision you made of them. <i>Why did you drop them? Why those values happened to be out there? Was it likely to be a data entry mistake? What were your assumptions?</i> </div>
marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-56011977855379107022016-01-18T02:20:00.000+00:002016-04-24T01:06:47.504+01:00Scheduling R Markdown Reports via Email<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8ZHNLe3UAErwcmZ2B8mU-LSDmT7fztOc84yTdxK-RNSIu7xxMx6aFvyOO_b0K3GEFuVeKnFrQ_Xn62PR6Wj4qsjy7r5SZZUpWGc0ajYWwsOdoXOGMG1_9FBieDCuNdxBKvzCBujm3pVM/s1600/GA+hml+report+-+post+cover_1.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img alt="GA markdown report using R" border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8ZHNLe3UAErwcmZ2B8mU-LSDmT7fztOc84yTdxK-RNSIu7xxMx6aFvyOO_b0K3GEFuVeKnFrQ_Xn62PR6Wj4qsjy7r5SZZUpWGc0ajYWwsOdoXOGMG1_9FBieDCuNdxBKvzCBujm3pVM/s320/GA+hml+report+-+post+cover_1.jpg" title="Scheduling R Markdown Reports via Email" width="303" /></a></div>
<a href="http://rmarkdown.rstudio.com/">R Markdown</a> is an amazing tool that allows you to blend bits of R code with ordinary text and produce well-formatted data analysis reports very quickly. You can export the final report in many formats like HTML, pdf or MS Words which makes it easy to share with others. And of course, you can modify or update it with fresh data very easily.<br />
<br />
I have recently been using it R Markdown for pulling data from various data source such Google Analytics API and MySQL database, perform several operations on it (merging for example) and present the outputs with tables, visualizations and insights (text).<br />
<br />
<i>But what about automating the whole report generation and emailing the final report as an attached document every month at a specific time?</i> <br />
<a name='more'></a>In this post I am going to explain <b>how to do it in Windows</b>. If you do a search on google, you will find several threads on stackoverflow and a few good specific posts on it. However it took me sometimes to get it working and had to try different options before. That's why I am writing this quick tutorial, including screenshots, hoping you might get it your report automated faster!<br />
<br />
<h2>
<span style="font-size: large;">
1. Create your Rmarkdown report</span></h2>
In RStudio create a new Rmarkdown document where you will enter your R code and texts. Mine is called "Schedule_Report.Rmd" and here is what is does:<br />
<br />
<ul>
<li>retrieve some data from Google Analytics API using the <a href="https://github.com/Tatvic/RGoogleAnalytics">RGoogleAnalytics</a> library</li>
<li>turn dates into a more friendly format</li>
<li>create a trend line chart of sessions using the ggplot2 package </li>
</ul>
<br />
A very basic report. Remember that in Rmarkdown you can decide whether to show each chunk of code or not. I showed just the final outputs that are the table and the bar chart.<br />
<br />
<h2>
<span style="font-size: large;">
2. Create an R script that executes and email your Rmarkdown report </span></h2>
Create a new R script which will:<br />
<br />
<ul>
<li>locate your Rmarkdown document (set the working directory to where your report is located)</li>
<li>generate an HTML file (or pdf, MS Word) from your Rmarkdown document</li>
<li>send the HTML file via email</li>
</ul>
<br />
To email the report I have used the <i>gmailR</i> library which allows you to generate and send emails directly from R. To make sure the <i>gmailR</i> library will work, first you might need to enable the "Less Secure apps" option in your google account. Open your Google personal account and go to Sign-in and Security section, scrolll down to the bottom of the page and switch on the "Allow less secure apps".<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5Q0vVQzmT97pBs76IF4GNpnj8O2kenaBMoLBLmkoubcG6RKJdzsX0Nj3yZ8x2qAHu0lw6REi-0dbBKHP1C-mAn4mZSqetSXNpL-adPML2Zt1q3qxUuLz1dS1xalA4umSatpjKpxh_WeI/s1600/less+secure+apps+-+google.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="120" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5Q0vVQzmT97pBs76IF4GNpnj8O2kenaBMoLBLmkoubcG6RKJdzsX0Nj3yZ8x2qAHu0lw6REi-0dbBKHP1C-mAn4mZSqetSXNpL-adPML2Zt1q3qxUuLz1dS1xalA4umSatpjKpxh_WeI/s320/less+secure+apps+-+google.jpg" width="320" /></a></div>
<br />
I also made a few tries with the mailR package but without success. I guess this was because of security issues with my google account, I have gmail. Anyway the <i>gmailR</i> package worked perfectly so I sticked to it! Here is the code contained into my R script, which is named "Script.R".<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjsVkBLEH0VcJWbhl1XlVhT9vGDRN7p521PKNGNg1ImcjgFJ6FINlfZ-FjjBNiOIwuT1rKX2tFSr377OX-2DbyA3CZFswKou_BQ5_doN6iuiSqrIpy1SV-B4vcFz6rNfsJxA1Sw4SvLGBI/s1600/schedule_script.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="R script scheduling reports" border="0" height="198" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjsVkBLEH0VcJWbhl1XlVhT9vGDRN7p521PKNGNg1ImcjgFJ6FINlfZ-FjjBNiOIwuT1rKX2tFSr377OX-2DbyA3CZFswKou_BQ5_doN6iuiSqrIpy1SV-B4vcFz6rNfsJxA1Sw4SvLGBI/s400/schedule_script.jpg" title="" width="400" /></a></div>
<br />
<br />
<h2>
<span style="font-size: large;">
3. Schedule a task in Windows</span></h2>
From the main Windows menu, go to Programs>Accessories>System Tools>Task Scheduler (at least this is the path in my Windows edition). The task scheduler will open up:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoZMVIJqJO4rl8hUBvfAlZd_TQRsRecmLDGUpnniQbHRBbydEwT_JyxtfVa7SeTYBOcEQ2jN-eEJKnUrMd4fvpLUzt1-YhxVfDQUB557S54vCEh6gWOIGXNAf860X2aObuz29gqAuj7Ds/s1600/task+scheduler+windows.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoZMVIJqJO4rl8hUBvfAlZd_TQRsRecmLDGUpnniQbHRBbydEwT_JyxtfVa7SeTYBOcEQ2jN-eEJKnUrMd4fvpLUzt1-YhxVfDQUB557S54vCEh6gWOIGXNAf860X2aObuz29gqAuj7Ds/s320/task+scheduler+windows.jpg" width="320" /></a></div>
<br />
<br />
Click on Action>Create basic task. Type a name for your task and add a short description if you like. Now select the trigger which means how often you want the task to be executed (to try it first I recommend choosing "One time"). Select the date and time and on the action field choose "Start a program".<br />
<br />
In the "Start a Program" step, complete the fields as follows:<br />
<br />
><i>Program/Script</i>: the directory path to where to find both the executable file for R<br />
<br />
><i>Add arguments</i>: CMD BATCH followed by the path of the R script you created at step two. Remember to put the directories path between quotation marks "" like in the image below.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhLuOf9nskASCX-zGVp9HqDnkx_5lRVG4a0pxRm_pshu8TwbP8SW0yrcXof4lCoxQC_60ZILdV3jyXMNZzK1My5TpcFVmFvnj1AQsQbvJQMokpiFfC2cc0QKM8M-u52JIBQSJwabg26yIM/s1600/1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Schedule R reports via email " border="0" height="275" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhLuOf9nskASCX-zGVp9HqDnkx_5lRVG4a0pxRm_pshu8TwbP8SW0yrcXof4lCoxQC_60ZILdV3jyXMNZzK1My5TpcFVmFvnj1AQsQbvJQMokpiFfC2cc0QKM8M-u52JIBQSJwabg26yIM/s400/1.jpg" title="" width="400" /></a></div>
<br />
<br />
Click on next and you should now reach the last step and see a confirmation window. Press finish and voila' your task is created and it should execute correctly at the time you set.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjyARy3zJ1rHktKTHwA5om8JiBw8_DwfJYucLGB_hGaWnHIQhPtuqJJ_eLu_6QiQO1OHDxoYpExnOnTV8o62JnzmDIyGHK0xdUlW72fI-2k0poMSAP2egwj3rriuvt9H13oXvnOBKb44Y/s1600/2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="221" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjyARy3zJ1rHktKTHwA5om8JiBw8_DwfJYucLGB_hGaWnHIQhPtuqJJ_eLu_6QiQO1OHDxoYpExnOnTV8o62JnzmDIyGHK0xdUlW72fI-2k0poMSAP2egwj3rriuvt9H13oXvnOBKb44Y/s320/2.jpg" width="320" /></a></div>
<br />
<h2>
<span style="font-size: large;">
4. Check your mail</span></h2>
At the time you set the task you should see the "taskeng" window popping up and disappearing after a few seconds (depending the workload you placed on your R files). Now open the mail account where you sent the report to. <i>Did you get the email with your report attached?</i><br />
<i><br /></i>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcxzVvEXvfgtMCZ7VOaqYnBW5_lgRfRynMPkuINuw6OnT6w6hu6sDFLfIXpU_lI3n_4ka0IHBeQO-tDrHvWej2BRBkbUqFxG3zU5Jo9wUQe6MuGmll9dRYvERpt_BU-upKiaQioKkd0qo/s1600/check+mail.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="75" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcxzVvEXvfgtMCZ7VOaqYnBW5_lgRfRynMPkuINuw6OnT6w6hu6sDFLfIXpU_lI3n_4ka0IHBeQO-tDrHvWej2BRBkbUqFxG3zU5Jo9wUQe6MuGmll9dRYvERpt_BU-upKiaQioKkd0qo/s320/check+mail.jpg" width="320" /></a></div>
<br />
<br />
In case you did not receive the email, I recommend you to:<br />
<br />
<ul>
<li>Check if the the task was executed in Windows. Open the task scheduler and you will see the list of tasks. Look for your task name. Make sure the status says "Success" and not failed;</li>
<li>If the status says failed, double check you set the task correctly as per step 3. An alternative is creating a .bat file separately and enter the path of the .bat file in the task scheduler;</li>
<li><b>As a general troubleshooting method</b>, I also suggest opening your R console (double click on R.exe) and execute line by line the code of your R script at step 2. This way you can realize if there is an error inside your R code. I mean, Windows executes the task correcly but no data is generated/sent by R.</li>
</ul>
<br />
Here below are a few issues that might prevent R from executing the code contained in your R script properly:<br />
<br />
*To be able to send mails via gmailR package, make sure you enable the "less secured apps" option in your Google account.<br />
<br />
**To be able to create an html document from an Rmarkdown file, make sure you have installed the latest version of <i>pandoc</i> library. To do that, you should, in order:<br />
> install.packages("installr")<br />
> install.pandoc()<br />
> Restart your machine<br />
<br />
<br />
To recap, the process you have just automated will work as follows:<br />
<ol>
<li>Windows will start a task at the day/time you specified in the task scheduler</li>
<li>the taskeng will open and executes your R script you create at point step 2 through R</li>
<li>the Rmarkdown report will be converted into an HTML file and sent by email</li>
</ol>
<div>
If you like to reproduce the whole process using my files, you can find them both the <b>Rmarkdown report</b> and the <b>R script</b> at <a href="https://github.com/mcpasin/schedule-Rmarkdown-report-via-email.git">this github repository</a>. I hope the post was helpful and will push you to use R for generating business reports.</div>
<div>
<br /></div>
marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-61798882259896413022016-01-02T15:38:00.001+00:002016-01-02T15:38:35.830+00:00Happy New Year! Most Popular Posts in 2015<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxkH8fiPdscSZFXfViqS03Jb-F0wcShgHLTkcmUZ665XHJMLgn4sjyORMkSjTf5ZmApvlrEUwbb-7FJIWDuijA00CCKz5yuuep74b0PcktEDmhSxP6FytOhwqFUNPrrA1l0kQ9FmgfPx4/s1600/Happy+2016.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="283" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxkH8fiPdscSZFXfViqS03Jb-F0wcShgHLTkcmUZ665XHJMLgn4sjyORMkSjTf5ZmApvlrEUwbb-7FJIWDuijA00CCKz5yuuep74b0PcktEDmhSxP6FytOhwqFUNPrrA1l0kQ9FmgfPx4/s400/Happy+2016.jpg" width="400" /></a></div>
<br />
2015 has been my 2nd year blogging and I wanted to thank everyone who has taken the time to read my posts, shared and commented. Some of you left such precious feedback which gave me the input for new post ideas and the strenght for keeping up blogging. <b>Thank you everyone!</b><br />
<br />
On a personal level, 2015 has been a very productive year in terms of learning. I've been playing quite a lot with the R language and Google Analytics, often combining both and trying to explore new uses and applications for daily job tasks. R has become an irreplaceable tool in my daily job. And blogging about it gave me the confidence to use it and recommend it to other colleagues in my team.<br />
<br />
Looking quickly at some web analytics metrics, the last was a positive year too. Sessions almost quadrupled compared to 2014 (quite a big number, note anyway this is a very young blog). Organic traffic increased by over 900% (yes this is very good news!) and referral traffic saw a big increase as well, mainly thanks to my R posts incorporation into <a href="http://www.r-bloggers.com/">R-bloggers.com</a> (that was another good news having been accepted).<br />
<br />
Moving to a more meanigful KPI, there has been a 115% increase in subscribers compared to 2014. Thank you guys! My major challenge for 2016 would be definitely producing more content, more frequently while keeping the posts interesting and valuable for the audience.<br />
<br />
Here below are my 3 most popular posts in 2015. That is, the content that you, the readers, found most interesting. Check them out if you have not seen them yet.<br />
<br />
1. <a href="http://www.analyticsforfun.com/2015/01/google-analytics-dashboards-with-r-shiny.html" target="_blank">Google Analytics Dashboards with R & Shiny</a><br />
<br />
2. <a href="http://www.analyticsforfun.com/2015/08/playing-with-r-shiny-dashboard-and.html" target="_blank">Playing with R, Shiny Dashboard and Google Analytics Data</a><br />
<br />
3. <a href="http://www.analyticsforfun.com/2015/03/r-stats-digital-analytics-8-blogs-you.html" target="_blank">R Statistics for Digital Analytics: 8 blogs you should follow</a><br />
<br />
<b>I wish you a great 2016</b> and thanks again for following my blog! I will be back soon with more content.marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-85673729802764728782015-10-12T22:16:00.000+01:002016-04-24T16:05:28.321+01:00Query your Google Analytics Data with the GAR package<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggjsv-96HezRdDuUQ828A3aS37fj5jJ4v_7xOWnNrhMvWJY3rCbEtVTGWJvVTahw8SqfvQk9Cb8_OTFdyxEqz7vA4YETCpdUeAiJpZLXmjaoEx6ffJpnXVQaVAX6cYitI32AyZbLHZDSY/s1600/Google-Analytics-Logo.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img alt="Google Analytics API connection with R" border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggjsv-96HezRdDuUQ828A3aS37fj5jJ4v_7xOWnNrhMvWJY3rCbEtVTGWJvVTahw8SqfvQk9Cb8_OTFdyxEqz7vA4YETCpdUeAiJpZLXmjaoEx6ffJpnXVQaVAX6cYitI32AyZbLHZDSY/s320/Google-Analytics-Logo.png" title="Query Google Analytics Data with GAR package" width="320" /></a></div>
Recently my friend <a href="https://www.linkedin.com/in/andrewjgeisler" target="_blank">Andrew Geisler</a> released a new version of the GAR package. Like other similar packages, the GAR package is designed to help you retrieve data from Google Analytics using R. But with some new features.<br />
<br />
I have been playing a bit with the package and the feature I enjoy the most is the ability to <b>query multiple Google Analytics View IDs</b> in the same query. To do that, you simply need to pass a vector of the View IDs in the correspondent <i>gaRequest()</i> command, and you get back a data frame with each view/profile clearly identified and all their correspondent metrics/dimension you included in the query. <br />
<a name='more'></a>Pretty simple, no?<br />
<br />
I think this is a very useful feature which makes the GAR package stand out from other similar packages out there (as far as I know there are currently 4 Google Analytics packages available: RGoogleAnalytics, RGA, ganalytics and GAR of course).<br />
<br />
You could also <a href="http://www.analyticsforfun.com/2015/05/query-multiple-google-analytics-view.html" target="_blank">build a loop in R to query multiple View IDs</a> at once, and this is actually what I did previously using the RGoogleAnalytics package. But having this feature included in a package, it just make your life easier!<br />
<br />
The GAR package is <a href="https://cran.r-project.org/web/packages/GAR/GAR.pdf" target="_blank">available on CRAN repository</a> (v1.1 was released on 17 Sep 2015) and you can install it and load with the following commands:<br />
<br />
<i>install.packages('GAR', type=source)</i><br />
<i>library(GAR)</i><br />
<br />
<br />
<h2>
<span style="font-size: large;">
Getting the data from Google Analytics</span></h2>
<br />
To get data from Google Analytics is easy and similar to other packages.<br />
<br />
First of all you need to:<br />
<br />
<ol>
<li>Create a new project in the Google Developers's API Console, if you have not done it before.</li>
<li>Authenticate using your project credentials. </li>
</ol>
<br />
You can find a detailed explanation for these two steps on the <a href="https://github.com/andrewgeisler/GAR" target="_blank">GAR github tutorial here</a>.<br />
<br />
So, assuming you got the authentication right and obtained a token, you now need to make sure your token is refreshed (GA access tokens expire) every time you need to retrieve data, and finally execute your query from R.<br />
<br />
To refresh the token you use the <i>tokenRefresh()</i> function. The resulting access token will be stored as an environmental variable accessible by the GAR Package.<br />
<br />
<i>tokenRefresh(GAR_CLIENT_ID, GAR_CLIENT_SECRET, GAR_REFRESH_TOKEN)</i><br />
<br />
<br />
To get the data, you will use the <i>gaRequest()</i> function.<br />
<br />
<i>df <- gaRequest(</i><br />
<i>id=c('ga:123456789','ga:987654321'),</i><br />
<i>dimensions='ga:date,ga:month',</i><br />
<i>metrics='ga:sessions, ga:users, ga:pageviews',</i><br />
<i>start='YYYY-MM-DD',</i><br />
<i>end='YYYY-MM-DD',</i><br />
<i>sort='-ga:sessions,ga:users'</i><br />
<i>)</i><br />
<br />
The arguments of this function are based on the structure of the typical API call to Google Analytics. So, it's here that you will specify all the parameters of your query (metrics, dimensions, period,etc.). And it is here in particular that you <b>specify the Google Analytics View IDs</b> you would like to get the data from.<br />
<br />
Of course the <i>gaRequest()</i> function will authenticate using the access token previously stored as an environmental variable.<br />
<br />
<br />
Let's run an example. In the query below I am asking Google Analytics API to retrieve data about sessions and pageviews between 10 Oct 2015 to 11 Oct 2015, from five distinct View IDs.<br />
<br />
<i>df <- gaRequest(</i><br />
<i>id=c('ga:83424646','ga:77989457','ga:82857332','ga:65743580','ga:65743194'), dimensions='ga:date,ga:month',</i><br />
<i>metrics='ga:sessions, ga:pageviews',</i><br />
<i>start='2015-10-10', end='2015-10-11',</i><br />
<i>sort='-ga:sessions,ga:pageviews')</i><br />
<br />
As expected, the resulting dataset has a total of 10 rows (5 View IDs x 2 days).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCSxfEAT6yemzfxfCfnkFB2_deveFIhydRhg_aajaqx8xjQ97kg71e1tqiAYF7KIXB8B6eCTaMLfwaGHxJmbODa0ze-zZReyRcpYFF6Y1QlaYUsQnKz6XZfZF_cKQfAvpF26s0YoJQ6JQ/s1600/resultQuery.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="GAR package Query Output" border="0" height="94" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCSxfEAT6yemzfxfCfnkFB2_deveFIhydRhg_aajaqx8xjQ97kg71e1tqiAYF7KIXB8B6eCTaMLfwaGHxJmbODa0ze-zZReyRcpYFF6Y1QlaYUsQnKz6XZfZF_cKQfAvpF26s0YoJQ6JQ/s400/resultQuery.jpg" title="Output Data Frame of a Google Analytics Query" width="400" /></a></div>
<br />
<br />
As you can see on the screenshot, in addition to the metrics and dimensions you requested, the resulting data frame contains also details about your request, such as:<br />
<br />
<ul>
<li>profile ID (or View ID)</li>
<li>accountId</li>
<li>webPropertyId</li>
<li>internalWebPropertyId</li>
<li>profileName (or View name)</li>
<li>tableId</li>
<li>start-date</li>
<li>end-date</li>
</ul>
<br />
Now that you have got your output data frame, you might want to <a href="http://www.analyticsforfun.com/2015/05/query-multiple-google-analytics-view.html" target="_blank">categorize different websites</a> or Views according to specific criteria and apply any aggregate functions (sum, average). It's up to you and to your internal business reporting needs. The key thing is that <b>all the data you requested are included in a single table</b> and ready analyse it with R.<br />
<br />
Happy analysis!<br />
<div>
<br /></div>
marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-22056197942936920822015-08-17T16:55:00.000+01:002016-04-24T01:30:17.325+01:00Playing with R, Shiny Dashboard and Google Analytics DataIn this post, I want to share some examples of data visualization I was playing with recently. Like in many other occasions, my field of application is digital analytics data. Precisely, data from Google Analytics.<br />
<br />
You might remember a previous post where I built a tentative <a href="http://www.analyticsforfun.com/2015/01/google-analytics-dashboards-with-r-shiny.html" target="_blank">dashboard using R, Shiny and Google Charts</a>. The final result was not too bad, however the layout was somewhat too rigid since I was using the command "merge" to merge the charts and create the final dashboard.<br />
<br />
So, I thought to spend some time improving my previous dashboard and include a couple of new visualizations, which will be hopefully inspiring. Of course, I am still using R, Shiny, and in particular <a href="http://rstudio.github.io/shinydashboard/index.html" target="_blank">shinydashboard</a>: an ad hoc package to build dashboard with R.<br />
<a name='more'></a><br />
<br />
The dashboard I've made makes use of the following visualizations:<br />
<br />
<ul>
<li>Value boxes</li>
<li>Interactive Time Series (dygraphs)</li>
<li>Bubble charts</li>
<li>Streamgraphs</li>
<li>Treemaps </li>
</ul>
<br />
You can see the final dashboard at <a href="https://mcpasin.shinyapps.io/PlayingGoogleAnalyticsDataViz">shinyapps.io</a> (though, because of basic plan current limits, it might be temporarily unavailable), or better you can check <a href="https://github.com/mcpasin/PlayingGoogleAnalyticsDataViz" target="_blank">the code at github</a>. Here is a screenshot:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZvcG6-jR5nCzU_tOJY8UvHTTMZzoErc9wA0SxTCFuvi6k9i7nh9UhFqHvIhrq6H2V8pLBmi2Br70Dl-AjcU6T25hkedox6IWbHLDWDcGAwiLceD1S_D4T7VB0864EVl23TOvjenYD_jU/s1600/Playing_with_R_Shiny_Dashboard_and_Google_Analytics_Data.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="308" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZvcG6-jR5nCzU_tOJY8UvHTTMZzoErc9wA0SxTCFuvi6k9i7nh9UhFqHvIhrq6H2V8pLBmi2Br70Dl-AjcU6T25hkedox6IWbHLDWDcGAwiLceD1S_D4T7VB0864EVl23TOvjenYD_jU/s640/Playing_with_R_Shiny_Dashboard_and_Google_Analytics_Data.gif" width="640" /></a></div>
<br />
<br />
Let's go quickly through each visualization to see what Google Analytics dimension/metrics it shows.<br />
<br />
<h2>
<span style="font-size: large;">
Value Boxes</span></h2>
When you build a dashboard, boxes are probably the main building blocks since they allow organize the information you want to show within the page. When I build a dashboard, I normally start by sketching the layout, and this means placing the main boxes.<br />
<br />
A particular type of box available in the Shiny Dashboard package is the <a href="http://rstudio.github.io/shinydashboard/structure.html#valuebox" target="_blank">valueBox</a>, which lets you display numeric or text values, and also add an icon. Value boxes are great components to be placed at the top of a dashboard and display main KPI's, change % or add a description to the rest of the dashboard.<br />
<br />
In my dashboard I placed 3 boxes at the top, showing the value for my 3 main KPI's: sessions to the website, transactions (conversions) and conversion rate. The code to build a value box with shiny dashboard is very simple and if you want to have dynamic values, like in my case, you have to create in both the <i>server.R</i> and <i>ui.R</i> section of your Shiny app:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqC5vqMfV5IVxNqbmxax3kLrAOnP3f45UkAdFNzAifXYHNm1UAWKmv1d8xRndZY0WRogPWVAulwx5COMoldiL2d7IpI9KItOdosGtNEPutl0slnByTlqAWRKQqfjYkV1F6iSyIMvQZdJ4/s1600/valueBox.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Value Boxes with Shiny Dashbard" border="0" height="72" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqC5vqMfV5IVxNqbmxax3kLrAOnP3f45UkAdFNzAifXYHNm1UAWKmv1d8xRndZY0WRogPWVAulwx5COMoldiL2d7IpI9KItOdosGtNEPutl0slnByTlqAWRKQqfjYkV1F6iSyIMvQZdJ4/s640/valueBox.jpg" title="Playing with R, Shiny Dashboard and Google Analytics Data" width="640" /></a></div>
<br />
<br />
<br />
<h2>
<span style="font-size: large;">
Interactive Time Series (dygraphs)</span></h2>
Time series charts might get chaotics and not provide clear insights when filled with too many data and series (you might end up with the so called "spaghetti-effect").<br />
<br />
But if time series are interactive, user can easily explore and make sense of complex datasets.<br />
<br />
For example, users could highlight specific data points, include/exclude time series, zoom in specific time intervals, enrich the graph with shaded regions or annotations, etc. All of these features are offered by the dygraphs Javascript charting library.<br />
<br />
I used the <a href="https://rstudio.github.io/dygraphs/" target="_blank">R dygraph package</a> (which provides an interface to the Javascript dygraph charting library) to make an interactive time series with my Google Analytics dataset. The simple chart I made shows 3 metrics: sessions, transactions and conversion rate (of those transactions) over the period selected by the user. Both sessions and transactions use the left axis while conversion rate the right one. I included a <i>dyRangeSelector</i> placed at the bottom of the chart that lets you narrow down the time interval.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiuSGLTSDUm4GqFrLd9y-uxwhkjMnfXOS3DpQGkGKbPAv-YbdyXia3mnMRHGvRhoF4TTfmC9RkfdyweaspJyu17R2P_HBJa5vVsQIVZntBpcPkNYZTVafRbhBkYArtlx-jdazEC26IwqEk/s1600/dygraphs.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Dygraphs with R Shiny" border="0" height="191" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiuSGLTSDUm4GqFrLd9y-uxwhkjMnfXOS3DpQGkGKbPAv-YbdyXia3mnMRHGvRhoF4TTfmC9RkfdyweaspJyu17R2P_HBJa5vVsQIVZntBpcPkNYZTVafRbhBkYArtlx-jdazEC26IwqEk/s400/dygraphs.jpg" title="Playing with R, Shiny Dashboard and Google Analytics Data" width="400" /></a></div>
<br />
<br />
<br />
<h2>
<span style="font-size: large;">
Bubble charts</span></h2>
With bubble charts you can show three dimensions of data. I used a bubble chart to visualize the performance of traffic channels: x axis represents the number of sessions, y axis thee avg. pages per session, and finally transactions (that is the ultimate objective of many websites) are proportional to the size of the bubble. The larger the bubble and the higher is the number of transactions produced by that channel of traffic.<br />
<br />
To make this chart I used the <a href="https://cran.r-project.org/web/packages/googleVis/index.html" target="_blank">GoogleVis package</a>.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwtRJtvlpj1H0DV2vnHCoG4F1LQHPvCnxuRQQ0YF467axIuyiQY0lj8QgIZkEirAqNCxfg6shzNsNO4XqVhIm6varxOO0XjAmo_KaF5-XDvHt2ueQo1ACV_T5HO4AqBHD54KtVvWzc6DA/s1600/bubbleChart.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Bubble Charts to visualize Traffic Channel Performance" border="0" height="158" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwtRJtvlpj1H0DV2vnHCoG4F1LQHPvCnxuRQQ0YF467axIuyiQY0lj8QgIZkEirAqNCxfg6shzNsNO4XqVhIm6varxOO0XjAmo_KaF5-XDvHt2ueQo1ACV_T5HO4AqBHD54KtVvWzc6DA/s400/bubbleChart.jpg" title="Playing with R, Shiny Dashboard and Google Analytics Data" width="400" /></a></div>
<br />
<br />
In the dashboard I've also included a one-dimensional bubble chart using the <a href="https://github.com/jcheng5/bubbles" target="_blank">bubbles library</a>. This type of chart works similar to a bar chart though the latter is more accurate in terms of understanding the real value you are showing.<br />
<br />
On the other hand, this bubble chart might look more attractive than bar charts and it allows to display lots of values in a small area. I used this chart to show screen resolutions data from Google Analytics mobile reports.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEip20b5COcEB4wuRzUzQPCeAKS22X8Fk7JkiNYp_6hMPj6CtrXE-fEqNg_kvRRGhbB3U_EG1i-7NUDc8W93inAttXy415J0qYbRE_PQJFflaqNjwo9sXqr99BqwF5U2xRYX9kG_gn-trec/s1600/bubbles.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Bubbles showing Sessions by Screen Resolution" border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEip20b5COcEB4wuRzUzQPCeAKS22X8Fk7JkiNYp_6hMPj6CtrXE-fEqNg_kvRRGhbB3U_EG1i-7NUDc8W93inAttXy415J0qYbRE_PQJFflaqNjwo9sXqr99BqwF5U2xRYX9kG_gn-trec/s320/bubbles.jpg" title="Playing with R, Shiny Dashboard and Google Analytics Data" width="320" /></a></div>
<br />
<br />
<h2>
<span style="font-size: large;">
Streamgraphs </span></h2>
Streamgraphs are a type of stacked area charts that are displaced around a central horizontal axis. Stremgraphs are very effective to visualize data series that varies over time, especially if you need to show many categories.<br />
<br />
The result is a flowing, organic shape, with strong aesthetic appeal, which is why <a href="http://www.visualisingdata.com/2010/08/making-sense-of-streamgraphs/" target="_blank">streamgraphs are becoming more and more popular</a>.<br />
<br />
In the dashboard I made a streamgraph to visualize the evolution of sessions among devices (desktop, mobile, tablet) over the past years. To do it in R. I had to play a bit with the <a href="https://github.com/hrbrmstr/streamgraph" target="_blank">streamgraph package</a>.<br />
<br />
Here below is the final data viz (I am not completely happy with this visualization as for some reason when I mouse over the series the value showed is always the total of the period, not the one of the specific date I am pointing on. Any help?).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyP8hEa8iilxpdqwwzkLitk7GOnwW-PeD2NSETSUAR7gogexAJDsUUE2mlIGeWcUZ9WRsXbUvTfmvqNx1wO7WY2kJba_uB3RTZeATwguUxie7YBJwkq8iDqCZEXv9UtNjipxPGuCvP1pQ/s1600/streamgraph.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Streamgraph to show Devices Share of Traffic" border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyP8hEa8iilxpdqwwzkLitk7GOnwW-PeD2NSETSUAR7gogexAJDsUUE2mlIGeWcUZ9WRsXbUvTfmvqNx1wO7WY2kJba_uB3RTZeATwguUxie7YBJwkq8iDqCZEXv9UtNjipxPGuCvP1pQ/s320/streamgraph.jpg" title="Playing with R, Shiny Dashboard and Google Analytics Data" width="310" /></a></div>
<br />
<br />
Another interesting application on web analytics data, would be using streamgraphs to analyse channels share of traffic over time (direct vs organic vs paid vs referral, etc.).<br />
<br />
<br />
<h2>
<span style="font-size: large;">
Treemaps</span></h2>
Treemap visualizations are very effective in showing hierarchical (tree-structured) data in a compact way. They can display lot of information within a limited space and at the same allow users to drilldown into the represented segments.<br />
<br />
An example of hierarchical data in Google Analytics reports, is devices as principal segment (main rectangles) and browser as sub-segment (nested rectangles). The area of each rectangle is proportional to the amount of sessions produced by its corresponding segment/sub-segment.<br />
<br />
To make in R, I used the <a href="https://github.com/mtennekes/treemap" target="_blank">treemap library</a> (unfortunately the visualization is not interactive, but you can have a try with the <a href="https://github.com/timelyportfolio/d3treeR" target="_blank">d3treeR library</a>).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeKAxWmr4lDmi891WOrcZn1ZFwrJJ99zals9HzzcxB_O6h2p8PzGYbKqtrNufYLNKOPgYNlLYMOijdm1h6ZVeoD7CBcXqFcX1ELBbxxIMdxXfwAxlolP404PJqIirTnfXNyTvv-vEw3SY/s1600/treemap.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Treemap to show Devices and OS Share of Sessions." border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeKAxWmr4lDmi891WOrcZn1ZFwrJJ99zals9HzzcxB_O6h2p8PzGYbKqtrNufYLNKOPgYNlLYMOijdm1h6ZVeoD7CBcXqFcX1ELBbxxIMdxXfwAxlolP404PJqIirTnfXNyTvv-vEw3SY/s320/treemap.jpg" title="Playing with R, Shiny Dashboard and Google Analytics Data" width="310" /></a></div>
<br />
I hope you can get inspiration from these visualizations and include some of them in your digital analytics dashboard or reports. My plan is to <b>keep adding more interesting visualizations</b> (that are not currently offered in Google Analytics reports) to this dashboard, to better show digital data. If you have suggestions please leave a comment here or share it via <a href="https://github.com/mcpasin/PlayingGoogleAnalyticsDataViz" target="_blank">github repo</a>.<br />
<br />marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-35109491451317897342015-05-19T00:48:00.000+01:002016-05-28T15:29:17.612+01:00Query Multiple Google Analytics View IDs with R<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCyFUAMAyd2QYH7a7pNqUAooN1pxst7qUoR2_uG554paALLpti2qX-LYzScxpBr-9rk0ztb39XOUqD2nq7WMkb-dsXLA3cpciEp4aILfQyUWDgrJua1mPd-urHjdzrTHQ_fRAJQHC409c/s1600/GA+homepage+multiple+Views.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Query Multiple View IDs with R" border="0" height="256" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCyFUAMAyd2QYH7a7pNqUAooN1pxst7qUoR2_uG554paALLpti2qX-LYzScxpBr-9rk0ztb39XOUqD2nq7WMkb-dsXLA3cpciEp4aILfQyUWDgrJua1mPd-urHjdzrTHQ_fRAJQHC409c/s400/GA+homepage+multiple+Views.jpg" title="Google Analytics Home Multiple Views" width="400" /></a></div>
<br />
Extracting Google Analytics data from one website is pretty easy, and there are several options to do it quickly. But what if you need to extract data from multiple websites or, to be more precise, from multiple Views? And perhaps you also need to summarize it within a single data frame?<br />
<br />
Not long ago I was working on a reporting project, where the client owned over 60 distinct websites. All of them tracked using Google Analytics.<br />
<a name='more'></a><br />
<br />
Given the high number of sites managed and the nature of their business, it did not make sense for them to report & analyse data for each single website. It was much more effective to <b>group those websites into categories</b> (let say category 1, category 2, category 3, etc.) and <b>report/analyse data at a category level</b> rather than at a website level.<br />
<br />
In other words, they needed to:<br />
<br />
<ol>
<li>Collect data from each website</li>
<li>Categorize websites data according to specific internal business criteria</li>
<li>Report and visualize data at a category level (through an internal reporting system)</li>
</ol>
<br />
<br />
Very soon I realized that steps 1 & 2 were critical both in terms of time needed for extracting data and of the risk of copy/paste errors, especially if the extraction process was executed directly from Google Analytics platform.<br />
<br />
But luckily that's where R and the <a href="https://developers.google.com/analytics/solutions/r-google-analytics#query">RGoogleAnalytics package</a> came in handy, allowing me to <b>automate the extraction process</b> <b>with a simple <i>for</i> loop</b>.<br />
<br />
Let's quickly go through the different options I had to tackle points 1. and 2.<br />
<br />
<h4>
<span style="color: orange;"><b>a) Download data from Google Analytics platform as Excel format</b></span></h4>
This would have meant doing the same operation for each one of the 60 sites! Too long. Plus a subsequent manual copy/paste work to group sites data into different categories. Boring and too risky! Moreover, given the segmentation required by the client, I could not find the info directly from Google Analytics standard reports.<br />
<br />
<h4>
<span style="color: orange;"><b>b) Google Analytics Query Explorer</b></span></h4>
<a href="https://ga-dev-tools.appspot.com/query-explorer/" target="_blank">Google Analytics Query Explorer</a> is very very handy and I use it a lot. You can connect to Google Analytics API and build complex queries quickly thanks to their easy to use interface. So I could obtain the required segmentation of data quite fast.<br />
<br />
However, the current Query Explorer version <b>allows you to query only one View ID at a time</b>. Despite its plural nomenclature (ids), the <a href="https://developers.google.com/analytics/devguides/reporting/core/v3/reference#ids" target="_blank">View ID is a unique value as explained in Core Reporting API documentation</a>, and you will have to run your request several times in order to query multiple websites.<br />
<br />
Hence, even if you use Query Explorer, you will have to query each website/view at a time. Download the data and merge it together your "websites category".<br />
<br />
<h4>
<span style="color: orange;"><b>c) Google Spreadsheet Add-on</b></span></h4>
Thanks to the <a href="http://analytics.blogspot.com.ar/2015/01/simplify-your-google-analytics.html" target="_blank">Google Analytics Spreadsheet Add-on</a>, it's easy to run a query via Google Analytics API and obtain your web data. You can also run more than one query at a time, which means you can query more than one Vew ID at a time.<br />
<br />
I love Google Sheets Add-on, though in this particular case (query and categorize over 60 websites), you would still have some manual copy/paste work to do once you extracted the data into the spreadsheet.<br />
<br />
<h4>
<span style="color: blue;"><b>d) Automate the extraction process with R (the solution I cover in this post)</b></span></h4>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEg0xslAQHALm8Wz6JqmYhuxUISDpieOosbeHheqURBwVLZu7daVNygR8_mQNyCSBpi_4Vyj1yE_7k16e2lQDDBYgrcqluCPylNdl0vhB5OurVE1s67mr4ST9vU86mVQ0xekIvOYttgNo/s1600/Rlogo.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="R to extract Google analytics data" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEg0xslAQHALm8Wz6JqmYhuxUISDpieOosbeHheqURBwVLZu7daVNygR8_mQNyCSBpi_4Vyj1yE_7k16e2lQDDBYgrcqluCPylNdl0vhB5OurVE1s67mr4ST9vU86mVQ0xekIvOYttgNo/s1600/Rlogo.png" title="R language" /></a></div>
<br />
There are a few packages in R that let you connect to Google Analytics API. One of them is <a href="https://code.google.com/p/r-google-analytics/" target="_blank">RGoogleAnalytics</a>. But R is also a powerful programming language which allows you to automate complex operations.<br />
<br />
So, I thought that <b>combining the RGoogleAnalytics package with a simple <a href="http://www.statmethods.net/management/controlstructures.html">R control structure</a></b> like a <i>for</i> loop, could do the job quickly and with low margin of error.<br />
<blockquote class="tr_bq" style="text-align: center;">
<i><span style="font-size: large;">for (var in seq) expr</span></i></blockquote>
<br />
Here below I provide a bit more details of how I run multiple queries in R, and obviously, the code!<br />
<br />
<br />
<h2>
<span style="font-size: x-large;">
For loop to query multiple Google Analytics View IDs with R</span></h2>
<br />
What I did, was running a simple <i>for</i> loop that iterates over each View ID of my category, and retrieves the corresponding data using the query. Each time appending the new data in a data frame that will eventually become the final category data frame.<br />
<br />
Let's break it down in a few steps to make it clearer.<br />
<br />
<br />
<h3>
Step 1: Authenticate to Google Analytics via RGoogleAnalytics package</h3>
<br />
I assume you are familiar with the RGoogleAnalytics package. If not, please check out this brilliant post which explains in details <a href="https://code.google.com/p/r-google-analytics/" target="_blank">how to connect Google Analytics with R</a>.<br />
<br />
What you have to do, is first of all create a new project using the Google developers Console. Once created, you will grab your credentials ("client.id" and "client.secret" variables in the code), and use them to create and validate your token.<br />
<br />
Of course you need to have the RGoogleAnalytics library loaded to do all of this.<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/base/library"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">(</span>RGoogleAnalytics<span style="color: #009900;">)</span>
client.id <- <span style="color: blue;">"yourClientID"</span>
client.secret <- <span style="color: blue;">"yourClientSecret"</span>
<span style="color: #666666; font-style: italic;"># if no token is found within your worrking directory, a new token will be created. Otherwise the existing one will be loaded</span>
<span style="color: black; font-weight: bold;">if</span> <span style="color: #009900;">(</span>!<a href="http://inside-r.org/r-doc/base/file.exists"><span style="color: #003399; font-weight: bold;">file.exists</span></a><span style="color: #009900;">(</span><span style="color: blue;">"./oauth_token"</span><span style="color: #009900;">)</span><span style="color: #009900;">)</span> <span style="color: #009900;">{</span>
oauth_token <- Auth<span style="color: #009900;">(</span>client.id<span style="color: #339933;">,</span>client.secret<span style="color: #009900;">)</span>
oauth_token <- <a href="http://inside-r.org/r-doc/base/save"><span style="color: #003399; font-weight: bold;">save</span></a><span style="color: #009900;">(</span>token<span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/file"><span style="color: #003399; font-weight: bold;">file</span></a>=<span style="color: blue;">"./oauth_token"</span><span style="color: #009900;">)</span>
<span style="color: #009900;">}</span> <span style="color: black; font-weight: bold;">else</span> <span style="color: #009900;">{</span>
<a href="http://inside-r.org/r-doc/base/load"><span style="color: #003399; font-weight: bold;">load</span></a><span style="color: #009900;">(</span><span style="color: blue;">"./oauth_token"</span><span style="color: #009900;">)</span>
<span style="color: #009900;">}</span>
ValidateToken<span style="color: #009900;">(</span>token<span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
<br />
<h3>
Step 2: Create the View IDs category</h3>
<br />
Using the "GetProfiles" command, you can get a list with all the Views(or profiles) you have access to with your token. And the corresponding View IDs too, which are actually the parameters you need to build your query.<br />
<br />
From that list you can easily select the ones you need to build your category. Or otherwise you can create your category directly by entering the IDs manually. As an example, below I create 3 categories, each containing a certain number of IDs. <br />
<br />
Each category will be a vector of charachter class.<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">viewID<-GetProfiles<span style="color: #009900;">(</span>token<span style="color: #009900;">)</span>
viewID
category1<- <a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"79242136"</span><span style="color: #339933;">,</span> <span style="color: blue;">"89242136"</span><span style="color: #339933;">,</span> <span style="color: blue;">"892421"</span><span style="color: #339933;">,</span><span style="color: blue;">"242136"</span><span style="color: #339933;">,</span><span style="color: blue;">"242138"</span><span style="color: #339933;">,</span><span style="color: blue;">"242140"</span><span style="color: #339933;">,</span><span style="color: blue;">"242141"</span><span style="color: #009900;">)</span>
category2<- <a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"54120"</span><span style="color: #339933;">,</span> <span style="color: blue;">"54121"</span><span style="color: #339933;">,</span> <span style="color: blue;">"54125"</span><span style="color: #339933;">,</span><span style="color: blue;">"54126"</span><span style="color: #009900;">)</span>
category3<- <a href="http://inside-r.org/r-doc/base/c"><span style="color: #003399; font-weight: bold;">c</span></a><span style="color: #009900;">(</span><span style="color: blue;">"60123"</span><span style="color: #339933;">,</span> <span style="color: blue;">"60124"</span><span style="color: #339933;">,</span> <span style="color: blue;">"60125"</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
<h3>
Step 3: Initialize an empty data frame</h3>
<br />
Before executing the loop, I create an empty data frame named "df". I will need this to store the data extracted through the multiple queries.<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;">df<-<a href="http://inside-r.org/r-doc/base/data.frame"><span style="color: #003399; font-weight: bold;">data.frame</span></a><span style="color: #009900;">(</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
As you will see in next step, each time a new query is run for a specific View ID, the resulting data will be appended below the last row of the previous data frame using the function rbind.<br />
<br />
<br />
<h3>
Step 4: Run the for loop over each category</h3>
<br />
Now that we have the websites's categories set up and the a data frame ready to store data, we can finally run the loop. What I do here, is using a variable called "v" and iterate it over a specific category, let say "category1". In other words, the Google Analytics query is run for each single View ID included in the category.<br />
<br />
The resulting object of each query is a data frame called "ga.data". To collect the result of each query in the same data frame, each time the loop is run, the "df" data frame created previously is joined vertically using a "rbind" function.<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><span style="color: black; font-weight: bold;">for</span> <span style="color: #009900;">(</span>v <span style="color: black; font-weight: bold;">in</span> category1<span style="color: #009900;">)</span><span style="color: #009900;">{</span>
start.date <- <span style="color: blue;">"2015-04-01"</span>
end.date <- <span style="color: blue;">"2015-04-30"</span>
view.id <- <a href="http://inside-r.org/r-doc/base/paste"><span style="color: #003399; font-weight: bold;">paste</span></a><span style="color: #009900;">(</span><span style="color: blue;">"ga:"</span><span style="color: #339933;">,</span>v<span style="color: #339933;">,</span>sep=<span style="color: blue;">""</span><span style="color: #009900;">)</span> <span style="color: #666666; font-style: italic;">#the View ID parameter need to have "ga:" in front of the ID </span>
query.list <- Init<span style="color: #009900;">(</span>start.date = start.date<span style="color: #339933;">,</span> end.date = end.date<span style="color: #339933;">,</span> dimensions = <span style="color: blue;">"ga:date, ga:deviceCategory, ga:channelGrouping,"</span><span style="color: #339933;">,</span> metrics = <span style="color: blue;">"ga:sessions, ga:users, ga:bounceRate, ga:goalCompletions1"</span><span style="color: #339933;">,</span> table.id = view.id<span style="color: #009900;">)</span>
ga.query <- QueryBuilder<span style="color: #009900;">(</span>query.list<span style="color: #009900;">)</span>
ga.data <- GetReportData<span style="color: #009900;">(</span>ga.query<span style="color: #339933;">,</span> token<span style="color: #339933;">,</span> paginate_query = F<span style="color: #009900;">)</span>
df<-<a href="http://inside-r.org/r-doc/base/rbind"><span style="color: #003399; font-weight: bold;">rbind</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/df"><span style="color: #003399; font-weight: bold;">df</span></a><span style="color: #339933;">,</span>ga.data<span style="color: #009900;">)</span>
<span style="color: #009900;">}</span></pre>
</div>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYNTWHMyHWgWTTA4ab9HgMYhDz3m5XfjFJXnPDOyIVM6II-IQjFIqh7unK1PIGgAm4nKW7XODix0KF0e_sCZQX8uv_AEM2dbQwtZYvI3ae2tCqc8pZj4zhntNUwvcyY6rqh8f-DE1Em8A/s1600/for+loop+output.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Query Multiple Google Analytics View IDs output" border="0" height="212" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYNTWHMyHWgWTTA4ab9HgMYhDz3m5XfjFJXnPDOyIVM6II-IQjFIqh7unK1PIGgAm4nKW7XODix0KF0e_sCZQX8uv_AEM2dbQwtZYvI3ae2tCqc8pZj4zhntNUwvcyY6rqh8f-DE1Em8A/s400/for+loop+output.jpg" title="Query Multiple Google Analytics View IDS with R" width="400" /></a></div>
<br />
<br />
This for loop would query data only for category 1. To query websites belonging to category 2, you would need run the same loop again, this time iterating over category 2. Remember to re-initialize the "df" data frame when you change category, otherwise all nes results will be joined below your previous data frame.<br />
<br />
<h3>
Step 5: Do whatever you want with your data frame!</h3>
<br />
At this point, you should have all the Google Analytics data available in your R workspace. And most importantly, categorized!<br />
<br />
You might need now to perform some cleaning on your data, visualize it or export it into another format. Fortunately R offers you so many functions and packages that you can do basically whatever you want with those data.<br />
<br />
If you need for example to export your data frame into a .csv. file, you can do it very quickly using the write.csv command:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/utils/write.csv"><span style="color: #003399; font-weight: bold;">write.csv</span></a><span style="color: #009900;">(</span><a href="http://inside-r.org/r-doc/stats/df"><span style="color: #003399; font-weight: bold;">df</span></a><span style="color: #339933;">,</span><a href="http://inside-r.org/r-doc/base/file"><span style="color: #003399; font-weight: bold;">file</span></a>=<span style="color: blue;">"category1.csv"</span><span style="color: #009900;">)</span></pre>
</div>
</div>
<br />
Another data munging operation you might want to do on your Google Analytics data, is converting dates in a more friendly format. Infact, the dates you extract from Google Analytics comes into R as character data type, with the "yyyyMMdd" format. You can do this with the following code:<br />
<br />
<div style="overflow: auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family: monospace;"><a href="http://inside-r.org/r-doc/base/class"><span style="color: #003399; font-weight: bold;">class</span></a><span style="color: #009900;">(</span>ga.data$date<span style="color: #009900;">)</span> <span style="color: #666666; font-style: italic;"># dates come as character</span>
newDate<-<a href="http://inside-r.org/r-doc/base/as.Date"><span style="color: #003399; font-weight: bold;">as.Date</span></a><span style="color: #009900;">(</span>ga.data$date<span style="color: #339933;">,</span><span style="color: blue;">"%Y%M%d"</span><span style="color: #009900;">)</span> <span style="color: #666666; font-style: italic;">#convert into date data type</span>
newFormat<- <a href="http://inside-r.org/r-doc/base/format"><span style="color: #003399; font-weight: bold;">format</span></a><span style="color: #009900;">(</span>newDate<span style="color: #339933;">,</span><span style="color: blue;">"%m/%d/%y"</span><span style="color: #009900;">)</span> <span style="color: #666666; font-style: italic;">#to change format, but it convets it back to character class</span>
newFormat<- <a href="http://inside-r.org/r-doc/base/as.Date"><span style="color: #003399; font-weight: bold;">as.Date</span></a><span style="color: #009900;">(</span>newFormat<span style="color: #339933;">,</span><span style="color: blue;">"%d/%m/%y"</span><span style="color: #009900;">)</span> <span style="color: #666666; font-style: italic;">#convert it back to date data type</span></pre>
</div>
</div>
<br />
In general I suggest you use the <a href="http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html" target="_blank">dplyr package</a> for any data manipulation operation you might need to perform on your data frame.<br />
<br />
And of course, you could include all the data cleaning/manipulation commands inside the above <i>for</i> loop if you like. By doing that, you would automatize your process even more, and end up with a data frame ready to be reported or visualized for your audience. <br />
<div>
<br /></div>
marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-19595277711189853152015-03-30T03:08:00.000+01:002016-04-15T15:38:58.784+01:00R Statistics for Digital Analytics: 8 Blogs you should Follow<br />
Are you interested in using R for your digital analytics projects? Do you need to perform prediction modelling and visualizations on your digital data and Excel can´t just do the job as you wanted?<br />
<br />
Or, you simply have no idea how R could help you in your digital analytics problems and you would like to see some real working examples first?<br />
<br />
Well, there are 2 good news for you.<br />
<br />
The first one is that you are not alone. There is a quite vibrant community out there, sharing more and more examples on how to get real value from using R in digital analytics. They often post/tweet around the <a href="https://twitter.com/search?q=%23rstats&src=typd" target="_blank"><b>#rstats</b> hashtag</a>.<br />
<br />
The second news is that I decided to write a post on this. I am going to list here<b> </b>the main<b> blogs (and people) that might be useful to add to your "R Stats + Digital Analytics" reading list</b>.<br />
<a name='more'></a><br />
I came up with a list of 8 top contributors for now (please add up!). A few of them actually don't have a blog, but it still made sense to include them here. What these people have in common is that:<br />
<br />
<ul>
<li>they are <b>promoting R</b> as a powerful tool for digital analytics and are encouraging analysts to move away from traditional tools like Excel spreadsheet.</li>
<li>they are <b>helping those wanting to learn R</b> and apply it to digital analytics, to get started, sharing examples of real case analysis.</li>
<li>most of them have been <b>nominated by the Digital Analytics Associations</b> this year for the "Most Influential "Vendor/Agency" award, because of effort thay are making to help digital analytics practitioners skill up and smarten up when it comes to R Stats.</li>
</ul>
<br />
<br />
For each blog/people I have included:<br />
<br />
<ul>
<li>a brief introductory information and links to their main works.</li>
<li>Twitter account</li>
<li>Github account (if they have one)</li>
<li>Blog (if they publish on a blog)</li>
</ul>
<br />
<br />
<h3>
1. Tatvic</h3>
<br />
Tatvic is a Google Analytics Certified Partner offering Web Analytics Consulting Services for Google Analytics, Omniture, etc.<br />
<br />
First of all, one of his team members, Kushan Shah, is the mantainer of the RGoogleAnalytics popular package, which lets you connect to GA API through R (this package was initially built by a team at Google).<br />
<br />
Secondly, they are actively promoting the use of R for analysing web analytics data and buillding predictive models based on it. They do run practical <a href="http://www.tatvic.com/webinar/r-google-analytics/" target="_blank">webinars</a> and have a blog where they publish real case scenarios of mining Google Analytics data and generate insights through R.<br />
<br />
A couple of very interesting applications they produced are:<br />
<br />
<a href="http://www.tatvic.com/blog/product-revenue-prediction-with-r/" target="_blank">Predicting product revenue with R</a><br />
<br />
<a href="http://www.tatvic.com/blog/web-analytics-visualization-through-ggplot2/" target="_blank">Web Analytics Visualization through ggplot</a><br />
<br />
<a href="http://www.tatvic.com/blog/predict-bounce-rate-based-on-page-load-time-in-google-analytics/" target="_blank">Predict Bounce Rate based on Page Load Time in Google Analytics</a><br />
<br />
<br />
If you want to see real examples of using R to explore Google Analytics data, I recommend you follow Tatvic.<br />
<br />
<b>Twitter</b>: <a href="https://twitter.com/tatvic">@tatvic</a><br />
<br />
<b>Github</b>: <a href="https://github.com/Tatvic/RGoogleAnalytics">https://github.com/Tatvic/RGoogleAnalytics</a><br />
<br />
<b>Blog</b>: <a href="http://www.tatvic.com/blog/">http://www.tatvic.com/blog/</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgs3dFgC0OApIbFRasGSup5qu4z9S0mIK53JGE1xbPFz37pG2mKLw37hfTENqapnIDIQ9UCxR2AEotGCd1SKvyG5EN5nUksgysqETbtsxDaOX3wnVfsjrwdKclgK77fkbvzuShXKGNdim8/s1600/Tatvic.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Tatvic R stats Digital Analytics" border="0" height="187" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgs3dFgC0OApIbFRasGSup5qu4z9S0mIK53JGE1xbPFz37pG2mKLw37hfTENqapnIDIQ9UCxR2AEotGCd1SKvyG5EN5nUksgysqETbtsxDaOX3wnVfsjrwdKclgK77fkbvzuShXKGNdim8/s1600/Tatvic.jpg" title="Tatvic blog" width="400" /></a></div>
<br />
<br />
<br />
<h3>
2. Online Behaviour</h3>
<br />
Online Behaviour is a blog that focuses on Web Analytics, Usability, Testing and Digital Marketing techniques. His founder, Daniel Waisberg, who works as Analytics Advocate at Google, is an R user and is actively promoting the use of R within the digital analytics community.<br />
<br />
His blog is also a great place for other experts in the digital analytics to field to share their work. So, if you have built something interesting with R + Google Analytics, you might want to let him know!<br />
<br />
Have a read at these great articles published on Online Behaviour:<br />
<br />
<a href="http://online-behavior.com/analytics/r" target="_blank">Visualizing Google Analytics Data With R [Tutorial]</a> (by Daniel Waisberg)<br />
<br />
<a href="http://online-behavior.com/analytics/big-data" target="_blank">Big Data – What It Means For The Digital Analyst</a> (by Daniel Waisberg): <br />
<br />
<a href="http://online-behavior.com/analytics/shiny" target="_blank">Building A Google Analytics App With Shiny & R</a> (guest post by Chris Beeley):<br />
<br />
<a href="http://online-behavior.com/analytics/statistical-significance" target="_blank">Testing Statistical Significance On Google Analytics Data</a> (guest post by Mark Edmondson)<br />
<br />
<br />
<b>Twitter</b>: <a href="https://twitter.com/onbehavior">@onbehavior</a><br />
<br />
<b>Github</b>: check out each single post author<br />
<br />
<b>Blog</b>: <a href="http://online-behavior.com/">http://online-behavior.com/</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgecvPdXqJ68Uyxn0aDjvaxqLs_0VXerqenW68h4bgTR9y1rsiVINEr9bj5I4DyE8Ut-jfB6_Cblf7sA7DTS2tLNN59RmtMYG5RofmkmL0FcIO15raTMWjl1fmj7-Zo4qtpQ6S5jMeFbps/s1600/OnlineBehaviour.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Online Behaviour R stats Digital Analytics" border="0" height="178" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgecvPdXqJ68Uyxn0aDjvaxqLs_0VXerqenW68h4bgTR9y1rsiVINEr9bj5I4DyE8Ut-jfB6_Cblf7sA7DTS2tLNN59RmtMYG5RofmkmL0FcIO15raTMWjl1fmj7-Zo4qtpQ6S5jMeFbps/s1600/OnlineBehaviour.jpg" title="Online Behaviour blog" width="400" /></a></div>
<br />
<br />
<br />
<h3>
3. Mark Edmondson</h3>
<br />
Mark is working as a Digital Analyst at Wunderman and is being sharing some very interesting web applications using R and the <a href="http://shiny.rstudio.com/" target="_blank">Shiny package</a>. A great feature he includes in his apps is the automation of the authentication process, which allows you authenticate with your Google Analytics account/profile and run the app using your own data. Amazing.<br />
<br />
He is got his own blog and he is also guest editor at Online Behaviour blog. I do recommend you put him on your reading list, and check out these posts below. By the way, he also made an amazing presentation available at RPubs about how/why to use R in digital analytics. After reading it, you will be more than tempted to close your Excel spreasheet and <a href="http://www.rstudio.com/products/rstudio/download/" target="_blank">download R studio</a>! <br />
<br />
<a href="http://rpubs.com/MarkeD/r-in-digital-analytics-workflow" target="_blank">R in a Digital Analytics Worklow</a> (RPubs presentation)<br />
<br />
<a href="http://markedmondson.me/my-google-analytics-time-series-shiny-app-alpha" target="_blank"> </a><a href="http://markedmondson.me/my-google-analytics-time-series-shiny-app-alpha" target="_blank">My Google Analytics Time Series Shiny App (Alpha)</a><br />
<br />
<a href="http://markedmondson.me/finding-the-roi-of-title-tag-changes-using-googles-causalimpact-r-package" target="_blank">Finding the ROI of Title tag changes using Google's CausalImpact R package</a><br />
<br />
<a href="http://markedmondson.me/how-i-made-ga-effect-creating-an-online-statistics-dashboard-using-reais" target="_blank">How I made GA Effect - creating an online statistics dashboard using R</a><br />
<br />
<a href="http://online-behavior.com/analytics/statistical-significance" target="_blank">Testing Statistical Significance On Google Analytics Data</a> (guest post at Online Behaviour)<br />
<br />
<br />
<b>Twitter</b>: <a href="https://twitter.com/HoloMarkeD">@HoloMarkeD</a><br />
<br />
<b>Github</b>: <a href="https://github.com/MarkEdmondson1234">https://github.com/MarkEdmondson1234</a><br />
<br />
<b>Blog</b>: <a href="http://markedmondson.me/">http://markedmondson.me/</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHeo9A6T_s2CuaySGqs4efn-sdL3HI5miEcWOUWPtMwPV4IaafnRuRvXH3GGiiMqcxKak5IuHo5dWOgpT3LxBSUaHzafGL3Glmw3yoo23_IM82itI-bgf0hFeO2hw8hEmOM5RAjipCUCw/s1600/MarkEdmondson.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Mark Edmondson R stats Digital Analytics" border="0" height="147" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHeo9A6T_s2CuaySGqs4efn-sdL3HI5miEcWOUWPtMwPV4IaafnRuRvXH3GGiiMqcxKak5IuHo5dWOgpT3LxBSUaHzafGL3Glmw3yoo23_IM82itI-bgf0hFeO2hw8hEmOM5RAjipCUCw/s1600/MarkEdmondson.jpg" title="Mark Edmondson blog" width="400" /></a></div>
<br />
<br />
<h3>
4. Randy Zwitch</h3>
<br />
Randy Zwitch is a Data Scientist, and he is the Lead developer for <a href="https://github.com/randyzwitch/RSiteCatalyst" target="_blank">RSiteCatalyst</a>, an R package for accessing the <a href="http://www.adobe.com/solutions/digital-analytics/marketing-reports-analytics.html" target="_blank">Adobe SiteCatalyst</a> (Omniture) Reporting API. So if you are a SiteCatalyst user, you must try this package.<br />
<br />
Randy was nominated for the 2015 DAA Practitioner of The Year because of his innovative work in the areas of data science, big data and his ability to create real products for the digital analytics community.<br />
<br />
He shares his work via his personal blog, where you can find a specific section about digital analytics.<br />
<br />
Here are a few posts you might like check from his blog:<br />
<br />
<a href="http://randyzwitch.com/r-google-analytics-api/" target="_blank">Analysing the percentage of Google organic search terms that are listed as "(not provided)”</a><br />
<br />
<a href="http://randyzwitch.com/rsitecatalyst-website-pathing-sankey-charts/" target="_blank">Visualizing Website Pathing With Sankey Charts</a><br />
<br />
<a href="http://randyzwitch.com/rsitecatalyst-k-means-clustering/" target="_blank">Clustering Search Keywords Using K-Means Clustering</a><br />
<br />
<br />
<b>Twitter</b>: <a href="https://twitter.com/randyzwitch">@randyzwitch</a><br />
<br />
<b>Github</b>: <a href="https://github.com/randyzwitch">https://github.com/randyzwitch</a><br />
<br />
<b>Blog</b>: <a href="http://randyzwitch.com/category/digital-analytics/">http://randyzwitch.com/category/digital-analytics/</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgx15qmKFXlWmU4yp6qYaRHlQ1xSgyKafHd-qk3PlyL1OdYJ7l3WDTlJQEHuj-elnDevYJQIFqWtFENKpSJJQN62xXPnUMy1wZFA-mis5X9T8YDAUWim4Lvqkqb4Jz0deoXaOa83QQ8i-g/s1600/RandyZwitch.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Randy Zwitch R stats Digital Analytics" border="0" height="182" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgx15qmKFXlWmU4yp6qYaRHlQ1xSgyKafHd-qk3PlyL1OdYJ7l3WDTlJQEHuj-elnDevYJQIFqWtFENKpSJJQN62xXPnUMy1wZFA-mis5X9T8YDAUWim4Lvqkqb4Jz0deoXaOa83QQ8i-g/s1600/RandyZwitch.jpg" title="Randy Zwitch blog" width="400" /></a></div>
<br />
<br />
<br />
<h3>
5. Johann Deboer</h3>
<br />
Johann is the author of the <a href="https://github.com/jdeboer/ganalytics" target="_blank">ganalytics</a> R package, another package that lets you query Google Analytics data through R.<br />
<br />
He works at <a href="http://www.lovesdata.com/blog/" target="_blank">Loves Data</a> (I recommend you put their blog on your Digital Analytics reading list) and he is actively encouraging the use of R for Web Analytics. Check out his presentation on <a href="http://www.slideshare.net/johanndeboer/web-analytics-with-r-melb-urn" target="_blank">doing Web Analytics with R</a> , where, among other things, he explains why you would use R instead of traditional spreadsheets like Excel.<br />
<br />
<br />
<b>Twitter</b>: <a href="https://twitter.com/johannux">@johannux</a><br />
<br />
<b>Github</b>: <a href="https://github.com/jdeboer">https://github.com/jdeboer</a><br />
<br />
<br />
<br />
<h3>
6. Bror Skardhamar</h3>
<br />
He is the author of the <a href="https://github.com/skardhamar/rga" target="_blank">RGA package</a>, another package designed to extract data from the Google Analytics API to R.<br />
<br />
Check out his github account and follow him on Twitter for more info. <br />
<br />
<br />
<b>Twitter</b>: <a href="https://twitter.com/skardhamar">@skardhamar</a><br />
<br />
<b>Github</b>: <a href="https://github.com/skardhamar">https://github.com/skardhamar</a><br />
<br />
<br />
<br />
<h3>
7. Lunametrics</h3>
<br />
Lunametrics is a very well known name within the digital analytics community. They have in-depth knowledge and experience in Google Analytics thanks to their close relationship with Google, and you can soon realize it by reading some of the technical posts published on their blog.<br />
<br />
They have also made use of R script in a few occasions, and I guess they will be publishing more R content in 2015. Have a read at <a href="http://www.lunametrics.com/blog/2014/06/25/google-analytics-data-mining-bigquery-r/" target="_blank">Google Analytics Data Mining with Big Query and R</a> post, where<br />
the author, Noah Haibach, provides an R script for generating an E-commerce report with visualizations that are not currently possible inside Google Analytics platform.<br />
<br />
To be sure not to miss anything, I would include them too into your "R stats + Digital Analytics" reading list.<br />
<br />
<br />
<b>Twitter</b>: <a href="https://twitter.com/LunaMetrics">@LunaMetrics</a><br />
<br />
<b>Github</b>: <a href="https://github.com/lunametrics">https://github.com/lunametrics</a><br />
<br />
<b>Blog</b>: <a href="http://www.lunametrics.com/blog/">http://www.lunametrics.com/blog/</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEge_3KYMTSTcVDr_9nlYdqrj73WDKhWmUjg21ENnBY3gDQIRsIr6e6f8EIuytw8ckszdLehMmZkDE84PVgja8UCHodJF1waGmvJGHE9zZ32opU7YrSFBWxIfORfbP1SJXB7WihAE_L6-fY/s1600/Lunametrics.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Lunametrics R stats Digital Analytics" border="0" height="130" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEge_3KYMTSTcVDr_9nlYdqrj73WDKhWmUjg21ENnBY3gDQIRsIr6e6f8EIuytw8ckszdLehMmZkDE84PVgja8UCHodJF1waGmvJGHE9zZ32opU7YrSFBWxIfORfbP1SJXB7WihAE_L6-fY/s1600/Lunametrics.jpg" title="Lunametrics blog" width="400" /></a></div>
<br />
<br />
<br />
<h3>
8. R-bloggers</h3>
<br />
Last but not least, I include in this list the site R-Bloggers. What can I say about R-Bloggers? If you decide to invest in R, then you must follow R-Bloggers since it´s currently the main blog aggregator of content collected from other R bloggers.<br />
<br />
R-Bloggers is not a blog about digital analytics (most of posts published are not related to it). However, it´s very likely that a new post about digital analytics will be published there too. Finally, I strongly recommend you to follow R-Bloggers:<br />
<br />
<ul>
<li>to learn about R and get updated on new packages, developments in the R community</li>
<li>to make sure you don´t miss any digital analytics related post</li>
</ul>
<br />
<br />
<br />
<b>Twitter</b>: <a href="https://twitter.com/Rbloggers">@Rbloggers</a><br />
<br />
<b>Github</b>: check out each single R blogger<br />
<br />
<b>Blog</b>: <a href="http://www.r-bloggers.com/">http://www.r-bloggers.com/</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRx-azm3tPML7-zPjgDq79zrdGr3fwE3EqVWH51y-texr3VTWA1kH5SBzgzhHbdJnFVEbs62mMIazpMX5ci31XN1Tn1dwlQla6YORwa6NvElgorUw2SpJfCgNFMBzSA3ga1vmQPhDCBh0/s1600/R-bloggers.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="R-bloggers R stats + Digital Analytics" border="0" height="197" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRx-azm3tPML7-zPjgDq79zrdGr3fwE3EqVWH51y-texr3VTWA1kH5SBzgzhHbdJnFVEbs62mMIazpMX5ci31XN1Tn1dwlQla6YORwa6NvElgorUw2SpJfCgNFMBzSA3ga1vmQPhDCBh0/s1600/R-bloggers.jpg" title="R-bloggers blog" width="400" /></a></div>
<br />
<br />
<br />
<br />
I hope that this list will be useful for you. Whether you are new to R or already have hands on experience and you now want to apply it to your digital analytics data, you should follow these people I mentioned above.<br />
<br />
I've also created <b>a list on Twitter</b> with all of them. It's called "<a href="https://twitter.com/mcpasin/lists/r-for-digital-analytics" target="_blank"><b>R for Digital Analytics</b></a>" and you can click on the link to subscribe if you like.<br />
<br />
I also tweet myself quite often around the <b>#rstat</b> hashtag (follow me on Twitter at <a href="https://twitter.com/mcpasin" target="_blank">@mcpasin</a>) and have recently wrote two posts about creating dashboards with R, have a read:<br />
<br />
<ul>
<li><a href="http://www.analyticsforfun.com/2015/01/google-analytics-dashboards-with-r-shiny.html">Google Analytics Dashboards with R and Shiny</a> (a simple approach using Google Charts)</li>
<li><a href="http://www.analyticsforfun.com/2015/08/playing-with-r-shiny-dashboard-and.html">Playing with R, Shiny Dashboard and Google Analytics Data</a> (a more sophisticated and nicer dashboard using the <i>shinydashboard</i> package).</li>
</ul>
<br />
And please please please, feel free to suggest other people to include in this list.<br />
<br />
Happy reading!!<br />
<div>
<br /></div>
marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-26971235820369423682015-01-28T00:30:00.000+00:002016-04-24T01:33:05.939+01:00Google Analytics Dashboards with R & Shiny<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgX-a7B2jnvMW_IWes8n6dHq6G2JoDMov6pkL5tUt6GlZJCkkHFmJbpuMhQc5nokNlbxokakiP3RMCvoB_zBJPOy9BJzmfgyA96zKJi9HrWyDDNreVBggo7pKWE6uRPPzOwZSIRw_4YD6Y/s1600/DashboardScreenshot.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img alt="Google Analytics Dashboards with R & Shiny" border="0" height="203" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgX-a7B2jnvMW_IWes8n6dHq6G2JoDMov6pkL5tUt6GlZJCkkHFmJbpuMhQc5nokNlbxokakiP3RMCvoB_zBJPOy9BJzmfgyA96zKJi9HrWyDDNreVBggo7pKWE6uRPPzOwZSIRw_4YD6Y/s1600/DashboardScreenshot.jpg" title="Google Analytics Dashboard with R & Shiny" width="320" /></a></div>
One of the key activities of any web or digital analyst is to design and create dashboards. The main objective of a web analytics dashboard is to display the current status of your key web metrics and arrange them on a single view, so that information can be monitored at a glance. Great dashboards should allow you/your boss or client to take action quickly and spot trends in data.<br />
<br />
There are plenty of tools for creating dashboard out there. You can decide to create your dashboard directly in Google Analytics, using a spreadsheets (e.g. Excel or <a href="http://analytics.blogspot.com.ar/2015/01/simplify-your-google-analytics.html" target="_blank">Google Sheets</a>) or you might decide to go for an ad hoc dashboarding solution such as <a href="http://www.tableau.com/" target="_blank">Tableau</a>, or <a href="http://www.klipfolio.com/" target="_blank">Klipfolio</a> (I am a heavy user of the latter).<br />
<br />
In this blogpost I aim to move away a bit from traditional dashboarding tools, and I wil show you <b>an example of Google Analytics dashboard I've built using the R programming language and the Shiny package</b>. Finally, I will also summarize the <b>main benefits</b> of using such tools for creating dashboards and perform data analysis in a digital analytics context.<br />
<br />
[UPDATE: I've recently built a more sophisticated and better looking dashboard using the <i>shinydashboard</i> package. <a href="http://www.analyticsforfun.com/2015/08/playing-with-r-shiny-dashboard-and.html">Click here</a> to see it.]<br />
<a name='more'></a><br />
<h2>
<span style="font-size: large;">
R and Shiny introduction </span></h2>
<br />
<a href="http://www.revolutionanalytics.com/what-r" target="_blank">R is a very powerful platform for data analysis</a>. R is actually very good at lots of things including statistical modelling, data visualizations, plus it relies on a very large and enthusiastic community of users and developers which make the product growing and improving regularly. For all these reasons, today R is widely used by scientists, researchers, and statisticians. And many are the companies that are routinely using R for data analysis: Google, Facebook, The New York Times, Twitter, Coursera, to name a few. As Dave Smith wrote on a recent paper, "<a href="http://www.revolutionanalytics.com/sites/default/files/r-is-still-hot.pdf" target="_blank">R is still hot and getting hotter</a>". <br />
<br />
On the other hand, <a href="http://shiny.rstudio.com/" target="_blank">Shiny</a> is an R package developed by the guys of RStudio, that allows you to build interactive web applications using R code. So let say you have performed some data analysis with R: you can now wrap it into an app and share it to other people, who do not need to be R users.<br />
<br />
With the developement of Shiny, R is gaining more interactivity and is becoming a quite attractive option for analysts (<a href="http://www.analyticsforfun.com/2015/03/r-stats-digital-analytics-8-blogs-you.html" target="_blank">learn from who is already using R for Digital Analytics</a>) to construct interactive dashboards and share data to their boss/clients or co-workers. Pretty cool, isn't it?<br />
<br />
Let's show you an example...<br />
<br />
<h2>
<span style="font-size: large;">
A simple dashboard scenario: segment traffic by device</span></h2>
The Shiny application I created simulates a simple dashboard scenario where users can segment data by traffic device (desktop vs mobile vs tablet) through a radio button.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://makeagif.com/JSR0HD" style="margin-left: 1em; margin-right: 1em;" title="GAdashboard on Make A Gif"><img alt="GAdashboard on Make A Gif" src="http://cdn.makeagif.com/media/1-26-2015/JSR0HD.gif" /></a></div>
<br />
<div style="font-size: 11px;">
make <a href="http://makeagif.com/" title="make a gif">animated gifs</a> like this at MakeAGif</div>
<br />
<br />
The dashboard is composed of 4 visualizations:<br />
<br />
<ol>
<li>A line chart showing sessions and sign-ups daily trend. Sessions are measured on primary axis while sign-ups on secondary axis.</li>
<li>A bubble chart plotting for each traffic channel three metrics: number of sessions, avg. pages per session and revenue. This visualization can be quite interesting to analyse channels performance with respect to the website objective (e.g. revenue), and currently it is not available in Google Analytics acquisition reports.</li>
<li>A line chart showing bounce rate daily evolution.</li>
<li>A world map visualizing the number of new users: the darker the country and the more new users visited the site from that country. </li>
</ol>
<br />
The app is currently hosted at <a href="http://shinyapps.io/">Shinyapps.io</a>, a dedicated website where you can deploy and share your Shiny applications online (sorry but, because of current Shinyapps free plan limits, the app is temporarily unavailable. But you can stil <a href="https://github.com/mcpasin/web-analytics-dashboard" target="_blank">get the code at github here</a> and run it on your own machine).<br />
<br />
As you can see, it's a very simple scenario, both in terms of user interface and calculations running in the background. Nothing complex, no statistical modelling involved, though this would be definitely a very powerful feature to include in a R coded dashboard.<br />
<br />
What I did, was playing a little bit with <a href="https://developers.google.com/chart/" target="_blank">Google Charts visualizations</a> through the GoogleVis package. <a href="http://cran.r-project.org/web/packages/googleVis/googleVis.pdf" target="_blank">GoogleVis is a R package</a> that provides an interface to the Google Vis API, and <b>make creating interactive plots quite easy</b>. Interactive means that users can manipulate data and look for the info they need.<br />
<br />
Except for the bubble chart, all the other charts I used to create the dashboard are available in Google Analytics reports. But if like, you can do much much more. Among the charts available in GoogleVis package there are scatter charts, histograms, stepped area charts, org charts, tree maps, gauge charts and boxplots. Here is the <a href="https://google-developers.appspot.com/chart/interactive/docs/gallery" target="_blank">complete list of visualizations you can do with Google Charts</a>.<br />
<br />
Like all Shiny applications, this dashboard app is made of <b>two code files</b> (<i>ui.R</i> and <i>server.R</i>) which must be placed in the same directory:<br />
<br />
<ul>
<li><i>ui.R</i> = it defines how the web application looks to users. All the calls you make on this file, they generate some HTML code.</li>
<li><i>server.R</i> = this is normal R code where you perform your data analysis.</li>
</ul>
<br />
With respect to building the actual 4 charts dashboard, what I did was first creating each chart object separately, and then merge them in pairs as follows:<br />
<br />
<span style="background-color: #9fc5e8;">D1 <- gvisLineChart(dataDevice, "date", c("sessions","signup"))</span><br />
<span style="background-color: #9fc5e8;">D2 <- gvisBubbleChart(channelsDevice, idvar="channel", xvar="sessions", yvar="pages.sessions")</span><br />
<span style="background-color: #9fc5e8;">D3 <- gvisLineChart(dataDevice, xvar="date", yvar="bounce.rate",</span><br />
<span style="background-color: #9fc5e8;">D4 <- gvisGeoChart(countriesDevice, "country", "new.users")</span><br />
<span style="background-color: #9fc5e8;">D12 <- gvisMerge(D1,D2, horizontal=TRUE)</span><br />
<span style="background-color: #9fc5e8;">D34 <- gvisMerge(D3,D4, horizontal=TRUE)</span><br />
<span style="background-color: #9fc5e8;">D12D34<- gvisMerge(D12,D34, horizontal=FALSE)</span><br />
<span style="background-color: #9fc5e8;">plot(D1234)</span><br />
<br />
All of the code for this dashboard application lives on this <a href="https://github.com/mcpasin/web-analytics-dashboard" target="_blank">GitHub Repo here</a>. Raw data was downloaded manually from Google Analytics in .csv format, though this operation can be automated by connecting directly with Google Analytics API (see RGoogleAnalytics package).<br />
<br />
<br />
<h2>
<span style="font-size: large;">
Benefits of using R & Shiny to create a Google Analytics Dashboard</span></h2>
<br />
So, what are the main benefits of using R & Shiny to create a Google Anaytics dashboard? And to answer a broader question: why should you use R for web anaytics?<br />
<br />
<div style="text-align: center;">
<i><span style="color: #666666;">Bonus:</span> <a href="http://www.analyticsforfun.com/2015/03/r-stats-digital-analytics-8-blogs-you.html" target="_blank">Check out who else is using R for Web Analytics and follow them!</a></i></div>
<br />
With the development of a package such as Shiny, R definitely becomes a more attractive option for analysts to build dashboards. Here below I put together a list of 12 main benefits you would gain by using R for creating a Google Analytics dashboard:<br />
<br />
<ol>
<li><b>Advanced statistics capabilities & prediction models</b>. R was born as a statistical language and keeps being the language of reference of any statistician. It has lots of packages for performing any specialized function and it's always up to date thanks to its open source nature. Using R for web analytics would allow you to incorporate sophisticated prediction models easily in your dashboard, and more importantly let your boss/client explore and interact with the models you have built (E-commerce is a very interesting field where to apply prediction models).</li>
<li><b>State-of-the-Art Visualizations</b>. R has very advanced graphics capabilities which let you create beautiful and interactive dashboards. R offers several powerful packages like GoogleVis (the one I used in the above dashboard), ggplot, ggVis or dygraphs.</li>
<li><b>Connect directly to Google Analytics API</b>. In my dashboard example I manually downloaded the data in .csv format (mainly for privacy reasons), but you can surely automate the retrieaval of data through ad-hoc R packages. Check out this recent post that explains <a href="https://github.com/mcpasin/web-analytics-dashboard" target="_blank">how to connect Google Anaytics to R</a> using the RGoogleAnalytics package. And <a href="http://www.analyticsforfun.com/2015/05/query-multiple-google-analytics-view.html" target="_blank">learn how to query multiple Views using R</a>.</li>
<li><b>No web development knowledge is required</b>, altough if you know some HTML/CSS/ JavaScript you can fully customize the user interface and make suitable for you and your final users.</li>
<li><b>Attractive default UI theme</b>, based on <a href="http://getbootstrap.com/" target="_blank">Twitter bootstrap</a>.</li>
<li><b>Shiny can integrate JavaScript libraries</b> like <a href="http://en.wikipedia.org/wiki/D3.js" target="_blank">d3.js for visualizations</a>.</li>
<li><b>Shiny uses a reactive programming model</b> like modern web applications do, which indicates that when the user changes a value in a <i>ui</i> control (e.g. the radio button), the R code in the background will get recalculated and the output that is bound to the <i>ui</i> (e.g. the 4 charts in the dashboard) will be re-rendered.</li>
<li><b>Reproducibility</b>. This is a very very important concept at the basis of R (and other programming languages too), and means being able <a href="http://www.analyticsforfun.com/2013/04/global-distribution-of-breast-cancer.html" target="_blank">capture each step of your data analysis</a> so that you or other people can reproduce it. In a business scenario, reproducibility means being able to repeat complex functions and dashboards for more than one client.</li>
<li><b>Scalability</b>. R is a much more powerful and solid compared to other toools like Excel when it comes to process large amount of data.</li>
<li><b>Integrate different data sources</b>. R can read almost any type of data (.txt, .csv, etc.). There are R packages specifically designed to read Excel, JSON, XML, etc. or you can even scrape data from websites and execute SQL queries. This means you could potentially integrate different sources of data all in the same dashboard. And once imported the data, and cleaned it, you can build a data frame on which you can use all R functions. Very powerful.</li>
<li>R is <b>an open source project</b>, which means it is continually improved, upgraded, enhanced, and expanded by a global <a href="http://www.analyticsforfun.com/2015/03/r-stats-digital-analytics-8-blogs-you.html" target="_blank">community of incredibly passionate developers</a> and users. Currently R has over 5,000 add-ons packages.</li>
<li><b>it's Free!</b></li>
</ol>
<div>
<b><br /></b></div>
<div>
What do you think about implementing a dashboard with R & Shiny? Which are the main obstacles you might encounter moving from traditional dashboarding tools to R?<br />
<br />
Do you see R & Shiny playing an important role in digital analytics in the near future?</div>
<div>
<br /></div>
<div>
Share your thougths and be social!</div>
<br />
<i><span style="color: #666666;">Other articles you will find useful:</span></i><br />
<a href="http://www.analyticsforfun.com/2015/08/playing-with-r-shiny-dashboard-and.html" target="_blank"><i><b>Playing with R, Shiny Dashboard and Google Analytics Data</b></i></a><br />
<a href="http://www.analyticsforfun.com/2015/03/r-stats-digital-analytics-8-blogs-you.html" target="_blank"><i><b>R for Digital Analytics: 8 Blogs you should Follow</b></i></a><br />
<br />
<br />
<br />marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-38477850910299446892014-11-23T23:41:00.001+00:002016-04-24T16:10:21.145+01:00Drawbacks of Using Time Metrics to Measure BlogsWhen it comes to blogging, we all know that CONTENT is king. We also understand that social interactions and readers engagement play a primary role for making the blog successful.<br />
<br />
So far, so good.<br />
<br />
But then it's time to analyse data and make decisions. And that's where we often fail.<br />
<br />
We usually take a web analytics tool like Google Analytics, install basic tracking code on pages, and analyze the blog like any other website. We look at most common metrics and take them as standard references to evaluate future performance. But we forget about the unique features that differentiate blogs from other digital properties: content consumption and social interactions.<br />
<br />
This post will help you understand <b>one of the most misused metrics to measure blogs performance</b>: I am talking about <b>time on page and time on site</b>. Most bloggers don't understand what time metrics actually measure. So, first of all I will try to explain how they are calculated in a typical web analytics tool (it might be different from what you think!).<br />
<br />
I will then discuss some of the <b>drawbacks of using time metrics</b> to measure blog performance and finally suggest a couple of more solid KPI's to better measure content engagement.<br />
<br />
After reading this post, I am sure you will start looking at time metrics with a bit more critical thinking than before. And perhaps shift your blog analytics focus to other <a href="http://www.analyticsforfun.com/2014/10/how-i-measure-success-for-my-blog.html" target="_blank">more powerful metrics</a>. <br />
<br />
Let's go!<br />
<a name='more'></a><br />
<br />
<h2>
<span style="color: blue; font-size: large;">How time metrics are calculated?</span></h2>
<div>
<span style="color: blue;"><br /></span></div>
<div>
Let's clarify here how Google Analytics calculate time on page and session duration:<br />
<blockquote class="tr_bq">
<i>The time on page for a particular page is calculated by subtracting the initial view time for a particular page, from the initial view time for the subsequent page.</i></blockquote>
From the above definition we get that, to calculate time on a specific page, it's necessary to have a <i>subsequent page</i> too.<br />
<br />
But not always actually.<br />
<br />
To get a more accurate measure, Google Analytics also uses the <a href="http://cutroni.com/blog/2012/02/29/understanding-google-analytics-time-calculations/" target="_blank">"<i>engagement hits</i>" concept</a>. <br />
<blockquote class="tr_bq">
<i>An engagement hit is a hit that results from an event that does not have the opt_noninteraction parameter applied.</i> </blockquote>
<br />
Let's provide a 3 simple scenarios to understand calculation.<br />
<br />
<br />
<h4>
<b>1st Scenario: A Visit with More than One Pageview</b> (and no engagement hits on the last page)</h4>
<div>
<br /></div>
In the case of a multiple pageviews visit without any engagement hits on the last page, then the session duration metric is calculated as follows:<br />
<br />
Session duration = Time of the first hit on the last page - time of the first hit on the first page<br />
<br />
Here is a visual example of a 3 pageviews visit:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioLOXYk0tULH7VxnYZV_uKiAE0c3LD6g2xFT16DrdR2KeC_3woAzB90PHONe0uVXiAPjXy029SnzvczQ_73f2zSi5uSbTa7FtNyYhEPHxG0pp-QBLwNFSS7Ut_3ZQjlSx0kk955eC3mfE/s1600/calculation1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Time Calculation Google Analytics" border="0" height="330" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioLOXYk0tULH7VxnYZV_uKiAE0c3LD6g2xFT16DrdR2KeC_3woAzB90PHONe0uVXiAPjXy029SnzvczQ_73f2zSi5uSbTa7FtNyYhEPHxG0pp-QBLwNFSS7Ut_3ZQjlSx0kk955eC3mfE/s1600/calculation1.png" title="Time Calculation in Google Analytics" width="400" /></a></div>
<br />
<br />
So, when there is more than one pageview, Google Analytics measure the time between the first hit on the last page (11.10) and the first hit on the first page (11:00) to calculate session duration.<br />
<br />
To calculate time on page it's the same process: time of the first hit on the subsecuent page minus time of the first hit on the previous page.<br />
<br />
But what about Page 3 which is not followed by any other page (visitor exits the site on page 3)? Here is where we have trouble measuring time. Since there is no pageview after Page 3 (and no engagement hits), Google Analytics can't figure out the time spent on page. The user actually stayed a total of 15 min on Page 3, but Google Analytics can't calculate it. So, time on Page 3 is unknown and will be reported as 0 in Google Analytics reports.<br />
<br />
<i>Does it make sense so far?</i> Let's move to second scenario.<br />
<b><br /></b>
<b><br /></b>
<br />
<h4>
<b>2nd Scenario: A Visit with Only One Pageview </b>(and no engagement hits)</h4>
<div>
<br /></div>
If there is only one page viewed in a visit and no other engagement hit on the same page, then session duration is equal to zero, so time on page. Google Analytics does not find a subsequent page, so it can't figure out how much time the user actually spent on the page. Content reports (for that specific visit) will say "00:00:00".<br />
<br />
THIS IS A COMMON SCENARIO FOR BLOGS, where often users land there just to check a speficic post (maybe the latest one published), and leave the blog without visiting other pages.<br />
<br />
More on this later.<br />
<br />
<br />
<h4>
<b>3rd Scenario: A Visit with Engagement Hits</b></h4>
<div>
<b><br /></b></div>
In our last scenario, we assume that there is an engagement hit on the last page visited. Then the session duration metric is calculated as follows:<br />
<br />
Session duration = Time of the last engagement hit on the last page - time first hit on the first page<br />
<br />
Let's rearrange our previous 3 pageviews example with an engagement hit on Page 3:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2wVVwrTK-Jp0KM9ZfpjYmwpjzBFrls_bdOfyhkeO24dcPXxQx3-VZ19q16stb3u6MhQiYJSkZKKf5NTTEnGHA6NAXIK4bh7ApEUTxJUHST6gmLKfhwYOL0R8zB-_n67PBYOTMXyp6ZMY/s1600/calculation3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Time Calculation with Engagement Hit Google Analytics " border="0" height="350" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2wVVwrTK-Jp0KM9ZfpjYmwpjzBFrls_bdOfyhkeO24dcPXxQx3-VZ19q16stb3u6MhQiYJSkZKKf5NTTEnGHA6NAXIK4bh7ApEUTxJUHST6gmLKfhwYOL0R8zB-_n67PBYOTMXyp6ZMY/s1600/calculation3.png" title="Time Calculation in Google Analytics - engagement hit" width="400" /></a></div>
<br />
<br />
In this scenario, we can get a more accurate measure of Time on Page 3 as well as of Session Duration, since there is an engagement hit happening at 11.20 AM. Let say for example you have configured an event (with no <i>opt_noninteraction</i> parameter applied) firing when the user scrolls 100% of the page. Google Analytics can take this piece of data and use it to better calculate time metrics.<br />
<br />
<i>Got the the logic behind calculation?</i> I hope this makes time metrics interpretation a bit more clearer.<br />
<br />
<br /></div>
<h2>
<span style="color: blue; font-size: large;">Drawbacks of using time metrics</span></h2>
<div>
<span style="color: blue;"><br /></span></div>
When comes to measure blogs, it's tempting to use time metrics as main KPI's or even set them as Goals in Google Analytics (e.g. count a Goal when user session reach 120 sec).<br />
<br />
But now that we have clarified how time metrics are calculated, you should have no problem understanding their impact on blog analytics. Especially if you decide to configure them as Goals in Google Analytics.<br />
<br />
<div>
So, <i>what's wrong with using time metrics to measure your blog performance?</i> Below I list which are in my opinion the 4 main drawbacks.<br />
<br /></div>
<div>
<h3>
1. The Way they are Calculated (subsequent page rule)</h3>
<i>What about those users who just visit one page and leave the site?</i> This is quite a common behaviour for blog readers. They land to your blog to read your latest post (perhaps from an email subscription) and exit from the same page without interacting with any element.<br />
<br />
No subsequent page and no engagement hits. One pageview, and time on page unknown (actually 0:00:00). <i>Then what? Did they really spend no time on page or they actually read the post?</i><br />
<br />
Google Analytics can't calculate time on page for a single pageview visit. Neither session duration. And this would also skew average time measures calculated on multiple sessions: my next drawback in this list.<br />
<br />
<h3>
2. They Average Potentially Very Different Values</h3>
Google Analytics offer average measures for various metrics, included time. While average values can be useful to understand trends in blog traffic, you should also be aware that any outlying value (good or bad) can badly skew your average time metrics.<br />
<br />
I am particularly referring to relying on Avg. Session Duration metric as main KPI to measure your blog performance. Setting Avg. Session Duration as a Goal in Google Analytics, and analyse it at an aggregate level (no segmentation), could lead you to wrong decisions.<br />
<br />
Average metrics can be dramatically skewed. However, they could be a good starting point for your analysis, if you find some way to "clean" them first. A good idea could be applying segments to the data in order to <a href="http://www.analyticsforfun.com/2016/02/what-happens-when-you-have-outliers-in.html">eliminate outliers</a> and get more accurate data. For example analysing Avg. Session Duration for each individual traffic source.<br />
<br /></div>
<div>
<h3>
3. They should measure engagement but we dont really know if it is good or bad engagement </h3>
We generally consider more time spent on the site/blog as a sign of success. More time = higher visitor engagement.<br />
<br />
But time metrics do not actually say anything about engagement!<br />
<br />
The user spent more that 10 min on my blog and visited 5 pages. <i>Is it good??</i> Maybe. Or maybe not. What if the user landed to my site from an organic search and looked all over the blog to find something without success. He wasted 10 min of his time navigating through my site and eventually left it disappointed. <i>Wouldn't be better to call it bad engagement?</i><br />
<br />
With this doubt in mind, I would not rely exclusively on time metrics to measure engagement. So, be cautious of setting time metrics as goals in Google Analytics and base your analysis only on that.<br />
<br />
Nevertheless, there are cases where you can plan in advance what is good or bad engagement for you. It depends on your site business model. For support sites for example, you want visitors to find information quickly, then a short visit can be considered successful. On the other hand, if you have an advertising business and sell ad space on your site, you might want visitors to see as many pages as possible since you're paid by impressions.<br />
<br />
<h3>
4. They can be Biased by your Web Analytics Tool Implementation</h3>
<div>
As I mentioned on the previous section. Google Analytics uses also the concept of "engagement hits" to calculate time. This is certainly a great feature that can make time measurement more accurate in many cases, but it can also biase your data if wrongly applied. </div>
<div>
<br /></div>
<div>
Depending on whether a particular user interaction (e.g. click to open a submenu) is tracked as an event (and <i>opt_noninteraction applied) </i>or not, time metrics can be dramaticaly biased. User behaviour is actually the same, but data can say two different stories.</div>
<div>
<br /></div>
<div>
Bottom line: make sure you know your implementation and think of potential consequences on metrics before configuring your web analytics tool.<br />
<br /></div>
<br />
<h2>
<span style="color: blue; font-size: large;">More solid metrics to measure your blog performance</span></h2>
</div>
<div>
<span style="color: blue;"><br /></span></div>
<div>
I am not going in detail on this section, since you can find a detailed <a href="http://www.analyticsforfun.com/2014/10/how-i-measure-success-for-my-blog.html" target="_blank">list of KPI to use for measuring blogs</a>, in my previous post.<br />
<br />
To mention a few, you can measure your blog performance with more solid metrics such as:<br />
<br />
<ul>
<li># of subscriptions to your blog</li>
</ul>
<ul>
<li># of social actions divided by # of posts published</li>
</ul>
<ul>
<li># of comments divided by # of posts published</li>
</ul>
<ul>
<li># of downloads (e.g. a post available to download in pdf format)</li>
</ul>
<ul>
<li>% of scrolling on posts</li>
</ul>
<ul>
<li>Depth of pages</li>
</ul>
<ul>
<li>Length of visit</li>
</ul>
<ul>
<li>Video play (completion rate)</li>
</ul>
<div>
<br /></div>
<div>
The most important thing is to choose metrics that have a direct connection with your ultimate blog/website objectives. To make sure your metrics are solid enough, try to answer these questions:<br />
<br />
> <i>Do the metrics you chose, support your goals?</i><br />
<br />
> <i>Can you take actions on your blog using those metrics?</i><br />
<br />
If your answer is YES, your metrics are solid enough to build a measurement plan from it. The metrics I mentioned in the list above, are good examples of great KPIs for blogs, because they focus on the essential aspects of blogging: content engagement and social interaction. </div>
<div>
<br />
In conclusion, you need to <b>be careful and thorough when utilizing time metrics to make decisions</b>. You must be aware of how they are calculated, and which are their limitations when it comes to understand user behaviour. Having clear in mind their limitations, time metrics can still be a good starting point to perform a web analysis. And perhaps, with some advanced segmentation you can get interesting insights.<br />
<br />
<br />
<i>What's your experience with time metrics? Were you aware of the logic behind their calculation? Do you agree with the drawbacks I discussed on this post?</i><br />
<br />
Feel free to comment, criticize, share it!<br />
<br />
Thanks<br />
<br /></div>
</div>
marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-81803170640136363062014-10-28T02:29:00.000+00:002016-08-16T00:45:47.586+01:00How to Measure Content and set great KPIs for your BlogIf you are serious about blogging, then you must have a measurement plan. No matter if you have just started and have only a dozen of visitors, or you already have a very popular blog whose primary purpose is making revenue from advertising. As long as you have some objectives for your blog, then you must decide what you need to measure.<br />
<br />
Why? Because this is the only way to understand your blog performance and whether you are successful or not for your readers (I assume you are not writing only for yourself!).<br />
<br />
<blockquote class="tr_bq">
<i>Developing a measurement plan is the only way to understand whether you are successful or not for your readers</i>.</blockquote>
<br />
In this post I am going to draft <b>a measurement plan for my blog </b> and use it as a learning exercise to discuss critical aspects like choosing KPI's (Key Performance Indicators) and segments of analysis. Google Analytics will be my reference platform for implementing the measurement plan.<br />
<a name='more'></a><br />
<br />
By nature, <b>blogs have some unique peculiarities</b> that differentiate themselves from other digital properties like websites. Two main peculiarities are:<br />
<ol>
<li>the importance of CONTENT and the different way in which it is consumed by readers</li>
<li>the emphasis put on SOCIAL CONVERSATION</li>
</ol>
<br />
Hence, developing a measurement plan for a blog require you to <b>think more expansively</b> and take these peculiarities into account. You should understand what really is an effective KPI to measure your performance, and consider any technical limitation you might encounter when implementing it on a web analytics tool.<br />
<br />
As an example, the way time on site is calculated in Google Analytics (and web analytics tools in general), should make you conclude that "Avg. Session Time" might not be the best metric to measure engagement of your blog readers. I wil touch this issue later and probably will reserve an entire new post to discuss in details other blog measurement issues.<br />
<br />
[Update: check my new post on <a href="http://www.analyticsforfun.com/2014/11/drawbacks-of-using-time-metrics-to.html" target="_blank">Time Metrics and Drawbacks to measure Blogs & Content Consumption</a>]<br />
<br />
Let's go back to the objective of this post now. <br />
<br />
First of all, I am going to give a brief theoretical background about digital measurement, by illustrating the model developed by Avinash Kaushik. If you have not done it yet, read his inspiring post on developing a <a href="http://www.kaushik.net/avinash/digital-marketing-and-measurement-model/" target="_blank">digital marketing measurement model</a>.<br />
<br />
Once reviewed Avinash model, I will apply it to my blog, and also discuss some specific aspects to take into account when building a measurement plan for a blog.<br />
<br />
Let's go!<br />
<br />
<br />
Quick jump links:<br />
<span style="color: blue;"><a href="https://www.blogger.com/blogger.g?blogID=8693029506171309303#5steps">1. The 5 steps digital measurement model</a></span><br />
<span style="color: blue;"><a href="https://www.blogger.com/blogger.g?blogID=8693029506171309303#framework">2. A tentative framework to measure my blog performance</a></span><br />
<span style="color: blue;"><a href="https://www.blogger.com/blogger.g?blogID=8693029506171309303#conclusions">3. Conclusions</a></span><br />
<b><br /></b>
<br />
<br />
<h2>
<b><a href="https://www.blogger.com/null" name="5steps"><span style="font-size: large;">The 5 steps digital measurement model</span></a></b></h2>
<br />
What Avinash recommends in his inspiring framework, is an HIERARCHICAL model with business objectives standing at the top and leading the other components of the model.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIMpe0rhG7m0EEdvKx3l-2VolE2ODJGEkHy3xwCstXzsJw2yTk5qg_XEtXrUjvpAHnehrxmI5H54bEONuHM0qObBET2sVkYnoHeM26oluJuh0ex1tQd3gCmUUSc1udyi3LkDo2RENg0gw/s1600/hierarchical.png" imageanchor="1"><img alt="Web Analytics Measurement Model" border="0" height="132" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIMpe0rhG7m0EEdvKx3l-2VolE2ODJGEkHy3xwCstXzsJw2yTk5qg_XEtXrUjvpAHnehrxmI5H54bEONuHM0qObBET2sVkYnoHeM26oluJuh0ex1tQd3gCmUUSc1udyi3LkDo2RENg0gw/s1600/hierarchical.png" title="" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
<h3>
<b><span style="color: orange;">#Step1: Document your business objectives</span></b></h3>
<b><span style="color: orange;"><br /></span></b>
The first step to create your digital measurement plan is to define your business objectives. You should ask yourself questions like: <b>Why does my blog/website exist? What are the 3 most important priorities for my blog?</b><br />
<br />
Everything starts from declaring business objectives, or in other words, the mission of your blog. Hence, this is a very critical stage and you should dig really deep into your "digital motivations" in order to get it right.<br />
<br />
And remember, your business objectives should be SMART (Specific, Measurable, Actionable, Realistic, and Timely)<br />
<h3>
<br /><b><span style="color: orange;">#Step2: Identify strategies & tactics to support your objectives</span></b></h3>
<b><span style="color: orange;"><br /></span></b>
I think many analysts find this step a bit confusing. What Avinash means here, is identifying specific strategies that you will leverage to accomplish your business objectives.<br />
<br />
Let say your business objective in #Step1 was "Selling more products". #Step 2 will be the specicic strategies you will use to increase your sales, like: increasing the number of subscribers to your monthly promotion newsletter, reduce the abandonment rate in your checkout process, or increase the online engagement towards the content that is related to those products you aim to sell.<br />
<h3>
<br /><b><span style="color: orange;">#Step3: Choose KPIs</span></b></h3>
<b><span style="color: orange;"><br /></span></b>
Once defined business objectives and how you will intend to achieve them, the nest step is to choose Key Performance Indicators, also referred to as KPIs.<br />
<br />
A KPI is <b>a metric that helps you understand how you are doing against your objectives</b>. In other words, KPI's they are the numbers (METRIC == NUMBER) that you'll analyze daily to understand how your business is performing.<br />
<br />
For an objective such as selling products, great KPI's could be revenue, average order value or the average number of days/visits needed to make purchase.<br />
<br />
Avinash stresses on the importance of choosing a few (3 to 5 ideally) meaningful and actionable KPI's. To do so, a KPI should follow 4 rules: it should be uncomplex, relevant, timely, and instant useful.<br />
<b><span style="color: orange;"><br /></span></b>
<b><span style="color: orange;"><br /></span></b>
<br />
<h3>
<b><span style="color: orange;">#Step4: Set targets</span></b></h3>
<br />
<b><span style="color: orange;"><br /></span></b>
Nex you have to set targets for each of your KPI's. Targets are numerical values that you have pre-determined (with your business leadership, or simply by yourself in case you are the only one owner of the blog/website) as indicators of success or failure.<br />
<br />
You can create yearly targets, monthly, quarterly, etc. You decide it. In our example of sales, it could be "Target: Revenue = $125 per month".<br />
<br />
<h3>
<br /><b><span style="color: orange;">#Step5: Choose segments of analysis</span></b></h3>
<b><span style="color: orange;"><br /></span></b>
Finally, you should document which segments of data are important to measure for your business. You might want to analyse your KPI's by traffic channel (Organic vs Paid vs Direct, etc.), by returning users, or by users who converted to a spefific goal you set in Google Analytics. It's up to you and to the nature of your business.<br />
<br />
What you have to remember, is that segmentation provides context to your data and it allows you to take action on your data. As Avinsh says..."Segment or die!".<br />
<br />
<br />
<br />
<h2>
<b><a href="https://www.blogger.com/null" name="framework"><span style="font-size: large;">A digital analytics framework to measure my blog performance</span></a></b></h2>
<div>
<br />
Let's walk through the measurement plan for my personal blog (if you want to familiarize with my blog objectives, here you can find a bit of <a href="http://www.analyticsforfun.com/p/about.html" target="_blank">background on how I started analyticsforfun.com</a>).</div>
<div>
<br /></div>
<div>
<b>#Step1: Business Objectives</b><br />
<span style="color: orange;"><br /></span>
Why does my blog exist? I would summarize my blog mission with 3 business objectives:</div>
<div>
<br /></div>
<div>
<ol>
<li>Delight my readers with helpful data analysis tutorials and tips</li>
<li>Raise my professional profile within the digital analytics community</li>
<li>Encourage me to learn and stay up to date with latest trends in digital analytics</li>
</ol>
</div>
<div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgU6b_gIYx4hDHCwpIzrHMIvJHQHk3MkNG7BSrlUp1NO4gPCnlgBFdWZxm8QO_n7p4-dKQmLlaebGmyqmrwq8KlP4oGwlMnodKNLeGOGExW_Qhqoq1LhVsWyETrWdB3A0SbzMwIlZa78JQ/s1600/Picture1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Measure a Blog Business Objectives" border="0" height="163" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgU6b_gIYx4hDHCwpIzrHMIvJHQHk3MkNG7BSrlUp1NO4gPCnlgBFdWZxm8QO_n7p4-dKQmLlaebGmyqmrwq8KlP4oGwlMnodKNLeGOGExW_Qhqoq1LhVsWyETrWdB3A0SbzMwIlZa78JQ/s1600/Picture1.png" title="" width="640" /></a></div>
<br />
<br />
<b>#Step2: Strategies to accomplish my objectives</b><br />
<br />
To accomplish objective 1 that is "Delight my Readers", I will need to create valuable content for my readers and engage them.<br />
<br />
To raise my professional profile (objective 2), I plan to execute 3 strategies: increase the number of people who subscribe to my newsletter, develop businees connections and also increase the number of posts shared over other social networks.<br />
<br />
Does it make sense so far?<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFbhUNF8UXA_OeGuh7DPXv_vXAcWUNKzSr1NoKBXPoYg5zTJAYXKRbpXGsXreBSjLDgOhACoOeYnsGotyxvHsIi6_U68FdBzdkCA7yKdJxvYmMHu5P21TnLxLjHEzHA9kjMcv2fBNsjeQ/s1600/%232+strategies.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="How to Measure Blogs - Strategies " border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFbhUNF8UXA_OeGuh7DPXv_vXAcWUNKzSr1NoKBXPoYg5zTJAYXKRbpXGsXreBSjLDgOhACoOeYnsGotyxvHsIi6_U68FdBzdkCA7yKdJxvYmMHu5P21TnLxLjHEzHA9kjMcv2fBNsjeQ/s1600/%232+strategies.png" title="" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<br />
Finally, another important business objective of my blog is to encourage me to learn and gain more expertise. I am going to spend a few words on this, to make it clear.<br />
<br />
An important reason why I started this blog, and I am still investing on it, is to "speed up" my learning process in the analytics field. Behind each post I write, there is a research activity which include reading many other blogger posts, articles, etc. until I come up with an idea. Other times the idea comes directly from some projects I am currently working on and again I start doing research to gain more knowledge about the subject.<br />
<br />
Another very important role in blogging, is also played by planning all my blog activities, including writing.<br />
<br />
These are all activities that enrich my professional experience in the long term. Hence, having a blog for me works like an incentive to keep learning and gain more expertise in the field. Sure, it also steal some of my free time! ;)<br />
<br />
So to deliver on my last blog objective, I need to: first of all publish new posts on a regular basis, then increase my readers base (subscriptions is a great incentove to keep working on my blog) and finally receive valuable feedback (comments work like another great incentive).<br />
<br />
I hope my argument makes sense. I will provide a few more examples in next paragraph.<br />
<br />
<br />
<b>#Step3: KPI's to measure my blog performance</b> </div>
<div>
<br />
Below are a list of potential KPI's I could use to measure my blog objectives. I've identified a total of 8 unique KPI's (note that some of them get repeated among different objectives and strategies). I am a bit over the 3 to 5 recommended earlier, but this just want to be an example from which you can get some ideas. <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
</div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1aqfBRPEFyUHWoZ1O6Vos1GcaqZ_BswNXuuprPWwaP3pQrWvPAMURB6KyDiYfUM34XiC5WDybLl7f_sJmL5Us7H8llVuINKTCTp0TiBDaSSYQ40qAIXlpAM1L7Td3E2oTz_0EI_5Dp20/s1600/%233+kpi.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="How to Measure Blogs - KPI" border="0" height="316" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1aqfBRPEFyUHWoZ1O6Vos1GcaqZ_BswNXuuprPWwaP3pQrWvPAMURB6KyDiYfUM34XiC5WDybLl7f_sJmL5Us7H8llVuINKTCTp0TiBDaSSYQ40qAIXlpAM1L7Td3E2oTz_0EI_5Dp20/s1600/%233+kpi.png" title="" width="640" /></a></div>
<br />
<br />
<b><span style="color: orange;"><span style="color: #f6b26b;"></span></span></b><span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;">To measure my ability to create valuable content, first of all I need to measure my effort in terms of <b>number of posts published</b>. To obtain success you first need to deserve it baby...Here the bad news: succcess for bloggers comes with lots of dedication and perseverance, and this means publishing great content regularly. Then, before looking at any other awesome metric offered by Google Analytics, you should look at your contribution as a blogger: are you publishing enough posts each month/week? (I am not, but you wait to answer after having set a target!).</span></span></span><br />
<br />
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;">The next question I'd like to answer is whether people actually read my posts. Or, for some reasons, they just land to a page and leave it without even giving me a chance to show them what I have say. </span></span></span><br />
<br />
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;">This is a very tricky user behaviour when it comes to measure it with Google Analytics as well as most web analytics tools. Especially for BLOGS, where readers might land to your blog just to consume your latest article and don't interact with any element of your page. Only 1 page consumed, no interactions with content, and Bounce Rate equal to 100%. Then what?</span></span></span><br />
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;"><br /></span></span></span>
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;">How can you distinguish users who landed to your blog and left immediately (without reading any content) from readers who actually consumed some content? Tricky! </span></span></span><br />
<br />
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;">The idea is to use some metrics that allow you to capture <b>readers who actually consumed some content</b>. Then you might say: "All right, people bounce on this post but most of them actually read the article (or some of it)" [Note: this does not mean we should not try to optimize the page. Bounce rate is still important measure and ideally the user should continue his/her journey through the blog by visiting other pages].</span></span></span><br />
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;"><br /></span></span></span>
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;">To measure content consumption, you will need a bit of coding to implement it in Google Analytics. I recommend you to check out the <a href="http://www.optimizesmart.com/implementing-scroll-tracking-via-google-tag-manager/" target="_blank">scroll tracking solutions suggested by Optimize Smart</a> or the <a href="http://cutroni.com/blog/2014/02/12/advanced-content-tracking-with-universal-analytics/" target="_blank">advanced content tracking post of Justin Cutroni</a>. Both implementations, allow you to measure content consumption by tracking users scroll on your page (e.g.: just opened the page, 25% , 50%, 75% or 100% scroll which means the user has got to the end of your page, hurra!). </span></span></span><br />
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;"><br /></span></span></span>
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;"><br /></span></span></span>
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;">The second strategy is engaging my readers. To measure it, the main KPI I use is <b>the number of users who decided to subscribe to my posts </b>and receive them directly into their email. I think that this is a superb measure of engagement, and I actually consider it a macro-conversion for my blog. To me, subscriptions means something like "Cool, I loved your blog...I don't want to miss any of your future posts, please drop me an email every time you will publish new content!"</span></span></span><br />
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;"><br /></span></span></span>
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;">That's real engagement. And also a lot of joy for any blogger! </span></span></span><br />
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;"><br /></span></span></span>
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;">Another solid metric I use to measure engagement is <b>page depth. </b>Page depth measures the distribution of the number of pages in each visit to my blog, during a given reporting period. In Google Analytics you can find this metric under Audience > Behaviour > Engagement. </span></span></span><br />
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;"><br /></span></span></span>
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;">In the same Google Analytics report, you can also find his "sister metric": session duration. This is another good measure of engagement for websites, though for blogs is a bit less reliable since people often read only one post and do not click on a second page. Page depth is more solid metric in this sense.</span></span></span><br />
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;"><br /></span></span></span>
<br />
<span style="color: orange;"><span style="color: #f6b26b;"><span style="color: black;"><br /></span></span></span>
Let's move to second objective: raise my professional profile within the digital analytics community.<br />
<br />
First of all I need to increase the number of subscribers. Subscribers metric, here we go again!<br />
<br />
Secondly, I aim to develop business connections through my blog. In particular, I hope that people look at <a href="http://www.linkedin.com/in/marcopasin" target="_blank">my Linkedin profile</a> or perhaps learn more about my professional experience by entering the <a href="http://www.analyticsforfun.com/p/about.html" target="_blank">"About this Blog" page</a>. I consider both of them signs of interest towards me as a professional in the digital analytics community.<br />
<br />
A good way to develop a KPI for this, it can be measuring <b>the percentage of people looking into my Linkedin page (or About page) over the total visits</b>. In Google Analytics, you can easily set this KPI as a GOAL, and automatically get a conversion rate for these type of readers. [Note: visiting my Linkedin connection does not imply building a new connection. Nevertheless, I consider it a demonstration of "professional" interest]. <br />
<br />
Another strategy I use to accomplish objective 2, is making people share my posts through social networks. My KPI here would be the <b>percentage of posts shared over the total number of posts published at that time on the blog</b>. This KPI is also referred to as "amplification rate".<br />
<br />
You could calculate this KPI separately for each social network or you might want to use an aggregate measure.<br />
<br />
<br />
All right we are almost done.<br />
<br />
My last - but not least - objective is to encourage me to learn.<br />
<br />
As I already said, having a blog for me works like an incentive to keep learning, come up with new ideas and ultimately gain more expertise in the analytics field. How can I accomplish this?<br />
<br />
Easy: first of all publishing more and more posts. KPI: number of posts published per month/week.<br />
<br />
But wait...am I just writing for myself? No. I need readers. I need to see them participating in my conversations. Readers will encourage me to write new content. It's kind of a VIRTUOUS CIRCLE.<br />
<br />
The more readers engage (subscribe) and participate (leave comments) to my blog, the more I will be motivated to write new content for them.<br />
<br />
My KPIs will be the number of subscribers and the <b>percentage of people commenting over the total number of posts published. </b>The second KPI is also referred to as conversation rate.<br />
<br />
An alternative way to measure conversation rate, might be the total number of words written in comments, divided by the total number of words I wrote on the post.<br />
<br />
<br />
<b>#Step4: Targets</b></div>
<div>
<br />
If you have got till this point, great. Your measurement plan is almost complete, you have resolved the most "conceptual part" of the model.<br />
<br />
Now it's time to set a target <b>for each your KPI's</b>, so that you will be able to decide whether you were successful or not at blogging.<br />
<br />
Every company/organization sets some targets at the beginning of the year/trimester/etc. and decide where they aim the get at the end of the period. Why shouldn't you do the same for your blog?<br />
<br />
Targets add context to your data and allows you to understand where you are at the moment versus where you need to be.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFIY0AI5l6yp8UuOy758QoaYfn3al3iLtYjr5pC0_WssAJ7JM0UCzfwpRDUw8FwRRF2WI4V8h56tb4CBVMakIBZN-TxNe2V5i-Nqg4bg0P_ZxR6Kv2bD-K-6hyRuw6YcX7hIaU-_UBbh8/s1600/%234-+targets.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="How to Measure Blogs - Targets" border="0" height="412" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFIY0AI5l6yp8UuOy758QoaYfn3al3iLtYjr5pC0_WssAJ7JM0UCzfwpRDUw8FwRRF2WI4V8h56tb4CBVMakIBZN-TxNe2V5i-Nqg4bg0P_ZxR6Kv2bD-K-6hyRuw6YcX7hIaU-_UBbh8/s1600/%234-+targets.png" title="" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<br />
I have not included here my personal targets.<br />
<br />
What is important though, is to understand <b>how to set targets</b>. For your blog, you could decide to set targets based on one of these 3 methods:<br />
<br />
<ol>
<li>use your historical data (straightway from your web analytics tool)</li>
<li>use competition data (rquire much more effort to get it) </li>
<li>use your web analytics benchmarking data, if they offer any (Google Analytics recently released a very interesting <a href="http://analytics.blogspot.com.ar/2014/09/new-benchmarking-reports-help-twiddy.html" target="_blank">Benchmarking Report</a> where you can compare your data with other similar industry/persoormance sites. There is also a category called "Blogging Resources and Services".</li>
</ol>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMVIVVrhxjviaARTybS7HosqUPFuaEoINu6dEKAb0Fbi3OR3m738uXNveGSw8JE2Vn-RZlZ4TKBBX4iW38X5DJTybZs5KUlMmCJs2J9qekbfvoHn0cUFgIlKcFahIlppzMBVDwXCdWsNk/s1600/benchmarking+report.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="125" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMVIVVrhxjviaARTybS7HosqUPFuaEoINu6dEKAb0Fbi3OR3m738uXNveGSw8JE2Vn-RZlZ4TKBBX4iW38X5DJTybZs5KUlMmCJs2J9qekbfvoHn0cUFgIlKcFahIlppzMBVDwXCdWsNk/s1600/benchmarking+report.png" width="400" /></a></div>
<div>
<br /></div>
<div>
If you want to learn more on how to set targets for your web analytics data, there is a fresh new post written by <a href="http://www.kaushik.net/avinash/benchmarking-digital-analytics-performance-metrics/" target="_blank">Avinash Kaushik on benchmarking performance</a>. </div>
<br />
<br />
<b>#Step5: Segments</b></div>
<div>
<br /></div>
<div>
A segment can be defined as a group of people who have some attributes in common. Segments are a key part of your measurement plan because they allow you to take action on your data.<br />
<br />
For example, it's interesting to understand where most of my subscribers come from in terms of traffic channel. Or whether they tend to land to my blog to read a specific type of content (e.g. google analytics related posts) or not.<br />
<br />
To analyse my blog, I would segment my KPI's something like this:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOgCM2bnpthb3gjaL20Hn0OaCEFhaXwMtlWzrftjXA7Mx7OoTcBpJfdBn0sZkzOLOsTAd64Lq2GhSKbu0Pjuhhus2JvGKRsyoYaR87ydEIWcEU2gMTziDz9wkzIMm03RckW5QhccKh1hU/s1600/%235+segments.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="How to Measure Blogs - Segments" border="0" height="371" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOgCM2bnpthb3gjaL20Hn0OaCEFhaXwMtlWzrftjXA7Mx7OoTcBpJfdBn0sZkzOLOsTAd64Lq2GhSKbu0Pjuhhus2JvGKRsyoYaR87ydEIWcEU2gMTziDz9wkzIMm03RckW5QhccKh1hU/s1600/%235+segments.png" title="" width="640" /></a></div>
<br />
<br />
So for "Delight my Readers", my segments of analysis are:<br />
<br />
- <b>Traffic Sources</b>: which traffic sources bring me quality traffic? Do engaged readers and subscribers tend to arrive from a speficic source or channel?<br />
<br />
- <b>Content Type</b>: my posts can be divided into 2 main categories of content, posts about <a href="http://www.analyticsforfun.com/search/label/web%20analytics" target="_blank">web analytics</a> and posts about <a href="http://www.analyticsforfun.com/search/label/R" target="_blank">data analysis with R</a>. So I want to analyse which type of content engage the most (I can see from my data that R readers tend to spend more time on the blog and visit more pages, though this could be realted to the fact that R posts might contain some coding)<br />
<br />
- <b>Geographic locations</b>: the large majority of my readers come from US, but other countries (still generating relevant traffic) present a stronger engagement looking at the KPI's above. What can I do with this data?<br />
<br />
For second objective, my list of segments looks like this:<br />
<br />
- Traffic Sources: again I want to understand where subscibers come from, whether there is a pattern. But I might also want to know traffic sources for those readers who share my content over social networks.<br />
<br />
- <b>Engaged readers</b>: here the idea is to analyse those readers who resulted engaged for my first objective. For example, I could create a GOAL in Google Analytics for readers who explore a minimum number of 3 pages within their visit. And classify them as CONVERTED. I will then use use this segment to analyse subscribers (are readers who convert, also more likely to subscribe to my posts?). And to analyse readers who demonstrate interest in building a business connection (do they tend to convert to my engagement goal first?).<br />
<br />
- <b>Subscribed readers</b>: are subscribed reader interested in building business connection with me?<br />
<br />
- <b>Returning vs new readers</b>: do users need to come back to demonstrate professional interest? How long does it take to mature a "lead"?<br />
<br />
Last, I want to analyse my 3rd objective through these segments:<br />
<br />
- Content Type: how many posts am I publishing for each type of content? Am I receiving more comments when I write about web analytics or when I use R language to analyse data?<br />
<br />
- Traffic Sources: again, I would like to know where subscribers come from as well as readers who left cooments to my posts.<br />
<br />
- Engaged readers: this could be another way to analyse people who converted to my prior engagement goal. Do people need to engage (spend some significant time/visit a min number of pages) before leaving some comments? If so, can I take specific actions to improve engagement rate and encourage their participation?<br />
<br /></div>
<br />
<h3>
<b><a href="https://www.blogger.com/null" name="conclusions">Conclusions</a></b></h3>
<br />
Ooh, that was quite a long post. I hope you found it of value and could use some of these ideas for developing your own blog measurement strategy.<br />
<br />
At this point, you should have clear the importance of establishing objectives and how to create a measurement plan from there. And you should have also got a sense of some critical measurement issues due to the unique nature of blogs. Measuring a blog needs you to think more extensively and consider important aspects such as content consumption and social sharing, that make blogs different from websites.<br />
<br />
A big benefit of setting up a measurement plan, is that <b>reporting and data analysis activities will become much easier</b>. You will now know what you need to look at, which KPI's and make decisions faster.<br />
<br />
Most of the KPI's I suggested in this post, can be easily measured with a basic implementation of Google Analytics (pageview tracking). Others require a bit of coding (like in the case of content consumption via page scrolling, social share buttons or comments). Others might need you to refer to your external tool data, like to measure the people who subscribed to your posts.<br />
<br />
The idea is to first produce a measurement plan, and next assess implementation on your web analytics tool. Pretty much everything can be implemented with Google Analytics. You might need some help from a technical person, but you are the owner of your plan and you know what is important for you.<br />
<br />
With Google Analytics you can also set goals for some of your KPI's. And if you like, you can also create a custom dashboard and monitor your KPI's/goals all in one place.<br />
<br />
That's about it. Please share your thoughts, critiques and feel free to suggest additional ideas, KPI's, segments we can use to develop a better plan for measuring blogs.<br />
<br />
<br />marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-7831456424580800732014-09-06T16:24:00.000+01:002015-03-14T19:09:14.584+00:00All Data Journalism Graduates in a Map<!-- GeoChart generated in R 2.15.1 by googleVis 0.4.5 package -->
<!-- Sat Sep 06 11:30:10 2014 -->
<!-- jsHeader -->
<script type="text/javascript">
// jsData
function gvisDataGeoChartID14f46bdb42a7 () {
var data = new google.visualization.DataTable();
var datajson =
[
[
"Algeria",
1
],
[
"Argentina",
13
],
[
"Armenia",
1
],
[
"Australia",
28
],
[
"Austria",
10
],
[
"Azerbaijan",
1
],
[
"Bangladesh",
2
],
[
"Belarus",
3
],
[
"Belgium",
23
],
[
"Bolivia",
2
],
[
"Bosnia and Herzegovina",
3
],
[
"Botswana",
1
],
[
"Brazil",
50
],
[
"Bulgaria",
3
],
[
"Burundi",
1
],
[
"Cameroon",
4
],
[
"Canada",
43
],
[
"Chile",
1
],
[
"China",
10
],
[
"Colombia",
6
],
[
"Costa Rica",
1
],
[
"Country",
1
],
[
"Croatia",
3
],
[
"Cyprus",
1
],
[
"Czech Republic",
2
],
[
"Democratic Republic of the Congo",
1
],
[
"Denmark",
8
],
[
"Egypt",
7
],
[
"Estonia",
3
],
[
"Ethiopia",
1
],
[
"Finland",
8
],
[
"France",
37
],
[
"Georgia",
12
],
[
"Germany",
67
],
[
"Greece",
21
],
[
"Guatemala",
3
],
[
"Guyana",
1
],
[
"Hong Kong",
3
],
[
"Hungary",
9
],
[
"Iceland",
1
],
[
"India",
20
],
[
"Indonesia",
2
],
[
"Iran",
1
],
[
"Iraq",
1
],
[
"Ireland",
14
],
[
"Israel",
3
],
[
"Italy",
78
],
[
"Japan",
9
],
[
"Jordan",
1
],
[
"Kazakhstan",
1
],
[
"Kenya",
19
],
[
"Latvia",
1
],
[
"Libanon",
1
],
[
"Lithuania",
3
],
[
"Luxembourg",
2
],
[
"Macedonia",
1
],
[
"Malaysia",
6
],
[
"Mauritius",
2
],
[
"Mexico",
5
],
[
"Morocco",
1
],
[
"Nepal",
4
],
[
"Netherlands",
34
],
[
"New Zealand",
8
],
[
"Nigeria",
7
],
[
"North Korea",
1
],
[
"Norway",
7
],
[
"Pakistan",
7
],
[
"Papua New Guinea",
1
],
[
"Peru",
3
],
[
"Philippines",
5
],
[
"Poland",
8
],
[
"Portugal",
13
],
[
"Puerto Rico",
1
],
[
"Romania",
5
],
[
"Russia",
10
],
[
"Saudi Arabia",
1
],
[
"Senegal",
2
],
[
"Serbia",
4
],
[
"Singapore",
2
],
[
"Slovakia",
3
],
[
"Slovenia",
1
],
[
"South Africa",
10
],
[
"South Korea",
19
],
[
"Spain",
75
],
[
"Sri Lanka",
2
],
[
"Sweden",
11
],
[
"Switzerland",
6
],
[
"Taiwan",
2
],
[
"Thailand",
1
],
[
"Trinidad and Tobago",
1
],
[
"Turkey",
2
],
[
"Uganda",
4
],
[
"Ukraine",
5
],
[
"United Arab Emirates",
1
],
[
"United Kingdom",
96
],
[
"United States",
160
],
[
"United States Minor Outlying Islands",
1
],
[
"Unspecified",
24
],
[
"Uruguay",
1
],
[
"Venezuela",
2
],
[
"Virgin Islands, U.S.",
1
],
[
"Zimbabwe",
1
]
];
data.addColumn('string','country');
data.addColumn('number','graduates');
data.addRows(datajson);
return(data);
}
// jsData
function gvisDataPieChartID14f41a276e23 () {
var data = new google.visualization.DataTable();
var datajson =
[
[
"Journalist",
262
],
[
"unknown",
184
],
[
"Student",
68
],
[
"Freelance Journalist",
36
],
[
"Reporter",
27
],
[
"Editor",
16
],
[
"Data Analyst",
11
],
[
"Graphic Designer",
10
],
[
"Researcher",
9
],
[
"Associate Professor",
8
]
];
data.addColumn('string','title');
data.addColumn('number','graduates');
data.addRows(datajson);
return(data);
}
// jsDrawChart
function drawChartGeoChartID14f46bdb42a7() {
var data = gvisDataGeoChartID14f46bdb42a7();
var options = {};
options["width"] = 500;
options["height"] = 400;
var chart = new google.visualization.GeoChart(
document.getElementById('GeoChartID14f46bdb42a7')
);
chart.draw(data,options);
}
// jsDrawChart
function drawChartPieChartID14f41a276e23() {
var data = gvisDataPieChartID14f41a276e23();
var options = {};
options["allowHtml"] = true;
options["width"] = 500;
options["height"] = 300;
var chart = new google.visualization.PieChart(
document.getElementById('PieChartID14f41a276e23')
);
chart.draw(data,options);
}
// jsDisplayChart
(function() {
var pkgs = window.__gvisPackages = window.__gvisPackages || [];
var callbacks = window.__gvisCallbacks = window.__gvisCallbacks || [];
var chartid = "geochart";
// Manually see if chartid is in pkgs (not all browsers support Array.indexOf)
var i, newPackage = true;
for (i = 0; newPackage && i < pkgs.length; i++) {
if (pkgs[i] === chartid)
newPackage = false;
}
if (newPackage)
pkgs.push(chartid);
// Add the drawChart function to the global list of callbacks
callbacks.push(drawChartGeoChartID14f46bdb42a7);
})();
function displayChartGeoChartID14f46bdb42a7() {
var pkgs = window.__gvisPackages = window.__gvisPackages || [];
var callbacks = window.__gvisCallbacks = window.__gvisCallbacks || [];
window.clearTimeout(window.__gvisLoad);
// The timeout is set to 100 because otherwise the container div we are
// targeting might not be part of the document yet
window.__gvisLoad = setTimeout(function() {
var pkgCount = pkgs.length;
google.load("visualization", "1", { packages:pkgs, callback: function() {
if (pkgCount != pkgs.length) {
// Race condition where another setTimeout call snuck in after us; if
// that call added a package, we must not shift its callback
return;
}
while (callbacks.length > 0)
callbacks.shift()();
} });
}, 100);
}
// jsDisplayChart
(function() {
var pkgs = window.__gvisPackages = window.__gvisPackages || [];
var callbacks = window.__gvisCallbacks = window.__gvisCallbacks || [];
var chartid = "corechart";
// Manually see if chartid is in pkgs (not all browsers support Array.indexOf)
var i, newPackage = true;
for (i = 0; newPackage && i < pkgs.length; i++) {
if (pkgs[i] === chartid)
newPackage = false;
}
if (newPackage)
pkgs.push(chartid);
// Add the drawChart function to the global list of callbacks
callbacks.push(drawChartPieChartID14f41a276e23);
})();
function displayChartPieChartID14f41a276e23() {
var pkgs = window.__gvisPackages = window.__gvisPackages || [];
var callbacks = window.__gvisCallbacks = window.__gvisCallbacks || [];
window.clearTimeout(window.__gvisLoad);
// The timeout is set to 100 because otherwise the container div we are
// targeting might not be part of the document yet
window.__gvisLoad = setTimeout(function() {
var pkgCount = pkgs.length;
google.load("visualization", "1", { packages:pkgs, callback: function() {
if (pkgCount != pkgs.length) {
// Race condition where another setTimeout call snuck in after us; if
// that call added a package, we must not shift its callback
return;
}
while (callbacks.length > 0)
callbacks.shift()();
} });
}, 100);
}
// jsFooter
</script>
<!-- jsChart -->
<script src="https://www.google.com/jsapi?callback=displayChartGeoChartID14f46bdb42a7" type="text/javascript"></script>
<!-- jsChart -->
<script src="https://www.google.com/jsapi?callback=displayChartPieChartID14f41a276e23" type="text/javascript"></script>
<br />
<table border="0">
<tbody>
<tr>
<td><!-- divChart -->
<br />
<div id="GeoChartID14f46bdb42a7" style="height: 400px; width: 500px;">
</div>
</td>
</tr>
<tr>
<td><!-- divChart -->
<br />
<div id="PieChartID14f41a276e23" style="height: 300px; width: 500px;">
</div>
</td>
</tr>
</tbody></table>
<br />
This week I got my certificate of completion from the course "<a href="https://www.canvas.net/courses/doing-journalism-with-data" target="_blank">Doing Journalism with Data: First Steps, Skills and Tools</a>"(if you like to know more about data journalism check out my post "<a href="http://www.analyticsforfun.com/2014/07/3-great-examples-of-data-journalism.html" target="_blank">3 Great Examples of Data Journalism Stories</a>"). I enjoyed the course a lot, and I am proud of being one of the 1250 people who successfully completed the course. I was a bit surprised we were <a href="http://datajournalismcourse.net/graduates.php" target="_blank">only 1250 graduates</a>!<br />
<a name='more'></a><br />
So, where did we come from and who we are? Above is <b>a map I built using the R programming language, and in particular the GoogleVis package</b>. GoogleVis is a great package that provides an interface to the Gogle Vis API, and make creating interactive plots quite easy. Interactive means that users can manipulate data and look for the info they need. Here <a href="https://developers.google.com/chart/interactive/docs/gallery" target="_blank">a list of visualizations you can do with Google Charts</a>.<br />
<br />
The other great thing about this visualization, is that you can make it available over HTML, like I did above (you can edit the HTML if you like). No more static charts on your desktop then, but beautiful, interactive visualization shared on the web!<br />
<br />
Below is the simple R code I used to prepare the data and plot the charts. To plot the data about graduates titles (the title people indicated when they enrolled to the course) I used Google Refine and some of its cluster methods to clean/group data (e.g.: "journalist" or journalists" or "periodista" falled into the general category of "Journalist"). Then I load it into R as a .csv file.<br />
<br />
<br />
<span style="background-color: #6fa8dc;">d<span style="font-family: inherit;">dj<- read.csv("ddjCleaned.csv")</span></span><br />
<span style="background-color: #6fa8dc; font-family: inherit;">summary(ddj)</span><br />
<span style="background-color: #6fa8dc; font-family: inherit;">studCountry<- as.data.frame(table(ddj$country))</span><br />
<span style="background-color: #6fa8dc; font-family: inherit;">names(studCountry)<- c("country","graduates")</span><br />
<span style="background-color: #6fa8dc; font-family: inherit;"><br /></span>
<span style="background-color: #6fa8dc; font-family: inherit;">studTitle<- as.data.frame(table(ddj$title))</span><br />
<span style="background-color: #6fa8dc; font-family: inherit;">names(studTitle)<-c("title","graduates")</span><br />
<span style="background-color: #6fa8dc; font-family: inherit;"><br /></span>
<span style="background-color: #6fa8dc; font-family: inherit;">install.packages("googleVis")</span><br />
<span style="background-color: #6fa8dc; font-family: inherit;">library(googleVis)</span><br />
<span style="background-color: #6fa8dc; font-family: inherit;"><br /></span>
<span style="background-color: #6fa8dc; font-family: inherit;">C<- gvisGeoChart(studCountry, locationvar = "country", colorvar = "graduates", options = list(width = 500, height = 400))</span><br />
<span style="background-color: #6fa8dc; font-family: inherit;">plot(C)</span><br />
<br />
<span style="background-color: #6fa8dc; font-family: inherit;">T<- gvisPieChart(head(studTitle[order(studTitle$graduates, decreasing =TRUE),],10), labelvar = "title", numvar="graduates",options = list(width = 500, height = 300))</span><br />
<span style="background-color: #6fa8dc; font-family: inherit;">plot(T)</span><br />
<span style="background-color: #6fa8dc; font-family: inherit;"><br /></span>
<span style="background-color: #6fa8dc; font-family: inherit;">CT <- gvisMerge(D,T, horizontal=FALSE)</span><br />
<span style="background-color: #6fa8dc; font-family: inherit;">plot(CT)</span><br />
<br />
# to get the HTML code of your visualization you can either print execute the following command:<br />
<br />
print(CT) #print the Object you have just created<br />
<br />
# or you can click on the Chart ID link below your visualization.<br />
<br />
<br />marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-86400631477200900222014-08-13T03:43:00.000+01:002015-08-27T01:23:59.033+01:00How to Test Universal Analytics Before Upgrading: via Google Tag Manager<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicVRLBt8o8R3i5MiZlcI2TzZCmzjrLfbNjSW_9tLM37uIWSZshcZyDd2AGa4o0t1bqhSlv1-AWOtSoWrgE0uo48aWUbozdUPMY_oWmroqTmxGOLYhlAukQpYZnmKztLFfif4grjIfpxwY/s1600/Picture1TestUAdebugger.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Test Universal Analytics with Google Tag Manager" border="0" height="210" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicVRLBt8o8R3i5MiZlcI2TzZCmzjrLfbNjSW_9tLM37uIWSZshcZyDd2AGa4o0t1bqhSlv1-AWOtSoWrgE0uo48aWUbozdUPMY_oWmroqTmxGOLYhlAukQpYZnmKztLFfif4grjIfpxwY/s1600/Picture1TestUAdebugger.jpg" title="Test Universal Analytics Before Upgrading: via Google Tag Manager" width="400" /></a></div>
<br />
Since Universal Analytics came out of beta last April, more and more users have been starting the upgrade process from classic Google Analytics. Altough Google strongly encourages to do the upgrade, and reassure that the migration will not cause any loss of data (perhaps just a few seconds of traffic), some of us still remain a bit worried about the change. This is especially true in the case of big websites with a large number of tags already implemented through classic Google Analytics.<br />
<br />
Will there be any significant difference in data after the complete migration? Will Universal Analytics inflate/reduce some metrics compared to classing tracking code? These questions should motivate you to perform some <b>testing </b>before moving completely to a new standard.<br />
<br />
In this post I am going to suggest <b>a step by step process to conduct your upgrade to Universal Analytics, with the help of Google Tag Manager</b>. Yes, this post is also about Google Tag Manager. It´s actually about taking the opportunity of the transition to a new standard (Universal Analytics) and make it in the most efficient and safest way (Google Tag Manager).<br />
<br />
The main idea of this step by step process is to <b>keep the upgrade "under control"</b> and make sure you are going to get the same quality of data as before.<br />
<a name='more'></a><br />
To do so, you will implement Universal tracking code through Google Tag Manager, into a new "test" property. On this new property you will replicate and test the current tags configuration. You will then compare new data with classic Google Analytics tracking data and evaluate possible discrepancies. Yes, this means having old hard-coded tags cohexisting with new ones served through Google Tag Manager. At least for some time: the time necessary to confirm that everything is working properly. That's the aim of the test.<br />
<br />
Once you are happy with the results (Universal Analytics data look right!), you will finally upgrade your "old" Google Analytics property to Universal Analytics. Via Google Tag Manager, obviously.<br />
<br />
Explaining this quickly, the upgrading process can sounds a bit confusing. That's why I've decided to write step by step instructions. You can go straight to the <b><span style="color: orange;">10 steps upgrading process</span></b> below if you prefer, though I thought that first you might want me to remind you WHY YOU ARE DOING THIS UPGRADE.<br />
<br />
Which are the benefits of upgrading to Universal Analytics? And which the advantages of using a tagging platform such Google Tag Manager?? Here below is a short reminder to convince you further that the upgrade is the right thing to do.<br />
<br />
<br />
<h3>
Why would you upgrade to Universal Analytics?</h3>
<br />
Universal Analytics is the new version of Google Analytics tracking code. It has <a href="https://cutroni.com/blog/2014/04/02/universal-analytics-now-beta/" target="_blank">100% feature compatibility with classic Google Analytics</a> (everything you do in GA, you can do it with UA), but it also offers several improvements with respect to data processing, data collection as well as to the way these data can be analysed.<br />
<br />
As most of you already know, all "classic" Google Analytics properties are required to upgrade in the future, since Universal Analytics will become the operating standard for Google Analytics. You can find here more details about the <a href="https://developers.google.com/analytics/devguides/collection/upgrade/#benefits" target="_blank">Universal Analytics Upgrade Timeline</a> (currently we are in phase 3).<br />
<br />
So, which are these improvements made by Universal Analytics? According to Google evelopers, here are the main benefits you will get once you complete upgrade to Universal:<br />
<ul>
</ul>
<ol>
<li><b>Cross device measurement</b>. thanks to a new <a href="https://support.google.com/analytics/answer/3123662" target="_blank">User-ID feature</a>, Univeral Analytics will allow you to understand how users interact with your website across multiple devices like desktop, tablet, mobile or any other digital device. That is analysing the user during his entire journey. To get this benefit working, you will probably need a developer to set it up and assign unique IDs to users.<b> </b></li>
<li><b>Create customized metrics and dimensions</b>. if you need to measure something very specific to your business, which is not contemplated in current Google Analytics settting or metrics, now you can do it with Universal. For example, you might want to track product details or levels in games if you are a videogame company, etc. </li>
<li><b>Track any digital device. </b>Universal Analytics introduces 3 new developer-friendly data collection methods so that you can customize your analytics implementation. These are the analytics.js JavaScript library for websites, the Mobile SDKs v.2.x for Android and IOS, and the Measurement Protocol for other digital devices such as game consoles or information kiosks.<b> </b></li>
<li><b>Configure account options more easily</b>. From the Admin section of your account, you can now adjust settings such as sessions time out (by default are 30min), manage the list of recognized search engines (categorize some of them as referrals instead of organic), exclude specific domains from being recognized as referrals (e.g. traffic from a third-party shopping cart links in a cross-domain tracking), or exclude traffic coming to your site using specific keywords (you might want to exclude you own business name or domain being counted as search traffic, and make it appear within direct traffic channel from now on).</li>
<li><b>Enhanced Ecommerce reports. </b>These new reports will be available once upgraded to Universal Analytics, and they will let you understand more effectively your user shopping and purchasing behavoiur. A good place to start playing with these reports, is attending the Ecommerce Analytics course from Analytics Accademy<b>. </b></li>
<li><b>Future features update. </b>All new features and product upddates, from now on will only be available to Universal Analytics properties (that is receiving data from Universal Analytics tracking code). In simple words, if you wanna stay up to date to latest innovations in Google Analytics, upgrade to Universal soon!<b><br /></b></li>
</ol>
<br />
<br />
<h3>
Why should you also implement Google Tag Manager?</h3>
<br />
Google Tag Manager is a free tool that lets you implement and manage all your website (or mobile app) tags in one place. This means you don´t need anymore to manually add JavaScript tags to the source code of your site.<br />
<br />
How can be that possible? Simple, <b>Google Tag Manager works via a single tag called "container snippet"</b> that you place on all your web pages. Once the container is installed, you will be able to add, update and administer tags directly from your Google Tag Manager, so without touching the site code.<br />
<br />
If you had the chance to work with large/complex websites, you can quickly realize what that means in terms of tags implementation: tedious code-editing activities, dependence from your IT department, etc.. Switching to a tag management solution is a really big step forward for your web analytics projects. Here below is a list of the main benefits of using a tag solution like Google Tag Manager, as reported by <a href="http://www.lunametrics.com/blog/2014/04/08/8-reasons-start-google-tag-manager/#sr=g&m=o&cp=or&ct=-tmc&st=(opu%20qspwjefe)&ts=1406645331" target="_blank">Lunametrics</a>:<br />
<ol>
<li><b>All in one place</b>. As already mentioned, once Google Tag Manager is installed you don´t need anymore to go though each web page and modigy the source code. Every modification will be managed directly from your Google Tag Manager Platform. And this will means eficiency, speed and less dependence to IT developers for marketing people. Of course Google Tag Manager has its own learning curve. </li>
<li><b>Tags Testing</b>. Thanks to the built-in debug console and preview mode, you test your tags and make sure they work properly BEFORE publish them to the live site. This gives more autonomy to marketing people, who will be able to quickly test and publish tags without involving developers (at least for not advanced implementations).</li>
<li><b>Versions Control</b>. Everytime you make some change on your tags and publish it live, you will create a new version. Previous versions are archived and you can go back to them in case you need it. This makes your tags much more organized. </li>
<li><b>Built-in Tags</b>. Google Tag Manager includes built-in tags like classic Google Analytics and Universal Analytics tracking codes, Adwords conversion, retargeting, etc. It also includes some great "Event Listeners" tags which make tagging events faster. </li>
<li><b>Multi-Platform</b>. Google Tag Manager works for websites, mobile sites, mobile apps and it´s planned to receive support for other platforms as well.</li>
<li><b>Multi-Account & User Support</b>. This is really good news especially for agencies who will be able to manage their clients tags through Google Tag Manager. Important: the client is the owner of the websites and so the tags! This means it´s the client who should set up a Google Tag Manager account and then grant permission to agencies to manage tags (same best practice used for Google Analytics accounts).</li>
</ol>
<br />
<h2>
<b><span style="font-size: x-large;">
How to Upgrade to Universal Analytics using Google Tag Manager, in 10 Steps</span></b></h2>
<br />
<h4>
<span style="color: orange;">
1) Create a New Property (directly in Universal Analytics)</span></h4>
<span style="font-weight: normal;"><br /></span>
<span style="font-weight: normal;">As mentioned above, the idea is to first test Universal Analytics tags through a property test, without touching current Google Analytics configuration.</span><br />
<br />
To create a new property in Google Analytics, from the Admin panel open the Property menu and click on "Create New Property". Name the property, type your website URL for which you are doing the upgrade, set the reporting time zone and click on "Get Tracking ID" button.<br />
<div>
<br /></div>
<div>
You now have a new property available in your Google Analytics account, and also the related Universal Analytics tracking code (all new properties from now on are created directly in Universal Analytics). Don't do anything with that tracking code for now. You don't need to paste it into your website pages as you will use Google tag Manager later for that.<br />
<br /></div>
<h4>
<span style="color: orange;">
2) Create a Google Tag Manager Account</span></h4>
<br />
Sign-in into the <a href="http://www.google.com.ar/tagmanager/" target="_blank">Google Tag Manager homepage</a> with your gmail account, and click on "New Account". Name your account, in my case my account is called "Marco personal". The account structure in Google Tag Manager is pretty much similar to Google Analytics, so if you are familiar with that it should be very easy.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhon7JqG5ris2Xl4DnaI4pNiQzzs8IXxK_H0XrTwTcL_jznQHaniGsaHOTLzlgj-hc8EjhzDxm710bgaykG0NFzqUjontNI2CP_o1NzP6xP2vLpb1I9u0hQSB0TBldqTeoEhdJYxLl7D2A/s1600/gtm+account+created.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Google Tag Manager Create Account" border="0" height="154" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhon7JqG5ris2Xl4DnaI4pNiQzzs8IXxK_H0XrTwTcL_jznQHaniGsaHOTLzlgj-hc8EjhzDxm710bgaykG0NFzqUjontNI2CP_o1NzP6xP2vLpb1I9u0hQSB0TBldqTeoEhdJYxLl7D2A/s1600/gtm+account+created.jpg" title="New Account in Google Tag Manager" width="320" /> </a> </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<h4>
<span style="color: orange;">
3) Create a New Container into Google Tag Manager</span></h4>
<span style="font-weight: normal;"><br /></span>
<span style="font-weight: normal;">Enter the account you have just created, click on "New"-> "Container" and name it. A good practice is to name the container after the website (or mobile app) your are going to manage tags for. I've named my container simply "analytics for fun".</span><br />
<div>
<span style="font-weight: normal;"><br /></span></div>
<div>
<span style="font-weight: normal;">You now have a container tag, which is a snippet of JavaScript code that you will add to your website pages. That will be next step.</span></div>
<div>
<span style="font-weight: normal;"><br /></span></div>
<div>
<span style="font-weight: normal;">Again, in terms of admin structure, container is similar to a property in Google Analytics. Within same account you can have one or more containers (each container assigned to a single website/mobile property), but a container can belong to only one account. Of course, also in Google Tag Manager you have <a href="https://support.google.com/tagmanager/answer/2695756?hl=en" target="_blank">User Permission configuration</a>, but I am not discussing it here. </span></div>
<div>
<span style="font-weight: normal;"><br /></span>
<span style="font-weight: normal;"><br /></span></div>
<h4>
<span style="color: orange;">
4) Paste Google Tag Manager Container Code into your Site</span></h4>
<span style="font-weight: normal;"><br /></span>
<span style="font-weight: normal;">Copy the container snippet you obtained in previous step, and paste it immediately after the opening <body> tag on every page of your site. Remember that we are not going to remove old tags from the site yet: tag manager tags and old ones will cohexist for sometime.</span><br />
<div>
<span style="font-weight: normal;"><br /></span></div>
<div>
<span style="font-weight: normal;">[In my case, as my site is hosted on Blogger, I had to paste the container snippet with a little variation in the code, to make it work properly. Details about <a href="https://productforums.google.com/forum/#!msg/tag-manager/Jk9i2AWfDSg/FZ3xqWuSgrAJ" target="_blank">how to adjust GTM container snippet for blogger is available here</a>]</span></div>
<div>
<span style="font-weight: normal;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEid6enJlzRmu846_uAbOM4GlI-3jLUQ9vbT60F52QoTGjqeG8e7DgqHoutTdwkyDCtGaDIFmp98FM4C9vWCjtZKMoM3lkUFE9yyOR4AHZfZ8PwRKBClP6_nQadXckwZLxLDMoYpvF8YKts/s1600/gtm+pasted+into+html.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="GTM container snippet in Blogger" border="0" height="80" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEid6enJlzRmu846_uAbOM4GlI-3jLUQ9vbT60F52QoTGjqeG8e7DgqHoutTdwkyDCtGaDIFmp98FM4C9vWCjtZKMoM3lkUFE9yyOR4AHZfZ8PwRKBClP6_nQadXckwZLxLDMoYpvF8YKts/s1600/gtm+pasted+into+html.jpg" title="GTM container snippet installed in Blogger" width="400" /></a></div>
<h4>
<span style="font-weight: normal;">*Wherever your website or blog is hosted, a savvy thing to do would be to </span><span style="font-weight: normal;">save a copy f your current blog template as a backup before adding the snippet code. Just to be sure.</span></h4>
<div>
<span style="font-weight: normal;"><br /></span></div>
<h4>
</h4>
<h4>
<span style="color: orange;">
5) Now you can Start Adding Tags with Google Tag Manager</span></h4>
<br />
Go back to your Google Tag Manager, select your account and clicks on your site container. The tags page will appear: to add a tag you have to click on New-> Tag, select the tag type and specify the rules for when the tag should fire. The idea is to replicate your exisiting hard-coded tags, and of course add new ones if you like.<br />
<br />
To add tags, definitely you need to have some knowledge of how Google Tag Manager works: rules, macro, data layer, etc. To be honest, the learning curve is not that easy and I am not going to delve into it in this post (I might write a post in the near future with specific examples of tags implementation).<br />
<br />
To make things easier, below I broke this step into two sub-steps: Universal Analytics basic tracking tag and other additional tags.<br />
<br />
<h4>
5a. Add Universal Analytics basic tracking tag</h4>
First of all you might want to add the basic Universal Analytics tracking code, so that you will be able to see your site data into Google Analytics. This step is very easy and you should be able to do it even with little knowledge of Google Tag Manager.<br />
<br />
In Google Tag Manager, select your account and enter your site container. Now click on New> Tag. Fill in the highlighted sections as in the image below.<br />
<br />
Tag name: I named it "Universal Analytics" (you can name it differently if you like)<br />
<br />
Tag Type: from the pre-definer tags in the menu select Google Anaytics>Universal Analytics<br />
<br />
Tracking ID: here you have to enter the property ID you want to track with this tag. Enter the ID of the new PROPERTY TEST YOU CREATED IN STEP 1.<br />
<br />
Track Type: select Page View<br />
<br />
Firing Rules: add a Rule like in the picture below, that will fire the Universal Analytics tag on all pages of your website. Note that this is a pre-defined rule of Google Tag Manager, so you will find it withing the exisiting rules. <br />
<br />
Finally click on Save. You now have a Universal Analytics tag available in your Google Tag Manager. But it has not been implemented yet.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitTz2N7blV3tBL1EyBiIOWeEyJP4nbiMY9x_53Js0Fh9kDfSXqYm6-dxovsBiUnj_Tv0wtfAAvKlkKggKWWxPR3f0H1pcJ2iEP1NXzvsdYct58mpZJVU-dVIR6vg3TCk5f_lFE7RyZmjs/s1600/GTM+creae+UA+basic+tag.jpg" imageanchor="1"><img alt="Universal Analytics Tag in Google Tag Manager" border="0" height="152" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitTz2N7blV3tBL1EyBiIOWeEyJP4nbiMY9x_53Js0Fh9kDfSXqYm6-dxovsBiUnj_Tv0wtfAAvKlkKggKWWxPR3f0H1pcJ2iEP1NXzvsdYct58mpZJVU-dVIR6vg3TCk5f_lFE7RyZmjs/s1600/GTM+creae+UA+basic+tag.jpg" title="How add Universal Analytics Tracking Tag in GTM" width="320" /></a></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg966Hi46RdlmRrfs79YqVL6Bh-B9cEnNmq8_O1iCQinlVa4czOoJNdwbanrWSI8UM_4w8JI9Lz0Qit8gxKDo6smSOZiPk5gsY_K2bYPCmcfvzPsUoVn6gslvCVs77PDauF1SgRdbOgH-E/s1600/create+rule+all+pages.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Universal Analytics Tag Rule" border="0" height="172" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg966Hi46RdlmRrfs79YqVL6Bh-B9cEnNmq8_O1iCQinlVa4czOoJNdwbanrWSI8UM_4w8JI9Lz0Qit8gxKDo6smSOZiPk5gsY_K2bYPCmcfvzPsUoVn6gslvCVs77PDauF1SgRdbOgH-E/s1600/create+rule+all+pages.jpg" title="Google tag Manager Rule Universal Analytics" width="320" /></a></div>
<br />
<br />
<h4>
5b. Add other Tags (recreate old tags with GTM) - optional step</h4>
<div>
As I mentioned above this will require more in-depth knowledge of how the tagging tool works, and I am not going to cover it in this place. But yes, the idea here is to (at least) recreate all the hard-coded tags you have implemented in your site. Such as events, tags on button clicks, forms, clicks on links, etc.<br />
<br /></div>
<h4>
<span style="color: orange;">
6) Debug your Tags and Publish them</span></h4>
<br />
This is a very useful feature provided by Google Tag Manager: you can test your tags and make sure they work properly BEFORE publishing them to the live site.<br />
<h4>
<span style="font-weight: normal;">In your site container main menu, click on Preview>Debug to enable Debug mode. Now visit your website with your current browser session. When debug is enabled, a console window will appear at the bottom of your browser, showing detailed information about your tags, including firing status and what data is being processed. Check out the <a href="https://support.google.com/tagmanager/answer/2695660" target="_blank">Preview and Debugging documentation</a> for more details. </span></h4>
<div>
The picture below is a screenshot of my debugging session to test the Universal Analytics tag. You can see at the bottom of the page that the Universal basic tag is fired on pageviews properly.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjQGdUxJwmdoidTti2VBP94GmcENIpybbe4Vjx2n67gH4TvPuWNYjgTIth-MtSGlXJAWEhVFRq8qCurqkHIzV0e9Qc6mYqwm40BhTdz0qIDtla-pPDAQJvzW_S-1Sqc11uo8HJpMH59Ns/s1600/analytics+for+fun+debugged.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Universal Analytics Debug" border="0" height="145" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjQGdUxJwmdoidTti2VBP94GmcENIpybbe4Vjx2n67gH4TvPuWNYjgTIth-MtSGlXJAWEhVFRq8qCurqkHIzV0e9Qc6mYqwm40BhTdz0qIDtla-pPDAQJvzW_S-1Sqc11uo8HJpMH59Ns/s1600/analytics+for+fun+debugged.png" title="Google Tag Manager Tags Debussing" width="320" /></a></div>
<br />
Now that you checked the Universal Analytics tag (and additional tags/events you implemented) works fine, what you will do is to CREATE A VERSION OF THE CONTAINER and eventually PUBLISH IT LIVE. From your container main menu, click on Create Version first, and eventually click on Publish (remember to click on publish otherwise the tags will not be live!).<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiX788Q3CrenAPkytoI_PXAXo7F-ay4X6KGKuZxuRdybmmTa8Zkz9-nz-pCtIcmIMnqA-lRVXaKN8-jnsRAr6J8IceUHzT3aHM7_NQXZVjLyNwAyXG_ZAcVA6KkbmPzrXEpsbi2kiSzAj4/s1600/publish+version2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Google Tag Manager Container Version" border="0" height="104" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiX788Q3CrenAPkytoI_PXAXo7F-ay4X6KGKuZxuRdybmmTa8Zkz9-nz-pCtIcmIMnqA-lRVXaKN8-jnsRAr6J8IceUHzT3aHM7_NQXZVjLyNwAyXG_ZAcVA6KkbmPzrXEpsbi2kiSzAj4/s1600/publish+version2.jpg" title="Publish new Tag in GTM" width="320" /></a></div>
<br />
Great! <br />
<br />
From now on, <b>the Universal Analytics code is live and your site is properly tracked by Google Analytics within your property test</b>. You can see data by opening any GA report inside your property test (actually you can notice that GA also collected the data of the preview visit you did during the debugging stage). <br />
<br />
You can easily double-check that GA is tracking ok, by opening a Real-Time report while you navigate into the site.<br />
<br />
<h4>
<span style="color: orange;">7) Assess Data Discrepancies with Old Tags Implementation </span></h4>
<br />
Here is a crucial step of the upgrade process I suggest in this post. Basically we want to make sure we are going to get the same quality of data as before. So, compare the data you are receving on your test property through the Tag Manager implementation, with the data you are receiving through hard-coded implementation.<br />
<br />
I suggest collecting enough data to make a significant comparison. Keep the Universal Analytics tags cohexisting with old hard-coded for sometime (a couple of days/weeks, I guess it depends on your current implementation and size of the site). Collect the data, analyze it and evaluate possible discrepancies.<br />
<br />
<br />
<h4>
<span style="color: orange;">8) Upgrade the old property to Universal Analytics</span></h4>
<br />
Once you have you have compared the data and are happy with the results (Universal Analytics data look right), you will finally upgrade your "old" Google Analytics property to Universal Analytics. Via Google Tag Manager, obviously. This means reaplacing completely the old tags with new Universal tags tcreated through Google Tag Manager.<br />
<br />
Go back to your Google Analytics old property and transfer it to UA. In the Administration panel, click on Transfer to Universal Analyticss. It shoul take 24 hours to make the transfer. Once completed, you will see it displayed in green "Transfer Complete" as below.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_Zx0rAQa39bnocFhfRlshauoWGk_8DkKS2tKm0AU9us4_0h1sw5-s73WrkDLrXKR9UdSf6pjOyrwbb_AFSYaWj25yqpai_9cfOeGtUkpYEGMzAkUEGsos1RU5LWaM6QCCdrwQrsTl_Hg/s1600/UA+transer+complete.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Universal Analytics Transfer Complete" border="0" height="189" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_Zx0rAQa39bnocFhfRlshauoWGk_8DkKS2tKm0AU9us4_0h1sw5-s73WrkDLrXKR9UdSf6pjOyrwbb_AFSYaWj25yqpai_9cfOeGtUkpYEGMzAkUEGsos1RU5LWaM6QCCdrwQrsTl_Hg/s1600/UA+transer+complete.jpg" title="" width="320" /></a></div>
<br />
<br />
<h4>
<span style="color: orange;">9) In Google Tag Manager, Replace Property Test Tracking ID with Old Property ID</span></h4>
<br />
Go back to Google Tag Manager and replace the UA property test Tracking ID with old property ID.<br />
By doing this trick you will be able to re-use the container you have created (with all the tags you added), with the old property ID.<br />
<b><br /></b>
<b>This means continuity of data</b>: in the same old property you will be able to see both historical data (measured so far with old GA tracking code) and new data (measured from now on with Universal Analytics traking code impemented through Google Tag Manager). <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJI4XY0pOJvQEYpVX7ByyanCaJH2Ig0PjQ3pJKkhEb58sAl5PtKcOjT2oIz1PBJ2xrebLW3Wvt_qHznK6zxsnmMXreHSg4I_cDKsZw8qc5doZjMPxHDr2TlabiU7f7z7h3CBUUvDkELgY/s1600/Replace+Property+ID+in+GTM.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Google Tag Manager Tracking ID Replacement" border="0" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJI4XY0pOJvQEYpVX7ByyanCaJH2Ig0PjQ3pJKkhEb58sAl5PtKcOjT2oIz1PBJ2xrebLW3Wvt_qHznK6zxsnmMXreHSg4I_cDKsZw8qc5doZjMPxHDr2TlabiU7f7z7h3CBUUvDkELgY/s1600/Replace+Property+ID+in+GTM.jpg" title="Testing Tags in Google Tag Manager" width="320" /></a></div>
<br />
<br />
<h4>
<b><span style="color: orange;">10) Remove hard-coded tags from your site</span></b></h4>
<br />
Immediately after step 9, remember to remove old GA tags manually added into the source code. This is to avoid having duplicated code on the site, which will obviously cause measurement inaccuracies (e.g. very low bounce rate and distorted quality page metrics).<br />
<br />
If you decided to keep some tags hard-coded (that is you did not replicate them in Google Tag Manager), I think you could that. However, for an optimal implementation Google recommends upgrading all tags to Google Tag Manager (of course remember that now you are using Universal).<br />
<br />
<div>
<br /></div>
<br />
I hope you found the above steps clear. By following this upgrade process, you will be able to test Universal Analytics before implementing it completely and replacing hard-coded tags. And very important, at the same time you will also move to Google Tag Manager paltform, which will make managing tags a much easier task in the future.<br />
<br />
What was your experience with the Universal Analytics upgrade process? Do you feel the quality of your data have been affected by the new implementation? Or have you followed a similar process as the one in this post?<i><br /></i><br />
<br />
You thought now.<br />
<br />marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-2049647823192074722014-07-15T14:42:00.000+01:002016-04-24T01:27:18.595+01:003 Great Examples of Data Journalism Stories Over the last month I've been spending part of my free time learning about an emerging discipline in the areas of data analytics: Data Journalism. I am doing it, firstly because I find the combination "data analysis + journalism" very fascinating, but also because, <b>as a Web Analyst I believe that there are some very important skills I can absorb from Data Journalists </b>(here is a post where I talk about <a href="http://www.analyticsforfun.com/2014/03/free-analytics-education-my-personal.html" target="_blank">Web Analyst skills</a>).<br />
<div>
<br /></div>
<div>
The aim of this post is to introduce you to this emerging discipline, and show you a couple of practical examples of data journalism. To do so, <b>I've selected 3 published data journalism stories</b> <b>and analysed</b> each of them by answering four key questions:<br />
<ol>
<li>What does the story do?</li>
<li>How was it created (methodology)? </li>
<li>How was it illustrated?</li>
<li>What technologies were used to create and present the story to readers? </li>
</ol>
</div>
<div>
<br />
<a name='more'></a>If, after reading this post, you feel you like to know more about data journalism, I encourage you to sign-up to this mooc: <a href="https://www.canvas.net/courses/doing-journalism-with-data" target="_blank">Doing Journalism with Data</a>, offered by the European Journalism Centre through the Canvas network platform. The course actually started in May and material will be available until next 30 of July, so you still have sometime.</div>
<div>
<br /></div>
<div>
But first, of course, keep scrolling down this post and get started here with Data Journalism!</div>
<div>
<br />
<br /></div>
<h2>
<span style="font-size: large;">
What is Data Journalism?</span></h2>
<div>
<br />
To put it very very simple, <b>Data Journalism is Journalism done with Data</b>.<br />
<br />
I know this is not a very helpful definition, and the reality is that if we asked different journalists for a definition of data journalism, we would probably get several different answers.<br />
<br />
To make things a bit clearer, Simon Rogers (Data Editor @ Twitter) looks at some key aspects of doing journalism with data. To define it, he suggests that data journalism is about:<br />
<ul>
<li>telling stories with numbers</li>
<li>finding the best way to tell this story</li>
<li>the techniques with which you tell the story (which keep changing all the time)</li>
</ul>
But, hold on. Is data jounalism a totally new discipline? Not really. Actually, data have always been at the base of stories. In some areas like sports for example, data have always been an essential part of the piece of work to deliver to the reader.<br />
<br />
Then, what happened over the last years that data journalism is emerging so rapidly?<br />
<br />
1) First of all, <b>more and more data are becoming accessible </b>to everyone, as never before. The world of media can now look for data to create their stories from several different sources such as: public databases about government spending, leaked documents published by Wikileaks or "big data" generated by social networks such as Facebook or Twitter for example;<br />
<br />
2) Secondly, <b>data analysis tools are available</b> to most of us. Who has not an Excel copy (or equivalent spreadsheet software) installed on his computer? Some very powerful tools are even free or open source. Think of <a href="http://www.r-project.org/" target="_blank">R statistical programming language</a>.<br />
<br />
4) Last, <b>data analysis tools are much more powerful and easier to use</b> than in the past. In many cases you can produce beautiful visualizations or maps without having technical experience or coding knowledge.</div>
<div>
<br />
<br /></div>
<div>
Because of the above changes, the journalism field is under siege and so are the <b>skills needed for "new journalists"</b>. Indeed, journalists will have to become knowledgeable in searching, cleaning, processing, analysing and visualizing data. They will have to mine the data, making sense of it and turn it into something interesting for the reader.<br />
<br />
Finally, we can try to group the main activities of a data journalists (or set of skills needed) into 4 categories:<br />
<ol>
<li><b>Finding data to support stories</b></li>
<li><b>Analyse the data to discover potential stories</b></li>
<li><b>Clean the data</b></li>
<li><b>Tell stories through visualizations</b></li>
</ol>
</div>
<div>
<br />
Some pionering journalists and newspapers are already demonstrating how data can deliver unique insights of what is happening around us. And they are creating very interesting stories from that. Here below I am going to show you a couple of examples of published pieces of data journalism, and I will try to analyse each of them through the simple schema mentioned above. </div>
<div>
<br />
<br /></div>
<h2>
</h2>
<h1>
<span style="font-size: x-large;">
Three Great Data Journalism Stories</span></h1>
<div>
<span style="font-size: x-large;"><br /></span></div>
<h2>
<span style="font-size: large;">
Story #1. Afghanistan War Logs: a Selection of Significant Accidents</span></h2>
<div>
<i>Published by:</i> <a href="http://www.theguardian.com/news/datablog" target="_blank">The Guardian Data Blog</a></div>
<div>
<i>Link to the article:</i> <a href="http://www.theguardian.com/world/datablog/interactive/2010/jul/25/afghanistan-war-logs-events?guni=Graphic:in%20body%20link">http://www.theguardian.com/world/datablog/interactive/2010/jul/25/afghanistan-war-logs-events?guni=Graphic:in%20body%20link</a></div>
<div>
<br /></div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEeDlb_QR75ZmAqu5U8ldmibii825Vf9upauJwEEiwVCNBLrC8IAHuYNKAWp57iU3M5o8vUPMDB1LgKfH0waW6XP71srEnvbRC2CGd_0vqo_IBcg5YHTzAARBEdOR7uK4GVCrRiPhJTmI/s1600/afghanWarLogs.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="afghan accidents data journalism" border="0" height="431" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEeDlb_QR75ZmAqu5U8ldmibii825Vf9upauJwEEiwVCNBLrC8IAHuYNKAWp57iU3M5o8vUPMDB1LgKfH0waW6XP71srEnvbRC2CGd_0vqo_IBcg5YHTzAARBEdOR7uK4GVCrRiPhJTmI/s1600/afghanWarLogs.jpg" title="Afghan War Logs: A Selection of Accidents" width="640" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<b>What does the story do?</b></div>
<div>
This great data story created and published by The Guardian, shows us a selection of key events happened during the <a href="http://en.wikipedia.org/wiki/War_in_Afghanistan_(2001%E2%80%93present)" target="_blank">war in Afghanistan</a>, such as coalition forces attacks on civilians, friendly fire incidents (coalition troops mistakenly firing on each others), and Afghan forces attacking each others. To achieve that, The Guardian uses data that have become available through <a href="https://wikileaks.org/About.html" target="_blank">Wikileaks</a>, which discloses previously military confidential facts about the war in Afghanistan.<br />
<br /></div>
<div>
<b>How was it created?</b></div>
<div>
What we are talking about here is one of the biggest leaks in intelligence history. The Guardian got a huge Excel file from Wikileaks, logging the history of the war in Afghanistan. The Excel file contained over 90,000 rows data, some of which of course had nothing in it or poor formatting.<br />
<br />
The data obviously <u>needed some cleaning</u>. And Excel showed its limits processing such a huge amount of data. Reporters could not access the data easily, hence it was hard to extract meaningful stories. What the Guardian data team did, was <u>building an internal database to store and access these data</u>, so that reporters could now look for stories, by using keywords and events. One of the key stories found from the war data, was the rise in the use of IED (improvise explosive devices) attacks.<br />
<br />
They then mapped latitudes and longitudes coordinates of every event, made a selection of key events to include in the story, and finally <u>created a graphic visualization with the help of Google Maps</u>.<br />
<br /></div>
<div>
<b>How was it illustrated?</b></div>
<div>
The story was visualized by plotting points (key events) of different colours on a geographical map. Using Google Map in this case. As you can see on the picture above, colours identify the category of event (Afghan friendly fire vs Coalition friendly fire, etc.).<br />
<br />
If you click on the categories within the map legend above, you can hide or show them in the map. Also, if you click on any data point within the map, a small window will open showing a brief description of the event, the category as well as the data and time it occured. By clicking on "Read the full log entry" you will be able to see the complete log of that event.<br />
<br />
I recommend you explore the <a href="http://www.theguardian.com/world/datablog/interactive/2010/jul/25/afghanistan-war-logs-events" target="_blank">Afghan War Logs map</a> yourself and play a bit with the data.<br />
<br /></div>
<div>
<b>What technologies were used to create and present the story to readers?</b><br />
Wikileaks data were recorded in a spreadsheet, about 92,000 notes, then the team built a simple database and interrogated with SQL. Finally, I guess they used Google Maps API to produce the map.</div>
<div>
<br /></div>
<br />
<h2>
<span style="font-size: large;">
Story #2. Weapons from Croatia Spread through the Conflict in Syria</span></h2>
<div>
<i>Published by:</i> final article was published by The New York Times, though the story was originally created by <a href="http://en.wikipedia.org/wiki/Eliot_Higgins" target="_blank">Eliot Higgins</a>, a.ka. <a href="http://brown-moses.blogspot.com.ar/" target="_blank">Brown Moses</a>. </div>
<div>
<i>Article link:</i> <a href="http://www.nytimes.com/2013/02/26/world/middleeast/in-shift-saudis-are-said-to-arm-rebels-in-syria.html">http://www.nytimes.com/2013/02/26/world/middleeast/in-shift-saudis-are-said-to-arm-rebels-in-syria.html</a></div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiu43RwZwoOGf_F1WXk2m7YdJJJwkPNSfQXWWvhYLKhRd0G7-gqh5rD2FaBnlJ7AuADWl_WJ-hIoCeWxu21d8Z0qMQZ4NTGdZV2lpI_TFL_Mi6_U9qdzXWqSgoHBKns0DBv5Zulx89nnXo/s1600/Syria+Weapons.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Weapons Smuggled into Syria - Data Journalism Story" border="0" height="489" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiu43RwZwoOGf_F1WXk2m7YdJJJwkPNSfQXWWvhYLKhRd0G7-gqh5rD2FaBnlJ7AuADWl_WJ-hIoCeWxu21d8Z0qMQZ4NTGdZV2lpI_TFL_Mi6_U9qdzXWqSgoHBKns0DBv5Zulx89nnXo/s1600/Syria+Weapons.jpg" title="Weapons from Croatia Spread through the Conflict in Syria" width="640" /></a></div>
<div>
</div>
<div>
<b><br /></b>
<b>What does the story do?</b></div>
<div>
The story reveals as, at sometime during the <a href="http://en.wikipedia.org/wiki/Syrian_Civil_War" target="_blank">war in Syria</a>, some very unusual weapons appeared in the conflict. Apparently all coming from former Yugoslavia. The insights discovered, led the journalist make a very important conclusion from the data: the Saudis had purchased those weapons from Croatia, shipped them to Jordan, and started smuggling into Syria to support the free Syrian Army in their fight against President Assad. All of this probably happening with the knowledge of the US Government.</div>
<div>
</div>
<div>
<b>How was it created?</b></div>
<div>
All began in 2012 when, <a href="http://en.wikipedia.org/wiki/Eliot_Higgins" target="_blank">an unemployed finance worker named Eliot Higgins</a>, started a blog about the Syrian civil war. As he said, his early posts were rather unorganised collections of videos he had seen on Facebook and Twitter. But after a couple of months, he started adopting <u>a more systematic approach to collect</u> and examine videos coming from Syria.</div>
<div>
<br />
What he did, was gathering a list of all channels that were posting from each specific area of Syria. Her ended up <u>monitoring a list of over 500 Youtube channels daily</u>, searching for images of weapons and tracking when new types of weapons appeared in the conflict, where , and with which army group. To find out the type of weapons, he mainly relied on Google. He eventually <u>collected all this data into a spreadsheet</u> and analysed it.<br />
<br />
Once he noticed there might have been an important sotry behind those data (unusual weapons mainly appearing in the southern region of Syria and near to the border with Jordan), he first published his findings on his blog, and later wen to the New York Times with an article summarising what he found.<br />
<br />
The New York Times did further investigations and eventually published the article.<br />
<br />
A very interesting point made by Elliot Higgins, is about <u>the effectiveness of his data collection methodology</u>. By monitoring Social Media he was able to track the arrival of those unusual weapons in Syria, which is something he might not have picked up staying on the ground. For this type of analysis, he had a much better picture of what was going on in Syria, than a journalist based locally. If you are interested in this fascinating story, you must read <a href="http://schoolofdata.org/2013/08/23/social-media-syria/" target="_blank">Higgins story on School of Data</a> or just google it. <br />
<br /></div>
<div>
<b>How was it illustrated?</b></div>
<div>
I have not found published visualizations about the findings. The story published on the New York is only text based, though it references to Brown Moses blog. Here you can find various pictures and videos reporting the conflict in Syria and the weapons discovered.</div>
<div>
<br /></div>
<div>
As a personal thought, it would be very interesting to summarize and show the story throughout some type of visualization. An idea, could be a geographical map, with points plotted over it indicating the different type of arms used in the conflict; and also a timeline showing when Croatian weapons started spreading throughout the conflict. Any other idea?</div>
<div>
<b><br /></b></div>
<div>
<b>What technologies were used to create and present the story?</b></div>
<div>
He regularly monitored images and videos from social media like Facebook and Twitter, and collected all the data into a spreadsheet, I guess Excel.<br />
<br />
<br />
<h2>
<span style="font-size: large;">
Story #3. The Cholera Map of John Snow (1854)</span></h2>
<div>
<i>Published by:</i> <a href="http://en.wikipedia.org/wiki/John_Snow_(physician)" target="_blank">John Snow</a> in 1854</div>
<div>
<i>Article link:</i> I am not sure if the original publication is available on the internet, however his amazing work has been largely discussed by many experts in the field of data visualization and journalism. I recommend you to read the interesting article from The Guardian, where they also <a href="http://www.theguardian.com/news/datablog/interactive/2013/mar/15/cholera-map-john-snow-recreated" target="_blank">recreate John Snow story with an interactive map</a>. Here below is the original cholera map.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEht3adaQWWGji1ZAHOOUbR0cAYeDqwqYEXfccTi9i87sc2-y827c1jHp82_2ABisqX5Wlb_hpiYwNJMxS4mwG6afzBE08Zjt7qvpAIAGHRFY_CjBQ53szu3YwXrmqB8vQjWyQlg5nz4K_0/s1600/Snow+CholeraMap.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Jogh Snow Cholera Map - Data Journalism Story" border="0" height="601" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEht3adaQWWGji1ZAHOOUbR0cAYeDqwqYEXfccTi9i87sc2-y827c1jHp82_2ABisqX5Wlb_hpiYwNJMxS4mwG6afzBE08Zjt7qvpAIAGHRFY_CjBQ53szu3YwXrmqB8vQjWyQlg5nz4K_0/s1600/Snow+CholeraMap.jpg" title="The Cholera Map of John Snow (1854)" width="640" /></a></div>
<br />
<br /></div>
</div>
<div>
<b>What does the story do?</b><br />
Until the 1870s, most scientists believed that cholera, as many other sicknesses, were caught by breathing "polluted" air. In 1854, a severe <a href="http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak" target="_blank">oubreak of cholera occured in the London district of Soho</a>, was the occasion for the physician John Snow to study the phenomenon closely, and defy the dominant theory by hypothesizing that cholera was a waterbone disease caused by germs.<br />
<br />
He concluded that the source of the London outbreak was the public water pump located on Broad Street (now called Broadwick Street). So, the cholera spreaded by contaminated water and not by "polluted" air as most scientists believed.<br />
<br />
<b>How was it created?</b><br />
John Snow talked to local residents, found out where the cholera cases happened and collected all data. He then <u>made a map of all the cases, representing each death as a bar and locating in the map on the exact point where it happened</u>. Thanks to the map, Snow could show his story behind those data: most of cholera deaths clustered around the Broad Street pump, which (as later was discovered) had been contaminated by fecal matter from a sick baby.<br />
<b><br /></b>
<b>How was it illustrated?</b><br />
With bars on a map, representing the amount of cholera cases in each areas of London Soho. And circles showing the locations of water pumps.<br />
<br />
This type of data visualization, allowed him to relate the two variables: pumps locations and number of cholera cases.<br />
<b><br /></b>
<b>What technologies were used to create and show the story?</b></div>
<div>
At that time, I guess Snow produced the map with just a pen and paper. <br />
<br />
As I mentioned earlier, his data story has been <a href="http://www.theguardian.com/news/datablog/interactive/2013/mar/15/cholera-map-john-snow-recreated" target="_blank">revisited recently by The Guardian</a>, who recreated the cholera map using modern mapping tools such as <a href="http://cartodb.com/" target="_blank">CartoDB</a> and <a href="http://maps.stamen.com/#toner/12/51.5117/-0.1328" target="_blank">Stamen style</a> maps .<br />
<br />
For more about the impact of Snow story on current data journalism and data visualization, you can check the very interesting posts from <a href="http://www.theguardian.com/news/datablog/2013/mar/15/john-snow-cholera-map" target="_blank">Simon Rogers</a> and <a href="http://www.peachpit.com/articles/article.aspx?p=2048358" target="_blank">Alberto Cairo</a>. </div>
<br />
<br />
<h2>
Conclusions: What can we Learn from Data Journalism?</h2>
<div>
<br />
In this post I introduced the emerging field of data journalism and showed with 3 examples, how some pionering journalists are analysing data to find insightful stories.<br />
<br />
The technologies used to create and present stories were more or less sophisticated. Still, <b>all journalists followed a common process to build the story</b>: they had to find data, analyze it, clean it, and finally communicate the story possibly using some graphic visualization.<br />
<br />
As a Web Analyst, I think that looking at how journalists are creating stories with data <b>is a great learning excercise</b>. It encourages us being more creative, curious and developing a good critical actitude which will help us in our daily job. And very important, it should make us realising that we share many aspects of our job with data journalists. Indeed, both of us follow the same process in building the story (find, analyse, clean, visualize) and have to communicate the final results to someone (who sometimes is new to the subject).<br />
<br />
Okay, I guess in many cases our job (web analysis) is more "standardized" than a journalist, in the sense that we tend to stick to the same sources of data, tools and techniques to collect, analyse and visualize data. But still, this should not stop us <b>thinking outside the box</b> and see if there is a better way to solve our data problems. For example, we might start asking ourselves questions like:<br />
<br />
How could we collect more interesting data to perform our web analysis? Is there any other available data that we could combine together with our website clickstream data, so that we will be able to make better decisions? Think at the Syria story and how Elliot Higgins started monitoring social media to get insights about the conflict going on in Syria. Would he had been able to get the same data by staying on the ground? Probably not.<br />
<br />
And finally, what would be the best graphic format for our monthly/weekly report? Could we replace current tables/charts with more insightful graphich visualizations? Sometimes it can be a great idea to just grab pen and paper and do a sketch of what we want to show with the data. John Snow cholera map was an excellent example of insightful data visualizations.<br />
<br />
See you next post! Thanks for reading it.<br />
<br />
<br /></div>
<div>
<h4>
</h4>
</div>
marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-42989040145716682482014-06-23T15:43:00.002+01:002016-04-24T01:00:56.279+01:00Performing ANOVA Test in R: Results and Interpretation<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzCi-M3JpvQ_Ak6vmeg0tw-5TeeKdcncANOkrF7u2ApAaSAmc4GunzAdY_q71JaDzgFLZ44eWL6Wa4CEGeBbQhiTe43D-voGVMLUNdQ454ZYcp7nhJd58bbcgPlHMJbqX4Kn3ZsXINpRs/s1600/boxplots.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="ANOVA test with R" border="0" height="201" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzCi-M3JpvQ_Ak6vmeg0tw-5TeeKdcncANOkrF7u2ApAaSAmc4GunzAdY_q71JaDzgFLZ44eWL6Wa4CEGeBbQhiTe43D-voGVMLUNdQ454ZYcp7nhJd58bbcgPlHMJbqX4Kn3ZsXINpRs/s1600/boxplots.jpg" title="How to Perform and ANOVA test" width="320" /></a></div>
<br />
When testing an hypothesis with a categorical explanatory variable and a quantitative response variable, the tool normally used in statistics is <b>Analysis of Variances</b>, also called <a href="http://en.wikipedia.org/wiki/Anova" target="_blank">ANOVA</a>.<br />
<br />
In this post I am performing an ANOVA test using the R programming language, to a dataset of breast cancer new cases across continents.<br />
<b><br /></b>
<b>The objective of the ANOVA test is to analyse if there is a (statistically) significant difference in breast cancer, between different continents.</b> In other words, I am interested to see whether new episodes of breast cancer are more likely to take place in some regions rather than others.<br />
<br />
Beyond analysing this specific breast cancer dataset, I hope with this post to create a short <b>tutorial</b> about ANOVA and <b>how to do simple linear models in R</b>.<br />
<br />
<a name='more'></a><div style="text-align: center;">
<a href="http://www.analyticsforfun.com/2015/08/playing-with-r-shiny-dashboard-and.html" style="background-color: #4caf50; border: none; color: white; cursor: pointer; display: inline-block; font-size: 16px; margin: 4px 2px; padding: 15px 32px; text-align: center; text-decoration: none;" target="_blank"><i>Learn also</i> <br /> <b>How to build beautiful visualizations with R</b></a></div>
<br />
Sometimes ago I took a <a href="http://www.analyticsforfun.com/2013/03/ready-for-new-statistics-course-my.html" target="_blank">statistics course</a> and this was actually part of the assignment; I hope there won't be major errors in the methodology I am going to follow, and of course any feedback/critique will be very welcome.<br />
<br />
<h2>
<span style="font-size: large;">
The Dataset</span></h2>
<div>
<br />
My dataset has breast cancer data for 173 countries as it was originally collected by ARC (International Agency for Research on Cancer) in 2002. The dataset also includes several other socio-economic variables about countries, though I am not gonna explore them in this occasion. To obtain the final dataset, I conducted some minor cleaning and added the "continent" variable, through a merge operation. To see how I've done this, you can also check a previous <a href="http://www.analyticsforfun.com/2013/04/plotting-data-over-map-with-r.html" target="_blank">post about merging datasets with R</a>. </div>
<div>
<br /></div>
<div>
If you like to get the final dataset, you can <a href="https://www.dropbox.com/s/or0rbrkoc861w15/gapC.csv?dl=0" target="_blank">download it here in .csv format</a>. Once imported into R, I stored it into a variable called "gapCleaned".</div>
<br />
<h2>
<span style="font-size: large;">
Define the ANOVA model mathematically</span></h2>
<br />
As already mentioned above, I am going to examine the relationship between:<br />
<br />
<ul>
<li>Continents, which is my <u>explanatory variable</u> --> let’s call it X</li>
<li>and New Cases of Breast Cancer, which is my <u>response variable</u> --> let’s call it Y</li>
</ul>
<br />
Mathematically, the relationship can be written like this:<br />
<br />
<div style="text-align: left;">
Y ~ X </div>
<br />
<br />
<b>ANOVA is going to compare means</b> of breast cancer among the seven continents, and <b>check if differences are statistically significant</b>. Here are my null and alternative hypothesis:<br />
<br />
<ul>
<li><b>Null Hypothesis</b>: all seven continents means are equal —> there is no relationship between continents and new cases of breast cancer, which we can write as follows:</li>
</ul>
H0: U1 = U2 = U3 = U4 = U5 = U6 = U7<br />
<ul>
<li><b>Alternative Hypothesis</b>: not all seven continents means are equal —> there is a relationship between continents and new cases of breast cancer:</li>
</ul>
H1: not all U are equal<br />
<br />
<br />
<h2>
<span style="font-size: large;">
Perform the ANOVA test with R</span></h2>
<br />
So, how do we go about testing the means? First of all we can calculate and plot means for each continent, which is pretty easy to do with R (remember, my breast cancer dataset is called "gapCleaned in R):<br />
<br />
> means<- round(tapply(gapCleaned$breastcancer, gapCleaned$continent, mean), digits=2) # note that I I round values to just 2 decimal places<br />
<br />
> means<br />
<br />
<br />
AF AS EE LATAM NORAM OC WE<br />
<br />
24.02 24.51 49.44 36.70 71.73 45.80 74.80<br />
<br />
> library(gplots) #I load the "gplots" package to plot means<br />
<br />
> plotmeans(gapCleaned$breastcancer~gapCleaned$continent, digits=2, ccol=”red”, mean.labels=T, main=”Plot of breast cancer means by continent”)<br />
<br />
<img alt="image" src="http://media.tumblr.com/ac7db6d49545371aea5b5f82ea3fd0a0/tumblr_inline_mlmno5UHeu1qz4rgp.jpg" /><br />
<br />
The above graph shows how breast cancer means change between continents, as well as the number of countries taken into account for calculating the mean of each continent. Cool, it looks like means differ among continents, with Africa presenting the lowest value and West Europe the highest. But… hang on, <b>is that enough to provide evidence against my null hypothesis?</b> Not really and we can understand why, through a lovely boxplot:<br />
<br />
> boxplot(gapCleaned$breastcancer ~ gapCleaned$continent, main=”Breast cancer by continent (mean is black dot)”, xlab=”continents”, ylab=”new cases per 100,00 residents”, col=rainbow(7))<br />
<br />
> points(means, col=”black”, pch=18)<br />
<br />
<img alt="image" src="http://media.tumblr.com/12e1f0a948adcef0bdcd53d5ff7dd9cf/tumblr_inline_mlmoj7B1jq1qz4rgp.jpg" /><br />
(* the blue boxplot with missing label, refers to North America).<br />
<br />
The boxplot shows that means are different (some less, others more). But it also shows that each continent present a different amount of variation/spread in breast cancer, so that there is much overlap of values between some continents (e.g. Africa&Asia or North America & West Europe). Hence, differences in means could have come about by chance (and we shouldn’t reject the null hypothesis case). <b>That is where ANOVA comes to help us</b>.<br />
<br />
<b>The question we are answering with ANOVA is</b>: are the variations between the continents means due to true differences about the populations means or just due to sampling variability? To answer this question, ANOVA calculates a parameter called <a href="http://en.wikipedia.org/wiki/F-statistics" target="_blank">F statistics</a>, which compares the variation among sample means (among different continents in our case) to the variation within groups (within continents).<br />
<br />
F statistics = Variation among sample means / Variation within groups<br />
<br />
Through the F statistics we can see if the variation among sample means dominates over the variation within groups, or not. In the first case we will have strong evidence against the null hypothesis (means are all equals), while in the second case we would have little evidence against the null hypothesis.<br />
<br />
All right, after this theoretical excursus, it’s time to perform ANOVA on my data and try to interpret results. To call ANOVA with R, I am using the “aov” function:<br />
<br />
> aov_cont<- aov(gapCleaned$breastcancer ~ gapCleaned$continent)<br />
<br />
> summary(aov_cont) # here I see results for my ANOVA test<br />
<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
<br />
gapCleaned$continent 6 52531 8755 <b>40.28 <2e-16 ***</b><br />
<br />
Residuals 166 36083 217 <br />
<br />
—————<br />
<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
<br />
Good, my F value is 40.28, and p-value is very low too. In other words, the variation of breast cancer means among different continents (numerator) is much larger than the variation of breast cancer within each continents, and our p-value is less than 0.05 (as suggested by normal scientific standard). Hence we can conclude that for our confidence interval <b>we accept the alternative hypothesis H1</b> that there is a significant relationship between continents and breast cancer.<br />
<br />
But we are not done yet… :(<br />
<br />
What I know at this point (thanks to ANOVA), is that NOT ALL THE MEANS ARE EQUAL. However my categorical variable “continents” has more than two levels (actually it has 7), and it might be that it’s just one continent that is not equal to the others. <u>ANOVA doesn’t tell me which groups (continents) are different from the others.</u> In this sense we will have to see each pair of continents to appreciate significant differences.<br />
<br />
To determine which groups are different from the others <b>I need to conduct a POST HOC TEST</b> or a post hoc pair comparison (note we can’t perform multiple anova tests one for each pair, as this would increase our error, see <a href="http://en.wikipedia.org/wiki/Familywise_error_rate" target="_blank">family wise error rate</a> for more details) which is designed to evaluate pair means. There are many post hoc tests available for analysis of variance and in my case I will use the Tukey post hoc test, calling with R the function “TukeyHSD” as follows:<br />
<br />
> tuk<- TukeyHSD(aov_cont)<br />
<br />
> tuk<br />
<br />
Tukey multiple comparisons of means<br />
<br />
95% family-wise confidence level<br />
<br />
Fit: aov(formula = gapCleaned$breastcancer ~ gapCleaned$continent)<br />
<br />
$`gapCleaned$continent`<br />
<br />
diff lwr upr p adj<br />
<br />
AS-AF 0.4953571 -8.986848 9.9775626 0.9999987<br />
<br />
EE-AF 25.4248377 14.352007 36.4976680 0.0000000<br />
<br />
LATAM-AF 12.6875000 2.501977 22.8730225 0.0050172<br />
<br />
NORAM-AF 47.7172619 21.638434 73.7960896 0.0000035<br />
<br />
OC-AF 21.7839286 5.151040 38.4168172 0.0025337<br />
<br />
WE-AF 50.7886905 39.528172 62.0492093 0.0000000<br />
<br />
EE-AS 24.9294805 12.956321 36.9026399 0.0000001<br />
<br />
LATAM-AS 12.1921429 1.034462 23.3498237 0.0223712<br />
<br />
NORAM-AS 47.2219048 20.748253 73.6955563 0.0000067<br />
<br />
OC-AS 21.2885714 4.043225 38.5339174 0.0056343<br />
<br />
WE-AS 50.2933333 38.146389 62.4402777 0.0000000<br />
<br />
LATAM-EE -12.7373377 -25.274849 -0.1998261 0.0437993<br />
<br />
NORAM-EE 22.2924242 -4.791696 49.3765447 0.1822328<br />
<br />
OC-EE -3.6409091 -21.809489 14.5276712 0.9967979<br />
<br />
WE-EE 25.3638528 11.938369 38.7893364 0.0000015<br />
<br />
NORAM-LATAM 35.0297619 8.296134 61.7633901 0.0025162<br />
<br />
OC-LATAM 9.0964286 -8.545414 26.7382711 0.7208506<br />
<br />
WE-LATAM 38.1011905 25.397612 50.8047690 0.0000000<br />
<br />
OC-NORAM -25.9333333 -55.725866 3.8591991 0.1332198<br />
<br />
WE-NORAM 3.0714286 -24.089965 30.2328219 0.9998787<br />
<br />
WE-OC 29.0047619 10.721189 47.2883344 0.0000943<br />
<br />
<br />
<h2>
<b><span style="font-size: large;">Results & Interpretations</span></b></h2>
<br />
From the table above (looking at “diff” and “p adj” columns) I can see which continents have significant differences in breast cancer from others. For example I can conclude that:<br />
<br />
<ul>
<li><b>there is no significant difference</b> in breast cancer new cases between Asia and Africa ( p =0.99 > 0.05), as well as between West Europe and North America (p=0.99) or Oceania and Latin America (p=0.72), etc. </li>
</ul>
<ul>
<li><b>THERE IS A SIGNIFICANT DIFFERENCE</b> in breast cancer new cases between East Europe and Africa (p= 0.00) as well as between Latin America and Africa (p=0.005) or West Europe and Oceania (p=0.00)</li>
</ul>
<br />
Finally, I can also visualize continent pairs and analyse significant differences by plotting the the “tuk” object in R (sorry the y axis is not displayed properly). Significant differences are the ones which not cross the zero value.<br />
<br />
> plot (tuk)<br />
<br />
<img alt="image" src="http://media.tumblr.com/7cf23c3cd6058e4d25b73fc6a2f373c2/tumblr_inline_mlqho2ycIb1qz4rgp.jpg" /><br />
<br />
<h3>
Conclusions</h3>
<div>
<br />
Despite the interesting findings obtained from the ANOVA test, which show a potential relationship between some continents/countries (most developed ones in particular) and breast cancer incidence, I am not going to draw any concrete conclusion from the data. This because the model I've built (Y~X) misses considering some potential <a href="http://en.wikipedia.org/wiki/Confounding" target="_blank">confounding variables</a> such as for example:</div>
<div>
<ul>
<li>access to health care and breast cancer screenings: Africa and Asia might have many women with breat cancer, but they might be undiagnosed due to lack of access to diagnostic and treatment services. On the other hand, it looks like there are more women in developed countries with breast cancer, but it may just be because these countries offer a better access to health services; </li>
</ul>
<ul>
<li>life expectancy: age at diagnosis is another variable to take into consideration, since life expectancy is far lower in less developed countries like Africa and Asia. Age is an important component in breast cancer causes (women over 50 are more likely to get breast cancer), and it might be that because of higher life expectancy, most developed countries present more cases than less developed ones.</li>
</ul>
<div>
While it is impossible with such a "poor" model to draw concrete results from my data analysis, I guess we should take this post as a <b>"learning exercise"</b> that shows the main steps for performing an ANOVA test with R, and the logic behind it. I hope you found it helpful and please add your own considerations, critiques, comments below.<br />
<br />
<i><span style="color: #666666;">Other R articles you will find useful:</span></i><br />
<a href="http://www.analyticsforfun.com/2013/04/global-distribution-of-breast-cancer.html"><i><b>Global Distribution of Breast Cancer: some initial considerations</b></i></a><br />
<a href="http://www.analyticsforfun.com/2013/04/plotting-data-over-map-with-r.html"><i><b>Plotting Data over a Map with R</b></i></a><br />
<a href="http://www.analyticsforfun.com/2013/09/my-first-r-shiny-web-application-using.html"><i><b>My first R Shiny Web Application using Breast Cancer Data</b></i></a><br />
<br />
<br /></div>
</div>
<br />marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-79009143426125460752014-05-14T03:41:00.001+01:002016-08-17T20:17:03.339+01:00Web Scraping for Non-Programmers: 3 easy Tools to Extract Data from WebsitesIf you work with data and use the web as your main source for datasets, then you might have heard the words "web scraping". If you have not come across it yet, well surely you happened to find some interesting data on the web, but no available download options. No csv file or excel download. Nothing. Nada. Niente. And even your desperate copy-and-paste attempt has failed you. <b>This is where web scraping comes in handy.</b><br />
<b><br /></b>
This post is about introducing web scraping, and I am going to present <b>3 tools anyone of us can use to "scrape" the web</b>. Two of them can be used directly from your browser, while the other option is available through Google Spreadsheets. But, most importantly, they are all free, very quick and easy to use and do not require programming skills.<br />
<br />
All right, let's define the topic of this post first. What the heck is Web Scraping?<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiknMcMSkmwpnS7M6rufGd2HGSauMmqKBKMWZGq67UHzWvFC3s5UXS3zIBv2n5iKhz24uiNBNUEoZqdb6pFr7vscpQ5bV8MWtgyi6P-3wvaK56mubWlcyvanlM3w7Pnw6897eMcuxCTRrM/s1600/web+scraping+jime.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="web scraping" border="0" height="266" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiknMcMSkmwpnS7M6rufGd2HGSauMmqKBKMWZGq67UHzWvFC3s5UXS3zIBv2n5iKhz24uiNBNUEoZqdb6pFr7vscpQ5bV8MWtgyi6P-3wvaK56mubWlcyvanlM3w7Pnw6897eMcuxCTRrM/s1600/web+scraping+jime.jpg" title="web scraping" width="400" /></a></div>
<br />
<br />
<a name='more'></a><br />
<br />
<h2>
<span style="font-size: large;">
What is Web Scraping?</span></h2>
<div>
<br /></div>
Web Scraping refers to <b>the software technique of extracting information from websites</b>. The information extracted can be both text or grafic. And, once gathered, it can be used for various purposes: from business to academic research or for any other personal purpose.<br />
<br />
An important aspect of web scraping, which differentiates it from web crawling (the process of indexing info on the web, like Google and other search engines do), is that <b>web scraping focuses on the</b> <b>transformation of unstructured data</b>, typically in HTML format on the web,<b> into structured data </b>that can be stored and analyzed in a central local database or spreadsheet.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjehU7r4CQtY-unCznPgukp0I6pJJ3sZA9BhZEqLNquyrYNSs35X-dXk5K3j5gzTQ1wOdiRIhrCC5gcW1JfLs8xMTfAqTDK8oEPNek31DZ6aKceKzuap5XX3pHjqBEweSirYnF5zbkmqXM/s1600/structured+data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="web scraping " border="0" height="107" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjehU7r4CQtY-unCznPgukp0I6pJJ3sZA9BhZEqLNquyrYNSs35X-dXk5K3j5gzTQ1wOdiRIhrCC5gcW1JfLs8xMTfAqTDK8oEPNek31DZ6aKceKzuap5XX3pHjqBEweSirYnF5zbkmqXM/s1600/structured+data.jpg" title="from unstructured to structured data" width="400" /></a></div>
<br />
As I mentioned above, web scraping can be performed with several diffferent techniques and technologies, each of them offering a different level of automation for finding and extracting data from the web. I am not going in depth on web scraping technologies since I am not an expert. However, to get an idea we can think of <b>different levels of web scraping automation</b> ranging from:<br />
<br />
<br />
<ul>
<li>on one extreme, the very basic human copy-and-paste which is a very long and tedious operation if you need to scrape lots of datasets. Nevertheless, sometimes copy-and-paste is the only workable solution and even the a very advanced technology cannot replace it. This happens in cases like the website built barriers preventing automated programs scraping the content.</li>
</ul>
<div>
<br /></div>
<ul>
<li>on the other extreme, a web scraping software that interacts with websites in a similar way as web browser. But instead of displaying the HTML document on screen, the web scraping software quicky extracts the desired content (for example only some specified fields like product, sku, price) from the HTML syntax and saves it in a local file of your machine or in an external database.</li>
</ul>
<br />
<br />
<br />
<h3>
Case Uses: Why Should you Want to Scrape the Web?</h3>
<div>
<br /></div>
The practical uses of web scraping are potentially endless. Each person or business has its own specific needs for extracting data from the web. While it's impossible to create a complete list of web scraping uses, here below I am providing a couple of popular reasons for scraping the web.<br />
<br />
<ul>
<li><b>Research</b>: finding the right data on the web is a very important activity for <b>academic, scientific, marketing researchers or financial analysts</b>. Whatever the field of research, they all have to answer some specific questions. And to do that, they need to find appropriate data, possibly from several different websites, combine it in a single spreadsheet and analyse it. Having some handy web scraping tool will make their work much more effective.</li>
</ul>
<div>
<br /></div>
<ul>
<li><b>Competition Analysis</b>: a key activity for marketers and sales people is researching competition, which often means visiting competitor websites, industry directories, etc. The data they will look for can be prices, product features, and are probably displayed in HTML tables. Beside marketers, everyone of us is a potential customer on the web: we look for products, services and often do price comparisons before making a decision. Why not saving the data we found on different web pages into a spreasdsheet, and make a decision from there? </li>
</ul>
<div>
<br /></div>
<ul>
<li><b>Lead Generation</b>: again, this is a fundamental task for marketers, which involves visiting companies websites, industry directories/exhibitions, yellowpages or social networks like Linkedin in order to find potential buyers. The data they look for are customer names, address, phone numbers, email, etc. </li>
</ul>
<div>
<br /></div>
<ul>
<li><b>Journalism</b>: as the digital era offers an incredible quantity of free data available online, the field of journalism is evolving the way it tells stories to readers. <b>A new field called Data Journalism is rapidly emerging </b>(<a href="http://www.analyticsforfun.com/2014/07/3-great-examples-of-data-journalism.html" target="_blank">learn about Data Journalism here</a>), and it's based on collecting, filtering and <a href="http://en.wikipedia.org/wiki/Data-driven_journalism" target="_blank">analysing large datasets for the purpose of creating news stories</a>. With this new emphasis on data, the job of journalists shifts its main focus from being the first ones to report a news, to <a href="http://datajournalismhandbook.org/1.0/en/introduction_1.html" target="_blank">being the ones telling readers what a certain fact might actually mean</a>. In this sense, <b>scraping data on the web is going to become a primary activity for journalists</b>, a new opportunity to build unique stories, less likely to be picked by others. </li>
</ul>
<div style="text-align: center;">
<a href="http://www.analyticsforfun.com/2014/07/3-great-examples-of-data-journalism.html" style="background-color: #4caf50; border-radius: 4px; border: none; border: solid 1px; color: white; cursor: pointer; display: inline-block; font-size: 16px; margin: 4px 2px; padding: 15px 32px; text-align: center; text-decoration: none;" target="_blank"><b>Discover Data Journalism</b></a> </div>
<br />
<br />
<h2>
<span style="font-size: large;">
3 Web Scraping Tools for Non-Programmers</span></h2>
<div>
<br /></div>
<div>
<h3>
1. Table Capture (Chrome Extension)</h3>
<div>
<br /></div>
<div>
<a href="https://chrome.google.com/webstore/detail/table-capture/iebpjdmgckacbodjpijphcplhebcmeop" target="_blank">Table Capture is an extension</a> that you can add to your Google Chrome browser and use it while you navigate through web pages. What this extension does, is giving you the ability to quickly <b>copy HTML tables to the clipboard and use them in a spreadsheet</b>, like Microsoft Excel, Google Docs or Open Office.<br />
<br />
<u>Installation </u><br />
<u><br /></u>
I assume you already have Google Chrome installed in your machine. Once installed, go to <a href="https://chrome.google.com/webstore/category/apps" target="_blank">Google Chrome Extensions page</a>, search for "<a href="https://chrome.google.com/webstore/detail/table-capture/iebpjdmgckacbodjpijphcplhebcmeop" target="_blank">Table Capture</a>" and add it to your browser. Make sure the extension is active (you can disable it whenever you like, by going to Settings-->Extensions from the Chrome main menú).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8O5GXXGuut36zk2BPxjM5F8vpDzO9c0ew8fpq7xO_5iUXAE0ZTbrOCBta_PU1hjklnjjw6XTUQpQSidXG0eotLj47JbRbwrSejhWB6Ad9IBRZNZ7Ku2yJ1wKQedFN7dmDdmDhMOJ_N6w/s1600/ChromeExtTable+capture.png" imageanchor="1"><img alt="Google Chrome Table Capture Extension" border="0" height="166" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8O5GXXGuut36zk2BPxjM5F8vpDzO9c0ew8fpq7xO_5iUXAE0ZTbrOCBta_PU1hjklnjjw6XTUQpQSidXG0eotLj47JbRbwrSejhWB6Ad9IBRZNZ7Ku2yJ1wKQedFN7dmDdmDhMOJ_N6w/s1600/ChromeExtTable+capture.png" title="" width="400" /></a></div>
<br />
<br />
<u>How to Scrape Data with Table Capture</u><br />
<br />
Let say we are looking for some <a href="http://markets.ft.com/research/Markets/Overview" target="_blank">markets data from the Financial Time webpage</a>. There are various tables on this page, however getting this data into a spreadsheet is not really easy. Without this extension we would probably try to select the first table, copy it and paste it into a spreadsheet. But we will realize that excel will put all the data in only in a wrong format, so not very useful.<br />
<br />
Using Table Capture extension the data scraping process is easy. While navigating through web pages inclusive of tables, you will see <b>a red icon appearing on the top of your browser</b>. If you click on the icon, it will bring up a list of all the tables that it found on the webpage. If there is a small number of tables, you can quickly scan the list of tables displayed by Table Capture and identify the one you like to export (look at the size). Otherwise, if you find difficult identifying the right table from the menu, I recommend clicking click on "display inline" and <b>a copy-to-clipboard menu will appear every time you mouse over a table in the web page</b>.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjzndEIEnh6Q9hDg9k9h8JdI-trUChoBLX0TYz3BfIWrzbFKkPKvO5j6N2gfDkXT2Y3STRmJatyTOu8nY8fBcHCCbK42bXagjbfpd1VeED4eTDOeHWMUcregJ29q15Gho7ikOsougDfzE/s1600/financial+times+table+capture.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Scraping Data with Table Capture " border="0" height="247" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjzndEIEnh6Q9hDg9k9h8JdI-trUChoBLX0TYz3BfIWrzbFKkPKvO5j6N2gfDkXT2Y3STRmJatyTOu8nY8fBcHCCbK42bXagjbfpd1VeED4eTDOeHWMUcregJ29q15Gho7ikOsougDfzE/s1600/financial+times+table+capture.jpg" title="" width="400" /></a></div>
<br />
<br />
Once Table to Capture detects a table, you can either:<br />
<br />
a) copy it to the clipboard and then paste it (ctrl+v) into an Excel spreadsheet, or<br />
b) extract it directly into a Google doc spreadsheet (you must be looged in with a Google account)<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiumOf4Mes8Pn0Q-cWSGZdKnRBRoNvIQ3vSi5-6gXYMKEfKbXVRrmqxjpNN0uMrS10UaB2t4Fc9nTK7kHCnPPGNXT6mSCQg2Izjh4jLzXr2Bq3_WE1XybPsMdyB4hm3SdcN-aC9Qh0Sh_0/s1600/excel+table+capture.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="171" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiumOf4Mes8Pn0Q-cWSGZdKnRBRoNvIQ3vSi5-6gXYMKEfKbXVRrmqxjpNN0uMrS10UaB2t4Fc9nTK7kHCnPPGNXT6mSCQg2Izjh4jLzXr2Bq3_WE1XybPsMdyB4hm3SdcN-aC9Qh0Sh_0/s1600/excel+table+capture.jpg" width="400" /></a></div>
<br />
<br />
I found that this Table Capture extension can be very useful especially if you work on projects where you research a lot on the web and need to answer questions quickly, based on data. This tool will allow you to get data out of webpages quickly, and import it into your favourite spreadsheet tool where you will process it further (cleaning, etc.) or directly perform some analysis.<br />
<br />
<br /></div>
</div>
<h3>
2. Clipboard to Table (Firefox Extension)</h3>
<div>
<br /></div>
If you prefer using Firefox to browse the web, luckily there is web scraping add-in too. It works pretty much the same as Chrome extension, with the difference that it also <b>allow selecting only certain rows/columns of an HTML table</b>.<br />
<br />
<u>Installation</u>:<br />
<br />
Assuming you have already installed Mozilla Firefox, you can <a href="https://addons.mozilla.org/en-US/firefox/addon/dafizilla-table2clipboard/" target="_blank">download Clipboard to Table from the Mozilla Add-ons page</a>. Make sure the add-in is enabled in your browser settings.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEit8H7DyxTFl_3PomaYIyI4ktdPOCW3e7vfIa-R8f2uPALZNzfTgl-pQF9mi6X4Xr4efPSOIo-BEcGI__whG74fxGZjnBIIRXef2glZviaQJGxPKdozsARCkOKwAipe652ai1XoVUqFXbM/s1600/mozilla+addon.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Firefox Clipboard to Table add-on" border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEit8H7DyxTFl_3PomaYIyI4ktdPOCW3e7vfIa-R8f2uPALZNzfTgl-pQF9mi6X4Xr4efPSOIo-BEcGI__whG74fxGZjnBIIRXef2glZviaQJGxPKdozsARCkOKwAipe652ai1XoVUqFXbM/s1600/mozilla+addon.jpg" title="" width="400" /></a></div>
<br />
<br />
<u>How to Scrape Data with Clipboard to Table</u>:<br />
<br />
Scraping web data with Clipboard to Table is even easier than the previous tool. Just place your mouse cursor over a table, <b>right click</b> and among the varius options you will see one names "Table2Clipboard". From there you can choose to <b>copy the whole table or only a specific row/column</b>, like in the image below. That´s it. The table is saved in your clipboard and ready to be pasted on your favourite spreadsheet.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifjJVmXc5krmdCt9P4DpbRX5Q9zaoxuiT8hfchAz1mc-5doHWcdyve7tXTz-Mf_gSyRzxT8Z5M2nFBCC8CyYMcCOI_kiY9Pk9CrZkTZvEhLR-oyBHl3ud3ziM8DNSPOQy32bfnlP1Jcfw/s1600/financial+clipboard2table.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Scrape Data with Clipboard to Table" border="0" height="206" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifjJVmXc5krmdCt9P4DpbRX5Q9zaoxuiT8hfchAz1mc-5doHWcdyve7tXTz-Mf_gSyRzxT8Z5M2nFBCC8CyYMcCOI_kiY9Pk9CrZkTZvEhLR-oyBHl3ud3ziM8DNSPOQy32bfnlP1Jcfw/s1600/financial+clipboard2table.jpg" title="" width="400" /></a></div>
<br />
<br />
<h4>
</h4>
<h3>
3. Google Docs Spreadsheets</h3>
<div>
<br /></div>
<div>
Very few people maximize the potential of Google docs tools. <a href="https://www.youtube.com/watch?v=9AyoRkr4I3U&feature=youtu.be" target="_blank">Google docs Spreadsheet has been through many improvements</a> over the last year, and among the many features offered, a very interesting one is the possibility to <b>extract data from HTML tables and import it directly in the spreadsheet</b>.<br />
<br />
<u>Installation</u><br />
<br />
You must have a Google account to access Google docs. Once logged in Google, go to <a href="https://drive.google.com/" target="_blank">Google Drive page</a> and click on Create--> Spreadsheets.<br />
<br />
<br />
<u>How to Scrape Data with Google Spreadsheets </u><br />
<u><br /></u>Go to any blank within Google Spreadsheet and type in the following formula:<br />
<br />
<i>= importHTML("","table",N)</i><br />
<br />
which has 3 arguments:<br />
- the first one is the URL of the webpage containing the table; it needs to be placed between double quotes<br />
- the second argument indicates that it is a table. Just leave "table" in this case (it depends on the type of query you like to do, for example you could also request a "list" of elements within the web page)<br />
- and the third argument N indicates the number of the table within the web page; counting starts from 1. My recommendation, in order to find the right table number, is to <b>start trying from 1 and increment the number until you get the correct table</b>.<br />
<br />
As an example, let´s try to extract the same table above from Financial Times, this time through Google Spreadsheet. Te formula would be:<br />
<br />
<i>= importHTML("</i><i>http://markets.ft.com/research/Markets/Overview","table",1)</i><br />
<br />
If everything goes well, the data table should be extracted from the web
page and appear directly into your spreadsheet, as below. Amazing!<i> </i><br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEDquszIFevwm5TtosV_mI8oH1cRD1kSlVbTdvshTqAsDlOv95Zb7uuCiV7vKY-U9nA8ZBsHWNvey7mq2WM1J-sOEK36fp-_5fI55dY6J2sIDHbWt5nYaTA6ww8bJXPL4511xINazV5Bg/s1600/financial+Google+Sheets.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Scrape Data with Google Docs Spreadsheet" border="0" height="177" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEDquszIFevwm5TtosV_mI8oH1cRD1kSlVbTdvshTqAsDlOv95Zb7uuCiV7vKY-U9nA8ZBsHWNvey7mq2WM1J-sOEK36fp-_5fI55dY6J2sIDHbWt5nYaTA6ww8bJXPL4511xINazV5Bg/s1600/financial+Google+Sheets.png" title="" width="400" /></a></div>
<br />
<br />
The very cool thing about this data scraping option is that if the HTML table will be updated in the website, <b>the data in your spreadsheet will be updated too</b> when you refresh the Google doc spreadsheet.<br />
<br />
<br />
In this post I wanted to give a brief introduction to web scraping and present 3 simple tools everyone one of us can use to extract data whithout coding. Of course there are much more sophisticated scraping tools in the market, and if you have programming skills you can write your own script to extract data from web pages (<a href="http://www.analyticsforfun.com/search/label/R" target="_blank">R</a> is a good option).<br />
<br />
Please share comments and any other interesting web scraping tool we can add to the ones presented here. Thanks! </div>
<div>
<br /></div>
<br />marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-37660494629489968822014-04-29T19:48:00.000+01:002015-08-27T01:24:48.346+01:00How to Move your Blog from Tumblr to Blogger in 10 Steps<i>Premise: <u>this post is not about <a href="http://www.analyticsforfun.com/search/label/web%20analytics" target="_blank">web analytics</a></u>. Still I've decided to include it here as this is the place where I am present online. At the end of the day I am a blogger too. And like all bloggers, sooner or later we have to get knowledgeable of diffferent blogging platforms, choose the most suitable for our own objectives, and focus on writing about the things we like. Hope you find it useful!</i><br />
<i><br /></i>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_W2mG0cFole4JkcOVFTFDlisSGzo-u45DTVJfYEv121sQAwGVeX7sSmEA7iCorB-lOzKNKCDoJlR3pOk8-huABMKc4onplX108wFy0QY2v8wB9UAtv1dziTNCxMaZK3T0EZAq6XlJrko/s1600/if_tumblr_then_blogger.png++1200%C3%97630+.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="move from tumblr to blogger" border="0" height="160" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_W2mG0cFole4JkcOVFTFDlisSGzo-u45DTVJfYEv121sQAwGVeX7sSmEA7iCorB-lOzKNKCDoJlR3pOk8-huABMKc4onplX108wFy0QY2v8wB9UAtv1dziTNCxMaZK3T0EZAq6XlJrko/s1600/if_tumblr_then_blogger.png++1200%C3%97630+.png" title="" width="400" /></a></div>
<i><br /></i>
Less than two months ago I decided to move all my blog posts from Tumblr to Blogger. Why? There were actually things I really enjoyed about being part of the Tumblr community. But I had clearly more reasons (I am not going in details here) to make a move and get into a simple and popular platform as Blogger. <br />
<br />
Here below I am reporting tha main steps I took (sussessfully, thanks god) to move my blog posts from Tumblr to Blogger. I will try to do it in a very quick and simple way, as I hope your migration will be. For each step I provide the link of the tool/tutorial you can refer to for more info. Ok, let's start. Good luck.<br />
<br />
<h3>
10 STEPS TO MOVE YOUR BLOG FROM TUMBLR TO BLOGGER & OPTIMIZE IT FOR SEO</h3>
<br />
<a name='more'></a><br />
<br />
<h3>
1. Export tumblr posts in HTML (use ad hoc tool)</h3>
The first step is to export all your Tumblr posts into an HTML file, so that you can later import it into other blogging platforms. I didn't find any tool inside Tumblr offering this service, but fortunately there are a couple of tools available on the internet, doing the right job for you. I suggest the "<a href="http://tumblr2wordpress.benapps.net/" target="_blank">Tumblr2Wordpress tool</a>": once in the page, select HTML as Exported Content Format, and export for "Self-Hosted Wordpress Installation".<br />
<br />
<h3>
2. Turn the HTML file created into a XML (use ad hoc tool)</h3>
Once you performed Step 1, you need now to convert the HTML into a XML file in order to be imported into Blogger platform. Again, I guess there would be various tools available on the web; I personally used the "<a href="http://wordpress2blogger.appspot.com/" target="_blank">Wordpress2Blogger Application</a>", uploaded my HTML file and got back an XML file. Very easy.<br />
<br />
For steps 1 and 2 a great explanation is provided also by <a href="http://www.jumbodumbothoughts.com/2013/05/on-migrating-from-tumblr-to-blogger.html" target="_blank">JumboDumboThoughts</a>.<br />
<br />
<br />
<h3>
3. Log into Blogger and create a new user, if you don't have one yet. Create your new blog.</h3>
<br />
<br />
<h3>
4. Import the XML file into Blogger</h3>
From your Blogger dashboard, click on <b>Settings--> Other --> Import Blog</b> and upload the XML file that you have hopefully saved somewhere in your local machine. I recommend keeping a backup copy of both your HTML and XML files in your local machine or better on the web. You might need it in the future.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi632XyS_OY6tGGEkBspsVCTo2fsgF8Tmogmt4q9jhfqraWEQRHiNTevLemGYlL6pvY2eeRhVBerSkDUWDYbpnVlO0w_-Jj2NDNfVMQ4pjYhdEx-A2hkGll4TK3WvsUZxOkYYAgMjtFayU/s1600/Blogger++import+xml.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="import your XML file in Blogger" border="0" height="95" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi632XyS_OY6tGGEkBspsVCTo2fsgF8Tmogmt4q9jhfqraWEQRHiNTevLemGYlL6pvY2eeRhVBerSkDUWDYbpnVlO0w_-Jj2NDNfVMQ4pjYhdEx-A2hkGll4TK3WvsUZxOkYYAgMjtFayU/s1600/Blogger++import+xml.png" title="" width="400" /></a></div>
<br />
<br />
In my case, all posts were imported fine except for some videos for which I had to review links and titles (some tumblr post types did not have titles). My recommendation is to manually check each post for errors/wrong format you might have incurred during your migration process.<br />
<br />
Unfortunately, I have not been able to export Tumblr comments, which were managed <a href="http://help.disqus.com/customer/portal/articles/286778-migration-tools" target="_blank">through Disqus</a>.<br />
<br />
<br />
<h3>
5. Set up your new Design on Blogger</h3>
At this point hopefully all your posts have been correctly imported into Blogger, and are ready to be published. Almost ready, actually. It's at this point that you might want to review your blog design, template and look of your posts.<br />
<br />
You can do all of this from your main dashboard. Go to Layout and play a bit with the gadgets. Remember that, in addition to the default gadget available from Blogger, you can create/add your own gadget. Just click on <b>Add Gadget--> HTML/JavaScript</b> and insert/paste the correspondent code. You can find plenty of nice gadgets on the web, such as "About me" or "Twitter" widgets.<br />
<br />
<br />
<h3>
6. (Optional) Submit your Custom Domain on blogger, if you have one</h3>
If you have a custom domain and want to use it for your new blog on blogger, you can. Blogger offers two publishing options for your blog: hosting on Blogspot
(example.blogspot.com) and <a href="https://support.google.com/blogger/troubleshooter/1233381?rd=1" target="_blank">hosting on your own custom domain</a>
(www.example.com or for me www.analyticsforfun.com).<br />
<br />
In my case I wanted to use a custom domain I previously registered. What I did was:<br />
<br />
a) Set my custom domain as default domain on Blogger: from the dashboard, go <b>Setting-->Basic-->Publishing</b> and enter the domain address you like to use;<br />
<br />
b) Configure DNS on my domain provider website (I registered my domain with <a href="https://www.hover.com/" target="_blank">Hover</a>)<br />
<br />
<br />
<h3>
7. Fix duplicated content</h3>
I believe this is an important aspect to keep in mind, when migrating your content from one platform to another one. Google don't like <a href="https://support.google.com/webmasters/answer/66359?hl=en" target="_blank">duplicated content</a> and actually penalize it. To avoid your blog being penalized (I mean penalize your ranking on search engine results), I think a savvy solution is to use <a href="https://support.google.com/webmasters/answer/93633" target="_blank">301 redirects</a> on your old blog platform: these redirects send visitors and search engines crawlers to your new domain and make it very clear which URL should be indexed. <br />
<br />
Another option to manage the duplicated content issue, could be also hiding your old Tumblr blog from search engine. I am not sure if this is a valid solution, but it is certainly faster. From your Tumblr dashboard, go to Settings--> click on your blog avatar--> scroll down to Directory and make sure the option "Allow search engines to index your blog" is not active.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEii9hHmzP-FgLuB9XInRRDL4Pq33hEApjmeKdHjxFAWH8gN7DDjblE-PKo-YCQvS-G02OfKZ1sQZwSG93MEnJfvAD1v7mY9L8DBKsGzmm2eNTlH7PQsDNUX-ibQ-XwtkVHvmvIE_UH1txQ/s1600/Tumblelog+Settings+++Tumblr.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="deactivate search engines indexing your tumblr blog" border="0" height="125" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEii9hHmzP-FgLuB9XInRRDL4Pq33hEApjmeKdHjxFAWH8gN7DDjblE-PKo-YCQvS-G02OfKZ1sQZwSG93MEnJfvAD1v7mY9L8DBKsGzmm2eNTlH7PQsDNUX-ibQ-XwtkVHvmvIE_UH1txQ/s1600/Tumblelog+Settings+++Tumblr.png" title="" width="400" /></a></div>
<br />
<br />
Whichever option you decide to take, you should be able to keep your old Tumblr blog and do not need to delete it. Of course, from now on, your new blog will be Blogger.<br />
<br />
<br />
<h3>
8. Publish your Posts</h3>
Once you have double checked this and
decide to go live with all your posts on Blogger, simply click "Publish"
on the Posts menu. The posts will be published with the original date
(dates of posts are imported too within the XML file).<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnJsk4mi-AB8nuQzSsJWCsViw5ArLr4N9reydZ2km7ynDqBaNBxTmfb9-mvulr2AOUxNeuxkJ2kS6SfBPzzfqUX_zVuJERhjWnhONFdn665uHDIzC9PiEyuHR-sNBS29k9Mj2y4vFuTd0/s1600/Blogger++publish.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Publish your Posts in Blogger" border="0" height="85" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnJsk4mi-AB8nuQzSsJWCsViw5ArLr4N9reydZ2km7ynDqBaNBxTmfb9-mvulr2AOUxNeuxkJ2kS6SfBPzzfqUX_zVuJERhjWnhONFdn665uHDIzC9PiEyuHR-sNBS29k9Mj2y4vFuTd0/s1600/Blogger++publish.png" title="" width="400" /></a></div>
<br />
<br />
<h3>
9. (Optional) Submit the new Sitemap to Google and other Search Engines</h3>
In simple words, a <a href="https://support.google.com/webmasters/answer/156184?hl=en" target="_blank">Sitemap</a> is a list of the pages on your website, like a table of contents showing the structure of your blog. Creating and submitting a Sitemap helps make sure that search engines (Google, Bing, etc.) crawl and index your blog/website properly.<br />
<br />
To create a Sitemap for Blogger and submit to Google it's very simple process. Here are the steps I took:<br />
<br />
1) Add this string "<span style="color: orange;"><span style="background-color: white; font-family: 'Droid Sans'; line-height: 22.5px;">atom.xml?redirect=false&start-index=1&max-results=500</span></span><span style="background-color: white; color: #666666; font-family: 'Droid Sans'; line-height: 22.5px;">" </span><span style="background-color: white; font-family: 'Droid Sans'; line-height: 22.5px;">next to your blog URL as below:</span><span style="background-color: white; font-family: 'Droid Sans'; line-height: 22.5px;"><br /></span><br />
<span style="font-size: small;"><br /></span>
<span style="background-color: white; font-size: small; line-height: 22.5px;"><span style="font-family: Droid Sans;">http://www.analyticsforfun.com/</span></span><span style="background-color: white; font-family: 'Droid Sans'; font-size: small; line-height: 22.5px;"><span style="color: orange;">atom.xml?redirect=false&start-index=1&max-results=500</span></span><br />
<span style="background-color: white; font-family: 'Droid Sans'; font-size: small; line-height: 22.5px;"><span style="color: orange;"><br /></span></span>
<br />
By doing this, you have just created a Sitemap for your new Blogger blog. What you need to do next is tell search engines (Google) about your Sitemap so that they will know the structure. For Google, you can submit the Sitemap to Google Web Master Tools<br />
<br />
2) Log into Webmaster Tools and click on your blog/website (add it if you can't see it). Then go to <b>Sitemaps--> Add/Test Sitemaps</b> and submit it. In the blank field, You just need to enter this string: <span style="background-color: white; color: orange; font-family: 'Droid Sans'; font-size: small; line-height: 22.5px;">atom.xml?redirect=false&start-index=1&max-results=500 </span><br />
<span style="background-color: white; color: orange; font-family: 'Droid Sans'; font-size: 15px; line-height: 22.5px;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJZndCdWJEc_Cm9euz0F5Uib4k4_Nac53CzH84vv8XX2_2PL591Kx2iTHuL6pe10AjtKpjZnkluqrlKHmpS6NDGoXCWeVeloeCOh-_ADHDgp8PO6zTwwUHSyxWxEQFX_mRHgj3E6Q-UeU/s1600/sitemap+submit.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="google sitemap submit for blogger" border="0" height="143" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJZndCdWJEc_Cm9euz0F5Uib4k4_Nac53CzH84vv8XX2_2PL591Kx2iTHuL6pe10AjtKpjZnkluqrlKHmpS6NDGoXCWeVeloeCOh-_ADHDgp8PO6zTwwUHSyxWxEQFX_mRHgj3E6Q-UeU/s1600/sitemap+submit.jpg" title="" width="320" /></a></div>
<span style="background-color: white; color: orange; font-family: 'Droid Sans'; font-size: 15px; line-height: 22.5px;"><br /></span>
<span style="background-color: white; color: orange; font-family: 'Droid Sans'; font-size: 15px; line-height: 22.5px;"><br /></span><br />
<br />
Done. The sitemap has been submitted, and unless you received some errors, your blog will be ready to be crawled and indexed by Google.<br />
<br />
Remember, this Sitemap submission was valid for Google. If you want to <a href="http://blogtimenow.com/how-to/submit-url-google-bing-yahoo-add-website/" target="_blank">submit it to other search engines</a>, you should do it through their appropriate tools. For example, for Bing you will submit it to <a href="http://www.bing.com/toolbox/webmaster" target="_blank">Bing Webmaster tools</a>.<br />
<span style="background-color: white; font-family: 'Droid Sans'; font-size: small; line-height: 22.5px;"><br />
</span><br />
<div class="separator" style="clear: both; text-align: center;">
<span style="background-color: white; font-family: 'Droid Sans'; font-size: small; line-height: 22.5px;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3qhFfJvQ63sWEuKnZOuplW7C2zu4PNwEblZodAgF0YN62FL6_09er4_PQoKasRDaru_5RmdnbK5V-uo48Swyfg1NgIobSsyjjs3UpOnkt65xga9aOcBH_HEUWOf4Ldjm7i1dtBBYh88M/s1600/sitemap+submit+bing.jpg" imageanchor="1"><img alt="bing sitemap submit" border="0" height="199" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3qhFfJvQ63sWEuKnZOuplW7C2zu4PNwEblZodAgF0YN62FL6_09er4_PQoKasRDaru_5RmdnbK5V-uo48Swyfg1NgIobSsyjjs3UpOnkt65xga9aOcBH_HEUWOf4Ldjm7i1dtBBYh88M/s1600/sitemap+submit+bing.jpg" title="" width="320" /></a></span></div>
<span style="background-color: white; font-family: 'Droid Sans'; font-size: small; line-height: 22.5px;">
<span style="background-color: white; color: #666666; font-family: 'Droid Sans'; font-size: 15px; line-height: 22.5px;"><br /></span>
<br />
</span><br />
<h3>
10. (Optional) Install Google Analytics </h3>
Now that your new Blogger blog is live, it´s time to <a href="http://www.analyticsforfun.com/2014/10/how-i-measure-success-for-my-blog.html" target="_blank">measure its performance</a>. Is it receiving traffic? Where does it come from? Which posts are engaging more your readers? These are all very important questions you should start asking yourself, in order to optimize your blog. A very popular tool for tracking your Blogger traffic and answering the above questions is Google Analyics (from now on all websites will actually have installed <a href="http://www.analyticsforfun.com/2014/08/how-to-test-universal-analytics-before.html" target="_blank">Universal Analytics</a>, but no worries the basic implementation process is the same). Installing it it's very easy:<br />
<br />
1) <a href="https://support.google.com/analytics/answer/1008015?hl=en" target="_blank">Create a Google Analytics account</a>, if you don´t have one yet.<br />
2) <a href="https://support.google.com/analytics/answer/1009618" target="_blank">Create a new property</a> for your blog: you will get a traking code. To install it in Blogger, you actually don´t need the entire java script code, but just the Web Property ID (the UA code).<br />
3) From your Blogger dashboard, go to <b>Settings-->Other--> Google Analytics</b> and enter there your Web Property ID.<br />
<br />
Once completed the above steps, you should see data in your Analytics account within 24 hours.<br />
<br />
<br />
***Bonus: Check out <a href="http://www.analyticsforfun.com/2014/10/how-i-measure-success-for-my-blog.html" target="_blank"><b>this post to learn how to get the most from Google Analytics for your blog</b></a>.<br />
<br />
<h4>
Conclusions</h4>
The process for moving from Tumblr to Google is quite simple and won't take you long if you folow the 10 steps I suggest above. The aim o this post was to centralize and summarize all the info I researched on the web, when I moved my own blog. And combine it with some <a href="http://blog.kissmetrics.com/get-google-to-index/" target="_blank">important SEO tips</a> in order to make your content easily optimized for search engines.<br />
<br />
Hopefully these guide will save you time and please share your feedback if you think the migration process can be improved further.<br />
<br />marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-16522992894189753042014-03-27T01:17:00.001+00:002016-07-24T21:23:05.524+01:00Gathering Business Requirements for a Google Analytics Project. A quick Template.<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTAiE-LoQ6wJPsvWwoHjrUlracsi-CXPuUG3cPbbeWIovWAR5InA4VS1_zS02s8mObOrvMj2ffT8a7swMxOfx9QvAbZYchdHmmB9U4psmEqsmTyAHGbqO4l5fLrDSVhYPMfzq491yTFRc/s1600/GArequirements.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Google Analytics Audit" border="0" height="195" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTAiE-LoQ6wJPsvWwoHjrUlracsi-CXPuUG3cPbbeWIovWAR5InA4VS1_zS02s8mObOrvMj2ffT8a7swMxOfx9QvAbZYchdHmmB9U4psmEqsmTyAHGbqO4l5fLrDSVhYPMfzq491yTFRc/s400/GArequirements.jpg" title="Capturing Business Requirements for a Google Analytics Project" width="400" /></a></div>
<br />
Every business has <b>unique objectives and data needs</b>. Because of that, a good Web Analytics project should always start with an audit aimed at understanding and gathering these unique business requirements.<br />
<br />
Infact, it´s only by capturing the objectives of the website/online business that we will be able to create and implement an effective <b><a href="http://www.analyticsforfun.com/2014/10/how-i-measure-success-for-my-blog.html" target="_blank">measurement plan</a></b> and eventualy take action on data. Using <a href="http://www.kaushik.net/avinash/digital-marketing-and-measurement-model/" target="_blank">Avinash Kaushik words</a>:<br />
<blockquote class="tr_bq">
<i>There is one difference between winners and losers when it comes to web analytcs. Winners, well before they think data or tool, have a well structured Digital Marketing & Measurement Model. Losers don´t.</i></blockquote>
<b>In this post I suggest a simple and quick template</b> that will help you capturing business and technical requirements for a web analytics project. And so being able, later, to develop an implementation plan. I am in particular thinking about digital analysts/agencies who offer <a href="http://www.google.com.ar/analytics/partners/" target="_blank">Google Analytics consulting services</a> to their clients.<br />
<a name='more'></a><br />
The following template can be useful when you sit down for the first time with a client (or prospect one), and you know very little or nothing about their business on the web. Of course you should complement this initial audit with:<br />
<ol>
<li>A visit to the website (do it before meeting the client!): see how it looks, go to product pages, navigate as you were a potential customer. And try to identify yourself their business objectives and potential technical requirements.</li>
<li>A more detailed analysis into their Web Analytics tool (if they are currently using one, and clearly if you have been given access previously). This step will let you audit the type of implementation they currently have and get first a picture of the data. </li>
</ol>
Scroll down the form to keep reading.<br />
<br />
<br />
<iframe frameborder="0" height="800" marginheight="0" marginwidth="0" src="https://docs.google.com/forms/d/16HyHlnnGzkPm-34V4lmrqqY7fa7bHkdD4wXudW-5Jzw/viewform?embedded=true" width="660">Loading...</iframe>
<br />
<br />
<br />
A couple of notes about the above template, before closing this post:<br />
<br />
<br />
<ul>
<li>the above questions were formulated with a <a href="http://www.google.com.ar/analytics/" target="_blank">Google Analytics</a> audit in mind (especially questions about GA implementation features). However, I believe most of them could be replicated for any Web Analytics tool;</li>
<li>I created the template using <a href="http://www.google.com/google-d-s/createforms.html" target="_blank">Google Forms</a>, though it is not intended to collect answers (Google Form it is used to create and send surveys for example). I simply used it because I liked the format and I could incorporate it easily in my post;</li>
<li>the template could be sent to clients by email, instead of filling it out during a face to face meeting;</li>
<li>some of the questions will need more detailed answers than the ones I put on the template. I didn´t leave a blank space in the form, but I encourage you to do it. The more detailed info you can get from the client, the better you will deliver your measurement and implementation plans.</li>
</ul>
<br />
<br />
I hope this simple template will be of use in your daily work. Please feel free to share advice, critique or other question you reckon I could add to the template.<br />
<br />
Thank you. <br />
<br />
<br />marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0tag:blogger.com,1999:blog-8693029506171309303.post-33253967705393223602014-03-12T01:39:00.000+00:002014-10-18T15:18:55.714+01:00Free Analytics Education? My Personal Development Plan for 2014 <div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglnak9awggm4EbU9t_TXe6covLZLDQgVTAg8iA1egwJQ-ldR_tUXfnuU5buoxgLTFFsyBNQr1KjT3w2435n13WtAcFrtMkoncl6kXnWfYus74xddJ4pRiQt09zk8vEBpIm0s4kqVZ_qJc/s1600/dataScientiestSexy.png" imageanchor="1"><img alt="analytics education skills " border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglnak9awggm4EbU9t_TXe6covLZLDQgVTAg8iA1egwJQ-ldR_tUXfnuU5buoxgLTFFsyBNQr1KjT3w2435n13WtAcFrtMkoncl6kXnWfYus74xddJ4pRiQt09zk8vEBpIm0s4kqVZ_qJc/s1600/dataScientiestSexy.png" height="181" title="Free Analytics Education" width="400" /></a></div>
<br />
<br />
It seems that 2014 will be a great year for analytics professionals. It's not me saying that, but several authoritative sources in the world business. To mention a few of them:<br />
<br />
<ul>
<li>according to research company Gartner, over the next two years, by 2015, there will be <a href="http://www.gartner.com/newsroom/id/2207915" target="_blank">4.4 million big data jobs available</a>; and just a third of them will be successfully filled.</li>
<li>data scientist is gonna be <a href="http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1" target="_blank">the sexiest job of the century</a>, according to Harvard Business Review. Earlier in 2009, H. Varian (chief economist @Google) said pretty much the same during an interview talking about statisticians. </li>
<li>Forbes states that <a href="http://www.forbes.com/sites/jaysondemers/2014/02/10/2014-is-the-year-of-digital-marketing-analytics-what-it-means-for-your-company/" target="_blank">2014 is gonna be the year of Digital Marketing Analytics</a> and companies need to get prepared for that (if they want to win more shares of the market).</li>
<li>while many other sources emphasize the importance of relevant education <a href="http://thefuturebuzz.com/2013/12/04/analytics-most-desirable-skill-and-largest-talent-gap-for-2014/" target="_blank">to fill the talent gap in analytics</a>.</li>
</ul>
<br />
For us, digital analysts, I guess this is all great news. But the other EVEN NICER NEWS - and that's the topic of this post - is that <b>lot of this analytics education is freely available</b> through the internet. Ever heard about <a href="http://en.wikipedia.org/wiki/Massive_open_online_course" target="_blank">MOOCs</a>? I have taken a couple of those courses, and beside the zero cost required to enroll, the QUALITY was the thing impressed me the most. A couple of my <a href="http://www.analyticsforfun.com/2013/03/ready-for-new-statistics-course-my.html" target="_blank">previous posts</a> actually came from some fantastic assignments I was given as part of the courses.<br />
<br />
Going back to 2014, I'd like to suggest 3 courses starting right this month. I have enrolled to all of them, they look very promising and hopefully will be able to complete them all.<br />
<br />
1) Google Analytics Platform Principles<br />
2) Making Sense of Data<br />
3) Doing Journalism with Data <br />
<br />
<a name='more'></a><br />
<br />
<br />
<h4>
1) <a href="https://analyticsacademy.withgoogle.com/course02/preview" target="_blank">GOOGLE ANALYTICS PLATFORM PRINCIPLES</a></h4>
<br />
<span style="color: #274e13;"><u>Offered by</u>:</span> Google Analytics Accademy<br />
<br />
<span style="color: #274e13;"><u>Starts</u>:</span> March 11 - March 27, 2014 (content available after end of the course)<br />
<br />
<u><span style="color: #274e13;">Content</span></u>: this is the second course launched by Google Analytics Accademy, and comes after the successfull <a href="https://analyticsacademy.withgoogle.com/explorer" target="_blank">Digital Analytics Fundamentals</a>, which saw more than 145000 students signing up last October, and over 30000 of them earning a certificate of completion. Google Analytics Platform Principles aims to go deeper into <b>how Google Analytics platform collects, transform and organizes the data</b> you see in the reports. Compared to the previous one, which more focused on how to plan a web analytics project (from a very business point of view), this course looks to be quite technical and specific about how GA platform works. <br />
<br />
<span style="color: #274e13;"><u>Reasons why I am taking it</u>:</span> I am currently working for a digital marketing agency, and deal with Google Analytics on a daily basis. Given the size of the agency, not only I am in charge of data reporting & analysis, but also of the actual technical implementation of the platform for the client. I often find myself in a client face role, identifyng requirements and turning into actions. So, understanding how the GA works in the "backend" is becoming more and more important in order to refine the implementation and offer unique and valuable insights for the client.<br />
<br />
I also hope to gain a good understanding of the main differences between GA for a website and mobile applications, being the last one an area where I see more and more interest from clients. <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='320' height='266' src='https://www.youtube.com/embed/vQFJ4fP1E7o?feature=player_embedded' frameborder='0'></iframe></div>
<br />
<br />
<h4>
2) <a href="https://datasense.withgoogle.com/preview" target="_blank">MAKING SENSE OF DATA</a></h4>
<br />
<span style="color: #274e13;"><u>Offered by</u>:</span> Google <br />
<br />
<u><span style="color: #274e13;">Starts</span></u>: March 18 - April 4, 2014 (content available after end of the course)<br />
<br />
<span style="color: #274e13;"><u>Content</u>:</span> the course appears to be an introductory data analysis course targeting general public dealing frequently, even if in different ways, with data. It covers main data analysis activities such as <b>structuring, cleaning, analysing and visualizing data</b>; and topics like exploratory and prediction analysis, at least at abasic level I guess. In terms of tools, the couse will introduce <a href="https://support.google.com/fusiontables/answer/2571232?hl=en" target="_blank">Google Fusion Tables</a>, an experimental web aplication aimed to gather, visualize and share large data tables.<br />
<br />
<span style="color: #274e13;"><u>Reasons why I am taking it</u>:</span> I am very interested in using Fusion Tables, understand what type of things I can do (compared to a classic Excel spreadsheet for example) and incorporate it into my daily work. I saw recently an amazing work <a href="http://online-behavior.com/analytics/visualizations" target="_blank">visualizing Google Analytics data with Fusion Tables</a> and got very impressed. <br />
<br />
Of couse I look forward to apply my knowledge in the final assignemnt and hopefully share it on this blog.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='320' height='266' src='https://www.youtube.com/embed/ZVotO69oFMA?feature=player_embedded' frameborder='0'></iframe></div>
<br />
<br />
<h4>
3) <a href="http://www.datajournalismcourse.net/" target="_blank">DOING JOURNALISM WITH DATA: FIRST STEPS, SKILLS AND TOOLS</a></h4>
<br />
<span style="color: #274e13;"><u>Offered by</u>:</span> European Journalism Centre & Data Driven Journalism <br />
<br />
<span style="color: #274e13;"><u>Starts</u>:</span> from May 19, 2014 <br />
<br />
<span style="color: #274e13;"><u>Content</u>:</span> this course is about how data is changing the way journalism is done. It covers verious topics such as <b>where to find relevant data, how to explore it and find stories</b>, and eventually transform it into <b>attractive visualizations</b>. In other words, it will teach the essential concepts, techniques and skills necessary to do data journalism and produce compelling data stories.<br />
<br />
<span style="color: #274e13;"><u>Reasons why I am taking it</u>:</span> I am not a journalist. I am a digital analyst. Then why this course? I do think, TELLING A STORY is a very important part of any web/digital analyst job. We need to find data, subset relevant one (garbage in/garbage out), clean it, analyse it, turn it into attractive visualizations. But most importantly, think critically. And that is where this data journalism course can be of help.<br />
<br />
If you are interested in reading some great pieces of data jurnalism, I recommend start following the <a href="http://www.theguardian.com/data" target="_blank">data section of The Guardian</a>. <br />
<br />
<br />
I hope you found this post useful. I have enrolled to all three courses, and hopefully will complete them. I plan to post some feedback soon, ideally some work I will have to perform as part of the class.<br />
<br />
Are you thinking to enroll to some of these course? Do you have any other suggestion? Please share your thoughts here, I will be happy to get in touch.<br />
<br />
<br />
<br />marquihttp://www.blogger.com/profile/00191739365763861374noreply@blogger.com0