May 13, 2014

Web Scraping for Non-Programmers: 3 easy Tools to Extract Data from Websites

If you work with data and use the web as your main source for datasets, then you might have heard the words "web scraping". If you have not come across it yet, well surely you happened to find some interesting data on the web, but no available download options. No csv file or excel download. Nothing. Nada. Niente. And even your desperate copy-and-paste attempt has failed you. This is where web scraping comes in handy.

This post is about introducing web scraping, and I am going to present 3 tools anyone of us can use to "scrape" the web. Two of them can be used directly from your browser, while the other option is available through Google Spreadsheets. But, most importantly, they are all free, very quick and easy to use and do not require programming skills.

All right, let's define the topic of this post first. What the heck is Web Scraping?

web scraping




What is Web Scraping?


Web Scraping refers to the software technique of extracting information from websites. The information extracted can be both text or grafic. And, once gathered, it can be used for various purposes: from business to academic research or for any other personal purpose.

An important aspect of web scraping, which differentiates it from web crawling (the process of indexing info on the web, like Google and other search engines do), is that web scraping focuses on the transformation of unstructured data, typically in HTML format on the web, into structured data that can be stored and analyzed in a central local database or spreadsheet.

web scraping

As I mentioned above, web scraping can be performed with several diffferent techniques and technologies, each of them offering a different level of automation for finding and extracting data from the web. I am not going in depth on web scraping technologies since I am not an expert. However, to get an idea we can think of different levels of web scraping automation ranging from:


  • on one extreme, the very basic human copy-and-paste which is a very long and tedious operation if you need to scrape lots of datasets. Nevertheless, sometimes copy-and-paste is the only workable solution and even the a very advanced technology cannot replace it. This happens in cases like the website built barriers preventing automated programs scraping the content.

  • on the other extreme, a web scraping software that interacts with websites in a similar way as web browser. But instead of displaying the HTML document on screen, the web scraping software quicky extracts the desired content (for example only some specified fields like product, sku, price) from the HTML syntax and saves it in a local file of your machine or in an external database.



Case Uses: Why Should you Want to Scrape the Web?


The practical uses of web scraping are potentially endless. Each person or business has its own specific needs for extracting data from the web. While it's impossible to create a complete list of web scraping uses, here below I am providing a couple of popular reasons for scraping the web.

  • Research: finding the right data on the web is a very important activity for academic, scientific, marketing researchers or financial analysts. Whatever the field of research, they all have to answer some specific questions.  And to do that, they need to find appropriate data, possibly from several different websites, combine it in a single spreadsheet and analyse it. Having some handy web scraping tool will make their work much more effective.

  • Competition Analysis: a key activity for marketers and sales people is researching competition, which often means visiting competitor websites, industry directories, etc. The data they will look for can be prices, product features, and are probably displayed in HTML tables. Beside marketers, everyone of us is a potential customer on the web: we look for products, services and often do price comparisons before making a decision. Why not saving the data we found on different web pages into a spreasdsheet, and make a decision from there?     

  • Lead Generation: again, this is a fundamental task for marketers, which involves visiting companies websites, industry directories/exhibitions, yellowpages or social networks like Linkedin in order to find potential buyers. The data they look for are customer names, address, phone numbers, email, etc.  



3 Web Scraping Tools for Non-Programmers


1. Table Capture (Chrome Extension)


Table Capture is an extension that you can add to your Google Chrome browser and use it while you navigate through web pages. What this extension does, is giving you the ability to quickly copy HTML tables to the clipboard and use them in a spreadsheet, like Microsoft Excel, Google Docs or Open Office.

Installation

I assume you already have Google Chrome installed in your machine. Once installed, go to Google Chrome Extensions page, search for "Table Capture" and add it to your browser. Make sure the extension is active (you can disable it whenever you like, by going to Settings-->Extensions from the Chrome main menĂº).

Google Chrome Table Capture Extension


How to Scrape Data with Table Capture

Let say we are looking for some markets data from the Financial Time webpage. There are various tables on this page, however getting this data into a spreadsheet is not really easy.  Without this extension we would probably try to select the first table, copy it and paste it into a spreadsheet. But we will realize that excel will put all the data in only in a wrong format, so not very useful.

Using Table Capture extension the data scraping process is easy. While navigating through web pages inclusive of tables, you will see a red icon appearing on the top of your browser. If you click on the icon, it will bring up a list of all the tables that it found on the webpage. If there is a small number of tables, you can quickly scan the list of tables displayed by Table Capture and identify the one you like to export (look at the size). Otherwise, if you find difficult identifying the right table from the menu,  I recommend clicking click on "display inline" and a copy-to-clipboard menu will appear every time you mouse over a table in the web page.


Scraping Data with Table Capture


Once Table to Capture detects a table, you can either:

 a) copy it to the clipboard and then paste it (ctrl+v) into an Excel spreadsheet, or
 b) extract it directly into a Google doc spreadsheet (you must be looged in with a Google account)




I found that this Table Capture extension can be very useful especially if you work on projects where you research a lot on the web and need to answer questions quickly, based on data. This tool will allow you to get data out of webpages quickly, and import it into your favourite spreadsheet tool where you will process it further (cleaning, etc.) or directly perform some analysis.


2. Clipboard to Table (Firefox Extension)


If you prefer using Firefox to browse the web, luckily there is web scraping add-in too. It works pretty much the same as Chrome extension, with the difference that it also allow selecting only certain rows/columns of an HTML table.

Installation:

Assuming you have already installed Mozilla Firefox, you can download Clipboard to Table from the Mozilla Add-ons page. Make sure the add-in is enabled in your browser settings.


Firefox Clipboard to Table add-on


How to Scrape Data with Clipboard to Table:

Scraping web data with Clipboard to Table is even easier than the previous tool. Just place your mouse cursor over a table, right click and among the varius options you will see one names "Table2Clipboard". From there you can choose to copy the whole table or only a specific row/column, like in the image below. That´s it. The table is saved in your clipboard and ready to be pasted on your favourite spreadsheet.


Scrape Data with Clipboard to Table


3. Google Docs Spreadsheets


Very few people maximize the potential of Google docs tools. Google docs Spreadsheet has been through many improvements over the last year, and among the many features offered, a very interesting one is the possibility to extract data from HTML tables and import it directly in the spreadsheet.

Installation

You must have a Google account to access Google docs. Once logged in Google, go to Google Drive page and click on Create--> Spreadsheets.


How to Scrape Data with Google Spreadsheets 

Go to any blank within Google Spreadsheet and type in the following formula:

= importHTML("","table",N)

which has 3 arguments:
- the first one is the URL of the webpage containing the table; it needs to be placed between double quotes
- the second argument indicates that it is a table. Just leave "table" in this case (it depends on the type of query you like to do, for example you could also request a "list" of elements within the web page)
- and the third argument N indicates the number of the table within the web page; counting starts from 1. My recommendation, in order to find the right table number, is to start trying from 1 and increment the number until you get the correct table.

As an example, let´s try to extract the same table above from Financial Times, this time through Google Spreadsheet. Te formula would be:

= importHTML("http://markets.ft.com/research/Markets/Overview","table",1)

If everything goes well, the data table should be extracted from the web page and appear directly into your spreadsheet, as below. Amazing!


Scrape Data with Google Docs Spreadsheet


The very cool thing about this data scraping option is that if the HTML table will be updated in the website, the data in your spreadsheet will be updated too when you refresh the Google doc spreadsheet.


In this post I wanted to give a brief introduction to web scraping and present 3 simple tools everyone one of us can use to extract data whithout coding. Of course there are much more sophisticated scraping tools in the market, and if you have programming skills you can write your own script to extract data from web pages (R is a good option).

Please share comments and any other interesting web scraping tool we can add to the ones presented here. Thanks!