We’ve talked all about how the Internet and the Web work. But how do we put this into practice to obtain data from websites?
There are many interesting things to extract out of Web pages:
Sometimes websites make it easy to get these things; sometimes we have to do it the hard way.
These days, lots of websites provide ways for programs to easily interact with them to extract data. Rather than forcing you to scrape the pages the hard way, a site might provide a simple interface to request data and perform common operations. These are usually referred to as APIs (Application Programming Interfaces), and work using ordinary HTTP requests.
For example, Twitter has APIs to search tweets or integrate with Direct Messages (e.g. so you can make a customer service robot). GitHub’s API lets you extract information about public repositories, receive notifications about events on repositories you have access to (we used this in Stat Computing to record pull request approvals for grading), make comments, open issues, and so on. Wikipedia’s API lets you fetch pages, make edits, upload files, and so on (it’s often used by robots that correct common formatting mistakes and detect vandalism). The arXiv API lets you search papers, download abstracts, and fetch PDFs.
Many of these APIs use the idea of REST, or Representational State Transfer. REST APIs often use HTTP requests that send and receive data in XML or JSON formats.
The basic idea is this:
https://api.github.com/repos/36-750/documents/issues/1
.Because REST APIs use simple, well-defined data formats and HTTP requests, it’s often easy to make packages that wrap the API in the programming language of your choice. PyGitHub, for example, provides Python functions and classes that automatically do the necessary REST API calls to GitHub. If you want to use a well-known website’s API, check if there’s a package for your language.
Let’s try using GitHub’s API without using a package.
library(httr)
## This API endpoint fetches all of my repositories
## https://developer.github.com/v3/repos/#list-user-repositories
r <- GET("https://api.github.com/users/capnrefsmmat/repos")
status_code(r) # 200
headers(r)[["content-type"]] # "application/json; charset=utf-8"
str(content(r))
## An enormous list, starting with
## List of 14
## $ :List of 72
## ..$ id : int 51103677
## ..$ node_id : chr "MDEwOlJlcG9zaXRvcnk1MTEwMzY3Nw=="
## ..$ name : chr "confidence-hacking"
## ..$ full_name : chr "capnrefsmmat/confidence-hacking"
## ..$ private : logi FALSE
## ..$ owner :List of 18
## .. ..$ login : chr "capnrefsmmat"
## .. ..$ id : int 711629
## .. ..$ node_id : chr "MDQ6VXNlcjcxMTYyOQ=="
## .. ..$ avatar_url : chr "https://avatars3.githubusercontent.com/u/711629?v=4"
## .. ..$ gravatar_id : chr ""
## .. ..$ url : chr "https://api.github.com/users/capnrefsmmat"
## ...
Notice that httr helpfully parses the JSON returned by GitHub into nested lists for us, using the jsonlite package. It did that by inspecting the Content-Type
header.
jsonlite can also read from websites directly, so if you just need to do GET requests with JSON, you can just do
jsonlite, being clever, notices that the data is a list of repositories, each with the same attribute names, and makes a data frame out of the result instead of lists of lists of lists. (Some of the columns of the data frame, like owner
, are data frames themselves…)
I can also use a POST request to create a repository, though this requires me to prove to GitHub who I am, which I’ll leave out for simplicity:
library(httr)
r <- POST("https://api.github.com/user/repos",
body = list(name = "new-repository",
description = "My cool repo"),
encode = "json")
APIs often do authentication with “tokens” – basically a secret password you supply to the server in the POST request – or with OAuth, which is somewhat complicated and best left to packages that handle all its details.
Sometimes the data you want isn’t available through a convenient API, but it’s right there in the HTML, so surely you can get it out somehow!
This is scraping. It requires a few steps:
Last time we talked about packages like Requests and httr that make HTTP requests, so let’s skip that step and talk about parsing HTML and extracting data.
HTML, the HyperText Markup Language, is a way of marking up text with various attributes and features to define a structure – a structure of paragraphs, boxes, headers, and so on, which can be given colors and styles and sizes with another language, CSS (Cascading Style Sheets).
For our purposes we don’t need to worry about CSS, since we don’t care what a page looks like, just what’s included in its HTML.
HTML defines a hierarchical tree structure. Let’s look at a minimal example:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>An Awesome Website</title>
</head>
<body>
<h1 id="awesome">Awesome Website</h1>
<p id="intro">
This is a paragraph of text linking
to <a href="http://example.com/page">another Web page.</a>
<p>This is another paragraph to introduce data & stuff:
<table class="datatable" id="bigdata">
<thead>
<tr><th>Key</th><th>Value</th></tr>
</thead>
<tbody>
<tr id="importantrow"><td>Foo</td><td>Bar</td></tr>
<tr><td>Baz</td><td>Bam</td></tr>
</tbody>
</table>
<!-- This is a comment and is ignored -->
<p>Notice the use of <thead> and <tbody> tags, which are
actually optional.
</body>
</html>
Notice some features of HTML:
<h1>
. There is a set of tags with predefined meanings, like <p>
for paragraphs and <table>
for tables.
<p>
and close with </p>
, and everything in between is enclosed in those tags. All tags have to be closed, except those that don’t. (Since people writing web pages were very bad at remembering to close tags, browsers now have standard rules for inferring when you meant to close a tag; notice the paragraphs above aren’t closed.)
id
is used for unique identifiers for elements, which can be used in JavaScript or CSS to modify those elements, and a class
can be assigned to many elements which should somehow behave in the same way.
<
, >
, and &
have specific meanings in HTML. If you want to write <
without it starting a new tag, you have to escape it by writing <
. There are many escapes, like ©
for the copyright symbol, and numeric escapes for specifying arbitrary Unicode characters. These are called “HTML entities”.
HTML’s complex structure makes it difficult to parse; the HTML standard chapter on syntax has headings going as deep as “12.2.5.80 Numeric character reference end state”. Do not attempt to parse HTML with regular expressions.
If you use an HTML parsing package – rvest uses the libxml2 parser underneath, while Beautiful Soup uses html5lib – it will handle all the complexity for you.
For example, for the example HTML file above:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("example.html", "r").read(), "lxml")
ps = soup.find_all("p") # get a list of the 3 <p> tags
ps[0]["id"] # "intro"
ps[-1].string # 'Notice the use of <thead> and <tbody> tags, which are\n actually optional.\n '
Or, in R,
Often web pages are made of huge complicated HTML documents, and you only need the contents of a few specific tags. How do you extract them from the page?
There are several ways to do this, but they come down to needing a selector: some specification of the type or name of the tag we want.
There are several common types of selector. The simplest is the CSS selector, used when making Cascading Style Sheets. A CSS selector might look like this: .datatable tr#importantrow td
.
That means:
.datatable
tr#importantrow
tr
element with the ID “importantrow”.
td
td
element.
These are interpreted hierarchically, so put together in one selector, this identifies all td
elements inside a tr
whose ID is “importantrow” inside some element with class “datatable”. This will match two td
elements in the example above. (Note that the tbody
is not in the selector, but that is not a problem; any td
inside a tr#importantrow
matches, even if there are enclosing tags in between.)
There are various other syntaxes. We can write p > a
to find a
tags immediately inside p
without any enclosing tags, so specifically excluding a situation like
We can use .class
and #id
on tags or without a tag name, depending on how specific we want to be. There are other kinds of selectors, like selectors for tags with specific attributes; the MDN selector tutorial is a good starting point to learn more.
rvest uses CSS selectors by default. The html_nodes
function I used above takes a CSS selector and returns a list of HTML tags matching that selector, then lets you do things to them.
Beautiful Soup supports CSS selectors. You can use
to get a list of the two tags matching the selector.
Another common syntax is XPath, although people often don’t like this because it’s very complicated. rvest supports XPath if you want it. An XPath selector like
/table[@class='datatable']//tr[@id='importantrow']//td
does the same thing as the CSS selector above. XPath can express arbitrarily complicated queries with all kinds of conditions.
(I did not actually try this XPath selector to make sure it works.)
You can also try using your browser’s developer tools to find the HTML tags and selectors you need; let’s do a live demo of that.
Typical scraping might involve something like this:
This is quite a common pattern because we usually don’t have all the URLs we want to scrape in advance. If I’m scraping Wikipedia, rather than downloading a list of all the Wikipedia pages in advance, I’d rather start with a few pages and follow links to find others.
Web site owners, however, often don’t like this pattern. If you don’t build restrain your scraping script, it might send dozens of requests per second to fetch new pages, and it may scrape parts of the website that are not intended to be accessible to robots – things like dynamically generated pages that are slow to make, or private user profiles, or copyrighted images, or other things they don’t want to be downloaded en masse.
To prevent this, Web site owners can use robots.txt
, a standard file that specifies what robots should be allowed to do on a website. The standard format is quite simple and easy to read.
A robots.txt
file is placed in the root directory of a website, like http://www.example.com/robots.txt
. It is a plain text file (no formatting, not RTF, just text) with contents like
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
User-agent: *
Crawl-delay: 5
This requests Google to ignore certain directories entirely and requests all robots to wait 5 seconds between making requests. (The Crawl-delay
directive is unofficial and not respected by all robots.)
You should respect robots.txt
if possible. The R package robotstxt can parse robots.txt
files and tell you what pages you’re allowed to scrape, and Python’s urllib has a robotparser module as well.
Python users may prefer to use Scrapy, a package that automatically handles everything: processing robots.txt
, maintaining a queue of pages to visit, extracting data from pages, and storing data in an output file. Here’s an example from the documentation:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.xpath('span/small/text()').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
This starts at a specific URL, selects elements from the requested page (using both CSS and XPath selectors), and yields a dictionary of values selected from the page, as well as yielding subsequent pages to visit. Scrapy handles the scheduling, outputs the data to a JSON file (you run Scrapy at the command line to specify the output file location), and automatically skips requests forbidden by robots.txt
(provided you set the option to do so).
Sometimes it’s not enough to scrape a website by sending it HTTP requests directly, or to use its API. Maybe the website involves a bunch of JavaScript to be run by a Web browser or it doesn’t like being accessed by robots.
In that case, you need a real browser.
Tools like Selenium let you automate a web browser. The Selenium WebDriver lets you start a Web browser – like Chrome or Firefox – and control it from a program, then reach in and inspect the contents of the web pages being displayed. Selenium can be used from within Python and R, and from many other languages. Here’s a Python example from the documentation:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()
The element with name q
is the search box, so we are literally typing the text pycon
into that box, hitting Enter, and checking that the string “No results found” is not in the resulting page.
Selenium can find elements by name, but there are also methods like driver.find_element_by_css_selector
and find_element_by_xpath
, among others. You can even get screenshots of the page if you want to do some kind of image analysis or interact with some graphical thing, using driver.save_screenshot
.
In the example above, Selenium uses Firefox. (You need to have Firefox installed separately.) It also supports Chrome and Internet Explorer.
Just remember that you usually don’t need this. If you just need the contents of Web pages, use a package for Web scraping or for HTTP requests; you don’t need the massive complexity of Web browsers unless you’re depending on the browser to do things like play videos and run JavaScript.