As statisticians and data scientists, we often get data from the Web. We write scripts that download CSVs from websites and sometimes we even write scrapers that browse websites to extract data of interest the hard way.
There is a sea of acronym and technologies involved. To connect to a website, your browser uses DNS to find a hostname to send an HTTP GET request to over the Internet, perhaps verifying the remote host with TLS, and receives an HTTP response encapsulating an HTML body containing references to numerous other resources that also have to be fetched with HTTP GET requests. Sometimes you make POST requests to submit forms. Sometimes websites have REST API endpoints that provide JSON replies to URL-encoded queries, but only if you provide the right Accept header and pass the right cookies.
Wat?
Let’s break this down.
The Internet is the collective name for all the computers and routers connected together in one big system that we use. Email, Web sites, video calls, your Internet-connected doorbell – they all use the Internet to communicate with other systems, even if they do not use HTTP or Web pages as we know them.
A machine connected to the Internet is a host. Each host has a hostname. On a Mac or Linux machine, you can run the hostname
command to see your computer’s hostname; mine is currently Leibniz.local
.
That hostname means I’m on a network called local
and my computer is called Leibniz
, apparently because I had a fondness for calculus when I first named it.
Hostnames can be assembled into domain names, which identify hosts on the Internet. For example, andrew.cmu.edu
is a fully qualified domain name. Domain names are hierarchical, separated by dots, and are read right-to-left:
edu
cmu
andrew
Top-level domains are created under ICANN’s authority, granting specific organizations authority to operate specific TLDs. Those TLDs (like .edu
or .org
) then sell name registrations to organizations like CMU. CMU then can create its own domains underneath its cmu.edu
registration, acting as its own registration authority for names like stat.cmu.edu
and www.cmu.edu
.
(Incidentally, it is quite important that domain names are hierarchical. This is how you know that securepayments.yourbank.com
is run by the same people who run yourbank.com
, whereas securepayments.yourbank.com.tech.mafia.ru
is run by the Russian Mafia, despite containing the same substring. Phishing sites will often use this trick to try to confuse you; remember to read right-to-left!)
Not every computer has a publicly accessible fully qualified domain name. My laptop currently does not, for example; outside of my local network, the hostname Leibniz.local
means nothing to anyone.
Domain names have to follow certain rules: for example, they do not contain spaces or underscore characters. Note that http://www.stat.cmu.edu
or andrew.cmu.edu/foo/bar
are not domain names; they are Uniform Resource Locators and refer to specific services hosted on specific domain names.
Now, a domain name doesn’t get you much. If I want my computer to send something to the host at andrew.cmu.edu
, how am I supposed to do that? I need a way to know which physical machine to deliver to.
Internet routers and switches don’t know how to find domain names, but they do understand IP addresses. An IP address is a numerical address; every machine connected to the public Internet has an IP address. (We’ll skip NAT for now.)
IPv4 uses addresses like 172.16.254.1
, with four parts each containing an 8-bit (0 to 255) number; IPv6, the successor, uses addresses like 2001:db8:0:1234:0:567:8:1
, with eight parts each containing a 16-bit number encoded in hexadecimal. (The encoding is just for humans to look at; computers just look at the full 32-bit or 128-bit numbers.)
Crucially, IP addresses are hierarchical as well. For example, Carnegie Mellon owns the entire block of addresses from 128.2.0.0
to 128.2.255.255
, which includes every address beginning with 128.2
. The details are a bit out of scope here, but using BGP, CMU’s routers announce to other routers they’re connected to, “Hey, I know how to deliver to any address starting with 128.2
”, and those routers advertise to their neighbors “Hey, I have a way to get to 128.2
,” and so on, and so every router on the Internet knows someone who knows someone who knows someone who can deliver the message. This involves something called the Border Gateway Protocol.
The Internet is, basically, a big extended family where if you get a parking ticket, your sister says “oh, talk to our aunt, she knows the sister of a guy who was roommates with the cousin of the county clerk”, and the message gets passed from step to step until it gets to the right person.
But I don’t want to type in 128.2.12.64
to get the website at stat.cmu.edu
. I want to just type in stat.cmu.edu
. How do I do that?
The answer: the Domain Name System. When I type in stat.cmu.edu
, my computer sends a DNS query to its friendly neighborhood DNS resolver (usually run by your Internet Service Provider). The DNS resolver follows several steps:
.edu
. A master list of root DNS servers and their IP addresses has to be distributed manually; several organizations maintain root zone servers on behalf of ICANN. The root zone server responds with the IP address of a DNS server for .edu
, such as 2001:501:b1f9::30
.
edu
.edu
where to find the DNS server for cmu.edu
, and receives a response like 128.237.148.168
.
cmu
cmu.edu
where to find stat.cmu.edu
, and receives a response like 128.2.12.64
.
Obviously this involves a lot of back-and-forth communication, so resolvers usually have a cache. Every DNS server gives a time-to-live for its responses, indicating how long they can be relied upon; the resolver saves all responses it has received for that time period, so subsequent requests can be given the same answer.
Typical TTLs are on the range of hours to a day or two, which is why sometimes after website maintenance it can take a while for your access to be restored – your resolver might have an old invalid DNS response cached.
When you see error messages like “example.com
could not be found”, your computer could not find a record for it in the Domain Name System.
Some systems distribute data via DNS. Some organizations run DNSBLs or RBLs (DNS Blackhole List or Real-time Blackhole List), which provide ways to query if a domain name or host is known to be involved in spam mail. If your email server receives a message from 192.168.42.23, it might do a DNS query for 23.42.168.192.dnsbl.example.net
; if the example.net
RBL indicates that this IP is known to send spam, it will return an IP address, otherwise it will return a message that no such record is known.
Once you have received the IP address corresponding to a domain name, how do you communicate with it?
There are layers here; seven layers, specifically. Your computer needs a physical connection to other computers, either by cable or via electromagnetic field; communication over that connection has to be coordinated; messages have to be sent over that connection; those messages need to be reassembled into larger pieces with meaning; and those pieces need to be given meaning.
We’ll skip the physical layers. Suffice to say that Ethernet cables or Wi-Fi connections are involved.
Let’s talk about how we send messages and reassemble them. There are several ways to do this, but the one most often used is the Transmission Control Protocol.
TCP is a way of delivering messages between machines. It does not discriminate on the content of those messages; they may involve emails, video calls, Web pages, DNS queries, World of Warcraft game data, or anything else. TCP does not care or know anything about the specific uses of the messages.
TCP merely provides a way to deliver messages. It does so in steps:
TCP is meant to be robust: if packets are lost (which happens surprisingly often!) they are re-sent, and packets can arrive in any order and be reassembled.
When you see error messages like “Could not connect to host” or “Connection refused”, there was a problem at step 1: the server failed to reply to your TCP connection request. Maybe there is no server actually listening at that IP address, or that computer is temporarily disconnected.
The World Wide Web refers to the vast collection of documents and applications interconnected with hyperlinks and available through the Internet. The documents are usually formatted with HTML and might contain pictures and video, though other files can be delivered as well.
The WWW is not the only thing on the Internet; email is transmitted via separate protocols over the Internet, even if you can access your mailbox via a Web page. The Web is just one massively influential use for the Internet.
To fetch a Web page, you need a Uniform Resource Locator to identify which page you want. URLs come in parts; here’s the syntax as given by Wikipedia:
URI = scheme:[//authority]path[?query][#fragment]
authority = [userinfo@]host[:port]
Let’s break that down:
http
and https
for Web pages, mailto
for email links, file
for files on the local file system, and so on.
username:password
). Optional. Sometimes used for FTP links so the links specifies the username to use.
/
– no, these are not backslashes). The path specifies the specific resource or file being requested.
foo=bar&baz=bin
.
#
, pointing to a specific part of the document. For example, in an HTML page, the fragment may refer to a specific section or heading on the page.
Some URL examples:
https://en.wikipedia.org/wiki/URL?thing=baz
mailto:foo@example.com
ftp://alex:hunter2@ftp.example.com/home/alex/passwords
file:///Users/myuser/Documents/foo.txt
data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg==
Never try to parse URLs (e.g. to extract the domain name or a specific path component) with regular expressions. You will get it wrong and weird things will happen, since URLs can have very weird forms. Use a library or package that does it properly.
We use Web browsers – and various other programs – to view Web pages. All that communication between computers and servers is done via TCP, but there’s an extra layer on top of TCP to provide a standardized way to request specific Web pages, define the formats you are willing to accept in response, identify yourself.
That extra layer is the HyperText Transport Protocol, or HTTP. It defines specific types of messages to be sent via TCP to make requests.
For example, here’s a request I sent to Wikipedia for the page https://en.wikipedia.org/wiki/World_Wide_Web
:
GET /wiki/World_Wide_Web HTTP/2.0
Host: en.wikipedia.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
DNT: 1
Connection: keep-alive
Cookie: enwikiUserName=Cap%27n+Refsmmat; [more redacted]
Upgrade-Insecure-Requests: 1
Cache-Control: max-age=0
If-Modified-Since: Sun, 27 Jan 2019 03:20:05 GMT
There’s a lot going on here. Let’s break it down.
GET /wiki/World_Wide_Web HTTP/2.0
Host: en.wikipedia.org
This is a GET
request using HTTP version 2.0. The first line specifies the type of request and the specific resource I’m requesting, /wiki/World_Wide_Web
. The second line is the beginning of the headers, providing metadata about the request. I specify the domain name I am making the request for – it’s possible for multiple different domain names to have the same IP address, so I have to tell the receiving server which domain name I’m interested in.
More headers follow. A few interesting ones:
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Cookie: enwikiUserName=Cap%27n+Refsmmat; [more redacted]
The User-Agent
tells the server what kind of program I am. Servers sometimes use this to tell what version of a web page to serve – maybe a page uses some features that only work in Chrome or in specific versions of Internet Explorer – and servers often do this in stupid ways, so browser makers have been forced to make User-Agent
headers progressively stupider.
For example, back in The Day™, only Netscape supported “advanced” HTML features like frames, so servers checked for “Mozilla” in the User-Agent
before sending the fancy versions of pages; after Internet Explorer and others added support for frames, they had to add Mozilla
to their User-Agent
so websites would actually send them frames. So now every User-Agent
starts with Mozilla
even if it’s not made by Mozilla.
User-Agent
headers are often used to identify robots; some websites categorically ban the default user-agents used by Python’s standard library, for example, because poorly written web scrapers often cause them problems. It’s considered good form to provide your own User-Agent
when you write a scraper, and include enough identifying information in the header that it could be used to contact you to complain if your program causes problems.
The Accept
header specifies what media type my browser is willing to accept in response to the request: HTML, XHTML, or XML pages, or anything else (*/*
) if needed. Accept-Encoding
says that my browser would prefer the response to be compressed with gzip, DEFLATE, or Brotli, to save bandwidth.
The Cookie
header provides the server any cookies my browser has stored for this domain name. I’ve redacted several cookies containing authentication tokens that prove to Wikipedia that I’m a specific logged-in user. If I don’t send the Cookie
header, Wikipedia will have no idea who I am.
There are many other types of headers, but these are the ones you may need in practice.
After a server receives an HTTP request, it (usually) sends a response. Responses have a similar format to requests:
HTTP/2.0 200 OK
date: Wed, 30 Jan 2019 19:14:42 GMT
content-type: text/html; charset=UTF-8
server: mw1272.eqiad.wmnet
x-powered-by: HHVM/3.18.6-dev
expires: Thu, 01 Jan 1970 00:00:00 GMT
content-language: en
set-cookie: enwikiSession=[redacted]; path=/; secure; httponly
set-cookie: enwikiUserID=92555; expires=Thu, 30-Jan-2020 19:14:42 GMT; Max-Age=31536000; path=/; secure; httponly
set-cookie: enwikiUserName=Cap%27n+Refsmmat; expires=Thu, 30-Jan-2020 19:14:42 GMT; Max-Age=31536000; path=/; secure; httponly
set-cookie: forceHTTPS=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; httponly
set-cookie: forceHTTPS=true; expires=Fri, 01-Mar-2019 19:14:42 GMT; Max-Age=2592000; path=/; domain=.wikipedia.org; httponly
set-cookie: centralauth_User=Cap%27n+Refsmmat; expires=Thu, 30-Jan-2020 19:14:42 GMT; Max-Age=31536000; path=/; domain=.wikipedia.org; secure; httponly
set-cookie: centralauth_Token=[redacted]; expires=Thu, 30-Jan-2020 19:14:42 GMT; Max-Age=31536000; path=/; domain=.wikipedia.org; secure; httponly
set-cookie: centralauth_Session=[redacted]; path=/; domain=.wikipedia.org; secure; httponly
last-modified: Sun, 27 Jan 2019 04:28:44 GMT
vary: Accept-Encoding,Cookie,Authorization,X-Seven
content-encoding: gzip
via: 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1)
(some boring headers redacted)
After the headers is the body. In the above case, since Content-Encoding
is gzip, the response body was the gzip-compressed HTML of the web page.
Notice the Set-Cookie
headers asking my browser to store various cookies, the Content-Type
specifying that this is a text/html
page, and the various date and modification headers to tell the browser how long it may cache this page.
In this example, I sent a GET request to Wikipedia. GET is one of several methods that can be used:
(In 2005, Google released the Google Web Accelerator, which would preemptively make GET requests for links on the page you’re currently viewing, so those pages would be downloaded by the time you click a link. Unfortunately, some websites used GET requests for things like “delete this post”, and so users with Google Web Accelerator unintentionally deleted everything they had access to delete.)
When you just fetch a web page, you’re probably using GET.
Sometimes, when you’re scraping a site or using an API, you’ll need to use POST. Some APIs use it if you want to specify search terms and restrictions in a request; it’s also useful if you want to submit forms to a site. GET request URLs are also usually logged, so POST is used for things like login forms, so your password does not end up in the logs.
A POST request can contain arbitrary amounts of data, typically as key-value pairs.
Here’s an example POST request adapted from MDN:
POST /submit HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded
Content-Length: 13
say=Hi&to=Mom
We’re sending the keys say
and to
to the server at example.com
.
There are other ways of encoding the contents of a POST request; this method is called urlencoded
(because the key-value pairs resemble those in the query component of a URL), but we could send JSON or other data formats if the server knows how to read them.
We do a lot of sensitive things on the Web these days: manage our bank accounts, organize political protests, view medical records, search WebMD for embarrassing diseases, and view enough How It’s Made videos to make others doubt our sanity. We do these things even when we do not trust the people running our network: we’ll gladly use our credit card to buy stuff on Amazon while sitting in a coffee shop using their Wi-Fi, for example, even when they could make good money by stealing credit card information from all the people who sit for four hours in the coffee shop nursing one Americano and doing homework.
We frequently need three specific properties for our websites:
large-evil-bank.com
, we are receiving responses from the genuine Large Evil Bank Corporation, and not from an impostor trying to lure us into sharing bank details.
This is what HTTPS is meant to enable. HTTPS extends HTTP to use the Transport Layer Security protocol (TLS) to connect to the host server. TLS lets the server crytographically prove its identity, and uses strong encryption to prevent anyone from eavesdropping on the connection or from manipulating its contents in any way.
Cryptographic authentication is tricky and relies on public-key cryptography. I’ll skip the math. One part worth knowing, though, is the idea of a “digital certificate” provided by a Web server to prove its identity; certificates are signed by certificate authorities, who vouch that they have verified the website’s identity and are trusted by Web browser manufacturers to do so. Certificates cannot be forged (unless the encryption system is broken).
You sometimes see certificate errors when you try to visit Web sites. These may happen because the certificate has expired (they last for a limited period of time), it vouches for a different websites, or it has been issued by a certificate authority your browser does not recognize.
In the olden days, obtaining a certificate for your website cost money, so only businesses handling credit cards or sensitive data bothered to do so. That’s no longer true, and any website that can enable HTTPS should. It prevents common attacks, like coffee shops injecting ads into websites and hackers adding malicious code to the software packages you download.
HTTPS also blocks the Upside-Down-Ternet, which is perhaps its only downside.
So, this all looks like a confusing tangled mess. To send an HTTP request, you need to know how to form the right HTTP headers, urlencode your POST data, set Accept
headers and who knows what, and then parse the headers you receive in return.
This is why packages exist.
In Python, just use Requests, branded as “HTTP for Humans™”. It “allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.” It handles encodings and POST and everything else.
For example:
import requests
r = requests.get('https://api.github.com/events')
print(r.text) # response body
## build a query string with key=val&key2=val2 syntax
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get('https://httpbin.org/get', params=payload)
## automatically format a POST request
r = requests.post('https://httpbin.org/post', data = {'key':'value'})
In R, you have httr:
library(httr)
r <- GET("http://httpbin.org/get")
str(content(r)) # response body
headers(r) # list of headers
## a request with a query string
r <- GET("http://httpbin.org/get",
query = list(key1 = "value1", key2 = "value2")
)
## a POST
r <- POST("http://httpbin.org/post", body = list(a = 1, b = 2, c = 3))
You will undoubtedly find a nice library for any programming language you want, because everyone wants to make HTTP requests.
However, you usually don’t want to send HTTP requests yourself. Often you’re interested in scraping a website; there are integrated packages like rvest that handle both HTTP requests and parsing HTML to extract desired content.
Next time we’ll talk about scraping more directly, including how to extract specific HTML elements using rvest or Beautiful Soup.