36-651/751: The Internet, the Web, and HTTP

– Spring 2019, mini 3 (last updated February 5, 2019) all courses · refsmmat.com

As statisticians and data scientists, we often get data from the Web. We write scripts that download CSVs from websites and sometimes we even write scrapers that browse websites to extract data of interest the hard way.

There is a sea of acronym and technologies involved. To connect to a website, your browser uses DNS to find a hostname to send an HTTP GET request to over the Internet, perhaps verifying the remote host with TLS, and receives an HTTP response encapsulating an HTML body containing references to numerous other resources that also have to be fetched with HTTP GET requests. Sometimes you make POST requests to submit forms. Sometimes websites have REST API endpoints that provide JSON replies to URL-encoded queries, but only if you provide the right Accept header and pass the right cookies.

Wat?

Let’s break this down.

The Internet

The Internet is the collective name for all the computers and routers connected together in one big system that we use. Email, Web sites, video calls, your Internet-connected doorbell – they all use the Internet to communicate with other systems, even if they do not use HTTP or Web pages as we know them.

Host, domain names, and IP addresses

A machine connected to the Internet is a host. Each host has a hostname. On a Mac or Linux machine, you can run the hostname command to see your computer’s hostname; mine is currently Leibniz.local.

That hostname means I’m on a network called local and my computer is called Leibniz, apparently because I had a fondness for calculus when I first named it.

Hostnames can be assembled into domain names, which identify hosts on the Internet. For example, andrew.cmu.edu is a fully qualified domain name. Domain names are hierarchical, separated by dots, and are read right-to-left:

[root zone]
Conceptually, every fully qualified domain name is part of the root zone, which is controlled by the Internet Corporation for Assigned Names and Numbers (ICANN). (ICANN was originally run under contract with the US Department of Commerce, but since 2016 it is independent, operating with input from 111 countries.)
edu
A top-level domain (TLD) controlled by the organization Educause.
cmu
Carnegie Mellon’s domain name.
andrew
A specific host in Carnegie Mellon’s network.

Top-level domains are created under ICANN’s authority, granting specific organizations authority to operate specific TLDs. Those TLDs (like .edu or .org) then sell name registrations to organizations like CMU. CMU then can create its own domains underneath its cmu.edu registration, acting as its own registration authority for names like stat.cmu.edu and www.cmu.edu.

(Incidentally, it is quite important that domain names are hierarchical. This is how you know that securepayments.yourbank.com is run by the same people who run yourbank.com, whereas securepayments.yourbank.com.tech.mafia.ru is run by the Russian Mafia, despite containing the same substring. Phishing sites will often use this trick to try to confuse you; remember to read right-to-left!)

Not every computer has a publicly accessible fully qualified domain name. My laptop currently does not, for example; outside of my local network, the hostname Leibniz.local means nothing to anyone.

Domain names have to follow certain rules: for example, they do not contain spaces or underscore characters. Note that http://www.stat.cmu.edu or andrew.cmu.edu/foo/bar are not domain names; they are Uniform Resource Locators and refer to specific services hosted on specific domain names.

Now, a domain name doesn’t get you much. If I want my computer to send something to the host at andrew.cmu.edu, how am I supposed to do that? I need a way to know which physical machine to deliver to.

Internet routers and switches don’t know how to find domain names, but they do understand IP addresses. An IP address is a numerical address; every machine connected to the public Internet has an IP address. (We’ll skip NAT for now.)

IPv4 uses addresses like 172.16.254.1, with four parts each containing an 8-bit (0 to 255) number; IPv6, the successor, uses addresses like 2001:db8:0:1234:0:567:8:1, with eight parts each containing a 16-bit number encoded in hexadecimal. (The encoding is just for humans to look at; computers just look at the full 32-bit or 128-bit numbers.)

Crucially, IP addresses are hierarchical as well. For example, Carnegie Mellon owns the entire block of addresses from 128.2.0.0 to 128.2.255.255, which includes every address beginning with 128.2. The details are a bit out of scope here, but using BGP, CMU’s routers announce to other routers they’re connected to, “Hey, I know how to deliver to any address starting with 128.2”, and those routers advertise to their neighbors “Hey, I have a way to get to 128.2,” and so on, and so every router on the Internet knows someone who knows someone who knows someone who can deliver the message. This involves something called the Border Gateway Protocol.

The Internet is, basically, a big extended family where if you get a parking ticket, your sister says “oh, talk to our aunt, she knows the sister of a guy who was roommates with the cousin of the county clerk”, and the message gets passed from step to step until it gets to the right person.

But I don’t want to type in 128.2.12.64 to get the website at stat.cmu.edu. I want to just type in stat.cmu.edu. How do I do that?

The Domain Name System

The answer: the Domain Name System. When I type in stat.cmu.edu, my computer sends a DNS query to its friendly neighborhood DNS resolver (usually run by your Internet Service Provider). The DNS resolver follows several steps:

[root zone]
It asks DNS servers run by the root zone where to find the DNS servers for .edu. A master list of root DNS servers and their IP addresses has to be distributed manually; several organizations maintain root zone servers on behalf of ICANN. The root zone server responds with the IP address of a DNS server for .edu, such as 2001:501:b1f9::30.
edu
It asks the DNS server for .edu where to find the DNS server for cmu.edu, and receives a response like 128.237.148.168.
cmu
It asks the DNS server for cmu.edu where to find stat.cmu.edu, and receives a response like 128.2.12.64.

Obviously this involves a lot of back-and-forth communication, so resolvers usually have a cache. Every DNS server gives a time-to-live for its responses, indicating how long they can be relied upon; the resolver saves all responses it has received for that time period, so subsequent requests can be given the same answer.

Typical TTLs are on the range of hours to a day or two, which is why sometimes after website maintenance it can take a while for your access to be restored – your resolver might have an old invalid DNS response cached.

When you see error messages like “example.com could not be found”, your computer could not find a record for it in the Domain Name System.

Some systems distribute data via DNS. Some organizations run DNSBLs or RBLs (DNS Blackhole List or Real-time Blackhole List), which provide ways to query if a domain name or host is known to be involved in spam mail. If your email server receives a message from 192.168.42.23, it might do a DNS query for 23.42.168.192.dnsbl.example.net; if the example.net RBL indicates that this IP is known to send spam, it will return an IP address, otherwise it will return a message that no such record is known.

Sending data with TCP

Once you have received the IP address corresponding to a domain name, how do you communicate with it?

There are layers here; seven layers, specifically. Your computer needs a physical connection to other computers, either by cable or via electromagnetic field; communication over that connection has to be coordinated; messages have to be sent over that connection; those messages need to be reassembled into larger pieces with meaning; and those pieces need to be given meaning.

We’ll skip the physical layers. Suffice to say that Ethernet cables or Wi-Fi connections are involved.

Let’s talk about how we send messages and reassemble them. There are several ways to do this, but the one most often used is the Transmission Control Protocol.

TCP is a way of delivering messages between machines. It does not discriminate on the content of those messages; they may involve emails, video calls, Web pages, DNS queries, World of Warcraft game data, or anything else. TCP does not care or know anything about the specific uses of the messages.

TCP merely provides a way to deliver messages. It does so in steps:

  1. Your computer sends a TCP message asking to connect to a particular server. That server replies with a message saying “Sounds good to me”, and your computer replies with a message saying “Great, let’s get to it”. This is the connection handshake. It’s what’s happening when you see “connecting to host…” messages.
  2. Your computer takes the content of the message it wants to send and breaks it up into small pieces, called packets. Each packet is stamped with its destination IP address and a number indicating the order they’re supposed to be reassembled in. Your computer sends these to be delivered to the server.
  3. As the server receives the packets, it sends acknowledgements (“ACK”). If a packet is lost, your computer can re-send it.
  4. The server may reply with messages of its own, delivered in the same way, and messages can flow back and forth until one computer decides to close the connection and sends a “FIN” message (which has to be ACKed and matched with another FIN from the other computer).

TCP is meant to be robust: if packets are lost (which happens surprisingly often!) they are re-sent, and packets can arrive in any order and be reassembled.

When you see error messages like “Could not connect to host” or “Connection refused”, there was a problem at step 1: the server failed to reply to your TCP connection request. Maybe there is no server actually listening at that IP address, or that computer is temporarily disconnected.

The World Wide Web

The World Wide Web refers to the vast collection of documents and applications interconnected with hyperlinks and available through the Internet. The documents are usually formatted with HTML and might contain pictures and video, though other files can be delivered as well.

The WWW is not the only thing on the Internet; email is transmitted via separate protocols over the Internet, even if you can access your mailbox via a Web page. The Web is just one massively influential use for the Internet.

URLs

To fetch a Web page, you need a Uniform Resource Locator to identify which page you want. URLs come in parts; here’s the syntax as given by Wikipedia:

URI = scheme:[//authority]path[?query][#fragment]
authority = [userinfo@]host[:port]

Let’s break that down:

Scheme
The type of thing we want. Common schemes are http and https for Web pages, mailto for email links, file for files on the local file system, and so on.
userinfo
A username and potentially a password (as username:password). Optional. Sometimes used for FTP links so the links specifies the username to use.
host
A domain name or IP address for the server that owns this content.
port
The port number to use when connecting.
path
A sequence of parts separated by slashes (/ – no, these are not backslashes). The path specifies the specific resource or file being requested.
query
Optionally, a query, starting with a question mark. Queries are typically key-value pairs separated by ampersands, as in foo=bar&baz=bin.
fragment
A name, preceded by the hash sign #, pointing to a specific part of the document. For example, in an HTML page, the fragment may refer to a specific section or heading on the page.

Some URL examples:

https://en.wikipedia.org/wiki/URL?thing=baz

mailto:foo@example.com

ftp://alex:hunter2@ftp.example.com/home/alex/passwords

file:///Users/myuser/Documents/foo.txt



Never try to parse URLs (e.g. to extract the domain name or a specific path component) with regular expressions. You will get it wrong and weird things will happen, since URLs can have very weird forms. Use a library or package that does it properly.

The HyperText Transport Protocol

We use Web browsers – and various other programs – to view Web pages. All that communication between computers and servers is done via TCP, but there’s an extra layer on top of TCP to provide a standardized way to request specific Web pages, define the formats you are willing to accept in response, identify yourself.

That extra layer is the HyperText Transport Protocol, or HTTP. It defines specific types of messages to be sent via TCP to make requests.

For example, here’s a request I sent to Wikipedia for the page https://en.wikipedia.org/wiki/World_Wide_Web:

GET /wiki/World_Wide_Web HTTP/2.0
Host: en.wikipedia.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
DNT: 1
Connection: keep-alive
Cookie: enwikiUserName=Cap%27n+Refsmmat; [more redacted]
Upgrade-Insecure-Requests: 1
Cache-Control: max-age=0
If-Modified-Since: Sun, 27 Jan 2019 03:20:05 GMT

There’s a lot going on here. Let’s break it down.

GET /wiki/World_Wide_Web HTTP/2.0
Host: en.wikipedia.org

This is a GET request using HTTP version 2.0. The first line specifies the type of request and the specific resource I’m requesting, /wiki/World_Wide_Web. The second line is the beginning of the headers, providing metadata about the request. I specify the domain name I am making the request for – it’s possible for multiple different domain names to have the same IP address, so I have to tell the receiving server which domain name I’m interested in.

More headers follow. A few interesting ones:

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Cookie: enwikiUserName=Cap%27n+Refsmmat; [more redacted]

The User-Agent tells the server what kind of program I am. Servers sometimes use this to tell what version of a web page to serve – maybe a page uses some features that only work in Chrome or in specific versions of Internet Explorer – and servers often do this in stupid ways, so browser makers have been forced to make User-Agent headers progressively stupider.

For example, back in The Day™, only Netscape supported “advanced” HTML features like frames, so servers checked for “Mozilla” in the User-Agent before sending the fancy versions of pages; after Internet Explorer and others added support for frames, they had to add Mozilla to their User-Agent so websites would actually send them frames. So now every User-Agent starts with Mozilla even if it’s not made by Mozilla.

User-Agent headers are often used to identify robots; some websites categorically ban the default user-agents used by Python’s standard library, for example, because poorly written web scrapers often cause them problems. It’s considered good form to provide your own User-Agent when you write a scraper, and include enough identifying information in the header that it could be used to contact you to complain if your program causes problems.

The Accept header specifies what media type my browser is willing to accept in response to the request: HTML, XHTML, or XML pages, or anything else (*/*) if needed. Accept-Encoding says that my browser would prefer the response to be compressed with gzip, DEFLATE, or Brotli, to save bandwidth.

The Cookie header provides the server any cookies my browser has stored for this domain name. I’ve redacted several cookies containing authentication tokens that prove to Wikipedia that I’m a specific logged-in user. If I don’t send the Cookie header, Wikipedia will have no idea who I am.

There are many other types of headers, but these are the ones you may need in practice.

Request types and responses

After a server receives an HTTP request, it (usually) sends a response. Responses have a similar format to requests:

HTTP/2.0 200 OK
date: Wed, 30 Jan 2019 19:14:42 GMT
content-type: text/html; charset=UTF-8
server: mw1272.eqiad.wmnet
x-powered-by: HHVM/3.18.6-dev
expires: Thu, 01 Jan 1970 00:00:00 GMT
content-language: en
set-cookie: enwikiSession=[redacted]; path=/; secure; httponly
set-cookie: enwikiUserID=92555; expires=Thu, 30-Jan-2020 19:14:42 GMT; Max-Age=31536000; path=/; secure; httponly
set-cookie: enwikiUserName=Cap%27n+Refsmmat; expires=Thu, 30-Jan-2020 19:14:42 GMT; Max-Age=31536000; path=/; secure; httponly
set-cookie: forceHTTPS=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; httponly
set-cookie: forceHTTPS=true; expires=Fri, 01-Mar-2019 19:14:42 GMT; Max-Age=2592000; path=/; domain=.wikipedia.org; httponly
set-cookie: centralauth_User=Cap%27n+Refsmmat; expires=Thu, 30-Jan-2020 19:14:42 GMT; Max-Age=31536000; path=/; domain=.wikipedia.org; secure; httponly
set-cookie: centralauth_Token=[redacted]; expires=Thu, 30-Jan-2020 19:14:42 GMT; Max-Age=31536000; path=/; domain=.wikipedia.org; secure; httponly
set-cookie: centralauth_Session=[redacted]; path=/; domain=.wikipedia.org; secure; httponly
last-modified: Sun, 27 Jan 2019 04:28:44 GMT
vary: Accept-Encoding,Cookie,Authorization,X-Seven
content-encoding: gzip
via: 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1)

(some boring headers redacted)

After the headers is the body. In the above case, since Content-Encoding is gzip, the response body was the gzip-compressed HTML of the web page.

Notice the Set-Cookie headers asking my browser to store various cookies, the Content-Type specifying that this is a text/html page, and the various date and modification headers to tell the browser how long it may cache this page.

In this example, I sent a GET request to Wikipedia. GET is one of several methods that can be used:

GET
Retrieve data from a specific resource (i.e. fetch a specific web page). GET requests should not have side effects, like deleting things.
HEAD
Same as GET, except without the body of the response – only the response headers will be sent.
POST
Send data to the server so it can do something. Submitting a form to a website often sends a POST request containing the contents of the form. POST requests often have side effects, and automated scrapers should be very careful about submitting POST requests.
PUT, PATCH, …
Other methods are less commonly used. PUT sends a file to a server and asks it to put it at a specific URL, for example, while PATCH asks for modifications to be made to a file.

(In 2005, Google released the Google Web Accelerator, which would preemptively make GET requests for links on the page you’re currently viewing, so those pages would be downloaded by the time you click a link. Unfortunately, some websites used GET requests for things like “delete this post”, and so users with Google Web Accelerator unintentionally deleted everything they had access to delete.)

When you just fetch a web page, you’re probably using GET.

Sometimes, when you’re scraping a site or using an API, you’ll need to use POST. Some APIs use it if you want to specify search terms and restrictions in a request; it’s also useful if you want to submit forms to a site. GET request URLs are also usually logged, so POST is used for things like login forms, so your password does not end up in the logs.

A POST request can contain arbitrary amounts of data, typically as key-value pairs.

Here’s an example POST request adapted from MDN:

POST /submit HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded
Content-Length: 13

say=Hi&to=Mom

We’re sending the keys say and to to the server at example.com.

There are other ways of encoding the contents of a POST request; this method is called urlencoded (because the key-value pairs resemble those in the query component of a URL), but we could send JSON or other data formats if the server knows how to read them.

TLS and HTTPS

We do a lot of sensitive things on the Web these days: manage our bank accounts, organize political protests, view medical records, search WebMD for embarrassing diseases, and view enough How It’s Made videos to make others doubt our sanity. We do these things even when we do not trust the people running our network: we’ll gladly use our credit card to buy stuff on Amazon while sitting in a coffee shop using their Wi-Fi, for example, even when they could make good money by stealing credit card information from all the people who sit for four hours in the coffee shop nursing one Americano and doing homework.

We frequently need three specific properties for our websites:

Confidentiality
We don’t want others to know what we’re looking at or what data we’re submitting.
Integrity
We don’t want anyone to be able to manipulate the messages we send and receive, e.g. by inserting extra contents into Web pages we visit. (A malicious coffee shop could add code to Web pages to send them everything you enter into forms, for example.)
Authentication
We want to be confident that when we connect to large-evil-bank.com, we are receiving responses from the genuine Large Evil Bank Corporation, and not from an impostor trying to lure us into sharing bank details.

This is what HTTPS is meant to enable. HTTPS extends HTTP to use the Transport Layer Security protocol (TLS) to connect to the host server. TLS lets the server crytographically prove its identity, and uses strong encryption to prevent anyone from eavesdropping on the connection or from manipulating its contents in any way.

Cryptographic authentication is tricky and relies on public-key cryptography. I’ll skip the math. One part worth knowing, though, is the idea of a “digital certificate” provided by a Web server to prove its identity; certificates are signed by certificate authorities, who vouch that they have verified the website’s identity and are trusted by Web browser manufacturers to do so. Certificates cannot be forged (unless the encryption system is broken).

You sometimes see certificate errors when you try to visit Web sites. These may happen because the certificate has expired (they last for a limited period of time), it vouches for a different websites, or it has been issued by a certificate authority your browser does not recognize.

In the olden days, obtaining a certificate for your website cost money, so only businesses handling credit cards or sensitive data bothered to do so. That’s no longer true, and any website that can enable HTTPS should. It prevents common attacks, like coffee shops injecting ads into websites and hackers adding malicious code to the software packages you download.

HTTPS also blocks the Upside-Down-Ternet, which is perhaps its only downside.

Sending HTTP requests with code

So, this all looks like a confusing tangled mess. To send an HTTP request, you need to know how to form the right HTTP headers, urlencode your POST data, set Accept headers and who knows what, and then parse the headers you receive in return.

This is why packages exist.

In Python, just use Requests, branded as “HTTP for Humans™”. It “allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.” It handles encodings and POST and everything else.

For example:

import requests

r = requests.get('https://api.github.com/events')

print(r.text) # response body

## build a query string with key=val&key2=val2 syntax
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get('https://httpbin.org/get', params=payload)

## automatically format a POST request
r = requests.post('https://httpbin.org/post', data = {'key':'value'})

In R, you have httr:

library(httr)
r <- GET("http://httpbin.org/get")

str(content(r))  # response body
headers(r)       # list of headers

## a request with a query string
r <- GET("http://httpbin.org/get",
  query = list(key1 = "value1", key2 = "value2")
)

## a POST
r <- POST("http://httpbin.org/post", body = list(a = 1, b = 2, c = 3))

You will undoubtedly find a nice library for any programming language you want, because everyone wants to make HTTP requests.

However, you usually don’t want to send HTTP requests yourself. Often you’re interested in scraping a website; there are integrated packages like rvest that handle both HTTP requests and parsing HTML to extract desired content.

Next time we’ll talk about scraping more directly, including how to extract specific HTML elements using rvest or Beautiful Soup.