Hadoop, Spark, and the Spark Ecosystem

We’ve talked about parallelism as a way to solve a problem of scale: the amount of computation we want to do is very large, so we divide it up to run on multiple processors or machines.

But there’s another kind of problem of scale: the amount of data we want to process is very large, so we divide it up to process on multiple machines. This is Big Data.

Think of it this way: if you have one server with loads of RAM, a big hard drive, and a top-of-the-line CPU with many cores, and your data fits easily on that server, you’re probably best off writing parallel code and using that server. But if the data is going to be too large – or you’re receiving new data all the time – instead of buying an even bigger server, it might make more sense to buy a fleet of many smaller servers.

Computing on many independent machines is distributed computing. It brings many challenges over just using a single machine:

Each of these is a difficult problem. There is an ecosystem of software built to solve these problems, from the Hadoop Distributed File System to Spark to Hive to Mahout. But before we learn what the buzzwords mean, let’s talk about the structure of the problems.

Distributing data

Statisticians tend to think of datasets as simple things: maybe a few CSV files, a folder full of text files, or a big pile of images. Usually our biggest storage challenge is that we have to pay for extra Dropbox storage space to share the dataset with our collaborators. If datasets change, we usually just get some new data files or a few corrections.

But this is not what happens in industry. Consider a few examples:

A web search engine: Search engines run “crawlers”, programs that browse web pages, extract text and links from the HTML, and process the data to form a search index. There may be thousands of crawlers running on different machines at once (the Web is huge!), each producing vast quantities of data, and the search index must be updated with the new data very quickly to stay relevant.
A shopping site: A large shopping site might take many customer orders per minute, and have thousands of people browsing products all the time. To measure demand, adjust search results, recommend relevant products, and decide how much of each product to order for the warehouse, all this activity must be processed at least daily.
An ad network: Advertisements online are weird. When you visit a webpage containing ads, a tiny auction happens instantaneously: data on the webpage and about you is sent to different ad companies, running programs that submit bids on how much they’re willing to pay to show you an ad. An advertiser collects vast data on consumers, what pages they visit, and what ad campaigns are most profitable, and must crunch this data so it can instantaneously decide how much a particular pair of eyeballs is worth.
A retailer: A big chain like Walmart has hundreds or thousands of stores selling tens of thousands of different products. New sales happen every second, and data arrives about warehouse supplies, pricing changes from suppliers, online searches, and so on. Pricing needs to be adjusted, products ordered, ad campaigns mailed to relevant customers, and executives keep asking analysts to make charts justifying their latest big ideas.

There are a few common features here:

New data arrives regularly or continuously.
New data is being produced from many sources simultaneously.
The analysis must be updated regularly or it will no longer be useful.
The analysis may involve large amounts of computation that can only be done in parallel for it to be finished in time.
The incoming data is vast, and so is the archive.

Saving to a big CSV file simply is not going to scale. Large companies may have terabytes or petabytes of data stored, with gigabytes more arriving each day. We’re also fundamentally limited by how quickly we can write data to a single hard drive (and afraid of what would happen if that drive fails).

This is where a distributed file system is useful.

In a distributed file system, data is

Spread across multiple machines (often dozens or hundreds). Each server has its own large and fast hard drives containing a subset of the full data.
Stored redundantly. Each new data file is stored on several different servers, to prevent it from being lost if a hard drive fails or a power supply lets out its magic smoke.
Stored near where it is useful. Ideally, the machine where the data is processed is near the particular machine where it is stored, to avoid congesting the network with data files being sent back and forth.
Coherent. It shouldn’t be possible to change a file and have the change appear in one copy but not another redundant copy.
Available in quantity. It’s often assumed that rather than wanting small parts of many files, applications will want to receive huge chunks of single files; distributed systems may not be very fast if you try accessing many small files very quickly.

To achieve this, distributed file systems usually have several parts:

A bunch of servers with hard drives. Each server stores some chunk of files.
A coordination server (or servers) that tracks the data servers and knows where every file is stored, so users can fetch data from the right servers.
Some kind of library or client software for connecting to the distributed file system and accessing its data.

A common distributed file system is the Hadoop Distributed File System, HDFS, though there are many others. Amazon S3 (Simple Storage Service) is a distributed file system as a service, letting you send arbitrary objects to be stored on Amazon’s servers and retrieved whenever you want.

Spark can load data from HDFS, from S3, or just ordinary files off your hard drive.

Now suppose we have a distributed file system. How do we divide up computing tasks to be run on multiple machines, loading their input data from a distributed file system?

There are several conceptual models we could use. The first is MapReduce.

MapReduce

MapReduce dates to 2004, when Google engineers published a paper advertising the MapReduce concept as “Simplified Data Processing on Large Clusters”. By 2006 the Hadoop project existed to make an open-source implementation of the MapReduce ideas, and Hadoop soon exploded: it was the single largest buzzword of the early 2010s, being the canonical Big Data system.

Let’s talk about the conceptual ideas before we discuss the implementation details.

MapReduce uses the Map and Reduce operations we’ve discussed before, but with an extra twist: Shuffle. A MapReduce system has many “nodes” – different servers running the software – that are connected to some kind of distributed file system.

Map: Each node loads one chunk of data from the distributed file system, applies some function to that data, and writes the result to an output file. The result can have a “key”, some arbitrary identify.
Shuffle: The output files are moved around so that all data with the same key is on the same node. (Or, keys are assigned to nodes, and they fetch the data with that key from the distributed file system.)
Reduce: Each node applies another function to each chunk of output data, processing all chunks with the same key.

The output of all the reduction functions is aggregated into one big list.

The Map and Reduce steps are parallelized: each node processes the Map function simultaneously, and each node processes the Reduce function on each key simultaneously.

The Shuffle step lets us do reductions that aren’t completely associative.

The standard MapReduce example is counting words in a huge set of documents. Stealing from Wikipedia,

function map(String name, String document):
    // name: document name
    // document: document contents
    for each word w in document:
        emit (w, 1)

function reduce(String word, Iterator partialCounts):
    // word: a word
    // partialCounts: a list of aggregated partial counts
    sum = 0
    for each pc in partialCounts:
        sum += pc
    emit (word, sum)

Here emit (w, 1) means w is the key for the value 1.

That doesn’t really give you a sense of the scale of MapReduce, so let’s consider a more statistical example: parallel K-means clustering.

First, a brief review of K-means. We have n points (in some d-dimensional space) we want to cluster into k clusters. We randomly select k points and use their locations as cluster centers. Then:

Assign each point to a cluster, based on the distance between it and the cluster centers.
Take all the points in each cluster and set that cluster’s center to be the mean of the points.
Return to step 1.

The distance calculations are the scaling problem here: each iteration requires calculating the distance between all n points and all k cluster centers.

Q: How could we do K-means as a sequence of MapReduce operations?

Apache Hadoop

Apache Hadoop is an implementation of the MapReduce idea, in the same way that PostgreSQL or SQLite are implementations of the SQL idea. It’s built in Java.

Hadoop handles the details: it splits up the data, assigns data chunks to compute nodes, calls the Map function with the right data chunks, collects together output with the same key, provides output to the Reduce function, and handles all the communication between nodes. You need only write the Map and Reduce functions and let Hadoop do the hard work.

It’s easiest to write map and reduce functions in Java, but Hadoop also provides ways to specify “run this arbitrary program to do the Map”, so you can have your favorite R or Python code do the analysis.

Hadoop can be used to coordinate clusters of hundreds or thousands of servers running the Hadoop software, to which MapReduce tasks will be distributed.

Apache Spark

Hadoop was incredibly popular for quite a while, but it has its limitations.

Hadoop relies heavily on the distributed file system, storing all its intermediate results on disk. That can make it slow.
The MapReduce paradigm is quite restricted. There are extra features – like a Combiner step that operates on the Map’s output before it’s Reduced – but it can still be tricky to turn your task into a MapReduce operation, and it forces everything to be done in a certain order.
MapReduce isn’t very good for interactive use, when you just want to query your data and try stuff. It works in batches, reading data files from disk and spitting out new results to disk.
Hadoop was cool for long enough that its buzzword value was wearing off.

Apache Spark builds on MapReduce ideas in a more flexible way.

Here’s the idea. Your dataset (a great big data frame, for example) is immutable. There are many ways you can operate upon it, all of which produce a new dataset. For example, you could

map: apply a function to every row
filter: only keep the rows for which a function returns true
sample: randomly select a subset of rows
group by: if each row is a key and a value, produce a new dataset of (key, list of all the values) pairs

and so on. Spark has a bunch of built-in transformation functions, including these and many more.

When you load a dataset into Spark, different parts of it are loaded into memory on each machine in the cluster. When you apply a transformation to the data, each machine transforms its own chunk of the data.

Well, hang on. Each machine doesn’t transform its data; Spark just makes a note of the transformation you made. You can do a whole sequence of transformations and Spark will only start the distributed calculations when you perform an action, like

collect: return the entire result as a big array or data frame
reduce: apply a reduction function to the data
take: take the first n rows of data
save: save the dataset to a file

and so on.

You can chain together these operations: you can map, then filter, then sample, then group, then reduce, with whatever operations you want.

Let’s illustrate with an example taken from the documentation. I’ll work in Python, but there’s an R version of all of this.

First we make a SparkSession – basically a connection to the Spark cluster. This is a bit like connecting to Postgres.

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

Now spark is an object representing a connection to Spark. We can ask Spark to create a DataFrame object from a data file; it understands a whole bunch of data formats, including JSON:

df = spark.read.json("examples/src/main/resources/people.json")

Now df is a data frame. It’s just a table with columns age and name for some people.

Our Python code can do transformations, like filter:

filtered = df.filter(df['age'] > 21)

This creates a new DataFrame of the results.

Notice the magic happening here: if the data file were huge, this transformation is automatically distributed to all nodes in the cluster. But not now – later, when we try to use filtered, for example by writing filtered.show().

Since every operation returns a new DataFrame, we can write operations like

df.groupBy("age").count()

This returns a new DataFrame with columns age and count. A lazy DataFrame, which is only calculated when we ask for it.

You can think of this as a way of building up a MapReduce job. You write the sequence of transformations you need – even for loops and whatever else you want – and at the end, when you request the result, Spark rearranges and distributes the work so it can be calculated even on a massive dataset.

You still have to figure out how to structure your operation as a series of transformations and reductions, but it’s a bigger set of transformations and reductions, and you can compose many small ones together instead of cramming the operation into one big Map and Reduce.

Some historical cruft

With the release of Spark 2.0, Spark began a major change. Previously, the core data frame object was called the RDD, the Resilient Distributed Dataset, and common operations built RDDs and operated on RDDs. Eventually, the Spark developers decided that they hadn’t designed RDDs very well, so they made a new object called a DataFrame that is still distributed and resilient, but accessed with different functions in different ways.

Basically, they decided the original API wasn’t very good and made a new one. The new design lets you query data with SQL, adds fancy optimizers that figure out how to restructure your calculations to make them faster, and is supposed to be generally nicer.

The problem is that there’s a lot of Spark code out there, and much of it uses RDDs. So Spark now supports both. You’re supposed to use DataFrame now, but RDD code will still work until Spark 3.0, and so now you can easily be confused by having two versions of everything.

Fortunately, there are methods for converting an RDD to a DataFrame, so you can switch as needed.

Example: K-means clustering

Spark has a directory full of examples in Java, Scala, Python, and R. The R examples are a bit limited and the Python examples all still use RDDs instead of DataFrame, which is annoying. But we can see the idea of using Spark anyway.

Let’s try doing K-means clustering. Suppose we have a function closestPoint(p, centers) that takes a point p and a list of cluster centers and returns the index of the center closest to p. Then suppose we have a data file where every line is a point, with coordinates separated by spaces.

I’ll simplify a bit from the original example code.

To initialize and read the data, we start with

spark = SparkSession \
        .builder \
        .appName("PythonKMeans") \
        .getOrCreate()

lines = spark.read.text("data.txt").rdd.map(lambda r: r[0])

## parseVector just reads the space-separated coordinates
data = lines.map(parseVector).cache()

K = 10
convergeDist = 0.1

kPoints = data.takeSample(False, K, 1)

Next we have a simple loop. Find the closest center to each point, group together all points for each center, and make the new center location the average of those points. Track how much change happened and used that to check for convergence.

tempDist = 1.0

while tempDist > convergeDist:
    ## Find closest center to each point. Use that as the key; return (point, 1)
    ## tuple attached to that key.
    closest = data.map(
        lambda p: (closestPoint(p, kPoints), (p, 1)))

    ## Sum up the points and the counts. That is, take the (point, 1) tuples and
    ## merge them to get a (sum_of_coordinates, num_points) tuple for each key.
    pointStats = closest.reduceByKey(
        lambda p1_c1, p2_c2: (p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))

    ## Take those (sum_of_coordinates, num_points) tuples and divide the
    ## coordinates by the number of points, to get the new cluster centers.
    newPoints = pointStats.map(
        lambda st: (st[0], st[1][0] / st[1][1])).collect()

    ## Calculate how much the cluster centers have changed.
    tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

    ## Set the new cluster centers.
    for (iK, p) in newPoints:
        kPoints[iK] = p

This is much nicer than using an old-fashioned MapReduce, since you can write code with ordinary loops and such, but you still get the benefits of distributed computing. Actual calculation only occurs once per loop iteration, when we call .collect(), forcing Spark to send out the accumulated calculations and retrieve the results so we can calculate tempDist and see if we’ve reached convergence.

This would look pretty similar with DataFrame instead of RDDs, since the APIs are quite similar; most changes would be to how the data is loaded (notice the .rdd when we define lines).

SQL on a `DataFrame`

You might have noticed that the basic Spark operations, like maps, grouping, and aggregation, sound a lot like the kinds of things you can do in SQL. That’s not a coincidence: the SQL model of operations on tables still holds up after all these years.

As a bonus, Spark understands SQL syntax. If you have a DataFrame called df, you can run SQL operations on it:

## Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()

sqlDF is just another DataFrame, and so .show() prints it out.

You can do all the usual SQL functions like count(), avg(), max(), and so on, and arithmetic and so on. You can even use JOIN to join multiple data frames together.

It’s even possible to have Spark to behave like a SQL server so that any language that supports SQL can connect and send SQL queries.

Getting your data into Spark

I mentioned that Spark supports a whole bunch of data formats – a confusingly large number, actually. You can read text files, CSVs, JSON, Parquet, ORC, Avro, HBase, and various other things with connectors and libraries.

So what should you choose? When you’re scraping your data and preparing it for Spark, what file format is best?

For perfectly ordinary data files, CSVs are fine. But as the data file gets large – tens or hundreds of megabytes – reading the CSV can become slow. To access a single column or a small subset of data you usually have to parse the entire file. On huge datasets that defeats the purpose of having a distributed system.

Parquet is designed as a columnar data store. Instead of storing one row at a time, it stores one column at a time, with metadata indicating where to find each column in the file. It’s very efficient to load a subset of columns. Parquet files can also be partitioned, meaning the data is split into multiple files according to the value of a certain column (e.g. the year of the entry). This is nice for huge datasets, and on a distributed file system lets different chunks exist on different servers.

Arrow is the name of the in-memory format for Parquet, i.e. how the data is laid out in memory once it is loaded. This is standardized so you can load Parquet data into Arrow and pass it between libraries in programming languages without any conversion.

There is a Python library for Parquet and Arrow, and SparkR has a read.parquet function.

For enormous datasets, people often use HBase, a columnar database engine that supports splitting data across machines, querying arbitrary subsets of data, and updating and modifying datasets live as new data arrives and systems are querying the database.

Extra Spark Stuff

ML

An obvious thing to do with Spark is machine learning with large datasets.

Spark has MLlib, a library of machine learning algorithms, built in. It’s a bit confusing because it has two versions of everything, one using RDDs and one for DataFrame, but it supports

many feature extraction methods (tf-idf, word2vec, PCA, scaling…)
regression, including regularized methods, GLMs, regression trees, and survival models
classifiers, including logistic regression, decision trees and forests, and SVMs
clustering, including K-means, LDA, and Gaussian mixture models
and various cross-validation and tuning methods.

These operations are distributed as much as is possible.

The MLlib documentation gives examples in Java, Scala, and Python; for R, check the SparkR documentation.

Graphs

Spark comes with GraphX, a framework for implementing graph algorithms and analyses on top of Spark. The graph data is split up across nodes just like any other dataset, and graph operations can be handled in parallel.

Normal Spark datasets look like data frames. A GraphX graph looks like two data frames:

Vertex table: a table of IDs and properties for each vertex/node
Edge table: a table of (source, destination) pairs and properties, for each edge

You don’t need GraphX to define a graph like this – you can just make two normal Spark data tables. GraphX just provides extra operators and features, like a single Graph class encapsulating the two tables, operators like inDegrees and numEdges, and operations like mapEdges and mapVertices that apply functions to the graph.

GraphX also provides functions for Pregel, “a system for large-scale graph processing”. Pregel answers the question “How do you implement a graph algorithm like PageRank or graph search in parallel on a huge graph split across multiple machines?”

Pregel’s answer is to imagine every graph algorithm as an iterative algorithm involving multiple steps:

Every node/vertex has some initial value.
On each step,
- Every node calculates some value.
- Based on its calculations, each node can choose to send messages to adjacent nodes, or it can decide it is done.
- At the next step, incoming messages are available to each node as input.
Repeat until all nodes have decided they are done and there are no messages to process.

This is easy to distribute: each node’s calculation depends only on its own attributes and the messages delivered to it. After each step, Spark transfers the messages to the right servers, then starts another step.

The Pregel paper gives example algorithms for PageRank, shortest path algorithms, semi-clustering (mixed membership clustering), and bipartite matching. Many other algorithms can be written this way.

Unfortunately, there is no Python or R API for GraphX. Its methods can only be accessed via Scala or Java. Apparently, nobody really uses GraphX; most companies using Spark treat graph data as a pair of data frames and don’t think of graph problems in terms of graph algorithms, just summary statistics and calculations applied to the data frames. So nobody has invested time in Python or R APIs for GraphX.

You can instead use GraphFrames, an extension to Spark that has a Python API (but not R). It may possibly replace GraphX in the future, and has many of its features, including a message-passing API. The documentation has an example implementation of belief propagation using its message passing features.

Streaming

Most of the time, your analysis is run in batches: you have a big chunk of data, you run your analysis on it. A while later, the data has been updated, so you run everything again.

But what if you have an analysis that needs to be updated live as data arrives, within seconds of it arriving? This is called streaming.

Spark has a system called Structured Streaming for this kind of analysis. (DStreams, or Spark Streaming, were the previous method, to be replaced with Structured Streaming.) It is pretty simple to deal with; here’s an example from the documentation:

## Create DataFrame representing the stream of input lines from connection to
## localhost:9999
lines = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

## Split the lines into words
words = lines.select(
   explode(
       split(lines.value, " ")
   ).alias("word")
)

## Generate running word count
wordCounts = words.groupBy("word").count()

Rather than making a DataFrame from a file, we use readStream to make one from a network connection (to localhost, in this case). We then use ordinary DataFrame operations to get the word counts. wordCounts now represents a streaming operation – no calculation has been done yet, but when we tell Spark to start calculating, it will open the connection and update the calculation periodically as data arrives:

## Start running the query that prints the running counts to the console
query = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

Streaming essentially treats streaming data as an ordinary DataFrame, just one that’s always being added to. Spark is smart enough to avoid recalculating everything when new data arrives; since transformation functions apply to individual rows at a time, it can apply those operations to the new data, then use the results to update the aggregate functions (like groupBy).

In fact, Spark does not ever keep around the entire table in memory – only the output and whatever state is needed to allow it to update the results.

Spark guarantees that this will work even if some servers crash halfway through calculations and parts have to be started over.

Glossary

The documentation is full of a bunch of buzzwords that can be a bit hard to parse. Here’s a brief glossary for you:

Arrow: An in-memory format for storing data, in column-oriented form. Intended to be supported in many languages so it’s easy to use a library from one language to analyze Arrow data in another language.
Avro: A system for storing data – not just tables but arbitrary formats, stored in a binary file. Also a famous British aircraft manufacturer.
EMR: Elastic MapReduce, Amazon’s system that essentially lets you rent servers running Spark, Hadoop, and other similar systems by the hour, and add more servers the moment you need to run a huge calculation.
Kafka: A stream-processing system, meant for handling real-time data coming in (e.g. all activity on a huge website) and distributing it to the systems that need it. Spark’s streaming features can accept data from Kafka. Also a famous author who wrote about incomprehensible and surreal bureaucratic systems, appropriately enough.
Mesos, YARN, and Kubernetes: Systems for coordinating your huge cluster of servers, to make it easy to install your software on them, assign them specific tasks, and rearrange resources as needed (e.g. “my Spark job is going to need a bunch of memory, so I’ll give it some more servers temporarily”). Spark is often installed on clusters managed by one of these systems.
Parquet: A file format for storing data tables. Spark can read Parquet files. Notably, Parquet is column-oriented, meaning that instead of storing data by rows, it stores data one column at a time, making whole-table operations on columns faster. Also a type of wood flooring you might find in some old Pittsburgh apartments.