We’ve talked about parallelism as a way to solve a problem of scale: the amount of computation we want to do is very large, so we divide it up to run on multiple processors or machines.
But there’s another kind of problem of scale: the amount of data we want to process is very large, so we divide it up to process on multiple machines. This is Big Data.
Think of it this way: if you have one server with loads of RAM, a big hard drive, and a top-of-the-line CPU with many cores, and your data fits easily on that server, you’re probably best off writing parallel code and using that server. But if the data is going to be too large – or you’re receiving new data all the time – instead of buying an even bigger server, it might make more sense to buy a fleet of many smaller servers.
Computing on many independent machines is distributed computing. It brings many challenges over just using a single machine:
Each of these is a difficult problem. There is an ecosystem of software built to solve these problems, from the Hadoop Distributed File System to Spark to Hive to Mahout. But before we learn what the buzzwords mean, let’s talk about the structure of the problems.
Statisticians tend to think of datasets as simple things: maybe a few CSV files, a folder full of text files, or a big pile of images. Usually our biggest storage challenge is that we have to pay for extra Dropbox storage space to share the dataset with our collaborators. If datasets change, we usually just get some new data files or a few corrections.
But this is not what happens in industry. Consider a few examples:
There are a few common features here:
Saving to a big CSV file simply is not going to scale. Large companies may have terabytes or petabytes of data stored, with gigabytes more arriving each day. We’re also fundamentally limited by how quickly we can write data to a single hard drive (and afraid of what would happen if that drive fails).
This is where a distributed file system is useful.
In a distributed file system, data is
To achieve this, distributed file systems usually have several parts:
A common distributed file system is the Hadoop Distributed File System, HDFS, though there are many others. Amazon S3 (Simple Storage Service) is a distributed file system as a service, letting you send arbitrary objects to be stored on Amazon’s servers and retrieved whenever you want.
Spark can load data from HDFS, from S3, or just ordinary files off your hard drive.
Now suppose we have a distributed file system. How do we divide up computing tasks to be run on multiple machines, loading their input data from a distributed file system?
There are several conceptual models we could use. The first is MapReduce.
MapReduce dates to 2004, when Google engineers published a paper advertising the MapReduce concept as “Simplified Data Processing on Large Clusters”. By 2006 the Hadoop project existed to make an open-source implementation of the MapReduce ideas, and Hadoop soon exploded: it was the single largest buzzword of the early 2010s, being the canonical Big Data system.
Let’s talk about the conceptual ideas before we discuss the implementation details.
MapReduce uses the Map and Reduce operations we’ve discussed before, but with an extra twist: Shuffle. A MapReduce system has many “nodes” – different servers running the software – that are connected to some kind of distributed file system.
The output of all the reduction functions is aggregated into one big list.
The Map and Reduce steps are parallelized: each node processes the Map function simultaneously, and each node processes the Reduce function on each key simultaneously.
The Shuffle step lets us do reductions that aren’t completely associative.
The standard MapReduce example is counting words in a huge set of documents. Stealing from Wikipedia,
function map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
emit (w, 1)
function reduce(String word, Iterator partialCounts):
// word: a word
// partialCounts: a list of aggregated partial counts
sum = 0
for each pc in partialCounts:
sum += pc
emit (word, sum)
Here emit (w, 1)
means w
is the key for the value 1
.
That doesn’t really give you a sense of the scale of MapReduce, so let’s consider a more statistical example: parallel K-means clustering.
First, a brief review of K-means. We have n points (in some d-dimensional space) we want to cluster into k clusters. We randomly select k points and use their locations as cluster centers. Then:
The distance calculations are the scaling problem here: each iteration requires calculating the distance between all n points and all k cluster centers.
Q: How could we do K-means as a sequence of MapReduce operations?
Apache Hadoop is an implementation of the MapReduce idea, in the same way that PostgreSQL or SQLite are implementations of the SQL idea. It’s built in Java.
Hadoop handles the details: it splits up the data, assigns data chunks to compute nodes, calls the Map function with the right data chunks, collects together output with the same key, provides output to the Reduce function, and handles all the communication between nodes. You need only write the Map and Reduce functions and let Hadoop do the hard work.
It’s easiest to write map and reduce functions in Java, but Hadoop also provides ways to specify “run this arbitrary program to do the Map”, so you can have your favorite R or Python code do the analysis.
Hadoop can be used to coordinate clusters of hundreds or thousands of servers running the Hadoop software, to which MapReduce tasks will be distributed.
Hadoop was incredibly popular for quite a while, but it has its limitations.
Apache Spark builds on MapReduce ideas in a more flexible way.
Here’s the idea. Your dataset (a great big data frame, for example) is immutable. There are many ways you can operate upon it, all of which produce a new dataset. For example, you could
and so on. Spark has a bunch of built-in transformation functions, including these and many more.
When you load a dataset into Spark, different parts of it are loaded into memory on each machine in the cluster. When you apply a transformation to the data, each machine transforms its own chunk of the data.
Well, hang on. Each machine doesn’t transform its data; Spark just makes a note of the transformation you made. You can do a whole sequence of transformations and Spark will only start the distributed calculations when you perform an action, like
and so on.
You can chain together these operations: you can map, then filter, then sample, then group, then reduce, with whatever operations you want.
Let’s illustrate with an example taken from the documentation. I’ll work in Python, but there’s an R version of all of this.
First we make a SparkSession
– basically a connection to the Spark cluster. This is a bit like connecting to Postgres.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
Now spark
is an object representing a connection to Spark. We can ask Spark to create a DataFrame
object from a data file; it understands a whole bunch of data formats, including JSON:
Now df
is a data frame. It’s just a table with columns age
and name
for some people.
Our Python code can do transformations, like filter
:
This creates a new DataFrame
of the results.
Notice the magic happening here: if the data file were huge, this transformation is automatically distributed to all nodes in the cluster. But not now – later, when we try to use filtered
, for example by writing filtered.show()
.
Since every operation returns a new DataFrame
, we can write operations like
This returns a new DataFrame
with columns age
and count
. A lazy DataFrame
, which is only calculated when we ask for it.
You can think of this as a way of building up a MapReduce job. You write the sequence of transformations you need – even for
loops and whatever else you want – and at the end, when you request the result, Spark rearranges and distributes the work so it can be calculated even on a massive dataset.
You still have to figure out how to structure your operation as a series of transformations and reductions, but it’s a bigger set of transformations and reductions, and you can compose many small ones together instead of cramming the operation into one big Map and Reduce.
With the release of Spark 2.0, Spark began a major change. Previously, the core data frame object was called the RDD, the Resilient Distributed Dataset, and common operations built RDDs and operated on RDDs. Eventually, the Spark developers decided that they hadn’t designed RDDs very well, so they made a new object called a DataFrame
that is still distributed and resilient, but accessed with different functions in different ways.
Basically, they decided the original API wasn’t very good and made a new one. The new design lets you query data with SQL, adds fancy optimizers that figure out how to restructure your calculations to make them faster, and is supposed to be generally nicer.
The problem is that there’s a lot of Spark code out there, and much of it uses RDDs. So Spark now supports both. You’re supposed to use DataFrame
now, but RDD code will still work until Spark 3.0, and so now you can easily be confused by having two versions of everything.
Fortunately, there are methods for converting an RDD to a DataFrame
, so you can switch as needed.
Spark has a directory full of examples in Java, Scala, Python, and R. The R examples are a bit limited and the Python examples all still use RDDs instead of DataFrame
, which is annoying. But we can see the idea of using Spark anyway.
Let’s try doing K-means clustering. Suppose we have a function closestPoint(p, centers)
that takes a point p
and a list of cluster centers and returns the index of the center closest to p
. Then suppose we have a data file where every line is a point, with coordinates separated by spaces.
I’ll simplify a bit from the original example code.
To initialize and read the data, we start with
spark = SparkSession \
.builder \
.appName("PythonKMeans") \
.getOrCreate()
lines = spark.read.text("data.txt").rdd.map(lambda r: r[0])
## parseVector just reads the space-separated coordinates
data = lines.map(parseVector).cache()
K = 10
convergeDist = 0.1
kPoints = data.takeSample(False, K, 1)
Next we have a simple loop. Find the closest center to each point, group together all points for each center, and make the new center location the average of those points. Track how much change happened and used that to check for convergence.
tempDist = 1.0
while tempDist > convergeDist:
## Find closest center to each point. Use that as the key; return (point, 1)
## tuple attached to that key.
closest = data.map(
lambda p: (closestPoint(p, kPoints), (p, 1)))
## Sum up the points and the counts. That is, take the (point, 1) tuples and
## merge them to get a (sum_of_coordinates, num_points) tuple for each key.
pointStats = closest.reduceByKey(
lambda p1_c1, p2_c2: (p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))
## Take those (sum_of_coordinates, num_points) tuples and divide the
## coordinates by the number of points, to get the new cluster centers.
newPoints = pointStats.map(
lambda st: (st[0], st[1][0] / st[1][1])).collect()
## Calculate how much the cluster centers have changed.
tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)
## Set the new cluster centers.
for (iK, p) in newPoints:
kPoints[iK] = p
This is much nicer than using an old-fashioned MapReduce, since you can write code with ordinary loops and such, but you still get the benefits of distributed computing. Actual calculation only occurs once per loop iteration, when we call .collect()
, forcing Spark to send out the accumulated calculations and retrieve the results so we can calculate tempDist
and see if we’ve reached convergence.
This would look pretty similar with DataFrame
instead of RDDs, since the APIs are quite similar; most changes would be to how the data is loaded (notice the .rdd
when we define lines
).
DataFrame
You might have noticed that the basic Spark operations, like maps, grouping, and aggregation, sound a lot like the kinds of things you can do in SQL. That’s not a coincidence: the SQL model of operations on tables still holds up after all these years.
As a bonus, Spark understands SQL syntax. If you have a DataFrame
called df
, you can run SQL operations on it:
## Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
sqlDF
is just another DataFrame
, and so .show()
prints it out.
You can do all the usual SQL functions like count()
, avg()
, max()
, and so on, and arithmetic and so on. You can even use JOIN
to join multiple data frames together.
It’s even possible to have Spark to behave like a SQL server so that any language that supports SQL can connect and send SQL queries.
I mentioned that Spark supports a whole bunch of data formats – a confusingly large number, actually. You can read text files, CSVs, JSON, Parquet, ORC, Avro, HBase, and various other things with connectors and libraries.
So what should you choose? When you’re scraping your data and preparing it for Spark, what file format is best?
For perfectly ordinary data files, CSVs are fine. But as the data file gets large – tens or hundreds of megabytes – reading the CSV can become slow. To access a single column or a small subset of data you usually have to parse the entire file. On huge datasets that defeats the purpose of having a distributed system.
Parquet is designed as a columnar data store. Instead of storing one row at a time, it stores one column at a time, with metadata indicating where to find each column in the file. It’s very efficient to load a subset of columns. Parquet files can also be partitioned, meaning the data is split into multiple files according to the value of a certain column (e.g. the year of the entry). This is nice for huge datasets, and on a distributed file system lets different chunks exist on different servers.
Arrow is the name of the in-memory format for Parquet, i.e. how the data is laid out in memory once it is loaded. This is standardized so you can load Parquet data into Arrow and pass it between libraries in programming languages without any conversion.
There is a Python library for Parquet and Arrow, and SparkR has a read.parquet
function.
For enormous datasets, people often use HBase, a columnar database engine that supports splitting data across machines, querying arbitrary subsets of data, and updating and modifying datasets live as new data arrives and systems are querying the database.
An obvious thing to do with Spark is machine learning with large datasets.
Spark has MLlib, a library of machine learning algorithms, built in. It’s a bit confusing because it has two versions of everything, one using RDDs and one for DataFrame
, but it supports
These operations are distributed as much as is possible.
The MLlib documentation gives examples in Java, Scala, and Python; for R, check the SparkR documentation.
Spark comes with GraphX, a framework for implementing graph algorithms and analyses on top of Spark. The graph data is split up across nodes just like any other dataset, and graph operations can be handled in parallel.
Normal Spark datasets look like data frames. A GraphX graph looks like two data frames:
You don’t need GraphX to define a graph like this – you can just make two normal Spark data tables. GraphX just provides extra operators and features, like a single Graph
class encapsulating the two tables, operators like inDegrees
and numEdges
, and operations like mapEdges
and mapVertices
that apply functions to the graph.
GraphX also provides functions for Pregel, “a system for large-scale graph processing”. Pregel answers the question “How do you implement a graph algorithm like PageRank or graph search in parallel on a huge graph split across multiple machines?”
Pregel’s answer is to imagine every graph algorithm as an iterative algorithm involving multiple steps:
This is easy to distribute: each node’s calculation depends only on its own attributes and the messages delivered to it. After each step, Spark transfers the messages to the right servers, then starts another step.
The Pregel paper gives example algorithms for PageRank, shortest path algorithms, semi-clustering (mixed membership clustering), and bipartite matching. Many other algorithms can be written this way.
Unfortunately, there is no Python or R API for GraphX. Its methods can only be accessed via Scala or Java. Apparently, nobody really uses GraphX; most companies using Spark treat graph data as a pair of data frames and don’t think of graph problems in terms of graph algorithms, just summary statistics and calculations applied to the data frames. So nobody has invested time in Python or R APIs for GraphX.
You can instead use GraphFrames, an extension to Spark that has a Python API (but not R). It may possibly replace GraphX in the future, and has many of its features, including a message-passing API. The documentation has an example implementation of belief propagation using its message passing features.
Most of the time, your analysis is run in batches: you have a big chunk of data, you run your analysis on it. A while later, the data has been updated, so you run everything again.
But what if you have an analysis that needs to be updated live as data arrives, within seconds of it arriving? This is called streaming.
Spark has a system called Structured Streaming for this kind of analysis. (DStreams, or Spark Streaming, were the previous method, to be replaced with Structured Streaming.) It is pretty simple to deal with; here’s an example from the documentation:
## Create DataFrame representing the stream of input lines from connection to
## localhost:9999
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
## Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
## Generate running word count
wordCounts = words.groupBy("word").count()
Rather than making a DataFrame
from a file, we use readStream
to make one from a network connection (to localhost
, in this case). We then use ordinary DataFrame
operations to get the word counts. wordCounts
now represents a streaming operation – no calculation has been done yet, but when we tell Spark to start calculating, it will open the connection and update the calculation periodically as data arrives:
## Start running the query that prints the running counts to the console
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
Streaming essentially treats streaming data as an ordinary DataFrame
, just one that’s always being added to. Spark is smart enough to avoid recalculating everything when new data arrives; since transformation functions apply to individual rows at a time, it can apply those operations to the new data, then use the results to update the aggregate functions (like groupBy
).
In fact, Spark does not ever keep around the entire table in memory – only the output and whatever state is needed to allow it to update the results.
Spark guarantees that this will work even if some servers crash halfway through calculations and parts have to be started over.
The documentation is full of a bunch of buzzwords that can be a bit hard to parse. Here’s a brief glossary for you: