(Based on notes originally developed by Chris Genovese.)
Moore’s Law roughly captures the rapid growth in processing power, especially in the increasingly dense packing of transistors on a single chip.
But with current technologies, there are decreasing returns from increasing this density.
And as researchers, we are always pushing the edge of what is feasible with whatever technology we are using: larger problems, bigger data sets, more intensive methods.
The result is that we need more power.
As we push the limitations of a single processor, it makes sense to start thinking about how to use multiple processors effectively, especially as ordinary computers – even phones and tablets – routinely ship with multiple processors that can work in parallel.
Indeed, there is a significant current trend toward distributed computing: computation run on a network of processors that may be in different locations and may share various resources.
There are two related but distinct concepts:
Concurrency is common in applications like web servers, which must be able to deal with thousands of people trying to visit a website at the same time. Even if the server only has one processor and can only run one set of instructions at a time, it can handle the requests concurrently: for example, while waiting for a file to be read from the hard drive to serve to one user, it might do some processing for several other users, returning to serve the first user once the file is ready.
Parallelism involves having several processors or CPU cores doing work simultaneously.
The benefits of exploiting concurrency and parallelism in our algorithms and computations:
But concurrency and parallelism come with serious challenges:
The complexity is a particular problem, often leading to infuriating Heisenbugs when the code changes its behavior based on which processes or tasks happen in which order, so even minor or apparently unrelated changes to the code make it drastically change behavior. It can be quite easy to accidentally write incorrect code.
Concurrency involves managing multiple tasks or processes that share resources, like data in memory or files on the hard drive. The processes may run successively, asynchronously, or simultaneously.
The challenge of concurrent programming centers on controlling access to the shared resources. This can get complicated: concurrent execution of even simple tasks can lead to nondeterminism.
Operating systems have some basic features designed to allow for concurrency, such as processes and threads.
Even with these features, there are three main concurrency challenges:
In early computers, there was just one set of instructions running on the processor – one chunk of memory defining what the processor should do. Those instructions had to handle everything. Early operating systems, like DOS, did very little: DOS only allows one program to run at a time, and provides very basic services (like access to the hard drive) to that program.
Modern operating systems are much more sophisticated. The fundamental unit of abstraction is the process. You can think of a process as a set of instructions – a computer program – combined with the current activity of that process, including things like the data it stores in memory and the files it currently has open. Processes are usually isolated from each other: without special privileges from the operating system, they cannot access each other’s memory or interfere with the execution of each other.
Modern multitasking operating systems allow users to start multiple processes at the same time, and you likely have hundreds running on your computer right now. Processes may run in parallel if there are several processors, but they are always concurrent.
Beneath the process is the thread. Sometimes, a single program may want to be able to run several things simultaneously, all having access to its memory and resources. You can hence create multiple threads inside a single process, each sharing access to its memory. For example, your web browser may have a thread to respond to user input (typing, clicks on menus and buttons, and so on) and separate threads for tasks such as decoding compressed images or parsing HTML, so the browser does not appear unresponsive when viewing a large image or webpage.
Operations that can be executed properly as part of multiple concurrent threads are said to be thread safe. Beware the use of non-thread-safe operations in a concurrent context.
Note that processes and threads are not free. Spawning them (yes, that’s the term) requires your program to make a request to the operating system to make a new program or thread, and this takes some time for the operating system to do the necessary bookkeeping and fulfill the request. Creating new threads every time you perform a matrix operation, say, could be quite slow; concurrency and parallelism frameworks often create a thread pool of pre-made threads ready to run whatever tasks are needed.
Q: How does a single-processor computer seemingly execute multiple processes and threads at the same time?
For instance, you are running your data analysis while reading Twitter in your web browser.
Basic strategy: Slice time into small intervals and allow each process to run exclusively during any single interval. If the interval is small enough, then our perception is that they are running simultaneously. Your operating system does this automatically, dozens or hundreds of times each second.
When a process is operating during an interval it is said to be running; when it loses that privilege it is said to be blocking. We sometimes say things like “reading a file is a blocking operation”, meaning that execution of the process is suspended (and other programs run) until the hard drive fetches the requested file and has it ready.
To switch processes or threads requires a context switch that saves the current register values and instruction state (i.e. which instruction is next to run) to main memory. This is handled by the operating system.
Context switching is expensive, since it requires main memory access, and we have to make sure that every process gets the time it needs to run. Operating system designers spend a great deal of effort on scheduling algorithms to make sure that, say, your web browser can respond immediately when you click the Retweet button, but your data analysis still gets plenty of time to run in the background.
Concurrency need not involve parallelism: there may be just one processor doing many tasks concurrently. It’s obviously helpful for a machine shared between many processes, so they all appear to run. But is there any reason to make one process concurrent? Would it make it faster?
Let’s think of a few examples.
Q: How could concurrency without parallelism make these faster or more efficient?
One key feature that helps solve these concurrency problems is atomic operations. Here, “atomic” means “smallest indivisible unit”, in the sense that an atomic operation is indivisible: either the entire operation completes, or none of it does. It is impossible for only part of the operation to complete.
In the deadlock operation above, if each process could acquire input.txt
and output.txt
atomically, deadlock would be impossible, because it would be impossible for one process to hold one file without holding the other – the operating system would not grant it access to one and not the other.
But how do we get atomic operations? They are provided in several ways.
Computer processors frequently support atomic instructions. x86 has a fetch-and-add instruction, for example, that reads a number and adds a value to it atomically; this would solve the problems with Counter
in the previous section.
But how do we solve the problem in general, without an atomic CPU instruction for everything we need to do?
One common atomic instruction is test-and-set, which tells the processor to do the following:
Crucially, only one test-and-set can operate on a memory location at a time – even if several processors try to run test-and-set on the same memory location at the same time, they will coordinate to ensure that one is run before the other, so the sequence is uninterrupted.
How is this useful? We can use it to implement a lock, a way for two threads or processes to cooperate by declaring who has access to a resource when.
Consider another example with threads A and B:
Data = [10.2, 11.4, -17, 1.0, 0.0, 7, ...]
Lock = 0
A: Analyze Data[1:100]
While test_and_set(Lock) == 1, wait
Write results to file foo.out
Set Lock = 0
B: Analyze Data[101:200]
While test_and_set(Lock) == 1, wait
Write results to file foo.out
Set Lock = 0
Each thread can analyze each half of the data simultaneously or in any order, but as soon as one thread begins to write to foo.out
, the other will be forced to wait until it is done before doing so – until the first thread releases the lock by setting Lock
to 0.
This type of lock is called a spinlock, because the thread stuck waiting for the other to complete simply checks the lock in a loop repeatedly (“spins”) until the lock is released. This can be inefficient if the operating system keeps running this thread, wasting processor time checking on a lock that hasn’t yet changed. More advanced locking mechanisms coordinate with the operating system to inform it that the thread is waiting on a lock to be released, so it can switch to executing another thread that has real work to do.
You won’t usually write your own locks with test-and-set. (There are other atomic instructions, like compare-and-swap, that are also used.) Your programming language libraries usually will provide locking mechanisms.
Returning briefly to thread safety: you may have heard that Python and R are not thread-safe. What does this mean? The internal data structures – the variables where the R and Python interpreters keep track of your variables, handle memory allocation, and so on – are not protected with locks or atomic operations. If you run two Python threads at the same time, they can experience race conditions and other bizarre problems and do weird things; hence Python has a Global Interpreter Lock that locks everything in the interpreter, so only one thread can run at a time. The multiprocessing library gets around this by creating separate processes, meaning they do not share memory and data structures.
Q: Python has a threading module for creating multiple threads. If they can’t run simultaneously because of the GIL, how are they useful? Think of an example case.
Deadlock occurs when multiple processes block waiting for resources that will never be released.
Example: Processes A and B need mutually exclusive access to both input.txt
and output.txt
and block until they have them. Because of concurrent execution, A gets access to input.txt
but, before it gets access to output.txt
, B gets access to output.txt
first. Both processes then block forever.
This is illustrated in the famous Dining Philosopher’s problem: several philosophers are sitting around a circular table, with a fork on the table between each pair. The philosophers will randomly alternate between thinking and eating spaghetti, but they will only eat their spaghetti if they have a fork in both hands. When they are done eating they will lay down their forks; when they are done thinking, they will randomly look left and right, picking up a fork if it is available.
Q: Can the philosophers all starve?
Proposed concurrent systems involving locks and shared resources usually require quite careful design to avoid deadlock; tools like TLA+ exist to allow formal proofs of correctness of concurrent algorithms.
One common problem: you would like to have several threads work concurrently. They need to share and modify a data structure to do so. How do you prevent race conditions? Threads share address space, so they have access to the same memory, but accessing the data structure without care can easily cause problems.
Example: Your algorithm involves traversing a large graph using a priority queue. If the priority queue could be shared between threads, multiple threads could process nodes concurrently. This might be useful if every node requires a lot of computation to process.
Option 1: Use one big lock to control access to the data structure, preventing two threads from modifying it at the same time. For example, put locks around enqueue
and dequeue
so they may only be called from one thread at a time.
Problem: If threads use this data structure a lot, they may spend much of their time waiting for the lock to open.
Option 2: Use a concurrent priority queue. Instead of a single lock, such a queue uses atomic instructions for critical portions of the code, to ensure no thread ever has to wait for a lock. Atomic operations are slower than normal operations, but avoid accidental deadlock and can be faster than naively locking everything.
Many programming languages have concurrent data structure libraries. Python’s queue module, in the standard library, uses locks to share queues between threads; libcds has many lock-free data structures for C++; Java has a collection of concurrent collections; and so on.
A common communication system between threads is message passing, in which each thread has its own queue that other threads can write messages to. Each thread periodically dequeues messages from its queue to process.
Computer architectures have introduced parallelism at several levels:
Bit-level Parallelism
What’s the advantage of 64-bit architecture over 32-bit architecture?
One answer: faster basic computations through parallel processing, e.g. faster calculations with 64-bit integers and floats (also: memory range increased past 4GB)
Instruction-level Parallelism
Modern CPUs use various techniques like pipelining, re-ordering, out-of-order execution, and speculative execution, so they can execute multiple instructions while waiting for others to finish.
This is mostly invisible to the programmer at higher levels, unless you’re a computer security researcher. You don’t need to do anything to take advantage of this, since your CPU does it automatically, and you have very little control over it.
Data Parallelism
A data-parallel architecture (sometimes called Single Instruction, Multiple Data, or SIMD) can perform operations on a large quantity of data in parallel.
Example: Adding arrays elementwise. Modern x86 processors with SSE have instructions to add multiple pairs of numbers. High-level languages like R and Python usually don’t provide ways to use these instructions directly, since you have little control over what CPU instructions are used to run your code.
Task Parallelism
The allocation of computational work on a task into components that are performed simultaneously.
Important cases:In addition, data and processing capabilities are increasingly distributed, meaning they live and run on many different computers, possibly in different locations. This raises a variety of challenging problems in data storage and management.
(cf. Flynn’s taxonomy: single S or multiple M instructions I or data D, yielding: SIMD, MIMD, SISD, MISD.)
With all these types and levels of parallelism, the first question to ask is: what kinds of problems are suitable for parallelization?
The simplest case is problems that are embarrassingly parallel: problems featuring many tasks that can be trivially run simultaneously. Examples:
Q: Can you think of other examples?
Many of these problems are maps, in the functional programming sense: they involve having many chunks of data and one operation that must be applied (mapped) to each chunk separately. The results are not interdependent and order does not matter.
Programming languages often provide support for this pattern, since it’s simple to handle with multiple threads or processes each running the same code:
multiprocessing
automatically starts multiple Python processes, splits up the data, and runs the function on different chunks in different processes.
apply
functions, such as parLapply
. furrr provides parallel versions of purrr’s map
functions.
for
loops into parallel loops run on multiple processes/cores.
for
loops into parallel loops running on multiple threads simultaneously.
The R and Python packages also support remote processes, meaning the code and data can be sent to several machines to run on all of them in parallel.
Note a caveat: as mentioned above, spawning threads or processes has a cost. Parallelizing an operation that already takes only a few milliseconds likely won’t make it appreciably faster. If you do ten million of those operations, the parallelism will only be worthwhile if you create the threads once, not ten million times. This is why Python’s multiprocessing
supports creating a “pool” of processes and then using it repeatedly for different tasks.
Not every task is embarrassingly parallel, and even when some parts of your code are easily parallelized, other parts may not have obvious parallel versions.
One common departure from embarrassing parallelism is when you want to combine results somehow. Transforming every element in an array is easily parallelized, but adding every element is not: the sum of the first 1000 elements depends on the sum of the first 500, and so on. Similarly, the result of the 47th iteration of an iterative algorithm depends on the 46 previous steps.
Functional programming again saves the day. Some of these problems can be written as reductions, and if the reduction operation is associative, then steps can be run in parallel.
Consider taking the sum of a large array. Addition is associative, so we can sum the first 100 elements on one CPU and the second 100 on another CPU, then add the two final results. Associativity means it does not matter how we split up the elements or which order we add them – our result will be the same.
You can implement this manually by splitting your data into chunks and giving each chunk to a separate process or thread.
Alternately, OpenMP automatically supports parallel reductions, in limited forms. If a for
loop uses an operator OpenMP knows is associative (like +
, *
, or -
) on a variable each iteration through the loop, OpenMP can automatically transform the loop into a parallel reduction.
MapReduce – implemented by Hadoop, among others – is a framework for splitting up a large dataset, applying a map function to each element, and then applying a reduction, with all operations automatically parallelized to multiple cores or even multiple machines. It automatically distributes the needed data files and ensures that if one computer crashes before it finishes its part of the work, that part is finished by another machine.
Sometimes one adds “keys”, so that all data with the same key must be reduced on the same core.
Examples of parallel algorithms using reductions:
Amdahl’s Law is useful to keep in mind: parallelizing your program can only speed up the parallel portions, and won’t make your program any faster than the slowest non-parallel parts.