As our software systems have gotten more complex, it’s become more difficult to manage all the pieces. Your package depends on certain versions of certain libraries, and breaks when some of those libraries are updated; it’s built with a certain compiler version and doesn’t work with others; you’ve set various environment variables and modified a bunch of configuration files; notebook servers have to be connected to the Internet on the right ports with the right keys; and so on and so on. You sometimes find that if you have to set up your software on a different computer, it can take days of work just to get everything working right again.
The tech industry – all the companies using The Cloud to run their complicated software – have hit similar problems. Their website uses PostgreSQL with an ElasticSearch index and user events go through RabbitMQ to Celery workers running tasks with PyPy. Orchestrating all this complexity is hard work. Assigning services (like PostgreSQL and ElasticSearch) to different physical computers is also work: if we suddenly need more ElasticSearch capacity, we need to install it on a bunch more servers quickly, then uninstall it when no longer needed and reuse those servers for something else.
Cloud server providers need to solve all these problems and more. They rent servers to customers. Back in the 2000s, this mostly meant renting physical servers to customers, but that was wasteful; why not let customers share servers and pay for how much RAM and CPU time they want allocated to them? But if they share servers, separate customers want to be able to install their own software on the servers without any other customer being able to see or touch it.
There are two related solutions to these problems: virtualization and containers. Virtualization allows us to set up independent virtual machines on one physical computer, with their own operating systems, file systems, and resources (like RAM and CPU time). We can install anything we want on a virtual machine and run many virtual machines on one server. Containers isolate individual software or services with the configuration, libraries, and files they need; multiple containers can run on one computer and are isolated from each other, though they are all managed by the same operating system kernel.
Virtualization and containers are related ideas, so let’s start with virtualization and see how containers differ later.
Suppose you’ve bought a great big expensive server with 32 CPU cores and 128GB of RAM. You’d like to sell your customers pieces of this: a customer could, say, buy access to 2 cores and 4GB of RAM, plus some hard drive space, and use this to install their own software and run their own stuff. But how do you split up a system like this?
Think back to our discussion of computer architecture. Your operating system must interact with the physical hardware in various ways, from getting input from the keyboard to coordinating with the Memory Management Unit to map virtual addresses to physical locations in memory. This all requires executing privileged CPU instructions.
If I try to run two different operating systems simultaneously on one computer, they’ll interfere with each other. One will try to set up the MMU one way and the other will set it up its own way; they’d fight over who gets keyboard input and who has control of the hard drive.
The solution is virtualization. Virtualization software, like VirtualBox, operates as a host. You can install VirtualBox on Windows, Mac, or Linux – the host operating system – and ask VirtualBox to start up a virtual machine. A virtual machine has its own operating system, the guest operating system, plus a file system and USB ports and everything, and VirtualBox intercepts its privileged instructions – say, pretending to provide a keyboard but only providing its input when the user has opened the VM window, or pretending to allow MMU operations but rewriting them to coordinate between the different operating systems. (Modern CPUs let the host ask the CPU to handle this, making the guest run faster.)
Virtual machine hosts often let you allocate specific amounts of memory and CPU to the guest operating systems, and can control whether they get Internet access at what speed. The host can even intercept their network traffic and act as a firewall.
Another kind of virtual machine software has no host operating system: Xen, for example, runs directly on the CPU and hosts all the guest systems, managing all the hardware. You don’t install Xen on Windows or Linux – it runs directly on the system.
Since the host controls all access to the hardware, it can access the file systems of the guests. Those file systems can be saved to a file on the host operating system called an image: one big file representing the entire file system contents of the guest. Images can be duplicated and used to start new guests on different computers.
Some virtualization use cases:
You have a Mac but have video games that only run on Windows. You have Windows but want to try out Linux without partitioning your hard drive. You have Linux but want to try out a different version of Linux. With software like VirtualBox, you can run another operating system as a guest, install software on it, and often share files back and forth.
There are websites like OSBoxes with premade images of many operating systems ready for download. Just be careful that someone hasn’t sneaked malicious software into the images they distribute.If I want to run classic GameBoy games on my laptop, I have a problem: those games were compiled for an ancient 8-bit CPU with an instruction set completely different from the instructions used by modern CPUs. Some virtual machine software, like QEMU, can translate instructions for one CPU to a different CPU, letting you run virtual machines with ancient software or software meant for a different system. Apple, for example, introduced Rosetta so old software written for PowerPC-based Macs could work on Intel-based Macs after the transition in 2006.
Q: What use cases can you imagine for virtual machines in statistics and data science?
We can start to imagine a virtual machine workflow:
But we are software nerds. Why should we do manually what we can do with code? And we’re also statisticians – why should we do manually what we can do reproducibly with code?
This is where tools like Vagrant come in. Vagrant’s intended users are software developers working on complicated projects involving lots of dependencies and packages and libraries and configuration. These developers don’t want to have to install all sorts of crud on their computer just to work on the software; they want that part to be automated.
Suppose you install vagrant on your computer and have your complicated software project in ~/project
. You can create ~/project/Vagrantfile
and fill it with something like
$configscript = <<SCRIPT
yum install -y important-packages
# other shell commands that set up the VM go here
SCRIPT
Vagrant.configure(2) do |config|
# choose the base operating system:
config.vm.box = "centos/7"
config.vm.hostname = "myhost"
config.vm.network "private_network", ip: "192.168.50.10"
# share folders between the VM and the host:
config.vm.synced_folder "data/", "/data"
config.vm.provision "shell", inline: $configscript
config.vm.provider "virtualbox" do |vb|
vb.memory = "1024"
vb.cpus = "2"
end
end
(Example adapted from here. If this looks suspiciously like Ruby, that’s because it’s Ruby.)
Then, at the shell, we can just run vagrant up
and a new virtual machine will automatically be created and booted following our specification. Run vagrant ssh
and we can SSH into it and do stuff; run vagrant halt
and we turn it off.
If you need other packages or software configuration, add it to the Vagrantfile
and build a brand-new VM with the right configuration. The Vagrantfile
is, essentially, a reproducible description of the software environment you need.
Instead of distributing large virtual machine images to people, they can just run Vagrant and get the Vagrantfile
to create a new VM with all the right software. Working on multiple projects? You can switch between their VMs, without worrying about the software conflicting or packages becoming incompatible.
Virtual machines are cool, but they’re sometimes overkill. If we have ten virtual machines running on a computer, that means ten operating system kernels plus the host system: ten kernels each trying to manage their own filesystems (which are then managed by the host), handle their own networking, run cronjobs and system services, and everything.
If I want to run one operating system within a different one, this is unavoidable. I need a VM.
But what if all I want to do is isolate different software packages from each other? To install complex software but not have it conflict with other software? To distribute my setup to different servers and have it run on them regardless of what else is running on them?
This is when containers are useful. Containers take advantage of isolation features provided by operating systems. Linux has several features:
Container software combines all these features together into one easy-to-use system. A program running in a container is still managed by the host operating system kernel, and so it isn’t a full virtual machine, but it is isolated from the rest of the system.
Note one consequence of this: the container is managed by the host operating system, so you can’t run a Linux container on Windows, for example. You couldn’t make a container that runs on the department Linux servers and then use it unmodified on your Mac or Windows laptops. (Microsoft is working on making this possible on Windows.) You could use a virtual machine to run the containers, though that starts to become painful.
The most popular container software is Docker, which is open source.
Rather like Vagrant’s Vagrantfile
, Docker has the Dockerfile
. Here’s an example Dockerfile
from the official documentation:
# Use an official Python runtime as a parent image
FROM python:3.7-slim
# Set the working directory to /app
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY . /app
# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
# Make port 80 available to the world outside this container
EXPOSE 80
# Define environment variable
ENV NAME World
# Run app.py when the container launches
CMD ["python", "app.py"]
(If this doesn’t look like Ruby, that’s because it’s not Ruby.)
Suppose we have a Python file named app.py
in the same directory as the Dockerfile
, as well as requirements.txt
. (You should be using requirements.txt
for your Python projects even if you don’t use Docker; check out the documentation for examples.)
Next we run
That builds a Docker image, an overlay (union) file system with the extra things run. All the Python packages were installed by the RUN
step and all the files in the current directory were copied into the image’s /app
directory.
If we then run
the container starts up and app.py
runs. The output of app.py
is printed to our console.
Suppose app.py
starts up a webserver, like a Jupyter Notebook. That webserver is isolated inside the container, so we can’t access it – unless we use a command like
to tell Docker to make localhost:4000
be forwarded to port 80 on the container. Then we can access the notebook at http://localhost:4000
.
Similarly, we can use a command like
to share ~/data/bigdata.txt
with the Docker container as a volume. You can share individual files or whole directories.
Warning: In the Dockerfile
, commands run with CMD
run as root
inside the container. But the container is completely isolated from the host, so that’s no problem, right? Wrong-o, buffalo chips breath! If it’s root
inside the container, it has access to any files in the shared volumes, regardless of permissions; if you share too large a volume or there are other programs inside the volume, the container may be able to use them to do bad things. This matters if your container is exposed to the Internet or processes untrusted data from users. Best practice is to create a new unprivileged user inside the Dockerfile
and use the USER
option to run scripts in the container as that user.
Anyway, in the above Dockerfile
we saw a FROM
command declaring a “parent” image – you can build containers atop other containers. Docker Hub has a registry of public images you can use, including basic Linux distributions like Debian and images with specific software preinstalled. An instructive example is to view the Dockerfile
of an image on the Hub, like this R image.
For a tech company or anyone building a computing infrastructure, containers are great ways to deploy services. For each service you can specify exactly what files it needs access to, how it should be set up, and how it should be run. When you need to add new Hadoop nodes or set up a new webserver to replace one whose hard disk crashed, just run Docker to build the container and bam! the service is running.
Cloud computing services often build tools to make it easy to use containers in the cloud. Amazon has the Elastic Container Service that allows you to make your own registry of Docker images and then automatically start new servers running different containers. DigitalOcean lets you upload a Dockerfile and get a running server in a minute or two. Kubernetes is “container orchestration” software that you can install on your own server cluster to make it easy to automatically run containers, control how many are running, decide how much RAM and CPU they get, and monitor how all the containers are running.
One interesting service is Travis CI. CI stands for Continuous Integration, the idea that as you develop a software project, the tests should be running automatically at every commit, every pull request should be tested automatically, and developers should see reports on the results. Travis is a hosted CI service – free for open source projects and paid for commercial projects – that lets you provide a Docker image for the setup your software needs, and runs the Docker image to run all your tests and report results.
Scientists and statisticians can benefit from containers in several ways.
Dockerfile
we specify all the dependencies and packages needed by our analysis. We can also specify their versions, if we want, installing specific R and Python packages with the versions we used on our own computer. We can distribute the Dockerfile
to other people who want to replicate our analysis so they don’t struggle to run our code.
Reproducibility in particular can be achieved with containers and several useful tools. Python’s pip
, for example, lets you specify version numbers in requirements.txt
; the pip freeze command does this automatically and will ensure that anyone else who runs your code will use the same version of everything.
R doesn’t have something similar, but Microsoft does host the CRAN Time Machine, and rocker archives Docker images with specific versions of R. You can put these together to get a reproducible R setup in a Dockerfile
:
FROM rocker/r-ver:3.4.4
RUN mkdir /home/analysis
RUN R -e "options(repos = \
list(CRAN = 'https://mran.revolutionanalytics.com/snapshot/2019-01-06/')); \
install.packages('tidystringdist')"
COPY myscript.R /home/analysis/myscript.R
CMD Rscript /home/analysis/myscript.R
(Be sure to add a Docker volume if you want a way for the container to save output that remains on the host after the container is shut down.)
For Jupyter users, there are automated tools to build Docker images for your notebooks: Binder will scan your repository for requirements.txt
and other files declaring dependencies, build a Docker image, and deploy it to a JupyterHub server so other people can use the notebook.