To complete your projects, there are several kinds of computing resources available:
This offers cloud computing services that you can rent. There are a bunch of products, but the basic ones are the Elastic Compute Cloud (rent as many servers as you need) and the Simple Storage Service (rent as much storage as you need). EC2 includes general computing servers (the T3 instances) as well as ones with powerful GPUs (the P3 instances); check their instance types page for details.
The class has a shared Amazon EMR cluster with Spark installed; see below.
There is a free tier of services you can get when starting out; you can use the free tier while you figure out how to get things installed and set up. When you need more computing power, the AWS Educate program offers $40 in AWS credit to students.
Note that AWS bills for services by the hour. If you work on your project a few days a week, you can dramatically reduce your bill just by shutting off your servers while you’re not using them. I recommend starting small until you’ve gotten your code working, and only then scale up to working at a large scale with many machines.
If you’re likely to use up the $40 credit, ask well in advance about getting more resources; we can likely cover some additional expenses, but ask in advance, and definitely ask before spending any of your own money.
I have set up an Amazon EMR cluster for everyone using Spark in this course to share.
These instructions will expand as I learn more things about Spark.
I need your SSH public key to grant you access. If you have used SSH before, look in the ~/.ssh
hidden directory for a id_rsa.pub
or id_ed25519.pub
file. The contents should look something like this:
ssh-ed25519 LOTSOFAPPARENTLYRANDOMCHARACTERS you@yourcomputer
Paste the entire thing into an email and send it to me.
If you have no SSH public key, follow these instructions for generating one (using your GitHub email address isn’t important for us, though) and send me the public key as above. (You don’t need to follow their ssh-agent
steps.)
Do not send me any file beginning with -----BEGIN OPENSSH PRIVATE KEY-----
. It says PRIVATE KEY for a reason.
I will then set up your account with remote access. I will send you the name of the server you need to SSH into to get access.
Once your SSH access is set up, you can use the server like any SSH server. Use scp
to copy files, or configure your text editor to use SFTP to edit files directly on the server.
The server you log in to is separate from the servers doing the calculation. That means that if you have somedatafile.txt
in your home directory on the server, you will not be able to load it in Spark with spark.read.text("somedatafile.txt")
. The servers run the Hadoop Distributed File System, and so you need to add the file to HDFS. You can run
to copy the file to your HDFS user directory. There are HDFS commands corresponding to the familiar commands like cp
, mv
, ls
, rm
, and so on; run hadoop fs
to be shown a list of them all.
The EMR servers have Python 2.7 and R 3.4 installed. You can run ordinary Python or R with the usual commands, or you can run the Spark-integrated versions:
sparkR # R with the SparkR package auto-loaded and connected
pyspark # Python with pyspark loaded and connected
Both automatically connect to the Spark server and store the connection in the spark
variable for you to use.
GraphFrames works as usual:
Note that there is not a convenient way to load GraphFrames in the EMR Notebooks feature; the notebooks don’t provide any way to load Spark packages. You may have to stick with the pyspark
console or the spark-submit
command, passing the --packages
argument each time.
Installing Python or R packages poses problems: computation is distributed to all nodes in the cluster, so all of them need your packages installed. Unfortunately there is no easy way to do this automatically. You can send individual Python files and Zip files of Python files, so if you have some_module.py
and want to import some_module
in your script, use sc.addPyFile("some_module.py")
to have it distributed to the worker nodes.
You may want to separate your Spark code from the code that uses other packages, if possible, by having a Spark file that runs data processing and a separate file that uses packages to do other things to the processed data.