How to Connect R + EMR in 6 Short Steps w/ Segue

Want the power of your own Hadoop cluster but lack the budget, time, human resources, and/or capital? No problem!

Segue makes using Amazon’s Elastic MapReduce in R not only possible, but downright simple.R_AWS_Hadoop The best thing about this Big Data trick is that is brings you scalability without the out-of-pocket or opportunity costs normally associated with setting up a Hadoop cluster because of the elastic nature of EMR. The term elastic emphasizes that Amazon Web Services (AWS) extended the concept of scalability to their pricing model — clusters of virtual machines can be spun up and down at will, using whatever amount of RAM and compute power that you deem necessary. As soon as you’re finished with your virtual cluster the billing clock stops.

It gets better. AWS provides a free tier of service for the community, so that you can learn what you’re doing and test out EMR on your use case without having to worry about money (last time I checked it was about 750 free machine hours per month; check for changes). I’ve boiled down a 30-45 presentation that I used to give on the topic to the follow 6 simple steps:

1. Download the tar file from Google code

2. install it (can be done in bash or R):
R CMD BUILD segue
R CMD INSTALL segue_0_05.tar.gz
3. Load the package in R:
library(segue)
or require(segue), depending on your school of thought on library() v. require()
4. Enter the AWS Credentials:
setCredentials("user", "pass")
5. Create your cluster!
emr.handle <- createCluster(numInstances=6 )
6. Run your code using emrlapply() instead of regular lapply()

Pretty cool, huh? Just remember that the improvements in performance that you experience when using this or any other HPC/cluster solution depends heavily on well your code lends itself to parallelism, which is why we must always remember to write our Big Data scripts such that they are massively parallel.