Fun with Neutrix Cloud – Running Spark Tasks on Amazon Elastic MapReduce with NFS Backend
Amazon Web Services has done a great job simplifying provisioning, scaling and deprovisioning of Hadoop and Hadoop-like environments with their Elastic MapReduce (EMR) service. A fully configured cluster with all required components can be brought online via a single command (or a few mouse clicks in the console) and can be used to process the data easily. The AWS EMR provisioning process allows specifying bootstrapping actions for every new cluster instance. With this mechanism, one can easily mount a Neutrix NFS file system on every EMR cluster node.
A typical bootstrap action for this would be:
$ cat neutrix-emr-bootstrap.sh set –e sudo yum -y -q update sudo /usr/bin/yum -y -q install nfs-utils rpcbind sudo /bin/mkdir -p /mnt/neutrix sudo /bin/mount $NFSIP:$NFSEXPORT /mnt/neutrix
This script has to be stored in AWS S3, so it can be easily retrieved by every EMR cluster node:
$ aws s3 cp neutrix-emr-bootstrap.sh s3://neutrix-emr-config/ upload: ./neutrix-emr-bootstrap.sh to s3://neutrix-emr-config/neutrix-emr-bootstrap.sh
To launch an EMR cluster, specify a path to this bootstrap action script:
This command would bring up a fully-configured cluster with Hadoop, Spark, Hive and Pig environments using a shared Neutrix Cloud NFS file system providing the backend storage.
Once this is completed, a user may submit a Spark task (“step”) to the cluster, using Neutrix Cloud to store the input, output and the script itself:
$ aws emr add-steps --cluster-id j-3HB834POEIRB --steps Type=SPARK, Name="Neutrix Spark Program", Args="[--deploy-mode,cluster,--class,FlightSample, file:///mnt/neutrix/flightdata/flight-project_2.10-1.0.jar, file:///mnt/neutrix/flightdata/output]"
As with the previous example, this file system could also be replicated between on-premises InfiniBox storage and Neutrix Cloud as well as shared among multiple clouds as needed to better control costs.
This is the second of three posts covering Neutrix Cloud use cases. You can find the first post here. In the next and final post, I’ll cover spawning multiple DB instances using Neutrix Cloud snapshots.