Fun with Neutrix Cloud – Running Spark Tasks on Amazon Elastic MapReduce with NFS Backend

Amazon Web Services has done a great job simplifying provisioning, scaling and deprovisioning of Hadoop and Hadoop-like environments with their Elastic MapReduce (EMR) service. A fully configured cluster with all required components can be brought online via a single command (or a few mouse clicks in the console) and can be used to process the data easily. The AWS EMR provisioning process allows specifying bootstrapping actions for every new cluster instance. With this mechanism, one can easily mount a Neutrix NFS file system on every EMR cluster node.

A typical bootstrap action for this would be:

$ cat neutrix-emr-bootstrap.sh
set –e
sudo yum -y -q update
sudo /usr/bin/yum -y -q install nfs-utils rpcbind
sudo /bin/mkdir -p /mnt/neutrix
sudo /bin/mount $NFSIP:$NFSEXPORT /mnt/neutrix

 

This script has to be stored in AWS S3, so it can be easily retrieved by every EMR cluster node:

$ aws s3 cp neutrix-emr-bootstrap.sh s3://neutrix-emr-config/
upload: ./neutrix-emr-bootstrap.sh to 
s3://neutrix-emr-config/neutrix-emr-bootstrap.sh

 

To launch an EMR cluster, specify a path to this bootstrap action script:

This command would bring up a fully-configured cluster with Hadoop, Spark, Hive and Pig environments using a shared Neutrix Cloud NFS file system providing the backend storage.

Once this is completed, a user may submit a Spark task (“step”) to the cluster, using Neutrix Cloud to store the input, output and the script itself:

$ aws emr add-steps 
--cluster-id j-3HB834POEIRB 
--steps Type=SPARK,
Name="Neutrix Spark Program",
Args="[--deploy-mode,cluster,--class,FlightSample,
file:///mnt/neutrix/flightdata/flight-project_2.10-1.0.jar, 
file:///mnt/neutrix/flightdata/output]"

 

As with the previous example, this file system could also be replicated between on-premises InfiniBox storage and Neutrix Cloud as well as shared among multiple clouds as needed to better control costs.

This is the second of three posts covering Neutrix Cloud use cases. You can find the first post here. In the next and final post, I’ll cover spawning multiple DB instances using Neutrix Cloud snapshots.

About Gregory:
Gregory Touretsky (@gregnsk) is a Senior Director, Product Management at INFINIDAT. He drives the company’s roadmap around NAS, cloud and containers topics. Before that Gregory was a Solutions Architect with Intel, focused on distributed computing and storage solutions, data sharing and the cloud. He has over twenty years of practical experience with distributed computing and storage. Gregory has an M.S. in Computer Science from Novosibirsk State Technical University and an MBA from Tel-Aviv University.

×

Request a Demo