Check out the date on this baby!!
In this tutorial we will be setting up Apache Spark on a cluster of Tizen development devices , which is very easy to do. Make sure you have root SSH access to each of the devices (Wifi network in configured, SSH server is up and running and you have Internet access).
The cluster will be composed of one RD-PQ device acting as the master node and two RD-210 devices acting as slave nodes; the RD-PQ is powered by a quad core ARMv7 CPU while the RD-210 is a dual core device. Both types have 1Gb of RAM that should be more than ok for testing and understanding how a cluster works.
Apache Spark is a fast and general engine for large-scale data processing.
First, you will need Java; download Java SE Embedded from the Oracle website - (you will need to create an account or login into an existing one prior to downloading the archive). Make sure you download the ARMv5/ARMv6/ARMv7 Linux - SoftFP ABI, Little Endian version. After download, extract the archive and install Java somewhere on each of the devices.
$ tar -xzf ejdk-8u65-linux-arm-sflt.tar.gz $ ejdk1.8.0_65/bin/jrecreate.sh --dest /usr/lib/java
Make sure you add
JAVA_HOME to your
/etc/profile file, like this:
JAVA_HOME=/usr/lib/java _JAVA_OPTIONS=-Djava.io.tmpdir=/opt/usr/tmp export JAVA_HOME _JAVA_OPTIONS
On Tizen, the
/tmp directory is configured as
noexec, so Apache Spark won’t be able to load the required libraries from it (more precisely, the snappy library), so we need to create another temporary directory and give the spark user access to it. The
_JAVA_OPTIONS setting in the
/etc/profile file specifies the new temporary directory location.
$ mkdir /opt/usr/tmp $ chmod 0777 /opt/usr/tmp
Now, since Java is installed, we can install and configure Apache Spark.
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz $ tar xzf spark-1.6.0-bin-hadoop2.6.tgz -C /usr/lib $ groupadd spark $ useradd -s /bin/bash -d /usr/lib/spark-1.6.0-bin-hadoop2.6 -g spark spark $ chown -R spark:spark /usr/lib/spark-1.6.0-bin-hadoop2.6
The above steps will setup a
spark group and add a
spark user into it, setting the shell and home folder to the directory you extracted Apache Spark into. Lets configure Apache Spark.
$ cd /usr/lib/spark-1.6.0-bin-hadoop2.6/ $ cp conf/spark-env.sh.template conf/spark-env.sh $ cp conf/log4j.properties.template conf/log4j.properties
conf/spark-env.sh file and replace its contents with thisL
conf/log4j.properties file, find the line that says
log4j.rootCategory=INFO, console and replace it with
Everything should be setup on the master node now, so we can switch to the
spark user and play with an Apache Spark example:
$ su - spark $ bin/run-example SparkPi 10
If you start the Spark shell, you can perform a file line count on a file.
$ su - spark $ bin/spark-shell --master local > sc.textFile("CHANGES.txt").count
After you launch the Spark shell, you can point the browser to the device’s IP using the 4040 port and you get access to the Spark UI, very useful for debugging purposes.
Now we need to repeat the steps above for each RD-210 device that will be added as a slave to the Apache Spark cluster. Make sure the paths are identical on each of the devices, otherwise use soft links to mimic the structure. In addition to that, you will need to create a
conf/slaves file in the Spark directory structure on the master node, that will contain the hostnames of each slave node. For example, my file looks like this:
Login into the
spark user on the master, create a new ssh key and copy it onto each of the slaves. We do that because the master needs to be able to SSH into each of the slave nodes without providing a password.
$ ssh-keygen $ ssh-copy-id spark@first-srv-1 $ ssh-copy-id spark@first-srv-2
You can start Apache Spark on the master node and each of the slaves by logging in as the
spark user on the master and issuing this command:
$ su - spark $ sbin/start-all.sh
After Apache Spark is started, start the Spark shell using this command:
$ bin/spark-shell --master spark://x.x.x.x:7077
Make sure you replace
x.x.x.x with the actual IP of the Spark master.
You can perform the line count on a specific file, this time using the whole cluster, by typing this in your Spark shell:
Stopping the Spark cluster is very easy; while logged in as the
spark user on the master node, issue the command:
That’s it, now you should have a small cluster that you can use to test cluster computing and perform some intensive processing while keeping in mind that you will be limited by the amount of RAM memory available on each node (and the Wifi interface that is not optimal for cluster computing).