Apache Spark cluster on Tizen OS
Check out the date on this baby!!
In this tutorial we will be setting up Apache Spark on a cluster of Tizen development devices, which is very easy to do. Make sure you have root SSH access to each of the devices (Wifi network in configured, SSH server is up and running and you have Internet access).
The cluster will be composed of one RD-PQ device acting as the master node and two RD-210 devices acting as slave nodes; the RD-PQ is powered by a quad core ARMv7 CPU while the RD-210 is a dual core device. Both types have 1Gb of RAM that should be more than ok for testing and understanding how a cluster works.
Apache Spark is a fast and general engine for large-scale data processing.
First, you will need Java; download Java SE Embedded from the Oracle website - (you will need to create an account or login into an existing one prior to downloading the archive). Make sure you download the ARMv5/ARMv6/ARMv7 Linux - SoftFP ABI, Little Endian version. After download, extract the archive and install Java somewhere on each of the devices.
$ tar -xzf ejdk-8u65-linux-arm-sflt.tar.gz
$ ejdk1.8.0_65/bin/jrecreate.sh --dest /usr/lib/java
Make sure you add JAVA_HOME
to your /etc/profile
file, like this:
JAVA_HOME=/usr/lib/java
_JAVA_OPTIONS=-Djava.io.tmpdir=/opt/usr/tmp
export JAVA_HOME _JAVA_OPTIONS
On Tizen, the /tmp
directory is configured as noexec
, so Apache Spark won’t be able to load the required libraries from it (more precisely, the snappy library), so we need to create another temporary directory and give the spark user access to it. The _JAVA_OPTIONS
setting in the /etc/profile
file specifies the new temporary directory location.
$ mkdir /opt/usr/tmp
$ chmod 0777 /opt/usr/tmp
Now, since Java is installed, we can install and configure Apache Spark.
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
$ tar xzf spark-1.6.0-bin-hadoop2.6.tgz -C /usr/lib
$ groupadd spark
$ useradd -s /bin/bash -d /usr/lib/spark-1.6.0-bin-hadoop2.6 -g spark spark
$ chown -R spark:spark /usr/lib/spark-1.6.0-bin-hadoop2.6
The above steps will setup a spark
group and add a spark
user into it, setting the shell and home folder to the directory you extracted Apache Spark into. Lets configure Apache Spark.
$ cd /usr/lib/spark-1.6.0-bin-hadoop2.6/
$ cp conf/spark-env.sh.template conf/spark-env.sh
$ cp conf/log4j.properties.template conf/log4j.properties
Open the conf/spark-env.sh
file and replace its contents with this:
SPARK_WORKER_MEMORY=512m
Open the conf/log4j.properties
file, find the line that says log4j.rootCategory=INFO, console
and replace it with log4j.rootCategory=WARN, console
.
Everything should be setup on the master node now, so we can switch to the spark
user and play with an Apache Spark example:
$ su - spark
$ bin/run-example SparkPi 10
If you start the Spark shell, you can perform a file line count on a file.
$ su - spark
$ bin/spark-shell --master local[4]
> sc.textFile("CHANGES.txt").count
After you launch the Spark shell, you can point the browser to the device’s IP using the 4040 port and you get access to the Spark UI, very useful for debugging purposes.
Now we need to repeat the steps above for each RD-210 device that will be added as a slave to the Apache Spark cluster. Make sure the paths are identical on each of the devices, otherwise use soft links to mimic the structure. In addition to that, you will need to create a conf/slaves
file in the Spark directory structure on the master node, that will contain the hostnames of each slave node. For example, my file looks like this:
first-srv-1
first-srv-2
Login into the spark
user on the master, create a new ssh key and copy it onto each of the slaves. We do that because the master needs to be able to SSH into each of the slave nodes without providing a password.
$ ssh-keygen
$ ssh-copy-id spark@first-srv-1
$ ssh-copy-id spark@first-srv-2
You can start Apache Spark on the master node and each of the slaves by logging in as the spark
user on the master and issuing this command:
$ su - spark
$ sbin/start-all.sh
After Apache Spark is started, start the Spark shell using this command:
$ bin/spark-shell --master spark://x.x.x.x:7077
Make sure you replace x.x.x.x
with the actual IP of the Spark master.
You can perform the line count on a specific file, this time using the whole cluster, by typing this in your Spark shell:
> sc.textFile("CHANGES.txt").count
Stopping the Spark cluster is very easy; while logged in as the spark
user on the master node, issue the command:
$ sbin/stop-all.sh
That’s it, now you should have a small cluster that you can use to test cluster computing and perform some intensive processing while keeping in mind that you will be limited by the amount of RAM memory available on each node (and the Wifi interface that is not optimal for cluster computing).