Hadoop Pi

Steps taken to producing a 4 node Hadoop 2 cluster on Raspberry pis.

I’m not going to even pretend like I know what I’m doing here – heavily copy/pasted from this tutorial and the subsequent followup

Purpose of project

The whole reason I’m going through the motions and doing this basic “Hello World” type of exercise, is that I believe distributed computing is going to become more and more a distinguishing mark on any data engineer’s trade. I’ve already felt it in my last job search – the lack of Hadoop / Spark / distributed computing experience. My journey began in simple single node typical RDBMS environments (MySQL), and then a share-nothing MPP solution (Postgresql flavored) – this last step started to open my eyes to the benefits of distributed databases. The next step, I think, for me is to dip my toe into the Hadoop / distributed computing world. Even if I’m wrong, and I can have a solid career without the Hadoop experience, I still think it’s a good challenge and good to be exposed to current trends in data engineering technologies – even if I’m more than a few years behind the times.

Components

Download and image the SD – condensed from RaspberryPi.org

Repeat the following steps for each SD card:
1. Download latest version of Raspbian Jessie. Unzip the file.
2. From terminal:
diskutil list
* Identify the disk (not partition) of your SD card
* Unmount your SD card by using the disk identifier, to prepare for copying data to it: diskutil unmountDisk /dev/disk<disk# from diskutil>
* Copy data to SD card, on my MacBook this took almost 7 minutes:
sudo dd bs=1m if=PATH/TO/IMG of=/dev/rdisk&lt;disk# from diskutil&gt;
* PATH/TO/IMG: path where I placed the Raspbian Jessie .img file

Put together Raspberry Pi units – enclose motherboard in case, insert OS SD Card

Set up each Raspberry Pi unit

I’m going to be setting up my NameNode first and then I’ll ssh into all the other boxes from there. As such, I’ll need to set up
Boots to Desktop OS initially, I open up a terminal session and run raspi-config
* Expand Filessystem
* Boot options -> Desktop/CLI -> Console Autologin
* Advanced Options -> SSH
* Advanced Options -> Hostname -> “pi-hdp01”

At this point I also identified the IP address of this particular unit.

Update system and install Java

  • sudo apt-get update && sudo apt-get install oracle-java7-jdk
  • sudo apt-get install pdsh
  • sudo update-alternatives --config java
  • Ensure jdk-8-oracle* is selected

Configure Hadoop user / group

  • sudo addgroup hadoop
  • sudo adduser --ingroup hadoop hduser
  • Use whatever password
  • sudo adduser hduser sudo

Setup SSH for hduser

su hduser
mkdir ~/.ssh
ssh-keygen -t rsa -P "" # press enter when prompted for file location, we want this in the default ~/.ssh location
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
su hduser
ssh localhost # trust the connection
exit

Install protobuf 2.5.0

wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
tar xzvf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
./configure --prefix=/usr
make # I don’t typically run these make commands, so this seemed like it took forever (10 minutes?) and I kept getting the same weird messages over and over
make check
sudo make install

I think the above was unnecessary since I ended up downloading the binary files instead of building from source.

Install Hadoop

At this point, I broke away from the tutorial I was following which tried to build Hadoop from source with Make and all that craziness. I’m just downloading the binary and copying to /opt as per another tutorial

wget http://www-eu.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
sudo tar -xvzf hadoop-2.7.3.tar.gz -C /opt/
cd /opt
sudo ln --symbolic hadoop-2.7.3/ hadoop
sudo chown -R hduser:hadoop hadoop
sudo chown -R hduser:hadoop hadoop-2.7.3/

Add environment variables

Edit /etc/bash.bashrc – add the following lines to the end of the file:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_INSTALL/bin

Source the new bash profile.
source ~/.bashrc

Test to make sure it’s installed correctly

pi@pi-hdp01:/opt $ su hduser
Password:
hduser@pi-hdp01:/opt $ hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /opt/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar
hduser@pi-hdp01:/opt $

Success!!!

Test Hadoop in single node, non-distributed mode

At this point, I’m following Apache’s Documentation on “Single Node Setup”, forgoing any other tutorials. I may go back to other blog posts to revisit specific examples or tests others are doing.

From the /opt/hadoop directory …

sudo mkdir input
sudo cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
--my own example, results are written to a directory “output2”
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output2 'Zoo.*'

hduser@pi-hdp01:/opt/hadoop $ cat output2/*
2 Zookeeper.
1 Zookeeper connection string, a list of hostnames and port comma
1 Zookeeper authentication type, 'none' or 'sasl' (Kerberos).
1 Zookeeper ZNode path where the KMS instances will store and retrieve

Test Hadoop in pseudo-distributed mode

Update ./etc/hadoop/core-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Update ./etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Format filesystem:
bin/hdfs namenode -format
Start name node daemon and data node daemon:
sbin/start-dfs.sh
Update JAVA HOME in ./etc/hadoop/hadoop-env.sh:
export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/
Make the HDFS directories required to execute MapReduce jobs:
bin/hdfs dfs -mkdir /user/
bin/hdfs dfs -mkdir /user/hduser

At this point, I tried doing an ls for /user and it wasn’t there … apparently this /user directory is referencing something other than what I am assuming.

Continuing on …
Copy the input files into the distributed filesystem:
bin/hdfs dfs -mkdir input
bin/hdfs dfs -put etc/hadoop/*.xml input

Run an example like earlier:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+’

Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:

hduser@pi-hdp01:/opt/hadoop $ cat output/output/*
1 dfsadmin
1 dfs.replication

I had an existing output directory in /opt/hadoop, so it threw everything in the output directory in the existing output directory.

Note: I was still curious as to where this distribute file system was and apparently it was in /tmp

hduser@pi-hdp01:/opt/hadoop $ grep -r dfs.data.dir ./*
...
./share/doc/hadoop/hadoop-streaming/HadoopStreaming.html:<pre> -D dfs.data.dir=/tmp
./share/doc/hadoop/common/CHANGES.txt: 8. Change dfs.data.dir and mapred.local.dir to be comma-separated

hduser@pi-hdp01:/opt/hadoop $ ls /tmp/
hadoop-hduser hadoop-hduser-secondarynamenode.pid hsperfdata_root Jetty_localhost_46113_datanode____.zi7wi0
hadoop-hduser-datanode.pid hsperfdata_hduser Jetty_0_0_0_0_50070_hdfs____w2cu08 pulse-2L9K88eMlGn7
hadoop-hduser-namenode.pid hsperfdata_pi Jetty_0_0_0_0_50090_secondary____y6aanv

Stop namenode:
sbin/stop-dfs.sh

As an aside, the contents of /tmp remain the same, but the .pid files are removed.

Running YARN on single node

Create etc/hadoop/mapred-site.xml:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Update etc/hadoop/yarn-site.xml:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

Start YARN:

hduser@pi-hdp01:/opt/hadoop $ sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop-2.7.3/logs/yarn-hduser-resourcemanager-pi-hdp01.out
localhost: starting nodemanager, logging to /opt/hadoop-2.7.3/logs/yarn-hduser-nodemanager-pi-hdp01.out

SUCCESS (with an SSH tunnel: ssh -L 8088:localhost:8088 hduser@pi-hdp01):

Now I’m going to try running the same MapReduce job they’ve provided in the examples, counting the occurrences of that “dfs” string:
First, bring the distributed file store back up:
sbin/start-dfs.sh
I’ve been following this blog on and off, and found this “jps” command super helpful:

hduser@pi-hdp01:/opt/hadoop $ jps
25520 Jps
25210 DataNode
25114 NameNode
24571 NodeManager
25372 SecondaryNameNode
24478 ResourceManager
hduser@pi-hdp01:/opt/hadoop $

Before, I hadn’t run start-dfs.sh so only Jps, NodeManager, and ResourceManager had shown up.

Start the MapReduce job:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+’

There were a lot of “failures” and it took a lot longer than the other single node and pseudo distributed modes, not sure why – but this is also a pretty simple example and not a typical use case, I’m sure.

Leave a comment