How Do You Install Hadoop with a Script for Any Type of Linux Server?

Updated on 1/6/21

Problem scenario

You want to install open source Hadoop.  You may want a single-node or multi-node deployment with CentOS/RedHat/Fedora, Debian/Ubuntu, and/or SUSE Linux distributions.  You want to have most of it scripted and have the same script work on any variety of Linux.  How do you install Hadoop quickly with a script that works on almost any type of Linux?

Solution
1.  Log into Linux. Verify you can ping the server by its hostname: The command "ping -c 2 $(hostname)" should work.     Update the /etc/hostname if the ping is not working. This is a command to provide the IP address you will likely need for such an update:

a=$(ip addr show | grep 192.168 | awk '{print $2}') && ${a::-3}

2.a.  This step involves manual commands that depend on your type of Linux.  To find your distribution type, use the command:  cat /etc/*-release

Do one of 2.b., 2.c., or 2.d. depending on the output of the above command.

2.b.  If you are running a RedHat distribution (e.g., CentOS or Fedora), run these two commands:

sudo adduser hduser
sudo passwd hduser  #respond with the password of your choice

2.c.i.  If you are running Ubuntu or a Debian distribution do the following:
sudo adduser hduser  # Answer how you wish to the next prompts.
# For example, you could enter a password twice according to the prompts, then just press enter five times and press "Y" (with no quotes).

2.c.ii.  (This only applies if you are running Ubuntu or Debian).  Open /home/hduser/.bashrc (e.g., with the vi command).  Add four new lines at the very bottom with this content:

PATH="$PATH":/usr/local/hadoop/bin/
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/"
export HADOOP_COMMON_LIB_NATIVE_DIR="/usr/local/hadoop/lib/native/"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"

2.d.i.  If you are running Linux SUSE run these five commands:

sudo useradd hduser
sudo passwd hduser
sudo mkdir -p /home/hduser/.ssh
sudo groupadd hadoop; sudo usermod -g hadoop hduser  
sudo chown -R hduser:hadoop /home/hduser/.ssh

2.d.ii.   (This only applies if you are running Linux SUSE).  Create /home/hduser/.bashrc with this line (or create a new line in this file and place it at the very bottom):
PATH="$PATH":/usr/local/hadoop/bin/

3.  The rest of these directions should work regardless of your Linux distribution type.  Run these commands:

sudo groupadd hadoop; sudo usermod -g hadoop hduser  # skip this step if running SUSE
su hduser # respond with the password entered earlier
ssh-keygen -t rsa -P "" # press enter to the prompt
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
ssh 127.0.0.1   # respond with 'yes' (with no quote marks) to accept the fingerprint
exit # end the SSH section
exit # return to your regular user account

4.  Create a script (e.g., /tmp/install.sh) with the content below from "#!/bin/bash" to the final "echo ..." statement. 

#!/bin/bash
# Written by continualintegration.com
# Warning: this is intended for a server that has no Hadoop files on it. It uses logic that relies on having no other core-site.xml, mapred-site.xml or yarn-site.xml files.
# It is a basic script to use open source Hadoop on any distribution of Linux. It is not perfect.

version=3.3.0 #change here if there is a higher version of Hadoop than 3.3.0 available

distro=$(cat /etc/*-release | grep NAME)

debflag=$(echo $distro | grep -i "ubuntu")
if [ -z "$debflag" ]
then # If it is not Ubuntu, test if it is Debian.
debflag=$(echo $distro | grep -i "debian")
echo "determining Linux distribution..."
else
echo "You have Ubuntu Linux!"
fi

rhflag=$(echo $distro | grep -i "red*hat")
if [ -z "$rhflag" ]
then #If it is not RedHat, see if it is CentOS or Fedora.
rhflag=$(echo $distro | grep -i "centos")
if [ -z "$rhflag" ]
then #If it is neither RedHat nor CentOS, see if it is Fedora.
echo "It does not appear to be CentOS or RHEL..."
rhflag=$(echo $distro | grep -i "fedora")
fi
fi

if [ -z "$rhflag" ]
then
echo "...still determining Linux distribution..."
else
echo "You have a RedHat distribution (e.g., CentOS, RHEL, or Fedora)"
yum install -y java-1.8.0-openjdk-devel
fi

if [ -z "$debflag" ]
then
echo "...still determining Linux distribution..."
else
echo "You are using either Ubuntu Linux or Debian Linux."
apt-get -y update # This is necessary on new AWS Ubuntu servers.
apt-get -y install openjdk-8-jdk-headless default-jre curl
echo "It may be necessary to try to install different openjdk versions"
apt-get -y install openjdk-11-jdk-headless
ln -s python3 /bin/python
fi

suseflag=$(echo $distro | grep -i "suse")
if [ -z "$suseflag" ]
then
if [ -z "$debflag" ]
then
if [ -z "$rhflag" ]
then
echo "*******************************************"
echo "Could not determine the Linux distribution!"
echo "Installation aborted. Nothing was done."
echo "******************************************"
exit
fi
fi
else
zypper -n install java java-1_8_0-openjdk-devel
ln -s python3 /bin/python
fi

cd /tmp
curl https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-$vers
ion/hadoop-$version.tar.gz > hadoop-$version.tar.gz
tar xzf hadoop-$version.tar.gz
mv hadoop-$version /usr/local/hadoop
echo '
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin' >> ~/.bashrc
source ~/.bashrc
sed -i '/export JAVA_HOME/c\export JAVA_HOME=/usr/.' $(find / -name hadoop-env.sh)
echo '<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri has a scheme that determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri has authority that is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
' > $(find / -name core-site.xml)


echo '<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>

</configuration>' > $(find / -name mapred-site.xml)

echo '<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->


<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

</configuration>' > $(find / -name yarn-site.xml | grep -v sample-conf)


chmod 644 $(find / -name mapred-site.xml)
# Above command may or may not be necessary
chown -R hduser:hadoop /usr/local/hadoop/

mkdir -p /app/hadoop/
chown -R hduser:hadoop /app/hadoop/

echo 'PATH="$PATH":/usr/local/hadoop/bin/' >> /etc/profile.d/hadoop.sh

echo "Hadoop should now be in /usr/local/hadoop/ directory. You may want to double check."
hv=$(/usr/local/hadoop/bin/hdfs version)
echo "*****Hadoop version below*****"
echo $hv
echo "*****Hadoop version above*****"
echo "Proceed with the manual steps. hdfs commands should be run as hduser."
echo "You are done installing Hadoop on this server."

if [ -z "$debflag" ]
then
echo "You are done installing Hadoop on this server."
else
rm /etc/profile.d/hadoop.sh # Not needed for Ubuntu/Debian
echo "Use 'su hduser' you will need to run this command:"
fi

echo ""
echo "********************"
echo "You will have to log out and log back in for the environment variables to work. Moreover this was configured for the hduser account."
echo "You may want to run 'su hduser' to run the 'hdfs version' command as a test."

5.  Run this command to do the installation:
sudo bash /tmp/install.sh  # You must run script above as a sudoer or the root user.
# You can ignore an error about /usr/local/hadoop/logs not being found if you are using Ubuntu/Debian Linux.

#If you are using the multi-node Hadoop deployment directions on, skip step 6.  Go to this posting if you want to set up a multi-node Hadoop cluster.

6.  For a single-node deployment, log in as hduser (e.g., su hduser).  Run these three commands:

hdfs namenode -format   
bash /usr/local/hadoop/sbin/start-dfs.sh
bash /usr/local/hadoop/sbin/start-yarn.sh
jps # To see the node services that are running

FYI
The above directions were tested on Ubuntu 20.x, SUSE 15.x and CentOS 8.x Linux servers (as well as other types including different versions of RHEL).

With Hadoop 3.0.0 the master and slaves files no longer need to be configured.  Instead a workers file needs to be configured.  DataNodes do not need to be able to SSH to the NameNode.  Similarly the DataNodes do not need to be able to SSH to each other.

Do you want to configure a multi-node Hadoop cluster?  See this link for directions.

Leave a comment

Your email address will not be published. Required fields are marked *