The objective of this article is to explain how we can deploy the latest version of Apache Hadoop (Stable release: 3.2.0 / January 16, 2019) on the multi-node cluster to store unstructured data in a distributed manner. Ideally, this is an expansion procedure from single-node to a multi-node cluster. This multi-node cluster won’t be considered as HA (High Availability) cluster as we have not installed/configured Standby NameNode. But we can retain the exact state of active NameNode if it fails or crashes after post data ingestion into HDFS. To achieve this, we have to manually copy fsimage and edits into external separate location after every post data ingestion into HDFS. However, the following conditions should meet to bring back the cluster into an active state of how it was running before crushing the NameNode.
Of course, manual intervention will be required to bring back those files into the exact location (directories) and subsequently restart the NameNode. This approach can be followed when there are limited hardware resources. Typically former approach can be beneficial for the small business units who manage different types of data storage in HDFS based Data Lake and avoid cloud-based Data Lake to save revenue. This is recommended who wish to maintain a 3 or maximum 4 node in the cluster but not more than that and also for learning/RD purpose. Advisable to configure Standby NameNode that available with the Apache Hadoop-3.2.0 binary for larger size clusters. From Apache Hadoop 2.x release onwards, the concept of Standby-Namenode has been introduced to overcome the single point of failure. The Standby-Namenode which runs on separate systems and constantly maintains an in-memory, up-to-date copy of the file system namespace of active NameNode since they sync together using a shared directory. So without manual intervention, StandBy NameNode activates immediately and starts functioning if the active Namenode gets down or crushes. We can upgrade this multi-node cluster into HA (High Availability) by adopting the procedures provided by Apache Hadoop-3.2.0 either by Quorum Journal Manager or conventional Shared Storage
Here is a presupposition that Apache Hadoop-3.2.0 already installed and successfully running on Ubuntu-14.04 LTS using the JAVA environment with OpenJDK 11 in a single node cluster and the same would be used in the multi-node cluster. If not installed/created a single-node cluster, here is the link where I articulated step by step. This is an extension of the single-node cluster to a multi-node cluster with three DataNode. The system where a single-node cluster has installed and configured can be considered as NameNode if system configuration is high (At least 16 GB RAM and 1 TB Hard Disk ). I have integrated the system as the NameNode/Master Node where Apache Hadoop-3.2.0 already installed and running as a single-node cluster. Prior to begin, we need to make sure the following
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.eth0.disable_ipv6 = 1
Step 1:- Ensure SSH Passwordless Login from NameNode to all the DataNodes
Step 2:- OpenJDK 11 installation on each node
Step-3 NameNode Configuration
We have modified the single node cluster where Apache Hadoop-3.2.0 is running and changed to act as a NameNode in the cluster. Here are the steps
Step-4. Unzip hadoop-3.2.0.tar.gz on each DataNode
Step- 5 :- Setup Hadoop Environment Variables by editing ~/.bashrc file on each node
$ source ~/.bashrc
Step- 6 :- Update config files on each DataNode
Step- 7 :- Add activation1.1.jar
Download or copy activation1.1.jar from older version of Hadoop or JDK and place inside Hadoop common lib directories (hadoop-3.2.0/share/hadoop/common/lib/, hadoop-3.2.0/share/hadoop/yarn/lib/, hadoop-3.2.0/share/hadoop/yarn/activation-1.1.jar). OpenJDK 11 has not included activation1.1.jar.
Step- 8 :- Format NameNode or Master node
We need to format the NameNode before starting the multi- node cluster using following command
$ hdfs namenode -format
Following should appear after successful format of NameNode or Master node
Step- 9:- Starting the multi-node cluster
Step-10:- Verify and access the Hadoop services in Browser
Following commands can be used to stop the cluster
Can be contacted for real time POC development and hands-on technical training. Also to develop/support any Hadoop related project. Email:- [email protected], [email protected]. Gautam is a consultant as well as Educator. Prior to that, he worked as Sr. Technical Architect in multiple technologies and business domain across many countries. Currently, he is specializing in Big Data processing and analysis, Data lake creation, architecture etc. using HDFS. Besides, involved in HDFS maintenance and loading of multiple types of data from different sources, Design and development of real time use case development on client/customer demands to demonstrate how data can be leveraged for business transformation, profitability etc. He is passionate about sharing knowledge through blogs, training, seminars, presentations etc. on various Big Data related technologies, methodologies, real time projects with their architecture /design, multiple procedure of huge volume data ingestion, basic data lake creation etc.