CentOS 7.3 under the Hadoop 2.7.2 cluster

how to setup a Hadoop cluster on CentOS linux system. Before you read this article, I assume you already have all basic conceptions about Hadoop and Linux operating system.

mv ifcfg-eno16777736 ifcfg-eth0
vi /etc/udev/rules.d/90-eno-fix.rules
# This file was automatically generated on systemd update
SUBSYSTEM==”net”, ACTION==”add”, DRIVERS==”?*”, ATTR{address}==”00:0c:29:9e:8f:95″, NAME=”eno16777736″

# This file was automatically generated on systemd update
SUBSYSTEM==”net”, ACTION==”add”, DRIVERS==”?*”, ATTR{address}==”00:0c:29:9e:8f:95″, NAME=”eth0″

vi /etc/hosts

192.168.1.80 master.rmohan.com master
192.168.1.81 lab1.rmohan.com lab1
192.168.1.82 lab2.rmohan.com lab2

Architecture

IP Address Hostname Role
192.168.1.80 master NameNode, ResourceManager
192.168.1.81 slave1 SecondaryNameNode, DataNode, NodeManager
192.168.1.82 slave2 DataNode, NodeManager

Before we start, we will understand the meaning of the following:

DataNode

A DataNode stores data in the Hadoop File System. A functional file system has more than one DataNode, with the data replicated across them.

NameNode

The NameNode is the centrepiece of an HDFS file system. It keeps the directory of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these file itself.

NodeManager

The NodeManager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.

ResourceManager

ResourceManager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers (NMs) and the per-application ApplicationMasters (AMs).

Secondary Namenode

Secondary Namenode whole purpose is to have a checkpoint in HDFS. It is just a helper node for namenode.

HDFS (Hadoop distributed file system)

GFS paper from Google, published in October 2003, HDFS is GFS clone version.

Is the basis of data storage management in the Hadoop system. It is a highly fault tolerant system that can detect and respond to hardware failures for running on low-cost general-purpose hardware. HDFS simplifies the file consistency model, through streaming data access, providing high-throughput application data access capabilities for applications with large data sets.

HDFS this part of the main part of the composition of a few:

(1), Client: cut file; access HDFS; interact with NameNode, get file location information; interact with DataNode, read and write data.

(2), NameNode: Master node, only one in hadoop1.X, manage the HDFS name space and data block mapping information, configure the replica policy, handle the client request. For large clusters, Hadoop1.x has two of the biggest flaws: 1) for large clusters, namenode memory becomes a bottleneck, namenode’s scalability problem; 2) namenode’s single point of failure problem.

In response to the above two defects, Hadoop2.x later on these two issues were resolved. For the defect 1) proposed Federation namenode to solve, the program is mainly through multiple namenode to achieve multiple namespace to achieve the namingode horizontal expansion. Thereby alleviating the problem of a single namenode memory.

For defect 2), hadoop2.X proposed to achieve two namenode implementation of hot standby HA solution to solve. One of which is in the standby state, one in the active state.

(3), DataNode: Slave node, store the actual data, report the stored information to the NameNode.

(4), Secondary NameNode: Assist NameNode, share its workload; regularly merge fsimage and edits, push to NameNode; in case of emergency, can help restore NameNode, but Secondary NameNode is not a HotDress for NameNode.

At present, the hard disk is not bad, we can through secondarynamenode to achieve namenode recovery.

3, Mapreduce (distributed computing framework)

Source from Google’s MapReduce thesis, published in December 2004, Hadoop MapReduce is google MapReduce Clone Edition. MapReduce is a computational model used to calculate large amounts of data. Where Map specifies the operation of the independent elements on the dataset to generate key-value pairs of intermediate results. Reduce is the specification of all the “values” of the same “key” in the intermediate result to get the final result. MapReduce such a functional division, is very suitable for a large number of computers in a distributed parallel environment for data processing.

MapReduce calculation framework to the development of two versions of MapReduce API now, for MR1 main components have the following components:

(1), JobTracker: Master node, only one, the main task is the allocation of resources and job scheduling and supervision and management, management of all operations, job / task monitoring, error handling, etc .; the task is broken down into a series of tasks, and assigned to TaskTracker.

(2), TaskTracker: Slave node, run Map Task and Reduce Task; and interact with JobTracker, reporting task status.

(3), Map Task: parse each data record, passed to the user to write the map (), and the implementation of the output will be written to the local disk.

(4), Reducer Task: from the implementation of the results of the Map Task, remote read the input data, sort the data, the data passed to the user to write the reduce function implementation.

In this process, there is a shuffle process, which is the key to understanding the MapReduce calculation framework for the process. The process contains the map function output to the reduce function to enter all the operations in this intermediate process, called the shuffle process. In this process, can be divided into the map side and reduce end.

Map side:

1) After the input data is fragmented, the size of the slice is related to the size of the original file, the size of the file block. Each piece corresponds to a map task.

2) map task in the implementation process, the results will be stored into memory, when the memory occupies a certain threshold (the threshold can be set), the map will be the middle of the results written to the local disk, the formation of temporary File This process is called overwriting.

3) map In the process of overflow, will be specified according to the number of designated tasks to reduce the corresponding partition, which is the partition process. Each partition corresponds to a reduce task. And in the process of writing, the corresponding sort. In the process of overflow can also set the conbiner process, the process with the results of the reduction should be consistent, so the application of the process there is a certain limit, need to be used with caution.

4) At the end of each map, there is only one temporary file as input to reduce, so the Merge operations are merged into multiple temporary files that are overwritten to disk. Finally form a temporary file of an internal partition.

Reduce end:

1) first to achieve data localization, the need to remote node on the map output to the local copy.

2) Merge process, the merger process is mainly on the different nodes on the map output results are combined.

3) continue to copy and merge, the final form of an input file. Reduce stores the final results on HDFS.

For MR2 is the new generation of MR API. It is mainly run on Yarn’s resource management framework.

4, Yarn (resource management framework)

The framework was hadoop2.x later after the optimization of the JobTracker and TaskTracker models before hadoop1.x, which resulted in the separation of the JobTracker’s resource allocation and job scheduling and oversight. The framework of the main ResourceManager, Applicationmatser, nodemanager. The main work process is as follows: the ResourceManager is mainly responsible for all the application of resource allocation, ApplicationMaster is mainly responsible for the task scheduling of each job, that is, each job corresponds to an ApplicationMaster. Nodemanager is a command that receives Resourcemanager and ApplicationMaster to implement the allocation of resources.

The ResourceManager allocates a Conbiner after receiving the client’s job submission request. Here it is to be noted that the Resoucemanager allocation resource is allocated in units of Conbiner. The first assigned Conbiner starts Applicationmaster, which is primarily responsible for job scheduling. After the Applicationmanager starts, it communicates directly with the NodeManager.

In YARN, resource management is done by the ResourceManager and the NodeManager, where the scheduler in the ResourceManager is responsible for resource allocation and the NodeManager is responsible for resource provisioning and isolation. ResourceManager assigns resources on a NodeManager to a task (this is the so-called “resource scheduling”), the NodeManager needs to provide the appropriate resources for the task as required, and even ensure that these resources should be exclusive, provide the basis for the task of guarantee, This is the so-called resource isolation.

In the Yarn platform can run multiple computing framework, such as: MR, Tez, Storm, Spark and other calculations, the framework.

5, Sqoop (data synchronization tool)

Sqoop is an abbreviation for SQL-to-Hadoop, which is used primarily for transferring data between a legacy database and Hadoop. Data import and export is essentially a Mapreduce program that takes full advantage of MR parallelism and fault tolerance. Which is the main use of the MP Map task to achieve parallel import, export. Sqoop development has now appeared in two versions, one is sqoop1.xx series, one is sqoop1.99.X series. For the sqoop1 series, mainly through the command line to operate.

Sqoop1 import principle: from the traditional database to obtain metadata information (schema, table, field, field type), the import function is converted to Map Map only map operations, mapreduce a lot of map, each map read a piece of data, and then parallel Complete the copy of the data.

Sqoop1 export Principle: Get the schema of the exported table, meta information, and the field match in Hadoop; multiple map only jobs run at the same time, and the data in hdfs is exported to the relational database.

Sqoop1.99.x is a product of sqoop2, which is not fully functional products, in a test phase, generally will not be used in commercial products.

Sqoop tools, the current understanding of it is that there may be some problems because when the import and export, map task failed, then Applicationmaster will re-schedule another task to run this failed task. But there may be a problem that the data imported by the Map task before the failure and the result of the re-scheduling of the map task will be repeated.

6, Mahout (data mining algorithm library)

Mahout originated in 2008, initially Apache Lucent subproject, which in a very short period of time has made considerable development, is now Apache’s top project. Compared with the traditional MapReduce programming method to achieve the machine learning algorithm, often need to spend a lot of development time, and the cycle is longer, and Mahout’s main goal is to create some scalable machine learning domain classic algorithm to achieve Developers are more convenient and quick to create intelligent applications.

Mahout now includes a wide range of data mining methods such as clustering, classification, recommendation engine (collaborative filtering), and frequent set mining. In addition to the algorithm, Mahout also includes data input / output tools, data mining support architectures such as integration with other storage systems such as databases, MongoDB or Cassandra.

Mahout the following components will generate the corresponding jar package. At this point we need to understand a question: how to use mahout in the end?

In fact, mahout is just a machine learning algorithm library, in which the corresponding machine learning algorithm, such as: recommended system (including user-based and object-based recommendations), clustering and classification algorithm. And some of these algorithms to achieve the MapReduce, spark can be run on the hadoop platform, in the actual development process, only the corresponding jar package can be.

7, Hbase (distributed inventory database)

Bigtable papers from Google, published in November 2006, the traditional relational database is a row-oriented database. HBase is a Google Bigtable clone that HBase is a scalable, highly reliable, high-performance, distributed and column-oriented dynamic schema database for structured data. Unlike traditional relational databases, HBase uses a BigTable data model: an enhanced sparse sorting table (Key / Value), where the keys consist of row keywords, column keywords, and timestamps. HBase provides random, real-time read and write access to large-scale data, while data stored in HBase can be processed using MapReduce, which combines data storage with parallel computing.

Hbase table features

1), large: a table can have billions of rows, millions of columns;

2), no model: each row has a sortable primary key and any number of columns, columns can be dynamically increased according to the need, the same table can have different rows of different columns;

3), oriented columns: oriented column (family) storage and permission control, column (family) independent retrieval;

4), sparse: null (null) column does not take up storage space, the table can be designed very sparse;

5), multi-version data: each unit of data can have multiple versions, by default, the version number is automatically assigned, is the cell when the timestamp inserted;

6), data type single: Hbase in the data are strings, no type.

Hbase physical model

Each column family is stored in a separate file on HDFS, and null values ??are not saved.

Key and Version number have one in each column family;

HBase maintains a multi-level index for each value, ie, its physical storage:

1, all the rows in the table are sorted by the row key;

2, Table in the direction of the line is divided into multiple Regions;

3, Region by size, each table began with only one region, with the data increased, region increasing, when increased to a threshold when the region will be divided into two new regions, then there will be More and more region;

4, Region is the smallest unit of distributed storage and load balancing in Hbase, and different regions are distributed to different RegionsServer. ,

5, Region is the smallest unit of distributed storage, but not the smallest unit of storage. Region consists of one or more stores, each store holds a columns family; each Strore is made up of a memStore and 0 to multiple StoreFiles, StoreFile contains HFile; memStore is stored in memory and StoreFile is stored on HDFS.

8, Zookeeper (distributed collaborative service)

From the Google Chubby paper, published in November 2006, Zookeeper is Chubby clone version, mainly to solve the distributed environment of data management issues: unified naming, state synchronization, cluster management, configuration synchronization.

Zookeeper’s main implementation is two steps: 1), election Leader 2), sync data. This component needs to be used when implementing HA high availability for namenode.

9, Pig (Hadoop-based data flow system)

By yahoo! Open source, the design motivation is to provide a MapReduce-based ad-hoc (calculated at query time) data analysis tools

Defines a data flow language – Py Latin, which converts scripts into MapReduce tasks on Hadoop. Usually used for off-line analysis.

10, Hive (Hadoop-based data warehouse)

Open source by facebook, initially used to solve the massive structure of the log data statistics.

Hive defines a SQL-like query language (HQL) that transforms SQL into a MapReduce task on Hadoop. Usually used for off-line analysis.

11, Flume (log collection tool)

Cloudera open source log collection system, with distributed, highly reliable, high fault tolerance, easy to customize and expand the characteristics.

It abstracts the data from the process of generating, transmitting, processing, and ultimately writing to the target path as a data stream. The data source supports the custom data sender in Flume to support the collection of various protocol data. At the same time, Flume data stream provides the ability to process log data easily, such as filtering, format conversion and so on. In addition, Flume also has the ability to write logs to various data targets (customizable). In general, Flume is a scalable, complex environment for the massive log collection system.

3.1. install necessary packages for OS

We pick up CentOS minimal ISO as our installation prototype, once the system installed, we need several more basic packages:

yum install -y net-tools
yum install -y openssh-server
yum install -y wget
yum install -y ntpd
systemctl enable ntpd ; systemctl start ntpd
ntpdate -u 0.centos.pool.ntp.org
The first line is to install ifconfig, while the second one is to be able to be ssh login by remote peer.

3.2. setup hostname for all nodes

This step is optional, but important for better self-identify while you use same username to walk through different nodes.

hostnamectl set-hostname master

ex: at master node

re-login to check the effect

3.3. setup jdk for all nodes

install jdk from oracle official website

wget –header “Cookie: oraclelicense=accept-securebackup-cookie” http://download.oracle.com/otn-pub/java/jdk/8u121-b13/e9e7ea248e2c4826b92b3f075a80e441/jdk-8u121-linux-x64.rpm
yum localinstall -y jdk-8u121-linux-x64.rpm
rm jdk-8u121-linux-x64.rpm
add java.sh under /etc/profile.d/

[root@master ~]# yum localinstall -y jdk-8u121-linux-x64.rpm
Loaded plugins: fastestmirror
Examining jdk-8u121-linux-x64.rpm: 2000:jdk1.8.0_121-1.8.0_121-fcs.x86_64
Marking jdk-8u121-linux-x64.rpm to be installed
Resolving Dependencies
–> Running transaction check
—> Package jdk1.8.0_121.x86_64 2000:1.8.0_121-fcs will be installed
–> Finished Dependency Resolution

Dependencies Resolved

=========================================================================================================================================================================
Package Arch Version Repository Size
=========================================================================================================================================================================
Installing:
jdk1.8.0_121 x86_64 2000:1.8.0_121-fcs /jdk-8u121-linux-x64 263 M

Transaction Summary
=========================================================================================================================================================================
Install 1 Package

Total size: 263 M
Installed size: 263 M
Downloading packages:
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : 2000:jdk1.8.0_121-1.8.0_121-fcs.x86_64 1/1
Unpacking JAR files…
tools.jar…
plugin.jar…
javaws.jar…
deploy.jar…
rt.jar…
jsse.jar…
charsets.jar…
localedata.jar…
Verifying : 2000:jdk1.8.0_121-1.8.0_121-fcs.x86_64 1/1

Installed:
jdk1.8.0_121.x86_64 2000:1.8.0_121-fcs

Complete!
[root@master ~]#
java.sh content:

export JAVA_HOME=/usr/java/jdk1.8.0_121
export JRE_HOME=/usr/java/jdk1.8.0_121/jre
export CLASSPATH=$JAVA_HOME/lib:.
export PATH=$PATH:$JAVA_HOME/bin
re-login, and you’ll find all environment variables, and java is well installed.

Approach to verification:

java -version
ls $JAVA_HOME
echo $PATH
if the java version goes wrong, you can

[root@master ~]# update-alternatives –config java

There is 1 program that provides ‘java’.

Selection Command
———————————————–
*+ 1 /usr/java/jdk1.8.0_121/jre/bin/java

3.4. setup user and user group on all nodes

groupadd hadoop
useradd -d /home/hadoop -g hadoop hadoop
passwd hadoop
3.5. modify hosts file for network inter-recognition on all nodes

echo ‘192.168.1.80 master.rmohan.com master’ >> /etc/hosts
echo ‘192.168.1.81 lab1.rmohan.com lab1 slave1’ >> /etc/hosts
echo ‘192.168.1.82 lab1.rmohan.com lab2 slave2’ >> /etc/hosts
check the recognition works:

ping master
ping slave1
ping slave2

3.6. setup ssh no password login on all nodes

su – hadoop
ssh-keygen -t rsa
ssh-copy-id master
ssh-copy-id slave1
ssh-copy-id slave2
now you can ssh login to all 3 nodes without passwd, please have a try to check it out.

3.7. stop & disable firewall

systemctl stop firewalld.service
systemctl disable firewalld.service
4. Hadoop Setup

P.S. the whole Step 4 operations happens on a single node, let’s say, master. In addition, we’ll login as user hadoop to finish all operations.

su – hadoop
4.1. Download and untar on the file system.
[hadoop@master ~]$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
–2017-04-12 22:49:01– http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
Resolving mirrors.sonic.net (mirrors.sonic.net)… 69.12.162.27
Connecting to mirrors.sonic.net (mirrors.sonic.net)|69.12.162.27|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 212046774 (202M) [application/x-gzip]
Saving to: ‘hadoop-2.7.2.tar.gz’

1% [> ] 2,932,536 811KB/s eta 4m 24s

tar -zxvf hadoop-2.7.2.tar.gz
rm hadoop-2.7.2.tar.gz
chmod 775 hadoop-2.7.2
4.2. Add environment variables for hadoop

append following content onto ~/.bashrc

export HADOOP_HOME=/home/hadoop/hadoop-2.7.2
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
then make these variables effect:

source ~/.bashrc
4.3. Modify configuration files for hadoop

Add slave node hostnames into $HADOOP_HOME/etc/hadoop/slaves file
echo slave1 > $HADOOP_HOME/etc/hadoop/slaves
echo slave2 >> $HADOOP_HOME/etc/hadoop/slaves
Add secondary node hostname into $HADOOP_HOME/etc/hadoop/masters file
echo slave1 > $HADOOP_HOME/etc/hadoop/masters
Modify $HADOOP_HOME/etc/hadoop/core-site.xml as following
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000/</value>
<description>namenode settings</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoop-2.7.2/tmp/hadoop-${user.name}</value>
<description> temp folder </description>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
</configuration>
Modify $HADOOP_HOME/etc/hadoop/hdfs-site.xml as following
<configuration>
<property>
<name>dfs.namenode.http-address</name>
<value>master:50070</value>
<description> fetch NameNode images and edits </description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>slave1:50090</value>
<description> fetch SecondNameNode fsimage </description>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
<description> replica count </description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoop-2.7.2/hdfs/name</value>
<description> namenode </description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoop-2.7.2/hdfs/data</value>
<description> DataNode </description>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file:///home/hadoop/hadoop-2.7.2/hdfs/namesecondary</value>
<description> check point </description>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.stream-buffer-size</name>
<value>131072</value>
<description> buffer </description>
</property>
<property>
<name>dfs.namenode.checkpoint.period</name>
<value>3600</value>
<description> duration </description>
</property>
</configuration>
Modify $HADOOP_HOME/etc/hadoop/mapred-site.xml as following

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>hdfs://trucy:9001</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
<description>MapReduce JobHistory Server host:port, default port is 10020.</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
<description>MapReduce JobHistory Server Web UI host:port, default port is 19888.</description>
</property>
</configuration>

Modify $HADOOP_HOME/etc/hadoop/yarn-site.xml as following
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
</configuration>
4.4. Create necessary folders

mkdir -p $HADOOP_HOME/tmp
mkdir -p $HADOOP_HOME/hdfs/name
mkdir -p $HADOOP_HOME/hdfs/data
4.5. Copy hadoop folders and environment settings to slaves

scp ~/.bashrc slave1:~/
scp ~/.bashrc slave2:~/

scp -r ~/hadoop-2.7.2 slave1:~/
scp -r ~/hadoop-2.7.2 slave2:~/
5. Launch hadoop cluster service

Format namenode for the first time launch

hdfs namenode -format
Launch dfs distributed file system
start-dfs.sh

[hadoop@master ~]$ start-dfs.sh
17/04/12 23:27:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Starting namenodes on [master]
The authenticity of host ‘master (192.168.1.80)’ can’t be established.
ECDSA key fingerprint is d8:55:ea:50:b3:bb:8a:fc:90:a2:0e:54:3e:79:60:bc.
Are you sure you want to continue connecting (yes/no)? yes
master: Warning: Permanently added ‘master,192.168.1.80’ (ECDSA) to the list of known hosts.
hadoop@master’s password:
master: starting namenode, logging to /home/hadoop/hadoop-2.7.2/logs/hadoop-hadoop-namenode-master.rmohan.com.out
slave2: starting datanode, logging to /home/hadoop/hadoop-2.7.2/logs/hadoop-hadoop-datanode-lab2.rmohan.com.out
slave1: starting datanode, logging to /home/hadoop/hadoop-2.7.2/logs/hadoop-hadoop-datanode-lab1.rmohan.com.out
Starting secondary namenodes [slave1]
slave1: starting secondarynamenode, logging to /home/hadoop/hadoop-2.7.2/logs/hadoop-hadoop-secondarynamenode-lab1.rmohan.com.out
17/04/12 23:27:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

Launch yarn distributed computing system
start-yarn.sh

[hadoop@master ~]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.7.2/logs/yarn-hadoop-resourcemanager-master.rmohan.com.out
slave2: starting nodemanager, logging to /home/hadoop/hadoop-2.7.2/logs/yarn-hadoop-nodemanager-lab2.rmohan.com.out
slave1: starting nodemanager, logging to /home/hadoop/hadoop-2.7.2/logs/yarn-hadoop-nodemanager-lab1.rmohan.com.out
[hadoop@master ~]$

Shutdown Hadoop Cluster

stop-yarn.sh
stop-dfs.sh
6. Verify the hadoop cluster is up and healthy

6.1. Verify by jps processus

Check jps on each node, and view results.

[hadoop@master ~]$ jps
3184 ResourceManager
3441 Jps
2893 NameNode
[hadoop@master ~]$

On slave1 node:
[hadoop@lab1 ~]$ jps
3026 NodeManager
3127 Jps
2811 DataNode
2907 SecondaryNameNode
[hadoop@lab1 ~]$

On slave2 node:
[hadoop@lab2 ~]$ jps
2722 DataNode
2835 NodeManager
2934 Jps
[hadoop@lab2 ~]$

6.2. Verify on Web interface

192.168.1.80:50070 to view hdfs storage status.
192.168.1.80:8088 to view yarn computing system resources and application status.

7. End

This is all about basic version of hadoop 3 nodes cluster, for high availability version & hadoop relative eco-systems,

I’ll give it on other posts, thanks for contacting me if there is anything mistype or you have any suggestions or anything you don’t understand.

How do I add a new node to a Hadoop cluster?

1. Install a new node at hadoop or copy it from another node

2. Copy the namingode configuration file to the node

3. Modify the masters and slaves files, increase the node, all nodes have to modify

4. Set the ssh password to access the node

5. Start the datanode and tasktracker (hadoop-daemon.sh start datanode / tasktracker) on the node separately

6. Run start-balancer.sh for data load balancing

Recent Posts

Pages

Categories

Archives

Recent Comments

Categories

CentOS 7.3 under the Hadoop 2.7.2 cluster

Leave a Reply Cancel reply

PRODUCTS