{"id":6634,"date":"2017-04-12T23:56:18","date_gmt":"2017-04-12T15:56:18","guid":{"rendered":"http:\/\/rmohan.com\/?p=6634"},"modified":"2017-04-13T00:05:17","modified_gmt":"2017-04-12T16:05:17","slug":"centos-7-3-under-the-hadoop-2-7-2-cluster","status":"publish","type":"post","link":"https:\/\/mohan.sg\/?p=6634","title":{"rendered":"CentOS 7.3 under the Hadoop 2.7.2 cluster"},"content":{"rendered":"<p>CentOS 7.3 under the Hadoop 2.7.2 cluster<\/p>\n<p><strong>how to setup a Hadoop cluster on CentOS linux system. Before you read this article, I assume you already have all basic conceptions about Hadoop and Linux operating system.<\/strong><\/p>\n<p>mv ifcfg-eno16777736 ifcfg-eth0<br \/>\nvi \/etc\/udev\/rules.d\/90-eno-fix.rules<br \/>\n# This file was automatically generated on systemd update<br \/>\nSUBSYSTEM==&#8221;net&#8221;, ACTION==&#8221;add&#8221;, DRIVERS==&#8221;?*&#8221;, ATTR{address}==&#8221;00:0c:29:9e:8f:95&#8243;, NAME=&#8221;eno16777736&#8243;<\/p>\n<p># This file was automatically generated on systemd update<br \/>\nSUBSYSTEM==&#8221;net&#8221;, ACTION==&#8221;add&#8221;, DRIVERS==&#8221;?*&#8221;, ATTR{address}==&#8221;00:0c:29:9e:8f:95&#8243;, NAME=&#8221;eth0&#8243;<\/p>\n<p>&nbsp;<\/p>\n<p>vi \/etc\/hosts<\/p>\n<p>192.168.1.80 master.rmohan.com master<br \/>\n192.168.1.81 lab1.rmohan.com lab1<br \/>\n192.168.1.82 lab2.rmohan.com lab2<\/p>\n<p>Architecture<\/p>\n<p>IP Address Hostname Role<br \/>\n192.168.1.80 master NameNode, ResourceManager<br \/>\n192.168.1.81 slave1 SecondaryNameNode, DataNode, NodeManager<br \/>\n192.168.1.82 slave2 DataNode, NodeManager<\/p>\n<p>&nbsp;<\/p>\n<p>Before we start, we will understand the meaning of the following:<\/p>\n<p><strong>DataNode<\/strong><\/p>\n<p>A DataNode stores data in the Hadoop File System. A functional file system has more than one DataNode, with the data replicated across them.<\/p>\n<p><strong>NameNode<\/strong><\/p>\n<p>The NameNode is the centrepiece of an HDFS file system. It keeps the directory of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these file itself.<\/p>\n<p><strong>NodeManager<\/strong><\/p>\n<p>The NodeManager (NM) is YARN&#8217;s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers&#8217; life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log&#8217;s management and auxiliary services which may be exploited by different YARN applications.<\/p>\n<p><strong>ResourceManager<\/strong><\/p>\n<p>ResourceManager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers (NMs) and the per-application ApplicationMasters (AMs).<\/p>\n<p><strong>Secondary Namenode<\/strong><\/p>\n<p>Secondary Namenode whole purpose is to have a checkpoint in HDFS. It is just a helper node for namenode.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>HDFS (Hadoop distributed file system)<\/p>\n<p>GFS paper from Google, published in October 2003, HDFS is GFS clone version.<\/p>\n<p>Is the basis of data storage management in the Hadoop system. It is a highly fault tolerant system that can detect and respond to hardware failures for running on low-cost general-purpose hardware. HDFS simplifies the file consistency model, through streaming data access, providing high-throughput application data access capabilities for applications with large data sets.<\/p>\n<p>HDFS this part of the main part of the composition of a few:<\/p>\n<p>(1), Client: cut file; access HDFS; interact with NameNode, get file location information; interact with DataNode, read and write data.<\/p>\n<p>(2), NameNode: Master node, only one in hadoop1.X, manage the HDFS name space and data block mapping information, configure the replica policy, handle the client request. For large clusters, Hadoop1.x has two of the biggest flaws: 1) for large clusters, namenode memory becomes a bottleneck, namenode&#8217;s scalability problem; 2) namenode&#8217;s single point of failure problem.<\/p>\n<p>In response to the above two defects, Hadoop2.x later on these two issues were resolved. For the defect 1) proposed Federation namenode to solve, the program is mainly through multiple namenode to achieve multiple namespace to achieve the namingode horizontal expansion. Thereby alleviating the problem of a single namenode memory.<\/p>\n<p>For defect 2), hadoop2.X proposed to achieve two namenode implementation of hot standby HA solution to solve. One of which is in the standby state, one in the active state.<\/p>\n<p>(3), DataNode: Slave node, store the actual data, report the stored information to the NameNode.<\/p>\n<p>(4), Secondary NameNode: Assist NameNode, share its workload; regularly merge fsimage and edits, push to NameNode; in case of emergency, can help restore NameNode, but Secondary NameNode is not a HotDress for NameNode.<\/p>\n<p>At present, the hard disk is not bad, we can through secondarynamenode to achieve namenode recovery.<\/p>\n<p>3, Mapreduce (distributed computing framework)<\/p>\n<p>Source from Google&#8217;s MapReduce thesis, published in December 2004, Hadoop MapReduce is google MapReduce Clone Edition. MapReduce is a computational model used to calculate large amounts of data. Where Map specifies the operation of the independent elements on the dataset to generate key-value pairs of intermediate results. Reduce is the specification of all the &#8220;values&#8221; of the same &#8220;key&#8221; in the intermediate result to get the final result. MapReduce such a functional division, is very suitable for a large number of computers in a distributed parallel environment for data processing.<\/p>\n<p>MapReduce calculation framework to the development of two versions of MapReduce API now, for MR1 main components have the following components:<\/p>\n<p>(1), JobTracker: Master node, only one, the main task is the allocation of resources and job scheduling and supervision and management, management of all operations, job \/ task monitoring, error handling, etc .; the task is broken down into a series of tasks, and assigned to TaskTracker.<\/p>\n<p>(2), TaskTracker: Slave node, run Map Task and Reduce Task; and interact with JobTracker, reporting task status.<\/p>\n<p>(3), Map Task: parse each data record, passed to the user to write the map (), and the implementation of the output will be written to the local disk.<\/p>\n<p>(4), Reducer Task: from the implementation of the results of the Map Task, remote read the input data, sort the data, the data passed to the user to write the reduce function implementation.<\/p>\n<p>In this process, there is a shuffle process, which is the key to understanding the MapReduce calculation framework for the process. The process contains the map function output to the reduce function to enter all the operations in this intermediate process, called the shuffle process. In this process, can be divided into the map side and reduce end.<\/p>\n<p>Map side:<\/p>\n<p>1) After the input data is fragmented, the size of the slice is related to the size of the original file, the size of the file block. Each piece corresponds to a map task.<\/p>\n<p>2) map task in the implementation process, the results will be stored into memory, when the memory occupies a certain threshold (the threshold can be set), the map will be the middle of the results written to the local disk, the formation of temporary File This process is called overwriting.<\/p>\n<p>3) map In the process of overflow, will be specified according to the number of designated tasks to reduce the corresponding partition, which is the partition process. Each partition corresponds to a reduce task. And in the process of writing, the corresponding sort. In the process of overflow can also set the conbiner process, the process with the results of the reduction should be consistent, so the application of the process there is a certain limit, need to be used with caution.<\/p>\n<p>4) At the end of each map, there is only one temporary file as input to reduce, so the Merge operations are merged into multiple temporary files that are overwritten to disk. Finally form a temporary file of an internal partition.<\/p>\n<p>Reduce end:<\/p>\n<p>1) first to achieve data localization, the need to remote node on the map output to the local copy.<\/p>\n<p>2) Merge process, the merger process is mainly on the different nodes on the map output results are combined.<\/p>\n<p>3) continue to copy and merge, the final form of an input file. Reduce stores the final results on HDFS.<\/p>\n<p>For MR2 is the new generation of MR API. It is mainly run on Yarn&#8217;s resource management framework.<\/p>\n<p>4, Yarn (resource management framework)<\/p>\n<p>The framework was hadoop2.x later after the optimization of the JobTracker and TaskTracker models before hadoop1.x, which resulted in the separation of the JobTracker&#8217;s resource allocation and job scheduling and oversight. The framework of the main ResourceManager, Applicationmatser, nodemanager. The main work process is as follows: the ResourceManager is mainly responsible for all the application of resource allocation, ApplicationMaster is mainly responsible for the task scheduling of each job, that is, each job corresponds to an ApplicationMaster. Nodemanager is a command that receives Resourcemanager and ApplicationMaster to implement the allocation of resources.<\/p>\n<p>The ResourceManager allocates a Conbiner after receiving the client&#8217;s job submission request. Here it is to be noted that the Resoucemanager allocation resource is allocated in units of Conbiner. The first assigned Conbiner starts Applicationmaster, which is primarily responsible for job scheduling. After the Applicationmanager starts, it communicates directly with the NodeManager.<\/p>\n<p>In YARN, resource management is done by the ResourceManager and the NodeManager, where the scheduler in the ResourceManager is responsible for resource allocation and the NodeManager is responsible for resource provisioning and isolation. ResourceManager assigns resources on a NodeManager to a task (this is the so-called &#8220;resource scheduling&#8221;), the NodeManager needs to provide the appropriate resources for the task as required, and even ensure that these resources should be exclusive, provide the basis for the task of guarantee, This is the so-called resource isolation.<\/p>\n<p>In the Yarn platform can run multiple computing framework, such as: MR, Tez, Storm, Spark and other calculations, the framework.<\/p>\n<p>5, Sqoop (data synchronization tool)<\/p>\n<p>Sqoop is an abbreviation for SQL-to-Hadoop, which is used primarily for transferring data between a legacy database and Hadoop. Data import and export is essentially a Mapreduce program that takes full advantage of MR parallelism and fault tolerance. Which is the main use of the MP Map task to achieve parallel import, export. Sqoop development has now appeared in two versions, one is sqoop1.xx series, one is sqoop1.99.X series. For the sqoop1 series, mainly through the command line to operate.<\/p>\n<p>Sqoop1 import principle: from the traditional database to obtain metadata information (schema, table, field, field type), the import function is converted to Map Map only map operations, mapreduce a lot of map, each map read a piece of data, and then parallel Complete the copy of the data.<\/p>\n<p>Sqoop1 export Principle: Get the schema of the exported table, meta information, and the field match in Hadoop; multiple map only jobs run at the same time, and the data in hdfs is exported to the relational database.<\/p>\n<p>Sqoop1.99.x is a product of sqoop2, which is not fully functional products, in a test phase, generally will not be used in commercial products.<\/p>\n<p>Sqoop tools, the current understanding of it is that there may be some problems because when the import and export, map task failed, then Applicationmaster will re-schedule another task to run this failed task. But there may be a problem that the data imported by the Map task before the failure and the result of the re-scheduling of the map task will be repeated.<\/p>\n<p>6, Mahout (data mining algorithm library)<\/p>\n<p>Mahout originated in 2008, initially Apache Lucent subproject, which in a very short period of time has made considerable development, is now Apache&#8217;s top project. Compared with the traditional MapReduce programming method to achieve the machine learning algorithm, often need to spend a lot of development time, and the cycle is longer, and Mahout&#8217;s main goal is to create some scalable machine learning domain classic algorithm to achieve Developers are more convenient and quick to create intelligent applications.<\/p>\n<p>Mahout now includes a wide range of data mining methods such as clustering, classification, recommendation engine (collaborative filtering), and frequent set mining. In addition to the algorithm, Mahout also includes data input \/ output tools, data mining support architectures such as integration with other storage systems such as databases, MongoDB or Cassandra.<\/p>\n<p>Mahout the following components will generate the corresponding jar package. At this point we need to understand a question: how to use mahout in the end?<\/p>\n<p>In fact, mahout is just a machine learning algorithm library, in which the corresponding machine learning algorithm, such as: recommended system (including user-based and object-based recommendations), clustering and classification algorithm. And some of these algorithms to achieve the MapReduce, spark can be run on the hadoop platform, in the actual development process, only the corresponding jar package can be.<\/p>\n<p>7, Hbase (distributed inventory database)<\/p>\n<p>Bigtable papers from Google, published in November 2006, the traditional relational database is a row-oriented database. HBase is a Google Bigtable clone that HBase is a scalable, highly reliable, high-performance, distributed and column-oriented dynamic schema database for structured data. Unlike traditional relational databases, HBase uses a BigTable data model: an enhanced sparse sorting table (Key \/ Value), where the keys consist of row keywords, column keywords, and timestamps. HBase provides random, real-time read and write access to large-scale data, while data stored in HBase can be processed using MapReduce, which combines data storage with parallel computing.<\/p>\n<p>Hbase table features<\/p>\n<p>1), large: a table can have billions of rows, millions of columns;<\/p>\n<p>2), no model: each row has a sortable primary key and any number of columns, columns can be dynamically increased according to the need, the same table can have different rows of different columns;<\/p>\n<p>3), oriented columns: oriented column (family) storage and permission control, column (family) independent retrieval;<\/p>\n<p>4), sparse: null (null) column does not take up storage space, the table can be designed very sparse;<\/p>\n<p>5), multi-version data: each unit of data can have multiple versions, by default, the version number is automatically assigned, is the cell when the timestamp inserted;<\/p>\n<p>6), data type single: Hbase in the data are strings, no type.<\/p>\n<p>Hbase physical model<\/p>\n<p>Each column family is stored in a separate file on HDFS, and null values ??are not saved.<\/p>\n<p>Key and Version number have one in each column family;<\/p>\n<p>HBase maintains a multi-level index for each value, ie, its physical storage:<\/p>\n<p>1, all the rows in the table are sorted by the row key;<\/p>\n<p>2, Table in the direction of the line is divided into multiple Regions;<\/p>\n<p>3, Region by size, each table began with only one region, with the data increased, region increasing, when increased to a threshold when the region will be divided into two new regions, then there will be More and more region;<\/p>\n<p>4, Region is the smallest unit of distributed storage and load balancing in Hbase, and different regions are distributed to different RegionsServer. ,<\/p>\n<p>5, Region is the smallest unit of distributed storage, but not the smallest unit of storage. Region consists of one or more stores, each store holds a columns family; each Strore is made up of a memStore and 0 to multiple StoreFiles, StoreFile contains HFile; memStore is stored in memory and StoreFile is stored on HDFS.<\/p>\n<p>8, Zookeeper (distributed collaborative service)<\/p>\n<p>From the Google Chubby paper, published in November 2006, Zookeeper is Chubby clone version, mainly to solve the distributed environment of data management issues: unified naming, state synchronization, cluster management, configuration synchronization.<\/p>\n<p>Zookeeper&#8217;s main implementation is two steps: 1), election Leader 2), sync data. This component needs to be used when implementing HA high availability for namenode.<\/p>\n<p>9, Pig (Hadoop-based data flow system)<\/p>\n<p>By yahoo! Open source, the design motivation is to provide a MapReduce-based ad-hoc (calculated at query time) data analysis tools<\/p>\n<p>Defines a data flow language &#8211; Py Latin, which converts scripts into MapReduce tasks on Hadoop. Usually used for off-line analysis.<\/p>\n<p>10, Hive (Hadoop-based data warehouse)<\/p>\n<p>Open source by facebook, initially used to solve the massive structure of the log data statistics.<\/p>\n<p>Hive defines a SQL-like query language (HQL) that transforms SQL into a MapReduce task on Hadoop. Usually used for off-line analysis.<\/p>\n<p>11, Flume (log collection tool)<\/p>\n<p>Cloudera open source log collection system, with distributed, highly reliable, high fault tolerance, easy to customize and expand the characteristics.<\/p>\n<p><a href=\"http:\/\/rmohan.com\/wp-content\/uploads\/2017\/04\/Hadoop.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6642\" src=\"http:\/\/rmohan.com\/wp-content\/uploads\/2017\/04\/Hadoop.png\" alt=\"\" width=\"695\" height=\"433\" srcset=\"https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/Hadoop.png 695w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/Hadoop-300x187.png 300w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/Hadoop-150x93.png 150w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/Hadoop-400x249.png 400w\" sizes=\"(max-width: 695px) 100vw, 695px\" \/><\/a> <a href=\"http:\/\/rmohan.com\/wp-content\/uploads\/2017\/04\/hdfsarchitecture.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6643\" src=\"http:\/\/rmohan.com\/wp-content\/uploads\/2017\/04\/hdfsarchitecture.png\" alt=\"\" width=\"874\" height=\"604\" srcset=\"https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hdfsarchitecture.png 874w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hdfsarchitecture-300x207.png 300w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hdfsarchitecture-768x531.png 768w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hdfsarchitecture-150x104.png 150w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hdfsarchitecture-400x276.png 400w\" sizes=\"(max-width: 874px) 100vw, 874px\" \/><\/a>It abstracts the data from the process of generating, transmitting, processing, and ultimately writing to the target path as a data stream. The data source supports the custom data sender in Flume to support the collection of various protocol data. At the same time, Flume data stream provides the ability to process log data easily, such as filtering, format conversion and so on. In addition, Flume also has the ability to write logs to various data targets (customizable). In general, Flume is a scalable, complex environment for the massive log collection system.<\/p>\n<p>3.1. install necessary packages for OS<\/p>\n<p>We pick up CentOS minimal ISO as our installation prototype, once the system installed, we need several more basic packages:<\/p>\n<p>yum install -y net-tools<br \/>\nyum install -y openssh-server<br \/>\nyum install -y wget<br \/>\nyum install -y ntpd<br \/>\nsystemctl enable ntpd ; systemctl start ntpd<br \/>\nntpdate -u 0.centos.pool.ntp.org<br \/>\nThe first line is to install ifconfig, while the second one is to be able to be ssh login by remote peer.<\/p>\n<p>3.2. setup hostname for all nodes<\/p>\n<p>This step is optional, but important for better self-identify while you use same username to walk through different nodes.<\/p>\n<p>hostnamectl set-hostname master<\/p>\n<p>ex: at master node<\/p>\n<p>re-login to check the effect<\/p>\n<p>3.3. setup jdk for all nodes<\/p>\n<p>install jdk from oracle official website<\/p>\n<p>wget &#8211;header &#8220;Cookie: oraclelicense=accept-securebackup-cookie&#8221; http:\/\/download.oracle.com\/otn-pub\/java\/jdk\/8u121-b13\/e9e7ea248e2c4826b92b3f075a80e441\/jdk-8u121-linux-x64.rpm<br \/>\nyum localinstall -y jdk-8u121-linux-x64.rpm<br \/>\nrm jdk-8u121-linux-x64.rpm<br \/>\nadd java.sh under \/etc\/profile.d\/<\/p>\n<p>[root@master ~]# yum localinstall -y jdk-8u121-linux-x64.rpm<br \/>\nLoaded plugins: fastestmirror<br \/>\nExamining jdk-8u121-linux-x64.rpm: 2000:jdk1.8.0_121-1.8.0_121-fcs.x86_64<br \/>\nMarking jdk-8u121-linux-x64.rpm to be installed<br \/>\nResolving Dependencies<br \/>\n&#8211;&gt; Running transaction check<br \/>\n&#8212;&gt; Package jdk1.8.0_121.x86_64 2000:1.8.0_121-fcs will be installed<br \/>\n&#8211;&gt; Finished Dependency Resolution<\/p>\n<p>Dependencies Resolved<\/p>\n<p>=========================================================================================================================================================================<br \/>\nPackage Arch Version Repository Size<br \/>\n=========================================================================================================================================================================<br \/>\nInstalling:<br \/>\njdk1.8.0_121 x86_64 2000:1.8.0_121-fcs \/jdk-8u121-linux-x64 263 M<\/p>\n<p>Transaction Summary<br \/>\n=========================================================================================================================================================================<br \/>\nInstall 1 Package<\/p>\n<p>Total size: 263 M<br \/>\nInstalled size: 263 M<br \/>\nDownloading packages:<br \/>\nRunning transaction check<br \/>\nRunning transaction test<br \/>\nTransaction test succeeded<br \/>\nRunning transaction<br \/>\nInstalling : 2000:jdk1.8.0_121-1.8.0_121-fcs.x86_64 1\/1<br \/>\nUnpacking JAR files&#8230;<br \/>\ntools.jar&#8230;<br \/>\nplugin.jar&#8230;<br \/>\njavaws.jar&#8230;<br \/>\ndeploy.jar&#8230;<br \/>\nrt.jar&#8230;<br \/>\njsse.jar&#8230;<br \/>\ncharsets.jar&#8230;<br \/>\nlocaledata.jar&#8230;<br \/>\nVerifying : 2000:jdk1.8.0_121-1.8.0_121-fcs.x86_64 1\/1<\/p>\n<p>Installed:<br \/>\njdk1.8.0_121.x86_64 2000:1.8.0_121-fcs<\/p>\n<p>Complete!<br \/>\n[root@master ~]#<br \/>\njava.sh content:<\/p>\n<p>export JAVA_HOME=\/usr\/java\/jdk1.8.0_121<br \/>\nexport JRE_HOME=\/usr\/java\/jdk1.8.0_121\/jre<br \/>\nexport CLASSPATH=$JAVA_HOME\/lib:.<br \/>\nexport PATH=$PATH:$JAVA_HOME\/bin<br \/>\nre-login, and you\u2019ll find all environment variables, and java is well installed.<\/p>\n<p>Approach to verification:<\/p>\n<p>java -version<br \/>\nls $JAVA_HOME<br \/>\necho $PATH<br \/>\nif the java version goes wrong, you can<\/p>\n<p>[root@master ~]# update-alternatives &#8211;config java<\/p>\n<p>There is 1 program that provides &#8216;java&#8217;.<\/p>\n<p>Selection Command<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<br \/>\n*+ 1 \/usr\/java\/jdk1.8.0_121\/jre\/bin\/java<\/p>\n<p>3.4. setup user and user group on all nodes<\/p>\n<p>groupadd hadoop<br \/>\nuseradd -d \/home\/hadoop -g hadoop hadoop<br \/>\npasswd hadoop<br \/>\n3.5. modify hosts file for network inter-recognition on all nodes<\/p>\n<p>echo &#8216;192.168.1.80 master.rmohan.com master&#8217; &gt;&gt; \/etc\/hosts<br \/>\necho &#8216;192.168.1.81 lab1.rmohan.com lab1 slave1&#8217; &gt;&gt; \/etc\/hosts<br \/>\necho &#8216;192.168.1.82 lab1.rmohan.com lab2 slave2&#8217; &gt;&gt; \/etc\/hosts<br \/>\ncheck the recognition works:<\/p>\n<p>ping master<br \/>\nping slave1<br \/>\nping slave2<\/p>\n<p>3.6. setup ssh no password login on all nodes<\/p>\n<p>su &#8211; hadoop<br \/>\nssh-keygen -t rsa<br \/>\nssh-copy-id master<br \/>\nssh-copy-id slave1<br \/>\nssh-copy-id slave2<br \/>\nnow you can ssh login to all 3 nodes without passwd, please have a try to check it out.<\/p>\n<p>3.7. stop &amp; disable firewall<\/p>\n<p>systemctl stop firewalld.service<br \/>\nsystemctl disable firewalld.service<br \/>\n4. Hadoop Setup<\/p>\n<p>P.S. the whole Step 4 operations happens on a single node, let\u2019s say, master. In addition, we\u2019ll login as user hadoop to finish all operations.<\/p>\n<p>su &#8211; hadoop<br \/>\n4.1. Download and untar on the file system.<br \/>\n[hadoop@master ~]$ wget http:\/\/mirrors.sonic.net\/apache\/hadoop\/common\/hadoop-2.7.2\/hadoop-2.7.2.tar.gz<br \/>\n&#8211;2017-04-12 22:49:01&#8211; http:\/\/mirrors.sonic.net\/apache\/hadoop\/common\/hadoop-2.7.2\/hadoop-2.7.2.tar.gz<br \/>\nResolving mirrors.sonic.net (mirrors.sonic.net)&#8230; 69.12.162.27<br \/>\nConnecting to mirrors.sonic.net (mirrors.sonic.net)|69.12.162.27|:80&#8230; connected.<br \/>\nHTTP request sent, awaiting response&#8230; 200 OK<br \/>\nLength: 212046774 (202M) [application\/x-gzip]<br \/>\nSaving to: \u2018hadoop-2.7.2.tar.gz\u2019<\/p>\n<p>1% [&gt; ] 2,932,536 811KB\/s eta 4m 24s<\/p>\n<p>tar -zxvf hadoop-2.7.2.tar.gz<br \/>\nrm hadoop-2.7.2.tar.gz<br \/>\nchmod 775 hadoop-2.7.2<br \/>\n4.2. Add environment variables for hadoop<\/p>\n<p>append following content onto ~\/.bashrc<\/p>\n<p>export HADOOP_HOME=\/home\/hadoop\/hadoop-2.7.2<br \/>\nexport HADOOP_INSTALL=$HADOOP_HOME<br \/>\nexport HADOOP_MAPRED_HOME=$HADOOP_HOME<br \/>\nexport HADOOP_COMMON_HOME=$HADOOP_HOME<br \/>\nexport HADOOP_HDFS_HOME=$HADOOP_HOME<br \/>\nexport YARN_HOME=$HADOOP_HOME<br \/>\nexport HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME\/lib\/native<br \/>\nexport PATH=$PATH:$HADOOP_HOME\/sbin:$HADOOP_HOME\/bin<br \/>\nthen make these variables effect:<\/p>\n<p>source ~\/.bashrc<br \/>\n4.3. Modify configuration files for hadoop<\/p>\n<p>Add slave node hostnames into $HADOOP_HOME\/etc\/hadoop\/slaves file<br \/>\necho slave1 &gt; $HADOOP_HOME\/etc\/hadoop\/slaves<br \/>\necho slave2 &gt;&gt; $HADOOP_HOME\/etc\/hadoop\/slaves<br \/>\nAdd secondary node hostname into $HADOOP_HOME\/etc\/hadoop\/masters file<br \/>\necho slave1 &gt; $HADOOP_HOME\/etc\/hadoop\/masters<br \/>\nModify $HADOOP_HOME\/etc\/hadoop\/core-site.xml as following<br \/>\n&lt;configuration&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;fs.defaultFS&lt;\/name&gt;<br \/>\n&lt;value&gt;hdfs:\/\/master:9000\/&lt;\/value&gt;<br \/>\n&lt;description&gt;namenode settings&lt;\/description&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;hadoop.tmp.dir&lt;\/name&gt;<br \/>\n&lt;value&gt;\/home\/hadoop\/hadoop-2.7.2\/tmp\/hadoop-${user.name}&lt;\/value&gt;<br \/>\n&lt;description&gt; temp folder &lt;\/description&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;hadoop.proxyuser.hadoop.hosts&lt;\/name&gt;<br \/>\n&lt;value&gt;*&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;hadoop.proxyuser.hadoop.groups&lt;\/name&gt;<br \/>\n&lt;value&gt;*&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;\/configuration&gt;<br \/>\nModify $HADOOP_HOME\/etc\/hadoop\/hdfs-site.xml as following<br \/>\n&lt;configuration&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;dfs.namenode.http-address&lt;\/name&gt;<br \/>\n&lt;value&gt;master:50070&lt;\/value&gt;<br \/>\n&lt;description&gt; fetch NameNode images and edits &lt;\/description&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;dfs.namenode.secondary.http-address&lt;\/name&gt;<br \/>\n&lt;value&gt;slave1:50090&lt;\/value&gt;<br \/>\n&lt;description&gt; fetch SecondNameNode fsimage &lt;\/description&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;dfs.replication&lt;\/name&gt;<br \/>\n&lt;value&gt;2&lt;\/value&gt;<br \/>\n&lt;description&gt; replica count &lt;\/description&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;dfs.namenode.name.dir&lt;\/name&gt;<br \/>\n&lt;value&gt;file:\/\/\/home\/hadoop\/hadoop-2.7.2\/hdfs\/name&lt;\/value&gt;<br \/>\n&lt;description&gt; namenode &lt;\/description&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;dfs.datanode.data.dir&lt;\/name&gt;<br \/>\n&lt;value&gt;file:\/\/\/home\/hadoop\/hadoop-2.7.2\/hdfs\/data&lt;\/value&gt;<br \/>\n&lt;description&gt; DataNode &lt;\/description&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;dfs.namenode.checkpoint.dir&lt;\/name&gt;<br \/>\n&lt;value&gt;file:\/\/\/home\/hadoop\/hadoop-2.7.2\/hdfs\/namesecondary&lt;\/value&gt;<br \/>\n&lt;description&gt; check point &lt;\/description&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;dfs.webhdfs.enabled&lt;\/name&gt;<br \/>\n&lt;value&gt;true&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;dfs.stream-buffer-size&lt;\/name&gt;<br \/>\n&lt;value&gt;131072&lt;\/value&gt;<br \/>\n&lt;description&gt; buffer &lt;\/description&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;dfs.namenode.checkpoint.period&lt;\/name&gt;<br \/>\n&lt;value&gt;3600&lt;\/value&gt;<br \/>\n&lt;description&gt; duration &lt;\/description&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;\/configuration&gt;<br \/>\nModify $HADOOP_HOME\/etc\/hadoop\/mapred-site.xml as following<\/p>\n<p>&lt;configuration&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;mapreduce.framework.name&lt;\/name&gt;<br \/>\n&lt;value&gt;yarn&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;mapreduce.jobtracker.address&lt;\/name&gt;<br \/>\n&lt;value&gt;hdfs:\/\/trucy:9001&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;mapreduce.jobhistory.address&lt;\/name&gt;<br \/>\n&lt;value&gt;master:10020&lt;\/value&gt;<br \/>\n&lt;description&gt;MapReduce JobHistory Server host:port, default port is 10020.&lt;\/description&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;mapreduce.jobhistory.webapp.address&lt;\/name&gt;<br \/>\n&lt;value&gt;master:19888&lt;\/value&gt;<br \/>\n&lt;description&gt;MapReduce JobHistory Server Web UI host:port, default port is 19888.&lt;\/description&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;\/configuration&gt;<\/p>\n<p>Modify $HADOOP_HOME\/etc\/hadoop\/yarn-site.xml as following<br \/>\n&lt;configuration&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;yarn.resourcemanager.hostname&lt;\/name&gt;<br \/>\n&lt;value&gt;master&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;yarn.nodemanager.aux-services&lt;\/name&gt;<br \/>\n&lt;value&gt;mapreduce_shuffle&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;yarn.nodemanager.aux-services.mapreduce.shuffle.class&lt;\/name&gt;<br \/>\n&lt;value&gt;org.apache.hadoop.mapred.ShuffleHandler&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;yarn.resourcemanager.address&lt;\/name&gt;<br \/>\n&lt;value&gt;master:8032&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;yarn.resourcemanager.scheduler.address&lt;\/name&gt;<br \/>\n&lt;value&gt;master:8030&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;yarn.resourcemanager.resource-tracker.address&lt;\/name&gt;<br \/>\n&lt;value&gt;master:8031&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;yarn.resourcemanager.admin.address&lt;\/name&gt;<br \/>\n&lt;value&gt;master:8033&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;property&gt;<br \/>\n&lt;name&gt;yarn.resourcemanager.webapp.address&lt;\/name&gt;<br \/>\n&lt;value&gt;master:8088&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n&lt;\/configuration&gt;<br \/>\n4.4. Create necessary folders<\/p>\n<p>mkdir -p $HADOOP_HOME\/tmp<br \/>\nmkdir -p $HADOOP_HOME\/hdfs\/name<br \/>\nmkdir -p $HADOOP_HOME\/hdfs\/data<br \/>\n4.5. Copy hadoop folders and environment settings to slaves<\/p>\n<p>scp ~\/.bashrc slave1:~\/<br \/>\nscp ~\/.bashrc slave2:~\/<\/p>\n<p>scp -r ~\/hadoop-2.7.2 slave1:~\/<br \/>\nscp -r ~\/hadoop-2.7.2 slave2:~\/<br \/>\n5. Launch hadoop cluster service<\/p>\n<p>Format namenode for the first time launch<\/p>\n<p>hdfs namenode -format<br \/>\nLaunch dfs distributed file system<br \/>\nstart-dfs.sh<\/p>\n<p>[hadoop@master ~]$ start-dfs.sh<br \/>\n17\/04\/12 23:27:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform&#8230; using builtin-java classes where applicable<br \/>\nStarting namenodes on [master]<br \/>\nThe authenticity of host &#8216;master (192.168.1.80)&#8217; can&#8217;t be established.<br \/>\nECDSA key fingerprint is d8:55:ea:50:b3:bb:8a:fc:90:a2:0e:54:3e:79:60:bc.<br \/>\nAre you sure you want to continue connecting (yes\/no)? yes<br \/>\nmaster: Warning: Permanently added &#8216;master,192.168.1.80&#8217; (ECDSA) to the list of known hosts.<br \/>\nhadoop@master&#8217;s password:<br \/>\nmaster: starting namenode, logging to \/home\/hadoop\/hadoop-2.7.2\/logs\/hadoop-hadoop-namenode-master.rmohan.com.out<br \/>\nslave2: starting datanode, logging to \/home\/hadoop\/hadoop-2.7.2\/logs\/hadoop-hadoop-datanode-lab2.rmohan.com.out<br \/>\nslave1: starting datanode, logging to \/home\/hadoop\/hadoop-2.7.2\/logs\/hadoop-hadoop-datanode-lab1.rmohan.com.out<br \/>\nStarting secondary namenodes [slave1]<br \/>\nslave1: starting secondarynamenode, logging to \/home\/hadoop\/hadoop-2.7.2\/logs\/hadoop-hadoop-secondarynamenode-lab1.rmohan.com.out<br \/>\n17\/04\/12 23:27:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform&#8230; using builtin-java classes where applicable<\/p>\n<p>Launch yarn distributed computing system<br \/>\nstart-yarn.sh<\/p>\n<p>[hadoop@master ~]$ start-yarn.sh<br \/>\nstarting yarn daemons<br \/>\nstarting resourcemanager, logging to \/home\/hadoop\/hadoop-2.7.2\/logs\/yarn-hadoop-resourcemanager-master.rmohan.com.out<br \/>\nslave2: starting nodemanager, logging to \/home\/hadoop\/hadoop-2.7.2\/logs\/yarn-hadoop-nodemanager-lab2.rmohan.com.out<br \/>\nslave1: starting nodemanager, logging to \/home\/hadoop\/hadoop-2.7.2\/logs\/yarn-hadoop-nodemanager-lab1.rmohan.com.out<br \/>\n[hadoop@master ~]$<\/p>\n<p>Shutdown Hadoop Cluster<\/p>\n<p>stop-yarn.sh<br \/>\nstop-dfs.sh<br \/>\n6. Verify the hadoop cluster is up and healthy<\/p>\n<p>6.1. Verify by jps processus<\/p>\n<p>Check jps on each node, and view results.<\/p>\n<p>[hadoop@master ~]$ jps<br \/>\n3184 ResourceManager<br \/>\n3441 Jps<br \/>\n2893 NameNode<br \/>\n[hadoop@master ~]$<\/p>\n<p>On slave1 node:<br \/>\n[hadoop@lab1 ~]$ jps<br \/>\n3026 NodeManager<br \/>\n3127 Jps<br \/>\n2811 DataNode<br \/>\n2907 SecondaryNameNode<br \/>\n[hadoop@lab1 ~]$<\/p>\n<p>On slave2 node:<br \/>\n[hadoop@lab2 ~]$ jps<br \/>\n2722 DataNode<br \/>\n2835 NodeManager<br \/>\n2934 Jps<br \/>\n[hadoop@lab2 ~]$<\/p>\n<p>6.2. Verify on Web interface<\/p>\n<p>6.2. Verify on Web interface<\/p>\n<p>192.168.1.80:50070 to view hdfs storage status.<br \/>\n192.168.1.80:8088 to view yarn computing system resources and application status.<\/p>\n<p>7. End<\/p>\n<p>This is all about basic version of hadoop 3 nodes cluster, for high availability version &amp; hadoop relative eco-systems,<\/p>\n<p>I\u2019ll give it on other posts, thanks for contacting me if there is anything mistype or you have any suggestions or anything you don\u2019t understand.<\/p>\n<p><a href=\"http:\/\/rmohan.com\/wp-content\/uploads\/2017\/04\/hadoop-001.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6635\" src=\"http:\/\/rmohan.com\/wp-content\/uploads\/2017\/04\/hadoop-001.jpg\" alt=\"\" width=\"1853\" height=\"513\" srcset=\"https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-001.jpg 1853w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-001-300x83.jpg 300w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-001-768x213.jpg 768w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-001-1024x283.jpg 1024w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-001-150x42.jpg 150w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-001-400x111.jpg 400w\" sizes=\"(max-width: 1853px) 100vw, 1853px\" \/><\/a> <a href=\"http:\/\/rmohan.com\/wp-content\/uploads\/2017\/04\/hadoop-002.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6636\" src=\"http:\/\/rmohan.com\/wp-content\/uploads\/2017\/04\/hadoop-002.jpg\" alt=\"\" width=\"1842\" height=\"830\" srcset=\"https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-002.jpg 1842w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-002-300x135.jpg 300w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-002-768x346.jpg 768w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-002-1024x461.jpg 1024w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-002-150x68.jpg 150w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-002-400x180.jpg 400w\" sizes=\"(max-width: 1842px) 100vw, 1842px\" \/><\/a> <a href=\"http:\/\/rmohan.com\/wp-content\/uploads\/2017\/04\/hadoop-003.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6637\" src=\"http:\/\/rmohan.com\/wp-content\/uploads\/2017\/04\/hadoop-003.jpg\" alt=\"\" width=\"1919\" height=\"807\" srcset=\"https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-003.jpg 1919w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-003-300x126.jpg 300w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-003-768x323.jpg 768w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-003-1024x431.jpg 1024w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-003-150x63.jpg 150w, https:\/\/mohan.sg\/wp-content\/uploads\/2017\/04\/hadoop-003-400x168.jpg 400w\" sizes=\"(max-width: 1919px) 100vw, 1919px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p><strong>How do I add a new node to a Hadoop cluster?<\/strong><\/p>\n<p>1. Install a new node at hadoop or copy it from another node<\/p>\n<p>2. Copy the namingode configuration file to the node<\/p>\n<p>3. Modify the masters and slaves files, increase the node, all nodes have to modify<\/p>\n<p>4. Set the ssh password to access the node<\/p>\n<p>5. Start the datanode and tasktracker (hadoop-daemon.sh start datanode \/ tasktracker) on the node separately<\/p>\n<p>6. Run start-balancer.sh for data load balancing<\/p>\n","protected":false},"excerpt":{"rendered":"<p>CentOS 7.3 under the Hadoop 2.7.2 cluster<\/p>\n<p>how to setup a Hadoop cluster on CentOS linux system. Before you read this article, I assume you already have all basic conceptions about Hadoop and Linux operating system.<\/p>\n<p>mv ifcfg-eno16777736 ifcfg-eth0 vi \/etc\/udev\/rules.d\/90-eno-fix.rules # This file was automatically generated on systemd update SUBSYSTEM==&#8221;net&#8221;, ACTION==&#8221;add&#8221;, DRIVERS==&#8221;?*&#8221;, ATTR{address}==&#8221;00:0c:29:9e:8f:95&#8243;, NAME=&#8221;eno16777736&#8243;<\/p>\n<p> [&#8230;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[73,60],"tags":[],"_links":{"self":[{"href":"https:\/\/mohan.sg\/index.php?rest_route=\/wp\/v2\/posts\/6634"}],"collection":[{"href":"https:\/\/mohan.sg\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mohan.sg\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mohan.sg\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mohan.sg\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6634"}],"version-history":[{"count":6,"href":"https:\/\/mohan.sg\/index.php?rest_route=\/wp\/v2\/posts\/6634\/revisions"}],"predecessor-version":[{"id":6645,"href":"https:\/\/mohan.sg\/index.php?rest_route=\/wp\/v2\/posts\/6634\/revisions\/6645"}],"wp:attachment":[{"href":"https:\/\/mohan.sg\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6634"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mohan.sg\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6634"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mohan.sg\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6634"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}