Prerequisites
3台虚拟机 SSH免密登陆 Java已安装 Hadoop安装包已经下载 tools工具已经下载
配置
配置分配目录:
目录 | 说明 |
hdfs | 存放hdfs相关文件,一般存放namenode、datanode、logs信息文件 |
logs | 存放启动日志信息 |
tmp | 存放hdfs系统临时文件等信息 |
hdfs/name | 存放namenode信息 |
hdfs/logs | 存放hdfs日志信息 |
hdfs/data | 存放datanode信息 |
使用hadoop用户身份,以安装2.7.3为例:
1 建立以上目录
[hadoop@NN01 hadoop]$ cd ~/hadoop-2.7.3/
[hadoop@NN01 hadoop-2.7.3]$ ls
bin etc include lib libexec LICENSE.txt NOTICE.txt README.txt sbin share
[hadoop@NN01 hadoop-2.7.3]$ ~/tools/runRemoteCmd.sh "mkdir ~/hadoop-2.7.3/hdfs ~/hadoop-2.7.3/logs ~/hadoop-2.7.3/tmp" all
**************NN01.HadoopVM************
**************DN01.HadoopVM************
**************DN02.HadoopVM************
[hadoop@NN01 hadoop-2.7.3]$ ~/tools/runRemoteCmd.sh "mkdir -p ~/hadoop-2.7.3/hdfs/data hadoop-2.7.3/hdfs/logs hadoop-2.7.3/hdfs/name" all
**************NN01.HadoopVM************
**************DN01.HadoopVM************
**************DN02.HadoopVM************
[hadoop@NN01 hadoop-2.7.3]$ ls
bin etc hdfs include lib libexec LICENSE.txt logs NOTICE.txt README.txt sbin share tmp
[hadoop@NN01 hadoop-2.7.3]$ ls hdfs
data logs name
2 修改配置文件
Hadoop’s Java configuration is driven by two types of important configuration files:
- Read-only default configuration - core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml.
- Site-specific configuration - etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml and etc/hadoop/mapred-site.xml.
Additionally, you can control the Hadoop scripts found in the bin/ directory of the distribution, by setting site-specific values via the etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh.
To configure the Hadoop cluster you will need to configure the environment in which the Hadoop daemons execute as well as the configuration parameters for the Hadoop daemons.
HDFS daemons are NameNode, SecondaryNameNode, and DataNode. YARN damones are ResourceManager, NodeManager, and WebAppProxy. If MapReduce is to be used, then the MapReduce Job History Server will also be running. For large installations, these are generally running on separate hosts.
注意: 修改配置前将hadoop-2.7.3软链接为hadoop
[hadoop@NN01 ~]$ ./tools/runRemoteCmd.sh "ln -s hadoop-2.7.3 hadoop" all
2.1 修改etc/hadoop/core-site.xml
Parameter | Value | Notes |
fs.defaultFS | NameNode URI | hdfs://host:port/ |
io.file.buffer.size | 131072 | Size of read/write buffer used in SequenceFiles. |
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://NN01.HadoopVM:9000</value>
<final>true</final>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:///home/hadoop/hadoop/tmp</value>
</property>
<property>
<name>ds.default.name</name>
<value>hdfs://NN01.HadoopVM:54310</value>
<final>true</final>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>NN01.HadoopVM:2181,DN01.HadoopVM:2181,DN02.HadoopVM:2181</value>
</property>
</configuration>
2.2 修改etc/hadoop/hdfs-site.xml
Configurations for NameNode:
Parameter | Value | Notes |
dfs.namenode.name.dir | Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. | If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. |
dfs.hosts / dfs.hosts.exclude | List of permitted/excluded DataNodes. | If necessary, use these files to control the list of allowable datanodes. |
dfs.blocksize | 268435456 | HDFS blocksize of 256MB for large file-systems. |
dfs.namenode.handler.count | 100 | More NameNode server threads to handle RPCs from large number of DataNodes. |
Configurations for DataNode:
Parameter | Value | Notes |
dfs.datanode.data.dir | Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. | If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. |
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/hadoop/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/hadoop/hdfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
2.3 修改etc/hadoop/yarn-site.xml
Configurations for ResourceManager and NodeManager:
Parameter | Value | Notes |
yarn.acl.enable | true /false | Enable ACLs? Defaults to false. |
yarn.admin.acl | Admin ACL | ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which meansanyone. Special value of just space means no one has access. |
yarn.log-aggregation-enable | FALSE | Configuration to enable or disable log aggregation |
Configurations for ResourceManager:
Parameter | Value | Notes |
yarn.resourcemanager.address | ResourceManager host:port for clients to submit jobs. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.scheduler.address | ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.resource-tracker.address | ResourceManager host:port for NodeManagers. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.admin.address | ResourceManager host:port for administrative commands. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.webapp.address | ResourceManager web-ui host:port. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.hostname | ResourceManager host. | host Single hostname that can be set in place of setting all yarn.resourcemanager*addressresources. Results in default ports for ResourceManager components. |
yarn.resourcemanager.scheduler.class | ResourceManager Scheduler class. | CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler |
yarn.scheduler.minimum-allocation-mb | Minimum limit of memory to allocate to each container request at the Resource Manager. | In MBs |
yarn.scheduler.maximum-allocation-mb | Maximum limit of memory to allocate to each container request at the Resource Manager. | In MBs |
yarn.resourcemanager.nodes.include-path /yarn.resourcemanager.nodes.exclude-path | List of permitted/excluded NodeManagers. | If necessary, use these files to control the list of allowable NodeManagers. |
Configurations for NodeManager:
Parameter | Value | Notes |
yarn.nodemanager.resource.memory-mb | Resource i.e. available physical memory, in MB, for given NodeManager | Defines total available resources on the NodeManager to be made available to running containers |
yarn.nodemanager.vmem-pmem-ratio | Maximum ratio by which virtual memory usage of tasks may exceed physical memory | The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio. |
yarn.nodemanager.local-dirs | Comma-separated list of paths on the local filesystem where intermediate data is written. | Multiple paths help spread disk i/o. |
yarn.nodemanager.log-dirs | Comma-separated list of paths on the local filesystem where logs are written. | Multiple paths help spread disk i/o. |
yarn.nodemanager.log.retain-seconds | 10800 | Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled. |
yarn.nodemanager.remote-app-log-dir | /logs | HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled. |
yarn.nodemanager.remote-app-log-dir-suffix | logs | Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled. |
yarn.nodemanager.aux-services | mapreduce_shuffle | Shuffle service that needs to be set for Map Reduce applications. |
<!-- Site specific YARN configuration properties -->
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>NN01.HadoopVM</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>NN01.HadoopVM:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>NN01.HadoopVM:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>NN01.HadoopVM:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>NN01.HadoopVM:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>NN01.HadoopVM:8088</value>
</property>
</configuration>
2.4 修改etc/hadoop/mapred-site.xml
Configurations for MapReduce Applications:
Parameter | Value | Notes |
mapreduce.framework.name | yarn | Execution framework set to Hadoop YARN. |
mapreduce.map.memory.mb | 1536 | Larger resource limit for maps. |
mapreduce.map.java.opts | -Xmx1024M | Larger heap-size for child jvms of maps. |
mapreduce.reduce.memory.mb | 3072 | Larger resource limit for reduces. |
mapreduce.reduce.java.opts | -Xmx2560M | Larger heap-size for child jvms of reduces. |
mapreduce.task.io.sort.mb | 512 | Higher memory-limit while sorting data for efficiency. |
mapreduce.task.io.sort.factor | 100 | More streams merged at once while sorting files. |
mapreduce.reduce.shuffle.parallelcopies | 50 | Higher number of parallel copies run by reduces to fetch outputs from very large number of maps. |
Configurations for MapReduce JobHistory Server:
Parameter | Value | Notes |
mapreduce.jobhistory.address | MapReduce JobHistory Server host:port | Default port is 10020. |
mapreduce.jobhistory.webapp.address | MapReduce JobHistory Server Web UI host:port | Default port is 19888. |
mapreduce.jobhistory.intermediate-done-dir | /mr-history/tmp | Directory where history files are written by MapReduce jobs. |
mapreduce.jobhistory.done-dir | /mr-history/done | Directory where history files are managed by the MR JobHistory Server. |
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>NN01.HadoopVM:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>NN01.HadoopVM:19888</value>
</property>
</configuration>
2.5 修改masters slaves
[hadoop@NN01 hadoop]$ cat masters
NN01.HadoopVM
[hadoop@NN01 hadoop]$ cat slaves
NN01.HadoopVM
DN01.HadoopVM
DN02.HadoopVM
2.6 增加执行权限
[hadoop@NN01 hadoop]$ chmod +x *.sh
2.7 环境配置
export JAVA_HOME=/opt/tool/jdk1.8
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export HADOOP_HOME=/home/hadoop/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_LOG_DIR=/home/hadoop/hadoop/logs
export YARN_LOG_DIR=$HADOOP_LOG_DIR
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
2.8 修改yarn-env.sh
export JAVA_HOME=/opt/tool/jdk1.8
3 Hadoop Startup
The first time you bring up HDFS, it must be formatted. Format a new distributed filesystem as hdfs:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format <cluster_name>
Start HDFS AND YARN
[hadoop@NN01 sbin]$ ./start-dfs.sh
Starting namenodes on [NN01.HadoopVM]
NN01.HadoopVM: starting namenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-namenode-NN01.HadoopVM.out
DN01.HadoopVM: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-DN01.HadoopVM.out
NN01.HadoopVM: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-NN01.HadoopVM.out
DN02.HadoopVM: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-DN02.HadoopVM.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-secondarynamenode-NN01.HadoopVM.out
[hadoop@NN01 sbin]$ ./start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-resourcemanager-NN01.HadoopVM.out
NN01.HadoopVM: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-NN01.HadoopVM.out
DN01.HadoopVM: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-DN01.HadoopVM.out
DN02.HadoopVM: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-DN02.HadoopVM.out
Start the MapReduce JobHistory Server with the following command, run on the designated server as mapred:
[hadoop@NN01 sbin]$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver
starting historyserver, logging to /home/hadoop/hadoop/logs/mapred-hadoop-historyserver-NN01.HadoopVM.out
Web Interfaces
Once the Hadoop cluster is up and running check the web-ui of the components as described below:
Daemon | Web Interface | Notes |
NameNode | http://nn_host:port/ | Default HTTP port is 50070. |
ResourceManager | http://rm_host:port/ | Default HTTP port is 8088. |
MapReduce JobHistory Server | http://jhs_host:port/ | Default HTTP port is 19888. |
4 Hadoop Shutdown
Stop HDFS AND YARN
[hadoop@NN01 sbin]$ ./stop-dfs.sh
Stopping namenodes on [NN01.HadoopVM]
NN01.HadoopVM: stopping namenode
NN01.HadoopVM: stopping datanode
DN01.HadoopVM: stopping datanode
DN02.HadoopVM: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
[hadoop@NN01 sbin]$ ./stop-yarn.sh
stopping yarn daemons
stopping resourcemanager
NN01.HadoopVM: stopping nodemanager
DN01.HadoopVM: stopping nodemanager
DN02.HadoopVM: stopping nodemanager
Stop the MapReduce JobHistory Server with the following command, run on the designated server as mapred:
[hadoop@NN01 sbin]$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR stop historyserver
stopping historyserver
5 Hadoop版本切换
按照上述配置,配置另一套Hadoop环境于相同节点上。
查看目前版本状态
切换
即将软链接改为指向2.6.4即可
[hadoop@NN01 ~]$ ./tools/runRemoteCmd.sh "rm ~/hadoop" all
**************NN01.HadoopVM************
**************DN01.HadoopVM************
**************DN02.HadoopVM************
[hadoop@NN01 ~]$ ./tools/runRemoteCmd.sh "ln -s ~/hadoop2.6 ~/hadoop" all
**************NN01.HadoopVM************
**************DN01.HadoopVM************
**************DN02.HadoopVM************
查看目前版本状态
6 Other
配置SecondaryNameNode
[hadoop@NN01 hadoop]$ tail etc/hadoop/hdfs-site.xml
...
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>DN01.HadoopVM:50090</value>
</property>
[hadoop@NN01 hadoop]$ ./sbin/start-dfs.sh
Starting namenodes on [NN01.HadoopVM]
NN01.HadoopVM: starting namenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-namenode-NN01.HadoopVM.out
NN01.HadoopVM: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-NN01.HadoopVM.out
DN02.HadoopVM: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-DN02.HadoopVM.out
DN01.HadoopVM: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-DN01.HadoopVM.out
Starting secondary namenodes [DN01.HadoopVM]
DN01.HadoopVM: starting secondarynamenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-secondarynamenode-DN01.HadoopVM.out
[hadoop@DN01 current]$ jps
4504 DataNode
4570 SecondaryNameNode
故障恢复
The latest checkpoint can be imported to the NameNode if all other copies of the image and the edits files are lost. In order to do that one should:
- Create an empty directory specified inthedfs.namenode.name.dir configuration variable;
- Specify the location of the checkpointdirectory in the configuration variabledfs.namenode.checkpoint.dir;
- and start the NameNode with-importCheckpoint option.
The NameNode will upload the checkpoint from the dfs.namenode.checkpoint.dir directory and then save it to the NameNode directory(s) set in dfs.namenode.name.dir. The NameNodewill fail if a legal image is contained in dfs.namenode.name.dir. The NameNode verifies that the image indfs.namenode.checkpoint.dir is consistent, but does not modify it in any way.
[hadoop@NN01 hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode -importCheckpoint
starting namenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-namenode-NN01.HadoopVM.out
[hadoop@NN01 hdfs]$ jps
11571 Jps
11530 NameNode
[hadoop@NN01 hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
DN01.HadoopVM: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-DN01.HadoopVM.out
DN02.HadoopVM: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-DN02.HadoopVM.out
NN01.HadoopVM: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-NN01.HadoopVM.out