问题陈述
Sqoop是Hadoop生态系统和RDBMS之间进行数据传输的一个工具。在学习Sqoop之前首先需要完成学习环境的搭建。这里为了学习方便,采用单机部署方式。
最初Sqoop是Hadoop的一个子项目,它设计只能在Linux操作系统上运行。
先决条件
安装Sqoop的必要前提条件是:
- 准备Linux操作系统(Centos7)
- 安装Java环境(JDK1.8)
- 安装Hadoop环境(Hadoop 3.1.4)
另外为了学习Sqoop的大部分功能,还需要需要安装:
- Zookeeper 3.5.9)
- HBase 2.2.3
- MySQL 5.7
- Hive 3.1.2
解决办法
准备Linux操作系统
配置主机名映射
设置网络:
虚拟机安装Centos7时
硬盘设置大一些,如40G或更多
不要设置预先分配磁盘空间.
网络适配器设置为NAT连接

打开虚拟网络编辑器

记住NAT模式所在的虚拟网卡对应的子网IP,子网掩码以及网关IP.
然后进入虚拟机终端,设置静态IP
$ vi /etc/sysconfig/network-scripts/ifcfg-ens33
NAME="ens33"
TYPE="Ethernet"
DEVICE="ens33"
BROWSER_ONLY="no"
DEFROUTE="yes"
PROXY_METHOD="none"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_FAILURE_FATAL="no"
IPV6_ADDR_GEN_MODE="stable-privacy"
IPV6_PRIVACY="no"
UUID="b00b9ac0-60c2-4d34-ab88-2413055463cf"
ONBOOT="yes"
BOOTPROTO="static"
IPADDR="192.168.186.100"
PREFIX="24"
GATEWAY="192.168.186.2"
DNS1="223.5.5.5"
DNS2="8.8.8.8"
其中需要修改的是:
- ONTBOOT设置yes可以实现自动联网
- BOOTPROTO="static" 设置静态IP,防止IP发生变化
- IPADDR的前三段要和NAT虚拟网卡的子网IP一致,且第四段在0~254之间选择,又不能和NAT虚拟网卡的子网掩码和其他相同网络中的主机IP重复.
- PREFIX=24是设置子网掩码的位数长度,换算十进制就是255.255.255.0,因此PREFIX=24也可以直接替换成NETMASK="255.255.255.0"
- DNS1设置的是阿里的公共DNS地址"223.5.5.5",DNS2设置的是谷歌的公共DNS地址"8,8,8,8"
设置好了之后需要重新启动网络:
$ sudo service network restart
Restarting network (via systemctl): [ 确定 ]
$ sudo service network status
已配置设备:
lo ens33
当前活跃设备:
lo ens33
查看本机ip:
$ ip a
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens33: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:0c:29:b2:5b:75 brd ff:ff:ff:ff:ff:ff
inet 192.168.186.100/24 brd 192.168.186.255 scope global noprefixroute ens33
valid_lft forever preferred_lft forever
inet6 fe80::40db:ee8d:77c1:fbf7/64 scope link noprefixroute
valid_lft forever preferred_lft forever
设置静态主机名:
# hostnamectl --static set-hostname hadoop100
在hosts文件中配置主机名和本机ip之间的映射关系
(注释掉localhost的部分)
vi /etc/hosts
#127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.186.100 hadoop100
创建专用账号
不建议直接使用root账号, 这里我们创建一个hadoop账号用于接下来的所有操作.
创建hadoop用户
# useradd hadoop
设置hadoop用户的密码
# password hadoop
为hadoop账号设置sudoer权限
vi /etc/sudoers
找到root ALL=(ALL) ALL在下面添加一行
## Allow root to run any commands anywhere
root ALL=(ALL) ALL
delucia ALL=(ALL) NOPASSWD:ALL
修改完成后,切换到Hadoop用户
# su hadoop
以后在使用需要root权限的命令时,就可以在命令前面加上sudo来提升权限,且无需输入hadoop密码, 如:
$ sudo ls /root
接下来的所有操作, 如果没有特殊说明, 一律使用hadoop账号.
配置SSH免密登录
由于Hadoop集群的机器之间ssh通信默认需要输入密码,在集群运行时我们不可能为每一次通信都手动输入密码,因此需要配置机器之间的ssh的免密登录。单机伪分布式的Hadoop环境样需要配置本地对本地ssh连接的免密,流程如下:
-
首先ssh-keygen命令生成RSA加密的密钥对(公钥和私钥)。
$ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: SHA256:yQYChs4eniVLeICaI2bCB9HopbXUBE9v0lpLjBACUnM hadoop@hadoop100 The key's randomart image is: +---[RSA 2048]----+ |==O+Eo | |=+.O+.= | |Bo* o+.B | |B%.+ .*o.. | |OoB . .S | | = . | | | +----[SHA256]-----+ -
将生成的公钥添加到~/.ssh目录下的authorized_keys文件中。并为authorized_keys文件设置600权限。
$ cd ~/.ssh/ $ cat id_rsa.pub >> authorized_keys $ chmod 600 authorized_keys以上三个命令可以使用一个命令代替:
$ ssh-copy-id hadoop100 -
使用ssh命令连接本地终端,如果不需要输入密码则说明本地的SSH免密配置成功。
$ ssh hadoop@hadoop100 The authenticity of host 'hadoop100 (192.168.186.100)' can't be established. ECDSA key fingerprint is SHA256:aGLhdt3bIuqtPgrFWnhgrfTKUbDh4CWVTfIgr5E5oV0. ECDSA key fingerprint is MD5:b8:bd:b3:65:fe:77:2c:06:2d:ec:58:3a:97:51:dd:ca. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'hadoop100,192.168.186.100' (ECDSA) to the list of known hosts. Last login: Sat Jan 9 10:16:53 2021 from 192.168.186.1 -
登出
$ exit Connection to hadoop100 closed.
配置时间同步
集群中的通信和文件传输一般是以系统时间作为约定条件的。所以当集群中机器之间系统如果不一致可能导致各种问题发生,比如访问时间过长,甚至失败。所以配置机器之间的时间同步非常重要。不过由于我们使用的学习环境是单机部署,所以无需配置时间同步。
统一目录结构
目录规划如下:
/opt/
├── bin # 脚本和命令
├── data # 程序需要使用的数据
├── download # 下载的软件安装包
├── pkg # 解压方式安装的软件
└── tmp # 存放程序生成的临时文件
使用hadoop账户创建目录:
$ sudo mkdir /opt/download
$ sudo mkdir /opt/data
$ sudo mkdir /opt/bin
$ sudo mkdir /opt/tmp
$ sudo mkdir /opt/pkg
为了使用方便更改opt下目录的用户及其所在用户组为hadoop:
$ sudo chown hadoop:hadoop /opt/*
$ ls
bin data download pkg tmp
$ ll
总用量 0
drwxr-xr-x. 2 hadoop hadoop 6 1月 9 14:29 bin
drwxr-xr-x. 2 hadoop hadoop 6 1月 9 14:29 data
drwxr-xr-x. 2 hadoop hadoop 6 1月 9 14:29 download
drwxr-xr-x. 2 hadoop hadoop 6 1月 9 14:37 pkg
drwxr-xr-x. 2 hadoop hadoop 6 1月 9 14:36 tmp
安装Java环境
检查是否已经装过Java JDK
$ rpm -qa | grep java
或
$ yum list installed | grep java
如果没有安装过Java 则需要安装,JDK版本建议1.8+
Java JDK的下载和解压
去官网下载JDK1.8的安装包,上传到/opt/download/,然后解压到/opt/pkg
$ tar -zxvf jdk-8u261-linux-x64.tar.gz
$ mv jdk1.8.0_261 /opt/pkg/java
配置java环境变量
确认当前目录为jdk的解压路径
$ pwd
/opt/pkg/java
编辑/etc/profile.d/hadoop.env.sh配置文件(没有则创建)
$ sudo vim /etc/profile.d/hadoop.env.sh
添加新的环境变量配置
# JAVA_HOME
export JAVA_HOME=/opt/pkg/java
PATH=$JAVA_HOME/bin:$PATH
export PATH
使新的环境变量立刻生效
$ source /etc/profile.d/env.sh
验证环境变量
$ java -version
$ java
$ javac
安装Hadoop
下载和解压
从官网下载hadoop-3.1.4.tar.gz上传到服务器,解压到指定目录:
$ tar -zxvf hadoop.tar.gz -C /opt/pkg/
编辑/etc/profile.d/env.sh配置文件,添加环境变量:
# HADOOP_HOME
export HADOOP_HOME=/opt/pkg/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
使新的环境变量立刻生效:
$ source /etc/profile.d/env.sh
验证:
$ hadoop version
修改Hadoop相关命令执行环境
找到Hadoop安装目录下的hadoop/etc/hadoop/hadoop-env.sh文件,找到这一处将JAVA_HOME修改为真实JDK路径即可:
# The java implementation to use.
export JAVA_HOME=/opt/pkg/java
找到hadoop/etc/hadoop/yarn.env.sh文件,做同样修改:
# export JAVA_HOME=/home/y/libexec/jdk1.6.0/
export JAVA_HOME=/opt/pkg/java
找到hadoop/etc/hadoop/mapred.env.sh,做同样修改:
# export JAVA_HOME=/home/y/libexec/jdk1.6.0/
export JAVA_HOME=/opt/pkg/java
修改Hadoop配置
来到 hadoop/etc/hadoop/,修改以下配置文件。
1) hadoop/etc/hadoop/core-site.xml – Hadoop核心配置文件
fs.defaultFS
hdfs://hadoop100:8020
hadoop.tmp.dir
/opt/pkg/hadoop/data/tmp
io.file.buffer.size
4096
fs.trash.interval
10080
注意:主机名要修改成本机的实际主机名。
hadoop.tmp.dir十分重要,此目录下保存hadoop集群中namenode和datanode的所有数据。
2) hadoop/etc/hadoop/hdfs-site.xml – HDFS相关配置
dfs.replication
1
dfs.namenode.secondary.http-address
hadoop100:9868
dfs.namenode.http-address
hadoop100:9870
dfs.permissions
false
dfs.replication默认是3,为了节省虚拟机资源,这里设置为1
全分布式情况下,SecondaryNameNode和NameNode 应分开部署
dfs.namenode.secondary.http-address默认就是本地,如果是伪分布式可以不用配置
3) hadoop/etc/hadoop/mapred-site.xml – mapreduce 相关配置
mapreduce.framework.name
yarn
mapreduce.jobhistory.address
hadoop100:10020
mapreduce.jobhistory.webapp.address
hadoop100:19888
yarn.app.mapreduce.am.env
HADOOP_MAPRED_HOME=/opt/pkg/hadoop
mapreduce.map.env
HADOOP_MAPRED_HOME=/opt/pkg/hadoop
mapreduce.reduce.env
HADOOP_MAPRED_HOME=/opt/pkg/hadoop
mapreduce.jobhistory相关配置是可选配置,用于查看MR任务的历史日志。
这里主机名千万不要弄错,不然任务执行会失败,且不容易找原因。
需要手动启动MapReduceJobHistory后台服务才能在Yarn的页面打开历史日志。
4) 配置 yarn-site.xml
yarn.resourcemanager.hostname
hadoop100
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.log-aggregation-enable
true
yarn.log-aggregation.retain-seconds
604800
yarn.nodemanager.vmem-check-enabled
false
yarn.nodemanager.pmem-check-enabled
false
5) workers DataNode 节点配置
vi workers
$ vi workers
hadoop100
这里单机伪分布式环境可以不进行修改,默认是localhost, 也可以改成本机的主机名。
全分布式配置则需要每行输入一个DataNode主机名。
注意DataNode的主机名中不要有空格和空行,因为其他脚本会获取相关主机名信息。
格式化名称节点
$ hdfs namenode -format
21/01/09 19:27:21 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hadoop100/192.168.186.100
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.7.3
************************************************************/
21/01/09 19:27:21 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
21/01/09 19:27:21 INFO namenode.NameNode: createNameNode [-format]
Formatting using clusterid: CID-08318e9e-e202-48f3-bcb1-548ca50310c9
21/01/09 19:27:22 INFO util.GSet: Computing capacity for map BlocksMap
21/01/09 19:27:22 INFO util.GSet: VM type = 64-bit
21/01/09 19:27:22 INFO util.GSet: 2.0% max memory 966.7 MB = 19.3 MB
21/01/09 19:27:22 INFO util.GSet: capacity = 2^21 = 2097152 entries
21/01/09 19:27:22 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
21/01/09 19:27:22 INFO blockmanagement.BlockManager: defaultReplication = 1
21/01/09 19:27:22 INFO blockmanagement.BlockManager: maxReplication = 512
21/01/09 19:27:22 INFO blockmanagement.BlockManager: minReplication = 1
21/01/09 19:27:22 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
21/01/09 19:27:22 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
21/01/09 19:27:22 INFO blockmanagement.BlockManager: encryptDataTransfer = false
21/01/09 19:27:22 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000
21/01/09 19:27:22 INFO namenode.FSNamesystem: fsOwner = hadoop (auth:SIMPLE)
21/01/09 19:27:22 INFO namenode.FSNamesystem: supergroup = supergroup
21/01/09 19:27:22 INFO namenode.FSNamesystem: isPermissionEnabled = false
21/01/09 19:27:22 INFO namenode.FSNamesystem: HA Enabled: false
21/01/09 19:27:22 INFO namenode.FSNamesystem: Append Enabled: true
21/01/09 19:27:23 INFO common.Storage: Storage directory /opt/pkg/hadoop/data/tmp/dfs/name has been successfully formatted.
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop100/192.168.186.100
************************************************************/
运行和测试
启动Hadoop环境,刚启动Hadoop的HDFS系统后会有几秒的安全模式,安全模式期间无法进行任何数据处理,这也是为什么不建议使用start-all.sh脚本一次性启动DFS进程和Yarn进程,而是先启动dfs后过30秒左右再启动Yarn相关进程。
1) 启动所有DFS进程:
$ start-dfs.sh
Starting namenodes on [hadoop100]
hadoop100: starting namenode, logging to /opt/pkg/hadoop/logs/hadoop-hadoop-namenode-hadoop100.out
hadoop100: starting datanode, logging to /opt/pkg/hadoop/logs/hadoop-hadoop-datanode-hadoop100.out
Starting secondary namenodes [hadoop100]
hadoop100: starting secondarynamenode, logging to /opt/pkg/hadoop/logs/hadoop-hadoop-secondarynamenode-hadoop100.out
2) 启动所有YARN进程:
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /opt/pkg/hadoop/logs/yarn-hadoop-resourcemanager-hadoop100.out
hadoop100: starting nodemanager, logging to /opt/pkg/hadoop/logs/yarn-hadoop-nodemanager-hadoop100.out
启动MapReduceJobHistory后台服务 – 用于查看MR执行的历史日志
$ mr-jobhistory-daemon.sh start historyserver
3) 查看是否相关进程都成功启动
执行jps命令,看看是否会有如下进程:
$ jps
14608 NodeManager
14361 SecondaryNameNode
14203 DataNode
14510 ResourceManager
14079 NameNode
4) 单一进程管理
# 在主节点上使用以下命令启动 HDFS NameNode:
hdfs --daemon start namenode
# 在主节点上使用以下命令启动 HDFS SecondaryNamenode:
hdfs --daemon start secondarynamenode
# 在从节点上使用以下命令启动 HDFS DataNode:
hdfs --daemon start datanode
# 在主节点上使用以下命令启动 YARN ResourceManager:
yarn --daemon start resourcemanager
# 在从节点上使用以下命令启动 YARN nodemanager:
yarn --daemon start nodemanager
以上脚本位于$HADOOP_HOME/sbin/目录下。如果想要停止某个节点上某个进程,只需要把命令中的start 改为stop 即可。
Web界面进行验证
访问http://hadoop100:9870查看HDFS情况

访问http://hadoop100:8088查看YARN情况

测试Hadoop集群
使用官方自带的示例程序测试Hadoop集群
启动DFS和YARN进程,找到测试程序的位置:
$ cd /opt/pkg/hadoop/share/hadoop/mapreduce
$ hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount
Usage: wordcount [...]
准备输入文件并上传到HDFS系统
$ cat /opt/data/mapred/input/wc.txt
hadoop hadoop hadoop
hi hi hi hello hadoop
hello world hadoop
$ hadoop fs -mkdir -p /input/wc
$ hadoop fs -put wc.txt /input/wc/
Found 1 items
-rw-r--r-- 1 hadoop supergroup 62 2021-01-09 20:15 /input/wc/wc.txt
$ hadoop fs -cat /input/wc/wc.txt
hadoop hadoop hadoop
hi hi hi hello hadoop
hello world hadoop
运行官方示例程序wordcount,并将结果输出到/output/wc之中
$ cd /opt/pkg/hadoop/share/hadoop/mapreduce
$ hadoop jar hadoop-mapreduce-examples-3.1.4.jar wordcount /input/wc/ /output/wc/
控制台输出:
2020-01-23 18:38:45,914 INFO client.RMProxy: Connecting to ResourceManager at hadoop100/192.168.186.100:8032
2020-01-23 18:38:47,204 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1642908422458_0001
2020-01-23 18:38:47,988 INFO input.FileInputFormat: Total input files to process : 1
2020-01-23 18:38:49,033 INFO mapreduce.JobSubmitter: number of splits:1
2020-01-23 18:38:49,788 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1642908422458_0001
2020-01-23 18:38:49,790 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-01-23 18:38:50,108 INFO conf.Configuration: resource-types.xml not found
2020-01-23 18:38:50,108 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2020-01-23 18:38:50,740 INFO impl.YarnClientImpl: Submitted application application_1642908422458_0001
2020-01-23 18:38:50,796 INFO mapreduce.Job: The url to track the job: http://hadoop100:8088/proxy/application_1642908422458_0001/
2020-01-23 18:38:50,797 INFO mapreduce.Job: Running job: job_1642908422458_0001
2020-01-23 18:39:09,424 INFO mapreduce.Job: Job job_1642908422458_0001 running in uber mode : false
2020-01-23 18:39:09,425 INFO mapreduce.Job: map 0% reduce 0%
2020-01-23 18:39:19,633 INFO mapreduce.Job: map 100% reduce 0%
2020-01-23 18:39:28,781 INFO mapreduce.Job: map 100% reduce 100%
2020-01-23 18:39:30,812 INFO mapreduce.Job: Job job_1642908422458_0001 completed successfully
2020-01-23 18:39:30,973 INFO mapreduce.Job: Counters: 53
File System Counters
FILE: Number of bytes read=52
FILE: Number of bytes written=444077
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=165
HDFS: Number of bytes written=30
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=8195
Total time spent by all reduces in occupied slots (ms)=6333
Total time spent by all map tasks (ms)=8195
Total time spent by all reduce tasks (ms)=6333
Total vcore-milliseconds taken by all map tasks=8195
Total vcore-milliseconds taken by all reduce tasks=6333
Total megabyte-milliseconds taken by all map tasks=8391680
Total megabyte-milliseconds taken by all reduce tasks=6484992
Map-Reduce Framework
Map input records=4
Map output records=11
Map output bytes=106
Map output materialized bytes=52
Input split bytes=102
Combine input records=11
Combine output records=4
Reduce input groups=4
Reduce shuffle bytes=52
Reduce input records=4
Reduce output records=4
Spilled Records=8
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=235
CPU time spent (ms)=3340
Physical memory (bytes) snapshot=366059520
Virtual memory (bytes) snapshot=5470892032
Total committed heap usage (bytes)=291639296
Peak Map Physical memory (bytes)=233541632
Peak Map Virtual memory (bytes)=2732072960
Peak Reduce Physical memory (bytes)=132517888
Peak Reduce Virtual memory (bytes)=2738819072
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=63
File Output Format Counters
Bytes Written=30
注意,输入是文件夹,可以指定多个。输出是一个必须不存在的文件夹路径。
查看保存在HDFS上的结果:
$ hadoop fs -ls /output/wc/
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2021-01-09 20:23 /output/wc/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 30 2021-01-09 20:23 /output/wc/part-r-00000
$ hadoop fs -cat /output/wc/part-r-00000
hadoop 5
hello 2
hi 3
world 1
在MR任务执行时,可以通过Yarn的Web UI界面查看进度:

执行完毕以后点击指定MapReduce程序的TrackingUI一栏下的History可以查看历史日志记录

如果跳转页面报404说明没有启动JobHistoryServer服务。
也可以在HDFS的Web界面上查看结果。
位置:Utilities > HDFS browser > /output/wc/ > part-r-00000

关闭集群
$ stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
Stopping namenodes on [hadoop100]
hadoop100: stopping namenode
hadoop100: stopping datanode
Stopping secondary namenodes [hadoop100]
hadoop100: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
hadoop100: stopping nodemanager
no proxyserver to stop
安装HBase
Hbase有自带的Zookeeper, 为了更好的使用建议使用自己安装的zookeeper环境.
安装Zookeeper
从官网下载apache-zookeeper-3.5.9-bin.tar.gz,安装到/opt/pkg/zookeeper,单机模式部署,配置文件 $ZOOKEEPER_HOME/conf/zoo.cfg:
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/opt/tmp/zookeeper/data
dataLogDir=/opt/tmp/zookeeper/dataLog
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
下载安装包
安装包下载地址:https://www.apache.org/dyn/closer.lua/hbase/2.2.6/hbase-2.2.3-bin.tar.gz
将安装包上传到hadoop100服务器/opt/download路径下,并进行解压:
$ cd /opt/download
$ tar -zxvf hbase-2.2.3-bin.tar.gz -C /opt/pkg/
配置HBase
修改HBase配置文件hbase-env.sh
$ cd /opt/pkg/hbase-2.2.3/conf
$ vim hbase-env.sh
修改如下两项内容,值如下
export JAVA_HOME=/opt/pkg/java
export HBASE_MANAGES_ZK=false
修改文件hbase-site.xml
$ vim hbase-site.xml
内容如下
hbase.rootdir
hdfs://hadoop100:8020/hbase
hbase.cluster.distributed
true
hbase.zookeeper.quorum
hadoop100:2181
hbase.master.info.port
16010
hbase.unsafe.stream.capability.enforce
false
修改regionservers配置文件,指定HBase的从节点主机名:
$ vim regionservers
hadoop100
添加HBase环境变量
export HBASE_HOME=/opt/pkg/hbase-2.2.3
export PATH=$PATH:$HBASE_HOME/bin
重新执行/etc/profile,让环境变量生效
source /etc/profile
HBase的启动与停止:
启动HBase前需要提前启动HDFS及ZooKeeper集群:
如果没开启hdfs,请在运行命令:
$ start-dfs.sh
如果没开启zookeeper,请运行命令:
$ zkServer.sh start conf/zoo.cfg
执行以下命令启动HBase集群
$ start-hbase.sh
启动完后,jps查看HBase相关进程是否都正常运行:
$ jps
7601 QuorumPeerMain
1670 NameNode
1975 SecondaryNameNode
6695 HMaster
6841 HRegionServer
1787 DataNode
2491 JobHistoryServer
2236 ResourceManager
6958 Jps
2351 NodeManager
使用HBase提供的Shell客户端进行访问:
$ hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.3, r6a830d87542b766bd3dc4cfdee28655f62de3974, 2020年 01月 10日 星期五 18:27:51 CST
Took 0.0025 seconds
hbase(main):001:0> status
访问HBase的WEB UI界面:
浏览器页面访问 http://hadoop100:16010

停止HBase相关进程的命令:
stop-hbase.sh
安装MySQL
下载rpm-bundle包
$ wget http://mirrors.163.com/mysql/Downloads/MySQL-5.7/mysql-5.7.33-1.el7.x86_64.rpm-bundle.tar
$ tar xvf mysql-5.7.33-1.el7.x86_64.rpm-bundle.tar
$ mkdir mysql-jars
$ mv mysql-comm*.rpm mysql-jars/
$ cd mysql-jars/
$ ls
mysql-community-client-5.7.33-1.el7.x86_64.rpm
mysql-community-common-5.7.33-1.el7.x86_64.rpm
mysql-community-libs-5.7.33-1.el7.x86_64.rpm
mysql-community-libs-compat-5.7.33-1.el7.x86_64.rpm
mysql-community-server-5.7.33-1.el7.x86_64.rpm
依次手动安装
sudo rpm -ivh mysql-community-common-5.7.33-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-libs-5.7.33-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-libs-compat-5.7.33-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-client-5.7.33-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-server-5.7.33-1.el7.x86_64.rpm
启动MySQL服务
启动MySQL服务
$ systemctl start mysqld.service
或
$ service mysqld start
配置开机启动
$ systemctl enable mysqld.service
查看运行状态
$ systemctl status mysqld.service
mysql 2574 1 1 23:49 ? 00:00:00 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
查看到进程信息
$ sudo netstat -anpl | grep mysql
tcp6 0 0 :::3306 :::* LISTEN 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45824 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45818 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45838 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45808 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45816 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45820 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45826 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45822 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45814 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45846 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45828 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45832 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45842 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45830 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45844 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45812 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45836 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45834 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45840 ESTABLISHED 946/mysqld
tcp6 0 0 192.168.186.100:3306 192.168.186.100:45810 ESTABLISHED 946/mysqld
unix 2 [ ACC ] STREAM LISTENING 20523 946/mysqld /var/lib/mysql/mysql.sock
查看端口信息:
可以看出mysql server的进程mysqld所使用的默认端口即3306
$ sudo netstat -anpl | grep tcp
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1412/sshd
tcp 0 52 192.168.186.103:22 192.168.186.1:54058 ESTABLISHED 2287/sshd: hadoop
tcp6 0 0 :::3306 :::* LISTEN 2574/mysqld
tcp6 0 0 192.168.186.103:3888 :::* LISTEN 2060/java
tcp6 0 0 :::22 :::* LISTEN 1412/sshd
tcp6 0 0 :::37791 :::* LISTEN 2060/java
tcp6 0 0 :::2181 :::* LISTEN 2060/java
修改密码
找到临时密码:
第一次登陆mysql需要root的临时密码,这个密码是安装时随机生成在MySQL的服务器日志中的:
$ grep "temporary password" /var/log/mysqld.log
2020-02-26T17:05:45.104999Z 1 [Note] A temporary password is generated for root@localhost: bl/!6qaU.wuX
$ mysql -u roop -p
这时输入临时密码即可登录MySQL客户端。
修改密码安全策略:
第一次登陆MySQL客户端终端后系统会很快提示你修改掉默认密码
mysql> show databases;
ERROR 1820 (HY000): You must reset your password using ALTER USER statement before executing this statement.
mysql> alter user 'root'@'localhost' identified by 'niit1234';
ERROR 1819 (HY000): Your password does not satisfy the current policy requirements
基于默认密码安全策略,所设置的密码必须要包含大小写字母、数字和字符。如果不考虑安全问题,可以修改策略:
mysql> set global validate_password_policy=0;
mysql> set global validate_password_length=1;
设置新root密码并开启远程权限:
Mysql客户端远程访问因为安全原因默认是关闭的。我们需要将root的访问权限扩大都允许从任意ip访问:
mysql>GRANT ALL PRIVILEGES ON *.* TO 'root'@'%'IDENTIFIED BY 'niit1234' WITH GRANT OPTION;
现在虽然修改了远程访问权限,但是还没有生效,因此我们需要刷新权限:
mysql>FLUSH PRIVILEGES;
修改默认字符集
修改MySQL配置文件:
vi /etc/my.cnf
在文件后面追加:
# 默认服务器内部操作字符集
character-set-server=utf8mb4
# 默认服务器内部操作字符集校对规则
collation-server=utf8mb4_general_ci
# 默认的存储引擎
default-storage-engine=InnoDB
# 初始化连接时设置以下字符集:
# character_set_client
# character_set_results
# character_set_connection
init_connect='set names utf8mb4'
[client]
default-character-set=utf8mb4
[mysql]
default-character-set=utf8mb4
重启MySQL服务
$ sudo service mysqld restart
安装Hive
下载安装包
从官网下载hive安装包:apache-hive-3.1.2-bin.tar.gz
规划安装目录:/opt/pkg/hive
上传安装包到hadoop100服务器中
解压到安装路径
解压安装包到指定的规划目录/opt/pkg/
$ cd /export/softwares/
$ tar -xzvf apache-hive-3.1.2-bin.tar.gz -C /opt/pkg/
修改配置文件
进入hive安装目录:
$ cd /opt/pkg
重新命名hive目录:
$ mv apache-hive-3.1.2-bin/hive
修改/opt/pkg//conf目录下的hive-site.xml,默认没有该文件, 需要手动创建:
$ cd /opt/pkg/hive/conf/
$ vim hive-site.xml
进入编辑模式, 文件内容如下:
javax.jdo.option.ConnectionURL
jdbc:mysql://hadoop100:3306/metastore?useSSL=false
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName
root
hive.metastore.warehouse.dir
/user/hive/warehouse
javax.jdo.option.ConnectionPassword
niit1234
hive.metastore.schema.verification
false
hive.metastore.event.db.notification.api.auth
false
hive.cli.print.current.db
true
hive.cli.print.header
true
hive.server2.thrift.bind.host
hadoop100
hive.server2.thrift.port
10000
创建hive日志存储目录
$ mkdir /opt/pkg/hive/logs/
重命名日志配置文件模板为hive-log4j.properties
$ pwd
/opt/pkg/hive/conf
$ mv hive-log4j2.properties.template hive-log4j2.properties
$ vim hive-log4j2.properties # 修改文件
修改此文件的hive.log.dir属性的值:
#更改以下内容,设置我们的hive的日志文件存放的路径,便于排查问题
hive.log.dir=/opt/pkg/hive/logs/
拷贝mysql驱动包
由于运行hive时,需要向mysql数据库中读写元数据,所以需要将mysql的驱动包上传到hive的lib目录下。
上传mysql驱动包,如mysql-connector-java-5.1.38.jar到/opt/download/目录中:
$ cp mysql-connector-java-5.1.38.jar /opt/pkg/hive/lib/
解决日志Jar包冲突
# 进入lib目录
$ cd /opt/pkg/hive/lib/
# 重新命名 或者直接删除
$ mv log4j-slf4j-impl-2.10.0.jar log4j-slf4j-impl-2.10.0.jar.bak
配置Hive环境变量
# HIVE
export HIVE_HOME=/opt/pkg/hive
export PATH=$PATH:$HIVE_HOME/bin
# HCATLOG
export HCAT_HOME=$HIVE_HOME/hcatalog
export PATH=$PATH:$HCAT_HOME/bin
之后别忘记source使环境变量配置文件的修改生效
初始化元数据库
开启MySQL客户端连接MySQL服务, 用户名root, 密码niit1234,创建hive元数据库, 数据库名称需要和hive-site.xml中配置的一致:
$ mysql -uroot -pniit1234
create database metastore;
show databases;
退出mysql:
Exit;
初始化元数据库:
$ schematool -initSchema -dbType mysql -verbose
看到schemaTool completed 表示初始化成功
启动Hive服务
前提:Hadoop集群、MySQL服务均已启动,执行命令:
nohup hive --service metastore >/tmp/metastore.log 2>&1 &
nohup hive --service hiveserver2 >/tmp/hiveServer2.log 2>&1 &
验证Hive安装是否成功
在hadoop100上任意目录启动hive的命令行客户端beeline:
$ beeline
Beeline version 3.1.2 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: hadoop
Enter password for jdbc:hive2://localhost:10000: ******
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> show databases;
+----------------+
| database_name |
+----------------+
| default |
+----------------+
2 rows selected (1.93 seconds)
认证时密码直接回车即可, 如果能看到以上信息说明hive安装成功:
退出客户端:
0: jdbc:hive2://localhost:10000> !quit
Closing: 0: jdbc:hive2://localhost:10000 quit;
使用beeline连接失败的解决办法
如果hiveserver2已经正常运行在本机的10000端口上, 但使用beeline连接hiveserver2报错, WARN jdbc.HiveConnection: Failed to connect to localhost:10000
主要的原因可能是hadoop引入了用户代理机制, 不允许上层系统直接使用实际用户.
解决办法: 在hadoop的核心配置文件core-site.xml中添加:
hadoop.proxyuser.hadoop.hosts
*
hadoop.proxyuser.hadoop.groups
*
其中hadoop.proxyuser.XXX的XXX就是代理用户名,我这里设置成hadoop
然后重启hadoop,否则以上修改不生效。
$ stop-all.sh
$ start-dfs.sh
$ start-yarn.sh
再次执行下面命令就可以正常连接欸蓝:
beeline -u jdbc:hive2://localhost:10000
安装Sqoop
Sqoop下载和安装
下载地址:http://archive.apache.org/dist/sqoop/1.4.7/
wget http://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
解压sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz到指定目录下:
tar -zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /opt/pkg/
修改Sqoop安装目录名称:
$ cd /opt/pkg/
$ mv sqoop-1.4.7.bin__hadoop-2.6.0/ sqoop
配置Sqoop环境变量
$ vi ~/.bash_profile
#sqoop
export SQOOP_HOME=/opt/pkg/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
把Sqoop所依赖的相关环境变量都配置上,修改sqoop-env.sh:
$ mv conf/sqoop-env-template.sh conf/sqoop-env.sh
$ vi conf/sqoop-env.sh
# Set Hadoop-specific environment variables here.
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/opt/pkg/hadoop
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/opt/pkg/hadoop
#set the path to where bin/hbase is available
export HBASE_HOME=/opt/pkg/hbase
#Set the path to where bin/hive is available
export HIVE_HOME=/opt/pkg/hive
#Set the path for where zookeper config dir is
export ZOOCFGDIR=/opt/pkg/zookeeper/conf
修改sqoop-site.xml, 具体配置如下文件所示:
sqoop.metastore.client.enable.autoconnect
true
If true, Sqoop will connect to a local metastore
for job management when no other metastore arguments are
provided.
sqoop.metastore.client.autoconnect.url
jdbc:hsqldb:file:/tmp/sqoop-meta/meta.db;shutdown=true
The connect string to use when connecting to a
job-management metastore. If unspecified, uses ~/.sqoop/.
You can specify a different path here.
sqoop.metastore.client.autoconnect.username
SA
The username to bind to the metastore.
sqoop.metastore.client.autoconnect.password
The password to bind to the metastore.
sqoop.metastore.client.record.password
true
If true, allow saved passwords in the metastore.
sqoop.metastore.server.location
/tmp/sqoop-metastore/shared.db
Path to the shared metastore database files.
If this is not set, it will be placed in ~/.sqoop/.
sqoop.metastore.server.port
16000
Port that this metastore should listen on.
修改configure-sqoop:
$ vi bin/configure-sqoop
如果没有安装accumulo,则将有关ACCUMULO_HOME的判断逻辑注释掉:
$ vi bin/configure-sqoop
94 #if [ -z "${ACCUMULO_HOME}" ]; then
95 # if [ -d "/usr/lib/accumulo" ]; then
96 # ACCUMULO_HOME=/usr/lib/accumulo
97 # else
98 # ACCUMULO_HOME=${SQOOP_HOME}/../accumulo
99 # fi
100 #fi
140 #if [ ! -d "${ACCUMULO_HOME}" ]; then
141 # echo "Warning: $ACCUMULO_HOME does not exist! Accumulo imports will fail."
142 # echo 'Please set $ACCUMULO_HOME to the root of your Accumulo installation.'
143 #fi
这样做的目的是避免将来运行时出现类似下面的警告信息:
Warning: /opt/pkg/sqoop/bin/…/…/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
将MySQL的驱动(MySQL5.7对应的驱动版本应为5.x版本)上传到Sqoop安装目录下的lib目录下:
$ cp mysql-connector-java-5.1.44-bin.jar /opt/pkg/sqoop/lib/
将$HIVE_HOME/lib/hive-common-3.1.2.jar拷贝或者软链接到$SQOOP_HOME/lib下
$ ln -s /opt/pkg/hive/lib/hive-common-3.1.2.jar /opt/pkg/sqoop/lib
如果需要解析json,可下载java-json.jar放到sqoop目录下的lib里。
下载地址:http://www.java2s.com/Code/Jar/j/Downloadjavajsonjar.htm
$ cp java-json.jar /opt/pkg/sqoop/lib/
如果需要avro序列化,可将hadoop里面的avro的jar包拷贝或者软链接到sqoop目录下的lib里。
$ ln -s /opt/pkg/hadoop/share/hadoop/common/lib/avro-1.7.7.jar /opt/pkg/sqoop/lib/
练习
完成以下练习:
练习1:
安装好Sqoop学习环境后使用cd命令进入到sqoop安装目录, 输入以下命令并观察输出:
$ sqoop version
2020-01-23 23:43:20,287 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 2017
如果输出了类似内容, 表明Sqoop的安装初步完成.
Views: 899
