问题陈述
Sqoop是Hadoop生态系统和RDBMS之间进行数据传输的一个工具。在学习Sqoop之前首先需要完成学习环境的搭建。这里为了学习方便,采用单机部署方式。
最初Sqoop是Hadoop的一个子项目,它设计只能在Linux操作系统上运行。
先决条件
安装Sqoop的必要前提条件是:
- 准备Linux操作系统(Centos7)
- 安装Java环境(JDK1.8)
- 安装Hadoop环境(Hadoop 3.1.4)
另外为了学习Sqoop的大部分功能,还需要需要安装:
- Zookeeper 3.5.9)
- HBase 2.2.3
- MySQL 5.7
- Hive 3.1.2
解决办法
准备Linux操作系统
配置主机名映射
设置网络:
虚拟机安装Centos7时
硬盘设置大一些,如40G或更多
不要设置预先分配磁盘空间.
网络适配器设置为NAT连接
打开虚拟网络编辑器
记住NAT模式所在的虚拟网卡对应的子网IP,子网掩码以及网关IP.
然后进入虚拟机终端,设置静态IP
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
$ vi /etc/sysconfig/network-scripts/ifcfg-ens33 NAME="ens33" TYPE="Ethernet" DEVICE="ens33" BROWSER_ONLY="no" DEFROUTE="yes" PROXY_METHOD="none" IPV4_FAILURE_FATAL="no" IPV6INIT="yes" IPV6_AUTOCONF="yes" IPV6_DEFROUTE="yes" IPV6_FAILURE_FATAL="no" IPV6_ADDR_GEN_MODE="stable-privacy" IPV6_PRIVACY="no" UUID="b00b9ac0-60c2-4d34-ab88-2413055463cf" ONBOOT="yes" BOOTPROTO="static" IPADDR="192.168.186.100" PREFIX="24" GATEWAY="192.168.186.2" DNS1="223.5.5.5" DNS2="8.8.8.8" |
其中需要修改的是:
- ONTBOOT设置yes可以实现自动联网
- BOOTPROTO="static" 设置静态IP,防止IP发生变化
- IPADDR的前三段要和NAT虚拟网卡的子网IP一致,且第四段在0~254之间选择,又不能和NAT虚拟网卡的子网掩码和其他相同网络中的主机IP重复.
- PREFIX=24是设置子网掩码的位数长度,换算十进制就是255.255.255.0,因此PREFIX=24也可以直接替换成NETMASK="255.255.255.0"
- DNS1设置的是阿里的公共DNS地址"223.5.5.5",DNS2设置的是谷歌的公共DNS地址"8,8,8,8"
设置好了之后需要重新启动网络:
1 2 3 4 5 6 7 8 |
$ sudo service network restart Restarting network (via systemctl): [ 确定 ] $ sudo service network status 已配置设备: lo ens33 当前活跃设备: lo ens33 |
查看本机ip:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:b2:5b:75 brd ff:ff:ff:ff:ff:ff inet 192.168.186.100/24 brd 192.168.186.255 scope global noprefixroute ens33 valid_lft forever preferred_lft forever inet6 fe80::40db:ee8d:77c1:fbf7/64 scope link noprefixroute valid_lft forever preferred_lft forever |
设置静态主机名:
1 |
# hostnamectl --static set-hostname hadoop100 |
在hosts文件中配置主机名和本机ip之间的映射关系
(注释掉localhost的部分)
1 2 3 4 |
vi /etc/hosts #127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 #::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.186.100 hadoop100 |
创建专用账号
不建议直接使用root账号, 这里我们创建一个hadoop账号用于接下来的所有操作.
创建hadoop用户
1 |
# useradd hadoop |
设置hadoop用户的密码
1 |
# password hadoop |
为hadoop账号设置sudoer权限
1 |
vi /etc/sudoers |
找到root ALL=(ALL) ALL
在下面添加一行
1 2 3 |
## Allow root to run any commands anywhere root ALL=(ALL) ALL delucia ALL=(ALL) NOPASSWD:ALL |
修改完成后,切换到Hadoop用户
1 |
# su hadoop |
以后在使用需要root权限的命令时,就可以在命令前面加上sudo来提升权限,且无需输入hadoop密码, 如:
1 |
$ sudo ls /root |
接下来的所有操作, 如果没有特殊说明, 一律使用hadoop账号.
配置SSH免密登录
由于Hadoop集群的机器之间ssh通信默认需要输入密码,在集群运行时我们不可能为每一次通信都手动输入密码,因此需要配置机器之间的ssh的免密登录。单机伪分布式的Hadoop环境样需要配置本地对本地ssh连接的免密,流程如下:
-
首先ssh-keygen命令生成RSA加密的密钥对(公钥和私钥)。
1234567891011121314151617181920$ ssh-keygen -t rsaGenerating public/private rsa key pair.Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):Created directory '/home/hadoop/.ssh'.Enter passphrase (empty for no passphrase):Enter same passphrase again:Your identification has been saved in /home/hadoop/.ssh/id_rsa.Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.The key fingerprint is:SHA256:yQYChs4eniVLeICaI2bCB9HopbXUBE9v0lpLjBACUnM hadoop@hadoop100The key's randomart image is:+---[RSA 2048]----+|==O+Eo ||=+.O+.= ||Bo* o+.B ||B%.+ .*o.. ||OoB . .S || = . || |+----[SHA256]-----+ -
将生成的公钥添加到~/.ssh目录下的authorized_keys文件中。并为authorized_keys文件设置600权限。
123$ cd ~/.ssh/$ cat id_rsa.pub >> authorized_keys$ chmod 600 authorized_keys以上三个命令可以使用一个命令代替:
1$ ssh-copy-id hadoop100 -
使用ssh命令连接本地终端,如果不需要输入密码则说明本地的SSH免密配置成功。
12345678$ ssh hadoop@hadoop100The authenticity of host 'hadoop100 (192.168.186.100)' can't be established.ECDSA key fingerprint is SHA256:aGLhdt3bIuqtPgrFWnhgrfTKUbDh4CWVTfIgr5E5oV0.ECDSA key fingerprint is MD5:b8:bd:b3:65:fe:77:2c:06:2d:ec:58:3a:97:51:dd:ca.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added 'hadoop100,192.168.186.100' (ECDSA) to the list of known hosts.Last login: Sat Jan 9 10:16:53 2021 from 192.168.186.1 -
登出
12$ exitConnection to hadoop100 closed.
配置时间同步
集群中的通信和文件传输一般是以系统时间作为约定条件的。所以当集群中机器之间系统如果不一致可能导致各种问题发生,比如访问时间过长,甚至失败。所以配置机器之间的时间同步非常重要。不过由于我们使用的学习环境是单机部署,所以无需配置时间同步。
统一目录结构
目录规划如下:
1 2 3 4 5 6 |
/opt/ ├── bin # 脚本和命令 ├── data # 程序需要使用的数据 ├── download # 下载的软件安装包 ├── pkg # 解压方式安装的软件 └── tmp # 存放程序生成的临时文件 |
使用hadoop账户创建目录:
1 2 3 4 5 |
$ sudo mkdir /opt/download $ sudo mkdir /opt/data $ sudo mkdir /opt/bin $ sudo mkdir /opt/tmp $ sudo mkdir /opt/pkg |
为了使用方便更改opt下目录的用户及其所在用户组为hadoop:
1 2 3 4 5 6 7 8 9 10 11 |
$ sudo chown hadoop:hadoop /opt/* $ ls bin data download pkg tmp $ ll 总用量 0 drwxr-xr-x. 2 hadoop hadoop 6 1月 9 14:29 bin drwxr-xr-x. 2 hadoop hadoop 6 1月 9 14:29 data drwxr-xr-x. 2 hadoop hadoop 6 1月 9 14:29 download drwxr-xr-x. 2 hadoop hadoop 6 1月 9 14:37 pkg drwxr-xr-x. 2 hadoop hadoop 6 1月 9 14:36 tmp |
安装Java环境
检查是否已经装过Java JDK
1 |
$ rpm -qa | grep java |
或
1 |
$ yum list installed | grep java |
如果没有安装过Java 则需要安装,JDK版本建议1.8+
Java JDK的下载和解压
去官网下载JDK1.8的安装包,上传到/opt/download/
,然后解压到/opt/pkg
1 2 |
$ tar -zxvf jdk-8u261-linux-x64.tar.gz $ mv jdk1.8.0_261 /opt/pkg/java |
配置java环境变量
确认当前目录为jdk的解压路径
1 2 |
$ pwd /opt/pkg/java |
编辑/etc/profile.d/hadoop.env.sh配置文件(没有则创建)
1 |
$ sudo vim /etc/profile.d/hadoop.env.sh |
添加新的环境变量配置
1 2 3 4 5 |
# JAVA_HOME export JAVA_HOME=/opt/pkg/java PATH=$JAVA_HOME/bin:$PATH export PATH |
使新的环境变量立刻生效
1 |
$ source /etc/profile.d/env.sh |
验证环境变量
1 2 3 |
$ java -version $ java $ javac |
安装Hadoop
下载和解压
从官网下载hadoop-3.1.4.tar.gz上传到服务器,解压到指定目录:
1 |
$ tar -zxvf hadoop.tar.gz -C /opt/pkg/ |
编辑/etc/profile.d/env.sh
配置文件,添加环境变量:
1 2 3 |
# HADOOP_HOME export HADOOP_HOME=/opt/pkg/hadoop export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH |
使新的环境变量立刻生效:
1 |
$ source /etc/profile.d/env.sh |
验证:
1 |
$ hadoop version |
修改Hadoop相关命令执行环境
找到Hadoop安装目录下的hadoop/etc/hadoop/hadoop-env.sh
文件,找到这一处将JAVA_HOME
修改为真实JDK路径即可:
1 2 |
# The java implementation to use. export JAVA_HOME=/opt/pkg/java |
找到hadoop/etc/hadoop/yarn.env.sh文件,做同样修改:
1 2 |
# export JAVA_HOME=/home/y/libexec/jdk1.6.0/ export JAVA_HOME=/opt/pkg/java |
找到hadoop/etc/hadoop/mapred.env.sh,做同样修改:
1 2 |
# export JAVA_HOME=/home/y/libexec/jdk1.6.0/ export JAVA_HOME=/opt/pkg/java |
修改Hadoop配置
来到 hadoop/etc/hadoop/
,修改以下配置文件。
1) hadoop/etc/hadoop/core-site.xml
– Hadoop核心配置文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
<configuration> <!-- 指定NameNode的地址和端口. --> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop100:8020</value> </property> <!-- 指定HDFS系统运行时产生的文件的存储目录. --> <property> <name>hadoop.tmp.dir</name> <value>/opt/pkg/hadoop/data/tmp</value> </property> <!-- 缓冲区大小,实际工作中根据服务器性能动态调整;默认值4096 --> <property> <name>io.file.buffer.size</name> <value>4096</value> </property> <!-- 开启hdfs的垃圾桶机制,删除掉的数据可以从垃圾桶中回收,单位分钟;默认值0 --> <property> <name>fs.trash.interval</name> <value>10080</value> </property> </configuration> |
注意:主机名要修改成本机的实际主机名。
hadoop.tmp.dir
十分重要,此目录下保存hadoop集群中namenode和datanode的所有数据。
2) hadoop/etc/hadoop/hdfs-site.xml
– HDFS相关配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
<configuration> <!-- 设置HDFS中的数据副本数. --> <property> <name>dfs.replication</name> <value>1</value> </property> <!-- 设置Hadoop的Secondary NameNode的主机配置 --> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop100:9868</value> </property> <property> <name>dfs.namenode.http-address</name> <value>hadoop100:9870</value> </property> <!-- 是否检查操作HDFS文件系统的用户权限. --> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration> |
dfs.replication
默认是3,为了节省虚拟机资源,这里设置为1
全分布式情况下,SecondaryNameNode和NameNode 应分开部署
dfs.namenode.secondary.http-address
默认就是本地,如果是伪分布式可以不用配置
3) hadoop/etc/hadoop/mapred-site.xml
– mapreduce 相关配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
<configuration> <!-- 指定MapReduce程序由Yarn进行调度. --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <!-- Mapreduce的Job历史记录服务器主机端口设置. --> <property> <name>mapreduce.jobhistory.address</name> <value>hadoop100:10020</value> </property> <!-- Mapreduce的Job历史记录的Webapp端地址. --> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop100:19888</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=/opt/pkg/hadoop</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=/opt/pkg/hadoop</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=/opt/pkg/hadoop</value> </property> </configuration> |
mapreduce.jobhistory
相关配置是可选配置,用于查看MR任务的历史日志。
这里主机名千万不要弄错,不然任务执行会失败,且不容易找原因。
需要手动启动MapReduceJobHistory后台服务才能在Yarn的页面打开历史日志。
4) 配置 yarn-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
<configuration> <!-- 设置Yarn的ResourceManager节点主机名. --> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop100</value> </property> <!-- 设置Mapper端将数据发送到Reducer端的方式. --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!-- 是否开启日志手机功能. --> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <!-- 日志保留时间(7天). --> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property> <!-- 如果vmem、pmem资源不够,会报错,此处将资源监察置为false --> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> </configuration> |
5) workers DataNode 节点配置
1 2 3 4 |
vi workers $ vi workers hadoop100 |
这里单机伪分布式环境可以不进行修改,默认是localhost, 也可以改成本机的主机名。
全分布式配置则需要每行输入一个DataNode主机名。
注意DataNode的主机名中不要有空格和空行,因为其他脚本会获取相关主机名信息。
格式化名称节点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
$ hdfs namenode -format 21/01/09 19:27:21 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = hadoop100/192.168.186.100 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.7.3 ************************************************************/ 21/01/09 19:27:21 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT] 21/01/09 19:27:21 INFO namenode.NameNode: createNameNode [-format] Formatting using clusterid: CID-08318e9e-e202-48f3-bcb1-548ca50310c9 21/01/09 19:27:22 INFO util.GSet: Computing capacity for map BlocksMap 21/01/09 19:27:22 INFO util.GSet: VM type = 64-bit 21/01/09 19:27:22 INFO util.GSet: 2.0% max memory 966.7 MB = 19.3 MB 21/01/09 19:27:22 INFO util.GSet: capacity = 2^21 = 2097152 entries 21/01/09 19:27:22 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false 21/01/09 19:27:22 INFO blockmanagement.BlockManager: defaultReplication = 1 21/01/09 19:27:22 INFO blockmanagement.BlockManager: maxReplication = 512 21/01/09 19:27:22 INFO blockmanagement.BlockManager: minReplication = 1 21/01/09 19:27:22 INFO blockmanagement.BlockManager: maxReplicationStreams = 2 21/01/09 19:27:22 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000 21/01/09 19:27:22 INFO blockmanagement.BlockManager: encryptDataTransfer = false 21/01/09 19:27:22 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000 21/01/09 19:27:22 INFO namenode.FSNamesystem: fsOwner = hadoop (auth:SIMPLE) 21/01/09 19:27:22 INFO namenode.FSNamesystem: supergroup = supergroup 21/01/09 19:27:22 INFO namenode.FSNamesystem: isPermissionEnabled = false 21/01/09 19:27:22 INFO namenode.FSNamesystem: HA Enabled: false 21/01/09 19:27:22 INFO namenode.FSNamesystem: Append Enabled: true 21/01/09 19:27:23 INFO common.Storage: Storage directory /opt/pkg/hadoop/data/tmp/dfs/name has been successfully formatted. /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at hadoop100/192.168.186.100 ************************************************************/ |
运行和测试
启动Hadoop环境,刚启动Hadoop的HDFS系统后会有几秒的安全模式,安全模式期间无法进行任何数据处理,这也是为什么不建议使用start-all.s
h脚本一次性启动DFS进程和Yarn进程,而是先启动dfs后过30秒左右再启动Yarn相关进程。
1) 启动所有DFS进程:
1 2 3 4 5 6 |
$ start-dfs.sh Starting namenodes on [hadoop100] hadoop100: starting namenode, logging to /opt/pkg/hadoop/logs/hadoop-hadoop-namenode-hadoop100.out hadoop100: starting datanode, logging to /opt/pkg/hadoop/logs/hadoop-hadoop-datanode-hadoop100.out Starting secondary namenodes [hadoop100] hadoop100: starting secondarynamenode, logging to /opt/pkg/hadoop/logs/hadoop-hadoop-secondarynamenode-hadoop100.out |
2) 启动所有YARN进程:
1 2 3 4 |
$ start-yarn.sh starting yarn daemons starting resourcemanager, logging to /opt/pkg/hadoop/logs/yarn-hadoop-resourcemanager-hadoop100.out hadoop100: starting nodemanager, logging to /opt/pkg/hadoop/logs/yarn-hadoop-nodemanager-hadoop100.out |
启动MapReduceJobHistory后台服务 – 用于查看MR执行的历史日志
1 |
$ mr-jobhistory-daemon.sh start historyserver |
3) 查看是否相关进程都成功启动
执行jps命令,看看是否会有如下进程:
1 2 3 4 5 6 |
$ jps 14608 NodeManager 14361 SecondaryNameNode 14203 DataNode 14510 ResourceManager 14079 NameNode |
4) 单一进程管理
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# 在主节点上使用以下命令启动 HDFS NameNode: hdfs --daemon start namenode # 在主节点上使用以下命令启动 HDFS SecondaryNamenode: hdfs --daemon start secondarynamenode # 在从节点上使用以下命令启动 HDFS DataNode: hdfs --daemon start datanode # 在主节点上使用以下命令启动 YARN ResourceManager: yarn --daemon start resourcemanager # 在从节点上使用以下命令启动 YARN nodemanager: yarn --daemon start nodemanager |
以上脚本位于$HADOOP_HOME/sbin/
目录下。如果想要停止某个节点上某个进程,只需要把命令中的start 改为stop 即可。
Web界面进行验证
访问http://hadoop100:9870查看HDFS情况
访问http://hadoop100:8088查看YARN情况
测试Hadoop集群
使用官方自带的示例程序测试Hadoop集群
启动DFS和YARN进程,找到测试程序的位置:
1 2 3 |
$ cd /opt/pkg/hadoop/share/hadoop/mapreduce $ hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount Usage: wordcount <in> [<in>...] <out> |
准备输入文件并上传到HDFS系统
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
$ cat /opt/data/mapred/input/wc.txt hadoop hadoop hadoop hi hi hi hello hadoop hello world hadoop $ hadoop fs -mkdir -p /input/wc $ hadoop fs -put wc.txt /input/wc/ Found 1 items -rw-r--r-- 1 hadoop supergroup 62 2021-01-09 20:15 /input/wc/wc.txt $ hadoop fs -cat /input/wc/wc.txt hadoop hadoop hadoop hi hi hi hello hadoop hello world hadoop |
运行官方示例程序wordcount,并将结果输出到/output/wc之中
1 2 |
$ cd /opt/pkg/hadoop/share/hadoop/mapreduce $ hadoop jar hadoop-mapreduce-examples-3.1.4.jar wordcount /input/wc/ /output/wc/ |
控制台输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
2020-01-23 18:38:45,914 INFO client.RMProxy: Connecting to ResourceManager at hadoop100/192.168.186.100:8032 2020-01-23 18:38:47,204 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1642908422458_0001 2020-01-23 18:38:47,988 INFO input.FileInputFormat: Total input files to process : 1 2020-01-23 18:38:49,033 INFO mapreduce.JobSubmitter: number of splits:1 2020-01-23 18:38:49,788 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1642908422458_0001 2020-01-23 18:38:49,790 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2020-01-23 18:38:50,108 INFO conf.Configuration: resource-types.xml not found 2020-01-23 18:38:50,108 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2020-01-23 18:38:50,740 INFO impl.YarnClientImpl: Submitted application application_1642908422458_0001 2020-01-23 18:38:50,796 INFO mapreduce.Job: The url to track the job: http://hadoop100:8088/proxy/application_1642908422458_0001/ 2020-01-23 18:38:50,797 INFO mapreduce.Job: Running job: job_1642908422458_0001 2020-01-23 18:39:09,424 INFO mapreduce.Job: Job job_1642908422458_0001 running in uber mode : false 2020-01-23 18:39:09,425 INFO mapreduce.Job: map 0% reduce 0% 2020-01-23 18:39:19,633 INFO mapreduce.Job: map 100% reduce 0% 2020-01-23 18:39:28,781 INFO mapreduce.Job: map 100% reduce 100% 2020-01-23 18:39:30,812 INFO mapreduce.Job: Job job_1642908422458_0001 completed successfully 2020-01-23 18:39:30,973 INFO mapreduce.Job: Counters: 53 File System Counters FILE: Number of bytes read=52 FILE: Number of bytes written=444077 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=165 HDFS: Number of bytes written=30 HDFS: Number of read operations=8 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=8195 Total time spent by all reduces in occupied slots (ms)=6333 Total time spent by all map tasks (ms)=8195 Total time spent by all reduce tasks (ms)=6333 Total vcore-milliseconds taken by all map tasks=8195 Total vcore-milliseconds taken by all reduce tasks=6333 Total megabyte-milliseconds taken by all map tasks=8391680 Total megabyte-milliseconds taken by all reduce tasks=6484992 Map-Reduce Framework Map input records=4 Map output records=11 Map output bytes=106 Map output materialized bytes=52 Input split bytes=102 Combine input records=11 Combine output records=4 Reduce input groups=4 Reduce shuffle bytes=52 Reduce input records=4 Reduce output records=4 Spilled Records=8 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=235 CPU time spent (ms)=3340 Physical memory (bytes) snapshot=366059520 Virtual memory (bytes) snapshot=5470892032 Total committed heap usage (bytes)=291639296 Peak Map Physical memory (bytes)=233541632 Peak Map Virtual memory (bytes)=2732072960 Peak Reduce Physical memory (bytes)=132517888 Peak Reduce Virtual memory (bytes)=2738819072 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=63 File Output Format Counters Bytes Written=30 |
注意,输入是文件夹,可以指定多个。输出是一个必须不存在的文件夹路径。
查看保存在HDFS上的结果:
1 2 3 4 5 6 7 8 9 10 |
$ hadoop fs -ls /output/wc/ Found 2 items -rw-r--r-- 1 hadoop supergroup 0 2021-01-09 20:23 /output/wc/_SUCCESS -rw-r--r-- 1 hadoop supergroup 30 2021-01-09 20:23 /output/wc/part-r-00000 $ hadoop fs -cat /output/wc/part-r-00000 hadoop 5 hello 2 hi 3 world 1 |
在MR任务执行时,可以通过Yarn的Web UI界面查看进度:
执行完毕以后点击指定MapReduce程序的TrackingUI一栏下的History可以查看历史日志记录
如果跳转页面报404说明没有启动JobHistoryServer服务。
也可以在HDFS的Web界面上查看结果。
位置:Utilities > HDFS browser > /output/wc/ > part-r-00000
关闭集群
1 2 3 4 5 6 7 8 9 10 11 |
$ stop-all.sh This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh Stopping namenodes on [hadoop100] hadoop100: stopping namenode hadoop100: stopping datanode Stopping secondary namenodes [hadoop100] hadoop100: stopping secondarynamenode stopping yarn daemons stopping resourcemanager hadoop100: stopping nodemanager no proxyserver to stop |
安装HBase
Hbase有自带的Zookeeper, 为了更好的使用建议使用自己安装的zookeeper环境.
安装Zookeeper
从官网下载apache-zookeeper-3.5.9-bin.tar.gz
,安装到/opt/pkg/zookeeper
,单机模式部署,配置文件 $ZOOKEEPER_HOME/conf/zoo.cfg
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. # do not use /tmp for storage, /tmp here is just # example sakes. dataDir=/opt/tmp/zookeeper/data dataLogDir=/opt/tmp/zookeeper/dataLog # the port at which the clients will connect clientPort=2181 # the maximum number of client connections. # increase this if you need to handle more clients #maxClientCnxns=60 # # Be sure to read the maintenance section of the # administrator guide before turning on autopurge. # # http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance # # The number of snapshots to retain in dataDir #autopurge.snapRetainCount=3 # Purge task interval in hours # Set to "0" to disable auto purge feature #autopurge.purgeInterval=1 |
下载安装包
安装包下载地址:https://www.apache.org/dyn/closer.lua/hbase/2.2.6/hbase-2.2.3-bin.tar.gz
将安装包上传到hadoop100服务器/opt/download
路径下,并进行解压:
1 2 |
$ cd /opt/download $ tar -zxvf hbase-2.2.3-bin.tar.gz -C /opt/pkg/ |
配置HBase
修改HBase配置文件hbase-env.sh
1 2 |
$ cd /opt/pkg/hbase-2.2.3/conf $ vim hbase-env.sh |
修改如下两项内容,值如下
1 2 |
export JAVA_HOME=/opt/pkg/java export HBASE_MANAGES_ZK=false |
修改文件hbase-site.xml
1 |
$ vim hbase-site.xml |
内容如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
<configuration> <!-- 指定hbase在HDFS上存储的路径 --> <property> <name>hbase.rootdir</name> <value>hdfs://hadoop100:8020/hbase</value> </property> <!-- 指定hbase是否分布式运行 --> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <!-- 指定zookeeper的地址,多个用“,”分割 --> <property> <name>hbase.zookeeper.quorum</name> <value>hadoop100:2181</value> </property> <!--指定hbase管理页面--> <property> <name>hbase.master.info.port</name> <value>16010</value> </property> <!-- 在分布式的情况下一定要设置,不然容易出现Hmaster起不来的情况 --> <property> <name>hbase.unsafe.stream.capability.enforce</name> <value>false</value> </property> </configuration> |
修改regionservers配置文件,指定HBase的从节点主机名:
1 2 |
$ vim regionservers hadoop100 |
添加HBase环境变量
1 2 |
export HBASE_HOME=/opt/pkg/hbase-2.2.3 export PATH=$PATH:$HBASE_HOME/bin |
重新执行/etc/profile,让环境变量生效
1 |
source /etc/profile |
HBase的启动与停止:
启动HBase前需要提前启动HDFS及ZooKeeper集群:
如果没开启hdfs,请在运行命令:
1 |
$ start-dfs.sh |
如果没开启zookeeper,请运行命令:
1 |
$ zkServer.sh start conf/zoo.cfg |
执行以下命令启动HBase集群
1 |
$ start-hbase.sh |
启动完后,jps查看HBase相关进程是否都正常运行:
1 2 3 4 5 6 7 8 9 10 11 |
$ jps 7601 QuorumPeerMain 1670 NameNode 1975 SecondaryNameNode 6695 HMaster 6841 HRegionServer 1787 DataNode 2491 JobHistoryServer 2236 ResourceManager 6958 Jps 2351 NodeManager |
使用HBase提供的Shell客户端进行访问:
1 2 3 4 5 6 7 8 |
$ hbase shell HBase Shell Use "help" to get list of supported commands. Use "exit" to quit this interactive shell. For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell Version 2.2.3, r6a830d87542b766bd3dc4cfdee28655f62de3974, 2020年 01月 10日 星期五 18:27:51 CST Took 0.0025 seconds hbase(main):001:0> status |
访问HBase的WEB UI界面:
浏览器页面访问 http://hadoop100:16010
停止HBase相关进程的命令:
1 |
stop-hbase.sh |
安装MySQL
下载rpm-bundle包
1 2 3 4 5 6 7 8 9 10 11 |
$ wget http://mirrors.163.com/mysql/Downloads/MySQL-5.7/mysql-5.7.33-1.el7.x86_64.rpm-bundle.tar $ tar xvf mysql-5.7.33-1.el7.x86_64.rpm-bundle.tar $ mkdir mysql-jars $ mv mysql-comm*.rpm mysql-jars/ $ cd mysql-jars/ $ ls mysql-community-client-5.7.33-1.el7.x86_64.rpm mysql-community-common-5.7.33-1.el7.x86_64.rpm mysql-community-libs-5.7.33-1.el7.x86_64.rpm mysql-community-libs-compat-5.7.33-1.el7.x86_64.rpm mysql-community-server-5.7.33-1.el7.x86_64.rpm |
依次手动安装
1 2 3 4 5 |
sudo rpm -ivh mysql-community-common-5.7.33-1.el7.x86_64.rpm sudo rpm -ivh mysql-community-libs-5.7.33-1.el7.x86_64.rpm sudo rpm -ivh mysql-community-libs-compat-5.7.33-1.el7.x86_64.rpm sudo rpm -ivh mysql-community-client-5.7.33-1.el7.x86_64.rpm sudo rpm -ivh mysql-community-server-5.7.33-1.el7.x86_64.rpm |
启动MySQL服务
启动MySQL服务
1 |
$ systemctl start mysqld.service |
或
1 |
$ service mysqld start |
配置开机启动
1 |
$ systemctl enable mysqld.service |
查看运行状态
1 2 3 |
$ systemctl status mysqld.service mysql 2574 1 1 23:49 ? 00:00:00 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid |
查看到进程信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
$ sudo netstat -anpl | grep mysql tcp6 0 0 :::3306 :::* LISTEN 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45824 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45818 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45838 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45808 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45816 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45820 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45826 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45822 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45814 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45846 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45828 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45832 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45842 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45830 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45844 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45812 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45836 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45834 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45840 ESTABLISHED 946/mysqld tcp6 0 0 192.168.186.100:3306 192.168.186.100:45810 ESTABLISHED 946/mysqld unix 2 [ ACC ] STREAM LISTENING 20523 946/mysqld /var/lib/mysql/mysql.sock |
查看端口信息:
可以看出mysql server的进程mysqld所使用的默认端口即3306
1 2 3 4 5 6 7 8 |
$ sudo netstat -anpl | grep tcp tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1412/sshd tcp 0 52 192.168.186.103:22 192.168.186.1:54058 ESTABLISHED 2287/sshd: hadoop tcp6 0 0 :::3306 :::* LISTEN 2574/mysqld tcp6 0 0 192.168.186.103:3888 :::* LISTEN 2060/java tcp6 0 0 :::22 :::* LISTEN 1412/sshd tcp6 0 0 :::37791 :::* LISTEN 2060/java tcp6 0 0 :::2181 :::* LISTEN 2060/java |
修改密码
找到临时密码:
第一次登陆mysql需要root的临时密码,这个密码是安装时随机生成在MySQL的服务器日志中的:
1 2 3 |
$ grep "temporary password" /var/log/mysqld.log 2020-02-26T17:05:45.104999Z 1 [Note] A temporary password is generated for root@localhost: bl/!6qaU.wuX $ mysql -u roop -p |
这时输入临时密码即可登录MySQL客户端。
修改密码安全策略:
第一次登陆MySQL客户端终端后系统会很快提示你修改掉默认密码
1 2 3 4 5 |
mysql> show databases; ERROR 1820 (HY000): You must reset your password using ALTER USER statement before executing this statement. mysql> alter user 'root'@'localhost' identified by 'niit1234'; ERROR 1819 (HY000): Your password does not satisfy the current policy requirements |
基于默认密码安全策略,所设置的密码必须要包含大小写字母、数字和字符。如果不考虑安全问题,可以修改策略:
1 2 |
mysql> set global validate_password_policy=0; mysql> set global validate_password_length=1; |
设置新root密码并开启远程权限:
Mysql客户端远程访问因为安全原因默认是关闭的。我们需要将root的访问权限扩大都允许从任意ip访问:
1 |
mysql>GRANT ALL PRIVILEGES ON *.* TO 'root'@'%'IDENTIFIED BY 'niit1234' WITH GRANT OPTION; |
现在虽然修改了远程访问权限,但是还没有生效,因此我们需要刷新权限:
1 |
mysql>FLUSH PRIVILEGES; |
修改默认字符集
修改MySQL配置文件:
1 |
vi /etc/my.cnf |
在文件后面追加:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# 默认服务器内部操作字符集 character-set-server=utf8mb4 # 默认服务器内部操作字符集校对规则 collation-server=utf8mb4_general_ci # 默认的存储引擎 default-storage-engine=InnoDB # 初始化连接时设置以下字符集: # character_set_client # character_set_results # character_set_connection init_connect='set names utf8mb4' [client] default-character-set=utf8mb4 [mysql] default-character-set=utf8mb4 |
重启MySQL服务
1 |
$ sudo service mysqld restart |
安装Hive
下载安装包
从官网下载hive安装包:apache-hive-3.1.2-bin.tar.gz
规划安装目录:/opt/pkg/hive
上传安装包到hadoop100服务器中
解压到安装路径
解压安装包到指定的规划目录/opt/pkg/
1 2 |
$ cd /export/softwares/ $ tar -xzvf apache-hive-3.1.2-bin.tar.gz -C /opt/pkg/ |
修改配置文件
进入hive安装目录:
1 |
$ cd /opt/pkg |
重新命名hive目录:
1 |
$ mv apache-hive-3.1.2-bin/hive |
修改/opt/pkg//conf
目录下的hive-site.xml,默认没有该文件, 需要手动创建:
1 2 |
$ cd /opt/pkg/hive/conf/ $ vim hive-site.xml |
进入编辑模式, 文件内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop100:3306/metastore?useSSL=false</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>niit1234</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>hive.metastore.event.db.notification.api.auth</name> <value>false</value> </property> <property> <name>hive.cli.print.current.db</name> <value>true</value> </property> <property> <name>hive.cli.print.header</name> <value>true</value> </property> <property> <name>hive.server2.thrift.bind.host</name> <value>hadoop100</value> </property> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> </configuration> |
创建hive日志存储目录
1 |
$ mkdir /opt/pkg/hive/logs/ |
重命名日志配置文件模板为hive-log4j.properties
1 2 3 4 5 |
$ pwd /opt/pkg/hive/conf $ mv hive-log4j2.properties.template hive-log4j2.properties $ vim hive-log4j2.properties # 修改文件 |
修改此文件的hive.log.dir
属性的值:
1 2 3 |
#更改以下内容,设置我们的hive的日志文件存放的路径,便于排查问题 hive.log.dir=/opt/pkg/hive/logs/ |
拷贝mysql驱动包
由于运行hive时,需要向mysql数据库中读写元数据,所以需要将mysql的驱动包上传到hive的lib目录下。
上传mysql驱动包,如mysql-connector-java-5.1.38.jar
到/opt/download/
目录中:
1 |
$ cp mysql-connector-java-5.1.38.jar /opt/pkg/hive/lib/ |
解决日志Jar包冲突
1 2 3 4 5 |
# 进入lib目录 $ cd /opt/pkg/hive/lib/ # 重新命名 或者直接删除 $ mv log4j-slf4j-impl-2.10.0.jar log4j-slf4j-impl-2.10.0.jar.bak |
配置Hive环境变量
1 2 3 4 5 6 7 |
# HIVE export HIVE_HOME=/opt/pkg/hive export PATH=$PATH:$HIVE_HOME/bin # HCATLOG export HCAT_HOME=$HIVE_HOME/hcatalog export PATH=$PATH:$HCAT_HOME/bin |
之后别忘记source使环境变量配置文件的修改生效
初始化元数据库
开启MySQL客户端连接MySQL服务, 用户名root, 密码niit1234,创建hive元数据库, 数据库名称需要和hive-site.xml中配置的一致:
1 2 3 |
$ mysql -uroot -pniit1234 create database metastore; show databases; |
退出mysql:
1 |
Exit; |
初始化元数据库:
1 |
$ schematool -initSchema -dbType mysql -verbose |
看到schemaTool completed 表示初始化成功
启动Hive服务
前提:Hadoop集群、MySQL服务均已启动,执行命令:
1 2 |
nohup hive --service metastore >/tmp/metastore.log 2>&1 & nohup hive --service hiveserver2 >/tmp/hiveServer2.log 2>&1 & |
验证Hive安装是否成功
在hadoop100上任意目录启动hive的命令行客户端beeline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
$ beeline Beeline version 3.1.2 by Apache Hive beeline> !connect jdbc:hive2://localhost:10000 Connecting to jdbc:hive2://localhost:10000 Enter username for jdbc:hive2://localhost:10000: hadoop Enter password for jdbc:hive2://localhost:10000: ****** Connected to: Apache Hive (version 3.1.2) Driver: Hive JDBC (version 3.1.2) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://localhost:10000> show databases; +----------------+ | database_name | +----------------+ | default | +----------------+ 2 rows selected (1.93 seconds) |
认证时密码直接回车即可, 如果能看到以上信息说明hive安装成功:
退出客户端:
1 2 |
0: jdbc:hive2://localhost:10000> !quit Closing: 0: jdbc:hive2://localhost:10000 quit; |
使用beeline连接失败的解决办法
如果hiveserver2已经正常运行在本机的10000端口上, 但使用beeline连接hiveserver2报错, WARN jdbc.HiveConnection: Failed to connect to localhost:10000
主要的原因可能是hadoop引入了用户代理机制, 不允许上层系统直接使用实际用户.
解决办法: 在hadoop的核心配置文件core-site.xml
中添加:
1 2 3 4 5 6 7 8 |
<property> <name>hadoop.proxyuser.hadoop.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hadoop.groups</name> <value>*</value> </property> |
其中hadoop.proxyuser.XXX的XXX就是代理用户名,我这里设置成hadoop
然后重启hadoop,否则以上修改不生效。
1 2 3 4 |
$ stop-all.sh $ start-dfs.sh $ start-yarn.sh |
再次执行下面命令就可以正常连接欸蓝:
1 |
beeline -u jdbc:hive2://localhost:10000 |
安装Sqoop
Sqoop下载和安装
下载地址:http://archive.apache.org/dist/sqoop/1.4.7/
1 |
wget http://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz |
解压sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
到指定目录下:
1 |
tar -zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /opt/pkg/ |
修改Sqoop安装目录名称:
1 2 |
$ cd /opt/pkg/ $ mv sqoop-1.4.7.bin__hadoop-2.6.0/ sqoop |
配置Sqoop环境变量
$ vi ~/.bash_profile
1 2 3 |
#sqoop export SQOOP_HOME=/opt/pkg/sqoop export PATH=$PATH:$SQOOP_HOME/bin |
把Sqoop所依赖的相关环境变量都配置上,修改sqoop-env.sh
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
$ mv conf/sqoop-env-template.sh conf/sqoop-env.sh $ vi conf/sqoop-env.sh # Set Hadoop-specific environment variables here. #Set path to where bin/hadoop is available export HADOOP_COMMON_HOME=/opt/pkg/hadoop #Set path to where hadoop-*-core.jar is available export HADOOP_MAPRED_HOME=/opt/pkg/hadoop #set the path to where bin/hbase is available export HBASE_HOME=/opt/pkg/hbase #Set the path to where bin/hive is available export HIVE_HOME=/opt/pkg/hive #Set the path for where zookeper config dir is export ZOOCFGDIR=/opt/pkg/zookeeper/conf |
修改sqoop-site.xml
, 具体配置如下文件所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>sqoop.metastore.client.enable.autoconnect</name> <value>true</value> <description>If true, Sqoop will connect to a local metastore for job management when no other metastore arguments are provided. </description> </property> <property> <name>sqoop.metastore.client.autoconnect.url</name> <value>jdbc:hsqldb:file:/tmp/sqoop-meta/meta.db;shutdown=true</value> <description>The connect string to use when connecting to a job-management metastore. If unspecified, uses ~/.sqoop/. You can specify a different path here. </description> </property> <property> <name>sqoop.metastore.client.autoconnect.username</name> <value>SA</value> <description>The username to bind to the metastore. </description> </property> <property> <name>sqoop.metastore.client.autoconnect.password</name> <value></value> <description>The password to bind to the metastore. </description> </property> <property> <name>sqoop.metastore.client.record.password</name> <value>true</value> <description>If true, allow saved passwords in the metastore. </description> </property> <property> <name>sqoop.metastore.server.location</name> <value>/tmp/sqoop-metastore/shared.db</value> <description>Path to the shared metastore database files. If this is not set, it will be placed in ~/.sqoop/. </description> </property> <property> <name>sqoop.metastore.server.port</name> <value>16000</value> <description>Port that this metastore should listen on. </description> </property> </configuration> |
修改configure-sqoop
:
1 |
$ vi bin/configure-sqoop |
如果没有安装accumulo,则将有关ACCUMULO_HOME
的判断逻辑注释掉:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
$ vi bin/configure-sqoop 94 #if [ -z "${ACCUMULO_HOME}" ]; then 95 # if [ -d "/usr/lib/accumulo" ]; then 96 # ACCUMULO_HOME=/usr/lib/accumulo 97 # else 98 # ACCUMULO_HOME=${SQOOP_HOME}/../accumulo 99 # fi 100 #fi 140 #if [ ! -d "${ACCUMULO_HOME}" ]; then 141 # echo "Warning: $ACCUMULO_HOME does not exist! Accumulo imports will fail." 142 # echo 'Please set $ACCUMULO_HOME to the root of your Accumulo installation.' 143 #fi |
这样做的目的是避免将来运行时出现类似下面的警告信息:
1 2 |
Warning: /opt/pkg/sqoop/bin/…/…/accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. |
将MySQL的驱动(MySQL5.7对应的驱动版本应为5.x版本)上传到Sqoop安装目录下的lib目录下:
1 |
$ cp mysql-connector-java-5.1.44-bin.jar /opt/pkg/sqoop/lib/ |
将$HIVE_HOME/lib/hive-common-3.1.2.jar
拷贝或者软链接到$SQOOP_HOME/lib
下
1 |
$ ln -s /opt/pkg/hive/lib/hive-common-3.1.2.jar /opt/pkg/sqoop/lib |
如果需要解析json,可下载java-json.jar放到sqoop目录下的lib里。
下载地址:http://www.java2s.com/Code/Jar/j/Downloadjavajsonjar.htm
1 |
$ cp java-json.jar /opt/pkg/sqoop/lib/ |
如果需要avro序列化,可将hadoop里面的avro的jar包拷贝或者软链接到sqoop目录下的lib里。
1 |
$ ln -s /opt/pkg/hadoop/share/hadoop/common/lib/avro-1.7.7.jar /opt/pkg/sqoop/lib/ |
练习
完成以下练习:
练习1:
安装好Sqoop学习环境后使用cd命令进入到sqoop安装目录, 输入以下命令并观察输出:
1 2 3 4 5 |
$ sqoop version 2020-01-23 23:43:20,287 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7 Sqoop 1.4.7 git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8 Compiled by maugli on Thu Dec 21 15:59:58 STD 2017 |
如果输出了类似内容, 表明Sqoop的安装初步完成.
Views: 899