Sqoop环境安装

问题陈述

Sqoop是Hadoop生态系统和RDBMS之间进行数据传输的一个工具。在学习Sqoop之前首先需要完成学习环境的搭建。这里为了学习方便,采用单机部署方式。

最初Sqoop是Hadoop的一个子项目,它设计只能在Linux操作系统上运行。

先决条件

安装Sqoop的必要前提条件是:

  • 准备Linux操作系统(Centos7)
  • 安装Java环境(JDK1.8)
  • 安装Hadoop环境(Hadoop 3.1.4)

另外为了学习Sqoop的大部分功能,还需要需要安装:

  • Zookeeper 3.5.9)
  • HBase 2.2.3
  • MySQL 5.7
  • Hive 3.1.2

解决办法

准备Linux操作系统

配置主机名映射

设置网络:

虚拟机安装Centos7时
硬盘设置大一些,如40G或更多
不要设置预先分配磁盘空间.
网络适配器设置为NAT连接

打开虚拟网络编辑器

记住NAT模式所在的虚拟网卡对应的子网IP,子网掩码以及网关IP.

然后进入虚拟机终端,设置静态IP

$ vi /etc/sysconfig/network-scripts/ifcfg-ens33
NAME="ens33"
TYPE="Ethernet"
DEVICE="ens33"
BROWSER_ONLY="no"
DEFROUTE="yes"
PROXY_METHOD="none"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_FAILURE_FATAL="no"
IPV6_ADDR_GEN_MODE="stable-privacy"
IPV6_PRIVACY="no"
UUID="b00b9ac0-60c2-4d34-ab88-2413055463cf"

ONBOOT="yes"
BOOTPROTO="static"
IPADDR="192.168.186.100"
PREFIX="24"
GATEWAY="192.168.186.2"
DNS1="223.5.5.5"
DNS2="8.8.8.8"

其中需要修改的是:

  • ONTBOOT设置yes可以实现自动联网
  • BOOTPROTO="static" 设置静态IP,防止IP发生变化
  • IPADDR的前三段要和NAT虚拟网卡的子网IP一致,且第四段在0~254之间选择,又不能和NAT虚拟网卡的子网掩码和其他相同网络中的主机IP重复.
  • PREFIX=24是设置子网掩码的位数长度,换算十进制就是255.255.255.0,因此PREFIX=24也可以直接替换成NETMASK="255.255.255.0"
  • DNS1设置的是阿里的公共DNS地址"223.5.5.5",DNS2设置的是谷歌的公共DNS地址"8,8,8,8"

设置好了之后需要重新启动网络:

$ sudo service network restart
Restarting network (via systemctl):                        [  确定  ]
$ sudo service network status
已配置设备:
lo ens33
当前活跃设备:
lo ens33

查看本机ip:

$ ip a
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
  inet 127.0.0.1/8 scope host lo
    valid_lft forever preferred_lft forever
  inet6 ::1/128 scope host
    valid_lft forever preferred_lft forever
2: ens33:  mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
  link/ether 00:0c:29:b2:5b:75 brd ff:ff:ff:ff:ff:ff
  inet 192.168.186.100/24 brd 192.168.186.255 scope global noprefixroute ens33
    valid_lft forever preferred_lft forever
  inet6 fe80::40db:ee8d:77c1:fbf7/64 scope link noprefixroute
    valid_lft forever preferred_lft forever

设置静态主机名:

# hostnamectl --static set-hostname hadoop100

在hosts文件中配置主机名和本机ip之间的映射关系
(注释掉localhost的部分)

vi /etc/hosts
#127.0.0.1  localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1     localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.186.100 hadoop100

创建专用账号

不建议直接使用root账号, 这里我们创建一个hadoop账号用于接下来的所有操作.

创建hadoop用户

# useradd hadoop    

设置hadoop用户的密码

# password hadoop

为hadoop账号设置sudoer权限

vi /etc/sudoers

找到root ALL=(ALL) ALL在下面添加一行

## Allow root to run any commands anywhere
root    ALL=(ALL)   ALL
delucia ALL=(ALL)   NOPASSWD:ALL

修改完成后,切换到Hadoop用户

# su hadoop

以后在使用需要root权限的命令时,就可以在命令前面加上sudo来提升权限,且无需输入hadoop密码, 如:

$ sudo ls /root

接下来的所有操作, 如果没有特殊说明, 一律使用hadoop账号.

配置SSH免密登录

由于Hadoop集群的机器之间ssh通信默认需要输入密码,在集群运行时我们不可能为每一次通信都手动输入密码,因此需要配置机器之间的ssh的免密登录。单机伪分布式的Hadoop环境样需要配置本地对本地ssh连接的免密,流程如下:

  1. 首先ssh-keygen命令生成RSA加密的密钥对(公钥和私钥)。

    $ ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
    Created directory '/home/hadoop/.ssh'.
    Enter passphrase (empty for no passphrase): 
    Enter same passphrase again: 
    Your identification has been saved in /home/hadoop/.ssh/id_rsa.
    Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
    The key fingerprint is:
    SHA256:yQYChs4eniVLeICaI2bCB9HopbXUBE9v0lpLjBACUnM hadoop@hadoop100
    The key's randomart image is:
    +---[RSA 2048]----+
    |==O+Eo      |
    |=+.O+.=     |
    |Bo* o+.B     |
    |B%.+ .*o..    |
    |OoB . .S    |
    | =   .     |
    |         |
    +----[SHA256]-----+
  2. 将生成的公钥添加到~/.ssh目录下的authorized_keys文件中。并为authorized_keys文件设置600权限。

    $ cd ~/.ssh/
    $ cat id_rsa.pub >> authorized_keys
    $ chmod 600 authorized_keys

    以上三个命令可以使用一个命令代替:

    $ ssh-copy-id hadoop100
  3. 使用ssh命令连接本地终端,如果不需要输入密码则说明本地的SSH免密配置成功。

    $ ssh hadoop@hadoop100
    The authenticity of host 'hadoop100 (192.168.186.100)' can't be established.
    ECDSA key fingerprint is SHA256:aGLhdt3bIuqtPgrFWnhgrfTKUbDh4CWVTfIgr5E5oV0.
    ECDSA key fingerprint is MD5:b8:bd:b3:65:fe:77:2c:06:2d:ec:58:3a:97:51:dd:ca.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added 'hadoop100,192.168.186.100' (ECDSA) to the list of known hosts.
    Last login: Sat Jan 9 10:16:53 2021 from 192.168.186.1
    
  4. 登出

    $ exit
    Connection to hadoop100 closed.

配置时间同步

集群中的通信和文件传输一般是以系统时间作为约定条件的。所以当集群中机器之间系统如果不一致可能导致各种问题发生,比如访问时间过长,甚至失败。所以配置机器之间的时间同步非常重要。不过由于我们使用的学习环境是单机部署,所以无需配置时间同步。

统一目录结构

目录规划如下:

/opt/ 
  ├── bin         # 脚本和命令
  ├── data        # 程序需要使用的数据
  ├── download    # 下载的软件安装包
  ├── pkg         # 解压方式安装的软件
  └── tmp         # 存放程序生成的临时文件

使用hadoop账户创建目录:

$ sudo mkdir /opt/download
$ sudo mkdir /opt/data
$ sudo mkdir /opt/bin
$ sudo mkdir /opt/tmp
$ sudo mkdir /opt/pkg

为了使用方便更改opt下目录的用户及其所在用户组为hadoop:

$ sudo chown hadoop:hadoop /opt/*
$ ls
bin data download pkg tmp

$ ll
总用量 0
drwxr-xr-x. 2 hadoop hadoop 6 1月  9 14:29 bin
drwxr-xr-x. 2 hadoop hadoop 6 1月  9 14:29 data
drwxr-xr-x. 2 hadoop hadoop 6 1月  9 14:29 download
drwxr-xr-x. 2 hadoop hadoop 6 1月  9 14:37 pkg
drwxr-xr-x. 2 hadoop hadoop 6 1月  9 14:36 tmp

安装Java环境

检查是否已经装过Java JDK

$ rpm -qa | grep java

$ yum list installed | grep java

如果没有安装过Java 则需要安装,JDK版本建议1.8+

Java JDK的下载和解压

去官网下载JDK1.8的安装包,上传到/opt/download/,然后解压到/opt/pkg

$ tar -zxvf jdk-8u261-linux-x64.tar.gz 
$ mv jdk1.8.0_261 /opt/pkg/java

配置java环境变量

确认当前目录为jdk的解压路径

$ pwd
/opt/pkg/java

编辑/etc/profile.d/hadoop.env.sh配置文件(没有则创建)

$ sudo vim /etc/profile.d/hadoop.env.sh

添加新的环境变量配置

# JAVA_HOME
export JAVA_HOME=/opt/pkg/java
PATH=$JAVA_HOME/bin:$PATH

export PATH

使新的环境变量立刻生效

$ source /etc/profile.d/env.sh

验证环境变量

$ java -version
$ java
$ javac

安装Hadoop

下载和解压

从官网下载hadoop-3.1.4.tar.gz上传到服务器,解压到指定目录:

$ tar -zxvf hadoop.tar.gz -C /opt/pkg/

编辑/etc/profile.d/env.sh配置文件,添加环境变量:

# HADOOP_HOME
export HADOOP_HOME=/opt/pkg/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

使新的环境变量立刻生效:

$ source /etc/profile.d/env.sh

验证:

$ hadoop version

修改Hadoop相关命令执行环境

找到Hadoop安装目录下的hadoop/etc/hadoop/hadoop-env.sh文件,找到这一处将JAVA_HOME修改为真实JDK路径即可:

# The java implementation to use.
export JAVA_HOME=/opt/pkg/java

找到hadoop/etc/hadoop/yarn.env.sh文件,做同样修改:

# export JAVA_HOME=/home/y/libexec/jdk1.6.0/
export JAVA_HOME=/opt/pkg/java

找到hadoop/etc/hadoop/mapred.env.sh,做同样修改:

# export JAVA_HOME=/home/y/libexec/jdk1.6.0/
export JAVA_HOME=/opt/pkg/java

修改Hadoop配置

来到 hadoop/etc/hadoop/,修改以下配置文件。

1) hadoop/etc/hadoop/core-site.xml – Hadoop核心配置文件

    
      
      
        fs.defaultFS
        hdfs://hadoop100:8020
      

      
      
        hadoop.tmp.dir
        /opt/pkg/hadoop/data/tmp
      

     
     
       io.file.buffer.size
       4096
     

     
     
       fs.trash.interval
       10080
     
   

注意:主机名要修改成本机的实际主机名。

hadoop.tmp.dir十分重要,此目录下保存hadoop集群中namenode和datanode的所有数据。

2) hadoop/etc/hadoop/hdfs-site.xml – HDFS相关配置

    
      
      
        dfs.replication
        1
      

      
      
        dfs.namenode.secondary.http-address
        hadoop100:9868
      

      
         dfs.namenode.http-address
        hadoop100:9870
      

      
      
      dfs.permissions
      false
     
   

dfs.replication默认是3,为了节省虚拟机资源,这里设置为1

​ 全分布式情况下,SecondaryNameNode和NameNode 应分开部署

dfs.namenode.secondary.http-address默认就是本地,如果是伪分布式可以不用配置

3) hadoop/etc/hadoop/mapred-site.xml – mapreduce 相关配置

   
     
     
       mapreduce.framework.name
       yarn
     

     
     
       mapreduce.jobhistory.address
       hadoop100:10020
     

     
     
       mapreduce.jobhistory.webapp.address
       hadoop100:19888
     

     
       yarn.app.mapreduce.am.env
       HADOOP_MAPRED_HOME=/opt/pkg/hadoop
     

     
       mapreduce.map.env
       HADOOP_MAPRED_HOME=/opt/pkg/hadoop
     

     
       mapreduce.reduce.env
       HADOOP_MAPRED_HOME=/opt/pkg/hadoop
     
   

mapreduce.jobhistory相关配置是可选配置,用于查看MR任务的历史日志。

​ 这里主机名千万不要弄错,不然任务执行会失败,且不容易找原因。

​ 需要手动启动MapReduceJobHistory后台服务才能在Yarn的页面打开历史日志。

4) 配置 yarn-site.xml

    
     
     
       yarn.resourcemanager.hostname
       hadoop100
     

     
     
       yarn.nodemanager.aux-services
       mapreduce_shuffle
     

     
     
       yarn.log-aggregation-enable
       true
     

     
     
       yarn.log-aggregation.retain-seconds
       604800
     

     
     
       yarn.nodemanager.vmem-check-enabled
       false
     

     
       yarn.nodemanager.pmem-check-enabled
       false
     
    

5) workers DataNode 节点配置

   vi workers

   $ vi workers
   hadoop100

这里单机伪分布式环境可以不进行修改,默认是localhost, 也可以改成本机的主机名。

全分布式配置则需要每行输入一个DataNode主机名。

注意DataNode的主机名中不要有空格和空行,因为其他脚本会获取相关主机名信息。

格式化名称节点

$ hdfs namenode -format
21/01/09 19:27:21 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:  host = hadoop100/192.168.186.100
STARTUP_MSG:  args = [-format]
STARTUP_MSG:  version = 2.7.3
************************************************************/
21/01/09 19:27:21 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
21/01/09 19:27:21 INFO namenode.NameNode: createNameNode [-format]
Formatting using clusterid: CID-08318e9e-e202-48f3-bcb1-548ca50310c9
21/01/09 19:27:22 INFO util.GSet: Computing capacity for map BlocksMap
21/01/09 19:27:22 INFO util.GSet: VM type    = 64-bit
21/01/09 19:27:22 INFO util.GSet: 2.0% max memory 966.7 MB = 19.3 MB
21/01/09 19:27:22 INFO util.GSet: capacity   = 2^21 = 2097152 entries
21/01/09 19:27:22 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
21/01/09 19:27:22 INFO blockmanagement.BlockManager: defaultReplication     = 1
21/01/09 19:27:22 INFO blockmanagement.BlockManager: maxReplication       = 512
21/01/09 19:27:22 INFO blockmanagement.BlockManager: minReplication       = 1
21/01/09 19:27:22 INFO blockmanagement.BlockManager: maxReplicationStreams   = 2
21/01/09 19:27:22 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
21/01/09 19:27:22 INFO blockmanagement.BlockManager: encryptDataTransfer    = false
21/01/09 19:27:22 INFO blockmanagement.BlockManager: maxNumBlocksToLog     = 1000
21/01/09 19:27:22 INFO namenode.FSNamesystem: fsOwner       = hadoop (auth:SIMPLE)
21/01/09 19:27:22 INFO namenode.FSNamesystem: supergroup     = supergroup
21/01/09 19:27:22 INFO namenode.FSNamesystem: isPermissionEnabled = false
21/01/09 19:27:22 INFO namenode.FSNamesystem: HA Enabled: false
21/01/09 19:27:22 INFO namenode.FSNamesystem: Append Enabled: true
21/01/09 19:27:23 INFO common.Storage: Storage directory /opt/pkg/hadoop/data/tmp/dfs/name has been successfully formatted.
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop100/192.168.186.100
************************************************************/

运行和测试

启动Hadoop环境,刚启动Hadoop的HDFS系统后会有几秒的安全模式,安全模式期间无法进行任何数据处理,这也是为什么不建议使用start-all.sh脚本一次性启动DFS进程和Yarn进程,而是先启动dfs后过30秒左右再启动Yarn相关进程。

1) 启动所有DFS进程:

   $ start-dfs.sh
   Starting namenodes on [hadoop100]
   hadoop100: starting namenode, logging to /opt/pkg/hadoop/logs/hadoop-hadoop-namenode-hadoop100.out
   hadoop100: starting datanode, logging to /opt/pkg/hadoop/logs/hadoop-hadoop-datanode-hadoop100.out
   Starting secondary namenodes [hadoop100]
   hadoop100: starting secondarynamenode, logging to /opt/pkg/hadoop/logs/hadoop-hadoop-secondarynamenode-hadoop100.out

2) 启动所有YARN进程:

   $ start-yarn.sh
   starting yarn daemons
   starting resourcemanager, logging to /opt/pkg/hadoop/logs/yarn-hadoop-resourcemanager-hadoop100.out
   hadoop100: starting nodemanager, logging to /opt/pkg/hadoop/logs/yarn-hadoop-nodemanager-hadoop100.out

启动MapReduceJobHistory后台服务 – 用于查看MR执行的历史日志

   $ mr-jobhistory-daemon.sh start historyserver

3) 查看是否相关进程都成功启动

执行jps命令,看看是否会有如下进程:

   $ jps
   14608 NodeManager
   14361 SecondaryNameNode
   14203 DataNode
   14510 ResourceManager
   14079 NameNode

4) 单一进程管理

   # 在主节点上使用以下命令启动 HDFS NameNode: 
   hdfs --daemon start namenode

   # 在主节点上使用以下命令启动 HDFS SecondaryNamenode: 
   hdfs --daemon start secondarynamenode

   # 在从节点上使用以下命令启动 HDFS DataNode: 
   hdfs --daemon start datanode

   # 在主节点上使用以下命令启动 YARN ResourceManager: 
   yarn --daemon start resourcemanager

   # 在从节点上使用以下命令启动 YARN nodemanager: 
   yarn --daemon start nodemanager

以上脚本位于$HADOOP_HOME/sbin/目录下。如果想要停止某个节点上某个进程,只需要把命令中的start 改为stop 即可。

Web界面进行验证

访问http://hadoop100:9870查看HDFS情况

image-20220210022917279

访问http://hadoop100:8088查看YARN情况

image-20220210022937527

测试Hadoop集群

使用官方自带的示例程序测试Hadoop集群

启动DFS和YARN进程,找到测试程序的位置:

$ cd /opt/pkg/hadoop/share/hadoop/mapreduce
$ hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount
Usage: wordcount  [...] 

准备输入文件并上传到HDFS系统

$ cat /opt/data/mapred/input/wc.txt
hadoop hadoop hadoop
hi hi hi hello hadoop
hello world hadoop

$ hadoop fs -mkdir -p /input/wc

$ hadoop fs -put wc.txt /input/wc/
Found 1 items
-rw-r--r--  1 hadoop supergroup     62 2021-01-09 20:15 /input/wc/wc.txt

$ hadoop fs -cat /input/wc/wc.txt
hadoop hadoop hadoop
hi hi hi hello hadoop
hello world hadoop

运行官方示例程序wordcount,并将结果输出到/output/wc之中

$ cd /opt/pkg/hadoop/share/hadoop/mapreduce
$ hadoop jar hadoop-mapreduce-examples-3.1.4.jar wordcount /input/wc/ /output/wc/

控制台输出:

2020-01-23 18:38:45,914 INFO client.RMProxy: Connecting to ResourceManager at hadoop100/192.168.186.100:8032
2020-01-23 18:38:47,204 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1642908422458_0001
2020-01-23 18:38:47,988 INFO input.FileInputFormat: Total input files to process : 1
2020-01-23 18:38:49,033 INFO mapreduce.JobSubmitter: number of splits:1
2020-01-23 18:38:49,788 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1642908422458_0001
2020-01-23 18:38:49,790 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-01-23 18:38:50,108 INFO conf.Configuration: resource-types.xml not found
2020-01-23 18:38:50,108 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2020-01-23 18:38:50,740 INFO impl.YarnClientImpl: Submitted application application_1642908422458_0001
2020-01-23 18:38:50,796 INFO mapreduce.Job: The url to track the job: http://hadoop100:8088/proxy/application_1642908422458_0001/
2020-01-23 18:38:50,797 INFO mapreduce.Job: Running job: job_1642908422458_0001
2020-01-23 18:39:09,424 INFO mapreduce.Job: Job job_1642908422458_0001 running in uber mode : false
2020-01-23 18:39:09,425 INFO mapreduce.Job: map 0% reduce 0%
2020-01-23 18:39:19,633 INFO mapreduce.Job: map 100% reduce 0%
2020-01-23 18:39:28,781 INFO mapreduce.Job: map 100% reduce 100%
2020-01-23 18:39:30,812 INFO mapreduce.Job: Job job_1642908422458_0001 completed successfully
2020-01-23 18:39:30,973 INFO mapreduce.Job: Counters: 53
    File System Counters
        FILE: Number of bytes read=52
        FILE: Number of bytes written=444077
         FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=165
        HDFS: Number of bytes written=30
        HDFS: Number of read operations=8
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
     Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=8195
        Total time spent by all reduces in occupied slots (ms)=6333
        Total time spent by all map tasks (ms)=8195
        Total time spent by all reduce tasks (ms)=6333
        Total vcore-milliseconds taken by all map tasks=8195
        Total vcore-milliseconds taken by all reduce tasks=6333
        Total megabyte-milliseconds taken by all map tasks=8391680
        Total megabyte-milliseconds taken by all reduce tasks=6484992
    Map-Reduce Framework
        Map input records=4
         Map output records=11
        Map output bytes=106
        Map output materialized bytes=52
        Input split bytes=102
        Combine input records=11
        Combine output records=4
        Reduce input groups=4
        Reduce shuffle bytes=52
        Reduce input records=4
        Reduce output records=4
        Spilled Records=8
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=235
        CPU time spent (ms)=3340
        Physical memory (bytes) snapshot=366059520
        Virtual memory (bytes) snapshot=5470892032
        Total committed heap usage (bytes)=291639296
        Peak Map Physical memory (bytes)=233541632
        Peak Map Virtual memory (bytes)=2732072960
        Peak Reduce Physical memory (bytes)=132517888
        Peak Reduce Virtual memory (bytes)=2738819072
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=63
    File Output Format Counters
        Bytes Written=30

注意,输入是文件夹,可以指定多个。输出是一个必须不存在的文件夹路径。

查看保存在HDFS上的结果:

$ hadoop fs -ls /output/wc/
Found 2 items
-rw-r--r--  1 hadoop supergroup     0 2021-01-09 20:23 /output/wc/_SUCCESS
-rw-r--r--  1 hadoop supergroup     30 2021-01-09 20:23 /output/wc/part-r-00000

$ hadoop fs -cat /output/wc/part-r-00000
hadoop 5
hello  2
hi 3
world  1

在MR任务执行时,可以通过Yarn的Web UI界面查看进度:

image-20220210022958848

执行完毕以后点击指定MapReduce程序的TrackingUI一栏下的History可以查看历史日志记录

image-20220210023011146

如果跳转页面报404说明没有启动JobHistoryServer服务。

也可以在HDFS的Web界面上查看结果。

位置:Utilities > HDFS browser > /output/wc/ > part-r-00000

image-20220210023023180

关闭集群

$ stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
Stopping namenodes on [hadoop100]
hadoop100: stopping namenode
hadoop100: stopping datanode
Stopping secondary namenodes [hadoop100]
hadoop100: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
hadoop100: stopping nodemanager
no proxyserver to stop

安装HBase

Hbase有自带的Zookeeper, 为了更好的使用建议使用自己安装的zookeeper环境.

安装Zookeeper

从官网下载apache-zookeeper-3.5.9-bin.tar.gz,安装到/opt/pkg/zookeeper,单机模式部署,配置文件 $ZOOKEEPER_HOME/conf/zoo.cfg

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/opt/tmp/zookeeper/data
dataLogDir=/opt/tmp/zookeeper/dataLog
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1

下载安装包

安装包下载地址:https://www.apache.org/dyn/closer.lua/hbase/2.2.6/hbase-2.2.3-bin.tar.gz

将安装包上传到hadoop100服务器/opt/download路径下,并进行解压:

$ cd /opt/download
$ tar -zxvf hbase-2.2.3-bin.tar.gz -C /opt/pkg/

配置HBase

修改HBase配置文件hbase-env.sh

$ cd /opt/pkg/hbase-2.2.3/conf
$ vim hbase-env.sh

修改如下两项内容,值如下

export JAVA_HOME=/opt/pkg/java
export HBASE_MANAGES_ZK=false  

修改文件hbase-site.xml

$ vim hbase-site.xml

内容如下


    
    
         hbase.rootdir
        hdfs://hadoop100:8020/hbase
    

    
    
        hbase.cluster.distributed
        true
    

    
    
        hbase.zookeeper.quorum
         hadoop100:2181
    

    
    
       hbase.master.info.port
       16010
    

    
    
        hbase.unsafe.stream.capability.enforce
        false
    

修改regionservers配置文件,指定HBase的从节点主机名:

$ vim regionservers
hadoop100

添加HBase环境变量

export HBASE_HOME=/opt/pkg/hbase-2.2.3
export PATH=$PATH:$HBASE_HOME/bin

重新执行/etc/profile,让环境变量生效

source /etc/profile 

HBase的启动与停止:

启动HBase前需要提前启动HDFS及ZooKeeper集群:

如果没开启hdfs,请在运行命令:

$ start-dfs.sh

如果没开启zookeeper,请运行命令:

$ zkServer.sh start conf/zoo.cfg

执行以下命令启动HBase集群

$ start-hbase.sh

启动完后,jps查看HBase相关进程是否都正常运行:

$ jps
7601 QuorumPeerMain
1670 NameNode
1975 SecondaryNameNode
6695 HMaster
6841 HRegionServer
1787 DataNode
2491 JobHistoryServer
2236 ResourceManager
6958 Jps
2351 NodeManager

使用HBase提供的Shell客户端进行访问:

$ hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.3, r6a830d87542b766bd3dc4cfdee28655f62de3974, 2020年 01月 10日 星期五 18:27:51 CST
Took 0.0025 seconds
hbase(main):001:0> status

访问HBase的WEB UI界面:

浏览器页面访问 http://hadoop100:16010

image-20220210023039865

停止HBase相关进程的命令:

stop-hbase.sh

安装MySQL

下载rpm-bundle包

$ wget http://mirrors.163.com/mysql/Downloads/MySQL-5.7/mysql-5.7.33-1.el7.x86_64.rpm-bundle.tar
$ tar xvf mysql-5.7.33-1.el7.x86_64.rpm-bundle.tar
$ mkdir mysql-jars
$ mv mysql-comm*.rpm mysql-jars/
$ cd mysql-jars/
$ ls
mysql-community-client-5.7.33-1.el7.x86_64.rpm
mysql-community-common-5.7.33-1.el7.x86_64.rpm
mysql-community-libs-5.7.33-1.el7.x86_64.rpm
mysql-community-libs-compat-5.7.33-1.el7.x86_64.rpm
mysql-community-server-5.7.33-1.el7.x86_64.rpm

依次手动安装

sudo rpm -ivh mysql-community-common-5.7.33-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-libs-5.7.33-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-libs-compat-5.7.33-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-client-5.7.33-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-server-5.7.33-1.el7.x86_64.rpm

启动MySQL服务

启动MySQL服务

$ systemctl start mysqld.service

$ service mysqld start

配置开机启动

$ systemctl enable mysqld.service

查看运行状态

$ systemctl status mysqld.service

mysql   2574   1 1 23:49 ?    00:00:00 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid

查看到进程信息

$ sudo netstat -anpl | grep mysql
tcp6    0   0 :::3306         :::*          LISTEN   946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45824  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45818  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45838  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45808  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45816  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45820  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45826  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45822  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45814  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45846  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45828  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45832  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45842  ESTABLISHED 946/mysqld
tcp6    0    0 192.168.186.100:3306  192.168.186.100:45830  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45844  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45812  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45836  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306   192.168.186.100:45834  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45840  ESTABLISHED 946/mysqld
tcp6    0   0 192.168.186.100:3306  192.168.186.100:45810  ESTABLISHED 946/mysqld
unix 2   [ ACC ]   STREAM   LISTENING   20523  946/mysqld      /var/lib/mysql/mysql.sock

查看端口信息:

可以看出mysql server的进程mysqld所使用的默认端口即3306

$ sudo netstat -anpl | grep tcp
tcp    0   0 0.0.0.0:22       0.0.0.0:*        LISTEN   1412/sshd     
tcp    0   52 192.168.186.103:22   192.168.186.1:54058   ESTABLISHED 2287/sshd: hadoop 
tcp6    0   0 :::3306         :::*          LISTEN   2574/mysqld    
tcp6    0   0 192.168.186.103:3888  :::*          LISTEN   2060/java     
tcp6    0   0 :::22          :::*          LISTEN   1412/sshd     
tcp6    0   0 :::37791        :::*          LISTEN   2060/java     
tcp6    0    0 :::2181         :::*          LISTEN   2060/java 

修改密码

找到临时密码:

第一次登陆mysql需要root的临时密码,这个密码是安装时随机生成在MySQL的服务器日志中的:

$ grep "temporary password" /var/log/mysqld.log
2020-02-26T17:05:45.104999Z 1 [Note] A temporary password is generated for root@localhost: bl/!6qaU.wuX
$ mysql -u roop -p 

这时输入临时密码即可登录MySQL客户端。

修改密码安全策略:

第一次登陆MySQL客户端终端后系统会很快提示你修改掉默认密码

mysql> show databases;
ERROR 1820 (HY000): You must reset your password using ALTER USER statement before executing this statement.

mysql> alter user 'root'@'localhost' identified by 'niit1234';
ERROR 1819 (HY000): Your password does not satisfy the current policy requirements

基于默认密码安全策略,所设置的密码必须要包含大小写字母、数字和字符。如果不考虑安全问题,可以修改策略:

mysql> set global validate_password_policy=0;
mysql> set global validate_password_length=1;

设置新root密码并开启远程权限:

Mysql客户端远程访问因为安全原因默认是关闭的。我们需要将root的访问权限扩大都允许从任意ip访问:

mysql>GRANT ALL PRIVILEGES ON *.* TO 'root'@'%'IDENTIFIED BY 'niit1234' WITH GRANT OPTION;

现在虽然修改了远程访问权限,但是还没有生效,因此我们需要刷新权限:

mysql>FLUSH PRIVILEGES;

修改默认字符集

修改MySQL配置文件:

vi /etc/my.cnf

在文件后面追加:

# 默认服务器内部操作字符集
character-set-server=utf8mb4
# 默认服务器内部操作字符集校对规则
collation-server=utf8mb4_general_ci
# 默认的存储引擎
default-storage-engine=InnoDB
# 初始化连接时设置以下字符集:
# character_set_client
# character_set_results
# character_set_connection
init_connect='set names utf8mb4'

[client]
default-character-set=utf8mb4

[mysql]
default-character-set=utf8mb4

重启MySQL服务

$ sudo service mysqld restart

安装Hive

下载安装包

从官网下载hive安装包:apache-hive-3.1.2-bin.tar.gz

规划安装目录:/opt/pkg/hive

上传安装包到hadoop100服务器中

解压到安装路径

解压安装包到指定的规划目录/opt/pkg/

$ cd /export/softwares/
$ tar -xzvf apache-hive-3.1.2-bin.tar.gz -C /opt/pkg/

修改配置文件

进入hive安装目录:

$ cd /opt/pkg

重新命名hive目录:

$ mv apache-hive-3.1.2-bin/hive

修改/opt/pkg//conf目录下的hive-site.xml,默认没有该文件, 需要手动创建:

$ cd /opt/pkg/hive/conf/
$ vim hive-site.xml

进入编辑模式, 文件内容如下:




   
       javax.jdo.option.ConnectionURL
       jdbc:mysql://hadoop100:3306/metastore?useSSL=false
   

   
       javax.jdo.option.ConnectionDriverName
       com.mysql.jdbc.Driver
   

   
       javax.jdo.option.ConnectionUserName
       root
   

   
       hive.metastore.warehouse.dir
       /user/hive/warehouse
   

   
       javax.jdo.option.ConnectionPassword
       niit1234
   

   
       hive.metastore.schema.verification
       false
   

   
       hive.metastore.event.db.notification.api.auth
       false
   

    
       hive.cli.print.current.db
       true
   

    
       hive.cli.print.header
       true
   

   
       hive.server2.thrift.bind.host
       hadoop100
   

   
       hive.server2.thrift.port
       10000
   

创建hive日志存储目录

$ mkdir /opt/pkg/hive/logs/

重命名日志配置文件模板为hive-log4j.properties

$ pwd
/opt/pkg/hive/conf

$ mv hive-log4j2.properties.template hive-log4j2.properties
$ vim hive-log4j2.properties # 修改文件

修改此文件的hive.log.dir属性的值:

#更改以下内容,设置我们的hive的日志文件存放的路径,便于排查问题

hive.log.dir=/opt/pkg/hive/logs/

拷贝mysql驱动包

由于运行hive时,需要向mysql数据库中读写元数据,所以需要将mysql的驱动包上传到hive的lib目录下。

上传mysql驱动包,如mysql-connector-java-5.1.38.jar/opt/download/目录中:

$ cp mysql-connector-java-5.1.38.jar /opt/pkg/hive/lib/

解决日志Jar包冲突

# 进入lib目录
$ cd /opt/pkg/hive/lib/

# 重新命名 或者直接删除
$ mv log4j-slf4j-impl-2.10.0.jar log4j-slf4j-impl-2.10.0.jar.bak

配置Hive环境变量

# HIVE
export HIVE_HOME=/opt/pkg/hive
export PATH=$PATH:$HIVE_HOME/bin

# HCATLOG
export HCAT_HOME=$HIVE_HOME/hcatalog
export PATH=$PATH:$HCAT_HOME/bin

之后别忘记source使环境变量配置文件的修改生效

初始化元数据库

开启MySQL客户端连接MySQL服务, 用户名root, 密码niit1234,创建hive元数据库, 数据库名称需要和hive-site.xml中配置的一致:

$ mysql -uroot -pniit1234
create database metastore;
show databases;

退出mysql:

Exit;

初始化元数据库:

$ schematool -initSchema -dbType mysql -verbose

看到schemaTool completed 表示初始化成功

启动Hive服务

前提:Hadoop集群、MySQL服务均已启动,执行命令:

nohup hive --service metastore >/tmp/metastore.log 2>&1 &
nohup hive --service hiveserver2 >/tmp/hiveServer2.log 2>&1 &

验证Hive安装是否成功

在hadoop100上任意目录启动hive的命令行客户端beeline:

$ beeline
Beeline version 3.1.2 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: hadoop
Enter password for jdbc:hive2://localhost:10000: ******
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> show databases;
+----------------+
| database_name |
+----------------+
| default    |
+----------------+
2 rows selected (1.93 seconds)

认证时密码直接回车即可, 如果能看到以上信息说明hive安装成功:

退出客户端:

0: jdbc:hive2://localhost:10000> !quit
Closing: 0: jdbc:hive2://localhost:10000 quit;

使用beeline连接失败的解决办法

如果hiveserver2已经正常运行在本机的10000端口上, 但使用beeline连接hiveserver2报错, WARN jdbc.HiveConnection: Failed to connect to localhost:10000
主要的原因可能是hadoop引入了用户代理机制, 不允许上层系统直接使用实际用户.

解决办法: 在hadoop的核心配置文件core-site.xml中添加:


   hadoop.proxyuser.hadoop.hosts
   *


   hadoop.proxyuser.hadoop.groups
   *

其中hadoop.proxyuser.XXX的XXX就是代理用户名,我这里设置成hadoop
然后重启hadoop,否则以上修改不生效。

$ stop-all.sh

$ start-dfs.sh
$ start-yarn.sh

再次执行下面命令就可以正常连接欸蓝:

beeline -u jdbc:hive2://localhost:10000

安装Sqoop

Sqoop下载和安装

下载地址:http://archive.apache.org/dist/sqoop/1.4.7/

wget http://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

解压sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz到指定目录下:

tar -zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /opt/pkg/

修改Sqoop安装目录名称:

$ cd /opt/pkg/
$ mv sqoop-1.4.7.bin__hadoop-2.6.0/ sqoop

配置Sqoop环境变量

$ vi ~/.bash_profile

#sqoop
export SQOOP_HOME=/opt/pkg/sqoop
export PATH=$PATH:$SQOOP_HOME/bin

把Sqoop所依赖的相关环境变量都配置上,修改sqoop-env.sh

$ mv conf/sqoop-env-template.sh conf/sqoop-env.sh
$ vi conf/sqoop-env.sh 

# Set Hadoop-specific environment variables here.

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/opt/pkg/hadoop

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/opt/pkg/hadoop

#set the path to where bin/hbase is available
export HBASE_HOME=/opt/pkg/hbase

#Set the path to where bin/hive is available
export HIVE_HOME=/opt/pkg/hive

#Set the path for where zookeper config dir is
export ZOOCFGDIR=/opt/pkg/zookeeper/conf

修改sqoop-site.xml, 具体配置如下文件所示:




  
  sqoop.metastore.client.enable.autoconnect
  true
  If true, Sqoop will connect to a local metastore
   for job management when no other metastore arguments are
   provided.
  
 
 
  sqoop.metastore.client.autoconnect.url
  jdbc:hsqldb:file:/tmp/sqoop-meta/meta.db;shutdown=true
  The connect string to use when connecting to a
   job-management metastore. If unspecified, uses ~/.sqoop/.
   You can specify a different path here.
  
 
 
  sqoop.metastore.client.autoconnect.username
  SA
  The username to bind to the metastore.
  
 
 
  sqoop.metastore.client.autoconnect.password
  
  The password to bind to the metastore.
  
 
  
  sqoop.metastore.client.record.password
  true
  If true, allow saved passwords in the metastore.
  
 
 
  sqoop.metastore.server.location
  /tmp/sqoop-metastore/shared.db
  Path to the shared metastore database files.
  If this is not set, it will be placed in ~/.sqoop/.
  
 
 
  sqoop.metastore.server.port
  16000
  Port that this metastore should listen on.
  
 

修改configure-sqoop

$ vi bin/configure-sqoop

如果没有安装accumulo,则将有关ACCUMULO_HOME的判断逻辑注释掉:

$ vi bin/configure-sqoop
 94 #if [ -z "${ACCUMULO_HOME}" ]; then
 95 # if [ -d "/usr/lib/accumulo" ]; then
 96 #  ACCUMULO_HOME=/usr/lib/accumulo
 97 # else
 98 #  ACCUMULO_HOME=${SQOOP_HOME}/../accumulo
 99 # fi
100 #fi

140 #if [ ! -d "${ACCUMULO_HOME}" ]; then
141 # echo "Warning: $ACCUMULO_HOME does not exist! Accumulo imports will fail."
142 # echo 'Please set $ACCUMULO_HOME to the root of your Accumulo installation.'
143 #fi

这样做的目的是避免将来运行时出现类似下面的警告信息:

Warning: /opt/pkg/sqoop/bin/…/…/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.

将MySQL的驱动(MySQL5.7对应的驱动版本应为5.x版本)上传到Sqoop安装目录下的lib目录下:

$ cp mysql-connector-java-5.1.44-bin.jar /opt/pkg/sqoop/lib/

$HIVE_HOME/lib/hive-common-3.1.2.jar拷贝或者软链接到$SQOOP_HOME/lib

$ ln -s /opt/pkg/hive/lib/hive-common-3.1.2.jar /opt/pkg/sqoop/lib

如果需要解析json,可下载java-json.jar放到sqoop目录下的lib里。

下载地址:http://www.java2s.com/Code/Jar/j/Downloadjavajsonjar.htm

$ cp java-json.jar /opt/pkg/sqoop/lib/

如果需要avro序列化,可将hadoop里面的avro的jar包拷贝或者软链接到sqoop目录下的lib里。

$ ln -s /opt/pkg/hadoop/share/hadoop/common/lib/avro-1.7.7.jar /opt/pkg/sqoop/lib/

练习

完成以下练习:

练习1:

安装好Sqoop学习环境后使用cd命令进入到sqoop安装目录, 输入以下命令并观察输出:

$ sqoop version
2020-01-23 23:43:20,287 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 2017

如果输出了类似内容, 表明Sqoop的安装初步完成.

Views: 899

Index