Hive環境搭建及測試

本文轉載自查看原文 2017-12-21 12:09 6531 Hadoop

前提條件：已經安裝好如下軟件

Eclipse4.5  hadoop-2.7.3  jdk1.7.0_79

此篇文章基於上一篇文章：zookeeper高可用集群搭建

什么是Hive？

1、Hive是一個基於Hadoop文件系統之上的數據倉庫結構。它為數據倉庫的管理提供了許多功能：數據ETL（抽取、轉換和加載）工具、數據存儲管理和大型數據集的查詢和分析能力。

2、同時Hive定義了類SQL的語句；它能夠將結構化的數據文件映射為一張數據庫表，並提供簡單的SQL查詢功能。還允許開發人員方便的使用Mapper和Reducer操作，可以將SQL語句轉化為MapReduce任務運行，這對MapReduce框架來說是一個強有力的支持。

3、Hive的優勢在於處理大數據，對於處理小數據沒有優勢，因為Hive的執行延遲比較高；主要延遲是發生在啟動線程部分。

一、開始安裝Hive：(僅在CloudDeskTop上安裝)

上傳安裝文件到/software目錄下；

下載地址：http://mirrors.shuosc.org/apache/hive/

解壓到/software目錄下,並修改它的名字；

[hadoop@CloudDeskTop software]$ mv apache-hive-1.2.2-bin/ hive-1.2.2

配置Hive：

[hadoop@CloudDeskTop software]$ cd /software/hive-1.2.2/conf/
[hadoop@CloudDeskTop conf]$ cp hive-default.xml.template hive-site.xml
[hadoop@CloudDeskTop conf]$ vi hive-site.xml

334行:hdfs集群存放hive倉庫元數據的位置

 333   <property>
 334     <name>hive.metastore.warehouse.dir</name>
 335     <value>/user/hive/warehouse</value>
 336     <description>location of default database for the warehouse</description>
 337   </property>

46行

  45   <property>
  46     <name>hive.exec.scratchdir</name>
  47     <value>/tmp/hive</value>
  48     <description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/&lt;userna     me&gt; is created, with ${hive.scratch.dir.permission}.</description>
  49   </property>

2911行

2910   <property>
2911     <name>hive.server2.logging.operation.log.location</name>
2912     <value>/tmp/hive/operation_logs</value>
2913     <description>Top level directory where operation logs are stored if logging functionality is enabled</description>
2914   </property>

51行

  50   <property>
  51     <name>hive.exec.local.scratchdir</name>
  52     <value>/tmp/hive</value>
  53     <description>Local scratch space for Hive jobs</description>
  54   </property>

56行

  55   <property>
  56     <name>hive.downloaded.resources.dir</name>
  57     <value>/tmp/hive/resources</value>
  58     <description>Temporary local directory for added resources in the remote file system.</description>
  59   </property>

[hadoop@CloudDeskTop conf]$ cp -a hive-log4j.properties.template hive-log4j.properties

[hadoop@CloudDeskTop conf]$ vi hive-log4j.properties

使用[tail -f hive.log]可以動態實時查看日志的最新情況;

 17 # Define some default values that can be overridden by system properties
 18 hive.log.threshold=ALL
 19 hive.root.logger=INFO,DRFA
    #logs目錄需要自己創建，用來存放你操作hive數據庫后產生的日志
 20 hive.log.dir=/software/hive-1.2.2/logs
 21 hive.log.file=hive.log

 72 #log4j.appender.EventCounter=org.apache.hadoop.hive.shims.HiveEventCounter
 73 log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter

[hadoop@CloudDeskTop software]$ vi /software/hive-1.2.2/bin/hive-config.sh

在最后添加：

export JAVA_HOME=/software/jdk1.7.0_79
export HADOOP_HOME=/software/hadoop-2.7.3
export HIVE_HOME=/software/hive-1.2.2

二、啟動Hive前的准備：

【1、在slave節點啟動zookeeper集群（小弟中選個leader和follower）】

　　cd /software/zookeeper-3.4.10/bin/ && ./zkServer.sh start && cd - && jps
　　cd /software/zookeeper-3.4.10/bin/ && ./zkServer.sh status && cd -

【2、master01啟動HDFS集群】cd /software/ && start-dfs.sh && jps

【3、master01啟動YARN集群】cd /software/ && start-yarn.sh && jps

【YARN集群啟動時，不會把另外一個備用主節點的YARN集群拉起來啟動，所以在master02執行語句:】

cd /software/ && yarn-daemon.sh start resourcemanager && jps

【4、查看兩個master誰是主節點:】
[hadoop@master01 software]$ hdfs haadmin -getServiceState nn1
active （主節點）
[hadoop@master01 software]$ hdfs haadmin -getServiceState nn2
standby （備用主節點）

【5、查看兩個resourcemanager誰是主:】
[hadoop@master01 hadoop]$ yarn rmadmin -getServiceState rm1
active（主）
[hadoop@master01 hadoop]$ yarn rmadmin -getServiceState rm2
standby（備用）

啟動Hive：

【此時會在當前目錄下生成：】

【如果發生問題，刪除如下兩個文件】

【如果問題依舊不能解決，直接刪除metastore_db文件試一試】

大數據學習交流群：217770236 讓我我們一起學習大數據

三、Hive測試

【下面的測試都遵循如下兩圖的目錄准則：】

【1、首先 cd /home/hadoop/test/hive/src】

新增t_user數據文件：並上傳到hdfs集群；[hdfs dfs -put t_user /user/hive/warehouse/mmzs.db/t_user/]

新增myuser數據文件：並上傳到hdfs集群；[hdfs dfs -put myuser /user/hive/warehouse/mmzs.db/t_user/]

新增myuser02數據文件：暫時不上傳，后面用另外一種方式上傳；

【2、首先 cd /software/hive-1.2.2/bin】

【創建測試數據庫：】create database mmzs;

【不進入./hive命令終端執行sql語句】 echo -e "select * from mmzs.t_user;"|./hive -S

【3、將結果寫入test.sql文件，將執行其結果輸出到指定目錄的文件中】

【不進入./hive命令終端,動態執行sql語句】

【4、在hive數據庫中創建表，[row format delimited fields terminated by '\t']是指定數據每行按照空白進行分割】

　create table mmzs.t_user(userId int, username string, userage int, userheight double) row format delimited fields terminated by '\t';

【另外一種上傳數據文件到集群hive數據庫的方式，load data就是上傳的關鍵字，local是將本地文件拷貝到hive數據庫，不加就是導入hdfs集群數據到hive數據庫是一種剪切方式】

【本地文件不覆上傳】./hive -S -e "load data local inpath '/home/hadoop/test/hive

/src/myuser02' into table mmzs.t_user;"

【使用overwrite本地文件覆蓋上傳】./hive -S -e "load data local inpath '/home/hadoop/test/hive/src/myuser02' overwrite into table mmzs.t_user;"

【hdfs文件不覆蓋上傳】./hive -S -e "load data local inpath '/data/myuser' into table mmzs.t_user;"

【小結：】注意：

使用HDFS上傳和使用load data導入本地文件：從本質意義上講都是文件的轉移過程。

如果轉移的文件是來自於本地則發生數據拷貝，如果轉移的文件是來自於HDFS文件系統則發生數據移動；

overwrite關鍵字在load data句法中將導致hive表中的數據先被清空，然后再轉移數據，即發生hive表的覆蓋寫入操作；如果沒有overwrite關鍵字則發生數據文件的追加操作，新建一個文件加后綴"已存在同名文件名字_copy_數字"；

【5、在hive數據庫中的t_user表中添加一條數據：會產生map任務】

　insert into mmzs.t_user(userid,username,userage,userheight) values (9,'zhaoyuan',34,1.78);

執行過程比較緩慢，可以【http://active狀態的master的IP地址:8088】查看任務進度;

【6、hive支持大多數sql語句，但也有不支持的sql語句：】

　 insert into mmzs.t_user select 1,'zhaoyun02',78,1.90;

【7、聚合函數有個統計過程，會產生MapTask和ReduceTask】

　select count(userid) from t_user ;

　select avg(userage) from t_user;

【8、非聚合函數，不會產生Job任務】

　select username,reverse(username),userage from mmzs.t_user;

　select username,length(username),userage from mmzs.t_user;

【9、hive數據庫結果的下載】

下載到本地：（1個Job）

下載到hdfs集群：（3個Job，應該是因為有3個slave，有3個備份的原因）

【10、通過子查詢的方式來創建表和創建數據：顯示的是3個Job(Total jobs只是計划的任務數),實際運行了1個；請用瀏覽器的方式查看或者你可以看到只出現了Launching Job 1 out of 3】

　create table mmzs.t_user as select * from mmzs.t_user where 1=2;（拷貝后的表，讀取數據時的分隔符會回復到默認的）

　insert into mmzs.t_user_new select * from mmzs.t_user where userid=1;

【11、在10的基礎上,通過hdfs集群的方式為表新增數據】

通過hdfs集群，直接將原表中的數據文件拷貝到新表中：

由於myuser02中的數據文件創建時使用的是Tab鍵作為分隔符，所以出現如下情況：

所以我們重新傳了用^A(hive的默認分隔符,通過按"ctrl+V"和"ctrl+A"鍵產生)作為分隔符的數據文件：

然后我們就能看到表中新增的數據了：

【12、第一階段測試完成,刪除所有測試的表】

　　hive> show databases;
　　hive> use mmzs;
　　hive> show tables;
　　hive> drop table t_user;
　　hive> drop table if exists t_user_new;

四、Hive高級測試（一）

【hive>均是在 cd /software/hive-1.2.2/bin 目錄下執行./hive命令或者./hive --hiveconf hive.root.logger=ERROR,console命令】

【1、創建測試所需表】

hive> use mmzs;
hive> create table if not exists emp(eno int,ename string,eage int,bithday date,sal double,com double,gender string,dno int) row format delimited fields terminated by '\t';
hive> create table if not exists dept(dno int,dname string,loc string) row format delimited fields terminated by '\t';

【2、創建測試所需表的數據】

cd /home/hadoop/test/hive/src

hdfs dfs -put emp01 /user/hive/warehouse/mmzs.db/emp
hdfs dfs -put dept01 /user/hive/warehouse/mmzs.db/dept

【3、測試開始】

hive> use mmzs;

【遇到create和insert操作會產生Job作業（只有Map作業）】

【遇到含有group by、order by操作或聚合函數操作時都會產生Job作業（有Map和Reduce）】

【當有同時含有group by和order by時會產生兩個Job：(有Map,也有Reduce)】

select dno,gender,count(1) renshu from emp where eage>25 group by gender,dno order by renshu desc;

【產生一個Job：(只有Map,沒有Reduce)】

select e.*,d.* from emp e,dept d where e.dno=d.dno;

select eno,ename,e.dno,d.dname from emp e inner join dept d on e.dno=d.dno;

【多列排序：仍然只產生一個Job,(有Map,也有Reduce)】

select dno,eno,ename,sal,com from emp order by sal desc,com desc;

【產生兩個Job：(因為多了個group by)(有Map,也有Reduce)】

select dno,eno,ename,sal,com from emp group by dno,eno,ename,sal,com order by sal desc,com desc;

select d.dno,avg(sal) avgsal from emp e inner join dept d on e.dno=d.dno where eage>20 group by d.dno order by avgsal;

【產生三個Job作業：（有Map,也有Reduce）子查詢只支持在from后面寫這一種格式】

select d.dname,avgsal from (select d.dno,avg(sal) avgsal from emp e inner join dept d on e.dno=d.dno where eage>20 group by d.dno order by avgsal)mid,dept d where mid.dno=d.dno;

遇到子查詢不能處理的，可以將子查詢結果用去做關聯查詢；

以下語句會出錯：(因為hive目前不支持where后面子查詢的語法)

select e.* from emp e where e.sal>(select avg(sal) from emp dno=e.dno);

【只支持limit查看前N條記錄：不能用它來像mysql那樣子分頁,不產生Job】

select * from emp limit 6;  //不產生Job
select * from emp limit 2,5;//會產生錯誤

【Hive的分頁查詢：只產生一個Job,（有Map,也有Reduce）】

select row_number() over() seq,e.* from emp e;
//row_number()是標識添加行號的關鍵字，按照over內的條件進行排列后，添加行號seq字段,
//over內不寫條件，默認是在,將按照第一個字段降序排列后的結果,前加行號

select * from (select row_number() over(order by sal desc) seq,e.* from emp e) mid where mid.seq>5 and mid.seq<11;  //分頁查詢

五、Hive高級測試（二）

【Hive目前不支持索引】

【Hive支持分區：靜態分區和動態分區的創建表的方式是一樣的】

create table t_user(userid int,username string,userage int) partitioned by(dno int) row format delimited fields terminated by '\t';//單級分區；dno是分區的字段名，添加分區時必須名字一樣
create table t_user(userid int,username string,userage int) partitioned by(dno int,gender string) row format delimited fields terminated by '\t';//多級分區；dno是分區的字段名，添加分區時必須名字一樣

【測試數據：】

　進入目錄：cd /home/hadoop/test/hive/src

【靜態分區：】

alter table t_user add partition(dno=1);//添加分區，分區名是"dno=1"
alter table t_user drop partition(dno=1);//刪除分區名是"dno=1"的分區

添加1、2、3、4共四個分區；

【為靜態分區表添加數據：4種方式】

// 集群中操作
hdfs dfs -put bak /user/hive/warehouse/mmzs.db/t_user/dno=1//①
// ./hive中操作
load data local inpath '/home/hadoop/test/hive/src/bak' into table t_user partition(dno=1);//②
load data local inpath '/home/hadoop/test/hive/src/bak' overwrite into table t_user partition(dno=1);//②
insert into t_user partition(dno=3) select eno,ename,eage from emp where eno<4;//③必須添加所有字段，不可以寫t_user(userid,username,userage)的方式
insert overwrite table t_user partition(dno=4) select eno,ename,eage from emp where eno>4 and eno<9;//④分頁

【動態分區：】

【為靜態分區表添加數據：只支持insert方式】

//臨時修改，啟動動態分區；也可在配置文件hive-site.xml配置，永久生效
set hive.exsec.dynamic.partition.mode=nonstrict;//hive-site.xml

insert into t_user partition(dno) values(1,"liganggang",89,1);//插入數據，分區確定在后面指定；
insert into t_user partition(dno=5) select eno,ename,eage from emp where eno<4;//如果沒有的分區會自動創建

//多級動態分區；多級靜態和單級靜態類似
insert into t_user partition(dno,gender) select eno,ename,eage,dno,gender from emp;//select中的dno,gender的順序要和建表是的partition(dno,gender)的順序和保持一致；否則會亂建分區
//into改變成overwrite table表示有相同的時，覆蓋插入;

truncate table t_user;//清空表中的數據
drop table t_user;//刪除表

數據移植小結：
A、從本地到Hive表：
　　使用HDFS的put命令上傳數據文件
　　使用Hive的load data local inpath句法導入數據文件
B、從Hive表到Hive表
　　使用HDFS的cp命令實現數據文件拷貝
　　使用普通的insert into句法插入單條記錄數據
　　使用insert....select...from...句法實現批量條件數據拷貝
　　使用insert overwrite table....select....句法實現數據拷貝
C、從Hive表到本地
　　使用HDFS的get命令下載數據文件
　　使用insert overwrite local directory句法實現Hive表批量條件數據導出
　　使用輸出定向符(>或>>)直接通過標准輸出流將select查詢結果其寫入本地文件

六、Hive高級測試（三）

1、構建數據

[hadoop@CloudDeskTop src]$ pwd
/home/hadoop/test/hive/src

 //創建表后
 hdfs dfs -put testarray /user/hive/warehouse/mmzs.db/tuser01
 hdfs dfs -put testmap /user/hive/warehouse/mmzs.db/tuser02
 hdfs dfs -put testarraymap user/hive/warehouse/mmzs.db/tuser03
 hdfs dfs -put teststruct /user/hive/warehouse/mmzs.db/tuser04

2、進行測試

#首先
[hadoop@CloudDeskTop ~]$ cd /software/hive-1.2.2/bin
[hadoop@CloudDeskTop bin]$ ./hive

#創建的外部表不會因為表的刪除而刪除數據；
#外部表不會在默認的數據存放地址新建文件夾，從始至終都只有指定的目錄下有數據

//表的數據存放在指定外部hdfs集群地址"/test"目錄中：
hive> create external table tuser00(userid int,username string,userage int) row format delimited fields terminated by '\t' location '/test';

//在數據庫下創建外部表就需要添加數據庫的前綴
hive> use mmzs;
hive> create external table mmzs.tuser00(userid int,username string,userage int) row format delimited fields terminated by '\t' location '/test';

//創建一個帶有數組的表：
hive> create table if not exists mmzs.tuser01(id bigint,name string,loves array<string>) row format delimited fields terminated by '\t' collection items terminated by ',';

select id,name,loves[1] from mmzs.tuser01;

//創建一個帶有map的表：
hive> create table if not exists mmzs.tuser02(id bigint,name string,info map<string,double>) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':';
//注：map keys terminated by ':'中的 ：號是元素之間的分隔符，為了解析數據，可以是任意符號。

hive> select id,name,info["age"],info['height'] from mmzs.tuser02;//(如果查詢的字段沒有，會返回一個人Null)

//創建一個帶有數組和map的表
hive> create table mmzs.tuser03(id bigint,name string,loves array<string>,info map<String,double>) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':';

//Struct類型(info是一個STRUCT類型，那么可以通過info.height得到這個用戶的身高)
hive> create table mmzs.tuser04(id bigint,name string,info struct<age:int,height:double>) row format delimited fields terminated by '\t' collection items terminated by ',';

//(如果查詢的字段沒有，會報異常)
hive> select id,name,info.age,info.height from mmzs.tuser04;

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Mac上Hive環境搭建 elasticsearch + hive環境搭建 Hive——環境搭建 Hive On Spark環境搭建基於docker快速搭建hive環境 Hive環境搭建與入門（轉）手把手教你搭建hadoop+hive測試環境(新手向) 性能測試環境搭建 Mybatis環境搭建及測試測試環境搭建