hive(基於mapreduce)的使用

本文轉載自查看原文 2020-03-02 20:10 1405 分布式系統

一：數據表建立

（一）創建數據庫

hive> create database hadoop;
hive> use hadoop;

數據庫位置在 hdfs://ns1/user/hive/warehouse/hadoop.db目錄下

（二）建表

hive> create table t_order(id int,name string,container string,price double)
    > row format delimited                                                  
    > fields terminated by '\t';

（三）創建數據表使用array

//array 
create table tab_array(a array<int>,b array<string>)
row format delimited
fields terminated by '\t'　　//使用\t分割字段 collection items terminated by ',';　　//使用,分割數組元素

select a[0] from tab_array;
select * from tab_array where array_contains(b,'word');
insert into table tab_array select array(0),array(name,ip) from tab_ext t;

（四）使用map創建數據表

//map
create table tab_map(name string,info map<string,string>)
row format delimited
fields terminated by '\t'　　　　//使用\t分割字段 collection items terminated by ','　　//使用，分割map元素 map keys terminated by ':';　　//使用：分割每個map的key和value

load data local inpath '/home/hadoop/hivetemp/tab_map.txt' overwrite into table tab_map;
insert into table tab_map select name,map('name',name,'ip',ip) from tab_ext;

（五）使用struct創建數據表

create table tab_struct(name string,info struct<age:int,tel:string,addr:string>)
row format delimited
fields terminated by '\t'
collection items terminated by ','

load data local inpath '/home/hadoop/hivetemp/tab_st.txt' overwrite into table tab_struct;
insert into table tab_struct select name,named_struct('age',id,'tel',name,'addr',country) from tab_ext;

二：數據文件導入

文件數據：

[hadoop@hadoopH1 ~]$ cat order.txt 
00001001        iphone5 32G     4999
00001002        iphone6S        128G    9999
00001003        xiaomi6x        32G     2999
00001004        honor   32G     3999

（一）hive使用hql進行導入

1.從本地導入數據到hive的表中（實質就是將文件上傳到hdfs中hive管理目錄下）

load data local inpath '/home/hadoop/order.txt' into table t_order;

實際是拷貝數據到hdfs文件系統中。

（二）直接上傳數據到hdfs文件目錄下

[hadoop@hadoopH1 ~]$ cat order_1.txt 
00002001        redmi   32G     3999
00002002        geli    128G    1999
00002003        xiami6x 32G     999
00002004        huawei  32G     3999

使用Hadoop命令： hadoop fs -put order_1.txt /user/hive/warehouse/hadoop.db/t_order/

（三）使用load data進行數據導入

1.從本地導入數據到hive的表中（實質就是將文件上傳到hdfs中hive管理目錄下）

load data local inpath '/home/hadoop/ip.txt' into table tab_ext;

2.從hdfs上導入數據到hive表中（實質就是將文件從原始目錄移動到hive管理的目錄下）

load data inpath 'hdfs://ns1/aa/bb/data.log' into table tab_user;

（四）使用insert進行數據導入（不允許一條插入，一般使用overwrite和select進行數據文件拷貝）

insert overwrite table tab_ip_seq select * from tab_ext;

三：數據查詢

（一）select查詢語句

select * from t_order;

（二）select查詢測試調用mapreduce程序

 select count(*) from t_order;

調用mapreduce程序進行數據處理。

四：其他方式建立數據表

（一）使用external外部表

從hdfs其他目錄下引入數據，進行建表

CREATE EXTERNAL TABLE tab_ip_ext(id int, name string,
     ip STRING,
     country STRING)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE　　#數據表存儲格式為TEXT文本 LOCATION '/external/user';　　#從hdfs中其他文件目錄/external/user下引入數據

不需要移動數據到專門的數據存放目錄下。

（二）使用AS和select語句建立數據表（用於創建一些臨時表存儲中間結果）

CREATE TABLE t_order_sel
   AS SELECT id new_id, name new_name, price new_price
FROM t_order
SORT BY new_id;

會調用mapreduce程序進行數據處理，並將結果存放在數據表文件目錄下，作為數據

（三）insert from select ...可以用於向已經存在的臨時表中追加中間數據

五：partition分區

如果使用groupby或者where語句，則是針對所有數據集進行操作。所以我們可以在建表得時候指定分區，之后進行查詢的時候可以使用分區查詢或者全局查詢操作，提高效率

partition分區，可以使用建表字段，也可以使用其他字段。

建表使用partition進行分區，數據插入也需要指定分區進行存放

建表：

create table tab_ip_part(id int,name string,ip string,country string) 
    partitioned by (year string)
    row format delimited fields terminated by ',';

數據加載：

load data local inpath '/home/hadoop/data.log' overwrite into table tab_ip_part
     partition(year='1990');
    
load data local inpath '/home/hadoop/data2.log' overwrite into table tab_ip_part
     partition(year='2000');

實際：將數據加載再分區中，既是在表目錄下新建一個分區目錄，將數據放入該目錄中。

顯示分區：

show partitions tab_ip_part;

六：alter修改數據表信息

alter table tab_ip change id id_alter string;　　修改字段 ALTER TABLE tab_cts ADD PARTITION (partCol = 'dt') location '/external/hive/dt';　　修改添加分區

七：使用cluster

數據表創建：

create table tab_ip_cluster(id int,name string,ip string,country string)
clustered by(id) into 3 buckets;

數據導入：

load data local inpath '/home/hadoop/ip.txt' overwrite into table tab_ip_cluster;
set hive.enforce.bucketing=true;
insert into table tab_ip_cluster select * from tab_ip;

數據查詢：

select * from tab_ip_cluster tablesample(bucket 2 out of 3 on id);

八：使用shell命令執行hive語句

使用shell機制，可以利用腳本語言shell/python進行hql語句批量執行

hive -S -e 'select country,count(*) from tab_ext' > /home/hadoop/hivetemp/e.txt

九：自定義函數（同之前利用電話號獲取地區）

select getarea(phoneNB),upflow,downflow from t_flow

（一）原始數據

1389990045    239    300
1385566005    229    435
1385566005    192    256
1389990045    23    84
1390876045    682    432
1385566005    134    300
1390876045    378    656
1390876045    346    123
1389990045    78    352

（二）處理結果形式

1389990045    beijing    239    300
1385566005    nanjin    229    435
1385566005    nanjin    192    256
1389990045    beijing    23    84
1390876045    shenyang    682    432
1385566005    nanjin    134    300
1390876045    shenyang    378    656
1390876045    shenyang    346    123
1389990045    beijing    78    352

（三）函數實現

1.實現Java類，定義上述函數邏輯。轉化為jar包，上傳到hive的lib中
2.在hive中創建一個函數getarea,和jar包中的自定義java類建立關聯

package cn.hadoop.hive;

import java.util.HashMap;

import org.apache.hadoop.hive.ql.exec.UDF;

public class phoneNBToArea extends UDF{
    public static HashMap<String,String> areamap = new HashMap<>();
    
    static {
        areamap.put("1389", "beijing");
        areamap.put("1385", "nanjin");
        areamap.put("1390", "shenyang");
    }
    
    public String evaluate(String phoneNB) {　　//需要重載該方法
        String result = areamap.get(phoneNB.substring(0, 4))==null?(phoneNB+"    nowhere"):(phoneNB+"    "+areamap.get(phoneNB.substring(0, 4)));
        return result;
    }
}

（四）將jar包導入hive中

hive> add jar /home/hadoop/hive.jar;
hive> create temporary function getarea as 'cn.hadoop.hive.phoneNBToArea';

（四）數據表創建

create table t_flow(phoneNB string,upflow int,download int)
row format delimited
fields terminated by '\t';

（五）數據導入

load data local inpath '/home/hadoop/flow.txt' into table t_flow;

（六）結果顯示

select getarea(phoneNB),upflow,download from t_flow;

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hive基於MapReduce運行過程 Hive | Hive可以避免進行MapReduce hive優化之調整mapreduce數目使用mapreduce清洗簡單日志文件並導入hive數據庫 Hive將SQL轉化為MapReduce的過程【HIVE】sql語句轉換成mapreduce HDFS,MapReduce,Hive,Hbase 等之間的關系 hive sql語句轉換成mapreduce mapreduce (六) MapReduce實現去重 NullWritable的使用 MapReduce框架-Join的使用