3 hql語法及自定義函數（含array、map講解） + hive的java api

本文轉載自查看原文 2016-10-11 15:54 1749 weekend110（Hadoop、MapReduce、Zookeeper、Hive、HBase、flume、sqoop、kafka）/ Hadoop Hive編程 API入門系列/ Hadoop Hive概念學習系列

本博文的主要內容如下：

　　.hive的詳細官方手冊

　　 .hive支持的數據類型

　　　.Hive Shell

　　.Hive工程所需依賴的jar包

　　.hive自定義函數

　　.分桶4

　　 .附PPT

hive的詳細官方手冊

http://hive.apache.org/

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

　　標准的SQL，hive都支持。

這就是，為什么目前hive占有市場這么多，因為，太豐富了，當然，Spark那邊的Spark SQL，也在不斷地進步。

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

　　非常多，自行去研究，這里不多贅述。

將查詢結果，寫到本地文件或hdfs里的文件

//write to hdfs

insert overwrite local directory '/home/hadoop/hivetemp/test.txt' select * from tab_ip_part where part_flag='part1';    //路徑可以是Linux本地的

insert overwrite directory '/hiveout.txt' select * from tab_ip_part where part_flag='part1';  //路徑也可以是hdfs里的

這里，不演示

//array

create table tab_array(a array<int>,b array<string>)
row format delimited
fields terminated by '\t'
collection items terminated by ',';

示例數據

tobenbrone, laihama,woshishui     13866987898,13287654321
abc,iloveyou,itcast     13866987898,13287654321

select a[0] from tab_array;
select * from tab_array where array_contains(b,'word');
insert into table tab_array select array(0),array(name,ip) from tab_ext t;

//map

create table tab_map(name string,info map<string,string>)
row format delimited
fields terminated by '\t'
collection items terminated by ';'
map keys terminated by ':';

示例數據：

fengjie                         age:18;size:36A;addr:usa
furong          age:28;size:39C;addr:beijing;weight:180KG

load data local inpath '/home/hadoop/hivetemp/tab_map.txt' overwrite into table tab_map;
insert into table tab_map select name,map('name',name,'ip',ip) from tab_ext;

　　這里，不多贅述。

//struct

create table tab_struct(name string,info struct<age:int,tel:string,addr:string>)
row format delimited
fields terminated by '\t'
collection items terminated by ','

load data local inpath '/home/hadoop/hivetemp/tab_st.txt' overwrite into table tab_struct;
insert into table tab_struct select name,named_struct('age',id,'tel',name,'addr',country) from tab_ext;

　　這里，不多贅述。

Hive Shell

//cli shell

hive -S -e 'select country,count(*) from tab_ext' > /home/hadoop/hivetemp/e.txt

　　有了這種執行機制，就使得我們可以利用腳本語言（bash shell,python）進行hql語句的批量執行

select * from tab_ext sort by id desc limit 5;

 
select a.ip,b.book from tab_ext a join tab_ip_book b on(a.name=b.name);

　　思考一個問題：就說，一個業務場景里面，寫sql語句去分析作統計，往往不是一句sql語句能搞定的，sql對一些字段或函數或自定義函數處理會得出一些中間結果，中間結果存在中間表里，然后，才可進入下一步的處理。可能，你需寫好多條sql語句，按照批量，流程去走，以前在關系型數據庫里，是按照流程處理過程做的。

　　hive里，不支持存儲過程的語法，那若有一個模型，這個模型里有十幾個sql語句，一條一條寫很麻煩，那么，hive在想，能不能組織成批量去運行呢？則借外部的工作（如寫一個shell腳本，執行十幾個sql語句）。

　　可以參照

Sqoop 腳本開發規范（實例手把手帶你寫sqoop export和sqoop import）

在shell下，接收。

//cli shell

hive -S -e 'select country,count(*) from tab_ext' > /home/hadoop/hivetemp/e.txt

　　有了這種執行機制，就使得我們可以利用腳本語言（bash shell,python）進行hql語句的批量執行

select * from tab_ext sort by id desc limit 5;
select a.ip,b.book from tab_ext a join tab_ip_book b on(a.name=b.name);

　　如業務場景里，

　　有了這種執行機制，就使得我們可以利用腳本語言（bash shell,python）進行hql語句的批量執行。

bash shell和python是最常用的兩種腳本語言。

新建包，cn.itcast.bigdata

新建，PhoneNbrToArea.java

編寫代碼

解壓

　　為了方便，把D:\SoftWare\hive-0.12.0\lib的jar包，全導入進去，但是，還要導入hadoop-core-***.Jar。（初學，還是手動吧！）

　　查閱了一些資料。在hive工程，所依賴的jar包，一般都是有如下就好了。12個jar包。

http://xiaofengge315.blog.51cto.com/405835/1408512

http://blog.csdn.net/haison_first/article/details/41051143

commons-lang-***.jar

commons-logging-***.jar

commons-logging-api-***.jar

hadoop-core-***.jar

hive-exec-***.jar

hive-jdbc-***.jar

hive-metastore-***.jar

hive-service-***.jar

libfb***.jar

log4j-***.jar

slf4j-api-***.jar

sl4j-log4j-***.jar

說明，注意了，在hadoop-2.X版本之后，hadoop-core-***.jar，沒有了，被分散成其他的jar包了。以前，是放在hadoop壓縮包下的share目錄下的

http://zhidao.baidu.com/link?url=KI6ZkudqskDjAthYc2PtTlmB_3FhR3OaMzm4Wcrl_oCkaJfBhaTd7mHSHsy1lkPYO8xa0EGhpD8RSnYdnpkDkGiZX04qff3ul3-xX-cOi07

　　2.x系列已經沒有hadoop-core的jar包了，變成一個個散的了，像下面這樣

　　鑒於此，因為，hive工程依賴於hive jar依賴包，日志jar包。

由於hive的很多操作依賴於mapreduce程序，因此，hive工程中還需引入hadoop包。

udf和jdbc連接hive需要的jar包，基本是最簡的了。

在這一步，各有說法，但是確實，是不需要全部導入，當然，若是圖個方便，可全部導入。

我這里，hadoop的版本是，hadoop-2.4.1，hive的版本是，hive-0.12.0。（因為，這個是自帶的）

再談hive-1.0.0與hive-1.2.1到JDBC編程忽略細節問題

Hive工程所需依賴的jar包

總結：就是將D:\SoftWare\hadoop-2.4.1\share\hadoop\common下的hadoop-common-2.4.1.jar

，以及D:\SoftWare\hive-0.12.0\lib\下的所有。即可。（圖個方便）！

當然，生產里，不建議這么做。

也參考了網上一些博客資料說，不需這么多。此外，程序可能包含一些間接引用，以后再逐步逐個，下載，添加就是。復制粘貼到hive-0.12.0lib 里。

去 http://mvnrepository.com/ 。

　　參考我的博客

Eclipse下新建Maven項目、自動打依賴jar包

2 weekend110的HDFS的JAVA客戶端編寫 + filesystem設計思想總結

weekend110-hive -> Build Path -> Configure Build Path

總結：就是將D:\SoftWare\hadoop-2.4.1\share\hadoop\common下的hadoop-common-2.4.1.jar

，以及D:\SoftWare\hive-0.12.0\lib\下的所有。即可。（圖個方便）！

D:\SoftWare\hive-0.12.0\lib\下的所有

D:\SoftWare\hadoop-2.4.1\share\hadoop\common下的hadoop-common-2.4.1.jar

package cn.itcast.bigdata;
import java.util.HashMap;
import org.apache.hadoop.hive.ql.exec.UDF;



public class PhoneNbrToArea extends UDF{
    private static HashMap<String, String> areaMap = new HashMap<>();
    static {
       areaMap.put("1388", "beijing");
       areaMap.put("1399", "tianjin");
       areaMap.put("1366", "nanjing");
    }

   

    //一定要用public修飾才能被hive調用
    public String evaluate(String pnb) {
       String result  = areaMap.get(pnb.substring(0,4))==null? (pnb+"    huoxing"):(pnb+"  "+areaMap.get(pnb.substring(0,4)));      
       return result;
    }
}

默認是/root/下，

這里，我改下到/home/hadoop/下

//UDF

select if(id=1,first,no-first),name from tab_ext;

hive>add jar /home/hadoop/myudf.jar;
hive>CREATE TEMPORARY FUNCTION my_lower AS 'org.dht.Lower';
select my_upper(name) from tab_ext;

hive自定義函數

接下來，創建hive自定義函數，來與它關聯。Hive自帶的函數是永久，我們自定義的函數是TEMPORARY。

得要去掉，不然后續處理，會出現問題。

　　在企業里，使用hive是有規范步驟的，一般在采用元數據，自動用mapreduce程序，清洗之后，再給hive。

　　數據采集 -》數據清洗 -> 數據歸整 -> 再交給hive

分桶

　　注意：普通表（外部表、內部表）、分區表這三個都是對應HDFS上的目錄，桶表對應是目錄里的文件

//CLUSTER <--相對高級一點，你可以放在有精力的時候才去學習>

create table tab_ip_cluster(id int,name string,ip string,country string)

clustered by(id) into 3 buckets;   //根據id來分桶，分3桶

load data local inpath '/home/hadoop/ip.txt' overwrite into table tab_ip_cluster;

set hive.enforce.bucketing=true;

insert into table tab_ip_cluster select * from tab_ip;

select * from tab_ip_cluster tablesample(bucket 2 out of 3 on id);

　　分桶是細粒度的，分桶是不同的文件。

　　分區是粗粒度的，即相當於，表下建立文件夾。分區是不同的文件夾。

桶在對指定列進行哈希計算時，會根據哈希值切分數據，使每個桶對應一個文件。

　　里面的id是哈希值，分過來的。

分桶，一般用作數據傾斜和數據抽樣方面。由此，可看出是細粒度。

附PPT

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hql語法及自定義函數（含array、map講解） + hive的java api Hive函數以及自定義函數講解（UDF） hive 自定義函數 HIVE自定義函數 MySQL之自定義函數實例講解 hive自定義函數學習 hive 之加密自定義函數 Hive自定義函數的學習筆記(1) Hive三種自定義函數 hive自定義udaf函數