使用Hive讀取ElasticSearch中的數據

本文轉載自查看原文 2020-06-04 10:50 749 Elasticsearch

本文將介紹如何通過Hive來讀取ElasticSearch中的數據，然后我們可以像操作其他正常Hive表一樣，使用Hive來直接操作ElasticSearch中的數據，將極大的方便開發人員。本文使用的各組件版本分別為 Hive0.12、Hadoop-2.2.0、ElasticSearch 2.3.4。

　　我們先來看看ElasticSearch中相關表的mapping：

 
                { 
               
                "user" 
                : { 
               
                "properties" 
                : { 
               
                "regtime" 
                : { 
               
                "index" 
                :  
                "not_analyzed" 
                ,  
               
                "type" 
                :  
                "string" 
               
                },  
               
                "uid" 
                : { 
               
                "type" 
                :  
                "integer" 
               
                },  
               
                "mobile" 
                : { 
               
                "index" 
                :  
                "not_analyzed" 
                ,  
               
                "type" 
                :  
                "string" 
               
                },  
               
                "username" 
                : { 
               
                "index" 
                :  
                "not_analyzed" 
                ,  
               
                "type" 
                :  
                "string" 
               
                } 
               
                } 
               
                } 
               
                }

ElasticSearch中的index名為iteblog，type為user；user有regtime、uid、mobile以及username四個屬性。現在我們在Hive端進行操作。

　　要讓Hive能夠操作ElasticSearch中的數據我們需要對Hive進行一些設置。值得高興的是，ElasticSearch官方為我們提供了一些類庫可以實現這些要求。我們需要引入相應的elasticsearch-hadoop-xxx.jar包，因為我們得ElasticSearch版本是2.x的，所以我們最少需要使用ES-Hadoop 2.2.x，本文使用的是elasticsearch-hadoop-2.3.4.jar，這個可以到Maven中央倉庫下載。要讓Hive能夠加載elasticsearch-hadoop-2.3.4.jar文件有好幾種方式：

1、直接通過add命令加載，如下：

 
                hive > ADD JAR  
                /home/iteblog/elasticsearch-hadoop-2 
                .3.4.jar; 
               
                Added [ 
                /home/iteblog/elasticsearch-hadoop-2 
                .3.4.jar] to class path 
               
                Added resources: [ 
                /home/iteblog/elasticsearch-hadoop-2 
                .3.4.jar]

2、我們還可以在啟動Hive的時候進行設置，如下：

 
                $ bin 
                /hive 
                --auxpath= 
                /home/iteblog/elasticsearch-hadoop-2 
                .3.4.jar

3、我們還可以通過設置hive.aux.jars.path屬性來實現：

 
                $ bin 
                /hive 
                -hiveconf hive.aux.jars.path= 
                /home/iteblog/elasticsearch-hadoop-2 
                .3.4.jar

或者我們把這個設置直接寫到hive-site.xml中，以便后面方便：

 
                < 
                property 
                > 
               
                < 
                name 
                >hive.aux.jars.path</ 
                name 
                > 
               
                < 
                value 
                >/home/iteblog/elasticsearch-hadoop-2.3.4.jar</ 
                value 
                > 
               
                < 
                description 
                >A comma separated list (with no spaces) of the jar files</ 
                description 
                > 
               
                </ 
                property 
                >

大家可以根據自己實際情況選擇設置。設置好ElasticSearch相關類庫之后，我們就可以到Hive中創建表了。為了方便，我們直接將Hive中各個字段以及類型設置成和ElasticSearch中一樣：

 
                hive (iteblog)>  
                create 
                EXTERNAL   
                table 
                ` 
                user 
                `( 
               
                >   regtime string, 
               
                >   uid  
                int 
                , 
               
                >   mobile string, 
               
                >   username string  
               
                > ) 
               
                > STORED  
                BY 
                'org.elasticsearch.hadoop.hive.EsStorageHandler' 
               
                > TBLPROPERTIES( 
                'es.resource' 
                =  
                'iteblog/user' 
                ,  
                'es.nodes' 
                = 
                'www.iteblog.com' 
                ,  
                'es.port' 
                = 
                '9200' 
                ,  
                'es.nodes.wan.only' 
                = 
                'true' 
                );

到這里，我們已經已經可以在Hive里面查詢ElasticSearch中的數據了：

 
                hive (iteblog)>  
                select 
                * from  `user` limit 10; 
               
                OK 
               
                2016-10-24 13:08:16 1   13112121212 Tom 
               
                2016-10-24 14:08:16 2   13112121212 Join 
               
                2016-10-25 14:23:16 3   13112121212 iteblog 
               
                2016-10-25 13:08:16 4   NULL        weixin 
               
                2016-10-25 19:08:16 5   13112121212 bbs 
               
                2016-10-25 13:14:04 6   NULL        zhangshan 
               
                2016-10-25 13:08:16 7   13112121212 wangwu 
               
                2016-10-25 14:56:16 8   13112121212 Joan 
               
                2016-10-25 15:25:16 9   13112121212 White 
               
                2016-10-25 17:24:16 0   NULL        lihhh 
               
                Time taken: 0.072 seconds, Fetched: 10 row(s)

如上所述，我們已經成功通過Hive查詢到ElasticSearch中的數據了。如果你在通過Hive查詢ElasticSearch中的數據遇到如下異常：

 
                Failed with exception java.io.IOException:org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens  
                if 
                the network 
                /Elasticsearch 
                cluster is not accessible or when targeting a WAN 
                /Cloud 
                instance without the proper setting  
                'es.nodes.wan.only'

這個很可能是因為你配置錯了 es.nodes 或者 es.port 屬性了。

　　在上面的例子中，我們為了方便將Hive中的字段設置成和ElasticSearch中一樣；但實際情況下，我們可能無法將Hive中的字段和ElasticSearch保持一致，這時候我們需要在創建Hive表的時候做一些設置，否則將會出現錯誤。我們可以通過 es.mapping.names 參數實現，如下：

 
                hive (iteblog)>  
                create 
                EXTERNAL   
                table 
                ` 
                user 
                `( 
               
                >   register_time string, 
               
                >   user_id  
                int 
                , 
               
                >   mobile string, 
               
                >   username string  
               
                > ) 
               
                > STORED  
                BY 
                'org.elasticsearch.hadoop.hive.EsStorageHandler' 
               
                > TBLPROPERTIES( 
                'es.resource' 
                =  
                'iteblog/user' 
                ,  
                'es.nodes' 
                = 
                'www.iteblog.com' 
                ,  
                'es.port' 
                = 
                '9200' 
                ,  
                'es.nodes.wan.only' 
                = 
                'true' 
                , 
                'es.mapping.names' 
                = 
                'register_time:regtime,user_id:uid' 
                );

然后我們就可以將Hive中的 register_time 映射到ElasticSearch中的 regtime 字段； user_id 映射到ElasticSearch中的 uid 字段。

　　在創建Hive表的時候，我們還可以通過制定 es.query 來限制需要查詢的數據，如下：

 
                hive (iteblog)>  
                create 
                EXTERNAL   
                table 
                ` 
                user 
                `( 
               
                >   regtime string, 
               
                >   uid  
                int 
                , 
               
                >   mobile string, 
               
                >   username string  
               
                > ) 
               
                > STORED  
                BY 
                'org.elasticsearch.hadoop.hive.EsStorageHandler' 
               
                > TBLPROPERTIES( 
                'es.resource' 
                =  
                'iteblog/user' 
                ,  
                'es.nodes' 
                = 
                'www.iteblog.com' 
                ,  
                'es.port' 
                = 
                '9200' 
                ,  
                'es.nodes.wan.only' 
                = 
                'true' 
                , 
                'es.query' 
                =  
                '?q=uid:2' 
                );

然后我們可以看效果：

 
                hive (iteblog)>  
                select 
                * from  `user` limit 10; 
               
                OK 
               
                2016-10-24 14:08:16 2   13112121212 Join 
               
                Time taken: 0.023 seconds, Fetched: 1 row(s)

我們可以看到，uid為2的數據才返回了，其他的數據被過濾了。

　　在一些需要啟動MapReduce任務來完成的SQL，Hive啟動的Map個數和ElasticSearch中的分片個數一致，也就是每個分片使用一個Map任務來處理。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 SparkSQL讀取Hive中的數據通過Python讀取elasticsearch中的數據 spark中讀取elasticsearch數據從hive中讀取數據推送到kafka 使用Hive或Impala執行SQL語句，對存儲在Elasticsearch中的數據操作使用Hive或Impala執行SQL語句，對存儲在Elasticsearch中的數據操作(二) Hive數據導入Elasticsearch 通過Hive將數據寫入到ElasticSearch Hive數據導入Elasticsearch Kettle讀取mysql數據存入Hive分區表中,使用Impala查詢