hive 解析 json 數據方法

本文轉載自查看原文 2022-01-29 19:31 1532 HIve

json是常見的一種數據格式，一般通過埋點程序獲取行為用戶行為數據，將多個字段存放在一個json數組中，因此數據平台調用數據時，要對json數據進行解析處理。接下來介紹下Hive中是如何解析json數據的。

hive 解析 json 數據函數

1、get_json_object

語法：get_json_object(json_string, '$.key')

說明：解析json的字符串json_string,返回path指定的內容。如果輸入的json字符串無效，那么返回NULL。這個函數每次只能返回一個數據項。

實例

select get_json_object('{"name":"令狐沖","age":29}', '$.name') user_name;

結果

解析二個字段示例：

select get_json_object('{"name":"依琳","age":16}', '$.name') user_name,
       get_json_object('{"name":"依琳","age":16}', '$.age')  user_age;

get_json_object解析json多個字段有很多會太麻煩，可以使用 json_tuple。

2、json_tuple

語法： json_tuple(json_string, k1, k2 ...)
說明：解析json的字符串json_string,可指定多個json數據中的key，返回對應的value。如果輸入的json字符串無效，那么返回NULL。

示例

select b.user_name, b.age
from (select * from temp.jc_test_coalesce_nvl where c1 = 1) i lateral view
    json_tuple('{"name":"依琳","age":18}', 'name', 'age') b as user_name, age;

json_tuple 使用細節：與 get_json_object 不同，使用 json_tuple 獲取數據不需要使用 $,如果使用 $ 反而獲取不到數據。

select b.user_name, b.age
from (select * from temp.jc_test_coalesce_nvl where c1 = 1) i lateral view
    json_tuple('{"name":"依琳","age":18}', '$.name', '$.age') b as user_name, age;

結果:使用json_tuple時需注意這一點

小結：json_tuple相當於get_json_object的優勢就是一次可以解析多個json字段。但是這兩個函數都無法處理json數組。

hive 解析 json 數組

1、使用嵌套子查詢解析json數組

場景：一個hive表有 json_str 字段的內容如下：

json_str

[{"title":"笑傲鏡湖","author":"金庸"},{"title":"小李飛刀","author":"古龍"}]

希望解析出以下數據：

title	author
笑傲鏡湖	金庸
小李飛刀	古龍

實現思路：

explode函數

語法：explode(Array OR Map)
說明：explode()函數接收一個array或者map類型的數據作為輸入，然后將array或map里面的元素按照每行的形式輸出，即將hive一列中復雜的array或者map結構拆分成多行顯示，也被稱為列轉行函數。

示例

select array('A','B','C') ;

select explode(array('A','B','C'));

regexp_replace函數

語法: regexp_replace(string A, string B, string C)
說明：將字符串A中的符合java正則表達式B的部分替換為C。注意，在有些情況下要使用轉義字符，類似oracle中的regexp_replace函數。

示例：將 ve_sp 替換成 @

select regexp_replace('hive_spark', 've_sp', '@');

下面我們試着解析 json 數組

第一步：先將json數組中的元素解析出來，轉化為每行顯示

select explode(split(regexp_replace(regexp_replace('[{"title":"笑傲鏡湖","author":"金庸"},{"title":"小李飛刀","author":"古龍"}]', '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;'));
結果：
"{""title"":""笑傲鏡湖"",""author"":""金庸""}"
"{""title"":""小李飛刀"",""author"":""古龍""}"

上面SQL看着很長，但是一步一步看也很 esay

select explode(split(
    regexp_replace(
        regexp_replace(
            '[
                {"title":"笑傲鏡湖","author":"金庸"},
                {"title":"小李飛刀","author":"古龍"}
            ]',
            '\\[|\\]' , ''), --將json數組兩邊的中括號去掉

              '\\}\\,\\{' , '\\}\\;\\{'),--將json數組元素之間的逗號換成分號

                 '\\;') --以分號作為分隔符(split函數以分號作為分隔)
          );

說明：為什么要將json數組元素之間的逗號換成分號？

因為元素內的分隔也是逗號，如果不將元素之間的逗號換掉的話，后面用 split函數分隔時也會把元素內的數據給分隔，這不是我們想要的結果。

第二步、上步已經把一個json數組轉化為多個json字符串了，接下來使用json_tuple函數來解析json里面的字段：

select
json_tuple(explode(split(regexp_replace(regexp_replace('[{"title":"笑傲鏡湖","author":"金庸"},{"title":"小李飛刀","author":"古龍"}]', '\\[|\\]', ''),'\\}\\,\\{', '\\}\\;\\{'), '\\;'))
, 'title', 'author') ;

執行上述語句，結果報錯了：

UDTF's are not supported outside the select clause, nor nested in expressions:17:16,

explode函數不能寫在別的json_tuple里面，更正使用子查詢方式

select json_tuple(json, 'title', 'author')
from (
select explode(split(regexp_replace(regexp_replace('[{"title":"笑傲鏡湖","author":"金庸"},{"title":"小李飛刀","author":"古龍"}]', '\\[|\\]', ''),'\\}\\,\\{', '\\}\\;\\{'), '\\;'))
as json) o

2、使用 lateral view 解析json數組

樣例數據如下

goods_id	json_str
1,2,3	[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

期望結果：把 goods_id 字段和 json_str 字段中的monthSales解析出來。

首先：拆分goods_id字段及將json數組轉化成多個json字符串

select 
explode(split(goods_id,',')) as good_id,
explode(split(regexp_replace(regexp_replace(json_str , '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) 
as sale_info 
from tableName;

執行上述語句，結果報錯：

FAILED: SemanticException 3:0 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'sale_info'

用UDTF的時候，SELECT 只支持一個字段。而上述語句select中有兩個字段，所以報錯了。

那怎么辦呢，要解決這個問題，還得再介紹一個hive語法：

lateral view

lateral view用於和split、explode等UDTF一起使用的，能將一行數據拆分成多行數據，在此基礎上可以對拆分的數據進行聚合，lateral view首先為原始表的每行調用UDTF，UDTF會把一行拆分成一行或者多行，lateral view在把結果組合，產生一個支持別名表的虛擬表。

示例：一張用戶大俠門派 user_table，它有兩列數據，第一列是party_name，第二列是門派成員user_name，是一個數組，存儲大俠的姓名：

party_name	user_name
日月神教	[東方不敗,任盈盈,曲陽]
五岳劍派	[令狐沖,依琳,劉正風]

數據准備

create table temp.jc_t_test_json
(
    party_name string,
    user_name  array<string>
) row format delimited fields terminated by ',' -- 字段之間用','分隔
    collection items terminated by '_' -- 集合中的元素用'_'分隔
    map keys terminated by ':' -- map中鍵值對之間用':'分隔
    lines terminated by '\n';-- 行之間用'\n'分隔

insert into temp.jc_t_test_json select "日月神教", array("東方不敗", "任盈盈", "曲陽");
insert into temp.jc_t_test_json select "五岳劍派", array ("令狐沖", "依琳","劉正風" );
select * from temp.jc_t_test_json;

select party_name, user_name_
from  temp.jc_t_test_json
lateral view explode(user_name) tmp_table as user_name_;

按照 user_name_ 進行分組聚合即可：

select user_name_ ,count(party_name) name_cnt
from temp.jc_t_test_json
lateral view explode(user_name) tmp_table as user_name_
group by user_name_;

下面看下剛才遇到的用UDTF的時候，SELECT 只支持一個字段的問題

select good_id,get_json_object(sale_json,'$.monthsales') as monthsales
from tablename
lateral view explode(split(goods_id,','))goods as good_id
lateral view explode(split(regexp_replace(regexp_replace(json_str , '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) sales as sale_json;

goods_id	monthSales
1	4900
1	2090
1	6987
2	4900
2	2090
2	6987
3	4900
3	2090
3	6987

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive解析json數組數據 hive 存儲，解析，處理json數據 Hive get_json_object函數解析json數據 hive sql 解析json Python解析JSON數據的基本方法 hive之Json解析(普通Json和Json數組) C# JavaScriptSerializer 解析Json數據(多方法解析Json 三) hive中解析json數組 Hive sql 解析Json數組 hive存儲json數據