inline
前情提要:inline無法作用於map,array(map)
關於inline:在橫表縱表轉換一節已經試過,map無法使用inline;
在這里將map轉成array,發現還是無法用inline,看來inline只適用array(struct)格式;
# map轉array,還是不能用lateral view inline;inline只適用於array(struct)格式
sc.sql(''' select id ,array(str_to_map(concat_ws(',',collect_set(concat_ws(':',prod_nm,cast(bal as string)))))) from test_youhua.zongbiao group by id ''') # 查詢結果已經轉成了ARRAY 1 [{"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}] 2 [{"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}]
# 還是不能用inline
sc.sql(''' select map_tmp_tbl.id,c1 from (
select id
,array(str_to_map(concat_ws(',',collect_set(concat_ws(':',prod_nm,cast(bal as string)))))) as array_map_col
from test_youhua.zongbiao
group by id
) as map_tmp_tbl lateral view inline(map_tmp_tbl.array_map_col) t1 as c1 ''').show()
# 報錯,inline使用的格式為array(struct),這里格式array(map)不匹配
AnalysisException: "cannot resolve 'inline(map_tmp_tbl.`array_map_col`)' due to data type mismatch: input to function inline should be array of struct type, not ArrayType(MapType(StringType,StringType,true),false);
看了這篇怎么感覺可以應用於array(map)???
https://blog.csdn.net/JnYoung/article/details/106169019
不一樣的,這個示例named_struct_1字段事先就存成了struct類型。
那接下來老老實實建一個array(struct)格式字段來處理吧
(1)數據准備-建表insert-select:直接將map轉array后不能用inline的數據存成array(struct):不能,會報錯字段類型不匹配。
這里有點奇怪,hive是schema on read,insert的時候會檢查字段格式是否一致嗎??
比如parquet不支持date格式,insert進去也只是顯示空字段,而不是一開始就insert報錯
# 建表 create table if not exists test_youhua.test_array_struct_inline( custom_id int comment "客戶id", all_bal array<struct<baoxian:float, cunkuan:float, jijin:float>> comment '資產配置' ) comment "array_struct_客戶資產配置表" ; # 插入數據 insert overwrite test_youhua.test_array_struct_inline select id ,array(str_to_map(concat_ws(',',collect_set(concat_ws(':',prod_nm,cast(bal as string)))))) from test_youhua.zongbiao group by id # 報錯:字段類型不一致 FAILED: SemanticException [Error 10044]: Line 1:23 Cannot insert into target table because column number/types are different 'test_array_struct_inline': Cannot convert column 1 from array<map<string,string>> to array<struct<baoxian:float,cunkuan:float,jijin:float>>.
(2)數據准備-直接load文件到test_youhua.test_array_struct_inline
# 文件准備 test_array_struct_inline, xftp到Linux 1 [{"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}] 2 [{"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}] # load 到HDFS hdfs dfs -put /opt/module/hive/my_input/test_array_struct_inline hdfs:///user/hive/warehouse/test_youhua.db/test_array_struct_inline # 查詢hive數據,數據確實已經load上去了,但是讀不出來 sc.sql(""" select * from test_youhua.test_array_struct_inline""").show() +---------+-------+ |custom_id|all_bal| +---------+-------+ | null| null| | null| null| +---------+-------+ # 猜測是分隔符的原因,重新指定一下分隔符 sc.sql(""" drop table test_youhua.test_array_struct_inline""") sc.sql("""create table if not exists test_youhua.test_array_struct_inline( custom_id int comment "客戶id", all_bal array<struct<baoxian:float, cunkuan:float, jijin:float>> comment '資產配置' ) comment "array_struct_客戶資產配置表" row format delimited fields terminated by ',' collection items terminated by '_' """) !hdfs dfs -put /opt/module/hive/my_input/test_array_struct_inline hdfs:///user/hive/warehouse/test_youhua.db/test_array_struct_inline sc.sql(""" select * from test_youhua.test_array_struct_inline""").show() #無論怎樣改都不行,讀不出來,可能是array嵌套struct,影響了分隔符指定的緣故(其實是因為json格式需要導入serde包)
(3)數據准備-用json包指定row format讀文件
其實之前數據無法正常read是因為json的分隔符的原因,需要導入jsonserde包
參考:Hive學習小記-(16)hive加載解析json文件 稍微修改了一下文件:
{"custom_id":"1","all_bal":[{"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}]} {"custom_id":"2","all_bal":[{"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}]}
hive> add jar /opt/module/hive/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar hive> create table if not exists test_youhua.test_array_struct_inline( > custom_id string comment "客戶id", > all_bal array<struct<baoxian:string, cunkuan:string, jijin:string>> comment '資產配置' > ) > comment "array_struct_客戶資產配置表" > row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'; OK Time taken: 0.09 seconds hive> select * from test_youhua.test_array_struct_inline; OK Time taken: 0.089 seconds
# 數據導入並且讀取成功
hive> select * from test_youhua.test_array_struct_inline; OK 1 [{"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}] 2 [{"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}] Time taken: 0.107 seconds, Fetched: 2 row(s)
#注意這里字段類型全部改為string,否則select會報錯: Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors at [Source: java.io.ByteArrayInputStream@ab327c; line: 1, column: 41]
(4)用inline可以打開array(struct),對比explode只是打開array
參考:https://blog.csdn.net/weixin_42003671/article/details/88132666
# 原始數據 hive> select * from test_youhua.test_array_struct_inline; OK 1 [{"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}] 2 [{"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}] Time taken: 0.103 seconds, Fetched: 2 row(s) #這個時候如果想讀取jijin、baoxian、cunkuan的余額,需要: #這是在array的元素只有一條struct數據的情況,如果有多條struct元素,通過array[i]的形式來尋找某個key的value會比較困難,這時候就要借助inline和explode來將多個struct的某個key對應的value轉到一列 hive> select all_bal[0].jijin,all_bal[0].baoxian,all_bal[0].cunkuan from test_youhua.test_array_struct_inline; OK 1.1 1.2 1.3 2.67 2.34 2.1 Time taken: 0.587 seconds, Fetched: 2 row(s) #用 inline 將多個struct的某個key對應的value轉到一列 hive> select tmp.custom_id,c1,c2,c3 from test_youhua.test_array_struct_inline as tmp lateral view inline(tmp.all_bal) t1 as c1,c2,c3; OK 1 1.2 1.3 1.1 2 2.34 2.1 2.67 Time taken: 0.093 seconds, Fetched: 2 row(s) #對比用explode來轉,explode只能打開一層,即去掉了array的[] hive> select tmp.custom_id,c1 from test_youhua.test_array_struct_inline as tmp lateral view explode(tmp.all_bal) t1 as c1; OK 1 {"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"} 2 {"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"} #explode還需再加上struct.key來進一步取key對應的value值;這么看inline能夠比explode打開更深一層,inline可以直接取到value,explode還要再通過struct.key形式來取value hive> select tmp.custom_id,c1.jijin from test_youhua.test_array_struct_inline as tmp lateral view explode(tmp.all_bal) t1 as c1; OK 1 1.1 2 2.67 Time taken: 0.122 seconds, Fetched: 2 row(s) #參考鏈接里這種寫法不行,需要數據明確key、value select tmp.custom_id,c1.value from test_youhua.test_array_struct_inline as tmp lateral view explode(tmp.all_bal) t1 as c1 where c1.key="jijin"; # 報錯: RuntimeException cannot find field key(lowercase form: key) in [baoxian, cunkuan, jijin]
tips
(1)org.apache.hive.hcatalog.data.JsonSerDe 對復雜類型支持不足
參考:https://www.cnblogs.com/aprilrain/p/6916359.html
(2)insert數據到array(struct)-用named_struct
參考這個:https://blog.csdn.net/random0815/article/details/85252250
以及報錯解決:https://blog.csdn.net/qq_36203774/article/details/102599260
insert into test_youhua.test_array_struct_inline select "4",array(named_struct('baoxian','1.46','cunkuan','1.46','jijin','1.46')); # 報錯 ParseException line 1:124 Failed to recognize predicate '<EOF>'. Failed rule: 'regularBody' in statement #報錯解決,建臨時表tmp,加from改寫,這里要注意struct里面字段順序 with tmp as (select "3",array(named_struct('baoxian','1.45','cunkuan','1.45','jijin','1.45'))) insert into test_youhua.test_array_struct_inline select * from tmp; select * from test_youhua.test_array_struct_inline #數據插入成功 3 [{"baoxian":"1.45","cunkuan":"1.45","jijin":"1.45"}] 1 [{"baoxian":"1.2","cunkuan":"1.3","jijin":"1.1"}] 2 [{"baoxian":"2.34","cunkuan":"2.1","jijin":"2.67"}] # 用struct來select可以,但是無法insert到指定的array(struct)中 with tmp as (select "6",array(struct('1.45','1.45','1.45'))) insert into test_youhua.test_array_struct_inline select * from tmp; # struct報錯,列名無法對應 Cannot insert into target table because column number/types are different 'test_array_struct_inline': Cannot convert column 1 from array<struct<col1:string,col2:string,col3:string>> to array<struct<baoxian:string,cunkuan:string,jijin:string>>.
(3)map、array、struct讀數據的方式
select array[1],map['xiao song'],struct.city from test