在處理流水增量表的時候,出現了一個判定的失誤。
select a.a1,a.a2 from ( select a.a1 ,if(a.a2<>b.b2,1,0) as diff ,a.a2 from a lefter join b on a.a1=b.b1 ) c where c.diff=1;
因為左外關聯,可能會出現b表數據不存在 則b.b2 is null , if(a.a2<>b.b2,1,0) as diff,null值的判斷只能使用is ,is not
0: jdbc:hive2://localhost:10000/big12> select if(1 <>null,0,1); +------+--+ | _c0 | +------+--+ | 1 | +------+--+ 1 row selected (0.13 seconds) 0: jdbc:hive2://localhost:10000/big12> select 1<>null; +-------+--+ | _c0 | +-------+--+ | NULL | +-------+--+ 1 row selected (0.127 seconds)
所以處理方式:
0: jdbc:hive2://localhost:10000/big12> select if(1<>null or null is null,1,0) as diff; +-------+--+ | diff | +-------+--+ | 1 | +-------+--+ 1 row selected (0.121 seconds) 0: jdbc:hive2://localhost:10000/big12> select if(1<>nvl(null,0),1,0) as diff; +-------+--+ | diff | +-------+--+ | 1 | +-------+--+ 1 row selected (0.13 seconds)
其他:
employee表
hive>desc employee; empid string deptid string salary string
查詢employee
hive>select * from employee 1 NULL NULL
hive 中null實際在HDFS中默認存儲為'\N'(但是我們一般為了安全性把null的儲存格式調整為'')
即employee中的數據在HDFS中為
1 \N \N
驗證,插入'\N'
hive>insert into table employee select '2','\\N','\\N' from employee limit 1;
其中多一個斜杠是轉義的作用
查詢employee
hive>select * from employee 1 NULL NULL 2 NULL NULL
此時hive中與null有關的函數,如nvl,coalesce,is null等判斷是否為null是為true
hive>select nvl(empid,'A'),nvl(deptid,'B'),nvl(salary,'C') from employee 1 B C 2 B C
但是null或NULL和''會被hive當做字符串處理。
hive>insert into table employee select '3','','' from employee limit 1;
查詢:
hive>select * from employee; 1 NULL NULL 2 NULL NULL 3
hive>insert into table employee select '4','null','NULL' from employee limit 1;
查詢
hive>select * from employee; 1 NULL NULL 2 NULL NULL 3 4 null NULL
注意:1,2同一行的NULL與4行的NULL或null不一樣。4行的NULL或null為字符串
此時hive中與null有關的函數,如nvl,coalesce,is null等判斷''和null(字符串)或NULL(字符串)是否為null是為false
hive> select empid ,nvl(deptid,'E'),nvl(salary,'F') from employee; 1 E F 2 E F 3 4 null NULL
hive>select * from employee where deptid=''; 3 hive>select * from employee where deptid='null' and salary ='NULL'; 4 null NULL hive>select * from employee where deptid is null; 1 NULL NULL 2 NULL NULL
可以通過
ALTER TABLE table_name SET SERDEPROPERTIES('serialization.null.format' = 定義描述符);
修改空值描述符
如果將''定義為NULL
ALTER TABLE employee SET SERDEPROPERTIES('serialization.null.format' = '');
查詢employee
hive>select * from employee; 1 \N \N 2 \N \N 3 NULL NULL 4 null NULL
和前面的select比較發現''變成了NULL,而\N露出了真面目,4行的NULL或null為字符串沒變化
驗證,將''插入到emloyee
hive> insert into table employee select '5','','' from employee limit 1;
查詢
hive>select * from employee; 1 \N \N 2 \N \N 3 NULL NULL 4 null NULL 5 NULL NULL
注意:3,5同一行的NULL與4行的NULL或null不一樣。4行的NULL或null為字符串
此時HDFS中如此存儲
1 \N \N 2 \N \N 3 4 null NULL 5
此時
hive> select empid ,nvl(deptid,'E'),nvl(salary,'F') from employee; 1 \N \N 2 \N \N 3 E F 4 null NULL 5 E F
總結hive中null的定義的意義在於:oracle數據導出后原表中的null會變成'',然后導入到hive中也變成了''。但是hive中關於NULL的一些函數如nvl,coalesce和is null卻無法使用,因為hive默認\N才代表NULL。在hive中通過
ALTER TABLE SET SERDEPROPERTIES('serialization.null.format' = '');修改''代表NULL,改造存儲過程中就不需要改nvl等語句。