一、問題描述
在使用CDH6.3.2的集群處理數據時,當創建的表格為orc格式,且表格中存在null字段時,where中的<>條件沒有生效;
建表語句為:
CREATE TABLE DWD_PC_INT_ZM_StockPoolComponent(
ID bigint ,
JSID bigint ,
InsertTime TIMESTAMP ,
UpdateTime TIMESTAMP ,
CalDate TIMESTAMP ,
InnerCode int ,
SecuCode string ,--(30) ,
SecuAbbr string ,--(100) ,
SecuMarket int ,
RuleCode bigint ,
RuleDesc string ,--(1000),
UDATE string,
DELDATE string
)stored as orcfile;
插入數據為:
INSERT INTO dwd.dwd_pc_int_zm_stockpoolcomponent (id,jsid,inserttime,updatetime,caldate,innercode,secucode,secuabbr,secumarket,rulecode,ruledesc,udate,deldate) VALUES
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',152,'000405','ST 鑫 光',90,100000000100002,'退市股票:退市日期:20040319','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',91,'000047','ST 中 僑',90,100000000100002,'退市股票:退市日期:20030530','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',76,'000033','新 都 退',90,100000000100002,'退市股票:退市日期:20170707','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',61,'000024','招商地產',90,100000000100002,'退市股票:退市日期:20151230','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',51,'000018','神城A退',90,100000000100002,'退市股票:退市日期:20200107','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',44,'000015','PT中浩A',90,100000000100002,'退市股票:退市日期:20011022','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',41,'000013','*ST石化A',90,100000000100002,'退市股票:退市日期:20040920','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',23,'000007','*ST 全新',90,100000000100002,'ST、*ST','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',17,'000005','ST 星 源',90,100000000100002,'ST、*ST','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',11,'000003','PT金田A',90,100000000100002,'退市股票:退市日期:20020614','20211021','99991231');
建表語句為:
CREATE TABLE tmp.TMP_DWD_PC_INT_ZM_StockPoolComponent_02(
CalDate TIMESTAMP ,
InnerCode int ,
SecuCode string ,--(30) ,
SecuAbbr string ,--(100) ,
SecuMarket int ,
RuleCode bigint ,
RuleDesc string --(1000)
)stored as orcfile;
插入數據為:
INSERT INTO tmp_dwd_pc_int_zm_stockpoolcomponent_02 (caldate,innercode,secucode,secuabbr,secumarket,rulecode,ruledesc) VALUES
('2021-10-17 00:00:00.0',23603,'000136','民生策略A',NULL,101000000100001,NULL),
('2021-10-17 00:00:00.0',309519,'000090','民生信用債A',NULL,101000000100001,NULL),
('2021-10-17 00:00:00.0',309520,'000089','民生信用債C',NULL,101000000100001,NULL),
('2021-10-17 00:00:00.0',22892,'000068','民生轉債C',NULL,101000000100001,NULL),
('2021-10-17 00:00:00.0',22891,'000067','民生轉債A',NULL,101000000100001,NULL),
('2021-10-17 00:00:00.0',44,'000015','PT中浩A',90,100000000100002,'退市股票:退市日期:20011022'),
('2021-10-17 00:00:00.0',41,'000013','*ST石化A',90,100000000100002,'退市股票:退市日期:20040920'),
('2021-10-17 00:00:00.0',23,'000007','*ST 全新',90,100000000100002,'ST、*ST'),
('2021-10-17 00:00:00.0',17,'000005','ST 星 源',90,100000000100002,'ST、*ST'),
('2021-10-17 00:00:00.0',11,'000003','PT金田A',90,100000000100002,'退市股票:退市日期:20020614');
執行查詢語句為:
SELECT
A.CalDate,
A.InnerCode,
A.SecuCode,
A.SecuAbbr,
A.SecuMarket,
A.RuleCode,
A.RuleDesc,
2 AS DataFlag
FROM TMP_DWD_PC_INT_ZM_StockPoolComponent_02 A
JOIN DWD_PC_INT_ZM_StockPoolComponent B ON A.CalDate=B.CalDate AND A.RuleCode=B.RuleCode AND A.SecuCode=B.SecuCode AND NVL(A.SecuMarket,0)=NVL(B.SecuMarket,0)
AND B.DELDATE='99991231'
WHERE NVL(A.InnerCode,0)<>NVL(B.InnerCode,0)
OR NVL(A.SecuAbbr,'')<>NVL(B.SecuAbbr,'')
OR NVL(A.RuleDesc,'')<>NVL(B.RuleDesc,'')
經過測試發現:
1、對於CDH6.3.2的集群
如果使用hive on spark執行引擎,可以查詢出數據;使用hive on mr 或者sparksql來執行是查詢不出來數據的;將表格格式切換為txt或者parquet時,使用mr或者hive on spark也查詢不出數據
2、對於TDH6.0.2的集群
使用tdh集群來查詢也查詢不出數據;
3、將CDH集群切換到5.16.4版本
使用hive on spark、hive on mr引擎也查不出結果;
二、問題總結
雖然具體原因沒有定位到,但是可以確定的是hive2.1.1版本對hive on spark支持的不友好,會有bug存在。
1、由此可見,cdh6.3.2的集群對於在表格格式為orc格式時,使用hive on spark執行引擎會有出錯的情況;CDH6.3.2的集群中的hive版本2.1.1對orc格式的表格處理不友好,會有bug存在;
2、推薦大家在建立數據庫表格時,使用parquet格式。
現在Cloudera公司在主推CDP集群,后面cdh版本不在維護,hive on mr和hive on spark引擎在CDP7.1推出后也不在支持了。后面會主推hive on tz。