Hive實現自增序列
在利用數據倉庫進行數據處理時,通常有這樣一個業務場景,為一個Hive表新增一列自增字段(比如事實表和維度表之間的"代理主鍵")。雖然Hive不像RDBMS如mysql一樣本身提供自增主鍵的功能,但它本身可以通過函數來實現自增序列功能:利用row_number()窗口函數或者使用UDFRowSequence。
示例:table_src是我們經過業務需求處理的到的中間表數據,現在我們需要為table_src新增一列自增序列字段auto_increment_id,並將最終數據保存到table_dest中。
1. 利用row_number函數
場景1:table_dest中目前沒有數據
insert into table table_dest select row_number() over(order by 1) as auto_increment_id, table_src.* from table_src;
場景2: table_dest中有數據,並且已經經過新增自增字段處理
insert into table table_dest select (row_number() over(order by 1) + dest.max_id) auto_increment_id, src.* from table_src src cross join (select max(auto_increment_id) max_id from table_dest) dest;
2. 利用UDFRowSequence
首先Hive環境要有hive-contrib相關jar包,然后執行
create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
針對上述場景一,可通過以下語句實現:
insert into table table_dest select row_sequence() auto_increment_id, table_src.* from table_src;
場景2實現起來也很簡單,這里不在贅述。
但是,需要注意二者的區別:
row_number函數是對整個數據集做處理,自增序列在當次排序中是連續的唯一的。
UDFRowSequence是按照任務排序,但是一個SQL可能並發執行的job不止一個,而每個job都會從1開始各自排序,所以不能保證序號全局唯一。可以考慮將UDFRowSequence擴展到一個第三方存儲系統中,進行序號邏輯管理,來最終實現全局的連續自增唯一序號。
Hive元數據問題
以下基於hive-2.X版本說明。
Hive正常啟動,但是執行show databases時報以下錯誤:
SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
首先從異常信息分析可知,是元數據問題導致的異常。
Hive默認將元數據存儲在derby,但因為用derby作為元數據存儲服務弊端太多,我們通常會選擇將Hive的元數據存在mysql中。所以我們要確保hive-site.xml中mysql的信息要配置正確,Hive要有mysql的相關連接驅動jar包,並且有mysql的權限。
首先在hive-site.xml中配置mysql信息:
<configuration>
<property>
<!-- mysql中存儲Hive元數據的庫-->
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive_metadata?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<!-- mysql用戶名-->
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<!-- mysql密碼-->
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>
</configuration>
執行完上述操作后,如果配置了Hive metastore方式,還需要啟動該服務:
nohup hive --service metastore &
如果想啟動hiveserver2,則執行:
nohup hive --service hiveserver2 &
但是,此時可能由於你設置的mysql元數據存儲庫沒有進行schema初始化,會報類似以下異常:
-- 異常1 Exception in thread "main" MetaException(message:Version information not found in metastore. ) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:83) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:92) at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6896) at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6891) at org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:7149) at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:7076) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:226) at org.apache.hadoop.util.RunJar.main(RunJar.java:141) Caused by: MetaException(message:Version information not found in metastore. )
-- 異常2
MetaException(message:Required table missing : "`DBS`" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables")
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:83)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:92)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6896)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6891)
at org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:7149)
at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:7076)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
此時,還需要在hive-site.xml中配置以下信息:
<property> <name>hive.metastore.schema.verification</name> <value>false</value> <description> Enforce metastore schema version consistency. True: Verify that version information stored in is compatible with one from Hive jars. Also disable automatic schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures proper metastore schema migration. (Default); False: Warn if the version information stored in metastore doesn't match with one from in Hive jars. </description> </property> <property> <name>datanucleus.schema.autoCreateAll</name> <value>true</value> </property>
並執行schematool -initSchema -dbType mysql進行Hive元數據的初始化。出現以下信息則說明初始化完畢,可以到mysql中元數據庫看到初始化生成的表。
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/soft/apache-hive-2.3.7-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/soft/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Metastore connection URL: jdbc:mysql://localhost:3306/hive_metadata?createDatabaseIfNotExist=true Metastore Connection Driver : com.mysql.jdbc.Driver Metastore connection User: root Starting metastore schema initialization to 2.3.0 Initialization script hive-schema-2.3.0.mysql.sql Initialization script completed schemaTool completed
最后,重新啟動bin/hive,執行show databases、建表、查詢等SQL語句進行測試,都能正常執行。
推薦文章:
SparkSQL中產生笛卡爾積的幾種典型場景以及處理策略
Hadoop支持的壓縮格式對比和應用場景以及Hadoop native庫
Spark存儲Parquet數據到Hive,對map、array、struct字段類型的處理
關注微信公眾號:大數據學習與分享,獲取更對技術干貨
