真香！PySpark整合Apache Hudi實戰

本文轉載自查看原文 2020-05-10 16:12 1103

1. 准備

Hudi支持Spark-2.x版本，你可以點擊如下鏈接安裝Spark，並使用pyspark啟動

# pyspark
export PYSPARK_PYTHON=$(which python3)
spark-2.4.4-bin-hadoop2.7/bin/pyspark \
  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

spark-avro模塊需要在--packages顯示指定

spark-avro和spark的版本必須匹配

本示例中，由於依賴spark-avro_2.11，因此使用的是scala2.11構建hudi-spark-bundle，如果使用spark-avro_2.12，相應的需要使用hudi-spark-bundle_2.12

進行一些前置變量初始化

# pyspark
tableName = "hudi_trips_cow"
basePath = "file:///tmp/hudi_trips_cow"
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()

其中DataGenerator可以用來基於行程schema生成插入和刪除的樣例數據。

2. 插入數據

生成一些新的行程數據，加載到DataFrame中，並將DataFrame寫入Hudi表

# pyspark
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))

hudi_options = {
  'hoodie.table.name': tableName,
  'hoodie.datasource.write.recordkey.field': 'uuid',
  'hoodie.datasource.write.partitionpath.field': 'partitionpath',
  'hoodie.datasource.write.table.name': tableName,
  'hoodie.datasource.write.operation': 'insert',
  'hoodie.datasource.write.precombine.field': 'ts',
  'hoodie.upsert.shuffle.parallelism': 2, 
  'hoodie.insert.shuffle.parallelism': 2
}

df.write.format("hudi"). \
  options(**hudi_options). \
  mode("overwrite"). \
  save(basePath)

mode(Overwrite)會覆蓋並重新創建數據集。示例中提供了一個主鍵 (schema中的uuid)，分區字段(region/county/city）和組合字段(schema中的ts) 以確保行程記錄在每個分區中都是唯一的。

3. 查詢數據

將數據加載至DataFrame

# pyspark
tripsSnapshotDF = spark. \
  read. \
  format("hudi"). \
  load(basePath + "/*/*/*/*")

tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")

spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()

該查詢提供讀取優化視圖，由於我們的分區路徑格式為region/country/city)，從基本路徑（basepath）開始，我們使用load(basePath + "/*/*/*/*")來加載數據。

4. 更新數據

與插入新數據類似，還是使用DataGenerator生成更新數據，然后使用DataFrame寫入Hudi表。

# pyspark
updates = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateUpdates(10))
df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
df.write.format("hudi"). \
  options(**hudi_options). \
  mode("append"). \
  save(basePath)

注意，現在保存模式現在為append。通常，除非是第一次嘗試創建數據集，否則請始終使用追加模式。每個寫操作都會生成一個新的由時間戳表示的commit 。

5. 增量查詢

Hudi提供了增量拉取的能力，即可以拉取從指定commit時間之后的變更，如不指定結束時間，那么將會拉取最新的變更。

# pyspark
# reload data
spark. \
  read. \
  format("hudi"). \
  load(basePath + "/*/*/*/*"). \
  createOrReplaceTempView("hudi_trips_snapshot")

commits = list(map(lambda row: row[0], spark.sql("select distinct(_hoodie_commit_time) as commitTime from  hudi_trips_snapshot order by commitTime").limit(50).collect()))
beginTime = commits[len(commits) - 2] # commit time we are interested in

# incrementally query data
incremental_read_options = {
  'hoodie.datasource.query.type': 'incremental',
  'hoodie.datasource.read.begin.instanttime': beginTime,
}

tripsIncrementalDF = spark.read.format("hudi"). \
  options(**incremental_read_options). \
  load(basePath)
tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")

spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_incremental where fare > 20.0").show()

這表示查詢在開始時間提交之后的所有變更，此增量拉取功能可以在批量數據上構建流式管道。

6. 特定時間點查詢

即如何查詢特定時間的數據，可以通過將結束時間指向特定的提交時間，將開始時間指向”000”(表示最早的提交時間)來表示特定時間。

# pyspark
beginTime = "000" # Represents all commits > this time.
endTime = commits[len(commits) - 2]

# query point in time data
point_in_time_read_options = {
  'hoodie.datasource.query.type': 'incremental',
  'hoodie.datasource.read.end.instanttime': endTime,
  'hoodie.datasource.read.begin.instanttime': beginTime
}

tripsPointInTimeDF = spark.read.format("hudi"). \
  options(**point_in_time_read_options). \
  load(basePath)

tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()

7. 刪除數據

刪除傳入的HoodieKey集合，注意：刪除操作只支持append模式

# pyspark
# fetch total records count
spark.sql("select uuid, partitionPath from hudi_trips_snapshot").count()
# fetch two records to be deleted
ds = spark.sql("select uuid, partitionPath from hudi_trips_snapshot").limit(2)

# issue deletes
hudi_delete_options = {
  'hoodie.table.name': tableName,
  'hoodie.datasource.write.recordkey.field': 'uuid',
  'hoodie.datasource.write.partitionpath.field': 'partitionpath',
  'hoodie.datasource.write.table.name': tableName,
  'hoodie.datasource.write.operation': 'delete',
  'hoodie.datasource.write.precombine.field': 'ts',
  'hoodie.upsert.shuffle.parallelism': 2, 
  'hoodie.insert.shuffle.parallelism': 2
}

from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['partitionpath', 'uuid']).withColumn('ts', lit(0.0))
df.write.format("hudi"). \
  options(**hudi_delete_options). \
  mode("append"). \
  save(basePath)

# run the same read query as above.
roAfterDeleteViewDF = spark. \
  read. \
  format("hudi"). \
  load(basePath + "/*/*/*/*") 
roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
# fetch should return (total - 2) records
spark.sql("select uuid, partitionPath from hudi_trips_snapshot").count()

8. 總結

本篇博文展示了如何使用pyspark來插入、刪除、更新Hudi表，有pyspark和Hudi需求的小伙伴不妨一試！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Apache Hudi集成Apache Zeppelin實戰 Apache Hudi + AWS S3 + Athena實戰實戰 | 將Apache Hudi數據集寫入阿里雲OSS 實戰| 配置DataDog監控Apache Hudi應用指標 Apache Hudi使用簡介 Apache Hudi 介紹與應用 Apache Hudi 介紹與應用數據湖-Apache Hudi Apache Hudi與Apache Flink集成 Apache Hudi使用問題匯總（一）