背景:
spark3新增動態裁剪。現嘗試將spark2升級到spark3
當前版本:spark 2.4.1,scala 2.11.12
目標版本:spark 3.1.1, scala 2.12.13
異常記錄:
- 異常1
java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/StreamWriteSupport
出問題的包
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql-kafka-0-10_2.12</artifactId> <version>2.4.1</version> </dependency>
修正后
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql-kafka-0-10_2.12</artifactId> <version>3.0.0</version> </dependency>
異常原因:
spark3.0中的org.apache.spark.sql.sources.DataSourceRegister中serviceLoader加載的類為
org.apache.spark.sql.execution.datasources.v2.csv.CSVDataSourceV2
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.v2.json.JsonDataSourceV2
org.apache.spark.sql.execution.datasources.noop.NoopDataSource
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2
org.apache.spark.sql.execution.datasources.v2.text.TextDataSourceV2
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.sources.RateStreamProvider
org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider
org.apache.spark.sql.execution.datasources.binaryfile.BinaryFileFormat
對比之前spark2中
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.json.JsonFileFormat
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.execution.datasources.text.TextFileFormat
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.sources.RateStreamProvider
org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider
發現部分的Source已發生改變。追蹤下來 org/apache/spark/sql/sources 下的v2包都沒了
spark2中的KafkaSourceProvider
private[kafka010] class KafkaSourceProvider extends DataSourceRegister with StreamSourceProvider with StreamSinkProvider with RelationProvider with CreatableRelationProvider with StreamWriteSupport with ContinuousReadSupport with MicroBatchReadSupport with Logging { import KafkaSourceProvider._
spark3中的KafkaSourceProvider
private[kafka010] class KafkaSourceProvider extends DataSourceRegister with StreamSourceProvider with StreamSinkProvider with RelationProvider with CreatableRelationProvider with SimpleTableProvider with Logging { import KafkaSourceProvider._
- 異常2
目前vertica提供的spark暫不支持3.0,需要通過jdbc方式重新實現一版
- 異常3
java.lang.String cannot be cast to java.time.ZonedDateTime
異常源:
<dependency> <groupId>com.github.housepower</groupId> <artifactId>clickhouse-integration-spark_2.12</artifactId> <version>2.5.4</version> </dependency>
建表語句:
create table default.zwy_test (time DateTime,AMP Float64,NOZP Int32,value Int32,reason String ) ENGINE = MergeTree order by time
寫入數據的schema:
root |-- time: string (nullable = true) |-- AMP: double (nullable = true) |-- NOZP: integer (nullable = true) |-- value: integer (nullable = true) |-- reason: string (nullable = true)
異常原因:
在Spark 3.0中,將值插入具有不同數據類型的表列中時,將根據ANSI SQL標准執行類型強制轉換。標准SQL的轉換規則參考,其中String轉日期已經不屬於隱式轉換,而且spark2中String會自動轉換為日期類型。因此spark2升級到spark3中,需要對String類型通過from_utc_timestamp等函數顯式地轉換
- 變動1
jdbc spark3增加keytab,principal參數,支持kerberos了
spark2到3的變更記錄 https://spark.apache.org/docs/3.0.0/core-migration-guide.html