【spark】spark2升級到spark3，spark3中的包變動記錄

本文轉載自查看原文 2021-04-13 18:01 329 spark

背景:

spark3新增動態裁剪。現嘗試將spark2升級到spark3

當前版本：spark 2.4.1，scala 2.11.12

目標版本：spark 3.1.1, scala 2.12.13

異常記錄:

異常1

java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/StreamWriteSupport

出問題的包

 <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql-kafka-0-10_2.12</artifactId>
    <version>2.4.1</version>
</dependency>

修正后

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql-kafka-0-10_2.12</artifactId>
    <version>3.0.0</version>
</dependency>

異常原因:

spark3.0中的org.apache.spark.sql.sources.DataSourceRegister中serviceLoader加載的類為

org.apache.spark.sql.execution.datasources.v2.csv.CSVDataSourceV2
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.v2.json.JsonDataSourceV2
org.apache.spark.sql.execution.datasources.noop.NoopDataSource
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2
org.apache.spark.sql.execution.datasources.v2.text.TextDataSourceV2
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.sources.RateStreamProvider
org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider
org.apache.spark.sql.execution.datasources.binaryfile.BinaryFileFormat

對比之前spark2中

org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.json.JsonFileFormat
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.execution.datasources.text.TextFileFormat
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.sources.RateStreamProvider
org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider

發現部分的Source已發生改變。追蹤下來 org/apache/spark/sql/sources 下的v2包都沒了

spark2中的KafkaSourceProvider

private[kafka010] class KafkaSourceProvider extends DataSourceRegister
    with StreamSourceProvider
    with StreamSinkProvider
    with RelationProvider
    with CreatableRelationProvider
    with StreamWriteSupport
    with ContinuousReadSupport
    with MicroBatchReadSupport
    with Logging {
  import KafkaSourceProvider._

spark3中的KafkaSourceProvider

private[kafka010] class KafkaSourceProvider extends DataSourceRegister
    with StreamSourceProvider
    with StreamSinkProvider
    with RelationProvider
    with CreatableRelationProvider
    with SimpleTableProvider
    with Logging {
  import KafkaSourceProvider._

異常2

目前vertica提供的spark暫不支持3.0，需要通過jdbc方式重新實現一版

異常3

java.lang.String cannot be cast to java.time.ZonedDateTime

異常源:

<dependency>
    <groupId>com.github.housepower</groupId>
    <artifactId>clickhouse-integration-spark_2.12</artifactId>
    <version>2.5.4</version>
</dependency>

建表語句:

create table default.zwy_test (time DateTime,AMP Float64,NOZP Int32,value Int32,reason String ) ENGINE = MergeTree order by time

寫入數據的schema:

root
 |-- time: string (nullable = true)
 |-- AMP: double (nullable = true)
 |-- NOZP: integer (nullable = true)
 |-- value: integer (nullable = true)
 |-- reason: string (nullable = true)

異常原因:

在Spark 3.0中，將值插入具有不同數據類型的表列中時，將根據ANSI SQL標准執行類型強制轉換。標准SQL的轉換規則參考，其中String轉日期已經不屬於隱式轉換，而且spark2中String會自動轉換為日期類型。因此spark2升級到spark3中，需要對String類型通過from_utc_timestamp等函數顯式地轉換

變動1

　　jdbc spark3增加keytab，principal參數，支持kerberos了

spark2到3的變更記錄 https://spark.apache.org/docs/3.0.0/core-migration-guide.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 antd v3升級到v4記錄 angular 8升級到9 CentOS 6下gcc升級的操作記錄(由默認的4.4.7升級到6.4.0版本） mysql 5.6.15升級到5.6.43 zookeeper從3.4.8升級到3.4.14 R從3.5升級到3.6.3 kibana從5.6升級到6.8 從Delphi 7升級到Delphi XE 【Ubuntu】14.04升級到18.04 記錄從Winserver2012R2升級到Winserver2019