0: jdbc:hive2://master01.hadoop.dtmobile.cn:1> select * from cell_random_grid_tmp2 limit 1; INFO : Compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): select * from cell_random_grid_tmp2 limit 1 INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:grid_row_id, type:int, comment:null), FieldSchema(name:grid_col_id, type:int, comment:null), FieldSchema(name:google_gri, type:int, comment:null), FieldSchema(name:google_gci, type:int, comment:null), FieldSchema(name:user_lon, type:double, comment:null), FieldSchema(name:user_lat, type:double, comment:null), FieldSchema(name:grid_type, type:int, comment:null), FieldSchema(name:grid_height, type:int, comment:null), FieldSchema(name:compute_region_name, type:string, comment:null), FieldSchema(name:antenna_0, type:string, comment:null), FieldSchema(name:antenna_1, type:string, comment:null), FieldSchema(name:antenna_2, type:string, comment:null), FieldSchema(name:antenna_3, type:string, comment:null), FieldSchema(name:antenna_4, type:string, comment:null), FieldSchema(name:antenna_5, type:string, comment:null), FieldSchema(name:antenna_6, type:string, comment:null), FieldSchema(name:scene, type:string, comment:null), FieldSchema(name:base_lon, type:double, comment:null), FieldSchema(name:base_lat, type:double, comment:null), FieldSchema(name:ssb_send_power, type:double, comment:null), FieldSchema(name:base_h_angle, type:double, comment:null), FieldSchema(name:antenna_height, type:double, comment:null), FieldSchema(name:m_vertical_angle, type:double, comment:null), FieldSchema(name:h_beam_precision, type:int, comment:null), FieldSchema(name:v_beam_precision, type:int, comment:null), FieldSchema(name:simu_spectrum, type:decimal(2,1), comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.045 seconds INFO : Executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): select * from cell_random_grid_tmp2 limit 1 INFO : Completed executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.001 seconds INFO : OK Error: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://master01.hadoop.dtmobile.cn:8020/user/hive/warehouse/capacity.db/cell_random_grid_tmp2/part-00000-82a689a5-7c2a-48a0-ab17-8bf04c963ea6-c000.snappy.parquet (state=,code=0) 0: jdbc:hive2://master01.hadoop.dtmobile.cn:1>
通過spark2.3 sparksql saveAsTable()執行寫數據到hive,sparksql寫數據到hive時候,默認是保存為parquet+snappy的數據。在數據保存完成之后,通過hive beeline查詢,報錯如上。但是通過spark查詢,執行正常。
在stackoverflow上找到同樣的問題:
根本原因如下:
This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning
Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.
Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.
所以嘗試調整參數 spark.sql.parquet.writeLegacyFormat = true,問題解決。
到spark2.3源代碼中查找該參數(spark.sql.parquet.writeLegacyFormat):
package org.apache.spark.sql.internal 中 關於sparksql的默認配置 SQLConf.scala中相關描述如下
val PARQUET_WRITE_LEGACY_FORMAT = buildConf("spark.sql.parquet.writeLegacyFormat") .doc("Whether to be compatible with the legacy Parquet format adopted by Spark 1.4 and prior " + "versions, when converting Parquet schema to Spark SQL schema and vice versa.") .booleanConf .createWithDefault(false)
可以看到默認值為false
在 package org.apache.spark.sql.execution.datasources.parquet 的關於ParquetWriteSupport.scala 的描述如下:
/** * A Parquet [[WriteSupport]] implementation that writes Catalyst [[InternalRow]]s as Parquet * messages. This class can write Parquet data in two modes: * * - Standard mode: Parquet data are written in standard format defined in parquet-format spec. * - Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior. * * This behavior can be controlled by SQL option `spark.sql.parquet.writeLegacyFormat`. The value * of this option is propagated to this class by the `init()` method and its Hadoop configuration * argument. */