ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs:...

本文轉載自查看原文 2019-09-04 22:19 386 spark

0: jdbc:hive2://master01.hadoop.dtmobile.cn:1> select * from cell_random_grid_tmp2 limit 1;
INFO : Compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): select * from cell_random_grid_tmp2 limit 1
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:grid_row_id, type:int, comment:null), FieldSchema(name:grid_col_id, type:int, comment:null), FieldSchema(name:google_gri, type:int, comment:null), FieldSchema(name:google_gci, type:int, comment:null), FieldSchema(name:user_lon, type:double, comment:null), FieldSchema(name:user_lat, type:double, comment:null), FieldSchema(name:grid_type, type:int, comment:null), FieldSchema(name:grid_height, type:int, comment:null), FieldSchema(name:compute_region_name, type:string, comment:null), FieldSchema(name:antenna_0, type:string, comment:null), FieldSchema(name:antenna_1, type:string, comment:null), FieldSchema(name:antenna_2, type:string, comment:null), FieldSchema(name:antenna_3, type:string, comment:null), FieldSchema(name:antenna_4, type:string, comment:null), FieldSchema(name:antenna_5, type:string, comment:null), FieldSchema(name:antenna_6, type:string, comment:null), FieldSchema(name:scene, type:string, comment:null), FieldSchema(name:base_lon, type:double, comment:null), FieldSchema(name:base_lat, type:double, comment:null), FieldSchema(name:ssb_send_power, type:double, comment:null), FieldSchema(name:base_h_angle, type:double, comment:null), FieldSchema(name:antenna_height, type:double, comment:null), FieldSchema(name:m_vertical_angle, type:double, comment:null), FieldSchema(name:h_beam_precision, type:int, comment:null), FieldSchema(name:v_beam_precision, type:int, comment:null), FieldSchema(name:simu_spectrum, type:decimal(2,1), comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.045 seconds
INFO : Executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): select * from cell_random_grid_tmp2 limit 1
INFO : Completed executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.001 seconds
INFO : OK
Error: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://master01.hadoop.dtmobile.cn:8020/user/hive/warehouse/capacity.db/cell_random_grid_tmp2/part-00000-82a689a5-7c2a-48a0-ab17-8bf04c963ea6-c000.snappy.parquet (state=,code=0)
0: jdbc:hive2://master01.hadoop.dtmobile.cn:1>

通過spark2.3 sparksql saveAsTable()執行寫數據到hive，sparksql寫數據到hive時候，默認是保存為parquet+snappy的數據。在數據保存完成之后，通過hive beeline查詢，報錯如上。但是通過spark查詢，執行正常。

在stackoverflow上找到同樣的問題：

根本原因如下：

This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning

Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.

Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.

所以嘗試調整參數 spark.sql.parquet.writeLegacyFormat = true，問題解決。

到spark2.3源代碼中查找該參數(spark.sql.parquet.writeLegacyFormat)：

package org.apache.spark.sql.internal 中關於sparksql的默認配置 SQLConf.scala中相關描述如下

  val PARQUET_WRITE_LEGACY_FORMAT = buildConf("spark.sql.parquet.writeLegacyFormat")
    .doc("Whether to be compatible with the legacy Parquet format adopted by Spark 1.4 and prior " +
      "versions, when converting Parquet schema to Spark SQL schema and vice versa.")
    .booleanConf
    .createWithDefault(false)

可以看到默認值為false

在 package org.apache.spark.sql.execution.datasources.parquet 的關於ParquetWriteSupport.scala 的描述如下：

/**
 * A Parquet [[WriteSupport]] implementation that writes Catalyst [[InternalRow]]s as Parquet
 * messages.  This class can write Parquet data in two modes:
 *
 *  - Standard mode: Parquet data are written in standard format defined in parquet-format spec.
 *  - Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior.
 *
 * This behavior can be controlled by SQL option `spark.sql.parquet.writeLegacyFormat`.  The value
 * of this option is propagated to this class by the `init()` method and its Hadoop configuration
 * argument.
 */

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 值得一提：關於 HDFS 的 file size 和 block size YAMLException: can not read a block mapping entry; a multiline key may not be an implicit key at line 5, column 1: (hexo)YAMLException can not read a block mapping entry; a multiline key may not be an implicit key findDecoder imread_(...) can't open/read file: check file path/integrity hdfs fsck / 檢查hdfs中block問題 HDFS中的數據塊(Block) 關於HDFS默認block塊大小 HDFS Lease Recovey 和 Block Recovery Hadoop之HDFS的block、packet、chunk javax.imageio.IIOException: Can't read input file!完美解決