spark 分析任務血緣關系

本文轉載自查看原文 2020-04-30 10:25 1440

接到新的需求,在spark中增加埋點，解析任務的血緣關系，包括sql和代碼方式，不包括中間臨時視圖(createOrReplaceTempView(XXX表))。

有位同學已經https://www.cnblogs.com/wuxilc/p/9326130.html 做了hive解析相關的，但是spark部分因為hive parseDriver解析不了。

還是在spark中搞搞吧。

spark BaseSessionStateBuilder 類

  /**
   * Build the [[SessionState]].
   */
  def build(): SessionState = {
    new SessionState(
      session.sharedState,
      conf,
      experimentalMethods,
      functionRegistry,
      udfRegistration,
      () => catalog,
      sqlParser,
      () => analyzer,
      () => optimizer,
      planner,
      streamingQueryManager,
      listenerManager,
      () => resourceLoader,
      createQueryExecution,
      createClone)
  }
}

optimizer 是spark語法優化器，HiveSessionStateBilder 繼承了BaseSessionStateBuilder類

Spark 所有優化器都繼承抽象類Optimizer

/**
 * Abstract class all optimizers should inherit of, contains the standard batches (extending
 * Optimizers can override this.
 */
abstract class Optimizer(sessionCatalog: SessionCatalog)
  extends RuleExecutor[LogicalPlan] {


在優化器 rule添加新的匹配規則

HiveSessionStateBuilder 類中修改

override lazy val optimizer: Optimizer = new SparkOptimizer(catalog, conf, experimentalMethods) {
  override def batches: Seq[Batch] = super.batches :+
    Batch("Determine stats of partitionedTable", Once,
      DeterminePartitionedTableStats(sparkSession)) :+
    Batch("Collect read and write tables", Once, DependencyCollect(sparkSession))
}



添加如下代碼 匹配inset、create等語句解析出輸入表輸出表。

case class DependencyCollect(sparkSession: SparkSession) extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = {
    if (sparkSession.sparkContext.conf.getBoolean("spark.collectDependencies", true)) {
      val readTables = mutable.HashSet[String]()
      val writeTables = mutable.HashSet[String]()
      plan transformDown {
        case a@InsertIntoHiveTable(table: CatalogTable,_,_,_,_,_) =>
          writeTables += s"${fillBlankDatabase(table)}.${table.identifier.table}"
          a
        case i@InsertIntoTable(table: HiveTableRelation, _, _, _, _) =>
          writeTables += s"${table.tableMeta.database}.${table.tableMeta.identifier.table}"
          i
        case c@CreateTable(table: CatalogTable, _, _) =>
          writeTables += s"${fillBlankDatabase(table)}.${table.identifier.table}"
          c
        case d@CreateTableCommand(table: CatalogTable, _) =>
          writeTables += s"${fillBlankDatabase(table)}.${table.identifier.table}"
          d
        case p@PhysicalOperation(_, _, table: HiveTableRelation) =>
          readTables += s"${table.tableMeta.database}.${table.tableMeta.identifier.table}"
          p
      }
      if (readTables.size > 0 || writeTables.size > 0) {
        logInfo(String.format("src table -> %s target table -> %s", readTables.mkString(","), writeTables.mkString(",")))
        AsyncExecution.AsycnHandle(new CallChain.Event(s"${readTables.mkString(",")}#${writeTables.mkString(",")}", AsyncExecution.getSparkAppName(sparkSession.sparkContext.conf), "bloodlineage"))
        sparkSession.sparkContext.listenerBus.post(DependencyEvent(readTables, writeTables))
      }
    }
    plan
  }

  private def fillBlankDatabase(table: CatalogTable): String = {
    var database = ""
    if (table.database.isEmpty) {
      database = sparkSession.sessionState.catalog.getCurrentDatabase
    } else {
      database = table.database
    }
    database
  }

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基於spark logicplan的表血緣關系解析實現血緣關系分析工具SQLFLOW--實踐指南 SVG--D3--血緣關系樹 [一起學Hive]之十九-使用Hive API分析HQL的執行計划、Job數量和表的血緣關系 hive血緣關系之輸入表與目標表的解析一款好用的數據血緣關系在線工具--SQLFlow 使用Druid解析SQL實現血緣關系計算數據治理中的數據血緣關系是什么？用來解決什么問題血緣分析 Spark任務提交源碼分析