關系
關鍵字:Hive HQL Job數量、Hive執行計划、Hive LineageInfo
本文介紹使用Hive的API獲取一條HQL的最終執行計划,從而獲取這條HQL的Job數量,另外,介紹使用API分析一條HQL中所包含的輸入表和輸出表。這些信息在做元數據管理和Hive表的血緣分析時候很有用。
Hive在執行一條HQL的時候,會經過以下步驟:
- 語法解析:Antlr定義SQL的語法規則,完成SQL詞法,語法解析,將SQL轉化為抽象 語法樹AST Tree;
- 語義解析:遍歷AST Tree,抽象出查詢的基本組成單元QueryBlock;
- 生成邏輯執行計划:遍歷QueryBlock,翻譯為執行操作樹OperatorTree;
- 優化邏輯執行計划:邏輯層優化器進行OperatorTree變換,合並不必要的ReduceSinkOperator,減少shuffle數據量;
- 生成物理執行計划:遍歷OperatorTree,翻譯為MapReduce任務;
- 優化物理執行計划:物理層優化器進行MapReduce任務的變換,生成最終的執行計划;
關於這幾個步驟,在美團的技術博客上有一篇文章介紹的非常好,可以參考:http://tech.meituan.com/hive-sql-to-mapreduce.html
一般情況下,HQL中的每一個表或者子查詢都會生成一個job,這是邏輯執行計划中生成的,但后面Hive還會優化,比如:使用MapJoin,最終一條HQL語句生成的job數量很難通過HQL觀察出來。
獲取HQL的執行計划和Job數量
直接看代碼吧:
- package com.lxw1234.test;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.hive.conf.HiveConf;
- import org.apache.hadoop.hive.ql.Context;
- import org.apache.hadoop.hive.ql.QueryPlan;
- import org.apache.hadoop.hive.ql.exec.Utilities;
- import org.apache.hadoop.hive.ql.parse.ASTNode;
- import org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer;
- import org.apache.hadoop.hive.ql.parse.ParseDriver;
- import org.apache.hadoop.hive.ql.parse.ParseUtils;
- import org.apache.hadoop.hive.ql.parse.SemanticAnalyzerFactory;
- import org.apache.hadoop.hive.ql.session.SessionState;
- /**
- * lxw的大數據田地 -- lxw1234.com
- * @author lxw1234
- *
- */
- public class HiveQueryPlan {
- public static void main(String[] args) throws Exception {
- HiveConf conf = new HiveConf();
- conf.addResource(new Path("file:///usr/local/apache-hive-0.13.1-bin/conf/hive-site.xml"));
- conf.addResource(new Path("file:///usr/local/apache-hive-0.13.1-bin/conf/hive-default.xml.template"));
- conf.set("javax.jdo.option.ConnectionURL",
- "jdbc:mysql://127.0.0.1:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf-8");
- conf.set("hive.metastore.local", "true");
- conf.set("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver");
- conf.set("javax.jdo.option.ConnectionUserName", "hive");
- conf.set("javax.jdo.option.ConnectionPassword", "hive");
- conf.set("hive.stats.dbclass", "jdbc:mysql");
- conf.set("hive.stats.jdbcdriver", "com.mysql.jdbc.Driver");
- conf.set("hive.exec.dynamic.partition.mode", "nonstrict");
- String command = args[0];
- SessionState.start(conf);
- Context ctx = new Context(conf);
- ParseDriver pd = new ParseDriver();
- ASTNode tree = pd.parse(command, ctx);
- tree = ParseUtils.findRootNonNullToken(tree);
- BaseSemanticAnalyzer sem = SemanticAnalyzerFactory.get(conf, tree);
- sem.analyze(tree, ctx);
- sem.validate();
- QueryPlan queryPlan = new QueryPlan(command,sem,0l);
- int jobs = Utilities.getMRTasks(queryPlan.getRootTasks()).size();
- System.out.println("Total jobs = " + jobs);
- }
- }
將上面的代碼打包成testhive.jar,運行該類需要引入Hive的依賴包,在包含Hadoop和Hive客戶端的機器上執行下面的命令:
- for f in /usr/local/apache-hive-0.13.1-bin/lib/*.jar; do
- HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:$f;
- done
- export HADOOP_CLASSPATH
分別解析下面三條HQL語句:
- HQL1:SELECT COUNT(1) FROM liuxiaowen.lxw1;
- HQL2:SELECT COUNT(1) FROM (SELECT url FROM liuxiaowen.lxw1 GROUP BY url) x;
- HQL3:SELECT COUNT(1) FROM liuxiaowen.lxw1 a join liuxiaowen.lxw2 b ON (a.url = b.domain);
解析HQL1:
- hadoop jar testhive.jar com.lxw1234.test.HiveQueryPlan "SELECT COUNT(1) FROM liuxiaowen.lxw1"
結果如下:
解析HQL2:
- hadoop jar testhive.jar com.lxw1234.test.HiveQueryPlan "SELECT COUNT(1) FROM (SELECT url FROM liuxiaowen.lxw1 GROUP BY url) x"
結果如下:
解析HQL3:
- hadoop jar testhive.jar com.lxw1234.test.HiveQueryPlan "SELECT COUNT(1) FROM liuxiaowen.lxw1 a join liuxiaowen.lxw2 b ON (a.url = b.domain)"
結果如下:
在HQL3中,由於Hive自動優化使用了MapJoin,因此,兩個表的join最終只用了一個job,在Hive中執行驗證一下:
解析HQL中表的血緣關系
在元數據管理中,可能需要知道Hive中有哪些表,以及這些表之間的關聯關系,比如:A表是由B表和C表統計匯總而來。
Hive中本身自帶了一個工具,用來分析一條HQL中的源表和目標表,org.apache.hadoop.hive.ql.tools.LineageInfo
但該類中目標表只能是使用INSERT語句插入數據的目標表,對於使用CREATE TABLE AS語句創建的表分析不出來。
下面的代碼只對org.apache.hadoop.hive.ql.tools.LineageInfo做了小小的修改:
- package com.lxw1234.test;
- import java.io.IOException;
- import java.util.ArrayList;
- import java.util.LinkedHashMap;
- import java.util.Map;
- import java.util.Stack;
- import java.util.TreeSet;
- import org.apache.hadoop.hive.ql.lib.DefaultGraphWalker;
- import org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher;
- import org.apache.hadoop.hive.ql.lib.Dispatcher;
- import org.apache.hadoop.hive.ql.lib.GraphWalker;
- import org.apache.hadoop.hive.ql.lib.Node;
- import org.apache.hadoop.hive.ql.lib.NodeProcessor;
- import org.apache.hadoop.hive.ql.lib.NodeProcessorCtx;
- import org.apache.hadoop.hive.ql.lib.Rule;
- import org.apache.hadoop.hive.ql.parse.ASTNode;
- import org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer;
- import org.apache.hadoop.hive.ql.parse.HiveParser;
- import org.apache.hadoop.hive.ql.parse.ParseDriver;
- import org.apache.hadoop.hive.ql.parse.ParseException;
- import org.apache.hadoop.hive.ql.parse.SemanticException;
- /**
- * lxw的大數據田地 -- lxw1234.com
- * @author lxw1234
- *
- */
- public class HiveLineageInfo implements NodeProcessor {
- /**
- * Stores input tables in sql.
- */
- TreeSet inputTableList = new TreeSet();
- /**
- * Stores output tables in sql.
- */
- TreeSet OutputTableList = new TreeSet();
- /**
- *
- * @return java.util.TreeSet
- */
- public TreeSet getInputTableList() {
- return inputTableList;
- }
- /**
- * @return java.util.TreeSet
- */
- public TreeSet getOutputTableList() {
- return OutputTableList;
- }
- /**
- * Implements the process method for the NodeProcessor interface.
- */
- public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx,
- Object... nodeOutputs) throws SemanticException {
- ASTNode pt = (ASTNode) nd;
- switch (pt.getToken().getType()) {
- case HiveParser.TOK_CREATETABLE:
- OutputTableList.add(BaseSemanticAnalyzer.getUnescapedName((ASTNode)pt.getChild(0)));
- break;
- case HiveParser.TOK_TAB:
- OutputTableList.add(BaseSemanticAnalyzer.getUnescapedName((ASTNode)pt.getChild(0)));
- break;
- case HiveParser.TOK_TABREF:
- ASTNode tabTree = (ASTNode) pt.getChild(0);
- String table_name = (tabTree.getChildCount() == 1) ?
- BaseSemanticAnalyzer.getUnescapedName((ASTNode)tabTree.getChild(0)) :
- BaseSemanticAnalyzer.getUnescapedName((ASTNode)tabTree.getChild(0)) + "." + tabTree.getChild(1);
- inputTableList.add(table_name);
- break;
- }
- return null;
- }
- /**
- * parses given query and gets the lineage info.
- *
- * @param query
- * @throws ParseException
- */
- public void getLineageInfo(String query) throws ParseException,
- SemanticException {
- /*
- * Get the AST tree
- */
- ParseDriver pd = new ParseDriver();
- ASTNode tree = pd.parse(query);
- while ((tree.getToken() == null) && (tree.getChildCount() > 0)) {
- tree = (ASTNode) tree.getChild(0);
- }
- /*
- * initialize Event Processor and dispatcher.
- */
- inputTableList.clear();
- OutputTableList.clear();
- // create a walker which walks the tree in a DFS manner while maintaining
- // the operator stack. The dispatcher
- // generates the plan from the operator tree
- Map<Rule, NodeProcessor> rules = new LinkedHashMap<Rule, NodeProcessor>();
- // The dispatcher fires the processor corresponding to the closest matching
- // rule and passes the context along
- Dispatcher disp = new DefaultRuleDispatcher(this, rules, null);
- GraphWalker ogw = new DefaultGraphWalker(disp);
- // Create a list of topop nodes
- ArrayList topNodes = new ArrayList();
- topNodes.add(tree);
- ogw.startWalking(topNodes, null);
- }
- public static void main(String[] args) throws IOException, ParseException,
- SemanticException {
- String query = args[0];
- HiveLineageInfo lep = new HiveLineageInfo();
- lep.getLineageInfo(query);
- System.out.println("Input tables = " + lep.getInputTableList());
- System.out.println("Output tables = " + lep.getOutputTableList());
- }
- }
將上面的程序打包成testhive.jar,同上面,執行時候需要引入Hive的依賴包:
分析下面兩條HQL語句:
- HQL1:CREATE TABLE liuxiaowen.lxw1234 AS SELECT * FROM liuxiaowen.lxw1;
- HQL2:INSERT OVERWRITE TABLE liuxiaowen.lxw3 SELECT a.url FROM liuxiaowen.lxw1 a join liuxiaowen.lxw2 b ON (a.url = b.domain);
執行命令:
- hadoop jar testhive.jar com.lxw1234.test.HiveLineageInfo "CREATE TABLE liuxiaowen.lxw1234 AS SELECT * FROM liuxiaowen.lxw1"
- hadoop jar testhive.jar com.lxw1234.test.HiveLineageInfo "INSERT OVERWRITE TABLE liuxiaowen.lxw3 SELECT a.url FROM liuxiaowen.lxw1 a
- join liuxiaowen.lxw2 b ON (a.url = b.domain)"
分析結果:
HQL中的Input table和Output table已經正確解析出來。