hive是使用antlr來解析的
parser要做的事情,是從無結構的字符串里面,解碼產生有結構的數據結構(a parser is a function accepting strings as input and returning some structure as output),參考 Parser_combinator wiki
parser分成兩種,一種是parser combinator,一種是parser generator,區別可以參考 王垠的文章——對 Parser 的誤解
1.parser combinator是需要手寫parser,a parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output,比如Thrift的Parser
https://github.com/apache/thrift/blob/master/compiler/cpp/src/thrift/main.cc
2.parser generator是需要你用某種指定的描述語言來表示出語法,然后自動把他們轉換成parser的代碼,比如Antlr里面的g4語法文件,calcite的ftl語法文件,hue使用的jison以及flex和cup等,缺點是由於代碼是生成的,排錯比較困難
使用了Antlr的parser有Hive,Presto,Spark SQL
美團點評的文章
https://tech.meituan.com/2014/02/12/hive-sql-to-mapreduce.html
以及hive源碼的測試用例
https://github.com/apache/hive/blob/branch-1.1/ql/src/test/org/apache/hadoop/hive/ql/parse/TestHiveDecimalParse.java
hive的g4文件如下
老版本的hive
https://github.com/apache/hive/blob/59d8665cba4fe126df026f334d35e5b9885fc42c/parser/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g
新版本的hive
https://github.com/apache/hive/blob/master/hplsql/src/main/antlr4/org/apache/hive/hplsql/Hplsql.g4
spark的g4文件如下
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Presto的g4文件如下
https://github.com/prestodb/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4
confluent的kSql的g4文件
https://github.com/confluentinc/ksql/blob/master/ksqldb-parser/src/main/antlr4/io/confluent/ksql/parser/SqlBase.g4
使用了Apache Calcite的parser有Apache Flink,Mybatis,Apache Storm等
Flink的ftl文件如下
https://github.com/apache/flink/blob/master/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl
Mybatis的mapper模板生成
https://github.com/abel533/Mapper/blob/master/generator/src/main/resources/generator/mapper.ftl
Storm的ftl文件如下
https://github.com/apache/storm/blob/master/sql/storm-sql-core/src/codegen/includes/parserImpls.ftl
以及使用了flex和cup的impala,如何使用impala的parser來解析query可以參考另一篇文章:使用Impala parser解析SQL
parser的測試用例
https://github.com/cloudera/Impala/blob/master/fe/src/test/java/com/cloudera/impala/analysis/ParserTest.java
源碼
https://github.com/apache/impala/blob/master/fe/src/main/jflex/sql-scanner.flex
和
https://github.com/apache/impala/blob/master/fe/src/main/cup/sql-parser.cup
impala也用了少量的antlr
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/ToSqlUtils.java
還有hue使用的jison,jison是JavaScript語言的語法分析器
https://github.com/cloudera/hue/tree/master/desktop/core/src/desktop/js/parse/jison
以hive的Hplsql.g4為例,來解析一句sql
antlr4 Hplsql.g4 javac Hplsql*.java
解析select語句
grun Hplsql r -tokens Warning: TestRig moved to org.antlr.v4.gui.TestRig; calling automatically select * from db1.tb1; [@0,0:5='select',<T_SELECT>,1:0] [@1,7:7='*',<'*'>,1:7] [@2,9:12='from',<T_FROM>,1:9] [@3,14:16='db1',<L_ID>,1:14] [@4,17:17='.',<'.'>,1:17] [@5,18:20='tb1',<L_ID>,1:18] [@6,21:21=';',<';'>,1:21] [@7,23:22='<EOF>',<EOF>,2:0] No method for rule r or it has arguments
可以看到打印出token流
解析建表語句
grun Hplsql r -tokens Warning: TestRig moved to org.antlr.v4.gui.TestRig; calling automatically CREATE TABLE IF NOT EXISTS db1.tb1 ( `f1` string, `f2` bigint, `f3` string, `f4` string, `f5` string) partitioned by(ds string) stored as parquet TBLPROPERTIES ("parquet.compression"="SNAPPY"); [@0,0:5='CREATE',<T_CREATE>,1:0] [@1,7:11='TABLE',<T_TABLE>,1:7] [@2,13:14='IF',<T_IF>,1:13] [@3,16:18='NOT',<T_NOT>,1:16] [@4,20:25='EXISTS',<T_EXISTS>,1:20] [@5,27:29='db1',<L_ID>,1:27] [@6,30:30='.',<'.'>,1:30] [@7,31:33='tb1',<L_ID>,1:31] [@8,35:35='(',<'('>,1:35] [@9,39:42='`f1`',<L_ID>,2:2] [@10,44:49='string',<T_STRING>,2:7] [@11,50:50=',',<','>,2:13] [@12,54:57='`f2`',<L_ID>,3:2] [@13,59:64='bigint',<T_BIGINT>,3:7] [@14,65:65=',',<','>,3:13] [@15,69:72='`f3`',<L_ID>,4:2] [@16,74:79='string',<T_STRING>,4:7] [@17,80:80=',',<','>,4:13] [@18,84:87='`f4`',<L_ID>,5:2] [@19,89:94='string',<T_STRING>,5:7] [@20,95:95=',',<','>,5:13] [@21,99:102='`f5`',<L_ID>,6:2] [@22,104:109='string',<T_STRING>,6:7] [@23,110:110=')',<')'>,6:13] [@24,112:122='partitioned',<L_ID>,7:0] [@25,124:125='by',<T_BY>,7:12] [@26,126:126='(',<'('>,7:14] [@27,127:128='ds',<L_ID>,7:15] [@28,130:135='string',<T_STRING>,7:18] [@29,136:136=')',<')'>,7:24] [@30,138:143='stored',<T_STORED>,8:0] [@31,145:146='as',<T_AS>,8:7] [@32,148:154='parquet',<L_ID>,8:10] [@33,156:168='TBLPROPERTIES',<L_ID>,9:0] [@34,170:170='(',<'('>,9:14] [@35,171:191='"parquet.compression"',<L_ID>,9:15] [@36,192:192='=',<'='>,9:36] [@37,193:200='"SNAPPY"',<L_ID>,9:37] [@38,201:201=')',<')'>,9:45] [@39,202:202=';',<';'>,9:46] [@40,204:203='<EOF>',<EOF>,10:0] No method for rule r or it has arguments