Hive源码解析-之-语法解析器

淘宝数据平台与产品部官方博客 tbdata.org 2011-05-03 23:34:04 累计浏览 5,694 次

本机暂存

内容概览

这篇讲的是Hive SQL引擎中语法解析器的具体实现。作者从上次分析的词法分析成果出发，揭示了语法解析器如何以生成的语法树为基础，承担起将Token流转化为具体查询结构的重任。

文章的核心在于剖析其设计：解析器根据遇到的语法Token情况，具体实现了五种不同的解析器。这种设计巧妙地应对了Hive SQL语法的多样性和复杂性。通过深入源码，文章清晰地展示了每种解析器所对应的具体语法结构（如DDL、DML、事务语句等）以及它们的分工逻辑。

对于想理解SQL引擎内部工作机制或Hive源码的同学，这篇文章提供了一个清晰的切入口，展现了如何将语法理论具体化为模块化的工程代码。

Hive语法解析器是根据<上次分享的词法分析 > 生成的语法树为基础，进行语法解析。根据语法token的情况实现了五个具体的语法解析器。

在你生成语法器的时候， SemanticAnalyzerFactory分别针对不同的情况生成对应的某个语法器,如下

SemanticAnalyzerFactory类：

现在有五个语法解析器 analyzer继承了BaseSemanticAnalyzer。

五个SemanticAnalyzer的简单介绍：

ExplainSemanticAnalyzer

对语法树、执行计划做了一个打印操作，其他的基本上都是按照SemanticAnalyzer执行的，最重要的差别，就是在整个解析过程中它没有让context在构建文件真正的临时文件所需的文件及文件路径等。

FunctionSemanticAnalyzer ：

主要操作是创建和消除一个的function的元信息。

如：

CREATE TEMPORARY FUNCTION str_to_date AS ‘com.taobao.hive.udf.UDFStrToDate’;

sql可以调用该自定义的function。

DDL SemanticAnalyzer：

主要是对表、view、partition的级别增删改查的操作。

如：show tables;

Load SemanticAnalyzer:

Load操作。

SemanticAnalyzer：

对于我们最重要关注的是SemanticAnalyzer：对应sql语句进行解析，这也是最核心最复杂的组件。

BeseSemanticAnalyze 中语法解析开始于下面：

publicvoid analyze(ASTNode ast, Context ctx) throws SemanticException {

this.ctx = ctx;

analyzeInternal(ast);

}

五个解析器都继承于它，并实现analyzeInternal（），不同的analyzer不同的实现过程，我们关注的是普通sql（selectfrom ）的解析，所以在这里直接看SemanticAnalyzer。

= （注： ctx 是 context 类，很重要，在下面会提到）

所以解析过程的就从这里开始。我们只说正常sql （select … from）的解析。

这就是hive源码里面的SemanticAnalyzer类（超大的一个类）。

因为很重要直接代码如下：

SemanticAnalyzer的analyzeInternal（）

publicvoid analyzeInternal(ASTNode ast) throws SemanticException {

reset();

QB qb = new QB(null, null, false);

this.qb = qb;

this.ast = ast;

ASTNode child = ast;

LOG.info(“Starting Semantic Analysis”);

System.out.print(“Starting Semantic Analysis”);

// analyze create table command

//建表或view 前处理 ,如： create table.. as select .. from

if (ast.getToken().getType() == HiveParser.TOK_CREATETABLE) {

// if it is not CTAS, we don’t need to go further and just return

if ((child = analyzeCreateTable(ast, qb)) == null) {

return;

}

// analyze create view command

if (ast.getToken().getType() == HiveParser.TOK_CREATEVIEW) {

child = analyzeCreateView(ast, qb);

if (child == null) {

return;

}

viewSelect = child;

}

// continue analyzing from the child ASTNode.

doPhase1(child, qb, initPhase1Ctx());//获取subSql,table 等对应别名

LOG.info(“Completed phase 1 of Semantic Analysis”);

getMetaData(qb);//get 元数据

LOG.info(“Completed getting MetaData in Semantic Analysis”);

// Save the result schema derived from the sink operator produced

// by genPlan.This has the correct column names, which clients

// such as JDBC would prefer instead of the c0, c1 we’ll end

// up with later.

Operator sinkOp = genPlan(qb);//这个层次才开始column names，生产operator

resultSchema =

convertRowSchemaToViewSchema(opParseCtx.get(sinkOp).getRR());

if (createVwDesc != null) {//四面

saveViewDefinition();

// Since we’re only creating a view (not executing it), we

// don’t need to optimize or translate the plan (and in fact, those

// procedures can interfere with the view creation). So

// skip the rest of this method.

ctx.setResDir(null);

ctx.setResFile(null);

return;

}

ParseContext pCtx = new ParseContext(conf, qb, child, opToPartPruner,

topOps, topSelOps, opParseCtx, joinContext, topToTable,

loadTableWork, loadFileWork, ctx, idToTableNameMap, destTableId, uCtx,

listMapJoinOpsNoReducer, groupOpToInputTables, prunedPartitions,

opToSamplePruner);

//进入优化器，生成更好的operator tree

Optimizer optm = new Optimizer();

optm.setPctx(pCtx);

optm.initialize(conf);

pCtx = optm.optimize();

init(pCtx);

qb = pCtx.getQB();

// At this point we have the complete operator tree

// from which we want to find the reduce operator

genMapRedTasks(qb);

LOG.info(“Completed plan generation”);

return;

}

关键方法：

doPhase1（）

这个方法相当于把tree的大枝叶先过滤了一遍，解决了一些别名问题和对应为问题，

包括：表和subsql的对应的别名，

Tree 的string 与 ast 对应等，只是没有涉及到字段级别。

1次解析

publicvoid doPhase1(ASTNode ast, QB qb, Phase1Ctx ctx_1)

以下是官方注释。

/**

*Phase1:(including,butnotlimitedto):

*1.Getsallthealiasesforallthetables/subqueriesandmakesthe appropriatemappinginaliasToTabs,aliasToSubq

*2.Getsthelocationofthe destinationandnamestheclase“inclause“+i

*3.Createsamapfroma stringrepresentationofanaggregationtreetotheactualaggregationAST

*4.CreatesamappingfromtheclausenametotheselectexpressionASTin destToSelExpr

*5.Createsamappingfromatablealiastothelateralview

*AST’sinaliasToLateralViews

这里是递归的遍历这颗树，

代码示例，如面对 TOK_FROM

case HiveParser.TOK_FROM:

int child_count = ast.getChildCount();//

if (child_count != 1) {

thrownew SemanticException(“Multiple Children “ + child_count);

}

// Check if this is a subquery / lateral view

// 正对不同情况，给出不同解决方法

ASTNode frm = (ASTNode) ast.getChild(0);

if (frm.getToken().getType() == HiveParser.TOK_TABREF) {

processTable(qb, frm);

} elseif (frm.getToken().getType() == HiveParser.TOK_SUBQUERY) {

processSubQuery(qb, frm);

} elseif (frm.getToken().getType() == HiveParser.TOK_LATERAL_VIEW) {

processLateralView(qb, frm);

} elseif (isJoinToken(frm)) {

processJoin(qb, frm);

qbp.setJoinExpr(frm);

}

break;

把摘下来的信息放在QB、QBParseInfo等几个容器里面。如：如果是select 就把信息记录到QBParseInfo中。

skipRecursion标示递归是否结束。

其中涉及了几个容器：

QBParseInfo是辅助analyzer语法解析的一个容器，

而qb放的是sql block基本单元，包括表名别名问题。

这里我们可以拿到很多我们想要的东西。

QBParseInfo

Implementationoftheparseinformationrelatedtoaqueryblock.

各种对应关系，如： select , groupby , groupby 等的string -Map- astNode

privatefinalbooleanisSubQ;

privatefinal String alias;

private ASTNode joinExpr;

private ASTNode hints;

privatefinal HashMap<String, ASTNode> aliasToSrc;

privatefinal HashMap<String, ASTNode> nameToDest;

privatefinal HashMap<String, TableSample> nameToSample;

privatefinal Map<String, ASTNode> destToSelExpr;

privatefinal HashMap<String, ASTNode> destToWhereExpr;

privatefinal HashMap<String, ASTNode> destToGroupby;

Context 类在这里很重要

Context : 是query的一个context

主要功能：

1 标示explain:如果是explain语句， explain为ture都不会实际的建立这些文件。

2可以建立tmp-file （在query执行过程中所需要的tmp-file 文件和路径），生成和清除中间临时文件及路径。

所以我们可以再这里获取整个过程中的临时文件，用于优化使用。

private Path makeMRScratchDir(HiveConf conf, boolean mkdir)

同分类推荐文章

我做了一个 AI 版的 StarRocks 升级风险扫描工具，直接帮我定位到一个风险（2026-06-15 01:00:00）
硬件故障后数据文件大小不对故障处理—Oracle碎片扫描恢复（2026-06-07 18:21:47）
如何在Hive SQL中构造临时表用于和其它的表做关联？（2026-05-29 20:07:00）

查看更多数据库文章 →

建议继续学习

hbase介绍（累计阅读 12,317）
海量数据面试题举例（累计阅读 11,019）
redis在大数据量下的压测表现（累计阅读 8,255）
淘宝数据魔方技术架构解析（累计阅读 7,917）
HBase随机写以及随机读性能测试（累计阅读 7,500）
敲击最多的键和编程语言语法（累计阅读 7,417）
如何获取hive建表语句（累计阅读 7,150）
Hive源码解析-之-词法分析器 parser （累计阅读 7,040）
大数据下的工行（累计阅读 6,598）
HIVE中UDTF编写和使用（累计阅读 5,957）