编译技术3-1

语法分析基础（Syntax Analysis Basics）

1. 语法分析在编译流程中的位置

Position in Compilation Pipeline

1
2
3

Lexical Analysis (词法分析) → Syntax Analysis (语法分析) → 
Semantic Analysis (语义分析) → IR Generation (中间代码生成) → 
Optimization (优化) → Code Generation (代码生成)

2. 什么是语法分析？

What is Syntax Analysis?

输入 (Input): 词法分析产生的 token 序列
任务 (Tasks):
- 从 token 序列中恢复程序结构 (Recover structure from token sequence)
- 检查语法错误 (Report syntax errors)
输出 (Output): 显式或隐式的语法树 (Explicit or implicit syntax tree)

🧾语法分析形式体系

1. Context-Free Grammars (CFG)

上下文无关文法

CFG 是一种形式语言的表示方法，用于定义编程语言的语法结构。 A Context-Free Grammar is a formalism used to describe the syntax of programming languages.

2. CFG and Grammars

文法与CFG

一个文法（Grammar）G = (Vₜ, Vₙ, P, S)，其中：

Vₜ（终结符）Terminals：不能再被替换的基本符号，通常是词法分析器的 token。
Vₙ（非终结符）Non-terminals：用于表示语法结构的符号。
P（产生式规则）Productions：形如 A → α 的替换规则。
S（开始符号）Start Symbol：推导的起点。

3. Derivations

推导

推导是通过应用产生式规则，从开始符号逐步构造出一个字符串的过程。 Derivation is the process of applying production rules to generate strings from the start symbol.

左推导 Leftmost Derivation：每一步总是替换最左侧的非终结符。
右推导 Rightmost Derivation：每一步总是替换最右侧的非终结符。

4. Concrete and Abstract Syntax Trees

具体与抽象语法树

具体语法树（Concrete Syntax Tree）：精确展示了程序中所有语法细节。
抽象语法树（Abstract Syntax Tree, AST）：省略不必要的符号，只保留程序结构的核心。

Concrete syntax trees show all grammar symbols; ASTs simplify the structure by removing syntactic sugar.

5. Ambiguity

歧义

一个文法是歧义的，如果存在某个字符串可以通过不同的方式推导出不同的语法树。 A grammar is ambiguous if a string has more than one distinct parse tree.

例子（Example）：

1	expr → expr + expr \| expr * expr \| num

表达式 num + num * num 既可以先加再乘，也可以先乘再加，造成歧义。

6. Parsing Algorithms

解析算法

用于将 token 序列转换为语法树的算法。 Parsing algorithms convert token sequences into syntax trees.

🧭 Top-Down Parsing

自顶向下分析

从开始符号开始尝试推导输入字符串。
常见方法：递归下降（Recursive Descent）、LL分析器。

Starts from the start symbol and tries to rewrite it to match the input.

⛏️ Bottom-Up Parsing

自底向上分析

从输入字符串出发，逐步归约回到开始符号。
常见方法：LR分析器、移进-归约（Shift-Reduce Parsing）。

Starts from the input and works backward to the start symbol.

7. Formal Languages

形式语言

字母表（Alphabet）Σ：一组符号，作为“字母”。 An alphabet is a set Σ of symbols that act as letters.
语言（Language）：由 Σ 中的符号组成的字符串的集合。 A language over Σ is a set of strings made from symbols in Σ.

在扫描（Lexing）中：

字母表：ASCII 字符（如 a-z, 0-9）
结果：token序列（标记）

在语法分析（Parsing）中：

字母表：由词法分析产生的 token 集合
分析目标：使用 CFG 判断 token 序列是否符合语法

正则语言的局限性

The Limits of Regular Languages

📌 基本定义 Definition

给定字母表： Σ = {a, b} Alphabet: Σ = {a, b}
构造的字符串集合 S：在这个字母表下，集合 S 包含一个 b，其左右有相同数量的 a。 The set of strings S over this alphabet consists of a single b surrounded by the same number of as.

举例 Examples:

1 2	S = {b, aba, aabaa, aaabaaa, …} = { aⁿ b aⁿ \| n ≥ 0, n ∈ ℤ }

❌ 正则表达式是否能描述？

Can regular expressions describe this language?

尝试的正则表达式： a*ba* ✅ 这个表达式可以匹配 但并不精确。 This regular expression matches strings with a b and any number of as on both sides.
但这不能保证左右两边的 a 的数量相等！ But it does not enforce that the number of as before and after b are the same.

❗结论 Conclusion:

这个语言不能被正则表达式完全描述。 This set cannot be described by a regular expression.

🧪 编译中的意义 Relevance in Compilation

🔍 词法分析（Lexical Analysis）

我们使用正则表达式来定义每一个 token。 During scanning, we used regular expressions to define each token.
这是可以的，因为大多数 token 是简单的结构（如标识符、数字、运算符）。 This works because most tokens have simple, regular structure (like identifiers, numbers, and operators).

🚫 正则表达式的局限 Limitations of Regular Expressions

虽然正则表达式在词法分析中很强大，但在更复杂的语法结构中不够用。 Although useful in scanning, regular expressions are too weak for syntax-level tasks.

⚠️ 无法处理的问题 Problems it cannot handle:

无法匹配所有成对括号的表达式 Cannot define a regular expression matching all expressions with properly balanced parentheses.
无法匹配所有嵌套块结构的函数体 Cannot define a regular expression matching all functions with properly nested block structures.

🚀 解决方法：需要更强的形式体系

Solution: We need a more powerful formalism.

因此，我们需要上下文无关文法（Context-Free Grammars, CFGs）来描述这些复杂结构。 So we turn to context-free grammars (CFGs) to define such structures.

📘上下文无关文法（Context-Free Grammars）

🧠 基本概念 Basic Concept

上下文无关文法（Context-Free Grammar，CFG） 是一种用于描述编程语言语法结构的形式体系。

A context-free grammar (CFG) is a specification for the syntactic structure of a programming language.

它与正则表达式类似，但 CFG 支持递归规则，因此能够描述更复杂的结构。 Similar to regular expressions, except that CFGs involve recursive rules.
CFG 是 正则语言的严格超集。 CFGs are a strict superset of the regular languages.

📌 示例 Example

一个用于整数算术表达式的上下文无关文法： A context-free grammar for integer arithmetic expressions:

exp   → exp op exp | (exp) | num
op    → + | - | * | /
num   = digit digit*      ← regular expression
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

📐 正式定义 Formal Definition

一个上下文无关文法 G 定义为： A CFG is defined as:

1	G = (Vₜ, Vₙ, P, S)

其中 (where):

Vₜ (终结符 Terminals): 用于构成字符串的基本符号，即 token。 A set of terminals, the basic symbols used to form strings (tokens).
Vₙ (非终结符 Nonterminals): 表示结构名称的一组符号，用于定义字符串集合。 A set of nonterminals, names for structures denoting sets of strings.
P (产生式 Productions): 一组规则，形式为 A → α，其中 A 是非终结符，α 是任意长度的终结符和非终结符组合。 A set of productions, or grammar rules, of the form A → α, where A ∈ Vₙ and α ∈ (Vₙ ∪ Vₜ)*.
S (开始符号 Start Symbol): 文法推导的起点，S ∈ Vₙ。 The start symbol, from which derivations begin.

Vₜ 和 Vₙ 没有重叠，即 Vₜ ∩ Vₙ = ∅ Terminals and nonterminals are disjoint: Vₜ ∩ Vₙ = ∅

🧱 结构意义 Structural Meaning

每条产生式（Production）定义一个结构。 A production defines a structure:

A → α 表示结构 A 的形式由 α 决定。 A → α means structure A is composed as described by α.
→ 并不意味着替换，而是说明结构递归定义。 The arrow → doesn’t mean literal substitution but structural derivation, due to recursion.
如果 X → Y₁ Y₂ ... Yₙ，表示 X 可以展开为 Y₁ Y₂ ... Yₙ。 If X → Y₁ ... Yₙ, then X can be replaced by Y₁ ... Yₙ.
如果 X → ε，表示 X 可以替换为空串。 If X → ε, then X can be replaced with the empty string.

✏️ BNF 表示法 (Backus-Naur Form)

这种规则表示形式被称为巴科斯-诺尔范式（BNF），是一种标准的 CFG 写法。 The format of the production rules A → α is known as Backus-Naur Form (BNF).

🧩 推导与语言定义 Derivation & Language

从起始符号 S 开始，通过反复应用产生式，可以生成一系列终结符组成的字符串。 Starting from the start symbol S, repeatedly applying productions can generate strings of terminals.
所有从 S 推导出的终结符字符串集合就是这个 CFG 定义的语言。 The set of all terminal strings derivable from S is the language defined by the grammar.

✅ 小结 Summary (中英文对照)

英文术语	中文术语	说明
Terminal (Vₜ)	终结符	最终输出符号，如 token
Nonterminal (Vₙ)	非终结符	表示结构的符号
Production (P)	产生式	定义结构的规则
Start Symbol (S)	开始符号	从这个非终结符开始推导
ε (epsilon)	空串	表示“什么也不生成”的规则
Derivation	推导	使用规则一步步展开的过程
BNF	巴科斯-诺尔范式	标准的文法书写格式

CFG 示例与表示法

这里其实就是递归写法!

🎯 示例：简单算术表达式的文法

文法 G = (Vₜ, Vₙ, P, S)

终结符 (Terminals):

1	VT = { num, +, -, *, /, (, ) }

非终结符 (Nonterminals):

1	VN = { exp, op }

开始符号 (Start Symbol):

exp

产生式 (Productions):

exp → exp op exp
exp → (exp)
exp → num
op   → +
op   → -
op   → *
op   → /

✏️ 表示约定 Notation Conventions

为使文法书写简洁，采用以下约定：

To simplify grammar notation, we adopt the following conventions:

默认第一个产生式左侧是开始符号。

The left-hand side of the first production is the start symbol, unless otherwise stated.
小写字母或具体符号（如 +, *, num）表示终结符（tokens）

Lowercase letters or literal symbols represent terminals.
大写字母或带尖括号 <...> 的名称表示非终结符。

Uppercase names or angle brackets represent nonterminals.
多条产生式拥有相同左侧时可合并写为：
1
A → α₁ | α₂ | ... | αₙ
If A → α₁, A → α₂, ..., A → αₙ share the same left-hand side, we write:
1
A → α₁ | α₂ | ... | αₙ

❌ 非法表示法 Not Notational Shorthand

在 CFG 中不能使用正则表达式的语法符号：

The syntax for regular expressions does not carry over to CFGs.

不能使用 *、|、() 表示重复、选择或分组。

You cannot use *, |, or parentheses as in regular expressions.

错误示例 (Incorrect Example):

S → a*b    ❌

正确写法 (Correct CFG Form):

1 2	S → Ab A → Aa \| ε

🎯 再看一个例子 Another Example

目标语言: { aⁿ b | n ≥ 0 } ∪ { aⁿ cᵐ | n ≥ 0, m ≥ 0 }

⚠️ 错误正则表达式尝试：

1	S → a(b\|c*) ❌

⚙️ 正确 CFG 写法：

1
2
3

S → aX
X → b | C
C → Cc | ε

解释：

X → b 匹配一个 b
X → C 表示 c 的重复（由 C → Cc | ε 实现）
aX 前缀表示 a 的前缀可以递归出现，通过多层调用 S → aX 来实现

✅ 总结 Summary

形式	意义	CFG 中支持？
`*`	重复	❌ 不支持
`\|`	或选
`()`	分组	❌ 不支持直接使用

为了描述复杂的语法结构，我们需要采用递归规则的 CFG，而不是仅依赖正则表达式。

Chomsky 文法层级（Chomsky Hierarchy of Grammars）

📚 Chomsky Hierarchy of Grammars

乔姆斯基层级（Chomsky Hierarchy）是由语言学家诺姆·乔姆斯基提出的形式语言分类方法，根据文法的表达能力将语言分为四类：

四种文法类型（Four Types of Grammars）

类型	名称（英文）	名称（中文）	描述
Type 0	Unrestricted Grammar	无限制文法	没有任何限制的产生式
Type 1	Context-Sensitive Grammar	上下文相关文法	产生式形如 αAβ → αγβ，其中γ非空
Type 2	Context-Free Grammar	上下文无关文法	产生式形如 A → β
Type 3	Regular Grammar	正则文法	产生式形如 A → aB 或 A → a

🔄 各类型产生式形式（Production Forms）

类型	产生式形式	限制说明	例子
Type 0	α → β	α, β ∈ (VT ∪ VN)* 且 α ≠ ε	没有限制：最强表达能力
Type 1	αAβ → αγβ	γ ≠ ε, 保证右侧长度 ≥ 左侧长度	S → aSBE, aB → ab
Type 2	A → β	A ∈ VN, β ∈ (VT ∪ VN)*	常用于编程语言语法
Type 3	A → aB 或 A → a	A, B ∈ VN, a ∈ VT	可对应正则表达式

🧠 各文法层级之间的关系（Hierarchy Relationship）

1	Unrestricted ⊇ Context-sensitive ⊇ Context-free ⊇ Regular

正则文法是上下文无关文法的子集，而上下文无关文法又是上下文相关文法的子集，依此类推。

📌 示例（Examples）

✅ Context-Sensitive Grammar 示例

S → aSBE
S → aBE
bB → bb
bE → be
aB → ab
eB → eE
eE → ee

✅ Context-Free Grammar 示例

1
2
3

S → AB
A → BS | 0
B → SA | 1

✅ Regular Grammar 示例

1
2
3

S → 0A | 1B | 0
A → 1B | 0
B → 0

📖 总结（Summary）

每种文法在表达能力上是递增的：
- 正则文法只能表达线性语言结构
- 上下文无关文法可以描述嵌套结构（如括号匹配）
- 上下文相关文法可用于描述更多语义限制（如变量声明后才能使用）
- 无限制文法则可表达所有形式语言（包括自然语言）
编译器前端常使用 上下文无关文法（CFG） 来描述编程语言的语法。