编译原理2007-2008

1. Fill in the blanks (18%)

a. Normally, a compiler consists of a number of phases. They are Scanner, _, , , Code Generator, and ___.

✅ 答案： Parser, Semantic Analyzer, Source Code Optimizer, Target Code Optimizer

📘 解释：

Scanner（词法分析器）：将源代码分解为记号（tokens）。
Parser（语法分析器）：根据语法规则分析 token 的结构。
Semantic Analyzer（语义分析器）：检查语义约束，如类型检查。
Source Code Optimizer（中间代码优化器）：优化中间表示以提高效率。
Code Generator（代码生成器）：生成目标代码（如汇编）。
Target Code Optimizer（目标代码优化器）：对最终代码进一步优化。

📙 中文翻译： 编译器通常由多个阶段组成，依次为：词法分析器、语法分析器、语义分析器、中间代码优化器、代码生成器和目标代码优化器。

b. The logical units the scanner generates are called Tokens. For a modern programming language, there are five types of token. They are ________, _, , , ____.

✅ 答案： Reserved Words, Identifier, Number, Operator, Special Symbol

📘 解释：

Reserved Words（保留字）：如 if, int, return，语言自身定义好的关键字。
Identifier（标识符）：如变量名、函数名，例如 s, main。
Number（数字）：如 0, 3.14, 42。
Operator（运算符）：如 +, ==, &&。
Special Symbol（特殊符号）：如 ;, {, (，用于语法结构的标识。

📙 中文翻译： 现代编程语言中的 token 类型包括：保留字、标识符、数字、运算符和特殊符号。

c. Based on the following C source code fragment

请回答： a) “int” is a ___________________ b) “printf” is a _________________ c) “s” is a (an) _________________ d) “= =” is a ____________________ e) “0” is a _____________________

✅ 答案： a) Reserved Word（保留字） b) Reserved Word（保留字）或 Identifier（标识符） c) Identifier（标识符） d) Operator（运算符） e) Number（数字）

📘 解释：

int 是 C 语言定义的数据类型，是保留字。
printf 通常是库函数的名字，在语法上是标识符，但有时当作预定义关键字处理。题目答案给为保留字也可接受。
s 是用户定义的变量名，是标识符。
== 是比较运算符，是一种运算符。
0 是整数常量，是数字。

📙 中文翻译： a）“int”是一个保留字；b）“printf”通常是标识符；c）“s”是标识符；d）“==”是运算符；e）“0”是数字。

d. A grammar G usually includes four components, they are _____, , _, and _______________.

✅ 答案： Terminal symbols（终结符）, Non-terminal symbols（非终结符）, Start symbol（开始符号）, Production rules（产生式）

📘 解释：

Terminal symbols（终结符）：不能再被替换的符号，如实际的 if, +, id。
Non-terminal symbols（非终结符）：可以被替换的语法变量，如 Expr, Stmt。
Start symbol（开始符号）：语法生成的起点符号，一般是 S 或 Program。
Production rules（产生式）：表示如何将非终结符展开成终结符和非终结符组合的规则。

📙 中文翻译： 一个文法 G 通常包括四个组成部分：终结符、非终结符、开始符号和产生式。

2. (Terms Translation, 12%)

a) Compiler ——

✅ 答案： 一种应用程序，将源代码转换为指定的目标代码。

A compiler is a software program that translates source code written in a high-level programming language into target code, such as machine code or intermediate code.

📘 解释： 编译器的作用是将高级语言（如 C、Java）翻译成机器可以执行的低级语言或中间语言，供后续执行或优化。

📙 中文翻译： 编译器是一个程序，它将源代码转换为目标代码，如机器语言或中间语言。

b) Source code ——

✅ 答案： 文本文件，其中内容是按照指定的文法规则描述特定的算法。

Source code is a text file that contains instructions written in a high-level programming language, structured according to the language’s grammar rules.

📘 解释： 源代码是程序员写的原始程序，用来描述程序的功能与算法，通常需要编译成可执行程序。

📙 中文翻译： 源代码是按照编程语言语法规则书写的程序文本文件，用于描述某个具体的算法或逻辑过程。

c) Scanner ——

✅ 答案： 将文本字符串按照词法规则，转换为特定的内部标识，供编译器后续分析。

A scanner, also called a lexical analyzer, processes the source code’s character stream and groups characters into tokens based on lexical rules.

📘 解释： Scanner 是编译器的第一个阶段，它读取字符流，将其划分为词法单元（tokens）供语法分析器使用。

📙 中文翻译： 词法分析器（Scanner）根据词法规则，把源程序的字符序列转换为记号（tokens），供后续阶段分析使用。

d) Tokens ——

✅ 答案： 源文件中最基本的信息单元。

Tokens are the smallest meaningful units of the source code, such as keywords, identifiers, literals, operators, and punctuation.

📘 解释： Token 是由词法分析器输出的基本单位，如 int, x, +, ;，是构成程序结构的最小语法单元。

📙 中文翻译： Token 是构成源代码的最小有意义单元，如关键字、变量名、数字、运算符等。

e) Terminal symbol ——

✅ 答案： 文法规则中不需要产生式定义的符号。

A terminal symbol is a basic symbol in a grammar that cannot be further broken down by production rules.

📘 解释： 终结符是语法分析时不能被替代的基本符号，常见的如实际的字符、关键词或运算符。

📙 中文翻译： 终结符是在语法规则中不能再被其他符号替代的最小单位，通常表示语言中实际存在的基本元素。

f) Ambiguous Grammar ——

✅ 答案： 使用不同推导方法，推导出不同语法树，就称该文法为二义文法。

An ambiguous grammar is one that can produce more than one distinct parse tree (syntax tree) for the same input string.

📘 解释： 二义文法的问题在于同一个字符串存在多种合法的语法结构，会导致语法分析器无法确定正确的解析方式。

📙 中文翻译： 如果一个文法可以对某个句子推导出两棵或更多不同的语法树，那么它就是一个二义文法（Ambiguous Grammar）。

Scanning（扫描；20%）

(a) (10%) Construct an NFA that recognizes the same language as defined by the following regular expression:

`(abab)ba`

✅ 问题翻译： 构建一个 NFA（非确定有限自动机），识别和正规表达式 (a*ba*b)*ba* 所定义的相同语言。

✅ 答案（简要结构解释）：

我们可以将正则表达式分为两个部分：

1	(a* b a* b)* b a*

这是一个典型的正则表达式转 NFA 的构造问题。使用 Thompson 构造法可以实现：

a* 的 NFA 是一个起点回环。
b 是单一步。
a* 再次是回环。
最外层 ( )* 使用 ε-transition 连接重复结构。
末尾再连接 b a*。

📙 中文解释： 使用 Thompson 构造法可以逐步构建这个正则表达式的非确定自动机。核心结构包括两个 a* 和两个 b 的连接，并使用 ε 转移处理星号（*）和重复。

(b) (10%) Using the subset construction, convert the NFA into a DFA.

✅ 问题翻译： 使用 子集构造法（subset construction / powerset construction） 将上述 NFA 转换为 DFA。

✅ 答案表（已提供部分 DFA 转换表）：

状态（DFA）	输入 a 后	输入 b 后
A = {0,1,2,5}	B	C
B = {2}	B	D
C = {3,6,7,8}	E	F
D = {3}	D	F
E = {3,7,8}	E	F
F = {1,2,4,5}	B	C

📘 解释 Explanation (EN):

使用 子集构造法 时，我们将 NFA 中的状态集合视为 DFA 的状态。
每个 DFA 状态是 NFA 状态的集合，例如 {0,1,2,5} 被命名为状态 A。
对每个 DFA 状态，分别考虑输入 a 和 b 后所有可能的到达状态，再取其 ε 闭包，生成新状态。
重复这个过程直到所有新状态都包含在表中。

📙 中文解释：

子集构造法：将 NFA 的多个状态集合合并视为 DFA 的一个状态。通过计算每个状态集在输入 a 或 b 后可达的状态，再求其 ε 闭包。
比如状态 A = {0,1,2,5}，输入 a 后可达 {2}，即状态 B。
状态表中的每一行表示一个新的 DFA 状态，以及在 a 或 b 输入下的状态迁移。

总结一下这道题：正则表达式换成NFA，画NFA时候注意*即可，因为涉及到对于空的应用，反复观察，利用 ε进行各种各样的跳转，即可应对第一问！

第二问，往往是子集构造法，只需要写出一开始所有能通过空到达的集合，然后不断反复枚举！

易错点：注意如果是接受态，需要画两个圆圈！

✅ 4. (LL Parsing; 25%)

🧾 题目描述（英文原文）：

Consider the following grammar G(S):
1
2
3
S → number | List  
List → ( Seq )  
Seq → Seq , S | S
Where number and “,” are terminal symbols, and the other symbols are non-terminals.

请完成以下任务：

(a) (7%)

Write the left-most derivation for the sentence: (4, (34))

(b) (8%)

Convert the grammar G into an LL(1) grammar G₁ by removing the left recursion.

(c) (10%)

Based on your LL(1) grammar G₁:

Calculate the FIRST set and FOLLOW set for each non-terminal.
Build the LL(1) parsing table.

✅ 标准答案 + 中英文解释：

🔹 (a) Left-most Derivation （最左推导）

推导目标句子：(4, (34))

S
→ List
→ (Seq)
→ (Seq , S)
→ (S , S)
→ (number , S)
→ (4 , List)
→ (4 , (Seq))
→ (4 , (S))
→ (4 , (number))
→ (4 , (34))

📘 Explanation (EN): We always replace the left-most non-terminal first. Here, Seq → Seq , S allows nesting, and S → List creates the inner expression.

📙 解释（中文）： 最左推导总是优先替换最左边的非终结符。这里 Seq → Seq , S 允许我们构造 4, (34) 的嵌套结构，S → List 则展开了内部的括号部分。

记住始终替换最左边的非终结符！

🔹 (b) Remove Left Recursion to Make LL(1) Grammar G₁

原文法中的这一条是左递归：

1	Seq → Seq , S \| S

我们需要将其转换为右递归。

✅ 转换后的 LL(1) 文法 G₁：

S     → number | List  
List  → ( Seq )  
Seq   → S Seq1  
Seq1  → , S Seq1 | ε

📘 Explanation (EN): Left recursion is removed by introducing a new non-terminal Seq1. The new form allows top-down parsing.

📙 解释（中文）： 我们引入 Seq1 来消除左递归，使文法适用于 LL(1) 分析（自顶向下预测分析）。

🔹 (c) FIRST 集、FOLLOW 集和 LL(1) 表

✅ FIRST 集（FIRST SET）

非终结符	FIRST 集
S	{number, (}
List	{(}
Seq	{number, (}
Seq1	{,, ε}

✅ FOLLOW 集（FOLLOW SET）

非终结符	FOLLOW 集
S	{), ,}
List	{$, ), ,}
Seq	{)}
Seq1	{)}

✅ LL(1) 分析表（LL(1) Parsing Table）

非终结符	`number`	`(`	`,`	`)`
S	S → number	S → List
List		List → ( Seq )
Seq	Seq → S Seq1	Seq → S Seq1
Seq1			Seq1 → , S Seq1	Seq1 → ε

📘 Explanation (EN):

FIRST set: the set of terminals that can begin a string derived from a non-terminal.
FOLLOW set: the set of terminals that can follow a non-terminal in some sentential form.
The LL(1) table is filled using FIRST and FOLLOW rules.

📙 解释（中文）：

FIRST 集表示某个非终结符最开始可能推出哪些终结符；
FOLLOW 集表示在文法推导中某个非终结符后面可能出现的终结符；
使用这两个集合构造 LL(1) 预测分析表，确保每个输入符号下只有一个唯一规则。

详细解释求 FIRST 集、FOLLOW 集，并构建 LL(1) 分析表

Calculate FIRST and FOLLOW sets and LL(1) table

1. FIRST 集（FIRST Sets）

我们按顺序列出每个非终结符的 FIRST 集：

FIRST(number) = { number } （终结符）
FIRST(( ) , ) = { ( ) , } （这些都是终结符）

FIRST(S)

1 2	S → number \| List List → ( Seq )

所以：

1	FIRST(S) = { number, ( }

FIRST(List)

1 2	List → ( Seq ) → FIRST(List) = { ( }

FIRST(Seq)

1
2
3

Seq → S Seq'
S 的 FIRST = { number, ( }
→ 所以 FIRST(Seq) = { number, ( }

FIRST(Seq')

1 2	Seq' → , S Seq' \| ε → FIRST(Seq') = { , , ε }

FIRST 集总结 (Summary):

非终结符	FIRST 集
S	{ number, ( }
List	{ ( }
Seq	{ number, ( }
Seq'	{ , , ε }

2. FOLLOW 集（FOLLOW Sets）

记住：

开始符号 S ⇒ FOLLOW(S) ⊇ { $ }
对于产生式 A → αBβ：
- 把 FIRST(β) - {ε} 加入 FOLLOW(B)
- 若 ε ∈ FIRST(β)，则 FOLLOW(A) ⊆ FOLLOW(B)

我们逐步推导：

S 是开始符号，所以 FOLLOW(S) ⊇ { $ }
在 Seq → S Seq' 中，FOLLOW(S) ⊇ FIRST(Seq') = { ,, ε }
- 加入 , 到 FOLLOW(S)
- Seq' 能推出 ε，说明 FOLLOW(Seq) ⊆ FOLLOW(S)
在 List → ( Seq ) 中：
- FOLLOW(Seq) ⊇ { ) }
在 Seq' → , S Seq' 中：
- FOLLOW(S) ⊇ FIRST(Seq') = { ,, ε }
- Seq' 能推出 ε，说明 FOLLOW(S) ⊇ FOLLOW(Seq') ⊇ FOLLOW(Seq) ⊇ { ) }

FOLLOW 集总结：

非终结符	FOLLOW 集
S	{ `,`, `)`, `$` }
List	{ `$`, `,`, `)` }（因为它是 S 的一部分）
Seq	{ `)` }
Seq'	{ `)` }

3. 构建 LL(1) 分析表（LL(1) Parsing Table）

非终结符	number	(	,	)
S	S → number	S → List
List		List → ( Seq )
Seq	Seq → S Seq'	Seq → S Seq'
Seq'			Seq' → , S Seq'	Seq' → ε

✅ LL(1) 表构建完成！

✅ 总结 Summary（中英文）

步骤	内容	Content
(a)	使用左最左推导得到 (4,(34))	Derived (4, (34)) using left-most derivation
(b)	消除了 Seq → Seq , S 的左递归	Removed left recursion and rewrote grammar
(c)	计算了 FIRST/FOLLOW 集，并构建了 LL(1) 表	Computed FIRST/FOLLOW sets and built the LL(1) parsing table

✅ 5. (LR Parsing; 25%)

🧾 题目描述（英文原文）：

Consider the following grammar G[S]:
1
2
E → (L) | a  
L → E L | E
Where a, (, ) are terminal symbols, and the other symbols are non-terminals.

任务如下：

(a) [5%]

Construct the DFA of LR(0) items of this grammar.

(b) [10%]

Construct the SLR(1) parsing table.

(c) [10%]

Using the SLR(1) parsing table, show the parsing stack and actions for the input string:

(a(a a))

✅ 标准答案 + 中英文解释：

🔹(a) LR(0) 项的 DFA 构造（5%）

📌 增广文法：

我们先给出增广文法（Augmented Grammar）：

1
2
3

S' → E
E → (L) | a
L → EL | E

📘 LR(0) 项的构造方法说明（EN）：

初始状态是包含增广项 S' → •E 的闭包。
每当 • 后面是一个符号，就将它“向前推进”形成新的状态。
每个状态是若干个项的集合（Closure）。

🔹(b) SLR(1) 语法分析表构造（10%）

我们使用 FOLLOW 集 来决定 规约项目应该填入的终结符列。

FOLLOW 集（Follow Sets）：

非终结符	FOLLOW 集
E	{a, (, ), $}
L	{)}

说明：

所有 reduce 项都只填入 FOLLOW 集中对应的终结符。
shift 动作根据 DFA 中的转移而定。

🔹(c) 分析栈与动作（Parsing Stack and Actions）（10%）

输入串为：(a(a a))$

📘 分析栈和动作表格（中英文解释）：

步骤	栈（Parsing Stack）	输入符号（Input）	动作（Action）
1	$ 0	(a(a a))$	Shift 1
2	$ 0 ( 1	a(a a))$	Shift 3
3	$ 0 (1 a 3	(a a))$	Reduce E → a
4	$ 0 (1 E 4	(a a))$	Shift 1
5	$ 0 (1 E 4 (1	a a))$	Shift 3
6	$ 0 (1 E 4 (1 a 3	a))$	Reduce E → a
7	$ 0 (1 E 4 (1 E 4	a))$	Shift 3
8	$ 0 (1 E 4 (1 E 4 a 3	))$	Reduce E → a
9	$ 0 (1 E 4 (1 E 4 E 4	))$	Reduce L → E
10	$ 0 (1 E 4 (1 E 4 L 6	))$	Reduce L → E L
11	$ 0 (1 E 4 (1 L 5	))$	Shift 7
12	$ 0 (1 E 4 (1 L 5 ) 7	)$	Reduce E → (L)
13	$ 0 (1 E 4 E 4	)$	Reduce L → E
14	$ 0 (1 E 4 L 6	)$	Reduce L → E L
15	$ 0 (1 L 5	)$	Shift 7
16	$ 0 (1 L 5 ) 7	$	Reduce E → (L)
17	$ 0 E 2	$	Accept

✅ 总结 Summary

📙 中文解释：

本题考察 LR(0) 状态机的构造、SLR(1) 表的构造、以及栈的动作模拟。
输入串 (a(a a)) 展示了嵌套结构与规约优先的处理方式。
构建语法分析表时要注意 FOLLOW 集的正确使用。

📘 English Explanation:

This grammar includes left-recursive rules (L → EL) and nested structures (E → (L)).
The parsing actions demonstrate how an SLR parser resolves ambiguity via FOLLOW sets and state transitions.
Final Accept confirms that the input string is valid under this grammar.

编译原理2007-2008

1. Fill in the blanks (18%)

a. Normally, a compiler consists of a number of phases. They are Scanner, _________________, ______________________, ______________________, Code Generator, and ___________________.

b. The logical units the scanner generates are called Tokens. For a modern programming language, there are five types of token. They are ________________, _______________, __________________, ____________________, ________________.

c. Based on the following C source code fragment

d. A grammar G usually includes four components, they are _______________, __________________, _______________, and _____________________.

2. (Terms Translation, 12%)

a) Compiler ——

b) Source code ——

c) Scanner ——

d) Tokens ——

e) Terminal symbol ——

f) Ambiguous Grammar ——

Scanning（扫描；20%）

(a) (10%) Construct an NFA that recognizes the same language as defined by the following regular expression:

(a*ba*b)*ba*

✅ 答案（简要结构解释）：

(b) (10%) Using the subset construction, convert the NFA into a DFA.

✅ 答案表（已提供部分 DFA 转换表）：

📘 解释 Explanation (EN):

📙 中文解释：

✅ 4. (LL Parsing; 25%)

🧾 题目描述（英文原文）：

请完成以下任务：

(a) (7%)

(b) (8%)

(c) (10%)

✅ 标准答案 + 中英文解释：

🔹 (a) Left-most Derivation （最左推导）

🔹 (b) Remove Left Recursion to Make LL(1) Grammar G₁

🔹 (c) FIRST 集、FOLLOW 集 和 LL(1) 表

✅ FIRST 集（FIRST SET）

✅ FOLLOW 集（FOLLOW SET）

✅ LL(1) 分析表（LL(1) Parsing Table）

详细解释求 FIRST 集、FOLLOW 集，并构建 LL(1) 分析表

Calculate FIRST and FOLLOW sets and LL(1) table

1. FIRST 集（FIRST Sets）

FIRST(S)

FIRST(List)

FIRST(Seq)

FIRST(Seq')

FIRST 集总结 (Summary):

2. FOLLOW 集（FOLLOW Sets）

FOLLOW 集总结：

3. 构建 LL(1) 分析表（LL(1) Parsing Table）

✅ 总结 Summary（中英文）

✅ 5. (LR Parsing; 25%)

🧾 题目描述（英文原文）：

任务如下：

(a) [5%]

(b) [10%]

(c) [10%]

✅ 标准答案 + 中英文解释：

🔹(a) LR(0) 项的 DFA 构造（5%）

📌 增广文法：

📘 LR(0) 项的构造方法说明（EN）：

🔹(b) SLR(1) 语法分析表构造（10%）

FOLLOW 集（Follow Sets）：

说明：

🔹(c) 分析栈与动作（Parsing Stack and Actions）（10%）

📘 分析栈和动作表格（中英文解释）：

✅ 总结 Summary

a. Normally, a compiler consists of a number of phases. They are Scanner, _, , , Code Generator, and ___.

b. The logical units the scanner generates are called Tokens. For a modern programming language, there are five types of token. They are ________, _, , , ____.

d. A grammar G usually includes four components, they are _____, , _, and _______________.

`(abab)ba`

🔹 (c) FIRST 集、FOLLOW 集和 LL(1) 表