编译技术2-4

# 编译技术2-4

From Regular Expressions (RE) to NFA

There is a beautiful inductive procedure for converting a regular expression to an NFA. 有一个优雅的归纳式方法可以将正则表达式转换为 NFA。

核心思想： 根据正则表达式的结构递归构造 NFA。每种基本构造（连接、并集、闭包）都有相应的 NFA 结构。

正则表达式：(a|b)*abb

How do we determine which lexemes are associated with each token? 我们如何确定哪些词素属于哪个词法单元（Token）？

When there are multiple ways to scan the input, how do we know which one to pick? 当输入可以有多种扫描方式时，我们该如何选择？

最长匹配优先（Longest match）
- 当多个正则表达式都匹配前缀时，应选择匹配最长的那一个。
Choose the match that consumes the most characters.
优先级选择（Rule priority）
- 如果多个表达式匹配相同长度，选择规则优先级更高的。
If lengths are equal, pick the rule that appears first.
效率问题 Efficiency concerns
- 如何快速扫描输入、匹配模式？
How to match efficiently in practice?

How do we address these concerns efficiently?

方法	中文说明	英文说明
将所有正则表达式构造成一个 NFA	所有 token 的 NFA 合并成一个超级 NFA（加 ε 转移）	Combine all token NFAs into a single "super NFA"
转换为 DFA	使用子集构造法将 NFA 转换为 DFA，提高效率	Convert NFA to DFA for fast scanning
DFA 最小化（可选）	减少状态数，提高运行效率	Optional: minimize DFA to reduce state count

步骤 Step	内容（中文）	Description (English)
正则表达式	描述词法规则	Describes patterns for tokens
构建 NFA	用归纳法逐步构建	Inductively construct from RE
合并 NFA	所有 token 合成一个超级 NFA	Merge all into one NFA
转换 DFA	更高效的自动机表示	Convert to DFA for performance
模拟输入	扫描字符串并识别 token	Simulate DFA to recognize tokens

From Regular Expressions to NFAs — Efficiency and Challenges

✅ 任何长度为 n 的正则表达式，都可以在 O(n) 的状态数内构建出一个 NFA。 ✅ Any regular expression of length *n* can be converted into an NFA with *O(n)* states.
⏱️ 判断一个长度为 m 的字符串是否匹配一个长度为 n 的正则表达式的时间复杂度为： ⏱️ Time complexity to determine if a string of length *m* matches an RE of length *n* is: 👉 O(m · n²) —— 使用 NFA 模拟时的最坏情况 👉 O(m · n²) — in the worst case when simulating with an NFA

💡 我们之后将看到如何将匹配复杂度优化为 O(m) 💡 We'll see how to reduce this to O(m) time later

✅ 这个优化是独立于正则表达式的复杂度的！ ✅ This improvement is independent of the complexity of the regular expression!

概念	中文说明	English Description
正则表达式 RE	描述模式	Pattern specification
NFA	非确定自动机，支持 ε 转移	Supports ε-moves, easy to construct
O(n) 状态	构造 RE → NFA 的状态数	Number of states in RE-to-NFA
O(m·n²) 时间	模拟时间复杂度	Simulation time
DFA	确定自动机，无 ε 转移，效率高	No ε-moves, fast matching
O(m) 时间	优化后匹配时间	Final optimized matching time