编译原理

本笔记基于上海交通大学 胡易坤老师 2024-2025 学年春季学期教学内容进行整理，部分图片来自胡老师的PPT，若有侵权请联系删除。

Quick Links

▶

考点跳转链接

Regex to NFA
NFA to DFA
DFA Simplification

Top-Down Parsing
- Recursive-Descent Parsing
- Predictive Parsing
  - LL(1) Parsing
Bottom-Up Parsing

FIRST and FOLLOW
LEADING and TRAILING
CLOSURE and GOTO
LR(1)’s CLOSURE and GOTO

Quick Check

▶

各类文法与解析方法

LL(1) Grammar
- L: 从左到右扫描输入；L：最左派生；1：提前看一个输入符号
- 无左递归和左因子
- A → α | β表示两个不同的产生式，则FIRST(α) 和 FIRST(β) 是互不相交的集合
  - 二者不会同时派生以a开头的字符串
  - 至多一个α和β可以派生空字符串
- 如果ϵ ∈ FIRST(β)，则FIRST(α)和FOLLOW(A)是互不相交的集合
LL(1) Parsing
- 提取左因子
- 消除直接左递归
- 计算FIRST和FOLLOW
- 构造预测分析表（列为终结符，行为非终结符）
- 从Start Symbol开始预测/或使用栈和输入缓冲区进行预测分析
  - 查表（最左侧非终结符，最左侧未匹配输入符号）作为派生的产生式进行派生
  - 若使用栈和输入缓冲区，则
    - 若栈顶符号为终结符且与输入符号匹配，则出栈并将输入指针后移
    - 若栈顶符号为非终结符，则查表（栈顶符号，输入指针对应符号）得到派生的产生式，将栈顶符号出栈并将产生式右侧符号逆序入栈
    - 直到栈为空，输入指针指向$，则接受输入
OG Grammar（算符文法）
- 任意生成式不含两个相邻的非终结符
- 不含空生成式
OPP Grammar（算符优先文法）
- 首先满足OG Grammar的要求
- 任意两个终结符号对（有序）之间一定满足唯一的优先级关系
OPP Parsing
- 计算LEADING和TRAILING
- 构造优先级关系表
- 使用栈和输入缓冲区进行运算符优先解析/或使用优先级爬升法
  - 若栈顶终结符号优先级 <· / =· 输入符号优先级，则移进
  - 若栈顶终结符号优先级 >· 输入符号优先级，则归约
LR(0) Grammar
- L: 从左到右扫描输入；R：最右派生，最左归约；0：提前看一个输入符号
- 任一项集的状态转移不含归约-归约或移进-归约冲突
LR(0) Parsing
- 扩展文法
- 通过GOTO和CLOSURE构造LR(0)项集族
- 构造LR(0)分析表（列为输入符号，行为状态）
- 使用栈和输入缓冲区进行LR(0)解析
  - 查表，（栈顶，输入符号）为s n，则移进并push状态n到栈顶，
  - 若为r m，则用第m条产生式进行归约，pop栈顶的符号数目等于产生式右侧符号数目，查（栈顶符号，产生式左侧符号）为n，则push状态n到栈顶
  - 若为ACCEPT，则接受输入
SLR(1) Grammar
- 任一项集的状态转移不含归约-归约或移进-归约冲突
  - LR(0)项目集中存在归约-归约冲突时，两个FOLLOW集的交集为空
  - LR(0)项目集中存在移进-归约冲突时，移进的终结符号不在归约的FOLLOW集中
SLR(1) Parsing
- 扩展文法、构造集族同LR(0)
- 构造SLR(1)分析表时：
  - 对A → β⋅ 归约时, 只对FOLLOW(A)中的终结符进行归约
  - 对S^′ → S^′ 赋值ACCEPT时, 只对$接受
- 使用栈和输入缓冲区进行SLR(1)解析
LR(1) Grammar
- L: 从左到右扫描输入；R：最右派生，最左归约；1：提前看一个输入符号
- LR(1)的项由两部分组成：LR(0)的项和一个lookahead符号
- 任一项集的状态转移不含归约-归约或移进-归约冲突
  - 无归约-归约冲突：同一状态下如果有多个归约，则前瞻符号不相交
  - 无移进-归约冲突：同一状态下如果同时有移进和归约，则移进的终结符号不在归约的前瞻符号中
LR(1) Parsing
- 扩展文法
- 通过GOTO和CLOSURE构造LR(1)项集族
- 构造LR(1)分析表时：
  - 对[A → β⋅, a] 归约时, 只对a进行归约
  - 对[S^′ → S⋅, $] 赋值ACCEPT时, 只对$接受
- 使用栈和输入缓冲区进行LR(1)解析
LALR(1) Grammar
- LA: LookAhead；L: 从左到右扫描输入；R：最右派生，最左归约；1：提前看一个输入符号
- LALR(1)的项集族是LR(1)的项集族的同心集合并
- 任一项集的状态转移不含归约-归约或移进-归约冲突
LALR(1) Parsing
- 扩展文法、构造集族同LR(0)作为心
- 对每个状态的初始项的心，用占位符#作为前瞻符号，通过闭包计算自发生成的前瞻符号和传递关系
- 对于传递关系，保留每个状态的内核项，构造传播表
- 根据传播表和自发生成的前瞻符号，通过若干轮传播得到每个心最终对应的前瞻符号，得到归约内核项的[心，前瞻符号(若干个)]对
- 构造LALR(1)分析表，与LR(1)相同
- 使用栈和输入缓冲区进行LALR(1)解析

Ch1 Intro

编译流程

解释器 vs. 编译器 (Interpreter vs. Compiler)
- 解释器方便错误诊断 (Error Diagnosis)
- 编译器得到的代码更加高效

编译器

前端：词法分析(Lexical Analysis)、语法分析(Syntax Analysis)、语义分析(Semantic Analysis)、中间代码生成(Intermediate Code Generation)
后端：代码优化(Optimization)、目标代码生成(Code Generation)

词法分析：Lexical Analysis

Lexical Analysis: Scanning （词法分析：扫描）
- Lexical Analyzer: Scanner （词法分析器：扫描器）
- Recognize Words (Lexemes) -> Tokens & Symbol Table （识别单词（词素）-> 生成标记和符号表）
- Token: <token-name, attribute-value (opt.)>
  - token-name: id, number, keywords, operators, etc. （标记名：标识符、数字、关键字、运算符等）
  - attribute-value: a pointer to the symbol table （属性值：指向符号表的指针）

语法分析：Syntax Analysis

Syntax Analysis: Parsing （语法分析：解析）
- Syntax Analyzer: Parser （语法分析器：解析器）
- Produce the Grammatical Structure （生成语法结构）
  - The Relationships among Tokens （标记之间的关系）
  - Tree-like Intermediate Representation （树状中间表示）
- Parse Tree （解析树）
  - Parser generates the Parse Tree, from which produces the Syntax Tree （解析器生成解析树，从中生成语法树）
- Syntax Tree: Simplified Parse Tree （语法树：简化的解析树）
  - Interior Node: Operation （内部节点：操作）
  - Children: Arguments of the Operation （子节点：操作的参数）

词法分析 vs. 语法分析

做着相似的事情：处理字符串
词法分析 (Scanning)：拆分字符串成Lexemes，并抽象成Tokens
语法分析 (Parsing)：整理Tokens的逻辑关系

语义分析：Semantic Analysis

Semantic Analysis （语义分析）
- computing additional info needed for compilation （计算编译所需的附加信息）
  - which is not regarded as syntax （这些信息不被视为语法）
- checking source code semantic consistency with the language definition （检查源代码与语言定义的语义一致性）
  - Type Checking （类型检查）

中间代码生成：Intermediate Code Generation

Intermediate Code Generation （中间代码生成）
- Intermediate Representations (IR), e.g., Syntax Tree, etc. （中间表示，例如语法树等）
- Low-level or Machine-like IR （低级或类似机器的中间表示）
  - LLVM-IR (LLVM, Clang), Gimple (GCC), etc. （LLVM-IR（LLVM，Clang），Gimple（GCC）等）
  - Three-address Code, with at Most Three Operands per Instruction （三地址码，每条指令最多三个操作数）
  - Static Single Assignment (SSA) （静态单赋值）
    - Every variable is only assigned (defined) once and defined before used. （每个变量只能被赋值（定义）一次，并且在使用前必须定义。）

优化：Optimization

Optimization （优化）
- Improve the IR for Better Target Code （改进中间表示以生成更好的目标代码）
- Better: Faster, Smaller, Greener （更好：更快、更小、更环保）

目标代码生成：Target Code Generation

Target Code Generation （目标代码生成）
- Instruction Selection （指令选择）
  - RISC vs. CISC
    - RISC: Reduced Instruction Set Computer （精简指令集计算机）
    - CISC: Complex Instruction Set Computer （复杂指令集计算机）
  - Intel Manual: > 6000 Pages （英特尔手册：超过6000页）
- Register Allocation （寄存器分配）
  - Graph Coloring Problem （图着色问题）
- Evaluation Order （计算顺序）
  - Arrange the Computation Order for Less Register Occupation （安排计算顺序以减少寄存器占用）
  - NPC (Non-Polynomial Complete) Problem （非多项式完全问题）

Ch2 Syntax Definition

文法的定义：Definition of Grammars

定义

Grammar: a Set of Rules to Describe a Language.
Language: a Sorted Set of Strings over some fixed Alphabet.
关系：Grammar Abstracts the Language to Cover All Its Strings. （文法是对语言的抽象，覆盖了语言的所有字符串）
∅: Empty Language (Set of Strings)
𝜺: Empty String (Set of Symbols)

组成

Grammar G[S] = (VN, VT, P, S)
- VT: A set of Terminal Symbols (终结符)
  - Atomic: 基本符号，不可再分
- VN : A set of Non-terminals (非终结符)
  - A Non-terminal: a set of strings of Terminals
- P: A set of Productions (生成式、规则)
  - A Non-terminal, an Arrow, a sequence of Terminals and/or Non-terminals
- S: A Start Symbol (开始符)
  - A Non-terminal

推导：Derivations

定义

A Grammar derives strings by beginning with the Start Symbol and repeatedly replacing a Non-terminal with Terminals via its Productions. （文法通过从开始符开始，反复用生成式将非终结符替换为终结符来派生字符串）
- 推导（Derivation）：反复根据生成规则用终结符替换非终结符
- 归约（Reduction）：推导的反过程

语法分析（Syntax Analysis, Parsing）

given a sequence of Terminals, figure out Whether it can be Derived from the Grammar and How if possible. （给定一个终结符序列，判断它是否可以由文法推导而来，如果可以，推导过程是什么）
一个语言可能有多个文法描述，而一个文法只会派生一个唯一语言

文法的二义性：Ambiguity

定义

语法树（Parse Tree）: A Graphical Representation of a Derivation without the Order of Applying Productions. （语法树是一个图形表示，表示了一个推导过程，但不考虑生成式应用的顺序）
二义性（Ambiguity）: When Parsing, given a sequence of Terminals, a Grammar is Ambiguous if there are more than one Parse Tree for the Derivation.（二义性是指一个文法可以产生多棵语法树） ### Fix the Grammar
Example:
- stmt -> if expr then stmt | if expr then stmt else stmt | other
- if E1 then if E2 then S1 else S2
Match each else with the closest unmatched then
- the statement appearing between a then and an else must be “matched”
- the interior statement must not end with an unmatched (open) then
Listing All Cases then Tidying Them Up
- if expr then matched_stmt
- if expr then open_stmt
- if expr then matched_stmt else matched_stmt
- if expr then matched_stmt else open_stmt
- so that:
  - matched_stmt -> if expr then matched_stmt else matched_stmt | other
  - open_stmt -> if expr then matched_stmt | if expr then open_stmt | if expr then matched_stmt else open_stm
- finally:
  - stmt -> matched_stmt | open_stmt
  - matched_stmt -> if expr then matched_stmt else matched_stmt | other
  - open_stmt -> if expr then stmt | if expr then matched_stmt else open_stmt

文法和语言的分类：Classes of Languages

范围由 Type 0 到 Type 3 逐渐缩小

Ch3 Scanning

词法分析：Lexical Analysis

Token, Pattern, and Lexemes

The Analysis Partitions Input String into Substrings. （分析将输入字符串划分为子字符串。）
Token: <token-name, attribute-value(opt.)>
- token-name: the role of lexical unit （词法单元的角色）
  - often refer to Token by its token-name
- attribute-value: any info associated to the Token （与Token相关的任何信息）
  - Generally, it has only ONE value: a pointer to the Symbol Table （通常只有一个值：指向符号表的指针）
    - In practice, the value of a constant can be stored as the attribute. （在实践中，常量的值可以作为属性存储。）
      - constant: strings, numbers （常量：字符串、数字）
Pattern: description of the form lexemes of a token may （描述词法单元的形式）
- regular expression: , ?, *, +, …
Lexeme: a sequence of characters matches a token’s pattern （与模式匹配的字符序列）
- Token vs. Lexeme: Class vs. Instance in C++

Specification of Tokens

Strings and Languages

String: a finite sequence of Symbols from an Alphabet （字符串：来自字母表的有限符号序列）
Language: any countable set of Strings （语言：任何可数的字符串集合）
Terms for Parts of a String s （字符串s的部分术语）:
- prefix：前缀
  - any string obtained by removing zero or more symbols from the end of s
- suffix：后缀
  - any string obtained by removing zero or more symbols from the beginning of s
- substring：子串
  - any string obtained by removing any prefix or any suffix from s （去掉前缀或后缀）
- subsequence：子序列
  - any string obtained by removing zero or more not necessarily consecutive position of s （不一定连续的位置）

Operations on Languages

Union: （并集）
- L₁ ∪ L₂ = {x|x ∈ L₁ or x ∈ L₂}
Concatenation: （连接）
- L₁L₂ = {xy|x ∈ L₁ and y ∈ L₂}
Kleene closure: （星闭包）
- $L^* = \bigcup_{1=0}^\infty L^i \\= {𝜖} ∪ L ∪ L^2 ∪ L^3 ∪ ... \\= \{x | x = x_1x_2...x_n, n ≥ 0, xi ∈ L\}$
- 𝜖 is the empty string
Positive closure: （正闭包）
- $L^+ = \bigcup_{1=1}^\infty L^i \\= LL^* \\= \{x | x = x_1x_2...x_n, n ≥ 1, xi ∈ L\}$

Regular Expressions

定义

Regular Expression (Regex): a way to describe Patterns of Tokens of a programming language. （正则表达式：描述编程语言的词法单元模式的一种方式）
- Each Regular Expression r denotes a Language L(r). （每个正则表达式r表示一个语言L(r)）
  - Regular Language, Type-3 Language
- The Regular Expressions are built recursively out of smaller ones, using the rules. （正则表达式是用规则递归构建的）

语法

BASIS （基础）
- 𝜀 is a regular expression, and L(𝜀) = {𝜀}, the empty set. （空字符）
- a is a symbol in a set Σ, then a is a regular expression, and L(a) = {a}. （集合中的字符）
INDUCTION （归纳）：Suppose r and s are expressions denoting L(r) and L(s)
- r|s : a regular expression denoting L(r) ∪ L(s)
- rs : a regular expression denoting L(r)L(s)
- r^* : a regular expression denoting (L(r))^*
- (r) : a regular expression denoting L(r)
  - We can add additional brackets around expressions. （我们可以在表达式周围添加额外的括号。）
定律：

Regular Definitions

Regular Definition: a set of productions with non-terminals derived by regular expressions. （正则定义：一组通过正则表达式派生的非终结符的产生式）

Extensions of Regex

+ : one or more instances
- r^* = r⁺ | 𝜖
- r⁺ = rr^* = r^*r
? : zero or one instance
- r? = r | 𝜖
[ ] : character classes
- [abc] = a | b | c
- [a − z] = a | b | … | z

Regular Language / Grammar, and Regex

Regular Expression r denotes a Language L(r).
- Regular Language is Type-3 Language （正则语言是类型3语言）
- Regular Language is that denoted by Regular Expressions. （正则语言是由正则表达式表示的）
Regular Grammar is the grammar describes a Regular Language.
- with the production form of A → α or A → αB

Recognition of Tokens

Input Buffer

Transition Diagrams

Transition Diagram = Nodes + Edges, a Flowchart （状态流程图 = 节点 + 边）
- Nodes: states, conditions that could occur when looking for a lexeme that matches one pattern. （节点：状态）
  - States: Circles （状态：圆圈）
  - Start State: Arrowhead, Beginning of a Pattern （起始状态：箭头，开始）
  - End State(s): Double Circles, End of a Pattern （终止状态：双圆圈，结束）
- Edge: actions, taken to transit from one State to Another. （边：动作）
  - labeled by a Symbol or a set of Symbols for matching （标记为符号或符号集以进行匹配）
- Deterministic: at most ONE edge out of a given state with a given label. （确定性：在给定状态下，最多有一条边）
Example:
- *(Retract): A Token has been accepted while another char has been read which must be unread. （回退：一个Token已经被接受，而另一个不应读取字符已经被读取，必须回退）

Reserved Words （保留字）

Keywords look like Identifiers.
- if, then, …
Add reserved words into symbol table initially. （在符号表中添加保留字）
Create separate transition diagrams for each keyword. （为每个关键字创建单独的转换图）
- thenextone

有穷自动机：Finite Automata

Finite Automata

What: an Abstract Machine that can be in exactly one of a Finite number of States at any given time. （有限自动机：在任何给定时间只能处于有限数量的状态之一的抽象机器）
- Finite Automation = Finite-state Automation (FSA, plural: automata) （有限自动机 = 有限状态自动机）
  - Finite-state Machine (FSM), or simply State Machine （有限状态机，或简单地称为状态机）
- changes from one state to another according to Inputs, called Transition （根据输入，从一个状态变化到另一个状态，称为转换）
Why: used as the Recognizer for Scanning, identifying Tokens （用于扫描的识别器，识别token）
How: answers “YES” or “NO” about each input String （如何：对每个输入字符串回答“是”或“否”）
- determines whether the String is valid for the given Grammar （确定字符串是否符合给定的语法）

DFA vs. NFA

FA: Deterministic (DFA) or Non-deterministic (NFA)
DFA: have exactly/at most one action for each input symbol （每个输入符号有一个动作）
- can be represented with a Transition Diagram
- Recognition with DFA: Faster, may take More Space （识别DFA：更快，可能占用更多空间）
- complex to represent Regex, but more Precise, widely used （复杂表示正则表达式，但更精确，广泛使用）
NFA: can have multiple actions for the same input symbol （同一输入符号可以有多个动作）
- can be represented with a Transition Graph
- Recognition with NFA: Slower, may take Less Space （识别NFA：较慢，可能占用更少的空间）
- simply represents Regex, but less Precise （简单表示正则表达式，但不够精确）
Example:
Lexical Analysis Workflow with FA:
- Regex -> NFA -> DFA
- Regex -> DFA

Nondeterministic Finite Automata (NFA)

An NFA M = (S, 𝜮, move, 𝒔₀, F) consists of:
- S: a finite set of States （有限状态集）
- 𝜮: the Input Alphabet, excluding 𝜖 （不包含𝜖的输入符号集合）
- move, a Transition Function （转换函数）
  - move(State, Symbol) = set of Next States
  - move: 𝑆 × (Σ ∪ {𝜖}) ⟶ ℙ(𝑆)
- s₀ ∈ S, the Start State (or Initial State) （起始状态）
- F ⊆ S, a set of Accepting States (or Final States) （终止状态）
An NFA accepts Input String s iff
- there exists some path in the Transition Graph from the Start State to one Accepting State, （存在一条路径从起始状态到一个接受状态）
- such that symbols along the path spell out s （路径上的符号拼写出s）
Transition Tables：rows for States, columns for Input Symbols and 𝝐
- Example:

Deterministic Finite Automata (DFA)

What: a Special Case of an NFA, where
- there are no moves on symbol 𝜖, and
- for each state s and input symbol a, there is Exactly ONE edge out of s labeled by a. （每个状态s和输入符号a，恰好有一条边出s标记为a）
  - COMPLETE: It defines from each state a transition for each input symbol. （完整：它定义了从每个状态到每个输入符号的转换）
    - Transition function is a total function.
  - Local Automation: DFA not necessarily complete (… At Most ONE edge …) （局部自动机：DFA不一定是完全图）
    - Transition function is a partial function.

Algorithm for Simulation

Conversion: NFA –> DFA

Subset Construction （子集构造）
- removing 𝜖-transitions
- combining multiple NFA’s states into ONE constructed DFA’s state （将多个NFA的状态组合成一个构造的DFA的状态，即：等势点合并）
Definitions:
- 𝜖-closure(s):
  - s: some State
  - = Set of NFA States reached by state s via 𝜖-transitions, including s itself. （NFA中可以通过若干个空变换到达的状态的集合）
- 𝜖-closure(T):
  - T: set of States
  - = ∪_s ∈ T 𝜖-closure(s)
- move(T, a):
  - T: set of States
  - a: Input Symbol
  - = NFA’s States reached by 𝑠 ∈ 𝑇 on a.

Algorithm Subset Construction

Input: the start State s0 and the Transition Diagram of NFA N.

Output: Transition Graph of DFA Dtran

add 𝝐-closure(s0) into Dstates //将初始状态s0的𝝐闭包加入Dstates
while (Dstates has unsearched state S) { //当Dstates有未搜索的状态S时
    tag S as searched //将S标记为已搜索
    foreach input symbol a { //对每个输入符号a
        U = 𝝐-closure(move(S, a)) //设U为S进行a动作后状态S'的𝝐闭包
        if (U is new to Dstates) { //如果U是Dstates中的新状态
            add U into Dstates as unsearched //将U加入Dstates并标记为未搜索
        }
        Dtran(S, a) = U //将Dtran(S, a)设为U
    }
}

最后得到的Dtran是一个DFA的转换表，Dstates是DFA的状态集合。

Algorithm 𝜖-closure(T) Computation:

上一步中U = 𝝐-closure(move(S, a))的实现逻辑：
Input: the State Set T

Output: 𝜖-closure(T)

push all states in T onto Stack //将T中的所有状态压入栈中
while (Stack is not empty) { //当栈不为空时
    s = Stack.pop() //弹出栈顶元素s
    foreach (state u reached by s via 𝜖) { //对于每个s通过𝜖能达到的状态u
        if (u is not in 𝜖-closure(T)) { //如果u不在T的𝜖闭包中
            add u into 𝜖-closure(T) // 将其加入T的𝜖闭包
            Stack.push(u) //将u压入栈中
        }
    }
}

Example:

Conversion: Regex –> NFA

McNaughton-Yamada-Thompson Algorithm
Regex’s Definition:
- BASIS:
- INDUCTION:
Example:

Workflow

The Workflow of Lexical Analyzer
- Regex –> NFA Construction
- NFA –> DFA Construction
- Simulating DFA to Recognize Tokens
Convert Regex Directly into DFA: PASS
DFA Simplification: Minimizing the Number of States

DFA Simplification

What and Why:
- no REDUNDANT states （无冗余状态）
  - REDUNDANCE: the states that NO accepted input string’s path passes through （没有路径到达终止状态的状态）
  - (in the transition graph)
- no EQUIVALENT states （无等效状态）
  - EQUIVALENCE: states with the SAME side effects （具有相同副作用的状态）
  - (making the states indistinguishable)
- Distinguish States via Input String （通过输入字符串区分状态）
  - State: s, t
  - String: x
  - x distinguishes s from t,
    - if one state can reach an accepting state via x, while the other cannot. （如果一个状态可以通过x到达终止状态，而另一个状态不能）
  - s is distinguishable from t,
    - if there is some string distinguishes them. （存在一些字符串可以区分它们）
  - Unify Indistinguishable States into One. （将不可区分的状态合并为一个）
How：
1. Start with the initial partition Π with two groups, the accepting and non-accepting states of the DFA. （将DFA的接受状态和非接受状态分为两个组）
2. Let Π_new := Π. Then, for each group G of Π: （初始时令Π_new = Π，然后对于每个组G）
  - For each input symbol a, states s, t in G are partitioned if they transit to different groups of Π via a; （对于每个输入符号a，如果状态s和t通过a转移到不同的组，则他们被划分为不同的组）
  - Replace G in Π_new by the new subgroups. （用新的子组替换G）
3. If Π_new ≠ Π, Π : = Π_new and repeat Step 2, Step 4 otherwise. （如果Π_new ≠ Π，则令Π : = Π_new并重复步骤2，否则跳到步骤4）
4. Aggregate the transitions among groups. （将组之间的转换聚合）
5. The resulting DFA is the minimized DFA. （得到的DFA是最小化的DFA）
Example:

Ch4 Parsing

词法分析：Lexical Analysis

Parser

What: Given Tokens, Parsing Verifies whether the Token Names Can Be Generated by the Grammar for the Source Language. （给定tokens，解析会验证token名称是否可以由源语言的语法生成）
Why: We expect the Parser
- to report Syntax Errors （报告语法错误）
- to recover from Errors to continue following processes. （从错误中恢复以继续后续过程）
How: Derivation or Reduction
- Top-down and Bottom-up Parsing

Compiler Errors

Lexical Errors
- The string does not match the pattern of any token. （字符串与任何token的模式不匹配）
Syntactic Errors
- The string does not meet the requirements of the grammar. （字符串不符合语法要求）
Semantic Errors: Type Mismatching （语义错误：类型不匹配）

Error-Recovery Strategies

Panic-Mode Recovery
- Discarding input symbols one at a time until meeting synchronizing tokens （丢弃输入符号，直到遇到同步tokens）
  - synchronizing tokens, e.g. “;”, “}”, etc., decided by designers
- simple, but may cause more errors
Phrase-Level Recovery
- Local Correction on Input, allowing the parser to continue （允许解析器继续）
  - e.g., “,” → “;”, delete/insert “;”
  - designers’ responsibility
- helpless if error occurs before the point of detection
Error Productions
- augmenting grammar with productions generating erroneous constructs （用产生错误构造的产生式扩充语法）
- relying on designers
Global Correction
- choosing a minimal sequence of changes for a globally least-cost correction （选择一系列最小的变化，以实现全局最低成本的修正）
- costly, yardstick? should defined by designers

Context-Free Grammar (CFG)

Definition

A Context-Free Grammar consists of:
- Terminals
- Non-terminals
- Start Symbol
- Productions: A → α
  - Header / Left Side → Body / Right Side
    - Header: A Non-terminal
    - Body: zero or more Terminals or Non-terminals
  - V_N → (V_T|V_N)^*

Derivation

What: Beginning with the Start Symbol, replace a Non-terminal by the body of one of its Production. （从起始符号开始，用其产生式的右侧替换非终结符）
Why:
- corresponding to Top-down Construction of a Parse Tree （对应于自顶向下构造解析树）
- helpful for Bottom-up Parsing
How:
- $$ : derive in one step
- $ $ : derive in zero or more steps
  - $\alpha \overset{*}{\Rightarrow} \alpha$
  - if $\alpha \overset{*}{\Rightarrow} \beta$ and $\beta \overset{*}{\Rightarrow} \gamma$, then $\alpha \overset{*}{\Rightarrow} \gamma$
- $ $ : derive in one or more steps
$\mathbf{S} \overset{*}{\Rightarrow} \alpha$
- α is a Sentential Form of S. （α是S的一个句子形式）
- α may contain Terminals, Non-terminals, or may be Empty.
- Sentence: a Sentential Form without Non-terminals. （没有非终结符的句子形式）
What is a Language?
- L(G): set of Sentences generated by the Grammar G
- A string of terminals $ L(G) $ if and only if $ $ is a Sentence of $ G $ (or $ S $).
Example:
Leftmost Derivation: the Leftmost Non-terminal is always replaced at first （最左边的非终结符总是第一个被替换）
- $ $
Rightmost Derivation: $ $
Parse Trees and Derivations
- What: A Graphical Representation of a Derivation （派生的图形表示）
  - filtering out the order in which productions applied to replace non-terminals （过滤出应用于替换非终结符的产生式的顺序）
- Example：

Top-Down Parsing

What: Create the Parse Tree from Top to Bottom （从上到下创建解析树）
- from root to leaves （从根到叶）
Why: for Parsing
How: Derive an Input String in the Leftmost Manner （以最左边的方式派生输入字符串）
- consistent with string scanning （与字符串扫描一致）
- Key: determine the production to be applied for a non-terminal （确定要应用于非终结符的产生式）
- Recursive-Descent Parsing
  - require backtracking to find right production （需要回溯以找到正确的产生式）
  - general, but inefficient
- Predictive Parsing
  - a special case of Recursive-Descent Parsing
  - no backtracking, choosing by looking ahead at input symbols （无需回溯，通过提前查看输入符号进行选择）

Recursive-Descent Parsing

void A() {
    // Choose an A-production, A --> X1 X2 ...Xk;
    for (i = 1 to k) {
        if (Xi is a non-terminal)
            call Xi();
        else if (Xi equals the current input symbol a)
            advance the input to the next symbol;
        else ... // an error has occurred
    }
}

Example:

Left Recursion （左递归）

What: The Grammar has a non-terminal A such that there exists a derivation $\mathbf{A} \overset{+}{\Rightarrow} \mathbf{A} \alpha$ （文法有一个非终结符A，使得存在一个派生$\mathbf{A} \overset{+}{\Rightarrow} \mathbf{A} \alpha$）
Why: Recursive-Descent Parsing cannot handle Left Recursion. （递归下降解析无法处理左递归）
How: Transform the grammar to eliminate Left Recursion. （转换文法以消除左递归）
- ==Immediate Elimination: A → Aα|β ⇒ A → βA^′, A^′ → αA^′|𝝐==
- $ _1 | | _m | _1 | | _n (_1 | | _n) , (_1 | | _m) | $
Example:

Left Recursion Elimination （消除左递归）

INPUT: Grammar G without Cycles or 𝝐-productions （无循环或𝝐产生式的文法）
- Cycle: $\mathbf{A} \overset{+}{\Rightarrow} \mathbf{A}$
- 𝝐-production: A → ϵ
OUTPUT: Equivalent Grammar without Left Recursions （没有左递归的等效文法）

Steps:

supposing there are the Non-terminals with Order A1, A2..., An
for (i from 1 to n) {
    for (j from 1 to i-1) {
        replace each Ai → Aj𝛾 by Ai → 𝛿1𝛾 | 𝛿2𝛾 | ... | 𝛿𝑘𝛾, where:
            Aj  → 𝛿1 | 𝛿2 | ... | 𝛿𝑘 are all current Aj-productions
    }
    eliminate the Immediate Left Recursion among the Ai-productions
}

Example:

Predictive Parsing

What: recursive-descent parsers needing no backtracking （不需要回溯的递归下降解析器）
- can be constructed for LL(k) grammar （可以为LL(k)文法构造）
  - L: scanning input from Left to right
  - L: producing a Leftmost derivation
  - k: using k input symbols of lookahead at each step to make decision
Why: a unique production to apply, or none to use (error)
Example: LL(1)
- stmt → if (expr) stmt else stmt | while (expr) stmt | { stmt_list}
How: FIRST and FOLLOW
- assist in choosing which production to apply, based on the next input symbol （根据下一个输入符号选择应用哪个产生式）

FIRST and FOLLOW

FIRST: the set of terminals that begin strings derived from a non-terminal or a string of grammar symbols. （从非终结符或语法符号字符串派生的字符串开始的终结符集合）
- FIRST(𝜶): what the first symbol would be for 𝜶
- 𝜶: string of grammar symbols
- return: set of terminals that begin strings derived from 𝜶
  - first symbols of strings derived from 𝜶
- HOW：
  - if X is a terminal, then FIRST(X) = {X}
  - if X is a non-terminal, and X → Y₁Y₂…Y_k
    - add all non-𝝐 symbols of Y₁ to FIRST(X)
    - add all non-𝝐 symbols of Y₂ to FIRST(X), if ϵ ∈ FIRST(Y₁)
    - ···
    - add all non-𝝐 symbols of Y_k to FIRST(X), if ϵ ∈ FIRST(Y₁) and ϵ ∈ FIRST(Y₂) and ··· and ϵ ∈ FIRST(Y_k − 1)
    - add 𝝐 to FIRST(X), if ϵ ∈ FIRST(Y₁) and ϵ ∈ FIRST(Y₂) and ··· and ϵ ∈ FIRST(Y_k)
- Example:
FOLLOW: the set of terminals that can appear immediately to the right of a non-terminal in some sentential form. （在某些句子形式中，可以出现在非终结符右侧的终结符集合）
- FOLLOW(N): what is the next symbol of N
- N: a non-terminal
- return: set of terminals can appear immediately after N in a sentential form
- HOW:
  - place $ in FOLLOW(S)
    - S: start symbol
    - $: input right end-marker
  - for each production M → αNβ
    - add all symbols in FIRST(β) to FOLLOW(N), except 𝝐
    - if ϵ ∈ FIRST(β), add all symbols in FOLLOW(M) to FOLLOW(N)
  - for each production M → αN
    - add all symbols in FOLLOW(M) to FOLLOW(N)
- Example:

LL(1) Grammar

What: Any A → α | β Represents two Distinct Productions （A → α | β表示两个不同的产生式）
- FIRST(α) and FIRST(β) are disjoint sets.（二者是互不相交的集合）
  - For no terminal a, do both α and β derive strings beginning with a. （二者不会同时派生以a开头的字符串）
  - At most one of α and β can derive the empty string. （至多一个α和β可以派生空字符串）
- If ϵ ∈ FIRST(β), then FIRST(α) and FOLLOW(A) are disjoint sets. （如果ϵ ∈ FIRST(β)，则FIRST(α)和FOLLOW(A)是互不相交的集合）
Why: Proper Production is Selected by Looking ONLY at the Next Input Symbol. （通过仅查看下一个输入符号来选择适当的产生式）
How: By Parsing Table, a Two-Dimensional Array
- M[A, a] = α, when deriving A, apply A → α if coming up with a. （当派生A时，如果出现a，则应用A → α）

LL(1) Parsing

Predictive Parsing Table Construction
- INPUT: Grammar G.
- OUTPUT: Parsing Table M.
- STEPS: For each production A → α:
  1. For each terminal a ∈ FIRST(α), add A → α to M[A, a].
  2. If ϵ ∈ FIRST(α), then for each terminal b ∈ FOLLOW(A), add A → α to M[A, b].
  3. If ϵ ∈ FIRST(α) and $ ∈ FOLLOW(A), add A → α to M[A, $].
- Example:
Implementation
- Stack-based Method, mimicking a leftmost derivation （基于栈的方法，模仿最左派生）

Left Factoring

When the decision is not clear, defer it until seeing enough symbols. （当决策不明确时，推迟到看到足够的符号为止）
Left-Factored:
- A → 𝜶𝜷₁|𝜶𝜷₂ by A → 𝜶A^′, A^′ → 𝜷₁|𝜷₂
How:
- For each non-terminal A, find the longest common prefix 𝜶 of its alternatives.
- If 𝜶 is not empty, replace all of the A-productions
Example:
- A → 𝜶𝜷₁|𝜶𝜷₂|...|𝜶𝜷_n|𝜸 by A → 𝜶A^′|𝜸, A^′ → 𝜷₁|𝜷₂|...|𝜷_n

Non-LL(1) Grammar

Non-LL(1) Grammars:
- grammars with Left Recursion （左递归的文法）
- grammars not Left Factored （未左因子化的文法）
- grammars with Ambiguity （歧义文法）
Thus, before Predictive Parsing,
- perform Left Factoring （左因子化）
- eliminate Left Recursion (prerequisite of the elimination?) （消除左递归）
- remove Ambiguity （消除歧义）

Bottom-Up Parsing

What: the construction of a parse tree beginning at the leaves and working up to the root （从叶子开始构建解析树，直到根）
Why: not all grammars can be made LL(1)
How: construct rightmost derivation in the reverse order
- Reduction: “Reversed Derivation”（归约：反向派生）
  - from the string to the start symbol （从字符串到起始符号）
  - The body of a production is replaced by the non-terminal at its header. （用产生式的头部替换产生式的主体）
- $S \overset{rm}{\Rightarrow} \gamma_0 \overset{rm}{\Rightarrow} \gamma_1 \overset{rm}{\Rightarrow} \dots \overset{rm}{\Rightarrow} \gamma_n \overset{rm}{\Rightarrow} \omega$
  - find the rightmost derivation in the reverse order: “leftmost reduction” （找到右侧派生的反向顺序：最左侧归约）
- “Left-to-Right, Rightmost Derivation in Reverse”: LR Parsing
Example：
- E ⇒ T ⇒ T * F ⇒ T * id ⇒ F * id ⇒ id * id

Handles

What: if $\mathbf{S} \overset{*}{\Rightarrow} \mathbf{\alpha A \omega} \overset{}{\Rightarrow} \mathbf{\alpha \beta \omega}$, then production A → β in the position following α is a handle of αβω.
- a substring that matches the body of a production, representing a step of reduction （与产生式的主体匹配的子字符串，表示归约的一步）
- a pair of values: (production, position)
Why: handle pruning for bottom-up parsing （处理底向上解析的剪枝）
- identify handles and reduce them to the appropriate leftmost non-terminals （识别句柄并将其归约到适当的最左非终结符）
How: by using a stack to keep track of the current position in the input string and the corresponding production rules （使用栈来跟踪输入字符串中的当前位置和相应的产生式规则）
Example：

Shift-Reduce Parsing （移进归约解析）

How: Stack + Input Buffer
- Stack: reduced Grammar Symbols
- Input Buffer: rest of the String to be parsed
Actions
- Shift: move the next symbol onto stack
- Reduce: replace the handle on the top of stack
- Accept: announce the success of parsing
- Error: discover syntax errors, and call for recovery
Example:

Operator-Precedence Parsing （运算符优先解析）

What: a shift-reduce parser handling operator-precedence grammar （处理运算符优先文法的移位归约解析器）
- operator-precedence grammar: a subset of LR(1) grammar （运算符优先文法：LR(1)文法的一个子集）
- for each Production:
  - no 𝝐 in the body （主体中没有空字符）
  - no two consecutive non-terminals in the body （在主体中没有两个连续的非终结符）
Why: to handle expressions with operator precedence and associativity （处理具有运算符优先级和结合性的表达式）
How: find handles according precedence
- Precedence （优先级）
  - a < ·b: a’s precedence is lower than b’s
  - a = ·b: … is equal to …
  - a > ·b: … is higher than …
- Precedence Climbing Method （优先级爬升法）
  - scan the input from Left to Right until >· is encountered
  - then, scan backward until <· is encountered
  - that between <· and >· is the handle
- Implementation with STACK (栈实现)
  - let a be the top Terminal on the STACK （栈顶终结符）
  - let b be the INPUT Symbol under processing （正在处理的输入符号）
  - if a < ·or = ·b, Shift
  - else if a > ·b, Reduce
Example:

OPP: Precedence Relation Construction （优先级关系构造）

Precedence Relation

If operator a has higher precedence than b
- a > ·b
- b < ·a
If a and b has equal precedence
- if left-associative, then a > ·b and b > ·a
- if right-associative, then a < ·b and b < ·a
For all operator a
- a < ·id, id > ·a
- $ < ·a, a > ·$
- a < ·(, ( < ·a, a > ·), ) > ·a
- ( = ·)

LEADING and TRAILING

LEADING: the set of symbols that can appear at the beginning of a string derived from a non-terminal （可以出现在从非终结符派生的字符串开头的符号集合）
- $\mathbf{LEADING(Q)} = \{\mathbf{Y}, \mathbf{N} \ | \ \mathbf{Q} \overset{+}{\Rightarrow} \mathbf{Y\delta} \ or\ \mathbf{Q} \overset{+}{\Rightarrow} \mathbf{N Y \delta}, \mathbf{N} \in \mathbf{V_n}, \mathbf{Y} \in \mathbf{V_t}\}$
- for Q → Yδ or Q → NYδ, we have:
  - Y, N ∈ LEADING(Q)
  - LEADING(N) ⊆ LEADING(Q)
TRAILING: the set of symbols that can appear at the end of a string derived from a non-terminal （可以出现在从非终结符派生的字符串末尾的符号集合）
- $\mathbf{TRAILING(P)} = \{\mathbf{X}, \mathbf{N} \ | \ \mathbf{P} \overset{+}{\Rightarrow} \mu \mathbf{X} \ or \ \mathbf{P} \overset{+}{\Rightarrow} \mu \mathbf{X} \mathbf{N}, \mathbf{N} \in \mathbf{V_n}, \mathbf{X} \in \mathbf{V_t}\}$
- for P → μX or P → μXN, we have:
  - X, N ∈ TRAILING(P)
  - TRAILING(N) ⊆ TRAILING(P)
Example:

Constructing Precedence Relations

X = ·Y
- if there is A → αXYβ, where X, Y ∈ V, α, β ∈ V^*
- or, there is A → αXNYβ, where N ∈ V_n ∪ {ϵ}, X, Y ∈ V_t
  - adjacent symbols have equal precedence （相邻符号具有相等的优先级）
    - at least one of them is a terminal （至少有一个是终结符）
  - e.g., E + T, T * F, E+ = T
X < ·Y
- if there is A → αXQβ, and $\mathbf{Q} \overset{+}{\Rightarrow} \mathbf{Y \delta}$, where X, Y ∈ V, Q ∈ V_n, α, β, δ ∈ V^*
- or, there is A → αXQβ, and $\mathbf{Q} \overset{+}{\Rightarrow} \mathbf{N Y \delta}$, where N ∈ V_n ∪ {ϵ}, X, Y ∈ V_t
  - $\mathbf{LEADING(Q)} = \{\mathbf{Y}, \mathbf{N} \ | \ \mathbf{Q} \overset{+}{\Rightarrow} \mathbf{Y\delta} \ or\ \mathbf{Q} \overset{+}{\Rightarrow} \mathbf{N Y \delta}\}$
    - for A → αXQβ, we have X <· Symbols in LEADING(Q)
X > ·Y
- if there is A → αPYβ, and $\mathbf{P} \overset{+}{\Rightarrow} \mu \mathbf{X}$, where X, Y ∈ V, P ∈ V_n, α, β, μ ∈ V^*
- or, there is A → αPYβ, and $\mathbf{P} \overset{+}{\Rightarrow} \mu \mathbf{X} \mathbf{N}$, where N ∈ V_n ∪ {ϵ}, P ∈ V_n, X, Y ∈ V_T
- $\mathbf{TRAILING(P)} = \{\mathbf{X}, \mathbf{N} \ | \ \mathbf{P} \overset{+}{\Rightarrow} \mu \mathbf{X} \ \text{or} \ \mathbf{P} \overset{+}{\Rightarrow} \mu \mathbf{X} \mathbf{N}\}$
  - for A → αPYβ, we have Symbols in TRAILING(P) > ·Y

Precedence Table

Steps：
- for each production A → X₁X₂…X_k:
  - for each X
    - if X_i ∈ V_t and X_i + 1 ∈ V_t, then X_i = ·X_i + 1
    - if X_i ∈ V_t and X_i + 2 ∈ V_t and X_i + 1 ∈ V_n, then X_i = ·X_i + 2
    - if X_i ∈ V_t and X_i + 1 ∈ V_n, then X_i < ·LEADING(X_i + 1)
    - if X_i ∈ V_n and X_i + 1 ∈ V_t, then X_i = ·X_i + 1 and TRAILING(X_i) > ·X_i + 1
- for the Start Symbol S:
  - $ < ·LEADING(S)
  - TRAILING(S) > ·$
Example:

OPP: Some More

Unary Minus vs. Binary Minus （一元负号与二元负号）
Leave It to Scanners （留给扫描器）
- return two different tokens for the two （返回两个不同的tokens）
- lookahead is required （需要向前看）
OPPs are not used often in practice （运算符优先级在实践中不常用）
- limited scenarios and applications （有限的场景和应用）
- but, simple -> part of a complex parsing system （但简单，是复杂解析系统的一部分）

LR Parsing

What:
- left-to-right scanning
- rightmost derivation in reverse
Why
- can recognize virtually all programming languages of context-free grammars （几乎可以识别所有上下文无关文法的编程语言）
- is the most general non-backtracking shift-reduce parsing method known （已知的最通用的非回溯移位归约解析方法）
  - yet is still efficient （仍然高效）
- can detect a syntax error as soon as possible （尽快检测语法错误）
- is a proper superset of the predictive parsing （是预测解析的适当超集）

LR(0) Parsing

Items and the LR(0) Automaton

Problem: when to shift and when to reduce?
- how to decide whether that on the top of the stack is a handle? （如何判断栈顶的句柄？）
LR(0) Items: a production with a dot at some position in its body （LR(0) 项目：在其主体的某个位置有一个点的产生式）
- prefixes of a valid production, indicating how much we have seen at the point （有效产生式的前缀，指示我们在该点上已经看到多少）
- e.g. A → XYZ yields four items（例如A → XYZ产生四个项）：
  - A → ⋅XYZ: hope to see XYZ next on the input
  - A → X ⋅ YZ: hope to see YZ next on the input
  - A → XY ⋅ Z: hope to see Z next on the input
  - A → XYZ⋅: hope to see nothing next on the input
- kind of state + transition → automaton
Kernel and Non-Kernel Items
- Kernel Items: the Initial Item + those whose dots are not at the left
- Non-Kernel Items: Otherwise
LR(0) Automaton: CLOSURE + GOTO
- CLOSURE: set of Items （项集的闭包）
- GOTO: the Transition Function （转换函数）

CLOSURE and GOTO

CLOSURE(I):
- I: a set of Items for a grammar G（文法G的项集）
- construct by two rules:
  1. every Item in I is added in to CLOSURE(I) （I中的每个项目都添加到CLOSURE(I)中）
  2. if A → α ⋅ Bβ is in CLOSURE(I) and B → γ is a production of G, then add B → ⋅γ to CLOSURE(I) （如果A → α ⋅ Bβ在CLOSURE(I)中，并且B → γ是G的一个产生式，则将B → ⋅γ添加到CLOSURE(I)中）
- Example:
GOTO(I, X):
- I: a set of Items for a grammar G （文法G的项集）
- X: a grammar symbol （文法符号）
- the closure of the set of items [A → αX ⋅ β] such that [A → α ⋅ Xβ] is in I （GOTO(I, X)是I中所有形如[A → α ⋅ Xβ]的项所对应的项[A → αX ⋅ β]的集合的闭包）
  - the transition from the state for I under input X （在输入X下，I的状态转换）
- construct by
  1. for each Item A → α ⋅ Xβ in I
  2. then every Item in CLOSURE(A → αX ⋅ β) is added to GOTO(I, X)
- Example:

Automaton Construction

INPUT: a grammar G
OUTPUT: a LR(0) automaton
Construction:
1. augment G to G^′ by adding a new start symbol S^′ and production S^′ → S （通过添加新的起始符号S^′和产生式S^′ → S扩展文法G至G^′）
2. C:= {CLOSURE(S^′ → ⋅S)} （先求CLOSURE(S^′ → ⋅S)，作为闭包的集合C中的第一个闭包）
3. repeat:
  - for each Item I in C and each grammar symbol X in G^′ （对C中每个项目I和G^′中的每个文法符号X）
    - if GOTO(I, X) is not empty and not in C （如果GOTO(I, X)不为空且不在C中，注意GOTO(I, X)是项集I在输入符号X下的转换的闭包）
      - add GOTO(I, X) to C
4. until no new Items are added to C
Example:

Parsing Table Construction

LR(0) Parsing Table 𝕋:
- Rows: states
- Columns: grammar symbols
  - terminals for SHIFT and REDUCE actions
  - non-terminals for GOTO actions
- Construction: each edge X: (s_m, s_n) （对于每条边X，(s_m, s_n)为其起始和终止状态，X为文法符号）
  - if X is a terminal a, then 𝕋[s_m, a] = SHIFT n （记作s n）
  - if X is a non-terminal A, then 𝕋[s_m, A] = GOTO n （记作n）
  - if A → β⋅ is in s_m, then for each terminal a, 𝕋[s_m, a] = REDUCE A → β （记作r n，其中n是产生式的编号，用罗马数字表示）
  - if S^′ → S⋅ is in s_m, then for each terminal a, 𝕋[s_m, a] = ACCEPT （记作a/acc）
Example: > 注：在上面的例子中，状态1、2、9均存在移进-归约冲突，因此不属于LR(0)文法。

Alogrithm and Implementation （算法与实现）

INPUT: an input string ω and an LR-parsing table 𝕋 for a grammar G
OUTPUT: if ω ∈ L(G), the reduction steps of a bottom-up parse for ω

STEPS:

let a be the first symbol of 𝜔$ //设a是输入字符串𝜔$的第一个符号
push state 0 onto the stack //将状态0推入栈中
while (1) {
    let s be the state on top of the stack //设s是栈顶的状态
    if (ACTION[s, a] = shift t) { //如果ACTION[s, a] = shift t
        push t onto the stack //将t推入栈中
        let a be the next input symbol //设a是下一个输入符号
    }
    else if (ACTION[s, a] = reduce A→𝛽) { //如果ACTION[s, a] = reduce A→𝛽
        pop |𝛽| symbols off the stack //从栈中弹出|𝛽|（𝛽包含的符号数）个状态
        let t be the top of the stack now //设t是栈顶的状态
        push GOTO[t, A] onto the stack //将GOTO[t, A]推入栈中
    }
    else if (ACTION[s, a] = accept) break //如果ACTION[s, a] = accept，则结束
    else call error-handling routine
}

Example:

LR(0) Conflicts

Reduce-Reduce Conflicts:
- state has two reduce items
- e.g. A → α⋅ and B → β⋅
Shift-Reduce Conflicts:
- state has a reduce item and a shift item
- e.g. A → α ⋅ kγ and B → β⋅
To avoid conflicts: SLR(1) Parsing

SLR(1) Parsing

Parsing Table Construction

LR(0) Parsing Table 𝕋 ⇒ SLR(1) Parsing Table 𝕋
- Rows: states
- Columns: grammar symbols
  - terminals for SHIFT and REDUCE actions
  - non-terminals for GOTO actions
- Construction: each edge X: (s_m, s_n)
  - if X is a terminal a, then 𝕋[s_m, a] = SHIFT n
  - if X is a non-terminal A, then 𝕋[s_m, A] = GOTO n
  - if A → β⋅ is in s_m, then for each terminal a ∈ FOLLOW(A), 𝕋[s_m, a] = REDUCE A → β （只对FOLLOW(A)中的终结符进行归约）
  - if S^′ → S⋅ is in s_m, then 𝕋[s_m, $] = ACCEPT （只对$进行归约）
Example:

SLR(1) Conflicts

Reduce-Reduce Conflicts:
- state has two reduce items
- e.g. A → α⋅ and B → β⋅
  - if FOLLOW(A) ∩ FOLLOW(B) = ∅, safe for SLR(1) （如果FOLLOW(A) ∩ FOLLOW(B) = ∅，则SLR(1)安全）
Shift-Reduce Conflicts:
- state has a reduce item and a shift item
- e.g. A → α ⋅ kγ and B → β⋅
  - if k ∉ FOLLOW(B), safe for SLR(1) （如果k ∉ FOLLOW(B)，则SLR(1)安全）

LR(1) Parsing

How: check the immediate following symbols of non-terminals for reduction （检查非终结符的直接后续符号以进行归约）
LR(1) Item = [LR(0) Item, Following Symbol]
- e.g. [A → XY ⋅ Z, a]
- where a is the following terminal symbol of A （a是A的后续终结符）
- when Z is not empty, the same as LR(0) Item （当Z不为空时，与LR(0)项相同）
- otherwise, A → XY ⋅ Z is applied only when the next input symbol is a （否则，只有在下一个输入符号为a时才应用A → XY ⋅ Z）
  - instead of SLR(1)’s ALL A’s following symbols, let along LR(0)’s ALL terminal symbols （而不是SLR(1)的所有A的后续符号，甚至LR(0)的所有终结符）

LR(1)’s CLOSURE and GOTO

LR(1)’s CLOSURE(I)
- every Item in I is added in to CLOSURE(I)
- repeat:
  - if [A → α ⋅ Bβ, a] is in CLOSURE(I) and B → γ is a production of G, then for each terminal b in FIRST(βa), add [B → ⋅γ, b] into CLOSURE(I)
  - until no more new Items can be added into CLOSURE(I)
- only reduce B when it is followed by that in FIRST(βa) （仅在后面跟着FIRST(βa)中的项时才归约B）
LR(1)’s GOTO(I, B)
- if [A → α ⋅ Bβ, a] in I
- then every Item in CLOSURE({[A → αB ⋅ β, a]}) is in GOTO(I, B)
  - GOTO(I, B) ⊇ CLOSURE({[A → αB ⋅ β, a]})
  - Those of Kernel Items inherited from the previous state （从前一个状态继承的内核项）
Example:

Automaton Construction

INPUT: a grammar G
OUTPUT: a LR(1) automaton
Construction:
1. augment G to G^′ by adding a new start symbol S^′ and production S^′ → S （扩展文法）
2. C:= {CLOSURE([S^′ → ⋅S, $])} （初始化项集C）
3. repeat:
  - for each Item I in C and each grammar symbol X in G^′
    - if GOTO(I, X) is not empty and not in C
      - add GOTO(I, X) to C
4. until no new Items are added to C
Example:
1. Example:
2. Exercise:

Parsing Table Construction

LR(1) Parsing Table 𝕋:
- Rows: states
- Columns: grammar symbols
  - terminals for SHIFT and REDUCE actions
  - non-terminals for GOTO actions
- Construction: each edge X: (s_m, s_n)
  - if X is a terminal a, then 𝕋[s_m, a] = SHIFT n
  - if X is a non-terminal A, then 𝕋[s_m, A] = GOTO n
  - if [A → β⋅, a] (the kernel) is in s_m, then 𝕋[s_m, a] = REDUCE A → β
  - if [S^′ → S⋅, $] is in s_m, then 𝕋[s_m, $] = ACCEPT
Example: > 注：上表中[11, id]应为s12，即𝕋[11, id] = s12。

LALR(1) Parsing

What: Look Ahead LR(1)
Why: smaller parsing table than LR(1) for practice（比LR(1)小的解析表）
- equal to SLR(1) in state number （与SLR(1)状态数相等）
  - e.g. In C, serveral hundred for SLR(1), serverals of thousands for LR(1)
- more powerful than SLR(1) in processing more grammars （比SLR(1)更强大，能处理更多文法）
How: combine Items with the same Production Set in LR(1) （将LR(1)中具有相同产生式集的项（同心集）组合在一起）
- e.g. [A → α⋅, a] and [A → α⋅, b] are combined into one state
Compared to LR(1):

Automaton Construction

INPUT: a grammar G
OUTPUT: a LALR(1) automaton
Construction:
1. Construct LR(0) items as LALR(1) items’ cores for the grammar G^′，then remove the non-kernel items （为文法G^′构造LR(0)项作为LALR(1)项的核心，然后删除非核心项）
2. For each kernal items K in I_i and each grammar symbol X in G^′, calculate the lookahead symbols’ INIT and PROPAGATION. （对于每个项集I_i中的核心项K和文法符号X，计算前瞻符号的初始值和传播）
  - For each item (A → α ⋅ β) in K
    - J := {CLOSURE([A → α ⋅ β, #])} （初始化J）
    - if [B → γ ⋅ Xδ, a] in J and a ≠ #, then B → γX ⋅ δ in GOTO(I_i, X) has a SELF-generated lookahead symbol a （J中的项若出现不为#的前瞻符号，则它是自发生成的）
    - if [B → γ ⋅ Xδ, #] in J, then lookahead symbols are propagated from A → α ⋅ β to B → γX ⋅ δ in GOTO(I_i, X). （J中的项若出现#的前瞻符号，则该项会传播前瞻符号）
  - Specifically, $ in [S^′ → S⋅, $] is SELF-generated.
3. Construct the FROM − TO table base on GOTO for the kernel items to show the propagation of lookahead symbols （为核心项构建FROM − TO表，以显示前瞻符号的传播）
  - in step 2, we calculate propagated relationships.
4. Propagate the lookahead symbols according to the FROM − TO relationships until the fixed point achieved （根据FROM − TO关系传播前瞻符号，直到达到不变点）
  - in step 2, we calculate self-generated lookahead symbols.
Example: > 注：上右表中，前瞻符号在传播时，每一行的前瞻符号可以向右填满右侧列，最右一列是最终结果

Parsing Table Construction

the same as LR(1) method
LALR(1) Parsing Table 𝕋:
- Rows: states
- Columns: grammar symbols
  - terminals for SHIFT and REDUCE actions
  - non-terminals for GOTO actions
- Construction: each edge X: (s_m, s_n)
  - if X is a terminal a, then 𝕋[s_m, a] = SHIFT n
  - if X is a non-terminal A, then 𝕋[s_m, A] = GOTO n
  - if [A → β⋅, a] (the kernel) is in s_m, then 𝕋[s_m, a] = REDUCE A → β
  - if [S^′ → S⋅, $] is in s_m, then 𝕋[s_m, $] = ACCEPT
Example:

Capabilities vs. Conflicts

LALR never introduces new SHIFT-REDUCE conflicts:
- SHIFT does not depend on lookaheads
Example:

Summary

LR(0), SLR(1), LR(1), LALR(1)
- all working in SHIFT-REDUCE mode
- only different in Parsing Tables
Parsing Tables - Capabilities
- LR(0) < SLR(1) < LALR(1) < LR(1)
- LR(0): Items
- SLR(1): Items with FOLLOW
- LALR(1): Items Combined from SLR(1) and LR(1)
- LR(1): Items with Subset of FOLLOW

Using Ambiguous Grammars

In Theory: grammar for LR parsing tables should be unambiguous （在理论上：LR解析表的文法应该是无歧义的）
For ambiguous grammars:
- there will be conflicts
- add new information/restrictions to resolve ambiguity （添加新信息/限制以解决歧义）
  - precedence, associativity, etc.
  - get LR tables without conflicts
Why embracing ambiguous grammars?
- some are much natural, the unambiguous one can be very complex （有些是非常自然的，无歧义的可能非常复杂）
- isolate common syntactic constructs for special-case optimizations （隔离常见的语法结构以进行特殊情况优化）
Example:

Summary

SJTU Notes

#SJTU #编译原理

编译原理

https://youyeyejie.github.io/posts/编译原理/

作者

youyeyejie

发布于

2025年6月9日

更新于

2025年6月24日

马克思主义基本原理上一篇

量子力学下一篇