跳转至

Ch3 Scanning

词法分析:Lexical Analysis

Token, Pattern, and Lexemes

  • The Analysis Partitions Input String into Substrings. (分析将输入字符串划分为子字符串。)

  • Token:

    • token-name: the role of lexical unit (词法单元的角色)
      • often refer to Token by its token-name
    • attribute-value: any info associated to the Token (与 Token 相关的任何信息)
      • Generally, it has only ONE value: a pointer to the Symbol Table (通常只有一个值:指向符号表的指针)
        • In practice, the value of a constant can be stored as the attribute. (在实践中,常量的值可以作为属性存储。)
          • constant: strings, numbers (常量:字符串、数字)
  • Pattern: description of the form lexemes of a token may (描述词法单元的形式)
    • regular expression: \d, \w, ?, *, +, ...
  • Lexeme: a sequence of characters matches a token's pattern (与模式匹配的字符序列)
    • Token vs. Lexeme: Class vs. Instance in C++

Specification of Tokens

Strings and Languages

  • String: a finite sequence of Symbols from an Alphabet (字符串:来自字母表的有限符号序列)
  • Language: any countable set of Strings (语言:任何可数的字符串集合)
  • Terms for Parts of a String s (字符串 s 的部分术语):
    • prefix:前缀
      • any string obtained by removing zero or more symbols from the end of s
    • suffix:后缀
      • any string obtained by removing zero or more symbols from the beginning of s
    • substring:子串
      • any string obtained by removing any prefix or any suffix from s (去掉前缀或后缀)
    • subsequence:子序列
      • any string obtained by removing zero or more not necessarily consecutive position of s (不一定连续的位置)

Operations on Languages

  • Union: (并集)
    • \(L_1 ∪ L_2 = \{x | x ∈ L_1 \quad or \quad x ∈ L_2\}\)
  • Concatenation: (连接)
    • \(L_1L_2 = \{xy | x ∈ L_1 \quad and \quad y ∈ L_2\}\)
  • Kleene closure: (星闭包)
    • \(L^* = \bigcup_{1=0}^\infty L^i \\= {𝜖} ∪ L ∪ L^2 ∪ L^3 ∪ ... \\= \{x | x = x_1x_2...x_n, n ≥ 0, xi ∈ L\}\)
    • 𝜖 is the empty string
  • Positive closure: (正闭包)
    • \(L^+ = \bigcup_{1=1}^\infty L^i \\= LL^* \\= \{x | x = x_1x_2...x_n, n ≥ 1, xi ∈ L\}\)

Regular Expressions

定义

  • Regular Expression (Regex): a way to describe Patterns of Tokens of a programming language. (正则表达式:描述编程语言的词法单元模式的一种方式)
    • Each Regular Expression r denotes a Language L \((r)\). (每个正则表达式 r 表示一个语言 L \((r)\)
      • Regular Language, Type-3 Language
    • The Regular Expressions are built recursively out of smaller ones, using the rules. (正则表达式是用规则递归构建的)

语法

  • BASIS (基础)
    • 𝜀 is a regular expression, and L(𝜀) = {𝜀}, the empty set. (空字符)
    • a is a symbol in a set Σ, then a is a regular expression, and L(a) = {a}. (集合中的字符)
  • INDUCTION (归纳):Suppose \(r\) and \(s\) are expressions denoting \(L(r)\) and \(L(s)\)
    • \(r | s\) : a regular expression denoting \(L(r) ∪ L(s)\)
    • \(rs\) : a regular expression denoting \(L(r)L(s)\)
    • \(r^*\) : a regular expression denoting \((L(r))^*\)
    • \((r)\) : a regular expression denoting \(L(r)\)
      • We can add additional brackets around expressions. (我们可以在表达式周围添加额外的括号。)
  • 定律

Regular Definitions

  • Regular Definition: a set of productions with non-terminals derived by regular expressions. (正则定义:一组通过正则表达式派生的非终结符的产生式)

Extensions of Regex

  • \(+\) : one or more instances
    • \(r^*\) = \(r^+\) | 𝜖
    • \(r^+\) = \(rr^*\) = \(r^*r\)
  • \(?\) : zero or one instance
    • \(r?\) = \(r\) | 𝜖
  • \([\quad]\) : character classes
    • \([abc]\) = \(a\) | \(b\) | \(c\)
    • \([a-z]\) = \(a\) | \(b\) | ... | \(z\)

Regular Language / Grammar, and Regex

  • Regular Expression \(r\) denotes a Language \(L(r)\).
    • Regular Language is Type-3 Language (正则语言是类型 3 语言)
    • Regular Language is that denoted by Regular Expressions. (正则语言是由正则表达式表示的)
  • Regular Grammar is the grammar describes a Regular Language.
    • with the production form of \(\mathbf{A}→\alpha\) or \(\mathbf{A}→\alpha\mathbf{B}\)

Recognition of Tokens

Input Buffer

Transition Diagrams

  • Transition Diagram = Nodes + Edges, a Flowchart (状态流程图 = 节点 + 边)
    • Nodes: states, conditions that could occur when looking for a lexeme that matches one pattern. (节点:状态)
      • States: Circles (状态:圆圈)
      • Start State: Arrowhead, Beginning of a Pattern (起始状态:箭头,开始)
      • End State(s): Double Circles, End of a Pattern (终止状态:双圆圈,结束)
    • Edge: actions, taken to transit from one State to Another. (边:动作)
      • labeled by a Symbol or a set of Symbols for matching (标记为符号或符号集以进行匹配)
    • Deterministic: at most ONE edge out of a given state with a given label. (确定性:在给定状态下,最多有一条边)
  • Example:
    • *(Retract): A Token has been accepted while another char has been read which must be unread. (回退:一个 Token 已经被接受,而另一个不应读取字符已经被读取,必须回退)

Reserved Words (保留字)

  • Keywords look like Identifiers.
    • if, then, ...
  • Add reserved words into symbol table initially. (在符号表中添加保留字)
  • Create separate transition diagrams for each keyword. (为每个关键字创建单独的转换图)
    • thenextone

有穷自动机:Finite Automata

Finite Automata

  • What: an Abstract Machine that can be in exactly one of a Finite number of States at any given time. (有限自动机:在任何给定时间只能处于有限数量的状态之一的抽象机器)
    • Finite Automation = Finite-state Automation (FSA, plural: automata) (有限自动机 = 有限状态自动机)
      • Finite-state Machine (FSM), or simply State Machine (有限状态机,或简单地称为状态机)
    • changes from one state to another according to Inputs, called Transition (根据输入,从一个状态变化到另一个状态,称为转换)
  • Why: used as the Recognizer for Scanning, identifying Tokens (用于扫描的识别器,识别 token)
  • How: answers “YES” or “NO” about each input String (如何:对每个输入字符串回答“是”或“否”)
    • determines whether the String is valid for the given Grammar (确定字符串是否符合给定的语法)

DFA vs. NFA

  • FA: Deterministic (DFA) or Non-deterministic (NFA)
  • DFA: have exactly/at most one action for each input symbol (每个输入符号有一个动作)
    • can be represented with a Transition Diagram
    • Recognition with DFA: Faster, may take More Space (识别 DFA:更快,可能占用更多空间)
    • complex to represent Regex, but more Precise, widely used (复杂表示正则表达式,但更精确,广泛使用)
  • NFA: can have multiple actions for the same input symbol (同一输入符号可以有多个动作)
    • can be represented with a Transition Graph
    • Recognition with NFA: Slower, may take Less Space (识别 NFA:较慢,可能占用更少的空间)
    • simply represents Regex, but less Precise (简单表示正则表达式,但不够精确)
  • Example:
  • Lexical Analysis Workflow with FA:
    • Regex -> NFA -> DFA
    • Regex -> DFA

Nondeterministic Finite Automata (NFA)

  • An NFA M = \(\mathbf{(S, 𝜮, move, 𝒔_0, F)}\) consists of:
    • \(\mathbf{S}\): a finite set of States (有限状态集)
    • \(\mathbf{𝜮}\): the Input Alphabet, excluding 𝜖 (不包含𝜖的输入符号集合)
    • \(\mathbf{move}\), a Transition Function (转换函数)
      • move(State, Symbol) = set of Next States
      • move: \(𝑆×(Σ ∪ \{𝜖\}) ⟶ ℙ(𝑆)\)
    • \(\mathbf{s_0} ∈ \mathbf{S}\), the Start State (or Initial State) (起始状态)
    • \(\mathbf{F} ⊆ \mathbf{S}\), a set of Accepting States (or Final States) (终止状态)
  • An NFA accepts Input String \(s\) iff
    • there exists some path in the Transition Graph from the Start State to one Accepting State, (存在一条路径从起始状态到一个接受状态)
    • such that symbols along the path spell out \(s\) (路径上的符号拼写出 \(s\)
  • Transition Tablesrows for States, columns for Input Symbols and 𝝐
    • Example:

Deterministic Finite Automata (DFA)

  • What: a Special Case of an NFA, where
    • there are no moves on symbol 𝜖, and
    • for each state s and input symbol a, there is Exactly ONE edge out of s labeled by a. (每个状态 s 和输入符号 a,恰好有一条边出 s 标记为 a)
      • COMPLETE: It defines from each state a transition for each input symbol. (完整:它定义了从每个状态到每个输入符号的转换)
        • Transition function is a total function.
      • Local Automation: DFA not necessarily complete (... At Most ONE edge ...) (局部自动机:DFA 不一定是完全图)
        • Transition function is a partial function.

Algorithm for Simulation

Conversion: NFA → DFA

  • Subset Construction (子集构造)
    • removing 𝜖-transitions
    • combining multiple NFA’s states into ONE constructed DFA’s state (将多个 NFA 的状态组合成一个构造的 DFA 的状态,即:等势点合并)
  • Definitions:
    • 𝜖-closure(s):
      • s: some State
      • = Set of NFA States reached by state s via 𝜖-transitions, including s itself. (NFA 中可以通过若干个空变换到达的状态的集合)
    • 𝜖-closure(T):
      • T: set of States
      • = \(∪_{s∈T}\) 𝜖-closure(s)
    • move(T, a):
      • T: set of States
      • a: Input Symbol
      • = NFA’s States reached by 𝑠 ∈ 𝑇 on a.
  • Algorithm Subset Construction

    • Input: the start State s0 and the Transition Diagram of NFA N.
    • Output: Transition Graph of DFA Dtran

      add 𝝐-closure(s0) into Dstates //将初始状态s0的𝝐闭包加入Dstates
      while (Dstates has unsearched state S) { //当Dstates有未搜索的状态S时
          tag S as searched //将S标记为已搜索
          foreach input symbol a { //对每个输入符号a
              U = 𝝐-closure(move(S, a)) //设U为S进行a动作后状态S'的𝝐闭包
              if (U is new to Dstates) { //如果U是Dstates中的新状态
                  add U into Dstates as unsearched //将U加入Dstates并标记为未搜索
              }
              Dtran(S, a) = U //将Dtran(S, a)设为U
          }
      }
      

    • 最后得到的 Dtran 是一个 DFA 的转换表,Dstates 是 DFA 的状态集合。

    • Algorithm 𝜖-closure(T) Computation:
    • 上一步中 U = 𝝐-closure(move(S, a)) 的实现逻辑:
    • Input: the State Set T
    • Output: 𝜖-closure(T)
      push all states in T onto Stack //将T中的所有状态压入栈中
      while (Stack is not empty) { //当栈不为空时
          s = Stack.pop() //弹出栈顶元素s
          foreach (state u reached by s via 𝜖) { //对于每个s通过𝜖能达到的状态u
              if (u is not in 𝜖-closure(T)) { //如果u不在T的𝜖闭包中
                  add u into 𝜖-closure(T) // 将其加入T的𝜖闭包
                  Stack.push(u) //将u压入栈中
              }
          }
      }
      
  • Example:

Conversion: Regex → NFA

  • McNaughton-Yamada-Thompson Algorithm
  • Regex's Definition:

    • BASIS:

    • INDUCTION:

  • Example:

Workflow

  • The Workflow of Lexical Analyzer
    • Regex → NFA Construction
    • NFA → DFA Construction
    • Simulating DFA to Recognize Tokens
  • Convert Regex Directly into DFA: PASS
  • DFA Simplification: Minimizing the Number of States

DFA Simplification

  • What and Why:
    • no REDUNDANT states (无冗余状态)
      • REDUNDANCE: the states that NO accepted input string’s path passes through (没有路径到达终止状态的状态)
      • (in the transition graph)
    • no EQUIVALENT states (无等效状态)
      • EQUIVALENCE: states with the SAME side effects (具有相同副作用的状态)
      • (making the states indistinguishable)
    • Distinguish States via Input String (通过输入字符串区分状态)
      • State: s, t
      • String: x
      • x distinguishes s from t,
        • if one state can reach an accepting state via x, while the other cannot. (如果一个状态可以通过 x 到达终止状态,而另一个状态不能)
      • s is distinguishable from t,
        • if there is some string distinguishes them. (存在一些字符串可以区分它们)
      • Unify Indistinguishable States into One. (将不可区分的状态合并为一个)
  • How
    1. Start with the initial partition \(\mathbf{Π}\) with two groups, the accepting and non-accepting states of the DFA. (将 DFA 的接受状态和非接受状态分为两个组)
    2. Let \(\mathbf{Π_{new} := Π}\). Then, for each group \(\mathbf{G}\) of \(\mathbf{Π}\): (初始时令 \(\mathbf{Π_{new} = Π}\),然后对于每个组 G)
      • For each input symbol \(\mathbf{a}\), states \(\mathbf{s},\mathbf{t}\) in \(\mathbf{G}\) are partitioned if they transit to different groups of \(\mathbf{Π}\) via \(\mathbf{a}\); (对于每个输入符号 a,如果状态 s 和 t 通过 a 转移到不同的组,则他们被划分为不同的组)
      • Replace \(\mathbf{G}\) in \(\mathbf{Π_{new}}\) by the new subgroups. (用新的子组替换 \(\mathbf{G}\)
    3. If \(\mathbf{Π_{new}} ≠ \mathbf{Π}\), \(\mathbf{Π: = Π_{new}}\) and repeat Step 2, Step 4 otherwise. (如果 \(\mathbf{Π_{new}} ≠ \mathbf{Π}\),则令 \(\mathbf{Π: = Π_{new}}\) 并重复步骤 2,否则跳到步骤 4)
    4. Aggregate the transitions among groups. (将组之间的转换聚合)
    5. The resulting DFA is the minimized DFA. (得到的 DFA 是最小化的 DFA)
  • Example: