AC 自動機#

問題背景#

敏感詞過濾：在用戶輸入的文本中，快速找出所有敏感詞並替換為 ***。

為什麼需要 AC 自動機？#

方法	做法	問題
單模式匹配	每個敏感詞掃描一次文本	敏感詞多時，掃描次數太多
Trie	從文本每個位置開始匹配	失敗時只能後移一位
AC 自動機	Trie + 失敗指標	失敗時跳到最佳位置

核心概念#

AC 自動機 = Trie 樹 + KMP 的失敗指標

失敗指標 (Fail Pointer)#

類似 KMP 的 next 陣列，當匹配失敗時，跳到能繼續匹配的最佳位置。

模式串：c, bc, bcd, abcd
主串：abcd

Trie 結構：
    root
   / | \
  a  b  c*
  |  |
  b  c*
  |  |
  c  d*
  |
  d*

* 表示單詞結尾

失敗指標規則：

節點 p 的失敗指標指向：Trie 中能匹配 p 路徑最長後綴的節點
例：路徑 “abc” 的失敗指標指向 “bc”（若存在）或 “c”（若存在）

建構步驟#

1. 建立 Trie 樹#

將所有模式串插入 Trie。

2. 建立失敗指標（BFS）#

def build_fail_pointers(root):
    from collections import deque

    queue = deque()
    root.fail = None

    # 第一層的失敗指標都指向 root
    for child in root.children.values():
        if child:
            child.fail = root
            queue.append(child)

    while queue:
        node = queue.popleft()

        for char, child in node.children.items():
            if not child:
                continue

            # 找父節點的失敗鏈，看是否有相同字元的子節點
            fail_node = node.fail
            while fail_node and char not in fail_node.children:
                fail_node = fail_node.fail

            child.fail = fail_node.children[char] if fail_node else root
            queue.append(child)

關鍵理解：失敗指標的建立類似 KMP 的 next 陣列，用 BFS 逐層計算。

匹配過程#

def match(root, text):
    results = []
    node = root

    for i, char in enumerate(text):
        # 沿失敗指標找能匹配的節點
        while node and char not in node.children:
            node = node.fail

        node = node.children[char] if node else root

        # 檢查當前節點及其失敗鏈上的所有單詞結尾
        temp = node
        while temp and temp != root:
            if temp.is_end:
                results.append((i - temp.length + 1, temp.word))
            temp = temp.fail

    return results

複雜度分析#

階段	時間複雜度
建立 Trie	O(m × len)，m 為模式串數，len 為平均長度
建立失敗指標	O(k × len)，k 為 Trie 節點總數
匹配	O(n × len)，實際接近 O(n)

建立 AC 自動機是預處理，只需做一次。匹配階段非常高效。

與其他演算法的關係#

BF 演算法 ─────改進────→ KMP 演算法
    │                      │
    │                      │（加入失敗指標概念）
    ↓                      ↓
Trie 樹 ──────改進─────→ AC 自動機

應用場景#

場景	說明
敏感詞過濾	一次掃描找出所有敏感詞
病毒碼掃描	匹配多個病毒特徵碼
搜尋引擎	關鍵詞高亮
DNA 序列匹配	多模式基因片段搜尋

字串匹配演算法總結#

演算法	類型	時間複雜度	適用場景
BF	單模式	O(n×m)	簡單場景
RK	單模式	O(n)	字元集小
KMP	單模式	O(n+m)	通用
BM	單模式	O(n/m)~O(n)	實務最快
Trie	多模式	O(n×k)	前綴查詢
AC 自動機	多模式	O(n)	敏感詞過濾