187. Repeated DNA Sequences

Medium

Description

給定 DNA 字串 s（只含 A, C, G, T），找出所有出現超過一次的長度為 10 的子字串。

Example:

Input: s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT” Output: [“AAAAACCCCC”,“CCCCCAAAAA”]

Intuition#

核心思路：用 HashSet 記錄所有出現過的 10 字元子字串，若再次出現就加入結果集。

滑動窗口大小固定為 10，每次取子字串存入 Set
若子字串已存在於 Set，則加入結果（用另一個 Set 去重結果）
進階：可用 Rolling Hash 或 Bitmask 編碼減少記憶體使用

Approaches#

1: HashSet of Strings#

概念: 用兩個 HashSet：seen 記錄出現過的子字串，repeated 記錄重複的子字串
時間複雜度: O(n * 10) - 每次取子字串 O(10)
空間複雜度: O(n * 10) - 儲存所有子字串

class Solution {
    fun findRepeatedDnaSequences(s: String): List<String> {
        if (s.length < 10) return emptyList()

        val seen = HashSet<String>()
        val repeated = HashSet<String>()

        for (i in 0..s.length - 10) {
            val sub = s.substring(i, i + 10)
            if (!seen.add(sub)) {
                repeated.add(sub)
            }
        }

        return repeated.toList()
    }
}

⭐2: Bitmask Rolling Hash#

概念: 每個字母用 2 bits 編碼（A=0, C=1, G=2, T=3），10 個字母用 20 bits 表示，用整數代替字串作為 HashSet 的 key
時間複雜度: O(n) - 每次滑動 O(1) 更新 hash
空間複雜度: O(n) - HashSet 儲存整數而非字串，更省空間

class Solution {
    fun findRepeatedDnaSequences(s: String): List<String> {
        if (s.length < 10) return emptyList()

        val charToVal = mapOf('A' to 0, 'C' to 1, 'G' to 2, 'T' to 3)
        val mask = (1 shl 20) - 1  // 20 bits mask

        var hash = 0
        // 初始化前 9 個字元的 hash
        for (i in 0 until 9) {
            hash = (hash shl 2) or charToVal[s[i]]!!
        }

        val seen = HashSet<Int>()
        val repeated = HashSet<Int>()
        val result = mutableListOf<String>()

        for (i in 9 until s.length) {
            hash = ((hash shl 2) or charToVal[s[i]]!!) and mask
            if (!seen.add(hash)) {
                if (repeated.add(hash)) {
                    result.add(s.substring(i - 9, i + 1))
                }
            }
        }

        return result
    }
}

Bitmask 編碼的好處：整數比較比字串比較快，且記憶體使用大幅降低。4 個字母只需要 2 bits，10 個字母的子字串用 20 bits 就能唯一表示，完全可以放在一個 Int 中。

🔑 Takeaways#

Pattern: 固定長度子字串的重複偵測 -> HashSet + Sliding Window
關鍵技巧: Bitmask 編碼將字串壓縮為整數，適用於小字母表（DNA 只有 4 個字母）；Rolling Hash 避免每次重新計算子字串 hash