Elasticsearch 商品搜尋 • 架構、網絡與存儲

為什麼 MySQL 不夠#

商品搜尋常見需求：

中文分詞
關鍵字權重（標題權重 > 描述）
拼錯容錯（typo tolerance）
同義詞（手機 = 手機 = 行動電話）
篩選器（價格範圍、品牌、分類）+ 排序
自動補全（autocomplete）
聚合（每個品牌幾件、每個價格區間幾件）

MySQL 的全文索引 FULLTEXT：

SELECT * FROM products WHERE MATCH(title, description) AGAINST('iPhone case');

跑得通，但中文分詞、相關性、權重控制都很弱。商品量大後性能崩。

Elasticsearch（Lucene 為核心） 解這些問題的代價是：

增加一個系統運維
與主庫的同步延遲
學習成本

對中型以上電商，這些代價都值得。

倒排索引（Inverted Index）的本質#

傳統 DB 索引：「ID ➡️ 內容」。倒排索引：「詞 ➡️ 包含這個詞的文件 ID 列表」。

例：

文件 1: "iPhone 手機殼 透明"
文件 2: "Samsung 手機殼 黑色"
文件 3: "iPhone 充電線"

倒排索引:
  iPhone   → [1, 3]
  Samsung  → [2]
  手機殼   → [1, 2]
  透明     → [1]
  黑色     → [2]
  充電線   → [3]

查 “iPhone 手機殼”：

iPhone   → [1, 3]
手機殼   → [1, 2]
intersect → [1]   ← 兩個詞都包含的文件

再加 BM25 等相關性算分，得到 [1] 為最佳結果。

實際倒排索引每個詞會記錄：

詞: iPhone
  doc_id  freq  positions     ...
  1       2     [0, 5]
  3       1     [0]

positions 用於 phrase query（“iPhone case” 要連續、不能跨行）。

ES 的核心結構#

Cluster
  └─ Indices（"products", "users", "logs"）
      └─ Shards（一個 index 切多份，每份是一個 Lucene index）
          └─ Segments（不可變的小單元，定期 merge）

shard 是分散式的單位、segment 是 IO 的單位。

寫入 ➡️ buffer ➡️ 定時 flush 成 segment（不可變）➡️ 多個 segment 定時 merge 成大 segment。這是 LSM-tree 的概念（第 15 章會深入）。

不可變 segment 是 ES 性能的根源 ── 不需要寫入鎖、cache 友善、批次合併。代價：刪除是 mark deleted，真正回收要等 merge。

商品索引設計#

Mapping（schema）#

PUT /products
{
  "mappings": {
    "properties": {
      "spu_id":      { "type": "keyword" },
      "title":       { "type": "text", "analyzer": "ik_max_word" },
      "description": { "type": "text", "analyzer": "ik_smart" },
      "category_id": { "type": "keyword" },
      "brand_id":    { "type": "keyword" },
      "price":       { "type": "double" },
      "stock":       { "type": "integer" },
      "tags":        { "type": "keyword" },     // 多值
      "attributes":  { "type": "object" },       // 動態欄位
      "created_at":  { "type": "date" },
      "score":       { "type": "double" }        // 業務權重（銷量、評分綜合）
    }
  }
}

幾個關鍵：

text vs keyword：text 分詞用於全文檢索；keyword 不分詞用於精確過濾、聚合、排序
中文分詞器（如 IK）必須安裝；標題用 ik_max_word（細分），描述用 ik_smart（粗分）
attributes 用 object 動態 ── ES 自動為每個欄位建索引

多欄位（multi-field）#

同一字段同時索引兩種：

"title": {
  "type": "text",
  "analyzer": "ik_max_word",
  "fields": {
    "keyword": { "type": "keyword" }
  }
}

之後可以用：

title ➡️ 全文檢索
title.keyword ➡️ 精確匹配、排序、聚合

商品搜尋 query#

實際商品搜尋 query 通常如下：

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "iPhone case",
            "fields": ["title^3", "description"],   // ^3 = 標題權重 3 倍
            "type": "best_fields",
            "fuzziness": "AUTO"                      // typo 容錯
          }
        }
      ],
      "filter": [
        { "term":  { "stock": { "gte": 1 } } },     // 有庫存
        { "range": { "price": { "gte": 100, "lte": 5000 } } },
        { "terms": { "brand_id": ["1", "2"] } }
      ]
    }
  },
  "aggs": {
    "brands":     { "terms": { "field": "brand_id" } },
    "price_hist": { "histogram": { "field": "price", "interval": 100 } }
  },
  "sort": [
    "_score",
    { "score": "desc" }
  ],
  "from": 0,
  "size": 20
}

注意：

must 影響 score；filter 不影響 score 且 cache 友善
multi_match 跨欄位搜尋
aggs 用於 facet（左側篩選器）
_score 是 ES 算的 BM25 相關性，加上業務 score 綜合排序

與 MySQL 的同步#

ES 不是主資料庫 ── MySQL 才是。同步策略：

方案 1：應用層雙寫#

def update_product(product):
    db.update_mysql(product)
    es.index(product)

簡單但不可靠（其中一個失敗 ➡️ 不一致）。

方案 2：MQ 解耦#

def update_product(product):
    db.update_mysql(product)
    mq.publish("product_updated", product.id)

def es_consumer():
    on_message:
        product = db.fetch_mysql(id)
        es.index(product)

MySQL 寫成功 ➡️ 出 MQ 訊息 ➡️ consumer 同步 ES。但若兩步驟分裂仍有問題（前章 Outbox 問題）。

方案 3：CDC（推薦）#

MySQL → binlog → Debezium / Canal → Kafka → ES sink connector

binlog 是 source of truth，所有變更自動廣播給 ES（與其他 consumer）。資料一致性靠 binlog 保證。

延遲：通常秒級。對搜尋來說這個延遲完全可接受。

全量初始化與增量同步#

新搭 ES：

全量導入：用 logstash / scroll 或自寫程式，把現有 MySQL 全部讀出寫入 ES
記錄全量起點：例如 binlog position
增量同步：從這個 binlog position 開始，所有變更實時同步

新增 index 時的 zero-downtime 方案：

1. 創建 index_v2（新 mapping）
2. 全量導 → index_v2
3. 增量同步雙寫 index_v1 + index_v2
4. 切流量到 index_v2（用 alias）
5. 觀察一段時間
6. 刪除 index_v1

alias 是 ES 把 index 名抽象出來的機制：

POST /_aliases
{
  "actions": [
    { "add": { "index": "products_v2", "alias": "products" } },
    { "remove": { "index": "products_v1", "alias": "products" } }
  ]
}

應用永遠查 products 這個 alias，運維端可以無痛切換底層 index。

性能調優要點#

shard 數量#

每個 index 的 shard 數建立後不可改（reindex 才能改）。

shard 太少：單 shard 大、query 慢、無法水平擴展
shard 太多：協調開銷大

經驗值：每個 shard 30~50 GB；node 數的整數倍。

refresh interval#

預設 1 秒 ── 寫入後 1 秒可被搜尋。對訂單監控、商品上架這種「秒級可見」需求 OK；對搜尋對時效要求不高的，可以拉到 30 秒以減少 segment 數，提升寫入性能：

PUT /products/_settings
{ "index": { "refresh_interval": "30s" } }

bulk 寫入#

不要 single insert。用 bulk API 一次寫幾百筆：

POST /_bulk
{ "index": { "_index": "products", "_id": "1" } }
{ "title": "..." }
{ "index": { "_index": "products", "_id": "2" } }
{ "title": "..." }

bulk 大小建議 5~15 MB / 批。

索引分層#

冷熱資料分離：

hot tier：SSD、最近訂單，高並發查詢
warm tier：HDD、半年內，偶爾查
cold tier：物件儲存，幾乎不查（searchable snapshot）

ES 的 ILM（Index Lifecycle Management）自動按時間滾動 + 遷移。

不適合 ES 的情況#

ES 強，但不是萬能。不適合：

強一致性查詢（轉帳、訂單狀態）
大量 join（ES 沒有真 join）
需要強事務（ES 沒有）
寫入立即可讀（refresh 延遲）
大量更新（每次 update 是 mark + reindex）

對訂單詳細查詢，主庫；對搜尋、聚合、報表，ES。

非搜尋用途#

ES 也常被用作：

日誌分析（ELK：Elasticsearch + Logstash + Kibana）
監控指標（早期 ELK；後來 Prometheus + Grafana 占主流）
APM 應用性能監控
報表 / OLAP（中等規模可勝任，超大規模還是要 ClickHouse）

競品#

工具	特色
Elasticsearch	最成熟、社群大、Kibana 視覺化
OpenSearch	AWS fork 自 ES 7.x，授權更開放
Solr	老牌，大型企業仍用（Lucene 同源）
Meilisearch	輕量、極佳預設、全文搜尋簡單場景
Typesense	類似 Meilisearch
Qdrant / Milvus / Weaviate	向量資料庫（語意搜尋）

近年趨勢：傳統關鍵字搜尋 + 向量搜尋（embedding）混合 ── ES 8 已內建 dense_vector 支援。對「語意搜尋」需求（如「我要冬天禦寒的衣服」找到羽絨外套），這是必走方向。

小結#

倒排索引 = 詞 ➡️ 文件列表 ── ES 全文檢索的本質
與主庫分離，靠 CDC（binlog）同步
mapping 設計：text + keyword 雙欄位、中文分詞器
query：must（影響評分）+ filter（cache）+ aggs（facet）
寫入優化：bulk + 適當 refresh interval
配合 alias 做零停機 reindex
不要把 ES 當主庫用

下章看快取策略 ── 電商所有讀路徑都繞不過 Redis。