Locate False Sharing by Using Performance Counters • Effective Debugging

False sharing 是一種多執行緒效能問題，原因在於 CPU 快取的運作機制：

每個 CPU 核心有自己的本地快取（local cache），存放常用的記憶體區塊
快取的最小單位是 cache line（快取行），通常為 64 bytes
當多個執行緒分別寫入不同的變數，但這些變數恰好落在同一條 cache line 上時，CPU 的 cache coherency protocol（快取一致性協定）會強制核心間同步
儘管每個執行緒操作的是不同的值，快取系統仍然視為「共享」——因此稱為 false sharing

實際影響#

一個 OpenMP 程式讓 8 個執行緒分別計算 sum[0] 到 sum[7]（完全平行、互不相干的任務）：

版本	wall-clock 時間
8 執行緒（OpenMP）	2.603 秒
單執行緒（循序）	2.249 秒

多執行緒版本居然比單執行緒還慢！ 因為 sum 陣列（8 個 int，共 32 bytes）小到能放進同一條 cache line，導致嚴重的 false sharing。

使用 Performance Counters 定位問題#

步驟 1：比較 LLC-loads#

使用 Linux perf 工具測量 last-level cache loads（LLC-loads）：

$ perf stat --event=LLC-loads ./sum-seq
  17,830    LLC-loads

$ perf stat -e LLC-loads ./sum-mp
  49,264,883    LLC-loads

多執行緒版本的 LLC-loads 是單執行緒的約 2,764 倍，明確指出 cache 問題。

步驟 2：錄製事件並標注程式碼#

perf record --event=LLC-loads ./sum-mp
perf annotate

perf annotate 會將 LLC-loads 的百分比標注到每一行組合語言指令上。結果顯示：

寫入 sum 陣列的那行指令佔了 25.03% + 14.23% + 14.83% = 54.09% 的所有 LLC-loads

解決方案#

使用區域變數（stack-based variable）累加，最後才寫回共享陣列：

#pragma omp parallel private(tid)
  {
    int local_sum = 0;
    tid = omp_get_thread_num();
    for (int i = 0; i < N; i++)
      local_sum += values[i] >> tid;
    sum[tid] = local_sum;    // 只寫一次
  }

版本	wall-clock 時間
修正後 8 執行緒	0.553 秒
單執行緒	2.249 秒

修正後的平行版本比循序版本快了約 4 倍。

False sharing 不涉及任何同步化原語，因此不會被 race condition 偵測工具發現。必須透過 performance counters 才能定位。

可用工具#

Linux perf 指令
Intel VTune Performance Analyzer
Visual Studio 的 Concurrency Visualizer 擴充套件

重點回顧#

False sharing 發生在多個執行緒寫入同一 cache line 上的不同變數時，快取一致性協定造成大量額外開銷
使用 perf stat 比較 LLC-loads 數量，快速判斷是否存在 false sharing
使用 perf annotate 精確定位造成 cache miss 的程式碼行
解決方式是讓每個執行緒使用區域變數累加，減少對共享記憶體的頻繁寫入

什麼是 False Sharing#