Prometheus 與 PromQL • 架構、網絡與存儲

Prometheus 的由來#

Prometheus 是目前最受歡迎的監控與告警系統之一，主要用來收集並處理指標。它源自 2012 年音樂分享平台 SoundCloud 的內部需求——當時他們要管理數百個微服務、數千個 Process，但 Docker（2013）與 Kubernetes（2014）都還沒誕生，原本使用的 StatsD 與 Graphite 已經應付不來。SoundCloud 因此決定自己打造一套新的監控系統。

SoundCloud 在專案第一天就把 Prometheus 開源，並在 2015 年發了一篇技術 Blog 詳述其用法，瞬間在 Hacker News 衝上首頁。後來 Kubernetes 採用 Prometheus 作為其標配的監控工具，更讓它聲勢一飛沖天，並於 2016 年加入 CNCF，成為繼 Kubernetes 之後 CNCF 的第二個專案。Prometheus 自此成為 Metrics 領域的事實標準（de facto standard）。

Prometheus 之名來自希臘神話的泰坦神 Prometheus，因為將火帶給凡人而被宙斯懲罰；Logo 就是那把被偷走的火炬。Prometheus 原意有「先見之明」，與監控藉由指標及早發現問題的精神相當契合。

核心元件#

Prometheus Client 與 Exporter#

Prometheus 之所以能擴散得這麼廣，很大程度仰賴生態系：

各語言都有 Prometheus Client Library，讓應用程式可以自行埋點生成指標。例如 Lab 中的 Python FastAPI 直接使用 Prometheus Client；Java Spring Boot 則透過 Actuator + Micrometer。
對沒辦法直接埋點的對象，社群開發了大量 Exporter 採集資訊並轉成 Prometheus 格式（下一章會詳述）。

Prometheus Server#

Prometheus Server 負責：

依設定爬取 Prometheus Metrics。
把資料存到本機的時序資料庫（Time Series Database, TSDB）。也因為這個角色，Prometheus 有時也被當成一個 TSDB 看待。
透過 Web UI 與 API 提供以 PromQL（Prometheus Query Language）為查詢語言的服務。

Alerting#

不可能有人 24 小時盯著螢幕，所以 Prometheus 內建告警機制，分為兩個角色：

Alerting Rule：告警規則，定義一段 PromQL 作為觸發條件。當條件成立，Prometheus 會通知 Alertmanager。
Alertmanager：執行預設好的告警動作，例如寄信、送 Telegram / Slack / Discord、呼叫 Webhook 等。

Scrape Job 設定#

Prometheus Server 透過 scrape_configs 來決定要抓哪些目標。一個典型設定如下：

scrape_configs:
  - job_name: "prometheus"
    scrape_interval: 15s
    metrics_path: "/metrics"
    static_configs:
      - targets: ["localhost:9090"]

說明：

job_name：Job 名稱；爬到的 Metrics 會被自動加上 job=<job_name> 的 Label。
scrape_interval：爬取週期，沒指定時繼承 global 設定。
metrics_path：要抓的路徑，預設為 /metrics。
static_configs.targets：要抓的機器清單，預設使用 HTTP。

上述設定的意思就是：每 15 秒去 localhost:9090/metrics 抓一次 Prometheus 自身的指標。

Prometheus Web UI 的 Status 頁#

Web UI 的 Status 選單有兩個排查問題時非常實用的頁面：

Configuration：顯示目前實際採用的設定（含 scrape_configs），用來確認 Job 設定是否正確。
Targets：列出所有爬取目標的狀態，欄位包含：
- Endpoint：抓取的 URL。
- State：UP 表示正常；DOWN 表示有問題。
- Labels：來自此 Target 的指標會被附加的 Label。
- Last Scrape：上次抓取時間。
- Scrape Duration：每次抓取耗時。
- Error：抓取失敗時的錯誤訊息。

Metric Types#

Prometheus Metrics 主要有四種類型：

Counter：只增不減的計數器，適合請求次數等累計值。命名常以 _total 結尾。注意服務重啟可能歸零。
Gauge：可增可減的瞬時值，適合 CPU、記憶體使用率等狀態指標。
Histogram：紀錄數值分布，包含各區間 Bucket 計數、總和、總計數三種資料。例如紀錄請求時間：
- request_process_time_bucket{le="0.1"} 5：≤ 0.1 秒的有 5 筆。
- request_process_time_bucket{le="0.5"} 15：≤ 0.5 秒的有 15 筆。
- request_process_time_bucket{le="+Inf"} 30：≤ 無窮大的有 30 筆。
- request_process_time_sum 20.5：總和 20.5 秒。
- request_process_time_count 30：總計 30 筆，與 +Inf 的 Bucket 一致。
Summary：類似 Histogram，但記錄的是預先計算好的百分位數（Quantile）。例如：
- request_process_time{quantile="0.5"} 0.2
- request_process_time{quantile="0.95"} 1.2
- request_process_time{quantile="0.99"} 2.0
- 同時也提供 _sum 與 _count。

Histogram vs. Summary#

Counter 與 Gauge 容易區分，Histogram 與 Summary 則有功能重疊。實務選擇要點：

Histogram 紀錄原始數值，較有彈性；要算特定百分位數時得在查詢端計算，會增加 Prometheus 負擔。
Summary 直接回報事先算好的百分位數，但無法臨時換成其他分位；計算負擔落在生產指標的應用程式上。

PromQL 入門#

基本篩選#

回顧 Prometheus 指標的長相：

prometheus_http_requests_total{code="302",handler="/"} 3

prometheus_http_requests_total 是 Metrics Name；大括號內為 Label；最後的數字為值。

直接打指標名稱就能查出所有 Label 的資料：

prometheus_http_requests_total

加上 Label 篩選某個 handler：

prometheus_http_requests_total{handler="/"}

也可以只用 Label 篩選不指定指標名稱，例如查看 prometheus 這個 Job 收集了什麼：

{job="prometheus"}

進階篩選#

除了等於 =，還支援 !=（不等於）、=~（符合 regex）、!~（不符合 regex）：

prometheus_http_requests_total{code=~"2.*"}

Instant Vector 與 Range Vector#

Instant Vector（瞬時向量）：某一時間點的指標值。沒指定時間就是「現在」。當 Web UI 在某個時間範圍內畫圖時，其實就是把多個瞬時向量串起來。
Range Vector（區間向量）：指定時間點往前一段時間的所有值。語法是在瞬時向量後面接中括號，例如：

http_server_requests_seconds_count{}[3m]

Range Vector 一個時間點就會回傳多筆值，所以不能直接畫成圖；想畫圖通常要套上 rate()、increase() 之類的函式先收斂回瞬時向量。

Operator 與 Function#

PromQL 結果可以再透過 Operator 與 Function 加工：

Binary Operator：基礎四則運算與布林運算（用於篩選）；只能作用在瞬時向量。
Aggregation Operator：將多個瞬時向量聚合成一個瞬時向量，例如 sum、min、max、avg、topk、quantile。
Function：較複雜的計算，例如：
- increase：區間向量第一個與最後一個的差值。
- rate：區間向量的變化率。
- predict_linear：用線性回歸預測未來值。
- histogram_quantile：從 Histogram Bucket 算出百分位數。

四個一定要會的 Operator/Function#

sum：依 Label 加總。

sum(logback_events_total{application="spring-boot"}) by(application)

除法計算比率。被除數與除數的 Label 必須一致，必要時先用 sum 收斂：

sum(logback_events_total{level="debug"}) by(application)
  / sum(logback_events_total{}) by(application)

rate：算出區間內每秒的平均增長率。

rate(http_server_requests_seconds_count{uri="/"}[3m])

histogram_quantile：算百分位數，例如 P95 API 執行時間：

histogram_quantile(
  0.95,
  sum(rate(http_server_requests_seconds_bucket{uri="/"}[3m])) by(le)
)

rate 在使用時有不少陷阱，常見原則是「先 rate 再 sum，不要先 sum 再 rate」。實作時若覺得結果怪，建議去翻 Prometheus 官方文件以及社群的 rate 解析文章。

小結#

Prometheus 從一個內部小專案出發，靠著社群與 Kubernetes 的助攻，已經是 Metrics 領域的事實標準。能撐起這個地位的還有它的查詢語言 PromQL，以及周邊豐富的 Exporter 生態系。下一章就要深入聊 Exporter。

原文出處#

原書/iThome：https://ithelp.ithome.com.tw/articles/10322080