本文介紹如何基于用戶自建的Prometheus,采集ACK Pro集群的控制平面組件監控APIServer、etcd、Scheduler、KCM、CCM指標配置,并介紹推薦的報警配置。
前提條件
自建的Prometheus能夠訪問ACK Pro集群的APIServer,并擁有
/metrics
的讀權限。自建的Prometheus可以在ACK Pro集群內,也可以在ACK Pro集群外。
背景信息
ACK Pro提供控制平面核心組件監控對外透出的功能,并基于ARMS預置了相關的組件監控大盤,具體包括APIServer、Cloud Controller Manager、etcd、Kube Controller Manager和Scheduler,如果您選用了ARMS監控能力,監控數據會被ARMS代理自動采集并在監控大盤上實時展示。如果您希望通過自建Prometheus采集ACK Pro集群的控制平面核心組件指標并配置相應告警,實現與自建監控系統的集成,可以基于本文進行配置。
Prometheus采集配置
使用自建的Prometheus采集ACK Pro集群控制平面核心組件指標,首先需要在Prometheus的配置文件prometheus.yaml中配置指標采集Job,配置文件格式如下:
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: ack-api-server
......
- job_name: ack-etcd
......
- job_name: ack-scheduler
......
其中,每個核心組件對應一個Job配置,具體配置可參見對應核心組件的指標清單。社區Prometheus配置Prometheus.yaml方法,請參見Configuration。
社區Prometheus Operator方案以及ACK應用市場ack-prometheus-operator組件的相關信息,請參見開源Prometheus監控。關于自定義采集配置,請參見Prometheus Operator社區官方文檔Prometheus Operator進行數據采集配置。
Prometheus報警規則配置
社區Prometheus報警配置具體操作,請參見Alerting_rules。
ACK Pro集群內部監控
內部監控是將Prometheus部署在待監控的ACK Pro集群內的監控形式。
kube-apiserver
關于kube-apiserver組件的更多信息,請參見kube-apiserver組件監控。
- job_name: ack-api-server scrape_interval: 30s scrape_timeout: 30s metrics_path: /metrics scheme: https # scheme: https honor_labels: true honor_timestamps: true params: hosting: ["true"] job: ["apiserver"] kubernetes_sd_configs: - role: endpoints namespaces: names: [default] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes, insecure_skip_verify: false} relabel_configs: - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: apiserver replacement: $1 action: keep - source_labels: [__meta_kubernetes_service_label_provider] separator: ; regex: kubernetes replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: https replacement: $1 action: keep - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Node;(.*) target_label: node replacement: ${1} action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Pod;(.*) target_label: pod replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: (.+) target_label: job replacement: ${1} action: replace - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
- alert: AckApiServerWarning annotations: message: APIServer is not available in last 5 minutes. Please check the prometheus job and target status. expr: | (absent(up{job="ack-api-server",pod!=""}) or (count(up{job="ack-api-server",pod!=""}) <= 1)) == 1 for: 5m labels: severity: critical
kube-apiserver監控采集指標清單,請參見kube-apiserver指標清單。
etcd
關于etcd組件的更多信息,請參見etcd組件監控。
- job_name: ack-etcd scrape_interval: 30s scrape_timeout: 30s metrics_path: /metrics scheme: https # scheme: https honor_labels: true honor_timestamps: true params: hosting: ["true"] job: ["etcd"] kubernetes_sd_configs: - role: endpoints namespaces: names: [default] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes, insecure_skip_verify: false} relabel_configs: - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: apiserver replacement: $1 action: keep - source_labels: [__meta_kubernetes_service_label_provider] separator: ; regex: kubernetes replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: https replacement: $1 action: keep - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Node;(.*) target_label: node replacement: ${1} action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Pod;(.*) target_label: pod replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: (.+) target_label: job replacement: ${1} action: replace - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
- alert: AckETCDWarning annotations: message: Etcd cluster has no leader in last 5 minutes, please check whether the cluster is overloaded and contact ACK team. expr: | sum_over_time(etcd_server_has_leader[5m]) == 0 for: 5m labels: severity: critical - alert: AckETCDWarning annotations: message: Etcd is not available in last 5 minutes. Please check the prometheus job and target status. expr: | (absent(up{job="ack-etcd",pod!=""}) or (count(up{job="ack-etcd",pod!=""}) <= 2)) == 1 for: 5m labels: severity: critical
etcd監控采集指標清單,請參見etcd指標清單。
kube-scheduler
關于kube-scheduler組件的更多信息,請參見kube-scheduler組件監控。
- job_name: ack-scheduler scrape_interval: 30s scrape_timeout: 30s metrics_path: /metrics scheme: https # scheme: https honor_labels: true honor_timestamps: true params: hosting: ["true"] job: ["ack-scheduler"] kubernetes_sd_configs: - role: endpoints namespaces: names: [default] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes, insecure_skip_verify: false} relabel_configs: - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: apiserver replacement: $1 action: keep - source_labels: [__meta_kubernetes_service_label_provider] separator: ; regex: kubernetes replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: https replacement: $1 action: keep - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Node;(.*) target_label: node replacement: ${1} action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Pod;(.*) target_label: pod replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: (.+) target_label: job replacement: ${1} action: replace - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
- alert: AckSchedulerWarning annotations: message: Scheduler is not available in last 3 minutes. Please check the prometheus job and target status. expr: | (absent(up{job="ack-scheduler",pod!=""}) or (count(up{job="ack-scheduler",pod!=""}) <= 0)) == 1 for: 3m labels: severity: critical
kube-scheduler監控采集指標清單,請參見kube-scheduler指標清單。
kube-controller-manager
關于kube-controller-manager組件的更多信息,請參見kube-controller-manager組件監控。
- job_name: ack-kcm scrape_interval: 30s scrape_timeout: 30s metrics_path: /metrics scheme: https # scheme: https honor_labels: true honor_timestamps: true params: hosting: ["true"] job: ["ack-kube-controller-manager"] kubernetes_sd_configs: - role: endpoints namespaces: names: [default] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes, insecure_skip_verify: false} relabel_configs: - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: apiserver replacement: $1 action: keep - source_labels: [__meta_kubernetes_service_label_provider] separator: ; regex: kubernetes replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: https replacement: $1 action: keep - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Node;(.*) target_label: node replacement: ${1} action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Pod;(.*) target_label: pod replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: (.+) target_label: job replacement: ${1} action: replace - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
- alert: AckKCMWarning annotations: message: KCM is not available in last 3 minutes. Please check the prometheus job and target status. expr: | (absent(up{job="ack-kcm",pod!=""})or(count(up{job="ack-kcm",pod!=""})<=0))>=1 for: 3m labels: severity: critical
kube-controller-manager監控采集指標清單,請參見kube-controller-manager指標清單。
cloud-controller-manager
關于cloud-controller-manager組件的更多信息,請參見cloud-controller-mananger組件監控。
- job_name: ack-cloud-controller-manager scrape_interval: 30s scrape_timeout: 30s metrics_path: /metrics scheme: https # scheme: https honor_labels: true honor_timestamps: true params: hosting: ["true"] job: ["ack-cloud-controller-manager"] kubernetes_sd_configs: - role: endpoints namespaces: names: [default] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes, insecure_skip_verify: false} relabel_configs: - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: apiserver replacement: $1 action: keep - source_labels: [__meta_kubernetes_service_label_provider] separator: ; regex: kubernetes replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: https replacement: $1 action: keep - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Node;(.*) target_label: node replacement: ${1} action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Pod;(.*) target_label: pod replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: (.+) target_label: job replacement: ${1} action: replace - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
- alert: AckCCMWarning annotations: message: CCM is not available in last 3 minutes. Please check the prometheus job and target status. expr: | (absent(up{job="ack-cloud-controller-manager",pod!=""}) or (count(up{job="ack-cloud-controller-manager",pod!=""}) <= 0)) == 1 for: 3m labels: severity: critical
cloud-controller-manager監控采集指標清單,請參見cloud-controller-manager指標清單。
ACK Pro集群外部監控
如果需要使用ACK Pro集群外的Prometheus來監控Kubernetes集群,具體操作,請參見Configuration和Monitoring kubernetes with prometheus from outside of k8s cluster。主要配置如下:
- job_name: 'out-of-k8s-scrape-job'
scheme: https
tls_config:
ca_file: /etc/prometheus/kubernetes-ca.crt
bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'
kubernetes_sd_configs:
- api_server: 'https://<KUBERNETES URL>'
role: node
tls_config:
ca_file: /etc/prometheus/kubernetes-ca.crt
bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'
驗證效果
登錄自建的Prometheus控制臺,切換到Graph頁面。
輸入up,查看是否全部控制平面組件都可以顯示。
up
預期輸出:
重要up{instance="x.x.x.x:6443", job="ack-api-server"}
是作為代理的Endpoint狀態。其中,x.x.x.x
是K8s集群default命名空間下Kubernetes Service的IP,不同用戶集群該IP不同。up{instance="controlplane-xyz", job="ack-api-server", pod="controlplane-xyz"}
是具體控制面Pod的狀態。可以使用該up
指標為控制面Pod做探活檢測。
輸入以下指標,查看是否可以正常顯示。
apiserver_request_total{job="ack-api-server"}
預期輸出:
如果界面能正常顯示查詢的指標和數據,說明自建Prometheus可以正常采集控制平面核心組件指標。