日本熟妇hd丰满老熟妇,中文字幕一区二区三区在线不卡 ,亚洲成片在线观看,免费女同在线一区二区

自動監控和響應ECS系統事件實現故障處理、動態調度等自動化運維

阿里云提供了ECS系統事件用于記錄和通知云資源信息,例如ECS實例的啟停、是否到期、任務執行情況等。在大規模集群、實時資源調度等場景,如果您需要主動監控和響應阿里云提供的ECS系統事件,來實現故障處理、動態調度等自動化運維,可通過云助手插件ecs-tool-event實現。

說明
  • ECS系統事件是由阿里云定義的,用于記錄和通知云資源的信息,例如運維任務執行情況、資源是否出現異常、資源狀態變化等。系統事件類型和詳細說明,請參見ECS系統事件概述

  • 云助手插件是集成在云助手里的插件能力,使用簡單的命令就能夠完成復雜的配置操作,提升運維管理效率。更多信息,請參見云助手概述使用云助手插件

方案原理

監控和響應ECS系統事件可通過控制臺或對接OpenAPI兩種方式。然而,這兩種方式都存在一定的局限:

  • 通過控制臺監控或響應系統事件:要手動干預,且對于多實例場景容易出現事件遺漏,無法做到自動化的響應。

  • 通過對接ECS OpenAPI監控或響應系統事件:需要自行開發程序,有一定的開發成本和技術要求。

為了解決上述問題,阿里云提供了云助手插件ecs-tool-event,該插件會每分鐘定時請求metaserver獲取ECS系統事件,并將ECS系統事件轉化為日志格式存儲在操作系統內部。用戶無需進行額外的程序開發,直接在操作系統內部采集系統事件日志來實現監控和響應ECS系統事件。例如,具備K8s自動化運維能力的用戶,可以通過采集host_event.log的流式日志來適配自身運維系統。

方案實踐

重要
  • 請確保您的實例已安裝云助手Agent如何安裝云助手Agent?

  • 啟動、停止云助手插件或查看云助手插件狀態需要使用root權限。

  1. 登錄ECS實例,啟用云助手插件ecs-tool-event

    啟用后,該插件會每分鐘定時請求metaserver獲取ECS系統事件,并將ECS系統事件轉化為日志格式存儲在操作系統內部。

    sudo acs-plugin-manager --exec --plugin=ecs-tool-event --params --start
    說明

    啟動后,可通過ls /var/log查看自動生成的host_event.log文件。

    • 日志保存地址:/var/log/host_event.log

    • 日志格式

      %Y-%m-%d %H:%M:%S - WARNING - Ecs event type is: ${事件類型},event status is: ${事件狀態}, action ISO 8601 time is ${實際執行ISO 8601時間}

      示例:

      2024-01-08 17:02:01 - WARNING - Ecs event type is: InstanceFailure.Reboot,event status is: Executed,action ISO 8601 time is 2023-12-27T11:49:28Z

  2. 查詢插件狀態。

    sudo acs-plugin-manager --status
  3. 結合自身業務場景,采集host_event.log的流式日志來適配自身運維系統。

    應用示例:Kubernetes集群場景自動化響應ECS系統事件

  4. (可選)如果您不再需要主動響應ECS系統事件,可停止云助手插件ecs-tool-event

    sudo acs-plugin-manager --remove --plugin ecs-tool-event

應用示例:Kubernetes集群場景自動化響應ECS系統事件

場景介紹

當ECS被用作Kubernetes集群的Node節點時,若單個節點出現異常(例如重啟、內存耗盡、操作系統錯誤等),可能會影響線上業務的穩定性,主動監測和響應節點異常事件至關重要。您可以通過云助手插件ecs-tool-event,將ECS系統事件轉化為操作系統日志,并結合K8s社區開源組件NPD(Node Problem Detector)、Draino和Autoscaler,實現免程序開發、便捷高效地監測和響應ECS系統事件,從而提升集群的穩定性和可靠性。

什么是NPD、Draino和Autoscaler?

  • NPD:Kubernetes社區的開源組件,可用于監控節點的健康狀態、檢測節點的故障,比如硬件故障、網絡問題等。更多信息,請參見NPD官方文檔

  • Draino:Kubernetes中的一個控制器,可監視集群中的節點,并將異常節點上的Pod遷移到其他節點。更多信息,請參見Draino官方文檔

  • Autoscaler:Kubernetes社區的開源組件,可動態調整Kubernetes集群大小,監控集群中的Pods,以確保所有Pods都有足夠的資源運行,同時保證沒有閑置的無效節點。更多信息,請參見Autoscaler官方文檔

方案架構

方案的實現原理和技術架構如下所示:

  1. 云助手插件ecs-tool-event每分鐘定時請求metaserver獲取ECS系統事件,轉化為系統日志存儲到操作系統內部(存儲路徑為/var/log/host_event.log)。

  2. 集群組件NPD采集到系統事件日志后,將問題上報給APIServer。

  3. 集群控制器Draino從APIServer接收K8s事件(ECS系統事件),將異常節點上的Pod遷移到其他正常節點。

  4. 完成容器驅逐后,您可以結合業務場景使用已有的集群下線方案完成異常節點下線,或者可以選擇使用Kubernetes社區的開源組件Autoscaler自動釋放異常節點并創建新實例加入到集群中。

image

方案實踐

步驟一:為節點啟動ecs-tool-event插件

登錄節點內部(即ECS實例),啟動ecs-tool-event插件。

重要

實際應用場景中,需要給集群的每個節點都啟動該插件。您可以通過云助手批量為多個實例執行如下啟動命令。具體操作,請參見創建并執行命令

sudo acs-plugin-manager --exec --plugin=ecs-tool-event --params --start

啟動后,ecs-tool-event插件會自動把ECS系統事件以日志形式輸出并保存到操作系統內部。

步驟二:為集群配置NPD和Draino

  1. 登錄集群中的任一節點。

  2. 為集群配置NPD組件(該配置作用于整個集群)。

    1. 配置NPD文件,需要用到如下3個文件。

      說明

      詳細配置說明,可參見官方文檔

      • node-problem-detector-config.yaml:定義NPD需要監控的指標,例如系統日志。

      • node-problem-detector.yaml:定義了NPD的在集群中的運行方式。

      • rbac.yaml:定義NPD在Kubernetes集群中所需的權限。

        實例未配置NPD

        在ECS實例添加上述3個YAML文件。

        node-problem-detector-config.yaml

        apiVersion: v1
        data:
          kernel-monitor.json: |
            {
                "plugin": "kmsg",
                "logPath": "/dev/kmsg",
                "lookback": "5m",
                "bufferSize": 10,
                "source": "kernel-monitor",
                "conditions": [
                    {
                        "type": "KernelDeadlock",
                        "reason": "KernelHasNoDeadlock",
                        "message": "kernel has no deadlock"
                    },
                    {
                        "type": "ReadonlyFilesystem",
                        "reason": "FilesystemIsNotReadOnly",
                        "message": "Filesystem is not read-only"
                    }
                ],
                "rules": [
                    {
                        "type": "temporary",
                        "reason": "OOMKilling",
                        "pattern": "Kill process \\d+ (.+) score \\d+ or sacrifice child\\nKilled process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB.*"
                    },
                    {
                        "type": "temporary",
                        "reason": "TaskHung",
                        "pattern": "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
                    },
                    {
                        "type": "temporary",
                        "reason": "UnregisterNetDevice",
                        "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
                    },
                    {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "BUG: unable to handle kernel NULL pointer dereference at .*"
                    },
                    {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "divide error: 0000 \\[#\\d+\\] SMP"
                    },
                    {
                                "type": "temporary",
                                "reason": "MemoryReadError",
                                "pattern": "CE memory read error .*"
                    },
                    {
                        "type": "permanent",
                        "condition": "KernelDeadlock",
                        "reason": "DockerHung",
                        "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\."
                    },
                    {
                        "type": "permanent",
                        "condition": "ReadonlyFilesystem",
                        "reason": "FilesystemIsReadOnly",
                        "pattern": "Remounting filesystem read-only"
                    }
                ]
            }
          host_event.json: |
            {
                "plugin": "filelog",                     
                "pluginConfig": {
                    "timestamp": "^.{19}",
                    "message": "Ecs event type is: .*",
                    "timestampFormat": "2006-01-02 15:04:05"
                },
                "logPath": "/var/log/host_event.log",   
                "lookback": "5m",
                "bufferSize": 10,
                "source": "host-event",                     
                "conditions": [
                    {
                        "type": "HostEventRebootAfter48",       
                        "reason": "HostEventWillRebootAfter48",
                        "message": "The Host Is Running In Good Condition"
                    }
                ],
                "rules": [
                    {
                        "type": "temporary",
                        "reason": "HostEventRebootAfter48temporary",
                        "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*"
                    },
                    {
                        "type": "permanent",
                        "condition": "HostEventRebootAfter48", 
                        "reason": "HostEventRebootAfter48Permanent",
                        "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*"
                    }
                ]
            }
        
          docker-monitor.json: |
            {
                "plugin": "journald",
                "pluginConfig": {
                    "source": "dockerd"
                },
                "logPath": "/var/log/journal",
                "lookback": "5m",
                "bufferSize": 10,
                "source": "docker-monitor",
                "conditions": [],
                "rules": [
                    {
                        "type": "temporary",
                        "reason": "CorruptDockerImage",
                        "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*"
                    }
                ]
            }
        kind: ConfigMap
        metadata:
          name: node-problem-detector-config
          namespace: kube-system

        node-problem-detector.yaml

        apiVersion: apps/v1
        kind: DaemonSet
        metadata:
          name: node-problem-detector
          namespace: kube-system
          labels:
            app: node-problem-detector
        spec:
          selector:
            matchLabels:
              app: node-problem-detector
          template:
            metadata:
              labels:
                app: node-problem-detector
            spec:
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                      - matchExpressions:
                          - key: kubernetes.io/os
                            operator: In
                            values:
                              - linux
              containers:
              - name: node-problem-detector
                command:
                - /node-problem-detector
                - --logtostderr
                - --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json,/config/host_event.json
                image: cncamp/node-problem-detector:v0.8.10
                resources:
                  limits:
                    cpu: 10m
                    memory: 80Mi
                  requests:
                    cpu: 10m
                    memory: 80Mi
                imagePullPolicy: Always
                securityContext:
                  privileged: true
                env:
                - name: NODE_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: spec.nodeName
                volumeMounts:
                - name: log
                  mountPath: /var/log
                  readOnly: true
                - name: kmsg
                  mountPath: /dev/kmsg
                  readOnly: true
                # Make sure node problem detector is in the same timezone
                # with the host.
                - name: localtime
                  mountPath: /etc/localtime
                  readOnly: true
                - name: config
                  mountPath: /config
                  readOnly: true
              serviceAccountName: node-problem-detector
              volumes:
              - name: log
                # Config `log` to your system log directory
                hostPath:
                  path: /var/log/
              - name: kmsg
                hostPath:
                  path: /dev/kmsg
              - name: localtime
                hostPath:
                  path: /etc/localtime
              - name: config
                configMap:
                  name: node-problem-detector-config
                  items:
                  - key: kernel-monitor.json
                    path: kernel-monitor.json
                  - key: docker-monitor.json
                    path: docker-monitor.json
                  - key: host_event.json
                    path: host_event.json
              tolerations:
                - effect: NoSchedule
                  operator: Exists
                - effect: NoExecute
                  operator: Exists

        rbac.yaml

        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: node-problem-detector
          namespace: kube-system
        
        ---
        apiVersion: rbac.authorization.k8s.io/v1
        kind: ClusterRoleBinding
        metadata:
          name: npd-binding
        roleRef:
          apiGroup: rbac.authorization.k8s.io
          kind: ClusterRole
          name: system:node-problem-detector
        subjects:
          - kind: ServiceAccount
            name: node-problem-detector
            namespace: kube-system

        實例已配置NPD

        • node-problem-detector-config.yaml文件中,添加host_event.json日志監控。如下所示:

          ...
          
          host_event.json: |
              {
                  "plugin": "filelog",   #指定使用的日志采集插件,固定為filelog       
                  "pluginConfig": {
                      "timestamp": "^.{19}",
                      "message": "Ecs event type is: .*",
                      "timestampFormat": "2006-01-02 15:04:05"
                  },
                  "logPath": "/var/log/host_event.log",    #系統事件日志路徑,固定為/var/log/host_event.log
                  "lookback": "5m",
                  "bufferSize": 10,
                  "source": "host-event",                     
                  "conditions": [
                      {
                          "type": "HostEventRebootAfter48",    #自定義事件名稱,Draino配置中會用到
                          "reason": "HostEventWillRebootAfter48",
                          "message": "The Host Is Running In Good Condition"
                      }
                  ],
                  "rules": [
                      {
                          "type": "temporary",
                          "reason": "HostEventRebootAfter48temporary",
                          "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*"
                      },
                      {
                          "type": "permanent",
                          "condition": "HostEventRebootAfter48", 
                          "reason": "HostEventRebootAfter48Permanent",
                          "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*"
                      }
                  ]
              }
          
          ...
        • node-problem-detector.yaml文件中

          • - --config.system-log-monitor行中添加/config/host_event.json,告訴NPD監控系統事件日志。如下所示:

            containers:
                  - name: node-problem-detector
                    command:
                     ...
                    - --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json,/config/host_event.json
            
          • - name: configitems:行下,按照如下注釋添加相關行。

            ...
            - name: config
                    configMap:
                      name: node-problem-detector-config
                      items:
                      - key: kernel-monitor.json
                        path: kernel-monitor.json
                      - key: docker-monitor.json
                        path: docker-monitor.json
                      - key: host_event.json     #待添加的行
                        path: host_event.json    #待添加的行
            ...
    2. 執行以下命令,使文件生效。

      sudo kubectl create -f rbac.yaml
      sudo kubectl create -f node-problem-detector-config.yaml
      sudo kubectl create -f node-problem-detector.yaml
    3. 執行如下命令,查看NPD配置是否生效。

      sudo kubectl describe nodes -n kube-system

      如以下回顯所示,condition已經新增HostEventRebootAfter48,表示NPD配置已完成并生效(若未出現,可稍等3~5分鐘)。

      image.png

  3. 為集群配置控制器Draino(該配置作用于整個集群)。

    1. 根據實際情況,配置或修改Draino配置。

    2. 實例未配置過Draino:安裝Draino

      在實例內部添加如下YAML文件。

      draino.yaml

      ---
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        labels: {component: draino}
        name: draino
        namespace: kube-system
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRole
      metadata:
        labels: {component: draino}
        name: draino
      rules:
      - apiGroups: ['']
        resources: [events]
        verbs: [create, patch, update]
      - apiGroups: ['']
        resources: [nodes]
        verbs: [get, watch, list, update]
      - apiGroups: ['']
        resources: [nodes/status]
        verbs: [patch]
      - apiGroups: ['']
        resources: [pods]
        verbs: [get, watch, list]
      - apiGroups: ['']
        resources: [pods/eviction]
        verbs: [create]
      - apiGroups: [extensions]
        resources: [daemonsets]
        verbs: [get, watch, list]
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
        labels: {component: draino}
        name: draino
      roleRef: {apiGroup: rbac.authorization.k8s.io, kind: ClusterRole, name: draino}
      subjects:
      - {kind: ServiceAccount, name: draino, namespace: kube-system}
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        labels: {component: draino}
        name: draino
        namespace: kube-system
      spec:
        # Draino does not currently support locking/master election, so you should
        # only run one draino at a time. Draino won't start draining nodes immediately
        # so it's usually safe for multiple drainos to exist for a brief period of
        # time.
        replicas: 1
        selector:
          matchLabels: {component: draino}
        template:
          metadata:
            labels: {component: draino}
            name: draino
            namespace: kube-system
          spec:
            containers:
            - name: draino
              image: planetlabs/draino:dbadb44
              # You'll want to change these labels and conditions to suit your deployment.
              command:
              - /draino
              - --debug
              - --evict-daemonset-pods
              - --evict-emptydir-pods
              - --evict-unreplicated-pods
              - KernelDeadlock
              - OutOfDisk
              - HostEventRebootAfter48
              # - ReadonlyFilesystem
              # - MemoryPressure
              # - DiskPressure
              # - PIDPressure
              livenessProbe:
                httpGet: {path: /healthz, port: 10002}
                initialDelaySeconds: 30
            serviceAccountName: draino

      實例已配置Draino:修改Draino配置

      打開Draino配置文件,找到containers:行,添加步驟2在node-problem-detector-config.yaml文件中定義的事件名稱(例如HostEventRebootAfter48),如下所示:

      containers:
            - name: draino
              image: planetlabs/draino:dbadb44
              # You'll want to change these labels and conditions to suit your deployment.
              command:
              - /draino
              - --debug
              ......
              - KernelDeadlock
              - OutOfDisk
              - HostEventRebootAfter48  # 添加的行    
    3. 執行如下命令,使Draino配置生效。

    4. sudo kubectl create -f draino.yaml

步驟三:下線異常節點并增加新節點

完成容器驅逐后,您可以結合業務場景用已有的集群下線方案完成異常節點下線,或者可以選擇使用社區開源的Autoscaler自動釋放異常節點并創建新實例加入到集群節點。如果需要使用Autoscaler,請參見Autoscaler官方文檔

結果驗證

  1. 登錄任意節點,執行以下命令,模擬生成一條ECS系統事件日志。

    重要

    時間需替換為系統當前最新時間。

    sudo echo  '2024-02-23 12:29:29 - WARNING - Ecs event type is: InstanceFailure.Reboot,event status is: Executed,action ISO 8601 time is 2023-12-27T11:49:28Z'  > /var/log/host_event.log
  2. 執行如下命令,可看到插件會根據檢測到事件自動生成k8s事件,并將該節點置為不可調度。

    sudo kubectl describe nodes -n kube-system

    image