當Linux操作系統的ECS實例在運行過程中出現內核panic、內存溢出OOM(Out Of Memory)、藍屏卡死等問題或收到系統事件通知實例出現操作系統崩潰時,說明該ECS實例發生宕機,您可以通過自助診斷工具或系統內核日志來定位問題并解決。
定位宕機原因
您可以通過以下方式,定位發生宕機的具體原因。
方式一:(推薦)通過自助診斷工具定位
登錄ECS管理控制臺,左側導航欄單擊自助問題排查。
單擊實例問題排查頁簽。
選擇
,然后選擇出現宕機的實例ID,單擊開始排查。根據返回的診斷結果和修復方案,定位問題并解決。
方式二:通過系統事件定位
登錄ECS管理控制臺,左側導航欄單擊事件。
在左側導航欄單擊非預期運維事件。
單擊發生宕機運維事件實例右側的診斷操作系統錯誤根因,診斷實例宕機原因。
根據返回的診斷結果和修復方案,定位問題并解決。
方式三:通過kdump查看內核日志定位
若您安裝并配置了kdump,當系統發生宕機時,會生成vmcore-dmesg.txt
文件,您可通過查看該文件獲取宕機時的內核日志,并根據其中的calltrace信息(通常以"Call Trace:"開頭)來定位問題的發生位置,分析宕機原因,從而進行修復和調試。
動手實踐
如您想動手實踐本文檔的內容,請單擊驗證Guestos panic診斷能力。
常見宕機原因和解決方案
問題描述
Linux操作系統的ECS實例在運行過程中宕機,產生了“not syncing: Out of memory: system-wide panic_on_oom is enabled”日志,調用棧類似如下:
[3624965.306801] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled [3624965.307824] CPU: 5 PID: 8510 Comm: AliDetect Kdump: loaded Tainted: GOE ------------ T 3.10.0-1127.10.1.el7.x86_64 #1 [3624965.308923] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [3624965.309671] Call Trace: [3624965.309935] [<ffffffff8f37ffa5>] dump_stack+0x19/0x1b [3624965.310444] [<ffffffff8f379541>] panic+0xe8/0x21f [3624965.310913] [<ffffffff8edc26b5>] check_panic_on_oom+0x55/0x60 [3624965.311480] [<ffffffff8edc2aab>] out_of_memory+0x23b/0x4f0 [3624965.312027] [<ffffffff8f37b3e0>] __alloc_pages_slowpath+0x5db/0x729 [3624965.312628] [<ffffffff8edc91a6>] __alloc_pages_nodemask+0x436/0x450 [3624965.313233] [<ffffffff8ee18e78>] alloc_pages_current+0x98/0x110 [3624965.313808] [<ffffffff8edbe3d7>] __page_cache_alloc+0x97/0xb0 [3624965.314364] [<ffffffff8edc0f90>] filemap_fault+0x270/0x420 [3624965.314912] [<ffffffffc04ea7d6>] ext4_filemap_fault+0x36/0x50 [ext4] [3624965.315530] [<ffffffff8ededf4a>] __do_fault.isra.61+0x8a/0x100 [3624965.316095] [<ffffffff8edee4fc>] do_read_fault.isra.63+0x4c/0x1b0 [3624965.316680] [<ffffffff8edf5d60>] handle_mm_fault+0xa20/0xfb0 [3624965.317231] [<ffffffff8f38d653>] __do_page_fault+0x213/0x500 [3624965.317775] [<ffffffff8f38da26>] trace_do_page_fault+0x56/0x150 [3624965.318378] [<ffffffff8f38cfa2>] do_async_page_fault+0x22/0xf0 [3624965.318954] [<ffffffff8f3897a8>] async_page_fault+0x28/0x30
問題原因
實例內存不足發生了OOM,且內核參數
vm.panic_on_oom
的值被設置為1或2。值為1時,表示內存不足時,有可能會觸發kernel panic,也有可能啟動OOM killer。
值為2時,表示內存不足時,強制觸發kernel panic。
解決方案
方案一:將內核參數
vm.panic_on_oom
設置為0您可以將內核參數
vm.panic_on_oom
設置為0,在內存不足時啟動OOM killer來解決上述問題。重要更改
vm.panic_on_oom
的值為0可能會導致系統在內存不足時啟動OOM killer,并終止占用大量內存的進程。這可能會對系統的穩定性和運行中的應用程序產生影響。因此,在進行此類更改之前,請確保了解其影響,并評估系統的內存管理和應用程序的需求。遠程連接ECS實例。
執行以下命令,打開文件
/etc/sysctl.conf
。sudo vim /etc/sysctl.conf
按
i
鍵,修改為以下內容。vm.panic_on_oom = 0
這將禁用系統在內存不足時發生崩潰。
按
Ecs
鍵,輸入:wq
,保存文件并退出編輯器。執行以下命令以加載
sysctl.conf
中的更改。sudo sysctl -p
方案二:優化內存使用
重要在操作前,建議您為ECS實例創建快照備份數據,避免因誤操作造成的數據丟失。創建快照的具體操作,請參見創建快照。
OOM通常是由內存不足引起的,您可以根據業務情況判斷內存使用是否合理,可以考慮以下方法來提高系統的內存容量,或減少內存使用:
升級實例規格
升級實例規格,您可以獲得更多的內存資源。具體操作,請參見修改實例規格。
優化應用程序:
檢查應用程序的內存使用情況,并進行優化。例如,通過減少內存泄漏、優化算法或配置等方式。
問題描述
Linux操作系統的ECS實例在運行過程中發生了宕機,產生日志“RIP: tcp_create_openreq_child”,調用棧類似如下:
[8343753.027138] Oops: 0000 [#1] SMP PTI [8343753.027431] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE 5.4.0-122-generic #138-Ubuntu [8343753.028127] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [8343753.028728] RIP: 0010:tcp_create_openreq_child+0x2fd/0x410 ... [8343753.036508] Call Trace: [8343753.036710] <IRQ> [8343753.036886] tcp_v4_syn_recv_sock+0x5a/0x400 [8343753.037234] tcp_get_cookie_sock+0x48/0x150 [8343753.037564] cookie_v4_check+0x581/0x6d0 [8343753.037880] tcp_v4_do_rcv+0x1a5/0x200 [8343753.038184] tcp_v4_rcv+0xc76/0xd10 [8343753.038551] ip_protocol_deliver_rcu+0x30/0x1b0 [8343753.038980] ip_local_deliver_finish+0x48/0x50 [8343753.039335] ip_local_deliver+0x73/0xf0
問題原因
操作系統內核版本Bug(例如內核中存在錯誤或缺陷),導致空指針引用錯誤,觸發系統的保護機制,引起實例宕機。Bug詳情
解決方案
將操作系統內核版本升級到5.4.0-123.139或更高版本。具體操作,請參見升級Linux ECS實例內核。
重要在操作前,建議您為ECS實例創建快照備份數據,避免因誤操作造成的數據丟失。創建快照的具體操作,請參見創建快照。
問題描述
Linux操作系統的ECS實例在運行中宕機重啟,產生日志“RIP: sysrq_handle_crash”,調用棧類似如下:
[ 7262.769377] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_powerclamp iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper virtio_balloon shpchp cryptd parport_pc parport i2c_piix4 pcspkr ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_blk virtio_console cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel serio_raw drm ata_piix virtio_pci libata virtio_ring i2c_core virtio floppy [ 7262.774113] CPU: 1 PID: 3818 Comm: bash Not tainted 3.10.0-514.26.2.el7.x86_64 #1 [ 7262.774699] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [ 7262.775317] task: ffff88040d3d5e20 ti: ffff8803cb7ac000 task.ti: ffff8803cb7ac000 [ 7262.775904] RIP: 0010:[<ffffffff813ee1d6>] [<ffffffff813ee1d6>] sysrq_handle_crash+0x16/0x20 ... [ 7262.784790] Call Trace: [ 7262.784992] [<ffffffff813ee9f7>] __handle_sysrq+0x107/0x170 [ 7262.785450] [<ffffffff813eee6f>] write_sysrq_trigger+0x2f/0x40 [ 7262.785915] [<ffffffff8126be0d>] proc_reg_write+0x3d/0x80 [ 7262.786355] [<ffffffff811fe9fd>] vfs_write+0xbd/0x1e0 [ 7262.786759] [<ffffffff811ff51f>] SyS_write+0x7f/0xe0 [ 7262.787172] [<ffffffff81697809>] system_call_fastpath+0x16/0x1b
問題原因
用戶在實例內部使用以下命令主動觸發了宕機:
echo c > /proc/sysrq-trigger
解決方案
正常情況下,請不要執行
echo c > /proc/sysrq-trigger
觸發宕機。重要執行
echo c > /proc/sysrq-trigger
后會觸發內核崩潰并且立即重啟,該命令通常用于測試或在無法通過正常方式關閉系統時強制崩潰內核。
問題描述
Linux操作系統的ECS實例在運行中出現宕機,產生“RIP:get_target_pstate_use_performance”日志,調用棧類似如下:
[ 1.076899] divide error: 0000 [#1] SMP [ 1.077669] Modules linked in: [ 1.078302] CPU: 4 PID: 9 Comm: rcu_sched Not tainted 3.10.0-1127.19.1.el7.x86_64 #1 [ 1.079519] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014 [ 1.080724] task: ffff91c8fa111070 ti: ffff91c8fa11c000 task.ti: ffff91c8fa11c000 [ 1.081919] RIP: 0010:[<ffffffff85dc3089>] [<ffffffff85dc3089>] get_target_pstate_use_performance+0x29/0xc0 [ 1.083355] RSP: 0000:ffff91c8fa11fb40 EFLAGS: 00010006 [ 1.093192] Call Trace: [ 1.093715] [<ffffffff85dc4081>] intel_pstate_update_util+0x161/0x310 [ 1.094550] [<ffffffff858e9523>] ? load_balance+0x1a3/0xa10 [ 1.095321] [<ffffffff858e4e87>] update_curr+0x127/0x1e0 [ 1.096123] [<ffffffff858e52a8>] dequeue_entity+0x28/0x5c0 [ 1.096894] [<ffffffff8586d3be>] ? kvm_sched_clock_read+0x1e/0x30 [ 1.097702] [<ffffffff858e5893>] dequeue_task_fair+0x53/0x660 [ 1.098490] [<ffffffff858debe5>] ? sched_clock_cpu+0x85/0xc0 [ 1.099266] [<ffffffff858d7a56>] deactivate_task+0x46/0xd0
問題原因
該問題可能是由于ECS實例在啟動過程中,Intel pstate驅動的
current_pstate
頻率值被初始化為0造成的。在進程切換時,系統會調用Intel pstate來調節性能模式以適應系統負載的變化。當Intel pstate使用了current_pstate
的0值,就可能導致除以零的運算錯誤,最終導致系統崩潰。解決方案
將操作系統內核版本升級到4.18或更高版本。具體操作,請參見升級Linux ECS實例內核。
重要在操作前,建議您為ECS實例創建快照備份數據,避免因誤操作造成的數據丟失。創建快照的具體操作,請參見創建快照。
問題描述
Linux操作系統的運行過程中出現了宕機,產生“not syncing: Out of memory and no killable processes”日志,調用棧類似于如下:
[217894.026467] Out of memory: Kill process 17807 (php-fpm) score 4 or sacrifice child [217894.027560] Killed process 17807 (php-fpm) total-vm:386252kB, anon-rss:6972kB, file-rss:144kB, shmem-rss:9020kB [217894.910947] php-fpm invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [217894.912175] php-fpm cpuset=/ mems_allowed=0 [217894.913100] CPU: 0 PID: 18534 Comm: php-fpm Tainted: GOE ------------ 3.10.0-957.21.3.el7.x86_64 #1 [217894.914510] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [217894.915780] Call Trace: [217894.916607] [<ffffffff8ff63107>] dump_stack+0x19/0x1b [217894.917775] [<ffffffff8ff5db2a>] dump_header+0x90/0x229 [217894.918914] [<ffffffff8f901292>] ? ktime_get_ts64+0x52/0xf0 [217894.919979] [<ffffffff8f9584df>] ? delayacct_end+0x8f/0xb0 [217894.921026] [<ffffffff8f9ba834>] oom_kill_process+0x254/0x3d0 [217894.922097] [<ffffffff8f9ba2dd>] ? oom_unkillable_task+0xcd/0x120 [217894.923248] [<ffffffff8f9ba386>] ? find_lock_task_mm+0x56/0xc0 [217894.924364] [<ffffffff8f9bb076>] out_of_memory+0x4b6/0x4f0 [217894.925513] [<ffffffff8ff5e62e>] __alloc_pages_slowpath+0x5d6/0x724
問題原因
系統發生了內存不足,并且沒有找到可終止的進程來釋放內存,導致系統無法正常運行。
解決方案
您可以根據業務情況判斷內存使用是否合理,可以考慮以下方法來提高系統的內存容量或減少內存使用:
升級實例規格
升級實例規格,獲得更多的內存資源。具體操作,請參見修改實例規格。
優化應用程序
檢查ECS實例中占用內存過高的進程,判斷內存使用是否合理,并進行優化。例如,減少內存泄漏、優化算法或配置等。
問題描述
Linux操作系統的ECS實例在運行過程中宕機,產生了“list_del corruption, ffff91bc2ad47048->prev is LIST_POISON2 (dead000000000200)”日志,調用棧類似如下:
[1072741.548729] list_del corruption, ffff91bc2ad47048->prev is LIST_POISON2 (dead000000000200) [1072741.549507] ------------[ cut here ]------------ [1072741.549886] kernel BUG at lib/list_debug.c:50! [1072741.550275] invalid opcode: 0000 [#1] SMP PTI [1072741.550646] CPU: 0 PID: 1583643 Comm: kworker/0:1 Tainted: G OE --------- - - 4.18.0-305.3.1.el8.x86_64 #1 [1072741.551468] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [1072741.552048] Workqueue: cgroup_destroy css_release_work_fn [1072741.552462] RIP: 0010:__list_del_entry_valid.cold.1+0x45/0x4c ... [1072741.560426] Call Trace: [1072741.560638] css_release_work_fn+0x3f/0x240 [1072741.560983] process_one_work+0x1a7/0x360 [1072741.561300] worker_thread+0x30/0x390 [1072741.561622] ? create_worker+0x1a0/0x1a0 [1072741.561933] kthread+0x116/0x130 [1072741.562195] ? kthread_flush_work_fn+0x10/0x10 [1072741.562557] ret_from_fork+0x35/0x40 [1072741.562843] Modules linked in: AliSecGuard(OE) nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl joydev pcspkr virtio_balloon i2c_piix4 ip_tables xfs libcrc32c ata_generic cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ata_piix libata crc32c_intel virtio_net net_failover serio_raw failover virtio_console virtio_blk [1072741.566968] Features: eBPF/event [1072741.567302] ---[ end trace 8f40bd2bf2a072e5 ]---
問題原因
操作系統內核版本Bug:
list_del
發生錯誤LIST_POISON2 (dead000000000200)
引發的宕機。Bug詳情解決方案
將操作系統內核版本升級到kernel-4.18.0-305.12.1.el8_4或更高版本。具體操作,請參見升級Linux ECS實例內核。
重要在操作前,建議您為ECS實例創建快照備份數據,避免因誤操作造成的數據丟失。創建快照的具體操作,請參見創建快照。
問題描述
Linux操作系統的ECS實例在運行過程中宕機,產生了“RIP:module_put”日志,調用棧類似如下:
[86389.969666] CPU: 2 PID: 1426 Comm: Syn-1203-Tx Tainted: GOE ------------ 3.10.0-1160.53.1.el7.x86_64 #1 [86389.970626] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [86389.971377] task: ffff983118bfc200 ti: ffff982defd58000 task.ti: ffff982defd58000 [86389.972034] RIP: 0010:[<ffffffff8c91956d>] [<ffffffff8c91956d>] module_put+0x1d/0x80 ... [86389.979170] Call Trace: [86389.979378] [<ffffffff8ca53b40>] cdev_put+0x20/0x30 [86389.979768] [<ffffffff8ca5098f>] __fput+0x1ef/0x230 [86389.980151] [<ffffffff8ca50abe>] ____fput+0xe/0x10 [86389.980526] [<ffffffff8c8c299b>] task_work_run+0xbb/0xe0 [86389.980946] [<ffffffff8c8a1954>] do_exit+0x2d4/0xa30 [86389.981375] [<ffffffff8c91358f>] ? futex_wait+0x11f/0x280
問題原因
系統進程使用或訪問已被釋放的內存,引發了use-after-free漏洞,觸發操作系統的保護機制或導致數據混亂,從而導致系統崩潰。
說明Use-after-free是一種常見的軟件漏洞類型,它發生在程序錯誤地使用或訪問已經釋放的內存時。這種情況可能會導致不可預測的行為,例如崩潰、數據損壞、數據泄露或執行惡意代碼。
解決方案
將操作系統內核版本升級到kernel-4.18.0-305.12.1.el8_4或更高版本。具體操作,請參見升級Linux ECS實例內核。
重要在操作前,建議您為ECS實例創建快照備份數據,避免因誤操作造成的數據丟失。創建快照的具體操作,請參見創建快照。
問題描述
Linux操作系統的ECS實例在運行過程中宕機,產生了“containerd: page allocation failure”日志,調用棧類似如下:
[1558839.130515] ------------[ cut here ]------------ [1558839.131215] kernel BUG at lib/idr.c:1163! [1558839.131797] invalid opcode: 0000 [#1] SMP [1558839.132411] Modules linked in: binfmt_misc AliSecGuard(OE) AliSecProcFilter64(OE) AliSecNetFlt64(OE) xt_CT xt_multiport ipt_rpfilter iptable_raw ip_set_hash_net ip_set_hash_ip ipip tunnel4 ip_tunnel veth ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables iptable_mangle nf_conntrack_netlink xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_mark xt_addrtype xt_set ip_set_bitmap_port ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set nfnetlink dummy xt_comment iptable_nat nf_nat_ipv4 nf_nat iptable_filter tcp_diag inet_diag overlay(T) sunrpc nfit ppdev libnvdimm iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev virtio_balloon pcspkr parport_pc parport i2c_piix4 nf_conntrack_ipv4 nf_defrag_ipv4 ip_vs_sh ip_vs_wrr [1558839.141715] ip_vs_rr ip_vs nf_conntrack libcrc32c br_netfilter bridge stp llc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_console virtio_blk cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci virtio_ring floppy virtio drm_panel_orientation_quirks [1558839.147553] CPU: 6 PID: 21465 Comm: kworker/6:0 Tainted: G OE ------------ T 3.10.0-957.21.3.el7.x86_64 #1 [1558839.149181] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [1558839.150656] Workqueue: events free_work [1558839.151766] task: ffff8fbc4d6e9040 ti: ffff8fb8b898c000 task.ti: ffff8fb8b898c000 [1558839.153196] RIP: 0010:[<ffffffff967774e1>] [<ffffffff967774e1>] ida_simple_remove+0x41/0x50 ... [1558839.171901] Call Trace: [1558839.173133] [<ffffffff966306c4>] __mem_cgroup_free+0x234/0x250 [1558839.174750] [<ffffffff966306f5>] free_work+0x15/0x20 [1558839.176259] [<ffffffff964b9ebf>] process_one_work+0x17f/0x440 [1558839.177872] [<ffffffff964baf56>] worker_thread+0x126/0x3c0 [1558839.179421] [<ffffffff964bae30>] ? manage_workers.isra.25+0x2a0/0x2a0 [1558839.181092] [<ffffffff964c1da1>] kthread+0xd1/0xe0 [1558839.182839] [<ffffffff964c1cd0>] ? insert_kthread_work+0x40/0x40 [1558839.184543] [<ffffffff96b75c37>] ret_from_fork_nospec_begin+0x21/0x21 [1558839.186238] [<ffffffff964c1cd0>] ? insert_kthread_work+0x40/0x40 ...
問題原因
操作系統內核版本Bug:在開啟memory control group的情況下,memcg_caches[]數組會增加每個已注冊的內核內存緩存。如果沒有可用的內存,即發生了內存不足事件,可能會導致系統崩潰。
解決方案
CentOS 7.7建議升級到kernel-3.10.0-1062.el7及以上版本,CentOS 7.6建議升級到kernel-3.10.0-957.27.2.el7及以上版本。具體操作,請參見升級Linux ECS實例內核。
重要在操作前,建議您為ECS實例創建快照備份數據,避免因誤操作造成的數據丟失。創建快照的具體操作,請參見創建快照。
問題描述
Linux操作系統的ECS實例在運行過程中宕機,產生了“RIP:blk_mq_rq_timed_out”日志,調用棧類似如下:
[8837401.113325] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0 [8837401.114219] IP: [<ffffffffae575638>] blk_mq_rq_timed_out+0x18/0xa0 [8837401.114892] PGD 8000000885d08067 PUD e1beda067 PMD 0 [8837401.115471] Oops: 0000 [#1] SMP [8837401.115855] Modules linked in: AliSecNetFlt64(OE) AliSecGuard(OE) AliSecProcFilter64(OE) xt_multiport veth ipt_rpfilter ip6t_rpfilter ip6t_MASQUERADE nf_nat_masquerade_ipv6 xt_set iptable_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_filter ip6table_raw ip6_tables ip_set_hash_ip ip_set_hash_net ip_set sch_htb xt_nat xt_statistic ipt_REJECT nf_reject_ipv4 nf_tables iptable_mangle xt_comment xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat tcp_diag inet_diag nfsv3 nfs_acl nfs lockd grace fscache overlay(T) sunrpc nfit libnvdimm iosf_mbi crc32_pclmul ppdev virtio_balloon joydev ghash_clmulni_intel parport_pc aesni_intel parport lrw gf128mul glue_helper i2c_piix4 ablk_helper pcspkr cryptd ip_vs_rr ip_vs_sh ip_vs_wrr ip_vs nf_conntrack ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net net_failover virtio_console virtio_blk failover cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci virtio_ring floppy drm_panel_orientation_quirks virtio libcrc32c br_netfilter bridge stp llc [last unloaded: AliSecNetFlt64] [8837401.130281] CPU: 0 PID: 163944 Comm: kworker/0:1H Kdump: loaded Tainted: G OE ------------ T 3.10.0-1160.80.1.el7.x86_64 #1 [8837401.133029] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014 [8837401.134621] Workqueue: kblockd blk_mq_timeout_work [8837401.135916] task: ffff88258a0b6300 ti: ffff8820c2b9c000 task.ti: ffff8820c2b9c000 [8837401.137422] RIP: 0010:[<ffffffffae575638>] [<ffffffffae575638>] blk_mq_rq_timed_out+0x18/0xa0 [8837401.139091] RSP: 0018:ffff8820c2b9fd18 EFLAGS: 00010246 [8837401.140371] RAX: 0000000000000000 RBX: ffff8819b6ad0000 RCX: 0000000000000000 [8837401.141838] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8819b6ad0000 [8837401.143314] RBP: ffff8820c2b9fd20 R08: 000000030ec11230 R09: df98ad67960c8828 [8837401.144732] R10: df98ad67960c8828 R11: ffff8822d9e17f00 R12: ffff8819b6863240 [8837401.146161] R13: 0000000000000002 R14: 0000000000000020 R15: 0000000000000002 [8837401.147605] FS: 0000000000000000(0000) GS:ffff8829bfc00000(0000) knlGS:0000000000000000 [8837401.149177] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [8837401.150426] CR2: 00000000000000d0 CR3: 00000003e570a000 CR4: 00000000003606f0 [8837401.151844] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [8837401.153287] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [8837401.154667] Call Trace: [8837401.155579] [<ffffffffae57572c>] blk_mq_check_expired+0x6c/0x80 [8837401.157057] [<ffffffffae578dac>] bt_iter+0x5c/0x70 [8837401.158357] [<ffffffffae57984b>] blk_mq_queue_tag_busy_iter+0x13b/0x320 [8837401.159675] [<ffffffffae2e84c9>] ? pick_next_entity+0xa9/0x190 [8837401.160968] [<ffffffffae5756c0>] ? blk_mq_rq_timed_out+0xa0/0xa0 [8837401.162414] [<ffffffffae5756c0>] ? blk_mq_rq_timed_out+0xa0/0xa0 [8837401.163748] [<ffffffffae57428b>] blk_mq_timeout_work+0x8b/0x180 [8837401.165062] [<ffffffffae2c319f>] process_one_work+0x17f/0x440 [8837401.166329] [<ffffffffae2c42e6>] worker_thread+0x126/0x3c0 [8837401.167541] [<ffffffffae2c41c0>] ? manage_workers.isra.26+0x2b0/0x2b0 [8837401.169048] [<ffffffffae2cb4d1>] kthread+0xd1/0xe0 [8837401.170311] [<ffffffffae2cb400>] ? insert_kthread_work+0x40/0x40 [8837401.171514] [<ffffffffae9c51f7>] ret_from_fork_nospec_begin+0x21/0x21 [8837401.172861] [<ffffffffae2cb400>] ? insert_kthread_work+0x40/0x40 [8837401.174091] Code: 83 84 c6 80 00 00 00 01 e8 f6 fe ff ff 5d c3 cc cc cc cc 0f 1f 44 00 00 55 48 89 e5 53 48 8b 57 58 48 8b 47 38 48 89 fb 83 e2 02 <48> 8b 80 d0 00 00 00 74 4c 48 83 78 10 00 74 50 48 ba 00 00 00 [8837401.178255] RIP [<ffffffffae575638>] blk_mq_rq_timed_out+0x18/0xa0 [8837401.179436] RSP <ffff8820c2b9fd18> [8837401.180300] CR2: 00000000000000d0
問題原因
操作系統內核版本Bug:程序訪問了空指針,觸發內存訪問錯誤,從而導致實例崩潰宕機。BUG詳情
解決方案
將操作系統內核升級到kernel-3.10.0-1160.88.1.el7以上版本。具體操作,請參見升級Linux ECS實例內核。
重要在操作前,建議您為ECS實例創建快照備份數據,避免因誤操作造成的數據丟失。創建快照的具體操作,請參見創建快照。
問題描述
Linux操作系統的ECS實例在運行過程中宕機,產生了“RIP:strnlen”日志,調用棧類似如下:
[86390.829326] BUG: unable to handle kernel paging request at 0000000100620100 [86390.829510] IP: [<ffffffff9ed7f2ad>] strnlen+0xd/0x40 [86390.829632] PGD 0 [86390.829685] Oops: 0000 [#1] SMP [86390.829766] Modules linked in: AliSecGuard(OE) binfmt_misc xt_conntrack iptable_filter iptable_nat nf_nat_ipv4 arc4 emp(OE) nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat nf_conntrack eudp(E) libcrc32c ppdev intel_powerclamp crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc virtio_balloon parport i2c_piix4 pcspkr ip_tables ext4 mbcache jbd2 cirrus drm_kms_helper syscopyarea sysfillrect virtio_net virtio_console virtio_blk sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common drm crc32c_intel serio_raw floppy virtio_pci virtio_ring virtio drm_panel_orientation_quirks [86390.831199] CPU: 2 PID: 1311 Comm: KeepAlive Tainted: G OE ------------ 3.10.0-957.el7.x86_64 #1 [86390.831410] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 9e9f1cc 04/01/2014 [86390.831580] task: ffff97c77add9040 ti: ffff97c77ade0000 task.ti: ffff97c77ade0000 [86390.831742] RIP: 0010:[<ffffffff9ed7f2ad>] [<ffffffff9ed7f2ad>] strnlen+0xd/0x40 ...... [86390.833643] Call Trace: [86390.833699] [<ffffffff9ed8105b>] string.isra.7+0x3b/0xf0 [86390.833805] [<ffffffff9ed82771>] vsnprintf+0x201/0x6a0 [86390.833908] [<ffffffff9ed82c1d>] vscnprintf+0xd/0x30 [86390.834011] [<ffffffff9ea9a24b>] vprintk_emit+0x11b/0x510 [86390.834143] [<ffffffff9ea9a8a9>] ? vprintk_default+0x29/0x40 [86390.834277] [<ffffffff9ed77ef0>] ? kobject_put+0x50/0x60 [86390.834407] [<ffffffff9ea9a65f>] vprintk+0x1f/0x30 [86390.834517] [<ffffffff9ea975ef>] __warn+0x7f/0x100 [86390.834618] [<ffffffff9ea976cf>] warn_slowpath_fmt+0x5f/0x80 [86390.834746] [<ffffffffc02e2b64>] ? close_eudp_mmap_dev+0x1b4/0x200 [eudp] [86390.834896] [<ffffffff9ed77ef0>] kobject_put+0x50/0x60 [86390.835013] [<ffffffff9ec466f8>] cdev_put+0x18/0x30 [86390.835125] [<ffffffff9ec4350a>] __fput+0x21a/0x260 [86390.835232] [<ffffffff9ec4363e>] ____fput+0xe/0x10 [86390.835340] [<ffffffff9eabe79b>] task_work_run+0xbb/0xe0 [86390.835459] [<ffffffff9ea9dc61>] do_exit+0x2d1/0xa40 [86390.835568] [<ffffffff9ea9e44f>] do_group_exit+0x3f/0xa0 [86390.835695] [<ffffffff9eaaf24e>] get_signal_to_deliver+0x1ce/0x5e0 [86390.835830] [<ffffffff9ea2b527>] do_signal+0x57/0x6f0 [86390.835942] [<ffffffff9eac57e0>] ? hrtimer_get_res+0x50/0x50 [86390.836068] [<ffffffff9ea2bc32>] do_notify_resume+0x72/0xc0 [86390.836202] [<ffffffff9f175124>] int_signal+0x12/0x17 ...
問題原因
系統安裝了第三方模塊eudp,該模塊存在Bug(例如傳遞給strnlen函數的參數不正確),導致實例宕機。
解決方案
建議您卸載第三方模塊eudp。
重要在操作前,建議您為ECS實例創建快照備份數據,避免因誤操作造成的數據丟失。創建快照的具體操作,請參見創建快照。
問題描述
Linux操作系統的ECS實例在運行過程中宕機,產生了“RIP:filp_close”日志,調用棧類似如下:
[ 1891.552008] BUG: unable to handle kernel NULL pointer dereference at 0000000000000036 [ 1891.552149] IP: [<ffffffff8801c67e>] filp_close+0xe/0x90 [ 1891.552239] PGD 40819b067 PUD 40819a067 PMD 0 [ 1891.552321] Oops: 0000 [#1] SMP [ 1891.552380] Modules linked in: AliSecGuard(OE) AliSecNetFlt64(OE) tampercore(OE) tampercfg(OE) ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_powerclamp crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc parport i2c_piix4 shpchp virtio_balloon pcspkr ip_tables ext4 mbcache jbd2 cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_net virtio_console virtio_blk drm crct10dif_pclmul crct10dif_common virtio_pci crc32c_intel virtio_ring i2c_core serio_raw virtio floppy [ 1891.553945] CPU: 3 PID: 2778 Comm: AliHips Tainted: G OE ------------ 3.10.0-862.14.4.el7.x86_64 #1 [ 1891.554107] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 9e9f1cc 04/01/2014 [ 1891.554228] task: ffff88d4cd7e4f10 ti: ffff88d4c5af8000 task.ti: ffff88d4c5af8000 [ 1891.554346] RIP: 0010:[<ffffffff8801c67e>] [<ffffffff8801c67e>] filp_close+0xe/0x90 ...... [ 1891.555727] Call Trace: [ 1891.555772] [<ffffffffc08d0d7c>] is_pathsite+0x1ac/0x400 [tampercore] [ 1891.555878] [<ffffffff88055e1a>] ? bh_lru_install+0x18a/0x1e0 [ 1891.555974] [<ffffffff880563fc>] ? __find_get_block+0xbc/0x120 [ 1891.556069] [<ffffffff8805648d>] ? __getblk+0x2d/0x300 [ 1891.556160] [<ffffffffc02d956b>] ? search_dir+0x8b/0x120 [ext4] [ 1891.556258] [<ffffffff87ebeed5>] ? wake_up_bit+0x25/0x30 [ 1891.556345] [<ffffffff88055b2d>] ? __brelse+0x3d/0x50 [ 1891.556432] [<ffffffffc02d9a69>] ? ext4_find_entry+0x299/0x570 [ext4] [ 1891.556536] [<ffffffff880380cd>] ? __d_instantiate+0x2d/0xe0 [ 1891.556629] [<ffffffff88037446>] ? _d_rehash+0x36/0x40 [ 1891.556712] [<ffffffff88037473>] ? d_rehash+0x23/0x40 [ 1891.556795] [<ffffffff8803866c>] ? d_splice_alias+0xdc/0x120 [ 1891.556891] [<ffffffffc02da368>] ? ext4_lookup+0x118/0x170 [ext4] [ 1891.556993] [<ffffffff8802b2b3>] ? lookup_fast+0xb3/0x230 [ 1891.557080] [<ffffffff8802ca48>] ? link_path_walk+0x238/0x8b0 [ 1891.558026] [<ffffffff8809769b>] ? proc_pid_permission+0x9b/0xc0 [ 1891.558976] [<ffffffff8802dfea>] ? path_lookupat+0x7a/0x8b0 [ 1891.559917] [<ffffffffc08d20db>] tamperhack_mkdir.part.4+0x12b/0x190 [tampercore] [ 1891.560888] [<ffffffffc08d2185>] tamperhack_mkdir+0x45/0x50 [tampercore] [ 1891.561828] [<ffffffff8852579b>] system_call_fastpath+0x22/0x27 [ 1891.562736] Code: ff 00 00 00 00 e9 d3 fe ff ff 0f 1f 00 b8 ea ff ff ff eb 9d e8 c4 7c e7 ff 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 <48> 8b 47 38 48 89 fb 48 85 c0 74 5b 48 8b 47 28 49 89 f4 48 85 [ 1891.564925] RIP [<ffffffff8801c67e>] filp_close+0xe/0x90
問題原因
系統安裝了第三方模塊Tampercore,該模塊存在Bug,導致
filp_close
函數調用時發生了錯誤,進而導致實例宕機。解決方案
建議您卸載或升級第三方模塊Tampercore。
重要在操作前,建議您為ECS實例創建快照備份數據,避免因誤操作造成的數據丟失。創建快照的具體操作,請參見創建快照。
問題描述
Linux操作系統的ECS實例在啟動過程中出現循環宕機,無法正常進入系統,產生“VFS: Unable to mount root fs on unknown-block”日志,調用棧類似如下:
[ 1.573197] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) [ 1.574179] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 3.10.0-1160.6.1.el7.x86_64 #1 [ 1.575045] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014 [ 1.575900] Call Trace: [ 1.576246] [<ffffffff8f381400>] dump_stack+0x19/0x1b [ 1.576845] [<ffffffff8f37a958>] panic+0xe8/0x21f [ 1.577433] [<ffffffff8f98b794>] mount_block_root+0x291/0x2a0 [ 1.578122] [<ffffffff8f98b7f6>] mount_root+0x53/0x56 [ 1.578719] [<ffffffff8f98b935>] prepare_namespace+0x13c/0x174 [ 1.579425] [<ffffffff8f98b412>] kernel_init_freeable+0x222/0x249 [ 1.580150] [<ffffffff8f98ab28>] ? initcall_blacklist+0xb0/0xb0 [ 1.580838] [<ffffffff8f36fa90>] ? rest_init+0x80/0x80 [ 1.581462] [<ffffffff8f36fa9e>] kernel_init+0xe/0x100 [ 1.582073] [<ffffffff8f394df7>] ret_from_fork_nospec_begin+0x21/0x21 [ 1.582814] [<ffffffff8f36fa90>] ? rest_init+0x80/0x80
問題原因
內核升級被中斷或出錯,導致根文件系統(rootfs)被損壞,ECS實例在啟動過程中找不到根分區的文件系統,進而導致實例宕機。
解決方案
建議您為ECS實例更換系統盤,或者基于已創建的快照回滾云盤。具體操作,請參見更換操作系統(系統盤)或使用快照回滾云盤。
重要在操作前,建議您為ECS實例創建快照備份數據,避免因誤操作造成的數據丟失。創建快照的具體操作,請參見創建快照。
問題描述
Linux操作系統的ECS實例在運行過程中宕機,產生了“RIP:virtio_check_driver_offered_feature”日志,調用棧類似如下:
[55686.388353] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090 [55686.389223] IP: [<ffffffffc0047450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio] [55686.390030] PGD 229af2067 PUD 21cbac067 PMD 0 [55686.390514] Oops: 0000 [#1] SMP [55686.390867] Modules linked in: unix_diag AliSecGuard(OE) udp_diag tcp_diag inet_diag joydev binfmt_misc xfs libcrc32c dm_mod kvm_amd kvm irqbypass crc32_pclmul ppdev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper parport_pc ablk_helper cryptd virtio_balloon pcspkr parport i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_blk virtio_console cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common ata_piix crc32c_intel virtio_pci libata serio_raw virtio_ring virtio drm_panel_orientation_quirks floppy [55686.396603] CPU: 0 PID: 19222 Comm: fdisk Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.1.2.el7.x86_64 #1 [55686.397848] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014 [55686.398578] task: ffff964836e8e2a0 ti: ffff964860370000 task.ti: ffff964860370000 [55686.399303] RIP: 0010:[<ffffffffc0047450>] [<ffffffffc0047450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio] .... [55686.406216] Call Trace: [55686.406473] [<ffffffffc0102b4c>] virtblk_ioctl+0x3c/0x70 [virtio_blk] [55686.407098] [<ffffffff955608b5>] __blkdev_driver_ioctl+0x25/0x40 [55686.407697] [<ffffffffc03b5024>] dm_blk_ioctl+0x74/0xb0 [dm_mod] [55686.408289] [<ffffffff955612fa>] blkdev_ioctl+0x28a/0xa20 [55686.408817] [<ffffffff95488771>] block_ioctl+0x41/0x50 [55686.409319] [<ffffffff9545d9e0>] do_vfs_ioctl+0x3a0/0x5a0 [55686.409845] [<ffffffff95305a82>] ? ktime_get+0x52/0xe0 [55686.410345] [<ffffffff955024ec>] ? security_file_ioctl+0x1c/0x20 [55686.410930] [<ffffffff9545dc81>] SyS_ioctl+0xa1/0xc0 [55686.411429] [<ffffffff9598cede>] system_call_fastpath+0x25/0x2a [55686.411999] Code: d5 89 de 48 c7 c7 e0 93 04 c0 e8 4c 98 53 d5 5b 5d c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 8b 8f a0 00 00 00 48 89 e5 <8b> 91 90 00 00 00 85 d2 74 2c 48 8b 81 88 00 00 00 39 30 74 59 [55686.414738] RIP [<ffffffffc0047450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio]
問題原因
實例使用了邏輯卷管理(LVM),且一個邏輯卷(LV)關聯到了設備(假設為
vdc
),但實際上該設備已被刪除。由于LVM中仍然保留了對應設備的配置信息,當執行涉及該設備的命令(如blkid
或fdisk
)時,會導致實例崩潰。解決方案
方案一:使用LVM命令刪除不存在的設備的配置,以使LVM中的配置與實際設備一致。
方案二:升級內核版本至kernel-3.10.0-1160.6.1.el7以上。具體操作,請參見升級Linux ECS實例內核。
問題描述
Linux操作系統的ECS實例在運行過程中宕機,產生了“Out of memory and no killable processes”日志,調用棧類似如下:
[28663.625353] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [28663.625363] [ 1799] 0 1799 26512 245 56 3 0 -1000 sshd [28663.625367] [29219] 0 29219 10832 126 26 3 0 -1000 systemd-udevd [28663.625375] Kernel panic - not syncing: Out of memory and no killable processes... [28663.634374] CPU: 1 PID: 3578 Comm: kworker/u176:4 Tainted: G OE 3.10.0-1062.9.1.el7.x86_64 #1 [28663.676873] Call Trace: [28663.679312] [<ffffffff8139f342>] dump_stack+0x63/0x81 [28663.684421] [<ffffffff811b2245>] panic+0xf8/0x244 [28663.689184] [<ffffffff811b98db>] out_of_memory+0x2eb/0x550 [28663.694726] [<ffffffff811be254>] __alloc_pages_may_oom+0x114/0x1c0 [28663.700959] [<ffffffff811bedb3>] __alloc_pages_slowpath+0x7d3/0xa40 [28663.707279] [<ffffffff811bf229>] __alloc_pages_nodemask+0x209/0x260 [28663.713599] [<ffffffff81216535>] alloc_pages_current+0x95/0x140 [28663.719573] [<ffffffff811ba5ee>] __get_free_pages+0xe/0x40 [28663.725113] [<ffffffff81075dae>] pgd_alloc+0x1e/0x160 [28663.730225] [<ffffffff810875e4>] mm_init+0x184/0x240 [28663.735249] [<ffffffff81088102>] mm_alloc+0x52/0x60 [28663.740186] [<ffffffff81257640>] do_execveat_common.isra.37+0x250/0x780 [28663.759839] [<ffffffff81257b9c>] do_execve+0x2c/0x30 [28663.764864] [<ffffffff810a231b>] call_usermodehelper_exec_async+0xfb/0x150 [28663.777246] [<ffffffff81741dd9>] ret_from_fork+0x39/0x50
問題原因
操作系統內核分配內存失敗后,嘗試通過kill進程來釋放內存,但系統沒有可被kill的進程,進而觸發了系統的主動宕機。出現該問題的可能原因有:
系統內核存在內存泄漏,從而導致系統可用內存不足。
oom_score_adj
為-1000
的進程占用過多內存,該類進程無法被終止從而導致系統可用內存不足。說明oom_score_adj是一個用于調整OOM(Out of Memory)終止進程的優先級的參數。內核根據每個進程的OOM分數(oom_score)來選擇要終止的進程,較低的oom_score值表示進程更有可能被終止,而較高的值表示進程更不可能被終止。
解決方案
檢查系統內核是否存在內存泄漏。
具體操作,請參見如何排查slab_unreclaimable內存占用高的原因?。
檢查進程的
oom_score_adj
設置是否合理。執行以下命令,獲取進程的PID。您可以使用命令如
ps
、top
或pgrep
來查找進程的 PID。ps aux | grep <進程名稱>
您需要將
<進程名稱>
替換為您要查找的進程的名稱。執行以下命令,檢查
oom_score_adj
設置。cat /proc/<PID>/oom_score_adj
您需要將
<PID>
替換為已獲取的進程實際PID。根據您的環境和需求,可以根據
oom_score_adj
的值來評估進程的OOM行為是否合理。如果oom_score_adj
的值為-1000
,則表示該進程具有較高的優先級,更不可能被內核選擇進行OOM終止,從而導致系統可用內存不足。
問題描述
當您在ECS實例內使用memory cgroup kmem功能時,內核有類似于如下所示的告警日志,且實例出現了宕機。調用棧類似如下:
[80569.393775] BUG kmalloc-256(15:94ef869ce655ebab64b08cd78ee00d16c20efd5737493b48293de41fe41b04a0) (Tainted: P B W OE ------------ T): Objects remaining in kmalloc-256(15:94ef869ce655ebab64b08cd78ee00d16c20efd5737493b48293de41fe41b04a [80569.397756] ----------------------------------------------------------------------------- [80569.397756] [80569.400724] INFO: Slab 0xffffea0001e94a00 objects=32 used=1 fp=0xffff88007a528000 flags=0x1fffff00004080 [80569.402702] CPU: 21 PID: 26626 Comm: dockerd Tainted: P B W OE ------------ T 3.10.0-693.2.2.el7.x86_64 #1 [80569.404898] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014 [80569.406747] ffffea0001e94a00 000000004eb9a19f ffff883afee53aa0 ffffffff816a3db1 [80569.408833] ffff883afee53b78 ffffffff811dbf54 ffffffff00000020 ffff883afee53b88 [80569.410731] ffff883afee53b38 656a624f8190fff8 616d657220737463 6e6920676e696e69 [80569.412630] Call Trace: [80569.414005] [<ffffffff816a3db1>] dump_stack+0x19/0x1b [80569.415627] [<ffffffff811dbf54>] slab_err+0xb4/0xe0 [80569.417204] [<ffffffff811e0623>] ? __kmalloc+0x1e3/0x230 [80569.420419] [<ffffffff811e1939>] kmem_cache_close+0x149/0x2e0 [80569.422006] [<ffffffff811e1ae4>] __kmem_cache_shutdown+0x14/0x80 [80569.423606] [<ffffffff811a6874>] kmem_cache_destroy+0x44/0xf0 [80569.425149] [<ffffffff811f6019>] kmem_cache_destroy_memcg_children+0x89/0xb0 [80569.426800] [<ffffffff811a6849>] kmem_cache_destroy+0x19/0xf0 [80569.428309] [<ffffffff8123b18e>] bioset_free+0xce/0x110 [80569.431306] [<ffffffffc06d0b43>] dm_destroy+0x13/0x20 [dm_mod] [80569.432803] [<ffffffffc06d69be>] dev_remove+0x11e/0x180 [dm_mod] [80569.435851] [<ffffffffc06d7015>] ctl_ioctl+0x1e5/0x500 [dm_mod] [80569.437363] [<ffffffffc06d7343>] dm_ctl_ioctl+0x13/0x20 [dm_mod] [80569.438882] [<ffffffff8121524d>] do_vfs_ioctl+0x33d/0x540 [80569.443291] [<ffffffff812154f1>] SyS_ioctl+0xa1/0xc0 [80569.446228] [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
問題原因
在使用memory cgroup kmem功能的過程中,
kmem_cache_destroy
在銷毀kmem_cache
時,會先刪除memcg cache
再檢查refcount
是否為0。由于refcount
不為0,因此可能存在其他合法任務嘗試通過當前kmem_cache
的memcg cache
分配slab,進而導致race
觸發宕機。解決方案
建議您在ECS實例內,關閉memory cgroup kmem功能。操作步驟如下:
運行以下命令,打開/etc/default/grub文件。
vim /etc/default/grub
按i鍵進入編輯模式,在
GRUB_CMDLINE_LINUX
中添加以下配置信息。cgroup.memory=nokmem
按Esc鍵退出編輯模式,并輸入
:wq
后按Enter鍵,保存退出文件。運行以下命令,更新GRUB。
grub2-mkconfig -o /boot/grub2/grub.cfg
運行以下命令,重啟ECS實例。
reboot
如果您的操作系統無法通過命令行(cmdline)關閉memory cgroup kmem,則建議您在ECS實例內的任何程序均不配置
memory.kmem.limit_in_bytes
的值。即可保證memory cgroup kmem功能未開啟。
問題描述
Linux操作系統的ECS實例在運行過程中宕機,產生了“unable to handle kernel NULL pointer dereference”日志,調用棧類似如下:
[8794845.086660] BUG: unable to handle kernel NULL pointer dereference at (null) [8794845.088500] IP: [<ffffffff8128f89c>] kref_get+0xc/0x30 [8794845.089355] PGD 812ca2067 PUD 6dd707067 PMD 0 [8794845.090303] Oops: 0000 [#1] SMP [8794845.091005] last sysfs file: /sys/devices/system/cpu/online [8794845.091861] CPU 3 [8794845.092212] Modules linked in: ysec_firewall_kmod(U) tcp_diag inet_diag nf_conntrack_netlink nfnetlink nf_conntrack_ipv6 nf_defrag_ipv6 ip6_tables xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ipv6 virtio_balloon virtio_net virtio_console i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ysec_firewall_kmod] [8794845.101913] [8794845.102621] Pid: 21908, comm: ysec_hids_mod_l Tainted: G W --------------- 2.6.32-504.16.2.el6.x86_64 #1 Alibaba Cloud Alibaba Cloud ECS [8794845.105481] RIP: 0010:[<ffffffff8128f89c>] [<ffffffff8128f89c>] kref_get+0xc/0x30 [8794845.107400] RSP: 0018:ffff88045f5a3e38 EFLAGS: 00010292 [8794845.108628] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000fffffff3 [8794845.110501] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [8794845.112371] RBP: ffff88045f5a3e48 R08: 0000000000000000 R09: ffff88050f507f00 [8794845.114133] R10: 0000000000000003 R11: 0000000000000206 R12: ffffffff8161b040 [8794845.115994] R13: 0000000000000040 R14: 00007f4b457f94d0 R15: 0000000000000000 [8794845.117865] FS: 00007f4b457fb700(0000) GS:ffff880030380000(0000) knlGS:0000000000000000 [8794845.119846] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [8794845.121055] CR2: 0000000000000000 CR3: 00000006f6837000 CR4: 00000000001406e0 [8794845.122807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [8794845.124685] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [8794845.126558] Process ysec_hids_mod_l (pid: 21908, threadinfo ffff88045f5a2000, task ffff8806d43acab0) [8794845.128689] Stack: [8794845.129414] ffff88045f5a3e68 0000000000000000 ffff88045f5a3e68 ffffffff810d6ae6 [8794845.131107] <d> ffffffff8161b040 ffff8806c03a3520 ffff88045f5a3ef8 ffffffff81203898 [8794845.133479] <d> 00007f4b457f9510 0000000000000000 ffff88045f5a3eb8 ffffffff8128c635 [8794845.136365] Call Trace: [8794845.137127] [<ffffffff810d6ae6>] pidns_get+0x26/0x30 [8794845.138367] [<ffffffff81203898>] proc_ns_readlink+0xc8/0x180 [8794845.139665] [<ffffffff8128c635>] ? _atomic_dec_and_lock+0x55/0x80 [8794845.141008] [<ffffffff811ab151>] ? touch_atime+0x71/0x1a0 [8794845.142268] [<ffffffff81193b0e>] sys_readlinkat+0xfe/0x120 [8794845.143536] [<ffffffff81193b4b>] sys_readlink+0x1b/0x20 [8794845.144695] [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
問題原因
內核或驅動訪問非法內存。
解決方案
方案一:將內核版本升級到更高版本。具體操作,請參見升級Linux ECS實例內核。
重要在操作前,建議您為ECS實例創建快照備份數據,避免因誤操作造成的數據丟失。創建快照的具體操作,請參見創建快照。
方案二:排查系統是否安裝了不可靠的第三方軟件或驅動程序,并嘗試卸載該軟件或驅動程序。更多信息,請參見如何查看ECS實例已安裝的第三方軟件和驅動程序?。
問題描述
Linux操作系統的ECS實例在運行過程中宕機,產生了“unable to handle kernel paging request at”日志,調用棧類似如下:
[85899.344803] BUG: unable to handle kernel paging request at ffffffffc0b0ceef [85899.345643] IP: [<ffffffffc0b0ceef>] 0xffffffffc0b0ceef [85899.346119] PGD 24f212067 PUD 24f214067 PMD 24e421067 PTE 0 [85899.346670] Oops: 0010 [#1] SMP [85899.346982] Modules linked in: nfnetlink_queue nfnetlink_log bluetooth rfkill ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink xt_addrtype br_netfilter tcp_diag inet_diag xt_set ip_set_hash_ip tampercfg(OE) overlay(T) ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter iosf_mbi ppdev virtio_balloon crc32_pclmul parport_pc ghash_clmulni_intel parport shpchp i2c_piix4 aesni_intel lrw gf128mul glue_helper joydev [85899.354796] ablk_helper pcspkr cryptd ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_console virtio_blk cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel virtio_pci i2c_core serio_raw virtio_ring floppy virtio [last unloaded: tampercore] [85899.358255] CPU: 2 PID: 1 Comm: systemd Tainted: G OE ------------ T 3.10.0-862.14.4.el7.x86_64 #1 [85899.359264] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [85899.360050] task: ffff9880fa2c0000 ti: ffff9880fa2c8000 task.ti: ffff9880fa2c8000 [85899.360817] RIP: 0010:[<ffffffffc0b0ceef>] [<ffffffffc0b0ceef>] 0xffffffffc0b0ceef [85899.361636] RSP: 0018:ffff9880fa2cbd30 EFLAGS: 00010246 [85899.362181] RAX: 0000000000000000 RBX: 000055a50e52e3c0 RCX: 0000000000000000 [85899.362913] RDX: 0000000180080006 RSI: fffff786c5c52800 RDI: 0000000040000000 [85899.363645] RBP: ffff9880fa2cbf48 R08: ffff9880f14a0000 R09: 0000000180080005 [85899.364372] R10: 00000000f14a3001 R11: fffff786c5c52800 R12: ffff9880fa2cbd30 [85899.365107] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [85899.365840] FS: 00007fa181b3a940(0000) GS:ffff9883bfc80000(0000) knlGS:0000000000000000 [85899.366669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [85899.367257] CR2: ffffffffc0b0ceef CR3: 000000024ed44000 CR4: 00000000003606e0 [85899.367992] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [85899.368728] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [85899.369453] Call Trace: [85899.369726] [<ffffffffa392579b>] system_call_fastpath+0x22/0x27 [85899.370339] Code: Bad RIP value. [85899.370729] RIP [<ffffffffc0b0ceef>] 0xffffffffc0b0ceef [85899.371292] RSP <ffff9880fa2cbd30> [85899.373188] CR2: ffffffffc0b0ceef
問題原因
內核或驅動訪問非法內存。
解決方案
方案一:將內核版本升級到更高版本。具體操作,請參見升級Linux ECS實例內核。
重要在操作前,建議您為ECS實例創建快照備份數據,避免因誤操作造成的數據丟失。創建快照的具體操作,請參見創建快照。
方案二:排查系統是否安裝了不可靠的第三方軟件或驅動程序,并嘗試卸載該軟件或驅動程序。更多信息,請參見如何查看ECS實例已安裝的第三方軟件和驅動程序?。