Alibaba Cloud Linux 2系統(tǒng)的ECS實(shí)例中Page Fault異常導(dǎo)致系統(tǒng)宕機(jī)
問(wèn)題描述
在符合如下條件的Alibaba Cloud Linux 2實(shí)例中,系統(tǒng)運(yùn)行時(shí)出現(xiàn)系統(tǒng)宕機(jī)問(wèn)題。
鏡像:Alibaba Cloud Linux 2.1903 LTS 64位。
內(nèi)核:kernel-4.19.91-23.al7及之前的內(nèi)核版本。
系統(tǒng)宕機(jī),且出現(xiàn)如下調(diào)用棧信息。
[ 332.057218] watchdog: BUG: soft lockup - CPU#7 stuck for 11s! [split_v2:28356]
[ 332.057219] mousedev isst_if_common hid_generic usbhid
[ 332.057223] CPU: 3 PID: 28336 Comm: split_v2 Kdump: loaded Not tainted 4.19.91-19.1.al7.x86_64 #1
[ 332.057507] Kernel panic - not syncing: softlockup: hung tasks
[ 332.057508] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 332.057510] CPU: 6 PID: 28355 Comm: split_v2 Kdump: loaded Tainted: G L 4.19.91-19.1.al7.x86_64 #1
[ 332.057513] cp_new_stat+0x13d/0x160
[ 332.057514] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 0000000000000019
[ 332.057515] Call Trace:
[ 332.057516] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014
[ 332.057518] __se_sys_newfstat+0x2e/0x40
[ 332.057518] Call Trace:
[ 332.057519] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 0f 01 cb 83
[ 332.057521] RBP: 00007eff1201bf10 R08: 00007eff1201c700 R09: 00007eff1201c700
[ 332.057523] do_syscall_64+0x5b/0x1b0
[ 332.057524] <IRQ>
[ 332.057525] RSP: 0018:ffffa389886efde8 EFLAGS: 00050206
[ 332.057529] dump_stack+0x66/0x8b
[ 332.057531] R10: 00007eff1201c9d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057534] panic+0xd8/0x24c
[ 332.057535] RAX: 000000c000100090 RBX: ffffa389886efea8 RCX: 0000000000000090
[ 332.057536] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff1201c700
[ 332.057539] __do_page_fault+0x11d/0x470
[ 332.057540] ? 0xffffffffc0477000
[ 332.057541] RDX: 0000000000000090 RSI: ffffa389886efdf8 RDI: 000000c000100000
[ 332.057552] watchdog_timer_fn+0x253/0x260
[ 332.057555] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057556] ? softlockup_fn+0x40/0x40
[ 332.057557] RBP: 000000c000100000 R08: 0000000000000000 R09: 0000000000000000
[ 332.057559] __hrtimer_run_queues+0xeb/0x250
[ 332.057560] R10: ffff8bfb1690a310 R11: ffff8bfb1f01a6c8 R12: ffff8bfaee04df00
[ 332.057562] hrtimer_interrupt+0x122/0x270
[ 332.057563] RIP: 0033:0x7eff1b11e3a4
[ 332.057564] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 332.057566] smp_apic_timer_interrupt+0x6a/0x140
[ 332.057568] do_page_fault+0x32/0x140
[ 332.057570] apic_timer_interrupt+0xf/0x20
[ 332.057572] _copy_to_user+0x22/0x30
[ 332.057573] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057574] </IRQ>
[ 332.057575] RSP: 002b:00007eff1181aed8 EFLAGS: 00000246
[ 332.057578] RIP: 0010:__do_page_fault+0x227/0x470
[ 332.057579] ORIG_RAX: 0000000000000005
[ 332.057580] Code: 00 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 f6 85 91 00 00 00 02 41 bf 14 00 00 00 0f 84 c5 fe ff ff fb 66 0f 1f 44 00 00 <e9> b9 fe ff ff f6 85 88 00 00 00 03 75 0d f6 85 92 00 00 00 04 0f
[ 332.057582] cp_new_stat+0x13d/0x160
[ 332.057583] RSP: 0018:ffffa389886f7ca0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 332.057585] __se_sys_newfstat+0x2e/0x40
[ 332.057586] RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffffffff93a00ae0
[ 332.057587] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057588] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffa389886f7d38
[ 332.057589] do_syscall_64+0x5b/0x1b0
[ 332.057590] RBP: ffffa389886f7d38 R08: 0000000000000000 R09: 0000000000000000
[ 332.057591] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 0000000000000009
[ 332.057592] R10: 0000000000000000 R11: 0000000000000000 R12: 000000c000100000
[ 332.057594] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057595] R13: ffff8bfb168bd940 R14: ffff8bfaee04af80 R15: 0000000000000014
[ 332.057597] RIP: 0033:0x7eff1b11e3a4
[ 332.057599] async_page_fault+0x1e/0x30
[ 332.057601] ? restore_regs_and_return_to_kernel+0x25/0x25
[ 332.057602] RBP: 00007eff1181af10 R08: 00007eff1181b700 R09: 00007eff1181b700
[ 332.057602] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057604] do_page_fault+0x32/0x140
[ 332.057606] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x20
[ 332.057607] R10: 00007eff1181b9d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057608] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 0f 01 cb 83
[ 332.057609] async_page_fault+0x1e/0x30
[ 332.057610] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff1181b700
[ 332.057612] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x20
[ 332.057613] RSP: 002b:00007eff08808ed8 EFLAGS: 00000246
[ 332.057614] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 0f 01 cb 83
[ 332.057615] ORIG_RAX: 0000000000000005
[ 332.057616] RSP: 0018:ffffa389886f7de8 EFLAGS: 00050206
[ 332.057617] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057618] RAX: 000000c000100090 RBX: ffffa389886f7ea8 RCX: 0000000000000090
[ 332.057619] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 0000000000000024
[ 332.057620] RDX: 0000000000000090 RSI: ffffa389886f7df8 RDI: 000000c000100000
[ 332.057621] RSP: 0018:ffffa389886ffde8 EFLAGS: 00050206
[ 332.057623] RBP: 000000c000100000 R08: 0000000000000000 R09: 0000000000000000
[ 332.057624] RBP: 00007eff08808f10 R08: 00007eff08809700 R09: 00007eff08809700
[ 332.057625] R10: ffff8bfb1690b810 R11: ffff8bfb1f01a6c8 R12: ffff8bfaee04af80
[ 332.057626] R10: 00007eff088099d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057627] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 332.057628] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff08809700
[ 332.057630] _copy_to_user+0x22/0x30
[ 332.057631] RAX: 000000c000100090 RBX: ffffa389886ffea8 RCX: 0000000000000090
[ 332.057632] cp_new_stat+0x13d/0x160
[ 332.057633] RDX: 0000000000000090 RSI: ffffa389886ffdf8 RDI: 000000c000100000
[ 332.057634] RBP: 000000c000100000 R08: 0000000000000000 R09: 0000000000000000
[ 332.057635] __se_sys_newfstat+0x2e/0x40
[ 332.057636] R10: ffff8bfb1690ad10 R11: ffff8bfb1f01a6c8 R12: ffff8bfaee048000
[ 332.057637] do_syscall_64+0x5b/0x1b0
[ 332.057638] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 332.057640] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057642] _copy_to_user+0x22/0x30
[ 332.057643] RIP: 0033:0x7eff1b11e3a4
[ 332.057645] cp_new_stat+0x13d/0x160
[ 332.057646] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057647] __se_sys_newfstat+0x2e/0x40
[ 332.057648] RSP: 002b:00007eff08007ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000005
[ 332.057651] do_syscall_64+0x5b/0x1b0
[ 332.057652] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057654] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057655] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 000000000000002e
[ 332.057656] RIP: 0033:0x7eff1b11e3a4
[ 332.057657] RBP: 00007eff08007f10 R08: 00007eff08008700 R09: 00007eff08008700
[ 332.057658] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057659] R10: 00007eff080089d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057660] RSP: 002b:00007eff07806ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000005
[ 332.057662] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff08008700
[ 332.057663] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057663] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 000000000000001e
[ 332.057664] RBP: 00007eff07806f10 R08: 00007eff07807700 R09: 00007eff07807700
[ 332.057665] R10: 00007eff078079d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057665] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff07807700
問(wèn)題原因
Alibaba Cloud Linux系統(tǒng)默認(rèn)開啟了THP(透明大頁(yè)功能),GC內(nèi)存回收時(shí),會(huì)調(diào)用MADV_NOHUGEPAGE關(guān)閉大頁(yè),然后調(diào)用MADV_FREE釋放部分4K頁(yè),并在操作系統(tǒng)中會(huì)切割THP大頁(yè)。當(dāng)其它進(jìn)程內(nèi)核Page Fault異常占用CPU資源時(shí),導(dǎo)致切割THP大頁(yè)的進(jìn)程沒有調(diào)度完成,而切割THP大頁(yè)的進(jìn)程無(wú)法完成,會(huì)導(dǎo)致Page Fault的進(jìn)程一直無(wú)法結(jié)束,因此它們會(huì)一直等待對(duì)方結(jié)束進(jìn)程,最終會(huì)導(dǎo)致SOFT LOCKUP。若Alibaba Cloud Linux實(shí)例中配置了/proc/sys/kernel/softlockup_panic
,SOFT LOCKUP的產(chǎn)生會(huì)觸發(fā)內(nèi)核宕機(jī)。
解決方案
如果您對(duì)實(shí)例或數(shù)據(jù)有修改、變更等風(fēng)險(xiǎn)操作,務(wù)必注意實(shí)例的容災(zāi)、容錯(cuò)能力,確保數(shù)據(jù)安全。
如果您對(duì)實(shí)例(包括但不限于ECS、RDS)等進(jìn)行配置與數(shù)據(jù)修改,建議提前創(chuàng)建快照或開啟RDS日志備份等功能。
如果您在阿里云平臺(tái)授權(quán)或者提交過(guò)登錄賬號(hào)、密碼等安全信息,建議您及時(shí)修改。
當(dāng)遇到該問(wèn)題時(shí),您可以參考以下方案處理:
登錄ECS實(shí)例,詳情請(qǐng)參見連接方式概述。
執(zhí)行以下命令,確認(rèn)系統(tǒng)內(nèi)核版本適用此方案。
uname -r
系統(tǒng)顯示類似如下。
4.19.91-19.1.al7.x86_64
根據(jù)系統(tǒng)內(nèi)核版本選擇對(duì)應(yīng)的解決方法:
對(duì)于4.19.91-19.1.al7.x86_64(不含)之前的版本:
執(zhí)行以下命令,將操作系統(tǒng)版本更新至最新的內(nèi)核版本。
yum update kernel
更新內(nèi)核版本之后,需重啟生效,請(qǐng)執(zhí)行以下命令,重啟服務(wù)器。
reboot
若最新內(nèi)核版本的操作系統(tǒng)同樣存在該問(wèn)題,請(qǐng)執(zhí)行以下步驟,更新內(nèi)核熱補(bǔ)丁。
對(duì)于4.19.91-19.1.al7.x86_64(包含)到4.19.91-23.al7.x86_64(包含)之間的版本,可通過(guò)安裝內(nèi)核熱補(bǔ)丁解決,安裝命令如下。
yum install -y kernel-hotfix-5902278-`uname -r | awk -F"-" '{print $NF}'`
適用于
云服務(wù)器ECS