雲計算之路-阿里雲上:Linux內核bug引起的“黑色10秒鍾”


一圖勝千言,先看下圖中Linux 3.2.0-39.62的Changelog:

截圖來自:https://launchpad.net/ubuntu/+source/linux/3.2.0-39.62

Linux 3.2.0-39.62發布時間:2013年2月27日(我們是3月9日遷入阿里雲的)

我們遭遇的“黑色10秒鍾”問題詳見:雲計算之路-阿里雲上:超級奇怪的“黑色10秒鍾”

本來准備硬啃內核代碼證明是Xen的問題引起的“黑色10秒鍾”,現在不用了。這是Linux內核中Xen paravirtualization spinlock實現的一個bug,Linux 3.2.0-39.62已經修復了這個bug。

我們是在一篇一篇閱讀這個帖子(Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types)的回復時找到答案的。

這個帖子中描述的問題現象與我們遇到的驚人的相似(連回復中提到的虛擬機跳時鍾的問題我們也遇到過)。帖子是Amazon的工程師在2012年6月11日發現並提交的,通過Amazon工程師與Canonical工程師在回帖中的對話,可以看到老外對待問題的態度。正是他們對問題的執着才最終讓Linux的這個bug得到了修復。

有些朋友質疑我們不務正業,浪費時間研究阿里雲的東西。

我們的想法是:

首先,阿里雲用的是Linux+Xen,這是開源社區的東西,不是阿里雲的東西;

其次,我們團隊只有一個人投入精力在阿里雲的事情上,沒有影響正業;

最重要的是,阿里雲上有很多很多用戶,我們遇到了這樣的問題如果不去找出真正的原因,其他用戶可能也會經歷和我們一樣的非常痛苦的折騰。這種折磨人的感覺真是刻骨銘心,我們不想讓任何人再經歷一次了。這就是我們堅守的最重要的原因!

關於這個bug的關鍵內容摘錄

1. #65樓:From my tests it seems that the problem in the Xen paravirt spinlock implementation is the fact that they re-enable interrupts (xen upcall event channel for that vcpu) during the hypercall to poll for the spinlock irq.

2. 當時對spinlock.c中的xen_spin_lock_slow()部分的代碼修改解決了問題:https://launchpadlibrarian.net/124276305/0001-xen-pv-spinlock-Never-enable-interrupts-in-xen_spin_.patch

3. #79樓:After finally having a breakthrough in understanding the source of the lockup and further discussions upstream, the proper turns out to be to change the way waiters are woken when a spinlock gets freed.

4. #86樓:There is currently a Precise kernel in proposed that will contain the first approach on fixing this (which is not to enable interrupts during the hv call). This should get replaced by the upstream fix (which is to wake up all spinners and not only the first found).

bug發生過程分析

來自Patchwork [25/58] xen: Send spinlock IPI to all waiters:

1. CPU n tries to schedule task x away and goes into a slow wait for the runq lock of CPU n-# (must be one with a lower number).

2. CPU n-#, while processing softirqs, tries to balance domains and goes into a slow wait for its own runq lock (for updating some records). Since this is a spin_lock_irqsave in softirq context, interrupts will be re-enabled for the duration of the poll_irq hypercall used by Xen.

3. Before the runq lock of CPU n-# is unlocked, CPU n-1 receives an interrupt (e.g. endio) and when processing the interrupt, tries to wake up task x. But that is in schedule and still on_cpu, so try_to_wake_up goes into a tight loop.

4. The runq lock of CPU n-# gets unlocked, but the message only gets sent to the first waiter, which is CPU n-# and that is busily stuck.

5. CPU n-# never returns from the nested interruption to take and release the lock because the scheduler uses a busy wait. And CPU n never finishes the task migration because the unlock notification only went to CPU n-#.

相關鏈接

Strange PVM spinlock case revisited


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM