Linux soft lockup 和 hard lockup

本文轉載自查看原文 2019-07-16 14:43 394 Linux調試

一. 整體介紹

　　soft lockup：檢測調度異常，一般是驅動禁止調度或者阻塞比如while(1)，導致無法調度其他線程，需要注意的是，應用程序while(1)不會影響其調度，只要有更高的優先級出現會在時間滴答（10ms）選中並切換進程，

　　　　　　　但如果是在驅動也即內核態，即使有更高優先級時間滴答也不會切換線程，只不過會在該線程的task->flag 標志 NEED_RESHEDULE，驅動還是會繼續跑，簡單說就是驅動while(1) 會獨占CPU 資源， CPU

　　　　　　　不會調度到其他進程/線程，只有中斷產生，才能打斷其運行， soft lockup就是根據這個，利用中斷在中斷上下文來判斷調度是否阻塞了，怎么判斷呢？先創建一個內核線程（watchdog/xx），且優先級設置

　　　　　　　實時優先級FIFO，功能就只是更新時間戳到某個變量（watchdog_touch_ts），這個線程在每一輪調度總能排在前面，設想如果連這個線程都得不到運行，其它普通線程還有機會么？中斷函數watchdog_timer_fn()

　　　　　　　會讀取這個變量watchdog_touch_ts存的時間戳和此時的時間戳，如果相差大於20秒，則說明進程阻塞了，否則說明進程運行正常，這里需要強調的是超過20秒是指線程watchdog/xx在20秒內都沒得到運行，

　　　　　　　可能是一個驅動while(1)導致的，也可能是多個進程阻塞幾秒鍾，全部累計時間超過20秒，甚至是中斷函數做一下耗時處理導致的，總而言之是系統其他總時間超過20秒，導致watchdog/xx沒有被調度，但這

　　　　　　　並不一定意味是有問題的，假設如果你的系統很變態，跑了100個應用/進程，每個耗時0.3秒，那系統輪詢所有的進程需要的時間就是30秒，產生這個異常很正常，所以要改這個默認20秒異常時間值，可以

　　　　　　　改成40秒就行了，當然絕大部分系統20秒足夠讓進程輪了好多次了~~~

　　hard lockup：檢測中斷異常，一般是禁止中斷或者某個中斷函數內阻塞，導致其他中斷無法得到執行，中斷是系統得以運行的重要保證，出了異常系統不可控！那問題來了， soft lockup是靠中斷來監控進程，那誰來監控

　　　　　　　中斷那就是NMI（不可屏蔽中斷）或者FIQ（一般系統的中斷都是IRQ），總之就是要比普通中斷更高級！具體做法也簡單，上面已經有個中斷監控函數watchdog_timer_fn() 里面除了判斷soft lockup，還會對

　　　　　　　 hrtimer_interrupts進行加1，不可屏蔽中斷watchdog_overflow_callback()會對上一次存放的值hrtimer_interrupts_saved和這次hrtimer_interrupts對比，如果一樣說明定時器中斷函數watchdog_timer_fn()沒有得到運行！

二. 源碼分析

　　1. soft lockup：

　　其實就是為每個CPU創建內核線程--watchdog/xx，里面就只是更新時間戳而已（__touch_watchdog()），只不過這個線程內核會有個while(1)大循環，沒必要老更新時間戳，

所以多個判斷函數 -- watchdog_should_run(), 當然也可以在線程函數改成msleep(4000）也行

　　如何判斷線程調度超時呢？

static int is_softlockup(unsigned long touch_ts) //watchdog_touch_ts
{
    unsigned long now = get_timestamp();

    if (watchdog_enabled & SOFT_WATCHDOG_ENABLED) {
        /* Warn about unreasonable delays. get_softlockup_thresh() = 20秒 */
        if (time_after(now, touch_ts + get_softlockup_thresh()))
            return now - touch_ts;
    }
    return 0;
}

　　now就是此刻時間，參數是線程上次更新后的時間戳值， get_softlockup_thresh() = 20秒，如果if為true，說明這個線程已經過了20秒都沒更新時間戳，有問題！在哪里判斷呢？在定時器中斷watchdog_timer_fn()函數中，

這個定時器中斷也是每個CPU都有的，所以也放置在線程的初始化函數內 -- .setup = watchdog_enable(), 同時設置改線程為SCHED_FIFO 實時進程！

static void watchdog_enable(unsigned int cpu)
{
    struct hrtimer *hrtimer = raw_cpu_ptr(&watchdog_hrtimer);

    /* kick off the timer for the hardlockup detector */
    hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
    hrtimer->function = watchdog_timer_fn;

    /* Enable the perf event */
    watchdog_nmi_enable(cpu);

    /* done here because hrtimer_start can only pin to smp_processor_id() */
    hrtimer_start(hrtimer, ns_to_ktime(sample_period),
              HRTIMER_MODE_REL_PINNED);

    /* initialize timestamp */
    watchdog_set_prio(SCHED_FIFO, MAX_RT_PRIO - 1);
    __touch_watchdog();
}

　　在中斷函數中，每產生一次中斷，判斷該線程是否超時了, 超時的話就打印log“BUG: soft lockup - CPU#%d stuck for %us!”。另外，線程不是一直在跑，只有在hrtimer_interrupts更新時候才會跑否則shedule出去，所以在這更新后要喚醒線程！

static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
    unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
    struct pt_regs *regs = get_irq_regs();
    int duration;
    int softlockup_all_cpu_backtrace = sysctl_softlockup_all_cpu_backtrace;

    /* kick the hardlockup detector hrtimer_interrupts++ */
    watchdog_interrupt_count();

    /* kick the softlockup detector */
    wake_up_process(__this_cpu_read(softlockup_watchdog));

    ...............
    duration = is_softlockup(touch_ts);
}

　　2. hard lockup

　　hardlockup比較簡單，就是注冊一個周期為10秒的不可屏蔽中斷函數watchdog_overflow_callback()，里面判斷是否超時：

static int is_hardlockup(void)
{
    unsigned long hrint = __this_cpu_read(hrtimer_interrupts);

    if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
        return 1;

    __this_cpu_write(hrtimer_interrupts_saved, hrint);
    return 0;
}

　　由於不可屏蔽中斷是10秒產生一次，而定時器中斷每4秒產生一次，所以hrtimer_interrupts必然不等於hrtimer_interrupts_saved，且將hrtimer_interrupts賦值到hrtimer_interrupts_saved進行更新，

如果哪次相等，說明這10秒鍾內該定時器中斷都沒有被執行！從而打印"Watchdog detected hard LOCKUP on cpu %d"

　　當然觸發后是警告式打印還是panic掛死系統取決用戶的配置！畢竟前面我說了，有可能觸發是正常現象，如果是正常現象用戶應該更改超時時間閾值，不過我習慣panic()，早發現早解決！

三、測試代碼

　　softlockup測試用例，如果針對SMP，可以多創建幾個線程以及綁定CPU

#include <linux/module.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/gpio.h>
#include <linux/of_gpio.h>
#include <linux/delay.h>
#include <linux/kthread.h>

int test_thread0(void *data)
{
    int cnt=0;

    printk("I will block CPU0\n");
    msleep(3000);
    printk("block CPU0\n");    
    while(1){
        mdelay(1000);
        printk("[CPU%d]block CPU %ds\n", raw_smp_processor_id(), ++cnt);
    }
    
}

static int test_init(void)
{
    struct task_struct *test_task0;

    printk("Vedic init.....\n");

    test_task0 = kthread_create(test_thread0, NULL, "test_thread0");
    if(IS_ERR(test_task0)) {
        printk("test_task0 fail\n");
        return 0;
    }
    kthread_bind(test_task0, 0);
    wake_up_process(test_task0);

    return 0;
}

static void test_exit(void)
{
    printk("Vedic exit......\n");
}

module_init(test_init);
module_exit(test_exit);

MODULE_LICENSE("Dual BSD/GPL");
MODULE_AUTHOR("Vedic <FZKmxcz@163.com>");

四、其他

　　a. 上面的測試用例我在單核上出現softlockup panic，但雙核卻沒有，甚至我開兩個線程分別綁定CPU0/CPU1都正常運行，懷疑跟系統的負載均衡有關，即線程綁定某個CPU不會一直就在該CPU上運行的！

　　b. 無論是soft還是hard觸發打印當前的線程，都不能判定就是凶手，只能說系統有問題，需要進一步排查。

　　c. hard lockup依賴處理器有沒有NIM中斷，沒有的話無法實現

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 soft lockup和hard lockup介紹內核如何檢測SOFT LOCKUP與HARD LOCKUP？ Linux soft lockup分析 linux 內核Lockup機制淺析服務器內核軟死鎖（soft lockup） kernel:NMI watchdog: BUG: soft lockup - CPU#6 stuck for 28s! CentOS7linux中內核被鎖死內核報錯kernel:NMI watchdog: BUG: soft lockup - CPU#1 關於panic之LOCKUP 報錯kernel:NMI watchdog: BUG: soft lockup - CPU#34 stuck for 22s 報錯kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s