Android Watchdog源碼簡析--Based on Android 6.0.1


1. Watchdog 簡介

Android 為了保證系統的高可用性,設計了Watchdog用以監視系統的一些關鍵服務的運行狀況,如果關鍵服務出現了死鎖,將重啟SystemServer;另外,接收系統內部reboot請求,重啟系統。

總結一下:Watchdog就如下兩個主要功能:

  1. 接收系統內部reboot請求,重啟系統;
  2. 監控系統關鍵服務,如果關鍵服務出現了死鎖,將重啟SystemServer。
    被監控的關鍵服務,這些服務必須實現Watchdog.Monitor接口:
    ActivityManagerService
    InputManagerService
    MountService
    NativeDaemonConnector
    NetworkManagementService
    PowerManagerService
    WindowManagerService
    MediaRouterService
    MediaProjectionManagerService

2. Watchdog 詳解

一張圖理解 Watchdog
![一張圖理解 Watchdog](http://images2015.cnblogs.com/blog/632312/201611/632312-20161130115015006-1586848184.png)

Watchdog 是在SystemServer啟動的時候 調用 startOtherServices 啟動的。 Watchdog 初始化了一個單例的對象並且繼承自 Thread,因此,Watchdog實際是跑在 SystemServer 進程中的。

啟動之后,watchdog的run進程會每30s檢查一次監控服務是否發生死鎖。檢查死鎖通過hc.scheduleCheckLocked(),然后調用各個被監控對象的monitor()來驗證。下面我們以 ActivityManagerService 為例。

    /** In this method we try to acquire our lock to make sure that we have not deadlocked */
    public void monitor() {
        synchronized (this) { }
    }

由於我們關鍵部分都用了synchronized (this) 這個鎖來進行鎖定,如果我們在monitor()的時候兩次每隔30s的檢查都未能獲取到相應的鎖,就表示這個進程死鎖,如果死鎖將殺死SystemServer進程(Watchdog跑在SystemServer進程中,因此Process.killProcess(Process.myPid()) 這里的myPid()就是SystemServer對應的PID)。

SystemServer 進程被殺死之后, Zygote 也會死掉(com_android_internal_os_Zygote.cpp 中通過 signal 機制 收到 SIGCHLD 就殺掉Zygote進程),最后init進程(init.rc中配置了onrestart,則就會有SVC_RESTARTING標簽,init.cpp執行到restart_processes())檢測到zygote死掉()會重新啟動Zygote 和 SystemServer。

下面,我們結合代碼來詳細看下這個流程:

@Override
    public void run() {
        boolean waitedHalf = false;
        while (true) {
            final ArrayList<HandlerChecker> blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                long timeout = CHECK_INTERVAL;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    // 1. 對每個關注的服務進行監控
                    hc.scheduleCheckLocked();
                }

                if (debuggerWasConnected > 0) {
                    debuggerWasConnected--;
                }

                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        // 2. 等待timeout時間,默認30s
                        wait(timeout);
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }

                // 3. 獲取監控之后的waitState狀態,如果狀態為COMPLETED、WAITING、WAITED_HALF,就結束本次循環,繼續執行后面的循環;如果是OVERDUE狀態,則執行OVERDUE相關邏輯,打印log、結束進程。
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                NATIVE_STACKS_OF_INTEREST);
                        waitedHalf = true;
                    }
                    continue;
                }

                // 4. OVERDUE狀態,則執行OVERDUE相關邏輯,打印log、結束進程。
                // something is overdue!
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
                allowRestart = mAllowRestart;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

            ArrayList<Integer> pids = new ArrayList<Integer>();
            pids.add(Process.myPid());
            if (mPhonePid > 0) pids.add(mPhonePid);
            // 5. dump AMS 堆棧信息
            // Pass !waitedHalf so that just in case we somehow wind up here without having
            // dumped the halfway stacks, we properly re-initialize the trace file.
            final File stack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            SystemClock.sleep(2000);

            // 6. dump kernel 堆棧信息
            // Pull our own kernel thread stacks as well if we're configured for that
            if (RECORD_KERNEL_THREADS) {
                dumpKernelStackTraces();
            }

            // 7. 觸發 kernel dump 所有阻塞的線程信息 和 所有CPU的backtraces放到 kernel 的 log 中
            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            doSysRq('w');
            doSysRq('l');

            // 8. 嘗試把錯誤信息放大dropbox里面,這個假設AMS還活着,如果AMS死鎖了,那watchdog也死鎖了
            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null,
                                subject, null, stack, null);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            // 9. ActivityController 檢查 systemNotResponding(subject) 的處理方式,1 = keep waiting, -1 = kill system
            IActivityController controller;
            synchronized (this) {
                controller = mController;
            }
            if (controller != null) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }

            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                // 10. 打印堆棧信息
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                for (int i=0; i<blockedCheckers.size(); i++) {
                    Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
                    StackTraceElement[] stackTrace
                            = blockedCheckers.get(i).getThread().getStackTrace();
                    for (StackTraceElement element: stackTrace) {
                        Slog.w(TAG, "    at " + element);
                    }
                }
                Slog.w(TAG, "*** GOODBYE!");
                // 11. 殺死進程
                Process.killProcess(Process.myPid());
                System.exit(10);
            }

            waitedHalf = false;
        }
    }


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM