Android 系統奔潰觸發WatchDog分析


轉自:https://juejin.cn/post/6989592850170118158

前言

作為一個Android開發者,不管是App或者是System開發者,經常會遇到一些分析日志的場景,本文就通過一個場景介紹一下Android 系統觸發Watdog重啟時的分析思路。Watchdog俗稱看門狗,Android如果一些系統服務發生異常,會觸發Watchdog,導致系統重啟。

場景介紹

QA報了一個系統重啟的Bug,就是電視通過Hdmi連接盒子,在進入Hdmi頁面時,系統奔潰了,既然是系統奔潰一般情況都是系統服務掛掉了,系統服務掛掉一遍會觸發Watchdog.下面就是觸發WatDog時的日志,一般查看Android Log中是否觸發WatDog可以搜索Watchdog,或者GOODBYE關鍵字.

07-26 06:34:10.100  2112  2135 W Watchdog: *** Watchdog KILLING SYSTEM PROCESS: Blocked in handler on main thread (main)
07-26 06:34:10.105  2112  2135 W Watchdog: main annotated stack trace:
07-26 06:34:10.106  2112  2135 W Watchdog:     at com.android.server.tv.TvInputHardwareManager.getHardwareList(TvInputHardwareManager.java:250)
07-26 06:34:10.106  2112  2135 W Watchdog:     - waiting to lock <0x0c25fdf3> (a java.lang.Object)
07-26 06:34:10.107  2112  2135 W Watchdog:     at com.android.server.tv.TvInputManagerService$InputServiceConnection.onServiceConnected(TvInputManagerService.java:2419)
07-26 06:34:10.107  2112  2135 W Watchdog:     - locked <0x0c7f5ab0> (a java.lang.Object)
07-26 06:34:10.108  2112  2135 W Watchdog:     at android.app.LoadedApk$ServiceDispatcher.doConnected(LoadedApk.java:1954)
07-26 06:34:10.108  2112  2135 W Watchdog:     at android.app.LoadedApk$ServiceDispatcher$RunConnection.run(LoadedApk.java:1986)
07-26 06:34:10.109  2112  2135 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:938)
07-26 06:34:10.109  2112  2135 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:99)
07-26 06:34:10.109  2112  2135 W Watchdog:     at android.os.Looper.loop(Looper.java:223)
07-26 06:34:10.110  2112  2135 W Watchdog:     at com.android.server.SystemServer.run(SystemServer.java:669)
07-26 06:34:10.110  2112  2135 W Watchdog:     at com.android.server.SystemServer.main(SystemServer.java:413)
07-26 06:34:10.111  2112  2135 W Watchdog:     at java.lang.reflect.Method.invoke(Native Method)
07-26 06:34:10.111  2112  2135 W Watchdog:     at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:592)
07-26 06:34:10.111  2112  2135 W Watchdog:     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1060)
07-26 06:34:10.112  2112  2135 W Watchdog: *** GOODBYE!
07-26 06:34:10.112  2112  2135 I Process : Sending signal. PID: 2112 SIG: 9
復制代碼

上面TvInputHardwareManager.getHardwareList方法被調用的時候block住了,從下一行日志得知他block的原因是等待一個鎖。

07-26 06:34:10.106  2112  2135 W Watchdog:     at com.android.server.tv.TvInputHardwareManager.getHardwareList(TvInputHardwareManager.java:250)
07-26 06:34:10.106  2112  2135 W Watchdog:     - waiting to lock <0x0c25fdf3> (a java.lang.Object)
復制代碼
   public List<TvInputHardwareInfo> getHardwareList() {
        synchronized (mLock) {
            return Collections.unmodifiableList(mHardwareList);
        }
    }
復制代碼

那這個鎖被誰持有呢?現在只看Android log已經無能為力了,需要分析trace文件,做過app開發的應該都經常遇到anr,遇到anr一般都需要分析trace文件, 這個trace文件在/data/anr/目錄下。我們知道觸發ANR,dump trace文件的場景如下:

  • Service Timeout:比如前台服務在20s內未執行完成
  • BroadcastQueue Timeout:比如前台廣播在10s內未執行完成
  • ContentProvider Timeout:內容提供者,在publish過超時10s
  • InputDispatching Timeout: 輸入事件分發超時5s,包括按鍵和觸摸事件

其實還有一種情況也會生成trace文件,那就是觸發Framework的Watchdog的時候。看一下Watchdog.java的run方法,符合條件時,會觸發dump trace的操作。

//frameworks/base/services/core/java/com/android/server/Watchdog.java
@Override
    public void run() {
      ···
if (!fdLimitTriggered) {
                    final int waitState = evaluateCheckerCompletionLocked();
                    if (waitState == COMPLETED) {
                        // The monitors have returned; reset
                        waitedHalf = false;
                        continue;
                    } else if (waitState == WAITING) {
                        // still waiting but within their configured intervals; back off and recheck
                        continue;
                    } else if (waitState == WAITED_HALF) {
                        if (!waitedHalf) {
                            Slog.i(TAG, "WAITED_HALF");
                            // We've waited half the deadlock-detection interval.  Pull a stack
                            // trace and wait another half.
                            ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
                          //dump trace文件
                            ActivityManagerService.dumpStackTraces(pids, null, null,
                                    getInterestingNativePids(), null);
                            waitedHalf = true;
                        }
                        continue;
                    }
  ···
}
復制代碼

觸發dump trace時,日志里會有如下的日志:

07-26 06:33:59.485  2112  2135 I ActivityManager: Dumping to /data/anr/anr_2021-07-26-06-33-59-483
復制代碼

分析trace文件

"main" prio=5 tid=1 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x71684c28 self=0xe11c5410
  | sysTid=2112 nice=-2 cgrp=default sched=0/0 handle=0xe9762470
  | state=S schedstat=( 0 0 0 ) utm=377 stm=96 core=1 HZ=100
  | stack=0xff11b000-0xff11d000 stackSize=8192KB
  | held mutexes=
  at com.android.server.tv.TvInputHardwareManager.getHardwareList(TvInputHardwareManager.java:250)
  - waiting to lock <0x0c25fdf3> (a java.lang.Object) held by thread 100
  at com.android.server.tv.TvInputManagerService$InputServiceConnection.onServiceConnected(TvInputManagerService.java:2419)
  - locked <0x0c7f5ab0> (a java.lang.Object)
  at android.app.LoadedApk$ServiceDispatcher.doConnected(LoadedApk.java:1954)
  at android.app.LoadedApk$ServiceDispatcher$RunConnection.run(LoadedApk.java:1986)
  at android.os.Handler.handleCallback(Handler.java:938)
  at android.os.Handler.dispatchMessage(Handler.java:99)
  at android.os.Looper.loop(Looper.java:223)
  at com.android.server.SystemServer.run(SystemServer.java:669)
  at com.android.server.SystemServer.main(SystemServer.java:413)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:592)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1060)
復制代碼

這個堆棧我們在Android log里已經看到了,block的原因是等待0x0c25fdf3這個鎖,那我們就搜一下0x0c25fdf3這個鎖現在被誰持有.

"Binder:2112_6" prio=5 tid=100 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x135c0808 self=0xb9698810
  | sysTid=2491 nice=0 cgrp=default sched=0/0 handle=0xb478b1c0
  | state=S schedstat=( 0 0 0 ) utm=32 stm=10 core=1 HZ=100
  | stack=0xb4690000-0xb4692000 stackSize=1008KB
  | held mutexes=
  at com.android.server.tv.TvInputHardwareManager$TvInputHardwareImpl.release(TvInputHardwareManager.java:847)
  - waiting to lock <0x0234b2e5> (a java.lang.Object) held by thread 11
  at com.android.server.tv.TvInputHardwareManager$Connection.resetLocked(TvInputHardwareManager.java:652)
  at com.android.server.tv.TvInputHardwareManager$Connection.binderDied(TvInputHardwareManager.java:722)
  - locked <0x0c25fdf3> (a java.lang.Object)
  at android.os.IBinder$DeathRecipient.binderDied(IBinder.java:305)
  at android.os.BinderProxy.sendDeathNotice(BinderProxy.java:654)
復制代碼

從上面的trace發現0x0c25fdf3Connection.binderDied這個方法持有的,而該方法被block原因是它調用了TvInputHardwareImpl.release方法,這個TvInputHardwareImpl.release方法被另外一個鎖0x0234b2e5block住了,那么0x0234b2e5鎖被誰持有呢?

"Binder:2112_2" prio=5 tid=11 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x131c03c8 self=0xe11c6210
  | sysTid=2127 nice=0 cgrp=default sched=0/0 handle=0xbd2401c0
  | state=S schedstat=( 0 0 0 ) utm=73 stm=20 core=3 HZ=100
  | stack=0xbd145000-0xbd147000 stackSize=1008KB
  | held mutexes=
  native: #00 pc 000715e4  /apex/com.android.runtime/lib/bionic/libc.so (__ioctl+8)
  native: #01 pc 0003f2a3  /apex/com.android.runtime/lib/bionic/libc.so (ioctl+30)
  native: #02 pc 0006d78f  /system/lib/libhidlbase.so (android::hardware::IPCThreadState::transact(int, unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int)+438)
  native: #03 pc 0006a265  /system/lib/libhidlbase.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+36)
  native: #04 pc 0000c3f9  /system/lib/android.hardware.tv.input@1.0.so (android::hardware::tv::input::V1_0::BpHwTvInput::_hidl_closeStream(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, int, int)+228)
  native: #05 pc 0000c7c5  /system/lib/android.hardware.tv.input@1.0.so (android::hardware::tv::input::V1_0::BpHwTvInput::closeStream(int, int)+36)
  native: #06 pc 00069801  /system/lib/libandroid_servers.so (android::JTvInputHal::removeStream(int, int)+440)
  at com.android.server.tv.TvInputHal.nativeRemoveStream(Native method)
  at com.android.server.tv.TvInputHal.removeStream(TvInputHal.java:111)
  - locked <0x08f771dc> (a java.lang.Object)
  at com.android.server.tv.TvInputHardwareManager$TvInputHardwareImpl.setSurface(TvInputHardwareManager.java:872)
  - locked <0x0234b2e5> (a java.lang.Object)
  at android.media.tv.ITvInputHardware$Stub.onTransact(ITvInputHardware.java:136)
  at android.os.Binder.execTransactInternal(Binder.java:1154)
  at android.os.Binder.execTransact(Binder.java:1123)
復制代碼

這個0x0234b2e5鎖被TvInputHardwareImpl.setSurface方法持有了,那么他被block的原因是:android.hardware.tv.input@1.0.so中有個_hidl_closeStream方法被block住了。

  native: #04 pc 0000c3f9  /system/lib/android.hardware.tv.input@1.0.so (android::hardware::tv::input::V1_0::BpHwTvInput::_hidl_closeStream(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, int, int)+228)

復制代碼

由於再往下追會涉及到公司的私有代碼了,這里就不在繼續往下分析了,但是上面的思路已經很清楚了,結合trace中的堆棧和源碼定位問題應該不成問題。

總結

不管是普通的anr問題或者是這種觸發Watchdog的重啟,分析方式都類似的,就是先查看Android log,然后在分析trace文件,一步一步的往下挖掘。在上面的介紹中提到的WatDog,其實是Android Framework中一個很重要的機制,后面將出文章專門介紹一下其實現,^v^。


作者:CodingDev
鏈接:https://juejin.cn/post/6989592850170118158
來源:稀土掘金
著作權歸作者所有。商業轉載請聯系作者獲得授權,非商業轉載請注明出處。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM