轉自:http://blog.csdn.net/ruingman/article/details/53118202
定義
主線程在特定的時間內沒有做完特定的事情
常見的場景
A.input事件超過5S沒有處理完成
B.service executing 超時(bind,create,start,unbind等等),前台20s,后台200s
C.廣播處理超時,前台10S,后台60s
D.ContentProvider執行超時,20s
常見的原因
A.耗時操作,如復雜的layout,龐大的for循環,IO等。
B.被Binder 對端block
C.被子線程同步鎖block
D.Binder被占滿導致主線程無法和SystemServer通信
E.得不到系統資源(CPU/RAM/IO)
其中ABCD比較好分析,而E比較困難。
如何分析?
指導思想:通過各種線索,尋求主線程在過去一段時間內發生block的可能原因。
線索包括:
A.traces.txt/dropbox
AMS在ANR發生的時候,dump相關進程(ANR的進程、systemserver、mediaserver,surfaceFinger等)的當前棧到traces.txt。
traces.txt存在的幾種形式:
1. adb pull /data/anr/
2. slog/2012xxxxx/misc/anr/snapshot_xxxx.log
3. slog/dropbox/data_app_anr_xxxxx.tgz
4. slog/dropbox/system_app_anr_xxxx.tgz
需要注意的是,traces.txt是抓取的是超時后(如input超時就是5s后)的snapshot,並不一定能夠真實的反應出block的點。
也存在抓到主線程沒有block,在idle的情況。
B.Eventlog中的dvm_lock_sample.
在同步鎖發生content的時候,虛擬機會將這個競爭記錄在eventlog中:
dvm_lock_sample: [system_server,1,Binder_7, 22, ActivityManagerService.java,15921,-,9628,4]
進程名 被block線程 block時間(ms) 被block的行號 持有者行號
實現可以參考art/runtime/monitor_android.cc LogContentionEvent函數
如果主線程是被binder對端、被同步鎖block,那么eventlog中就很有可能會有dvm_lock_sample的打印。
華為項目上實現了更加強大的功能BlockMonitor,在主線程有耗時操作(如handlemsg、Binder調用耗時)的時候會打印出棧。
C.搜索主線程在發生ANR前后的main,systemlog,結合代碼查看可能block在哪里。
現在發生ANR的時候,sprdruntimeinfo在mainlog中會Dump出比較多的信息,其中比較重要的是binder、cpu。
D.ANR發生的時候,打印出的CPU的占用。
注意,並不能因為有進程cpu占用高就果斷的去懷疑他。
因為最主要的線索traces.txt的有效性並不是非常高,所以ANR問題分析是存在一定的局限性的。
案例分析
1. Settings 被對端block
首先需要去看的就是traces.txt 中ANR進程的主線程的棧。
以settting的這次ANR為例:
"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 obj=0x73ee6470 self=0xb4d76500
| sysTid=22831 nice=0 cgrp=default sched=0/0 handle=0xb6f4bc00
| state=S schedstat=( 0 0 0 ) utm=22 stm=22 core=0 HZ=100
| stack=0xbe283000-0xbe285000 stackSize=8MB
| held mutexes=
native: #00 pc 000410ac /system/lib/libc.so (__ioctl+8)
native: #01 pc 000477e5 /system/lib/libc.so (ioctl+14)
native: #02 pc 0001e7c5 /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+132)
native: #03 pc 0001ee17 /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+38)
native: #04 pc 0001efcd /system/lib/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+124)
native: #05 pc 00019fb7 /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+30)
native: #06 pc 00086de9 /system/lib/libandroid_runtime.so (???)
native: #07 pc 00d94629 /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140)
at android.os.BinderProxy.transactNative(Native method)
at android.os.BinderProxy.transact(Binder.java:503)
at android.net.INetworkPolicyManager$Stub$Proxy.getNetworkPolicies(INetworkPolicyManager.java:410)
at android.net.NetworkPolicyManager.getNetworkPolicies(NetworkPolicyManager.java:174)
at com.android.settings.net.NetworkPolicyEditor.read(NetworkPolicyEditor.java:57)
at com.android.settings.DataUsageSummary.onCreate(DataUsageSummary.java:361)
at android.app.Fragment.performCreate(Fragment.java:2202)
at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:942)
at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1148)
at android.app.BackStackRecord.run(BackStackRecord.java:793)
at android.app.FragmentManagerImpl.execPendingActions(FragmentManager.java:1535)
at android.app.FragmentManagerImpl.executePendingTransactions(FragmentManager.java:562)
at com.android.settings.SettingsActivity.switchToFragment(SettingsActivity.java:1084)
at com.android.settings.SettingsActivity.onCreate(SettingsActivity.java:657)
at android.app.Activity.performCreate(Activity.java:6251)
at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1110)
at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2370)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2477)
at android.app.ActivityThread.-wrap11(ActivityThread.java:-1)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1345)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:148)
at android.app.ActivityThread.main(ActivityThread.java:5438)
at java.lang.reflect.Method.invoke!(Native method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:762)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:652)
主線程被對端block,而對端是在systemserver中的NetworkPolicyManager。
接下來繼續去查NetworkPolicyManager為何會被block。
通常情況下,可以搜索getNetworkPolicies,一般的對端的函數並不會修改函數名:
看到我們的對端:
"Binder_4" prio=5 tid=56 Blocked
| group="main" sCount=1 dsCount=0 obj=0x1321a0a0 self=0xad31e200
| sysTid=2491 nice=0 cgrp=default sched=0/0 handle=0x9fd00930
| state=S schedstat=( 0 0 0 ) utm=46441 stm=46582 core=1 HZ=100
| stack=0x9fc04000-0x9fc06000 stackSize=1014KB
| held mutexes=
at com.android.server.net.NetworkPolicyManagerService.getNetworkPolicies(NetworkPolicyManagerService.java:1696)
- waiting to lock <0x07439315> (a java.lang.Object) held by thread 35
at android.net.INetworkPolicyManager$Stub.onTransact(INetworkPolicyManager.java:145)
at android.os.Binder.execTransact(Binder.java:453
被Tid=35的人拿住了一把鎖(0x07439315),那么繼續看tid=35是誰,有兩種方法:
1. 搜索tid=35
2. 搜索0x07439315,找到 - locked <0x07439315> (a java.lang.Object)
"NetworkPolicy" prio=5 tid=35 TimedWaiting
| group="main" sCount=1 dsCount=0 obj=0x12d98940 self=0x9f91f700
| sysTid=2415 nice=0 cgrp=default sched=0/0 handle=0xa0f33930
| state=S schedstat=( 0 0 0 ) utm=7681 stm=7783 core=0 HZ=100
| stack=0xa0e31000-0xa0e33000 stackSize=1038KB
| held mutexes=
at java.lang.Object.wait!(Native method)
- waiting on <0x02580c1b> (a java.lang.Object)
at java.lang.Thread.parkFor$(Thread.java:1220)
- locked <0x02580c1b> (a java.lang.Object)
at sun.misc.Unsafe.park(Unsafe.java:299)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2053)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:372)
at com.android.server.NativeDaemonConnector$ResponseQueue.remove(NativeDaemonConnector.java:634)
at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:426)
at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:345)
at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:340)
at com.android.server.NetworkManagementService.setInterfaceQuota(NetworkManagementService.java:1712)
- locked <0x0b0f91b8> (a java.lang.Object)
at com.android.server.net.NetworkPolicyManagerService.setInterfaceQuota(NetworkPolicyManagerService.java:2421)
at com.android.server.net.NetworkPolicyManagerService.updateNetworkRulesLocked(NetworkPolicyManagerService.java:1232)
at com.android.server.net.NetworkPolicyManagerService$14.onReceive(NetworkPolicyManagerService.java:1060)
- locked <0x07439315> (a java.lang.Object)
at android.app.LoadedApk$ReceiverDispatcher$Args.run(LoadedApk.java:881)
at android.os.Handler.handleCallback(Handler.java:739)
at android.os.Handler.dispatchMessage(Handler.java:95)
at android.os.Looper.loop(Looper.java:148)
at android.os.HandlerThread.run(HandlerThread.java:61)
可以看到,NetworkPolicy在通過NativeDaemonConnector和netd通信(setInterfaceQuota)
我們結合log來看下是否有有用信息,按照之前的經驗,netd在執行完cmd的時候,會打印出slow operation。
在發生ANR的前后,查找netd相關的打印:
06-19 15:29:00.997 1235 1270 I am_anr : [0,22831,com.android.settings,818429509,Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)]
06-19 15:29:05.683 1235 2415 E NetdConnector: NDC Command {55445 bandwidth setiquota seth_w0 9223372036854775807} took too long (4755ms
06-19 15:29:05.723 1235 2491 I dvm_lock_sample: [system_server,1,Binder_4,4919,NetworkPolicyManagerService.java,1696,-,1056,100]
從eventslog中還可以發現,setting出現了兩次連續的ANR,而上下文中都有類似上面的dvm_lock_sample NetworkPolicyManagerService相關的告警。
因此雖然15:28的這次ANR並沒有打出有用的棧,但是我們還是可以猜測出這兩次ANR的原因都是netd的cmd耗時太久導致的(在主線程的調用路徑上存在必然的block)
那么,在netd可能無法修改的情況下,我們應該如何去resolve這個問題呢。
將可能存在block的操作放到非UI線程中去做。
2.com.huawei.hwvplayer.youku對端block:
打開dropbox中對應的system_app_anrxxxx:
查看主線程的棧:
"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 obj=0x752b0000 self=0xb4276500
| sysTid=25390 nice=-1 cgrp=default sched=3/0 handle=0xb6f18b34
| state=S schedstat=( 0 0 0 ) utm=81 stm=12 core=2 HZ=100
| stack=0xbe78b000-0xbe78d000 stackSize=8MB
| held mutexes=
kernel: (couldn't read /proc/self/task/25390/stack)
native: #00 pc 000422d0 /system/lib/libc.so (__ioctl+8)
native: #01 pc 00047825 /system/lib/libc.so (ioctl+14)
native: #02 pc 0001e835 /system/lib/libbinder.so (_ZN7android14IPCThreadState14talkWithDriverEb+132)
native: #03 pc 0001ee93 /system/lib/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+38)
native: #04 pc 0001f049 /system/lib/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+124)
native: #05 pc 00019fe3 /system/lib/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+30)
native: #06 pc 0008a035 /system/lib/libandroid_runtime.so (???)
native: #07 pc 00d78869 /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140)
at android.os.BinderProxy.transactNative(Native method)
at android.os.BinderProxy.transact(Binder.java:510)
at android.os.storage.IMountService$Stub$Proxy.getVolumeList(IMountService.java:771)
at android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)
at android.os.Environment$UserEnvironment.getExternalDirs(Environment.java:95)
at android.os.Environment.getExternalStorageDirectory(Environment.java:354)
at com.huawei.common.utils.PathUtils.<clinit>(PathUtils.java:51)
at com.huawei.common.utils.PathUtils.getWorkspacePath(PathUtils.java:80)
at com.huawei.common.components.log.Logger.<clinit>(Logger.java:37)
at com.huawei.common.components.log.Logger.i(Logger.java:162)
at com.huawei.hwvplayer.data.db.DbProvider.attachInfo(DbProvider.java:89)
at android.app.ActivityThread.installProvider(ActivityThread.java:5279)
at android.app.ActivityThread.installContentProviders(ActivityThread.java:4868)
at android.app.ActivityThread.handleBindApplication(ActivityThread.java:4799)
at android.app.ActivityThread.access$1600(ActivityThread.java:165)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1436)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:188)
at android.app.ActivityThread.main(ActivityThread.java:5578)
at java.lang.reflect.Method.invoke!(Native method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684)
主線程嘗試去調用MountService的 getVolumeList接口,可能沒有返回。
查看system_server中相關的,搜索getVolumeList
發現systemserver中有三個Binder線程和主線程被block,我們的對端是哪個Binder線程暫時無法確認,但這並不影響我們繼續分析,因為他們被blockd的路徑是一致的:
"main" prio=5 tid=1 Blocked
| group="main" sCount=1 dsCount=0 obj=0x752b0000 self=0xb4276500
| sysTid=22735 nice=-2 cgrp=default sched=0/0 handle=0xb6f18b34
| state=S schedstat=( 0 0 0 ) utm=432 stm=85 core=1 HZ=100
| stack=0xbe78b000-0xbe78d000 stackSize=8MB
| held mutexes=
at com.android.server.MountService.getVolumeList(MountService.java:2759)
- waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40
at android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)
at android.os.storage.StorageManager.getVolumeList(StorageManager.java:858)
at android.os.storage.StorageManager.getPrimaryVolume(StorageManager.java:906)
at com.android.server.usb.UsbDeviceManager.systemReady(UsbDeviceManager.java:327)
at com.android.server.usb.UsbService.systemReady(UsbService.java:181)
at com.android.server.usb.UsbService$Lifecycle.onBootPhase(UsbService.java:78)
at com.android.server.SystemServiceManager.startBootPhase(SystemServiceManager.java:135)
at com.android.server.SystemServer$3.run(SystemServer.java:1489)
at com.android.server.am.ActivityManagerService.systemReady(ActivityManagerService.java:12417)
at com.android.server.am.HwActivityManagerService.systemReady(HwActivityManagerService.java:960)
at com.android.server.SystemServer.startOtherServices(SystemServer.java:1485)
at com.android.server.SystemServer.run(SystemServer.java:381)
at com.android.server.SystemServer.main(SystemServer.java:272)
at java.lang.reflect.Method.invoke!(Native method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684)
"Binder_8" prio=5 tid=76 Blocked
| group="main" sCount=1 dsCount=0 obj=0x13bd60a0 self=0x9c1abe00
| sysTid=25191 nice=-1 cgrp=default sched=0/0 handle=0x97158930
| state=S schedstat=( 0 0 0 ) utm=9 stm=7 core=2 HZ=100
| stack=0x9705c000-0x9705e000 stackSize=1014KB
| held mutexes=
at com.android.server.MountService.getVolumeList(MountService.java:2759)
- waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40
at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1634)
at android.os.Binder.execTransact(Binder.java:453)
"Binder_2" prio=5 tid=8 Blocked
| group="main" sCount=1 dsCount=0 obj=0x12cac0a0 self=0xaebf0300
| sysTid=22761 nice=-1 cgrp=default sched=0/0 handle=0xaef7d930
| state=S schedstat=( 0 0 0 ) utm=41 stm=25 core=0 HZ=100
| stack=0xaee81000-0xaee83000 stackSize=1014KB
| held mutexes=
at com.android.server.MountService.getVolumeList(MountService.java:2759)
- waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40
at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1634)
at android.os.Binder.execTransact(Binder.java:453)
他們均是被tid=40的人block,按照上面的方法搜索tid=40或者 0x0eeb54f1得到block的人:
"MountService" prio=5 tid=40 TimedWaiting
| group="main" sCount=1 dsCount=0 obj=0x132c1160 self=0x9ce57400
| sysTid=23512 nice=0 cgrp=default sched=0/0 handle=0x9a239930
| state=S schedstat=( 0 0 0 ) utm=29 stm=2 core=2 HZ=100
| stack=0x9a137000-0x9a139000 stackSize=1038KB
| held mutexes=
at java.lang.Object.wait!(Native method)
- waiting on <0x00fea1f3> (a java.lang.Object)
at java.lang.Thread.parkFor$(Thread.java:1235)
- locked <0x00fea1f3> (a java.lang.Object)
at sun.misc.Unsafe.park(Unsafe.java:299)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2053)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:372)
at com.android.server.NativeDaemonConnector$ResponseQueue.remove(NativeDaemonConnector.java:777)
at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:489)
at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:386)
at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:381)
at com.android.server.MountService.resetIfReadyAndConnectedLocked(MountService.java:827)
at com.android.server.MountService.handleSystemReady(MountService.java:776)
- locked <0x0eeb54f1> (a java.lang.Object)
at com.android.server.MountService.access$500(MountService.java:152)
at com.android.server.MountService$MountServiceHandler.handleMessage(MountService.java:596)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:150)
at android.os.HandlerThread.run(HandlerThread.java:61)
和上面netd類似的,mountservice也是通過ndc和vold通信,這里我們需要繼續查看是否vold存在異常。
之前提到過,類似這種同步鎖block的,dvm_lock_sample一定會有打印,於是先去找eventslog,不過這個是華為的log,是沒有eventlog的。
而華為實現了一個blockMonitor的功能,和dvm_lock_sample類似,當某個操作特別耗時的時候,會將其打印出來:
在ANR的附近找到如下:
07-19 10:17:50.739 25271 25271 W BlockMonitor: The binder calling took 55209ms.
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.BlockMonitor.checkBinderTime(BlockMonitor.java:141)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.BinderProxy.transact(Binder.java:511)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.storage.IMountService$Stub$Proxy.getVolumeList(IMountService.java:771)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Environment$UserEnvironment.getExternalDirs(Environment.java:95)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Environment.getExternalStorageDirectory(Environment.java:354)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.parsePathStrategy(FileProvider.java:583)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.getPathStrategy(FileProvider.java:534)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.attachInfo(FileProvider.java:352)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.installProvider(ActivityThread.java:5279)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.installContentProviders(ActivityThread.java:4868)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.handleBindApplication(ActivityThread.java:4799)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.access$1600(ActivityThread.java:165)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread$H.handleMessage(ActivityThread.java:1436)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Handler.dispatchMessage(Handler.java:102)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Looper.loop(Looper.java:188)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.main(ActivityThread.java:5578)
07-19 10:17:50.739 25271 25271 W BlockMonitor: java.lang.reflect.Method.invoke(Native Method)
07-19 10:17:50.739 25271 25271 W BlockMonitor: com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)
07-19 10:17:50.739 25271 25271 W BlockMonitor: com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684)
再加上之前的:
07-19 10:17:50.729 22735 23512 E NativeDaemonConnector.ResponseQueue: Timeout waiting for response
07-19 10:17:50.729 22735 23512 E VoldConnector: timed-out waiting for response to 4 volume reset
07-19 10:17:50.729 22735 23512 E VoldConnector: timed-out waiting for response mOutputStream = android.net.LocalSocketImpl$SocketOutputStream@922644, mSocket = vold
07-19 10:17:50.731 22735 23512 W MountService: Failed to reset vold
07-19 10:17:50.731 22735 23512 W MountService: com.android.server.NativeDaemonTimeoutException: command '4 volume reset' failed with 'null'
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:501)
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:386)
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:381)
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.MountService.resetIfReadyAndConnectedLocked(MountService.java:827)
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.MountService.handleSystemReady(MountService.java:776)
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.MountService.access$500(MountService.java:152)
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.MountService$MountServiceHandler.handleMessage(MountService.java:596)
07-19 10:17:50.731 22735 23512 W MountService: at android.os.Handler.dispatchMessage(Handler.java:102)
07-19 10:17:50.731 22735 23512 W MountService: at android.os.Looper.loop(Looper.java:150)
07-19 10:17:50.731 22735 23512 W MountService: at android.os.HandlerThread.run(HandlerThread.java:61)
我們有理由去推斷vold的狀態是不對的,但是又沒有vold的棧。
那么全局grep vold試試能不能找到線索,在kernellog中發現了vold的異常:
07-19 14:22:24.669 <3>[10772.492156] c0 Freezing of tasks failed after 20.008 seconds (1 tasks refusing to freeze, wq_busy=0):
07-19 14:22:24.669 <6>[10772.492217] c0 vold R running 0 224 1 0x00000001
07-19 14:22:24.669 <4>[10772.492278] c0 [<c05ebecc>] (__schedule+0x38c/0x5bc) from [<c05ea478>] (schedule_timeout+0x18/0x1e8)
07-19 14:22:24.669 <4>[10772.492309] c0 [<c05ea478>] (schedule_timeout+0x18/0x1e8) from [<c05eb90c>] (wait_for_common+0x11c/0x164)
07-19 14:22:24.669 <4>[10772.492309] c0 [<c05eb90c>] (wait_for_common+0x11c/0x164) from [<c03cd8c8>] (mmc_wait_for_req+0xb4/0xe4)
07-19 14:22:24.669 <4>[10772.492339] c0 [<c03cd8c8>] (mmc_wait_for_req+0xb4/0xe4) from [<c03cd95c>] (mmc_wait_for_cmd+0x64/0x74)
07-19 14:22:24.669 <4>[10772.492370] c0 [<c03cd95c>] (mmc_wait_for_cmd+0x64/0x74) from [<c03d41f0>] (mmc_send_status+0x6c/0x8c)
07-19 14:22:24.670 <4>[10772.492400] c0 [<c03d41f0>] (mmc_send_status+0x6c/0x8c) from [<c03d4504>] (sd_send_status+0x14/0x44)
07-19 14:22:24.670 <4>[10772.492431] c0 [<c03d4504>] (sd_send_status+0x14/0x44) from [<c03d491c>] (mmc_lock_unlock_by_buf+0xac/0x168)
07-19 14:22:24.670 <4>[10772.492431] c0 [<c03d491c>] (mmc_lock_unlock_by_buf+0xac/0x168) from [<c03dabd8>] (mmc_lockable_store+0x594/0x75c)
07-19 14:22:24.670 <4>[10772.492461] c0 [<c03dabd8>] (mmc_lockable_store+0x594/0x75c) from [<c029d560>] (dev_attr_store+0x18/0x24)
07-19 14:22:32.070 <4>[10772.492492] c0 [<c029d560>] (dev_attr_store+0x18/0x24) from [<c013b370>] (sysfs_write_file+0x104/0x148)
07-19 14:22:32.070 <4>[10772.492522] c0 [<c013b370>] (sysfs_write_file+0x104/0x148) from [<c00eabb4>] (vfs_write+0xd0/0x180)
07-19 14:22:32.070 <4>[10772.492553] c0 [<c00eabb4>] (vfs_write+0xd0/0x180) from [<c00eb070>] (SyS_write+0x38/0x68)
07-19 14:22:32.071 <4>[10772.492583] c0 [<c00eb070>] (SyS_write+0x38/0x68) from [<c000e840>] (ret_fast_syscall+0x0/0x30)
vold一直在這個操作中沒有退出來,所以不能響應客戶端的請求,從而導致了ANR。
這個問題需要mmc的同事進一步去分析,目前懷疑是SD卡發生了錯誤。
3. setting 主線程耗時操作
"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 obj=0x7425caf8 self=0xb4827800
| sysTid=671 nice=0 cgrp=default sched=0/0 handle=0xb6febbec
| state=S schedstat=( 0 0 0 ) utm=1853 stm=50 core=2 HZ=100
| stack=0xbe7bd000-0xbe7bf000 stackSize=8MB
| held mutexes=
kernel: (couldn't read /proc/self/task/671/stack)
native: #00 pc 000133cc /system/lib/libc.so (syscall+28)
native: #01 pc 000a9a83 /system/lib/libart.so (art::ConditionVariable::Wait(art::Thread*)+82)
native: #02 pc 001b16f1 /system/lib/libart.so (art::JNI::NewString(_JNIEnv*, unsigned short const*, int)+640)
native: #03 pc 00075887 /system/lib/libandroid_runtime.so (???)
native: #04 pc 008570ab /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_database_CursorWindow_nativeGetString__JII+110)
at android.database.CursorWindow.nativeGetString(Native method)
at android.database.CursorWindow.getString(CursorWindow.java:438)
at android.database.AbstractWindowedCursor.getString(AbstractWindowedCursor.java:51)
at android.database.CursorWrapper.getString(CursorWrapper.java:114)
at com.android.settings.ApnSettings.fillList(ApnSettings.java:259)
at com.android.settings.ApnSettings.onResume(ApnSettings.java:208)
at android.app.Fragment.performResume(Fragment.java:2096)
at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:928)
at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1067)
at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1049)
at android.app.FragmentManagerImpl.dispatchResume(FragmentManager.java:1879)
at android.app.Activity.performResume(Activity.java:6113)
at android.app.ActivityThread.performResumeActivity(ActivityThread.java:3015)
at android.app.ActivityThread.handleResumeActivity(ActivityThread.java:3061)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2415)
at android.app.ActivityThread.access$800(ActivityThread.java:151)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1313)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:135)
at android.app.ActivityThread.main(ActivityThread.java:5345)
at java.lang.reflect.Method.invoke!(Native method)
at java.lang.reflect.Method.invoke(Method.java:372)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:947)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:742)
從ApnSettings.java中可以看到fillList這個可能的耗時操作主線程處理的地方有OnResume和OnReceive。
且代碼中存在使用HandlerThread異步處理這個操作的地方,可以借鑒下將這兩處都放到HandlerThread中去做。
4. 系統資源緊張造成的ANR
通常情況下,一般這種問題的表現的可能形式是
1. traces.txt中的主線程的棧在一個非常common的操作中,如new一個變量,讀取某個文件等
2. ANR的trace中IOW相當高(IOW+CPU sys+usr=100)
3. Kswapd非常活躍
4. Low Ram 配置
5. kernel log中 D狀態的用戶進程比較多,且都block在內存相關,且內存比較緊張.
6. 連續出現ANR,slog和ams一直在dumptrace,造成比較大的IO壓力
7. 前台正在玩一個很大的游戲
8. 某個應用正在安裝做dex2oat.
出現這種情況下,我們需要做的是:
1. 針對性的調整lmk參數,減少后台的數量
2. 檢查是否有不應該常駐的進程
3. 是否需要修改偽前台的adj,減少常駐內存
4. 是否需要關閉dex2oat
此類問題比較難解決。
5. Binder被占滿導致ANR
1. traces.txt中主線程處於IDLE狀態,要么是沒有抓到現場,要么是出現了其他問題。
2. 分析log,找到ANR的時間點如下:后台服務執行超時,這個timeout是200s
07-28 03:37:33.681 1719 1732 I am_anr : [0,15625,com.android.exchange,948452933,executing service com.android.exchange/.service.EasService]
----- pid 15625 at 2016-07-28 03:37:33 -----,dumptrace的時間點。
3. 查看是否有主線程的消息:
07-28 03:37:39.373 15625 15625 W BlockMonitor: Package name: com.android.exchange
07-28 03:37:39.373 15625 15625 W BlockMonitor: The Message{ what=115 obj=ServiceArgsData{token=android.os.BinderProxy@49814e2
startId=6 args=Intent { act=com.android.email.EXCHANGE_INTENT flg=0x4 cmp=com.android.exchange/.service.EasService (has extras) }}
target=android.app.ActivityThread$H } took 5174ms.
之前有說過,華為項目是沒有dvmlocksample但是有blockMonitor的,如果有耗時的binder調用以及mesg處理,那么就會被他抓過來。
從代碼中可以看到,what=115是serviceArg消息,是startservice的一部分,而這個消息應該是34秒的時候才接受到並處理的。
那么,為什么33s的時候dumptrace並沒有抓到主線程有卡頓?且后續並沒有blockmonitor的打印,也就是說200s內主線程和systemserver並沒有出現block。
那么還有可能是什么原因導致了前200s並沒有收到這個消息?
我們來看下這部分的代碼:
private final void sendServiceArgsLocked(ServiceRecord r, boolean execInFg,
boolean oomAdjusted) throws TransactionTooLargeException {
final int N = r.pendingStarts.size();
if (N == 0) {
return;
}
while (r.pendingStarts.size() > 0) {
Exception caughtException = null;
ServiceRecord.StartItem si;
try {
si = r.pendingStarts.remove(0);
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Sending arguments to: "
+ r + " " + r.intent + " args=" + si.intent);
if (si.intent == null && N > 1) {
// If somehow we got a dummy null intent in the middle,
// then skip it. DO NOT skip a null intent when it is
// the only one in the list -- this is to support the
// onStartCommand(null) case.
continue;
}
si.deliveredTime = SystemClock.uptimeMillis();
r.deliveredStarts.add(si);
si.deliveryCount++;
if (si.neededGrants != null) {
mAm.grantUriPermissionUncheckedFromIntentLocked(si.neededGrants,
si.getUriPermissionsLocked());
}
bumpServiceExecutingLocked(r, execInFg, "start"); //超時timer開始計算
if (!oomAdjusted) {
oomAdjusted = true;
mAm.updateOomAdjLocked(r.app);
}
int flags = 0;
if (si.deliveryCount > 1) {
flags |= Service.START_FLAG_RETRY;
}
if (si.doneExecutingCount > 0) {
flags |= Service.START_FLAG_REDELIVERY;
}
r.app.thread.scheduleServiceArgs(r, si.taskRemoved, si.id, flags, si.intent);//通過binder給App發送一個消息,ActivityThread會收到並處理。
} catch (TransactionTooLargeException e) {
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Transaction too large: intent="
+ si.intent);
caughtException = e;
} catch (RemoteException e) {
// Remote process gone... we'll let the normal cleanup take care of this.
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Crashed while sending args: " + r);
caughtException = e;
} catch (Exception e) {
Slog.w(TAG, "Unexpected exception", e);
caughtException = e;
}
if (caughtException != null) {
// Keep nesting count correct
final boolean inDestroying = mDestroyingServices.contains(r);
serviceDoneExecutingLocked(r, inDestroying, inDestroying);
if (caughtException instanceof TransactionTooLargeException) {
throw (TransactionTooLargeException)caughtException;
}
break;
}
}
}
從上面的代碼可以看出,在啟動計時器和給app發送消息之間並不存在耗時操作,且這個是在同步鎖之內的,如果慢了一定會被blockmonitor抓出來。
那么問題的原因只有一種可能情形:
200s之前給app發送的消息,APP並沒有及時的處理。
從經驗可以猜測,一般這種情況都是app的Binder被占滿導致沒有空閑的Binder線程能夠處理這個消息導致。
而我們的traces正好dump到了這一幕:
app所有的Binder線程都全部在做一件事情,一個依賴於網絡的操作,是被另外一個人通過contentprovider.query調過來的,有點不太正常。
"Binder_10" prio=5 tid=30 Native
| group="main" sCount=1 dsCount=0 obj=0x22dba160 self=0xb2f26700
| sysTid=4613 nice=10 cgrp=bg_non_interactive sched=0/0 handle=0x9eb80930
| state=S schedstat=( 0 0 0 ) utm=0 stm=0 core=3 HZ=100
| stack=0x9ea84000-0x9ea86000 stackSize=1014KB
| held mutexes=
kernel: (couldn't read /proc/self/task/4613/stack)
native: #00 pc 000423e8 /system/lib/libc.so (__pselect6+20)
native: #01 pc 0001c115 /system/lib/libc.so (select+60)
native: #02 pc 0000a279 /system/lib/libjavacrypto.so (???)
native: #03 pc 0000b29f /system/lib/libjavacrypto.so (???)
native: #04 pc 004bc3fd /data/dalvik-cache/arm/system@framework@boot.oat (Java_com_android_org_conscrypt_NativeCrypto_SSL_1read__JLjava_io_FileDescriptor_2Lcom_android_org_conscrypt_NativeCrypto_00024SSLHandshakeCallbacks_2_3BIII+192)
at com.android.org.conscrypt.NativeCrypto.SSL_read(Native method)
at com.android.org.conscrypt.OpenSSLSocketImpl$SSLInputStream.read(OpenSSLSocketImpl.java:705)
- locked <0x0508e00e> (a java.lang.Object)
at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:108)
at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:196)
at org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:88)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:179)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:185)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:240)
at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:264)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:284)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:126)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:440)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:596)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:517)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:495)
at com.android.exchange.EasResponse.fromHttpRequest(EasResponse.java:91)
at com.android.exchange.service.EasServerConnection.executeHttpUriRequest(EasServerConnection.java:384)
at com.android.exchange.eas.EasOperation.performOperation(EasOperation.java:382)
at com.android.exchange.service.EasService.searchGal(EasService.java:706)
at com.android.exchange.provider.ExchangeDirectoryProvider.query(ExchangeDirectoryProvider.java:350)
at android.content.ContentProvider.query(ContentProvider.java:1017)
at android.content.ContentProvider$Transport.query(ContentProvider.java:238)
at android.content.ContentProviderNative.onTransact(ContentProviderNative.java:112)
at android.os.Binder.execTransact(Binder.java:453)
sprdruntime在ANR的時候會dump BinderState,搜索當前進程的pid,查看是否能夠找到線索:
07-28 03:37:33.677 1719 1732 W SprdRuntimeInfo: proc 15625
07-28 03:37:33.677 1719 1732 W SprdRuntimeInfo: thread 4537: l 01
07-28 03:37:33.677 1719 1732 W SprdRuntimeInfo: incoming transaction 625339: ed9d1ec0 from 30228:4535 to 15625:4537 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600d84
07-28 03:37:33.677 1719 1732 W SprdRuntimeInfo: thread 4538: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 625344: c2c47080 from 30228:4536 to 15625:4538 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb601090
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4563: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 625759: e887e6c0 from 30228:4558 to 15625:4563 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb6013f4
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4564: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 625805: e6223a40 from 30228:4561 to 15625:4564 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601704
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4578: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626144: e96d0cc0 from 30228:4577 to 15625:4578 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601a70
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4580: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626162: d0a97440 from 30228:4579 to 15625:4580 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601db0
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4591: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626562: d34d60c0 from 30228:4588 to 15625:4591 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb6020ec
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4592: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626575: ecb23800 from 30228:4590 to 15625:4592 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602430
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4602: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626867: d3c64580 from 30228:4600 to 15625:4602 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602770
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4603: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626883: e8b8bfc0 from 30228:4601 to 15625:4603 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602ab4
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4612: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 627170: e8b8bdc0 from 30228:4610 to 15625:4612 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602df4
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4613: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 627176: e8ded5c0 from 30228:4611 to 15625:4613 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb603138
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 15635: l 02
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 623807: edbbfa40 from 30228:1612 to 15625:15635 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb600404
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 15636: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 624820: edc2bd80 from 30228:30240 to 15625:15636 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600714
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 15658: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 624824: c2c47400 from 30228:30238 to 15625:15658 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600a20
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 18343: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 623794: d5132380 from 30228:30241 to 15625:18343 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb6000f4
可以看到我們binder線程被30228給占滿了,而同時我們也發現30228也有點不太對:
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: proc 30228
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 1612: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 623807: edbbfa40 from 30228:1612 to 15625:15635 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb600404
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 623738: e994e880 from 2332:4503 to 30228:1612 code 1 flags 10 pri 10 r1 node 538358 size 776:4 data f880009c
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4535: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 625339: ed9d1ec0 from 30228:4535 to 15625:4537 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600d84
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 625317: e82f4000 from 2332:4556 to 30228:4535 code 1 flags 10 pri 10 r1 node 538358 size 776:4 data f8800d6c
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4536: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 625344: c2c47080 from 30228:4536 to 15625:4538 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb601090
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 625326: ed9d1c40 from 2332:4557 to 30228:4536 code 1 flags 10 pri 10 r1 node 538358 size 776:4 data f88010a0
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4558: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 625759: e887e6c0 from 30228:4558 to 15625:4563 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb6013f4
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 625733: e887e5c0 from 2332:4575 to 30228:4558 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f88013d4
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4561: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 625805: e6223a40 from 30228:4561 to 15625:4564 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601704
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 625756: e8719880 from 2332:4576 to 30228:4561 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f880170c
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4577: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 626144: e96d0cc0 from 30228:4577 to 15625:4578 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601a70
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 626125: d34d6ec0 from 2332:4585 to 30228:4577 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f8801a44
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4579: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 626162: d0a97440 from 30228:4579 to 15625:4580 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601db0
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 626131: c3f98200 from 2332:4587 to 30228:4579 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f8801d7c
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4588: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 626562: d34d60c0 from 30228:4588 to 15625:4591 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb6020ec
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 626541: ede63540 from 2332:4598 to 30228:4588 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f88020b4
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4590: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 626575: ecb23800 from 30228:4590 to 15625:4592 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602430
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 626546: e9a32b80 from 2332:4599 to 30228:4590 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f88023f0
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: thread 4600: l 11
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: outgoing transaction 626867: d3c64580 from 30228:4600 to 15625:4602 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602770
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 626849: ecb23e00 from 2332:4608 to 30228:4600 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f880272c
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: thread 4601: l 11
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: outgoing transaction 626883: e8b8bfc0 from 30228:4601 to 15625:4603 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602ab4
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 626856: e8389f00 from 2332:4609 to 30228:4601 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f8802a68
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: thread 4610: l 11
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: outgoing transaction 627170: e8b8bdc0 from 30228:4610 to 15625:4612 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602df4
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 627144: c27891c0 from 2332:4618 to 30228:4610 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f8802da4
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: thread 4611: l 11
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: outgoing transaction 627176: e8ded5c0 from 30228:4611 to 15625:4613 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb603138
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 627151: e2b511c0 from 2332:4619 to 30228:4611 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f88030e0
也就是說實際上是2332query了30228,30228才會去query16625導致anr的。
那么2332是誰?
eventslog中可以看到2332是email,30228是android.process.acore。
搜索traces.txt,看到我們在anr的時候有dump到2332的棧:
發現了異常,有很多個線程(也就是上面dump到的2332的線程)在查詢,
"Filter" prio=5 tid=42 Native
| group="main" sCount=1 dsCount=0 obj=0x130cc8e0 self=0x98637500
| sysTid=4502 nice=10 cgrp=bg_non_interactive sched=0/0 handle=0x9877e930
| state=S schedstat=( 0 0 0 ) utm=1 stm=0 core=0 HZ=100
| stack=0x9867c000-0x9867e000 stackSize=1038KB
| held mutexes=
kernel: (couldn't read /proc/self/task/4502/stack)
native: #00 pc 00017618 /system/lib/libc.so (syscall+28)
native: #01 pc 000f5cbd /system/lib/libart.so (_ZN3art17ConditionVariable4WaitEPNS_6ThreadE+80)
native: #02 pc 00275df1 /system/lib/libart.so (_ZN3art3JNI18CallBooleanMethodVEP7_JNIEnvP8_jobjectP10_jmethodIDSt9__va_list+592)
native: #03 pc 0008617b /system/lib/libandroid_runtime.so (???)
native: #04 pc 00089c2b /system/lib/libandroid_runtime.so (???)
native: #05 pc 000198dd /system/lib/libbinder.so (_ZN7android7BBinder8transactEjRKNS_6ParcelEPS1_j+60)
native: #06 pc 0001ec25 /system/lib/libbinder.so (_ZN7android14IPCThreadState14executeCommandEi+584)
native: #07 pc 0001ef73 /system/lib/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+262)
native: #08 pc 0001f049 /system/lib/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+124)
native: #09 pc 00019fe3 /system/lib/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+30)
native: #10 pc 0008a035 /system/lib/libandroid_runtime.so (???)
native: #11 pc 00d78919 /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140)
at android.os.BinderProxy.transactNative(Native method)
at android.os.BinderProxy.transact(Binder.java:505)
at android.content.ContentProviderProxy.query(ContentProviderNative.java:419)
at android.content.ContentResolver.query(ContentResolver.java:502)
at android.content.ContentResolver.query(ContentResolver.java:438)
at com.huawei.mail.chips.BaseRecipientAdapter.doQuery(BaseRecipientAdapter.java:931)
at com.huawei.mail.chips.BaseRecipientAdapter$DirectoryFilter.performFiltering(BaseRecipientAdapter.java:400)
at android.widget.Filter$RequestHandler.handleMessage(Filter.java:234)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:150)
at android.os.HandlerThread.run(HandlerThread.java:61)
那么接下來就需要結合代碼查看這么多線程創建的原因。
解決的思路也比較簡單,是否需要采取線程池或者限制下線程的個數,防止由於此操作比較耗時(依賴於網絡)導致對端ANR。
- Bug 582473 single touch point test,操作15次以上必現工廠測試anr
從dropbox中找到Validationtools對應的traces,查看主線程的棧,發現主線程在IDLE狀態。
"main" prio=5 tid=1 NATIVE
| group="main" sCount=1 dsCount=0 obj=0x4155acc0 self=0x414943f0
| sysTid=3287 nice=-1 sched=0/0 cgrp=apps handle=1074065748
| state=S schedstat=( 0 0 0 ) utm=1241 stm=340 core=0
#00 pc 00021764 /system/lib/libc.so (epoll_wait+12)
#01 pc 000105e3 /system/lib/libutils.so (android::Looper::pollInner(int)+94)
#02 pc 00010811 /system/lib/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+92)
#03 pc 0006b345 /system/lib/libandroid_runtime.so (android::NativeMessageQueue::pollOnce(_JNIEnv*, int)+22)
#04 pc 0001e70c /system/lib/libdvm.so (dvmPlatformInvoke+112)
#05 pc 0004dc43 /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+398)
#06 pc 00027778 /system/lib/libdvm.so
#07 pc 0002e410 /system/lib/libdvm.so (dvmMterpStd(Thread*)+76)
#08 pc 0002bac8 /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+156)
#09 pc 00060259 /system/lib/libdvm.so (dvmInvokeMethod(Object*, Method const*, ArrayObject*, ArrayObject*, ClassObject*, bool)+392)
#10 pc 000683eb /system/lib/libdvm.so
#11 pc 00027778 /system/lib/libdvm.so
#12 pc 0002e410 /system/lib/libdvm.so (dvmMterpStd(Thread*)+76)
#13 pc 0002bac8 /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+156)
#14 pc 0005ff77 /system/lib/libdvm.so (dvmCallMethodV(Thread*, Method const*, Object*, bool, JValue*, std::__va_list)+338)
#15 pc 0004982b /system/lib/libdvm.so
#16 pc 0004de13 /system/lib/libandroid_runtime.so
#17 pc 0004ecdd /system/lib/libandroid_runtime.so (android::AndroidRuntime::start(char const*, char const*)+532)
#18 pc 00001423 /system/bin/app_process
#19 pc 0000e403 /system/lib/libc.so (__libc_init+50)
#20 pc 00000f34 /system/bin/app_process
at android.os.MessageQueue.nativePollOnce(Native Method)
at android.os.MessageQueue.next(MessageQueue.java:138)
at android.os.Looper.loop(Looper.java:123)
at android.app.ActivityThread.main(ActivityThread.java:5294)
at java.lang.reflect.Method.invokeNative(Native Method)
at java.lang.reflect.Method.invoke(Method.java:515)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:932)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:748)
at dalvik.system.NativeStart.main(Native Method)
然后發現第二個線程是Binder10 且在block狀態,通常情況下,一個不是非常繁忙的需要對外提供服務的app是不許要16個binder線程的。
於是接着查看是否其他binder也存在異常,結果發現所有的binder線程均處於同樣的調用棧。
"Binder_D" prio=5 tid=21 WAIT
| group="main" sCount=1 dsCount=0 obj=0x41d6f468 self=0x555f4898
| sysTid=3344 nice=0 sched=0/0 cgrp=apps handle=1442108096
| state=S schedstat=( 0 0 0 ) utm=0 stm=0 core=3
at java.lang.Object.wait(Native Method)
- waiting on <0x41d6f4c0> (a java.lang.VMThread) held by tid=21 (Binder_D)
at java.lang.Thread.parkFor(Thread.java:1205)
at sun.misc.Unsafe.park(Unsafe.java:325)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:157)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:813)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:846)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1175)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:180)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:256)
at android.view.SurfaceView$MyWindow.resized(SurfaceView.java:655)
at android.view.IWindow$Stub.onTransact(IWindow.java:108)
at android.os.Binder.execTransact(Binder.java:427)
at dalvik.system.NativeStart.run(Native Method)
"Binder_F" prio=5 tid=23 WAIT
| group="main" sCount=1 dsCount=0 obj=0x41d4e600 self=0x55cb63f8
| sysTid=3351 nice=0 sched=0/0 cgrp=apps handle=1435714912
| state=S schedstat=( 0 0 0 ) utm=1 stm=1 core=0
at java.lang.Object.wait(Native Method)
- waiting on <0x41d4e658> (a java.lang.VMThread) held by tid=23 (Binder_F)
at java.lang.Thread.parkFor(Thread.java:1205)
at sun.misc.Unsafe.park(Unsafe.java:325)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:157)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:813)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:846)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1175)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:180)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:256)
at android.view.SurfaceView$MyWindow.resized(SurfaceView.java:655)
at android.view.IWindow$Stub.onTransact(IWindow.java:108)
at android.os.Binder.execTransact(Binder.java:427)
at dalvik.system.NativeStart.run(Native Method)
如上面一個例子,在這種Binder滿的情況下,systemserver和app之間的通信是被block的,雖然這個是4.4並沒有sprdruntimeInfo給我們dump出binder的狀態,我們還是可以推斷出
ANR的直接原因就是此處不合常理的Binder被占滿。
首先通過代碼可以看到,surfaceView中這行對應的代碼為如下:
@Override
public void resized(Rect frame, Rect overscanInsets, Rect contentInsets,
Rect visibleInsets, boolean reportDraw, Configuration newConfig) {
SurfaceView surfaceView = mSurfaceView.get();
if (surfaceView != null) {
if (DEBUG) Log.v(
"SurfaceView", surfaceView + " got resized: w=" + frame.width()
+ " h=" + frame.height() + ", cur w=" + mCurWidth + " h=" + mCurHeight);
surfaceView.mSurfaceLock.lock();
try {
if (reportDraw) {
surfaceView.mUpdateWindowNeeded = true;
surfaceView.mReportDrawNeeded = true;
surfaceView.mHandler.sendEmptyMessage(UPDATE_WINDOW_MSG);
} else if (surfaceView.mWinFrame.width() != frame.width()
|| surfaceView.mWinFrame.height() != frame.height()) {
surfaceView.mUpdateWindowNeeded = true;
surfaceView.mHandler.sendEmptyMessage(UPDATE_WINDOW_MSG);
}
} finally {
surfaceView.mSurfaceLock.unlock();
}
}
}
mSurfaceLocklock的路徑還有如下:
private void updateWindow(boolean force, boolean redrawNeeded)
private final Canvas internalLockCanvas(Rect dirty)
由於這個是個必現問題,那么問題就簡單了,我們可以在所有會lock和unlock的地方都加上log,再復現一次來得到真相。
於是乎發現是在SingleTouchPointTest pass的時候忘記unlockCanvs導致。
套路總結
1. 查看traces.txt,如果有block棧則通過棧來找到block原因。
2. 應用自身耗時操作嘗試修改異步,被Binder block則進一步確認下對端的情況,如果是依賴於外部狀態的可能block操作,應用修改為異步,如果是common接口,轉FW
3. 如果不存在棧,那么盡人事聽天命,查看ANR點附近是否有主線程block的可能信息,如果有則順騰摸瓜,沒有就Needinfo。
如何避免ANR
A.減少復雜的layout
B.主線程盡量不要做和顯示無關的事情。
C.如果存在可能會block、耗時的操作,不要放到主線程中,可以使用異步等方式來放到另外一個handlerthread或者asynctask中去做。
需要注意的是,使用handler.post或者sendmessage的時候需要確認清楚handler的looper,如果是主線程,依然可能會ANR。
在HandlerThread中如果需要處理和顯示相關的,還需要到主線程中處理(非UI線程不能操作UI,在ViewRoot中有檢查)