2020-09-14T15:02:14.773+0800: 262550.148: [GC (Allocation Failure) 2020-09-14T15:02:14.773+0800: 262550.148: [ParNew: 292225K->5384K(306688K), 0.0080812 secs] 1913661K->1626821K(2063104K), 0.0082840 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] CMS: abort preclean due to time 2020-09-14T15:02:15.293+0800: 262550.667: [CMS-concurrent-abortable-preclean: 4.087/5.007 secs] [Times: user=5.12 sys=0.00, real=5.01 secs] 2020-09-14T15:02:15.311+0800: 262550.685: [GC (CMS Final Remark) [YG occupancy: 20470 K (306688 K)]2020-09-14T15:02:15.311+0800: 262550.685: [Rescan (parallel) , 0.0100449 secs]2020-09-14T15:02:15.321+0800: 262550.695: [weak refs processing, 0.0126935 secs]2020-09-14T15:02:15.334+0800: 262550.708:
[class unloading, 37.4731903 secs]2020-09-14T15:02:52.807+0800: 262588.181: [scrub symbol table, 3.7824486 secs]2020-09-14T15:02:56.589+0800: 262591.964: [scrub string table, 0.0028450 secs][1 CMS-remark: 1621436K(1756416K)] 1641906K(2063104K), 41.3046642 secs] [Times: user=0.60 sys=0.00, real=41.30 secs] 2020-09-14T15:02:56.616+0800: 262591.991: [CMS-concurrent-sweep-start]
從上面的日志可以看出,unloading class 的耗時30s+。
官方文檔解釋:https://blogs.oracle.com/poonam/long-class-unloading-pauses-with-jdk8
In this case too, the problem seemed not be with the JVM or more specifically with the class unloading step but more with the GC threads getting blocked out. It turned out that the class unloading phase was causing lot of paging activity to occur at the system level which took away the CPU time from the GC threads.
這個例子也一樣,這個問題和jvm無關,更具體的說和類卸載無關。而是和GC線程阻塞有關。最后發現類卸載階段產生了很多的頁活動,導致把cpu時間從GC線程搶走。
In this particular case, the system was configured to have higher number of large pages and a limited memory resource was left for the regular small pages. And a limited number of regular pages available for the MetaSpace caused paging activity during the class unloading phase leading to long GC pauses. The solution for this problem was to use -XX:+UseLargePagesInMetaspace that enables the large pages support for the MetaSpace. This avoided the paging activity and the class-unloading times returned to normal.
這個特殊的例子,系統(os?)配置了大量的大內存頁,導致留給正常內存頁的內存減少。給元生代的正常內存頁的數量有限,導致頁活動很頻繁,進而導致GC線程中斷。解決方案:使用-XX:+UseLargePageInMetaspace, 開啟元生代使用大內存頁。
問題分析,從cpu使用時間開始入手,real的耗時是所有時間,包括user+sys+IO等待時間,paging/swapping等。 為什么real那么高?查看同一時刻的服務器運行狀況,發現大量的swapping 操作。檢查系統配置i,發現系統開啟了swap功能。至此,問題大概明白了。java應用的堆設置的很大,3天左右才會發生一次Major GC。 操作系統為了保證系統的安全,會把長期不活躍的數據(垃圾數據)移到Swap。 等到GC到來時,就需要把這些數據swap in,產生Swap洪峰,延長了GC耗時。
目前,內存容量夠大,也便宜,建議關閉swap功能。以下兩篇文章也是建議關閉swap:
https://dzone.com/articles/just-say-no-swapping
https://www.elastic.co/guide/en/elasticsearch/reference/current/setup-configuration-memory.html