記錄一次concurrent mode failure問題排查過程以及解決思路


 

背景:后台定時任務腳本每天凌晨5點30會執行一個批量掃庫做業務的邏輯。

 

gc錯誤日志:

2017-07-05T05:30:54.408+0800: 518534.458: [CMS-concurrent-mark-start]
2017-07-05T05:30:55.279+0800: 518535.329: [GC 518535.329: [ParNew: 838848K->838848K(1118464K), 0.0000270 secs]
[CMS-concurrent-mark: 1.564/1.576 secs] [Times: user=10.88 sys=0.31, real=1.57 secs]
 (concurrent mode failure): 2720535K->2719116K(2796224K), 13.3742340 secs] 
 3559383K->2719116K(3914688K), 
 [CMS Perm : 38833K->38824K(524288K)], 13.3748020 secs] [Times: user=16.19 sys=0.00, real=13.37 secs]
2017-07-05T05:31:08.659+0800: 518548.710: [GC [1 CMS-initial-mark: 2719116K(2796224K)] 2733442K(3914688K), 0.0065150 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2017-07-05T05:31:08.666+0800: 518548.716: [CMS-concurrent-mark-start]
2017-07-05T05:31:09.528+0800: 518549.578: 
[GC 518549.578: [ParNew: 838848K->19737K(1118464K), 0.0055800 secs] 
3557964K->2738853K(3914688K), 0.0060390 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
[CMS-concurrent-mark: 1.644/1.659 secs] [Times: user=14.15 sys=0.84, real=1.66 secs]
2017-07-05T05:31:10.326+0800: 518550.376: [CMS-concurrent-preclean-start]
2017-07-05T05:31:10.341+0800: 518550.391: [CMS-concurrent-preclean: 0.015/0.015 secs] [Times: user=0.05 sys=0.02, real=0.02 secs]
2017-07-05T05:31:10.341+0800: 518550.391: [CMS-concurrent-abortable-preclean-start]

借鑒於:understanding-cms-gc-logs 

得知導致concurrent mode failure的原因有是: there was not enough space in the CMS generation to promote the worst case surviving young generation objects. We name this failure as “full promotion guarantee failure” 

解決的方案有: The concurrent mode failure can either be avoided increasing the tenured generation size or initiating the CMS collection at a lesser heap occupancy by setting CMSInitiatingOccupancyFraction to a lower value and setting UseCMSInitiatingOccupancyOnly to true.

第二種方案需要綜合考慮下,因為如果設置的CMSInitiatingOccupancyFraction過低有可能導致頻繁的cms 降低性能。[參考不建議3g下配置cms:why no cms under 3G

 

問題排查:

1 jvm參數配置 -Xmx4096m -Xms2048m   -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSCompactAtFullCollection -XX:MaxTenuringThreshold=10 -XX:-UseAdaptiveSizePolicy -XX:PermSize=512M -XX:MaxPermSize=1024M -XX:SurvivorRatio=3 -XX:NewRatio=2 -XX:+PrintGCDateStamps -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails  幾乎沒什么問題

2 從報警時間看每天凌晨5點30報警一次, 應該是定時任務的問題。

該問題很容易排查,服務是個腳本服務,線上業務邏輯幾乎沒有,所以根據時間點找到定時任務的業務邏輯,就可以分析出來問題。

業務代碼:

     int batchNumber = 1;
        int realCount = 0;
        int offset = 0;
        int limit = 999;
        int totalCount = 0;
        //初始化20個大小的線程池
        ExecutorService service = Executors.newFixedThreadPool(20);
        while (true) {
            LogUtils.info(logger, "{0},{1}->{2}", batchNumber, offset, (offset + limit));
            try {
                //分頁查詢
                Set<String> result = query(offset, limit);
                realCount = result.size();
                //將查詢出的數據放入線程池執行
                service.execute(new AAAAAAA(result, batchNumber));
            } catch (Exception e) {
                LogUtils.error(logger, e, "exception,batch:{0},offset:{1},count:{2}", batchNumber, offset, limit);
                break;
            }
            totalCount += realCount;
            if (realCount < limit) {
                break;
            }
            batchNumber++;
            offset += limit;
        }
        service.shutdown();

用了一個固定20個線程的線程池,循環執行每次從數據庫里面取出來999條數據放到線程池里面去跑

分析

newFixedThreadPool
底層用了一個
LinkedBlockingQueue
無限隊列,而我的數據有2kw+條,這樣死循環取數據放到隊列里面沒有把內存撐爆算好的吧???

最后換成
BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(20);
ThreadPoolExecutor service = new ThreadPoolExecutor(20, 20, 1, TimeUnit.HOURS, queue, new ThreadPoolExecutor.CallerRunsPolicy());

  用了個固定長度的隊列,而且失敗策略用的callerruns,可以理解為不能執行並且不能加入等待隊列的時候主線程會直接跑run方法,會造成多線程變單線程,降低效率。

明天看看效果如何。

 

 

后記:

  對於線程池阻塞更好的方案在這里: 重寫一個拒絕策略,讓隊列滿的時候阻塞主線程,等待隊列消費后恢復。

new RejectedExecutionHandler() {
	@Override
	public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) {
		if (!executor.isShutdown()) {
			try {
				executor.getQueue().put(r);
			} catch (InterruptedException e) {
				// should not be interrupted
			}
		}
	}
};

  用put代替offer,前者失敗后阻塞,后者失敗后直接返回false,線程池的設計還是很有意思的。

 

詳見:並發編程網-支持生產阻塞的線程池

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM