記錄一次concurrent mode failure問題排查過程以及解決思路

本文轉載自查看原文 2017-07-10 15:11 4011 技術/ java並發編程

背景：后台定時任務腳本每天凌晨5點30會執行一個批量掃庫做業務的邏輯。

gc錯誤日志：

2017-07-05T05:30:54.408+0800: 518534.458: [CMS-concurrent-mark-start]
2017-07-05T05:30:55.279+0800: 518535.329: [GC 518535.329: [ParNew: 838848K->838848K(1118464K), 0.0000270 secs]
[CMS-concurrent-mark: 1.564/1.576 secs] [Times: user=10.88 sys=0.31, real=1.57 secs]
 (concurrent mode failure): 2720535K->2719116K(2796224K), 13.3742340 secs] 
 3559383K->2719116K(3914688K), 
 [CMS Perm : 38833K->38824K(524288K)], 13.3748020 secs] [Times: user=16.19 sys=0.00, real=13.37 secs]
2017-07-05T05:31:08.659+0800: 518548.710: [GC [1 CMS-initial-mark: 2719116K(2796224K)] 2733442K(3914688K), 0.0065150 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2017-07-05T05:31:08.666+0800: 518548.716: [CMS-concurrent-mark-start]
2017-07-05T05:31:09.528+0800: 518549.578: 
[GC 518549.578: [ParNew: 838848K->19737K(1118464K), 0.0055800 secs] 
3557964K->2738853K(3914688K), 0.0060390 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
[CMS-concurrent-mark: 1.644/1.659 secs] [Times: user=14.15 sys=0.84, real=1.66 secs]
2017-07-05T05:31:10.326+0800: 518550.376: [CMS-concurrent-preclean-start]
2017-07-05T05:31:10.341+0800: 518550.391: [CMS-concurrent-preclean: 0.015/0.015 secs] [Times: user=0.05 sys=0.02, real=0.02 secs]
2017-07-05T05:31:10.341+0800: 518550.391: [CMS-concurrent-abortable-preclean-start]

借鑒於:understanding-cms-gc-logs

得知導致concurrent mode failure的原因有是： there was not enough space in the CMS generation to promote the worst case surviving young generation objects. We name this failure as “full promotion guarantee failure”

解決的方案有： The concurrent mode failure can either be avoided increasing the tenured generation size or initiating the CMS collection at a lesser heap occupancy by setting CMSInitiatingOccupancyFraction to a lower value and setting UseCMSInitiatingOccupancyOnly to true.

第二種方案需要綜合考慮下，因為如果設置的CMSInitiatingOccupancyFraction過低有可能導致頻繁的cms 降低性能。［參考不建議3g下配置cms：why no cms under 3G］

問題排查：

1 jvm參數配置 -Xmx4096m -Xms2048m -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSCompactAtFullCollection -XX:MaxTenuringThreshold=10 -XX:-UseAdaptiveSizePolicy -XX:PermSize=512M -XX:MaxPermSize=1024M -XX:SurvivorRatio=3 -XX:NewRatio=2 -XX:+PrintGCDateStamps -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails 幾乎沒什么問題

2 從報警時間看每天凌晨5點30報警一次, 應該是定時任務的問題。

該問題很容易排查，服務是個腳本服務，線上業務邏輯幾乎沒有，所以根據時間點找到定時任務的業務邏輯，就可以分析出來問題。

業務代碼：

　　　　 int batchNumber = 1;
        int realCount = 0;
        int offset = 0;
        int limit = 999;
        int totalCount = 0;
        //初始化20個大小的線程池
        ExecutorService service = Executors.newFixedThreadPool(20);
        while (true) {
            LogUtils.info(logger, "{0},{1}->{2}", batchNumber, offset, (offset + limit));
            try {
                //分頁查詢
                Set<String> result = query(offset, limit);
                realCount = result.size();
                //將查詢出的數據放入線程池執行
                service.execute(new AAAAAAA(result, batchNumber));
            } catch (Exception e) {
                LogUtils.error(logger, e, "exception,batch:{0},offset:{1},count:{2}", batchNumber, offset, limit);
                break;
            }
            totalCount += realCount;
            if (realCount < limit) {
                break;
            }
            batchNumber++;
            offset += limit;
        }
        service.shutdown();

用了一個固定20個線程的線程池，循環執行每次從數據庫里面取出來999條數據放到線程池里面去跑

分析

newFixedThreadPool
底層用了一個

LinkedBlockingQueue
無限隊列，而我的數據有2kw+條,這樣死循環取數據放到隊列里面沒有把內存撐爆算好的吧？？？

最后換成

BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(20);
ThreadPoolExecutor service = new ThreadPoolExecutor(20, 20, 1, TimeUnit.HOURS, queue, new ThreadPoolExecutor.CallerRunsPolicy());

用了個固定長度的隊列，而且失敗策略用的callerruns，可以理解為不能執行並且不能加入等待隊列的時候主線程會直接跑run方法，會造成多線程變單線程，降低效率。

明天看看效果如何。

后記：

對於線程池阻塞更好的方案在這里：重寫一個拒絕策略，讓隊列滿的時候阻塞主線程，等待隊列消費后恢復。

new RejectedExecutionHandler() {
	@Override
	public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) {
		if (!executor.isShutdown()) {
			try {
				executor.getQueue().put(r);
			} catch (InterruptedException e) {
				// should not be interrupted
			}
		}
	}
};

用put代替offer,前者失敗后阻塞，后者失敗后直接返回false，線程池的設計還是很有意思的。

詳見：並發編程網-支持生產阻塞的線程池

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 記錄一次Mysql死鎖排查過程原創記錄一次線上Mysql慢查詢問題排查過程一次ygc越來越慢的問題排查過程記一次慢sql問題排查過程一次故障排查過程一次奇怪的的bug排查過程記一次內存溢出問題的排查、分析過程及解決思路 MySQL-記一次備份失敗的排查過程一次線上接口超時的排查過程記錄一次問題排查