近期遇到一個堆外內存導致swap飆高的問題,這類問題比較罕見,因此將整個排查過程記錄下來了
現象描述
最近1周線上服務器時不時出現swap報警(swap超過內存10%時觸發報警,內存是4G,因此swap超過400M會觸發報警),每次都是童鞋們通過重啟tomcat解決的;
但導致的根本原因是什么呢?必須找到根本原因才行,總是這么重啟就有點low了
問題排查
於是找了1台占用了swap但還未觸發報警的服務器進行了排查
以下是當時通過top命令觀察到的結果
原因分析
到底是什么原因導致swap飆高呢?肯定是tomcat,因為每次重啟tomcat就解決了;但根本原因是?
誰占用了swap
通過以下腳本 swap.sh
#!/bin/bash # Get current swap usage for all running processes # Erik Ljungstrom 27/05/2011 do_swap () { SUM=0 OVERALL=0 for DIR in `find /proc/ -maxdepth 1 -type d | egrep "^/proc/[0-9]"` ; do PID=`echo $DIR | cut -d / -f 3` PROGNAME=`ps -p $PID -o comm --no-headers` for SWAP in `grep Swap $DIR/smaps 2>/dev/null| awk '{ print $2 }'` do let SUM=$SUM+$SWAP done echo "PID=$PID - Swap used: $SUM - ($PROGNAME )" let OVERALL=$OVERALL+$SUM SUM=0 done echo "Overall swap used: $OVERALL" } do_swap |awk -F[\ \(] '{print $5,$1,$8}' | sort -n | tail -3
可以看出PID=19911這個進程使用了324M的swap
通過grep進程號19911可以看出確實是tomcat占用swap最多
進程19911占用總的物理內存是3.1G,java占用的堆內內存大小為2.78G,剩下的320M是堆外內存占用的
Max memory = [-Xmx] + [-XX:MaxPermSize] + number_of_threads * [-Xss]
2779M=2048M+268M+463*1M
sudo -u tomcat ./jinfo -flag MaxPermSize 19911 -XX:MaxPermSize=268435456 java -XX:+PrintFlagsFinal -version | grep ThreadStackSize intx CompilerThreadStackSize = 0 {pd product} intx ThreadStackSize = 1024 {pd product} intx VMThreadStackSize = 1024 {pd product} java version "1.7.0_45" Java(TM) SE Runtime Environment (build 1.7.0_45-b18) Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode) java -XX:+PrintFlagsFinal -version | grep -i permsize uintx AdaptivePermSizeWeight = 20 {product} uintx MaxPermSize = 85983232 {pd product} uintx PermSize = 21757952 {pd product} java version "1.7.0_45" Java(TM) SE Runtime Environment (build 1.7.0_45-b18) Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
哪行代碼占用了堆外內存
堆內內存溢出可以直接通過MAT分析堆信息就可以定位到具體的代碼,但是對於堆外內存就必須通過BTrace來解決
google-perftools 定位類名和方法名
如何安裝和使用google-perftools見這里
由於要啟動google-perftools需要重啟tomcat,所以重啟tomcat后,PID從19911變成了9176
重啟tomcat后,會自動生成heap文件,文件名的命名規范是gperf_pid.xxx.heap,所以我們只需要關注gperf_9176.*即可
[xxxx@xxxx /home/xxx/logs]$ ll *.heap …... -rw-r--r-- 1 tomcat tomcat 5048 May 6 10:46 gperf_9171.0001.heap -rw-r--r-- 1 tomcat tomcat 5036 May 6 10:46 gperf_9173.0001.heap -rw-r--r-- 1 tomcat tomcat 5055 May 6 10:46 gperf_9174.0001.heap -rw-r--r-- 1 tomcat tomcat 5352 May 6 10:46 gperf_9175.0001.heap -rw-r--r-- 1 tomcat tomcat 1048563 May 6 10:46 gperf_9176.0001.heap -rw-r--r-- 1 tomcat tomcat 1048564 May 6 10:46 gperf_9176.0002.heap -rw-r--r-- 1 tomcat tomcat 1048563 May 6 10:47 gperf_9176.0003.heap -rw-r--r-- 1 tomcat tomcat 1048565 May 6 10:47 gperf_9176.0004.heap -rw-r--r-- 1 tomcat tomcat 1048574 May 6 10:49 gperf_9176.0005.heap -rw-r--r-- 1 tomcat tomcat 1048574 May 6 10:50 gperf_9176.0006.heap -rw-r--r-- 1 tomcat tomcat 1048568 May 6 10:51 gperf_9176.0007.heap -rw-r--r-- 1 tomcat tomcat 1048572 May 6 10:53 gperf_9176.0008.heap -rw-r--r-- 1 tomcat tomcat 1048564 May 6 10:55 gperf_9176.0009.heap -rw-r--r-- 1 tomcat tomcat 1048560 May 6 10:58 gperf_9176.0010.heap -rw-r--r-- 1 tomcat tomcat 1048563 May 6 11:00 gperf_9176.0011.heap -rw-r--r-- 1 tomcat tomcat 1048564 May 6 11:03 gperf_9176.0012.heap …...
分析heap文件
/home/google-perftools/bin/pprof --text /home/java /home/logs/gperf_9176.0010.heap Using local file /home/java. Using local file /home/logs/gperf_9176.0010.heap. Total: 186.4 MB 91.2 48.9% 48.9% 91.2 48.9% updatewindow 52.5 28.2% 77.1% 52.5 28.2% os::malloc 38.0 20.4% 97.4% 38.0 20.4% inflateInit2_ 3.0 1.6% 99.0% 3.0 1.6% init 0.8 0.4% 99.5% 0.8 0.4% ObjectSynchronizer::omAlloc 0.4 0.2% 99.7% 0.4 0.2% readCEN 0.3 0.2% 99.9% 38.3 20.5% Java_java_util_zip_Inflater_init 0.1 0.1% 100.0% 0.1 0.1% _dl_allocate_tls 0.0 0.0% 100.0% 0.0 0.0% _dl_new_object 0.0 0.0% 100.0% 1.1 0.6% Thread::Thread 0.0 0.0% 100.0% 0.0 0.0% CollectedHeap::CollectedHeap 0.0 0.0% 100.0% 0.0 0.0% Events::init 0.0 0.0% 100.0% 0.4 0.2% ZIP_Put_In_Cache0 0.0 0.0% 100.0% 0.0 0.0% read_alias_file 0.0 0.0% 100.0% 0.0 0.0% _nl_intern_locale_data
可以看出是java.util.zip.Inflater的init()占用了比較多的內存
通過BTrace定位代碼調用方
編寫代碼BtracerInflater.java對init方法進行攔截
import static com.sun.btrace.BTraceUtils.*; import com.sun.btrace.annotations.*; import java.nio.ByteBuffer; import java.lang.Thread; @BTrace public class BtracerInflater{ @OnMethod( clazz="java.util.zip.Inflater", method="/.*/" ) public static void traceCacheBlock(){ println("Who call java.util.zip.Inflater's methods :"); jstack(); } }
運行BTrace
[xxxx@l-xxx.xx.xx /home/xxx/btrace-bin/bin]$ sudo -u tomcat ./btrace -cp ../build 9176 BtracerInflater.java|more Who call java.util.zip.Inflater's methods : java.util.zip.Inflater.<init>(Inflater.java:102) java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:76) java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:90) com.xxx.OrderDiffUtil.ungzip(OrderDiffUtil.java:54) com.xxx.OrderDiffUtil.parse(OrderDiffUtil.java:32) com.xxx.FaxOrderEventListener.takeSectionChangedInfo(FaxOrderEventListener.java:87) com.xxx.FaxOrderEventListener.onMessage(FaxOrderEventListener.java:46) .......
可以看出是OrderDiffUtil的ungzip()調用了java.util.zip.Inflater的init()
看看OrderDiffUtil.ungzip()
private static String ungzip(String encodeJson) { ByteArrayOutputStream out = new ByteArrayOutputStream(encodeJson.length() * 5); ByteArrayInputStream in = null; try { in = new ByteArrayInputStream(Base64.decode(encodeJson)); } catch (UnsupportedEncodingException e) { return "{}"; } try { GZIPInputStream gunzip = new GZIPInputStream(in); byte buffer[] = new byte[1024]; int len = 0; while ((len = gunzip.read(buffer)) != -1) { out.write(buffer, 0, len); } } catch (IOException e) { return "{}"; } try { return out.toString("ISO-8859-1"); } catch (UnsupportedEncodingException e) { } return "{}"; }
可見gunzip未被close
所以根本原因是未調用GZIPInputStream的close()關閉流導致堆外內存占用



