線上有一個消息消費服務xxx-consumer,使用spring-kafka框架,主線程批量從消費隊列(kafka)拉取交易系統生產的消息,然后提交到子線程池中挨個處理消費。
public abstract class AbstractMessageDispatchListener implements
BatchAcknowledgingMessageListener<String, Msg>, ApplicationListener<ApplicationReadyEvent> {
private ThreadPoolExecutor executor;
public abstract MessageWorker chooseWorker(ConsumerRecord<String, Msg> data);
@Override
public void onMessage(List<ConsumerRecord<String, Msg>> datas, Acknowledgment acknowledgment) {
List<Future<?>> futureList = new ArrayList<>(datas.size());
try {
CountDownLatch countDownLatch = new CountDownLatch(datas.size());
for (ConsumerRecord<String, Msg> data : datas) {
Future<?> future = executor.submit(new Worker(data, countDownLatch));
futureList.add(future);
}
countDownLatch.await(20000L - 2000, TimeUnit.MILLISECONDS);
long countDownLatchCount = countDownLatch.getCount();
if (countDownLatchCount > 0) {
return;
}
acknowledgment.acknowledge();
} catch (Exception e) {
logger.error("onMessage error ", e);
} finally {
for (Future<?> future : futureList) {
if (future.isDone() || future.isCancelled()) {
continue;
}
future.cancel(true);
}
}
}
@Override
public void onApplicationEvent(ApplicationReadyEvent event) {
ThreadFactoryBuilder builder = new ThreadFactoryBuilder();
builder.setNameFormat(this.getClass().getSimpleName() + "-pool-%d");
builder.setDaemon(false);
executor = new ThreadPoolExecutor(12,
12 * 2,
60L,
TimeUnit.SECONDS,
new ArrayBlockingQueue<>(100),
builder.build());
}
private class Worker implements Runnable {
private ConsumerRecord<String, Msg> data;
private CountDownLatch countDownLatch;
Worker(ConsumerRecord<String, Msg> data, CountDownLatch countDownLatch) {
this.data = data;
this.countDownLatch = countDownLatch;
}
@Override
public void run() {
try {
MessageWorker worker = chooseWorker(data);
worker.work(data.value());
} finally {
countDownLatch.countDown();
}
}
}
}
1. 問題背景
有一天早上xxx-consumer服務出現大量報警,人工排查發現30w+的消息未處理,業務日志正常,gc日志有大量Full gc,初步判斷因為Full gc導致消息處理慢,大量的消息積壓。

2. 堆棧分析
查看了近一個月的JVM內存信息,發現老年代內存無法被回收(9月22號的下降是因為服務有一次上線重啟),初步判斷發生了內存泄漏。

通過<jmap -dump:format=b,file=/home/work/app/xxx-consumer/logs/jmap_dump.hprof -F>命令導出內存快照,使用Memory Analyzer解析內存快照文件jmap_dump.hprof,發現有很明顯的內存泄漏提示:

進一步查看線程細節,發現創建了大量的ThreadLocalScope對象且循環引用:

同時我們也看到了分布式追蹤(dd-trace-java)jar包中的FakeSpan類,初步判斷是dd-trace-java中自研擴展的kafka插件存在內存泄漏bug。
3. 代碼分析
繼續查看dd-trace-java中kafka插件的代碼,其處理流程如下:
第一批消息
-
(SpringKafkaConsumerInstrumentation:L22)BatchAcknowledgingMessageListener.onMessage進入時,主線程會創建一個scope00=ThreadLocalScope(Type_BatchMessageListener_Value,toRestore=null)
-
(ExecutorInstrumentation:L21L47)消息被submit到線程池中處理時,子線程會創建一個scope10=ThreadLocalScope(Type_BatchMessageListener_Value,toRestore=null)
-
(SpringKafkaConsumerInstrumentation:L68)子線程處理消息時(ConsumerRecord.value),會創建一個scope11=ThreadLocalScope(Type_ConsumberRecord_Value,toRestore=scope10)
-
(ExecutorInstrumentation:L54)子線程處理完消息后,執行scope10.close(),而scopeManager.tlsScope.get()=scope11,命中ThreadLocalScope:L19,scope10和scope11均無法被GC
-
(SpringKafkaConsumerInstrumentation:L42)BatchAcknowledgingMessageListener.onMessage退出時,主線程會執行scope00.close(),scope00會被GC
第二批消息
-
(SpringKafkaConsumerInstrumentation:L22)BatchAcknowledgingMessageListener.onMessage進入時,主線程會創建一個scope01=ThreadLocalScope(Type_BatchMessageListener_Value,toRestore=null)
-
(ExecutorInstrumentation:L21L47)消息被submit到線程池中處理時,子線程會創建一個scope12=ThreadLocalScope(Type_BatchMessageListener_Value,toRestore=scope11)
-
(SpringKafkaConsumerInstrumentation:L68)子線程處理消息時(ConsumerRecord.value),會創建一個scope13=ThreadLocalScope(Type_ConsumberRecord_Value,toRestore=scope12)
-
(ExecutorInstrumentation:L54)子線程處理完消息后,執行scope12.close(),而scopeManager.tlsScope.get()=scope13,命中ThreadLocalScope:L19,scope12和scope13均無法被GC
-
(SpringKafkaConsumerInstrumentation:L42)BatchAcknowledgingMessageListener.onMessage退出時,主線程會執行scope01.close(),scope01會被GC
從上可以看到,主線程創建的ThreadLocalScope能被正確GC,而線程池中創建的ThreadLocalScope被循環引用,無法被正確GC,從而造成內存泄漏。
@AutoService(Instrumenter.class)
public final class SpringKafkaConsumerInstrumentation extends Instrumenter.Configurable {
@Override
public AgentBuilder apply(final AgentBuilder agentBuilder) {
return agentBuilder
.type(hasSuperType(named("org.springframework.kafka.listener.BatchAcknowledgingMessageListener")))
.transform(DDAdvice.create().advice(isMethod().and(isPublic()).and(named("onMessage")),
BatchMessageListenerAdvice.class.getName()))
.type(named("org.apache.kafka.clients.consumer.ConsumerRecord"))
.transform(DDAdvice.create().advice(isMethod().and(isPublic()).and(named("value")),
RecoredValueAdvice.class.getName()))
.asDecorator();
}
public static class BatchMessageListenerAdvice {
@Advice.OnMethodEnter(suppress = Throwable.class)
public static Scope before() {
FakeSpan span = new FakeSpan();
span.setKind(FakeSpan.Type_BatchMessageListener_Value);
Scope scope = GlobalTracer.get().scopeManager().activate(span, false);
return scope;
}
@Advice.OnMethodExit(suppress = Throwable.class)
public static void after(@Advice.Enter Scope scope) {
while (true) {
Span span = GlobalTracer.get().activeSpan();
if (span != null && span instanceof FakeSpan) {
FakeSpan fakeSpan = (FakeSpan) span;
if (fakeSpan.getKind().equals(FakeSpan.Type_ConsumberRecord_Value)) {
GlobalTracer.get().scopeManager().active().close();
} else {
break;
}
} else {
break;
}
}
if (scope != null) {
scope.close();
}
}
}
public static class RecoredValueAdvice {
@Advice.OnMethodEnter(suppress = Throwable.class)
public static void before(@Advice.This ConsumerRecord record) {
Span activeSpan = GlobalTracer.get().activeSpan();
if (activeSpan instanceof FakeSpan) {
FakeSpan proxy = (FakeSpan) activeSpan;
if (proxy.getKind().equals(FakeSpan.Type_ConsumberRecord_Value)) {
GlobalTracer.get().scopeManager().active().close();
activeSpan = GlobalTracer.get().activeSpan();
if (activeSpan instanceof FakeSpan) {
proxy = (FakeSpan) activeSpan;
}
}
if (proxy.getKind().equals(FakeSpan.Type_BatchMessageListener_Value)) {
final SpanContext spanContext = TracingKafkaUtils.extractSecond(record.headers(), GlobalTracer.get());
if (spanContext != null) {
FakeSpan consumerProxy = new FakeSpan();
consumerProxy.setContext(spanContext);
consumerProxy.setKind(FakeSpan.Type_ConsumberRecord_Value);
GlobalTracer.get().scopeManager().activate(consumerProxy, false);
}
}
}
}
}
}
@AutoService(Instrumenter.class)
public final class ExecutorInstrumentation extends Instrumenter.Configurable {
@Override
public AgentBuilder apply(final AgentBuilder agentBuilder) {
return agentBuilder
.type(not(isInterface()).and(hasSuperType(named(ExecutorService.class.getName()))))
.transform(DDAdvice.create().advice(named("submit").and(takesArgument(0, Runnable.class)),
SubmitTracedRunnableAdvice.class.getName()))
.asDecorator();
}
public static class SubmitTracedRunnableAdvice {
@Advice.OnMethodEnter(suppress = Throwable.class)
public static TracedRunnable wrapJob(
@Advice.This Object dis,
@Advice.Argument(value = 0, readOnly = false) Runnable task) {
if (task != null && (!dis.getClass().getName().startsWith("slick.util.AsyncExecutor"))) {
task = new TracedRunnable(task, GlobalTracer.get());
return (TracedRunnable) task;
}
return null;
}
}
public static class TracedRunnable implements Runnable {
private final Runnable delegate;
private final Span span;
private final Tracer tracer;
public TracedRunnable(Runnable delegate, Tracer tracer) {
this.delegate = delegate;
this.tracer = tracer;
if (tracer != null) {
this.span = tracer.activeSpan();
} else {
this.span = null;
}
}
@Override
public void run() {
Scope scope = null;
if (span != null && tracer != null) {
scope = tracer.scopeManager().activate(span, false);
}
try {
delegate.run();
} finally {
if (scope != null) {
scope.close();
}
}
}
}
}
public class ThreadLocalScopeManager implements ScopeManager {
final ThreadLocal<ThreadLocalScope> tlsScope = new ThreadLocal<ThreadLocalScope>();
@Override
public Scope activate(Span span, boolean finishOnClose) {
return new ThreadLocalScope(this, span, finishOnClose);
}
@Override
public Scope active() {
return tlsScope.get();
}
}
public class ThreadLocalScope implements Scope {
private final ThreadLocalScopeManager scopeManager;
private final Span wrapped;
private final boolean finishOnClose;
private final ThreadLocalScope toRestore;
ThreadLocalScope(ThreadLocalScopeManager scopeManager, Span wrapped, boolean finishOnClose) {
this.scopeManager = scopeManager;
this.wrapped = wrapped;
this.finishOnClose = finishOnClose;
this.toRestore = scopeManager.tlsScope.get();
scopeManager.tlsScope.set(this);
}
@Override
public void close() {
if (scopeManager.tlsScope.get() != this) {
// This shouldn't happen if users call methods in the expected order. Bail out.
return;
}
if (finishOnClose) {
wrapped.finish();
}
scopeManager.tlsScope.set(toRestore);
}
@Override
public Span span() {
return wrapped;
}
}
End
RecoredValueAdvice沒有銷毀自己創建的對象,而是寄希望於BatchMessageListenerAdvice去銷毀。
但(SpringKafkaConsumerInstrumentation:L27)BatchAcknowledgingMessageListener.onMessage退出時,只會close主線程創建的ThreadLocalScope,不會close線程池中創建的ThreadLocalScope,導致子線程創建的ThreadLocalScope被循環引用,無法被正確GC,從而造成內存泄漏。
