grpc報錯call already cancelled

本文轉載自查看原文 2020-09-22 11:04 1880 java

參考：https://sq.163yun.com/blog/article/180400600475590656

參考：https://stackoom.com/question/3rd87/gRPC-RPC%E8%B0%83%E7%94%A8%E4%B8%8A%E7%9A%84%E9%9A%8F%E6%9C%BACANCELED%E5%BC%82%E5%B8%B8

參考：https://blog.csdn.net/qq_14945847/article/details/102466693

GRPC是Google開源的一個高性能、跨語言的RPC框架，基於HTTP2協議，基於protobuf-3.x，基於Netty-4.x。

最近開發一個公司的游戲項目，采用分布式的服務器架構，內部服務之間均采用GRPC進行通信。但是在測試過程中發現一個重大的bug，偶現丟包。

前置信息：ServerA（作為grpc的client端）通過futureStub異步調用ServerB（作為server端）。

查看日志發現，client中有調用發送請求的日志，但server端卻沒有任何收到請求的日志。於是懷疑server在接收請求的過程中吃掉了異常，對整個過程加上try-catch，然而並沒有任何發現，於是懷疑根本沒有收到請求。通過tcpdump抓包后發現確實是client沒發包到網絡層，由此確定了是client端問題。

懷疑是grpc的futureStub有bug，於是在finally中加入future.get()打印出調用結果。結果一天下來一直沒有發現任何丟包，”完美“。懵逼了，去掉finally中的future get以后又出現丟包。這個時候，直覺判斷，由於調用futue.get()會block當前線程，所以不會出現丟包，而不block就會丟包。有種接近真相的感覺。檢查代碼后發現client端創建stub的時候沒有指定線程池，grpc會默認使用當前線程，結合偶現的表現，覺得非常像是線程在異步發送的過程中被銷毀導致。

然而，指定線程池以后，依舊發現丟包。無奈之下只有開始查看grpc源碼，並打印出grpc的日志。在日志中發現了如下錯誤：

----------------OUTBOUND--------------------
    [id: 0x7dee8b32, L:/10.82.7.14:60365 - R:/10.82.7.14:9001] RST_STREAM: streamId=4025, errorCode=8
------------------------------------

看到errorCode=8，並結合grpc源碼中對io.grpc.Status的定義：

 /**
     * Some resource has been exhausted, perhaps a per-user quota, or
     * perhaps the entire file system is out of space.
     */
RESOURCE_EXHAUSTED(8),

推斷是由於系統文件句柄資源耗盡導致，查看服務器系統文件句柄限制為65535，而進程打開的限制遠不止這么多。於是找sa將文件句柄限制改為100w，然后信心滿滿的重啟了全部服務，重新進行測試。結果很快又發現了丟包，內心崩潰！

由於當future.get()方法阻塞線程時，就不會出現丟包，無法重現。於是想到把futureStub改為stub，在保持異步的同時，通過observer回調來觀察調用結果,這次在回調的onError方法中發現了如下異常：

    CANCELLED,
    io.grpc.Status.asRuntimeException(Status.java:543), 
    io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:395), 
    io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426), 
    io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:76), 
    io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:512), 
    io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:429), 
    io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:544), 
    io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52), 
    io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:117), 
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142), 
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617), 
    java.lang.Thread.run(Thread.java:745

終於發現原來是這個call被grpc在底層cancel掉了。查代碼發現原來上面的errorCode=8是指http2的code定義：

    /**
     * Mapped to call cancellation when sent by a client. 
     * Mapped to CANCELLED when sent by a server. 
     * Note that servers should only use this mechanism when they need to cancel a call but the payload byte sequence is incomplete.
     */
    CANCEL(8)

知道了是由於被cancel才導致的丟包，那么就要查找為什么會被cancel，由於grpc的stub調用是通過異步的線性執行器執行io.grpc.ClientCall。所以堆棧中無法看到整個調用過程，只能通過源碼來看。不管是哪種stub的調用，最后都會走到ClientCalls.startCall()方法中，然后進ClientCallImpl.start()中，在這個方法中，可以看到與cancel有關的代碼：

    if (context.isCancelled()) {
      // Context is already cancelled so no need to create a real stream, just notify the observer
      // of cancellation via callback on the executor
      stream = NoopClientStream.INSTANCE;
      class ClosedByContext extends ContextRunnable {
        ClosedByContext() {
          super(context);
        }

        @Override
        public void runInContext() {
          closeObserver(observer, statusFromCancelled(context), new Metadata());
        }
      }

      callExecutor.execute(new ClosedByContext());
      return;
    }

可以看出來cancel的原因是由於context.isCancelled()的結果為true

  public boolean isCancelled() {
    if (parent == null || !cascadesCancellation) {
      return false;
    } else {
      return parent.isCancelled();
    }
  }

可以看出，如果parent==null，則不可能為true，所以必然是由於parent.isCancelled()為true導致的。由此我們基本可以猜測出來，當前的ClientCall被cancel的原因是由於它的parent context canceled導致。

回到ClientCallImpl中，可以看到this.context = Context.current();表明每個ClientCall被創建的時候，context都會被賦值為Context.current()

  public static Context current() {
    Context current = storage().current();
    if (current == null) {
      return ROOT;
    }
    return current;
  }

如果當前沒有context，則用ROOT作為context

  /**
   * The logical root context which is the ultimate ancestor of all contexts. This context
   * is not cancellable and so will not cascade cancellation or retain listeners.
   *
   * Never assume this is the default context for new threads, because {@link Storage} may define
   * a default context that is different from ROOT.
   */
  public static final Context ROOT = new Context(null);

在注釋中可以看出，ROOT是不可以被cancel的，所以可以判斷，當前必然有一個grpc的context存在。

最后聯想到由於業務代碼中，ClientCall是在ServerA處理grpc請求的過程中執行的，那么這個Context必然是由GrpcServer處理請求的時候生成的，並賦值給了ClientCall。而由於ClientCall是線性執行器中異步執行，當真正執行ContextRunnable的時候，GrpcServer的context可能已經被cancel，導致當前ClientCall被cancel。

解決辦法：可以對當前context進行fork，或者把每個call都封裝為Runnable，通過Context提供的run靜態方法執行。

至此，終於弄清楚了grpc丟包的原因。雖然，這個問題通過業務層代碼可以避免掉，但是我覺得grpc在底層默默的把這件事做了，並且沒有往上層拋出任何信息，這對開發者是非常不友好的。同時，這個cancel邏輯的處理上，我覺得也有待商榷。畢竟在GrpcServer中異步的去執行GrpcCall是一個非常普遍合理的模型。也許以后Google會改吧。

附：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Exception in thread "main" java.lang.IllegalStateException: Task already scheduled or cancelled 解決vue路由跳轉報錯，Navigation cancelled ……to with a new navigation 解決報錯（Navigation cancelled from “/roleList“ to “/userlist“ with a new navigation.）問題BeanFactory not initialized or already closed - call 'refresh' before access gRPC GRPC php 報錯 Call to undefined function bcadd() there is already 'RtController' bean method 項目報錯 BeanFactory not initialized or already closed - call 'refresh' before accessing beans解決辦法 TF_Server gRPC failed, call return code:8:Received message larger than max (45129801 vs. 4194304)