stream是什么

nivdia給出的解釋是:
A sequence of operations that execute in issue-order on the GPU. 可以理解成在GPU上執行的操作序列.比如下面的這些動作.

cudaMemcpy()
kernel launch
device sync
cudaMemcpy()

不同的流操作可能是交叉執行的,可能是同事執行的.

流的API:

cudaEvent_t start;
cudaEventCreate(&start);
cudaEventRecord( start, 0 );

我們可以把一個應用程序的整體對的stream的情況稱之為pipeline.優化程序以stream的角度就是優化pipeline

cuda overlap重疊

支持設備重疊的cuda GPU設備能夠在執行kernel函數時同時執行設備與主機之間的內存拷貝動作.可以用下面的代碼查看設備是否支持overlap:

int dev_count;
cudaDeviceProp prop;
cudaGetDeviceCount( &dev_count);
for (int i = 0; i < dev_count; i++) {
    cudaGetDeviceProperties(&prop, i);
    if (prop.deviceOverlap) ...

cudaMemcpyAsync()

memcpy是以同步方式執行的,當函數返回時,復制操作已經完成.而cudaMemcpyAsync()是異步函數,它只是放置一個請求,表示在流中執行一次內存復制操作,這個復制操作是通過參數stream來指定的.當函數返回時我們無法保證函數已經執行完成,能夠保證的是復制操作肯定會在下一個放入流的操作之前執行完成.任何傳遞給cudaMemcpyAsync()的主機內存指針都必須已經通過cudaHostAlloc()分配好內存,也就是,你只能以異步方式對頁鎖定內存進行復制操作.

Vector stream add 向量流加法

優化這個pipeline,最理想的pipeline如下:

可以看到在同一時間,lanuch kernel, copy host to device, copy device back to host 三個任務同時執行. 有2個stream流,一個是copy, 一個用於執行kernel.

實際優化pipeline的時候並不是這么簡單和容易的,先看下面一段host代碼:

    for (int i=0; i<n; i+=SegSize*2) {
        cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float),..., stream0);
        cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float),..., stream0);
        vecAdd<<<SegSize/256, 256, 0, stream0>>>(d_A0, d_B0,...);
        cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float),..., stream0);
        cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),...,
                        stream1);
        cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float) ,...,
                        stream1);
        vecAdd<<<SegSize/256, 256, 0, stream1>>>(d_A1, d_B1, ...);
        cudaMemcpyAsync(d_C1, h_C+i+SegSize, SegSize*sizeof(float),...,
                        stream1);
    }

這段代碼的pipeline的情況是: 執行kernel計算和下一塊拷貝主機內存到設備是同事進行的.

再看下面這段代碼:

for (int i=0; i<n; i+=SegSize*2) {
        cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float),..., stream0);
        cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float),..., stream0);
        cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),...,
                        stream1);
        cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float),...,
                        stream1);
        vecAdd<<<SegSize/256, 256, 0, stream0>>>(d_A0, d_B0, ...);
        vecAdd<<<SegSize/256, 256, 0, stream1>>>(d_A1, d_B1, ...);
        cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float),..., stream0);
        cudaMemcpyAsync(h_C+i+SegSize, d_C1, SegSize*sizeof(float),...,
                        stream1);
}

這段代碼的pipeline情況是:和上一種的區別是把拷貝A和B元素與kernel並行,可以形象的理解成,下一行向左移動一下,那么整個pipeline整體是縮短了的.

strean 同步API

cudaStreamSynchronize(stream_id): 等待一個stream中的所有任務執行完成.

cudaDeviceSynchronize(): 不帶參數等待設備中所有流任務執行完成

Vector-stream-add Code

首先使用2個stream來做:

#include    <wb.h>
#define wbCheck(stmt) do {                                                    \
        cudaError_t err = stmt;                                               \
        if (err != cudaSuccess) {                                             \
            wbLog(ERROR, "Failed to run stmt ", #stmt);                       \
            wbLog(ERROR, "Got CUDA error ...  ", cudaGetErrorString(err));    \
            return -1;                                                        \
        }                                                                     \
    } while(0)  

#define SegSize 256
#define StreamNum 2

__global__ void vecAdd(float * in1, float * in2, float * out, int len) {
    //@@ Insert code to implement vector addition here
    int gidx = blockIdx.x*blockDim.x + threadIdx.x;

    if(gidx< len)
    {
        out[gidx]= in1[gidx]+in2[gidx];
    }
}

int main(int argc, char ** argv) {
    wbArg_t args;
    int inputLength;
    float * hostInput1;
    float * hostInput2;
    float * hostOutput;
  //  float * deviceInput1;
  //  float * deviceInput2;
  //  float * deviceOutput;
    float *h_A, *h_B, *h_C;
    
    //cudaStream_t stream0, stream1;
    //cudaStreamCreate(&stream0);
    //cudaStreamCreate(&stream1);
    float *d_A0, *d_B0, *d_C0;// device memory for stream 0
    float *d_A1, *d_B1, *d_C1;// device memory for stream 1

    args = wbArg_read(argc, argv);
    int Csize = SegSize*sizeof(float);

    wbTime_start(Generic, "Importing data and creating memory on host");
    hostInput1 = (float *) wbImport(wbArg_getInputFile(args, 0), &inputLength);
    hostInput2 = (float *) wbImport(wbArg_getInputFile(args, 1), &inputLength);
    hostOutput = (float *) malloc(inputLength * sizeof(float));
    printf("inputLength ==%d, SegSize =%d\n", inputLength, SegSize);
    wbTime_stop(Generic, "Importing data and creating memory on host");
    
    cudaHostAlloc((void**)&h_A, inputLength*sizeof(float), cudaHostAllocDefault);
    cudaHostAlloc((void**)&h_B, inputLength*sizeof(float), cudaHostAllocDefault);
    cudaHostAlloc((void**)&h_C, inputLength*sizeof(float), cudaHostAllocDefault);

    memcpy(h_A, hostInput1,inputLength*sizeof(float));
    memcpy(h_B, hostInput2,inputLength*sizeof(float));
    
    wbCheck(cudaMalloc((void **)&d_A0, Csize));
    wbCheck(cudaMalloc((void **)&d_A1, Csize));
    wbCheck(cudaMalloc((void **)&d_B0, Csize));
    wbCheck(cudaMalloc((void **)&d_B1, Csize));
    wbCheck(cudaMalloc((void **)&d_C0, Csize));
    wbCheck(cudaMalloc((void **)&d_C1, Csize));
    
    cudaStream_t *streams = (cudaStream_t*) malloc(StreamNum * sizeof(cudaStream_t));
    for(int i = 0; i < StreamNum; i++)
        cudaStreamCreate(&(streams[i]));
    
    int main = inputLength/(SegSize*StreamNum);
    int left = inputLength%(SegSize*StreamNum);
    
    printf("main =%d, left=%d\n", main, left);
        int i = 0; // keep the increaing length
      for(i; i < inputLength; i+=SegSize*StreamNum)
    {
            cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[0]);
            cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[0]);
            cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[1]);
            cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[1]); 
            
            // block size is 256
            vecAdd<<<SegSize/256, SegSize, 1, streams[0]>>>(d_A0, d_B0, d_C0, SegSize);
            vecAdd<<<SegSize/256, SegSize, 1, streams[1]>>>(d_A1, d_B1, d_C1, SegSize);

          //  cudaStreamSynchronize(yiming_stream0);
            cudaMemcpyAsync(hostOutput+i, d_C0, Csize, cudaMemcpyDeviceToHost, streams[0]);
            //cudaStreamSynchronize(yiming_stream1);
            cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, Csize, cudaMemcpyDeviceToHost, streams[1]);
    }
    
    // Process the remaining elements


    if(SegSize < left)
    {
        printf("AAAAAAA, left- size ==%d\n", left-SegSize);
        cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[0]);
        cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[0]);
        cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, (left-SegSize)*sizeof(float), cudaMemcpyHostToDevice, streams[1]);
        cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, (left-SegSize)*sizeof(float), cudaMemcpyHostToDevice, streams[1]);

            
        // block size is 256
        vecAdd<<<1, SegSize, 1, streams[0]>>>(d_A0, d_B0, d_C0, SegSize);
        vecAdd<<<1, left-SegSize, 1, streams[1]>>>(d_A0, d_B0, d_C0, left-SegSize);                                                                                                                                    

       // cudaStreamSynchronize(streams[0]);
        cudaMemcpyAsync(hostOutput+i, d_C0, Csize,cudaMemcpyDeviceToHost, streams[0]);
        cudaMemcpyAsync(hostOutput+i+SegSize, d_C0, (left-SegSize)*sizeof(float),cudaMemcpyDeviceToHost, streams[1]);                                                                                                                                    
        
    //    i+=SegSize;
    //    left = left - SegSize;
    }
    else if(left > 0)
    {
        printf("BBBBBBB\n");
        cudaMemcpyAsync(d_A0, hostInput1+i, left*sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpyAsync(d_B0, hostInput2+i, left*sizeof(float), cudaMemcpyHostToDevice);
        
        vecAdd<<<1, left, 1, streams[0]>>>(d_A0, d_B0, d_C0, left);
        
        //cudaDeviceSynchronize();
        cudaMemcpyAsync(hostOutput+i, d_C0, left*sizeof(float), cudaMemcpyDeviceToHost);    
    }
    
    cudaDeviceSynchronize(); 
    wbSolution(args, hostOutput, inputLength);

    free(hostInput1);
    free(hostInput2);
    free(hostOutput);
    
    for(int i = 0; i < StreamNum; i++)
        cudaStreamDestroy(streams[i]);

    cudaFree(d_A0);
    cudaFree(d_A1);
    cudaFree(d_B0);
    cudaFree(d_B1);
    cudaFree(d_C0);
    cudaFree(d_C1);
    return 0;
}

View Code

然后是使用4個流來做,code如下:

#include    <wb.h>
#define wbCheck(stmt) do {                                                    \
        cudaError_t err = stmt;                                               \
        if (err != cudaSuccess) {                                             \
            wbLog(ERROR, "Failed to run stmt ", #stmt);                       \
            wbLog(ERROR, "Got CUDA error ...  ", cudaGetErrorString(err));    \
            return -1;                                                        \
        }                                                                     \
    } while(0)  

#define SegSize 256
#define StreamNum 4

__global__ void vecAdd(float * in1, float * in2, float * out, int len) {
    //@@ Insert code to implement vector addition here
    int gidx = blockIdx.x*blockDim.x + threadIdx.x;

    if(gidx< len)
    {
        out[gidx]= in1[gidx]+in2[gidx];
    }
}

int main(int argc, char ** argv) {
    wbArg_t args;
    int inputLength, i;
    float * hostInput1;
    float * hostInput2;
    float * hostOutput;
  //  float * deviceInput1;
  //  float * deviceInput2;
  //  float * deviceOutput;
    float *h_A, *h_B, *h_C;
    
    //cudaStream_t stream0, stream1;
    //cudaStreamCreate(&stream0);
    //cudaStreamCreate(&stream1);
    float *d_A0, *d_B0, *d_C0;// device memory for stream 0
    float *d_A1, *d_B1, *d_C1;// device memory for stream 1
    float *d_A2, *d_B2, *d_C2;// device memory for stream 2
    float *d_A3, *d_B3, *d_C3;// device memory for stream 3

    args = wbArg_read(argc, argv);
    int Csize = SegSize*sizeof(float);

    wbTime_start(Generic, "Importing data and creating memory on host");
    hostInput1 = (float *) wbImport(wbArg_getInputFile(args, 0), &inputLength);
    hostInput2 = (float *) wbImport(wbArg_getInputFile(args, 1), &inputLength);
    hostOutput = (float *) malloc(inputLength * sizeof(float));
    printf("inputLength ==%d, SegSize =%d\n", inputLength, SegSize);
    wbTime_stop(Generic, "Importing data and creating memory on host");
    
    cudaHostAlloc((void**)&h_A, inputLength*sizeof(float), cudaHostAllocDefault);
    cudaHostAlloc((void**)&h_B, inputLength*sizeof(float), cudaHostAllocDefault);
    cudaHostAlloc((void**)&h_C, inputLength*sizeof(float), cudaHostAllocDefault);

    memcpy(h_A, hostInput1,inputLength*sizeof(float));
    memcpy(h_B, hostInput2,inputLength*sizeof(float));
    
    wbCheck(cudaMalloc((void **)&d_A0, Csize));
    wbCheck(cudaMalloc((void **)&d_A1, Csize));
    wbCheck(cudaMalloc((void **)&d_B0, Csize));
    wbCheck(cudaMalloc((void **)&d_B1, Csize));
    wbCheck(cudaMalloc((void **)&d_C0, Csize));
    wbCheck(cudaMalloc((void **)&d_C1, Csize));
    wbCheck(cudaMalloc((void **)&d_A2, Csize));
    wbCheck(cudaMalloc((void **)&d_A3, Csize));
    wbCheck(cudaMalloc((void **)&d_B2, Csize));
    wbCheck(cudaMalloc((void **)&d_B3, Csize));
    wbCheck(cudaMalloc((void **)&d_C2, Csize));
    wbCheck(cudaMalloc((void **)&d_C3, Csize));
    
    cudaStream_t *streams = (cudaStream_t*) malloc(StreamNum * sizeof(cudaStream_t));
    for(int i = 0; i < StreamNum; i++)
        cudaStreamCreate(&(streams[i]));
    
    int main = inputLength/(SegSize*StreamNum);
    int left = inputLength%(SegSize*StreamNum);
    
    printf("main =%d, left=%d\n", main, left);
    for(i=0; i < inputLength; i+=SegSize*StreamNum)
    {
            cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[0]);
            cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[0]);
            cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[1]);
            cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[1]); 
            cudaMemcpyAsync(d_A2, hostInput1+i+SegSize*2, Csize, cudaMemcpyHostToDevice, streams[2]);
            cudaMemcpyAsync(d_B2, hostInput2+i+SegSize*2, Csize, cudaMemcpyHostToDevice, streams[2]);
            cudaMemcpyAsync(d_A3, hostInput1+i+SegSize*3, Csize, cudaMemcpyHostToDevice, streams[3]);
            cudaMemcpyAsync(d_B3, hostInput2+i+SegSize*3, Csize, cudaMemcpyHostToDevice, streams[3]); 
            
            // block size is 256
            vecAdd<<<SegSize/256, SegSize, 1, streams[0]>>>(d_A0, d_B0, d_C0, SegSize);
            vecAdd<<<SegSize/256, SegSize, 1, streams[1]>>>(d_A1, d_B1, d_C1, SegSize);
            vecAdd<<<SegSize/256, SegSize, 1, streams[2]>>>(d_A2, d_B2, d_C2, SegSize);
            vecAdd<<<SegSize/256, SegSize, 1, streams[3]>>>(d_A3, d_B3, d_C3, SegSize);
            
            cudaMemcpyAsync(hostOutput+i, d_C0, Csize, cudaMemcpyDeviceToHost, streams[0]);
            //cudaStreamSynchronize(yiming_stream1);
            cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, Csize, cudaMemcpyDeviceToHost, streams[1]);
            cudaMemcpyAsync(hostOutput+i+SegSize*2, d_C2, Csize, cudaMemcpyDeviceToHost, streams[2]);
            cudaMemcpyAsync(hostOutput+i+SegSize*3, d_C3, Csize, cudaMemcpyDeviceToHost, streams[3]);
    }
    
    // Process the remaining elements
    if(SegSize*3 < left){
            printf("DDDDDDDD\n");
            cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[0]);
            cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[0]);
            cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[1]);
            cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[1]); 
            cudaMemcpyAsync(d_A2, hostInput1+i+SegSize*2, Csize, cudaMemcpyHostToDevice, streams[2]);
            cudaMemcpyAsync(d_B2, hostInput2+i+SegSize*2, Csize, cudaMemcpyHostToDevice, streams[2]);
            cudaMemcpyAsync(d_A3, hostInput1+i+SegSize*3, (left-SegSize*3)*sizeof(float), cudaMemcpyHostToDevice, streams[3]);
            cudaMemcpyAsync(d_B3, hostInput2+i+SegSize*3, (left-SegSize*3)*sizeof(float), cudaMemcpyHostToDevice, streams[3]); 
            
            // block size is 256
            vecAdd<<<1, SegSize, 1, streams[0]>>>(d_A0, d_B0, d_C0, SegSize);
            vecAdd<<<1, SegSize, 1, streams[1]>>>(d_A1, d_B1, d_C1, SegSize);
            vecAdd<<<1, SegSize, 1, streams[2]>>>(d_A2, d_B2, d_C2, SegSize);
            vecAdd<<<1, (left-SegSize*3), 1, streams[3]>>>(d_A3, d_B3, d_C3, (left-SegSize*3));
            
            cudaMemcpyAsync(hostOutput+i, d_C0, Csize, cudaMemcpyDeviceToHost, streams[0]);
            //cudaStreamSynchronize(yiming_stream1);
            cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, Csize, cudaMemcpyDeviceToHost, streams[1]);
            cudaMemcpyAsync(hostOutput+i+SegSize*2, d_C2, Csize, cudaMemcpyDeviceToHost, streams[2]);
            cudaMemcpyAsync(hostOutput+i+SegSize*3, d_C3, (left-SegSize*3)*sizeof(float), cudaMemcpyDeviceToHost, streams[3]);
    }
    else if(SegSize*2 < left){
            printf("CCCCCCCC\n");
            cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[0]);
            cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[0]);
            cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[1]);
            cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, Csize, cudaMemcpyHostToDevice, streams[1]); 
            cudaMemcpyAsync(d_A2, hostInput1+i+SegSize*2, (left-SegSize*2)*sizeof(float), cudaMemcpyHostToDevice, streams[2]);
            cudaMemcpyAsync(d_B2, hostInput2+i+SegSize*2, (left-SegSize*2)*sizeof(float), cudaMemcpyHostToDevice, streams[2]);
            
            // block size is 256
            vecAdd<<<1, SegSize, 1, streams[0]>>>(d_A0, d_B0, d_C0, SegSize);
            vecAdd<<<1, SegSize, 1, streams[1]>>>(d_A1, d_B1, d_C1, SegSize);
            vecAdd<<<1, left-SegSize*2, 1, streams[2]>>>(d_A2, d_B2, d_C2, (left-SegSize*2));
            
            cudaMemcpyAsync(hostOutput+i, d_C0, Csize, cudaMemcpyDeviceToHost, streams[0]);
            //cudaStreamSynchronize(yiming_stream1);
            cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, Csize, cudaMemcpyDeviceToHost, streams[1]);
            cudaMemcpyAsync(hostOutput+i+SegSize*2, d_C2, (left-SegSize*2)*sizeof(float), cudaMemcpyDeviceToHost, streams[2]);
    
    }
    else if(SegSize < left)
    {
        printf("AAAAAAA, left- size ==%d\n", left-SegSize);
        cudaMemcpyAsync(d_A0, hostInput1+i, Csize, cudaMemcpyHostToDevice, streams[0]);
        cudaMemcpyAsync(d_B0, hostInput2+i, Csize, cudaMemcpyHostToDevice, streams[0]);
        cudaMemcpyAsync(d_A1, hostInput1+i+SegSize, (left-SegSize)*sizeof(float), cudaMemcpyHostToDevice, streams[1]);
        cudaMemcpyAsync(d_B1, hostInput2+i+SegSize, (left-SegSize)*sizeof(float), cudaMemcpyHostToDevice, streams[1]);

            
        // block size is 256
        vecAdd<<<1, SegSize, 1, streams[0]>>>(d_A0, d_B0, d_C0, SegSize);
        vecAdd<<<1, left-SegSize, 1, streams[1]>>>(d_A0, d_B0, d_C0, left-
SegSize);                                                                                                                                    

       // cudaStreamSynchronize(streams[0]);
        cudaMemcpyAsync(hostOutput+i, d_C0, Csize,cudaMemcpyDeviceToHost, streams[0]);
        cudaMemcpyAsync(hostOutput+i+SegSize, d_C1, (left-SegSize)*sizeof(float),cudaMemcpyDeviceToHost, streams[1]);                                                                                                                                    
        
    //    i+=SegSize;
    //    left = left - SegSize;
    }
    else if(left > 0)
    {
        printf("BBBBBBB\n");
        cudaMemcpyAsync(d_A0, hostInput1+i, left*sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpyAsync(d_B0, hostInput2+i, left*sizeof(float), cudaMemcpyHostToDevice);
        
        vecAdd<<<1, left, 1, streams[0]>>>(d_A0, d_B0, d_C0, left);
        
        //cudaDeviceSynchronize();
        cudaMemcpyAsync(hostOutput+i, d_C0, left*sizeof(float), cudaMemcpyDeviceToHost);    
    }
    
    cudaDeviceSynchronize(); 
    wbSolution(args, hostOutput, inputLength);

    free(hostInput1);
    free(hostInput2);
    free(hostOutput);
    
    for(int i = 0; i < StreamNum; i++)
        cudaStreamDestroy(streams[i]);

    cudaFree(d_A0);
    cudaFree(d_A1);
    cudaFree(d_B0);
    cudaFree(d_B1);
    cudaFree(d_C0);
    cudaFree(d_C1);
    cudaFree(d_A2);
    cudaFree(d_A3);
    cudaFree(d_B2);
    cudaFree(d_B3);
    cudaFree(d_C2);
    cudaFree(d_C3);
    return 0;
}

View Code

運行成功,但是遺留一個問題,當我把拷貝內存的代碼改成:

cudaMemcpyAsync(d_A0, h_A+i, Csize, cudaMemcpyHostToDevice, streams[0]); 即使用頁固定內存,結果就會錯誤,不明白為什么

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 JAVA 8 Streams rabbitMQ的Streams CUDA CUDA cuda Elasticsearch Data streams 淺談kafka streams Kafka Streams開發入門(9) java8的Streams Kafka Streams開發入門(2)