http://peghoty.blog.163.com/blog/static/493464092013016113254852/
http://blog.csdn.net/augusdi/article/details/12833235
CUDA存儲器模型:http://blog.csdn.net/endlch/article/details/44538801
CUDA限定符:http://blog.csdn.net/shouhuxianjian/article/details/42427285
思想即是將內存數據拷貝到顯存,在顯存上執行並行運算,將結果數據從顯存拷貝回內存。
CUDA內有thrust庫,類似於C++ stl庫。
===========以下是原文=========
挖坑待填。
以上是本機CUDA參數。
需要了解的概念:線程束(wrap),共享內存,常量內存,紋理內存(?,圖形學相關,略),流,原子操作。
隸屬於同一個SM的8個SP共用同一套取指與發射單元,共用一塊共享存儲器。
kernel以block為單位執行。
一個block必須被分配到同一塊SM中,block的每個thread會發送到一個SP上執行,但一個SM中同一時刻可以有多個活動線程塊在等待執行。
同一個block的thread開始於相同的指令地址,理論上能夠按不同的分支執行,但實際上由於8個SP共用一套取值與發射單元,同一warp的線程執行的指令是相同的。
如果一個warp的線程跳轉如分支語句的同一分支,那么實際執行時間就是這個分支執行時間;
否則,SM需要把每一個分支的指令發射到每個SP,執行時間是執行多個分支的所用時間之和。
故CUDA程序盡量避免分支,盡量warp內不分支。
線程束(warp):一個線程束由32個連續的線程組成。(簡單地說,warp包含32個線程是因為每發射一條warp指令,SM中的8個SP就會將這條指令執行4遍)。warp才是真正的執行單位。
雜
原子操作。同時對global內存寫操作,可分批進行,改成先線程塊對shared內存寫操作,結束后shared內存寫入global內存。
__syncthreads()實現了線程塊內的線程同步,當任意線程運行到BAR標記處后,暫停運行,直到整個block中所有的thread都運行到BAR標記處后才繼續執行。
__syncthreads()勿置於分支語句內。
流:名義上多個流,實際上可能就是kenel(GPU運算)和copy(CPU與GPU間數據傳送)兩個流。每個流都是一個隊列,事件被push入隊列中等待執行。
for循環的任務切分的時候,有兩種方式划分任務。
1.划分成k段,每段由某個線程執行。
2.按模k同余進行划分,for循環每次的遞增量為塊大小。
一般第2種方式更優,因為是並行執行,故第二種方式保證每次執行的的時候,不同線程訪問的數據位置相鄰。
並行算法
歸約運算: 每次折半。以求和為例,第一次前1/2 + 后1/2;第二次 前1/4 + 后1/4 .。。
int i = blockDim.x/2; while(i != 0) { if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); i /= 2; } if (cacheIndex == 0) c[blockIdx.x] = cache[0];
更好的優化算法:循環展開等
前綴和運算(Scan):
for(d = 1; (1 << d) < n; d++) for all k in parallel if( k >= (1 << d) ) x[out][k] = x[in][k – (1 << (d-1))] + x[in][k] else x[out][k] = x[in][k]
或
for(d = 1; (1 << d) < n; d++) for all k in parallel tmp = x[k] if( k >= (1 << d) ) tmp = x[k – (1 << (d-1))] + x[k] __syncthreads();//同步 x[k] = tmp
以上兩算法運行所需空間至少是原空間的兩倍,思想為倍增思想。
還有更高效的Scan算法。
for d:=0 to log2(n-1) do for k from 0 to n-1 by 2^(d+1) in parallel do x[k+2^(d+1)-1]:=x[k+2^(d+1)-1] + x[k+2^(d+1)-1] x[n-1]:=0 for d:=log2(n-1) downto 0 do for k from 0 to n-1 by 2^(d+1) in parallel do t:=x[k+2^d-1] x[k+2^d-1]:=x[k+2^(d+1)-1] x[k+2^(d+1)-1]:=t+x[k+2^(d+1)-1]
書上還有更高效的scan_best_kernel算法,略。
排序算法:
基於比較的排序:排序網絡
基於非比較的排序:並行基數排序。前置技能:Scan。
並行基數排序算法: 按二進制位考慮。 以00101101為例。排完序后應當是12473568。 二進制翻轉: 11010010 統計前綴和: 12233344 如果當前位是0,則寫入對應位置。 第1個數寫入首位置,第2個數寫入第二個位置,第4個數寫入第三個位置,第7個數寫入第四個位置。 再對當前位是1的進行寫入,位置下標 + 4(0的個數)。
矩陣乘法優化:
矩陣運算A*B = C, A為m*p的矩陣,B為p*n的矩陣。
優化1:
將C分塊,每個線程塊處理C中的某一塊,例如d*d的一小塊。
那么每個線程塊需要完成d*p的矩陣與p*d的矩陣相乘的運算。
為了高效訪存,每個線程塊再對d*p和p*d的矩陣的p進行划分,看成多個矩陣塊相乘后累加。
每個小塊為d*q和q*d的大小,開在shared memory內,節約了大量global memory帶寬。
(雖然循環次數會增加,但訪存效率得到了高效提升)
優化2:
利用寄存器資源優化,效率更高,但略為繁瑣。
矩陣轉置優化:
無優化:拷貝至GPU內存,置換后拷貝回CPU內存。缺點:輸入時每個block按行讀入,滿足合並訪問條件;輸出時數據間隔過大,不滿足合並訪問條件。
優化1:
分塊,每個塊是一個小方陣矩陣,如16*16。
輸入時,每個線程塊操作一個16*16方陣,通過shared memory完成16*16小方陣轉置。
之后將大矩陣按塊轉置輸出至global memory,每個線程塊內無需再轉置,滿足合並訪問條件。
shared memory數組大小設置成16*17而不是16*16,這樣每行中處於同一列的數據就會被存儲在不同的shared memory bank中,避免了bank conflict。
優化2:
上述無優化與優化1均存在分區沖突問題。優化2算法進行了for循環操作,暫未深入研究。
CUDA程序優化
grid和block的維度設計
grid的尺寸大一點較好。
為了有效利用執行單元,每個block的線程數應當是32的整數倍,最好讓線程數量保持在64 ~ 256之間。
block維度和每個維度上的尺寸的主要作用是避免做整數除法和求模運算。實際中視具體情況而定。
如果問題規模對划分方式不敏感,應該讓blockDim.x為16或16的整數倍,提高訪問global memory和shared memory的效率。
存儲器訪問優化
靈活運用各種存儲器的特性,實現最大可用帶寬。
指令流優化
等
CUDA作業
作業1 簡單CUDA應用,矩陣乘法

1 #include <bits/stdc++.h> 2 //#include "cuda_runtime.h" 3 //#include "device_launch_parameters.h" 4 5 using namespace std; 6 #define N 2000 7 8 const int block = 1<<12; 9 const int thread = 1<<10; 10 11 long long a[N][N]; 12 long long b[N][N]; 13 long long c[N][N]; 14 void init() { 15 for(int i = 0; i < N; i++) 16 for(int j = 0; j < N; j++) 17 a[i][j] = i*i+j, b[i][j] = i+j*j, c[i][j] = 0; 18 } 19 20 __global__ void init_cuda(long long *c) { 21 int id = blockIdx.x*blockDim.x+threadIdx.x; 22 if(id < N*N) c[id] = 0; 23 } 24 25 __global__ void mul_cuda(long long *a, long long *b, long long *c) { 26 int id = blockIdx.x*blockDim.x+threadIdx.x; 27 if(id < N*N) { 28 int row = id/N, col = id-row*N; 29 for(int k = 0; k < N; k++) 30 c[id] += a[row*N+k]*b[k*N+col]; 31 } 32 } 33 34 35 36 int main(int argc, char** argv) { 37 int cstart = clock(); 38 init(); 39 if(argv[1][0] == '0') { 40 puts("not cuda"); 41 for(int i = 0; i < N; i++) 42 for(int j = 0; j < N; j++) 43 for(int k = 0; k < N; k++) 44 c[i][k] += a[i][j]*b[j][k]; 45 } 46 else { 47 puts("cuda"); 48 long long *dev_a, *dev_b, *dev_c; 49 cudaMalloc( (void**)&dev_a, sizeof a ); 50 cudaMemcpy(dev_a, a, sizeof a, cudaMemcpyHostToDevice); 51 52 cudaMalloc( (void**)&dev_b, sizeof b ); 53 cudaMemcpy(dev_b, b, sizeof b, cudaMemcpyHostToDevice); 54 55 cudaMalloc( (void**)&dev_c, sizeof c ); 56 57 58 init_cuda<<<block, thread>>>(dev_c); 59 mul_cuda<<<block, thread>>>(dev_a, dev_b, dev_c); 60 61 cudaMemcpy(c, dev_c, sizeof c, cudaMemcpyDeviceToHost); 62 cudaFree(dev_a); 63 cudaFree(dev_b); 64 cudaFree(dev_c); 65 } 66 printf("%lld, ", c[1233][1233]); 67 printf("time: %d\n", int(clock()-cstart)); 68 return 0; 69 }
作業2 卷積操作,常量內存

1 //compile command: nvcc cv.cu `pkg-config --cflags --libs opencv` -std=c++11 2 //execute command1: ./a.out CC.jpg 3 3 //execute command2: ./a.out CC.jpg 5 4 #include <bits/stdc++.h> 5 6 #include <opencv2/opencv.hpp> 7 //#include <opencv2/gpu/gpu.hpp> 8 using namespace cv; 9 10 Mat G3 = (Mat_<int>(3, 3) << -5, 3, -5, 11 3, 9, 3, 12 -5, 3, -5); 13 14 Mat G5 = (Mat_<int>(5, 5) << 0, -1, 1, -1, 0, 15 -1, 1, -1, 1,-1, 16 0, -1, 8,-1, 0, 17 -1, 1, -1, 1,-1, 18 0, -1, 1,-1, 0); 19 20 void CPU_Sharpen(const Mat& myImage, Mat& Result, int ca){ 21 CV_Assert(myImage.depth() == CV_8U); // accept only uchar images 22 23 int begin = clock(); 24 Result.create(myImage.size(), myImage.type()); 25 const int nChannels = myImage.channels(); 26 Mat &G = ca == 3? G3: G5; 27 int half = G.rows >> 1; 28 29 for(int row = half; row < myImage.rows-half; ++row) { 30 uchar* output = Result.ptr<uchar>(row); 31 for(int col = half*nChannels; col < nChannels * (myImage.cols - half); ++col) { 32 int tmp = 0; 33 for(int i = 0; i < G.rows; ++i) 34 for(int j = 0; j < G.cols; ++j) 35 tmp += G.at<int>(i, j)*( *(myImage.ptr<uchar>(row-half+i)+(col-half*nChannels+j*nChannels) ) ); 36 *output++ = saturate_cast<uchar>(tmp); 37 } 38 } 39 for(int i = 0; i < half; i++) { 40 Result.row(i).setTo(Scalar(140)); 41 Result.row(Result.rows - 1 - i).setTo(Scalar(140)); 42 Result.col(i).setTo(Scalar(140)); 43 Result.col(Result.cols - 1 - i).setTo(Scalar(140)); 44 } 45 printf("Time used %.3fms\n", ((int)clock()-begin)*1000.0/CLOCKS_PER_SEC); 46 } 47 48 /********************************************/ 49 50 __constant__ int con_G3[3][3] = { 51 {-5, 3, -5}, 52 { 3, 9, 3}, 53 {-5, 3, -5} 54 }; 55 __constant__ int con_G5[5][5] = { 56 {0, -1, 1,-1, 0}, 57 {-1, 1, -1, 1,-1}, 58 {0, -1, 8,-1, 0}, 59 {-1, 1, -1, 1,-1}, 60 {0, -1, 1,-1, 0} 61 }; 62 63 __global__ void init_cuda(uchar *c, int col_num) { 64 int col_id = blockIdx.x, row_id = threadIdx.x; 65 int now = (row_id*col_num+col_id)*3; 66 c[now] = c[now+1] = c[now+2] = 0; 67 } 68 69 70 //GPU,start from c, num * sizeof(uchar) 71 __global__ void test(uchar *c, int *sum, int num) { 72 int x = 0; 73 for(int i = 0; i < num; i++) 74 x += c[i]; 75 *sum = x; 76 } 77 78 __global__ void con_cuda(uchar *s, uchar *t, int ca, int row_num, int col_num) { 79 int col_id = blockIdx.x-1, row_id = threadIdx.x-1; 80 const int half = ca >> 1; 81 if(row_id >= half && row_id < row_num-half && col_id >= half && col_id < col_num-half) { 82 const int* con_mat = ca == 3? con_G3[0]: con_G5[0]; 83 int res[3] = {0, 0, 0}; 84 for(int i = 0; i < ca; i++) 85 for(int j = 0; j < ca; j++) { 86 //s[row_num][col_num][3]; 87 int pos = (row_id-half+i)*col_num*3+(col_id-half+j)*3; 88 res[0] += con_mat[i*ca+j]*s[pos]; 89 res[1] += con_mat[i*ca+j]*s[pos+1]; 90 res[2] += con_mat[i*ca+j]*s[pos+2]; 91 } 92 res[0] = res[0] < 0? 0: (res[0] > 255? 255: res[0]); 93 res[1] = res[1] < 0? 0: (res[1] > 255? 255: res[1]); 94 res[2] = res[2] < 0? 0: (res[2] > 255? 255: res[2]); 95 int pos = row_id*col_num*3+col_id*3; 96 t[pos] = res[0], 97 t[pos+1] = res[1], 98 t[pos+2] = res[2]; 99 } 100 } 101 102 /*******************************************/ 103 104 void HANDLE(cudaError x) { 105 if(x != cudaSuccess) { 106 puts("error!"); 107 exit(0); 108 } 109 } 110 111 int main(int argc, char** argv ) { 112 if ( argc < 3 ) { 113 printf("usage: a.out <Image_Path> <size of Mat>\n"); 114 return -1; 115 } 116 117 Mat src_img = imread(argv[1], 1), ans_CPU, ans_GPU; 118 int ca = argv[2][0]-'0'; 119 printf("%d %d\n", src_img.rows, src_img.cols); 120 121 /**********************************************************************************/ 122 123 printf("Run on CPU!\n"); 124 CPU_Sharpen(src_img, ans_CPU, ca); 125 std::string s = std::string("CC")+std::to_string(ca)+std::string("_With_CPU.jpg"); 126 imwrite(s, ans_CPU); 127 imshow("after operation", ans_CPU); 128 waitKey(); 129 130 /**********************************************************************************/ 131 132 printf("Run on GPU!\n"); 133 int begin = clock(); 134 uchar *dev_src, *dev_result; 135 int seg = src_img.cols*src_img.channels(); 136 HANDLE(cudaMalloc( (void**)&dev_src, src_img.rows*seg*sizeof(uchar))); 137 HANDLE(cudaMalloc( (void**)&dev_result, src_img.rows*seg*sizeof(uchar))); 138 /*Memcpy to dev_src*/ 139 for(int i = 0; i < src_img.rows; ++i) 140 HANDLE(cudaMemcpy(dev_src+i*seg*sizeof(uchar), src_img.ptr<uchar>(i), sizeof(uchar)*seg, cudaMemcpyHostToDevice)); 141 /*Init for dev_result*/ 142 init_cuda<<<src_img.cols, src_img.rows>>>(dev_result, src_img.cols); 143 /*Do convolution*/ 144 con_cuda<<<src_img.cols, src_img.rows>>>(dev_src, dev_result, ca, src_img.rows, src_img.cols); 145 146 ans_GPU.create(src_img.size(), src_img.type()); 147 /*Memcpy to host*/ 148 for(int i = 0; i < ans_GPU.rows; ++i) 149 cudaMemcpy(ans_GPU.ptr<uchar>(i), dev_result+i*seg*sizeof(uchar), sizeof(uchar)*seg, cudaMemcpyDeviceToHost); 150 151 for(int i = 0; i < (ca >> 1); i++) { 152 ans_GPU.row(i).setTo(Scalar(140)); 153 ans_GPU.row(ans_GPU.rows - 1 - i).setTo(Scalar(140)); 154 ans_GPU.col(i).setTo(Scalar(140)); 155 ans_GPU.col(ans_GPU.cols - 1 - i).setTo(Scalar(140)); 156 } 157 /*Free*/ 158 cudaFree(dev_src); 159 cudaFree(dev_result); 160 printf("Time used %.3fms\n", ((int)clock()-begin)*1000.0/CLOCKS_PER_SEC); 161 imshow("after operation", ans_GPU); 162 waitKey(); 163 return 0; 164 }
作業3 卷積操作,流

1 //compile command: nvcc cv.cu `pkg-config --cflags --libs opencv` -std=c++11 2 //execute command1: ./a.out 1.jpg 3 3 //execute command2: ./a.out 1.jpg 5 4 #include <bits/stdc++.h> 5 6 #include <opencv2/opencv.hpp> 7 //#include <opencv2/gpu/gpu.hpp> 8 using namespace cv; 9 10 /********************************************/ 11 12 __constant__ int con_G3[3][3] = { 13 {-5, 3, -5}, 14 { 3, 9, 3}, 15 {-5, 3, -5} 16 }; 17 __constant__ int con_G5[5][5] = { 18 {0, -1, 1,-1, 0}, 19 {-1, 1, -1, 1,-1}, 20 {0, -1, 8,-1, 0}, 21 {-1, 1, -1, 1,-1}, 22 {0, -1, 1,-1, 0} 23 }; 24 25 __global__ void init_cuda(uchar *c, int col_num) { 26 int col_id = blockIdx.x, row_id = threadIdx.x; 27 int now = (row_id*col_num+col_id)*3; 28 c[now] = c[now+1] = c[now+2] = 0; 29 } 30 31 32 //GPU,start from c, num * sizeof(uchar) 33 __global__ void test(uchar *c, int *sum, int num) { 34 int x = 0; 35 for(int i = 0; i < num; i++) 36 x += c[i]; 37 *sum = x; 38 } 39 40 __global__ void con_cuda(uchar *s, uchar *t, int ca, int row_num, int col_num) { 41 int col_id = blockIdx.x-1, row_id = threadIdx.x-1; 42 const int half = ca >> 1; 43 if(row_id >= half && row_id < row_num-half && col_id >= half && col_id < col_num-half) { 44 const int* con_mat = ca == 3? con_G3[0]: con_G5[0]; 45 int res[3] = {0, 0, 0}; 46 for(int i = 0; i < ca; i++) 47 for(int j = 0; j < ca; j++) { 48 //s[row_num][col_num][3]; 49 int pos = (row_id-half+i)*col_num*3+(col_id-half+j)*3; 50 res[0] += con_mat[i*ca+j]*s[pos]; 51 res[1] += con_mat[i*ca+j]*s[pos+1]; 52 res[2] += con_mat[i*ca+j]*s[pos+2]; 53 } 54 res[0] = res[0] < 0? 0: (res[0] > 255? 255: res[0]); 55 res[1] = res[1] < 0? 0: (res[1] > 255? 255: res[1]); 56 res[2] = res[2] < 0? 0: (res[2] > 255? 255: res[2]); 57 int pos = row_id*col_num*3+col_id*3; 58 t[pos] = res[0], 59 t[pos+1] = res[1], 60 t[pos+2] = res[2]; 61 } 62 } 63 64 /*******************************************/ 65 66 void HANDLE_ERROR(cudaError x) { 67 if(x != cudaSuccess) { 68 puts("error!"); 69 exit(0); 70 } 71 } 72 73 int main(int argc, char** argv ) { 74 if ( argc < 3 ) { 75 printf("usage: a.out <Image_Path> <size of Mat>\n"); 76 return -1; 77 } 78 79 Mat src_img = imread(argv[1], 1), ans_CPU, ans_GPU; 80 int ca = argv[2][0]-'0'; 81 printf("%d %d\n", src_img.rows, src_img.cols); 82 /**********************************************************************************/ 83 84 printf("Run on GPU!\n"); 85 int begin = clock(); 86 /**********************************************************************************/ 87 88 uchar *dev_src, *dev_result; 89 int seg = src_img.cols*src_img.channels(); 90 HANDLE_ERROR(cudaMalloc( (void**)&dev_src, src_img.rows*seg*sizeof(uchar))); 91 HANDLE_ERROR(cudaMalloc( (void**)&dev_result, src_img.rows*seg*sizeof(uchar))); 92 93 cudaStream_t stream0, stream1; 94 HANDLE_ERROR( cudaStreamCreate( &stream0 ) ); 95 HANDLE_ERROR( cudaStreamCreate( &stream1 ) ); 96 for(int i = 0; i < src_img.rows; ++i) 97 HANDLE_ERROR( cudaMemcpyAsync(dev_src+i*seg*sizeof(uchar), src_img.ptr<uchar>(i), sizeof(uchar)*seg, cudaMemcpyHostToDevice, stream0) ); 98 99 init_cuda<<<src_img.cols, src_img.rows, 0, stream0>>>(dev_result, src_img.cols); 100 con_cuda<<<src_img.cols, src_img.rows, 0, stream0>>>(dev_src, dev_result, ca, src_img.rows, src_img.cols); 101 102 ans_GPU.create(src_img.size(), src_img.type()); 103 104 for(int i = 0; i < ans_GPU.rows; ++i) 105 HANDLE_ERROR( cudaMemcpyAsync(ans_GPU.ptr<uchar>(i), dev_result+i*seg*sizeof(uchar), sizeof(uchar)*seg, cudaMemcpyDeviceToHost, stream0) ); 106 107 for(int i = 0; i < (ca >> 1); i++) { 108 ans_GPU.row(i).setTo(Scalar(140)); 109 ans_GPU.row(ans_GPU.rows - 1 - i).setTo(Scalar(140)); 110 ans_GPU.col(i).setTo(Scalar(140)); 111 ans_GPU.col(ans_GPU.cols - 1 - i).setTo(Scalar(140)); 112 } 113 114 HANDLE_ERROR( cudaStreamSynchronize( stream0 ) ); 115 HANDLE_ERROR( cudaStreamSynchronize( stream1 ) ); 116 117 HANDLE_ERROR( cudaStreamDestroy( stream0 ) ); 118 HANDLE_ERROR( cudaStreamDestroy( stream1 ) ); 119 120 cudaFree(dev_src); 121 cudaFree(dev_result); 122 /********************without stream*********************/ 123 124 // uchar *dev_src, *dev_result; 125 // int seg = src_img.cols*src_img.channels(); 126 // HANDLE_ERROR(cudaMalloc( (void**)&dev_src, src_img.rows*seg*sizeof(uchar))); 127 // HANDLE_ERROR(cudaMalloc( (void**)&dev_result, src_img.rows*seg*sizeof(uchar))); 128 // /*Memcpy to dev_src*/ 129 // for(int i = 0; i < src_img.rows; ++i) 130 // HANDLE_ERROR(cudaMemcpy(dev_src+i*seg*sizeof(uchar), src_img.ptr<uchar>(i), sizeof(uchar)*seg, cudaMemcpyHostToDevice)); 131 // /*Init for dev_result*/ 132 // init_cuda<<<src_img.cols, src_img.rows>>>(dev_result, src_img.cols); 133 // /*Do convolution*/ 134 // con_cuda<<<src_img.cols, src_img.rows>>>(dev_src, dev_result, ca, src_img.rows, src_img.cols); 135 // 136 // ans_GPU.create(src_img.size(), src_img.type()); 137 // /*Memcpy to host*/ 138 // for(int i = 0; i < ans_GPU.rows; ++i) 139 // HANDLE_ERROR( cudaMemcpy(ans_GPU.ptr<uchar>(i), dev_result+i*seg*sizeof(uchar), sizeof(uchar)*seg, cudaMemcpyDeviceToHost) ); 140 // 141 // for(int i = 0; i < (ca >> 1); i++) { 142 // ans_GPU.row(i).setTo(Scalar(140)); 143 // ans_GPU.row(ans_GPU.rows - 1 - i).setTo(Scalar(140)); 144 // ans_GPU.col(i).setTo(Scalar(140)); 145 // ans_GPU.col(ans_GPU.cols - 1 - i).setTo(Scalar(140)); 146 // } 147 148 /*Free*/ 149 // cudaFree(dev_src); 150 // cudaFree(dev_result); 151 printf("Time used %.3fms\n", ((int)clock()-begin)*1000.0/CLOCKS_PER_SEC); 152 imshow("after operation", ans_GPU); 153 imwrite("Tigerwith5.jpg", ans_GPU); 154 waitKey(); 155 return 0; 156 }
final project 圖牆 + 圖片融合

1 //compile command: nvcc final.cu `pkg-config --cflags --libs opencv` -std=c++11 2 //execute command1: ./a.out CC.jpg 3 3 //execute command2: ./a.out CC.jpg 5 4 #include <bits/stdc++.h> 5 6 #include <opencv2/opencv.hpp> 7 //#include <opencv2/gpu/gpu.hpp> 8 using namespace cv; 9 10 __global__ void init_cuda(uchar *c, int col_num) { 11 int col_id = blockIdx.x, row_id = threadIdx.x; 12 int now = (row_id*col_num+col_id)*3; 13 c[now] = c[now+1] = c[now+2] = 0; 14 } 15 16 17 //GPU,start from c, num * sizeof(uchar) 18 __global__ void test(uchar *c, int *sum, int num) { 19 int x = 0; 20 for(int i = 0; i < num; i++) 21 x += c[i]; 22 *sum = x; 23 } 24 25 __global__ void solve(uchar *s, uchar *t, uchar a, uchar b, uchar c, int R, int C) { 26 int x = blockIdx.x-1, y = threadIdx.x-1; 27 if(x < R&&y < C) { 28 int g = 3*(x*C+y); 29 if(t[g] == a&&t[g+1] == b&&t[g+2] == c) { 30 t[g] = s[g]; 31 t[g+1] = s[g+1]; 32 t[g+2] = s[g+2]; 33 } 34 else { 35 t[g] = 0.35*s[g] +0.65*t[g]; 36 t[g+1] = 0.35*s[g+1]+0.65*t[g+1]; 37 t[g+2] = 0.35*s[g+2]+0.65*t[g+2]; 38 } 39 } 40 } 41 42 __global__ void zoom(uchar *s, uchar *t, int R, int C, int r, int c) { 43 int x = blockIdx.x-1, y = threadIdx.x-1; 44 if(x <= r && y <= c) { 45 int row = x/(float)r*R, col = y/(float)c*C; 46 //t[x][y] = s[row][col]; 47 t[ (x*c+y)*3 ] = s[ (row*C+col)*3 ]; 48 t[ (x*c+y)*3+1 ] = s[ (row*C+col)*3+1 ]; 49 t[ (x*c+y)*3+2 ] = s[ (row*C+col)*3+2 ]; 50 } 51 } 52 53 /*******************************************/ 54 55 void HANDLE(cudaError x) { 56 if(x != cudaSuccess) { 57 puts("error!"); 58 exit(0); 59 } 60 } 61 62 const int N = 36; 63 64 int main(int argc, char** argv ) { 65 Mat src_img[N], dst_img[N], ret1, ret2; 66 int width = 1000000, height = 1000000; 67 for(int i = 0; i < N; i++) { 68 src_img[i] = imread(std::to_string(i+1)+std::string(".jpg"), 1); 69 int r = src_img[i].rows, c = src_img[i].cols; 70 if(height > r) height = r; 71 if(width > c) width = c; 72 } 73 height *= 0.4; 74 width *= 0.17; 75 76 //resize 77 int begin = clock(); 78 for(int i = 0; i < N; i++) { 79 dst_img[i].create(Size(height, width), src_img[i].type()); 80 resize(src_img[i], dst_img[i], Size(height, width)); 81 } 82 printf("Time used in resizing is%.3fms\n", ((int)clock()-begin)*1000.0/CLOCKS_PER_SEC); 83 84 int sq = sqrt(N+0.5); 85 //std::cout << width << ' ' << height << std::endl; 86 ret1.create(Size(height*sq, width*sq), src_img[0].type()); 87 //std::cout << ret1.rows << ' ' << ret1.cols << std::endl; 88 89 //merge 90 for(int i = 0; i < sq; i++) 91 for(int j = 0; j < sq; j++){ 92 for(int r = 0; r < width; r++) 93 memcpy(ret1.ptr<uchar>(i*width+r)+j*height*3, dst_img[i*sq+j].ptr<uchar>(r), height*3); 94 } 95 96 97 Mat ret = imread("0.jpg", 1); 98 resize(ret, ret2, Size(height*6, width*6)); 99 //std::cout << ret2.rows << ' ' << ret2.cols << std::endl; 100 //imshow("", ret2); 101 //waitKey(); 102 103 uchar a = *ret2.ptr<uchar>(0), b = *(ret2.ptr<uchar>(0)+1), c = *(ret2.ptr<uchar>(0)+2); 104 if(ret1.rows != ret2.rows || ret1.cols != ret2.cols) puts("gg"); 105 int R = ret2.rows, C = ret2.cols; 106 std::cout << R << ' ' << C << std::endl; 107 //CPU 108 begin = clock(); 109 for(int i = 0; i < R; i++) { 110 uchar *p1 = ret1.ptr<uchar>(i), *p2 = ret2.ptr<uchar>(i); 111 bool tag = true; 112 double x = 0; 113 for(int j = 0; j < C; j++) { 114 if(*(p2+j*3) == a&&*(p2+j*3+1) == b&&*(p2+j*3+2) == c) { 115 x = 0; 116 *(p2+j*3) = *(p1+j*3); 117 *(p2+j*3+1) = *(p1+j*3+1); 118 *(p2+j*3+2) = *(p1+j*3+2); 119 continue ; 120 } 121 122 if(*(p2+j*3+15) == a&&*(p2+j*3+16) == b&&*(p2+j*3+17) == c) { 123 x = 0; 124 *(p2+j*3) = *(p1+j*3); 125 *(p2+j*3+1) = *(p1+j*3+1); 126 *(p2+j*3+2) = *(p1+j*3+2); 127 continue ; 128 } 129 130 x = tag? x+0.06: x-0.06; 131 if(x > 1) tag = false; 132 if(x < 0.6) tag = true; 133 *(p2+j*3) = (1-x)*(*(p1+j*3)) +x*(*(p2+j*3)); 134 *(p2+j*3+1) = (1-x)*(*(p1+j*3+1))+x*(*(p2+j*3+1)); 135 *(p2+j*3+2) = (1-x)*(*(p1+j*3+2))+x*(*(p2+j*3+2)); 136 } 137 } 138 printf("Time used in CPU is %.3fms\n", ((int)clock()-begin)*1000.0/CLOCKS_PER_SEC); 139 140 imshow("", ret2); 141 waitKey(); 142 imwrite("final.jpg", ret2); 143 //GPU 144 begin = clock(); 145 uchar *dev_src, *dev_result; 146 HANDLE(cudaMalloc( (void**)&dev_src, R*C*3*sizeof(uchar))); 147 HANDLE(cudaMalloc( (void**)&dev_result, R*C*3*sizeof(uchar))); 148 for(int i = 0; i < R; ++i) 149 cudaMemcpy(dev_src+i*C*3*sizeof(uchar), ret1.ptr<uchar>(i), C*3*sizeof(uchar), cudaMemcpyHostToDevice); 150 for(int i = 0; i < R; ++i) 151 cudaMemcpy(dev_result+i*C*3*sizeof(uchar), ret2.ptr<uchar>(i), C*3*sizeof(uchar), cudaMemcpyHostToDevice); 152 solve<<<R, C>>>(dev_src, dev_result, a, b, c, R, C); 153 for(int i = 0; i < R; ++i) 154 cudaMemcpy(ret2.ptr<uchar>(i), dev_result+i*C*3*sizeof(uchar), C*3*sizeof(uchar), cudaMemcpyDeviceToHost); 155 printf("Time used in GPU is %.3fms\n", ((int)clock()-begin)*1000.0/CLOCKS_PER_SEC); 156 157 imshow("", ret2); 158 waitKey(); 159 cudaFree(dev_src); 160 cudaFree(dev_result); 161 imwrite("final2.jpg", ret2); 162 163 /**********************************************************************************/ 164 165 printf("Run on CPU!\n"); 166 CPU_Sharpen(src_img, ans_CPU, ca); 167 std::string s = std::string("IMG")+std::to_string(ca)+std::string("_With_CPU.jpg"); 168 imwrite(s, ans_CPU); 169 imshow("after operation", ans_CPU); 170 waitKey(); 171 172 /**********************************************************************************/ 173 174 printf("Run on GPU!\n"); 175 int begin = clock(); 176 uchar *dev_src, *dev_result; 177 int seg = src_img.cols*src_img.channels(); 178 HANDLE(cudaMalloc( (void**)&dev_src, src_img.rows*seg*sizeof(uchar))); 179 HANDLE(cudaMalloc( (void**)&dev_result, src_img.rows*seg*sizeof(uchar))); 180 181 /*Memcpy to dev_src*/ 182 183 for(int i = 0; i < src_img.rows; ++i) 184 HANDLE(cudaMemcpy(dev_src+i*seg*sizeof(uchar), src_img.ptr<uchar>(i), sizeof(uchar)*seg, cudaMemcpyHostToDevice)); 185 186 init_cuda<<<src_img.cols, src_img.rows>>>(dev_result, src_img.cols); 187 188 con_cuda<<<src_img.cols, src_img.rows>>>(dev_src, dev_result, ca, src_img.rows, src_img.cols); 189 190 ans_GPU.create(src_img.size(), src_img.type()); 191 192 for(int i = 0; i < ans_GPU.rows; ++i) 193 cudaMemcpy(ans_GPU.ptr<uchar>(i), dev_result+i*seg*sizeof(uchar), sizeof(uchar)*seg, cudaMemcpyDeviceToHost); 194 195 for(int i = 0; i < (ca >> 1); i++) { 196 ans_GPU.row(i).setTo(Scalar(140)); 197 ans_GPU.row(ans_GPU.rows - 1 - i).setTo(Scalar(140)); 198 ans_GPU.col(i).setTo(Scalar(140)); 199 ans_GPU.col(ans_GPU.cols - 1 - i).setTo(Scalar(140)); 200 } 201 202 cudaFree(dev_src); 203 cudaFree(dev_result); 204 printf("Time used %.3fms\n", ((int)clock()-begin)*1000.0/CLOCKS_PER_SEC); 205 imshow("after operation", ans_GPU); 206 waitKey(); 207 208 return 0; 209 }
emmmm,可能有bug,有冗余代碼。
效果: