無意中瀏覽一篇文章,中間提到了基於多尺度的圖像的細節提升算法,嘗試了一下,還是有一定的效果的,結合最近一直研究的SSE優化,把算法的步驟和優化過程分享給大家。
論文的全名是DARK IMAGE ENHANCEMENT BASED ON PAIRWISE TARGET CONTRAST AND MULTI-SCALE DETAIL BOOSTING,好像在百度上搜索不到,由於博客的空間不多了,這里就不上傳了, 我貼出論文核心的字段。
論文的核心思想類似於Retinex,使用了三個尺度的高斯模糊,再和原圖做減法,獲得不同程度的細節信息,然后通過一定的組合方式把這些細節信息融合到原圖中,從而得到加強原圖信息的能力。
值得一提的就是對D1的系數做了特殊的處理,這個是值得學習的。
這個算法的編碼實在是簡單,一個簡單的C語言代碼如下:
int IM_MultiScaleSharpen(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) { int Channel = Stride / Width; if ((Src == NULL) || (Dest == NULL)) return IM_STATUS_NULLREFRENCE; if ((Width <= 0) || (Height <= 0)) return IM_STATUS_INVALIDPARAMETER; if ((Channel != 1) && (Channel != 3) && (Channel != 4)) return IM_STATUS_INVALIDPARAMETER; int Status = IM_STATUS_OK; unsigned char *B1 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); // 如果最后的大圖在這里內存有問題,一種解決辦法就是下面的模糊用BoxBlur unsigned char *B2 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); // 然后不要調用現有的BoxBlur函數,而是直接在本函數實現三種不同半徑的模糊 unsigned char *B3 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); // 運算速度(因為有些循環是公共的)會有點點提高,額外的內存占用則可以忽略 if ((B1 == NULL) || (B2 == NULL) || (B3 == NULL)) { if (B1 != NULL) free(B1); if (B2 != NULL) free(B2); if (B3 != NULL) free(B3); return IM_STATUS_OUTOFMEMORY; } Status = IM_ExpBlur(Src, B1, Width, Height, Stride, Radius); if (Status != IM_STATUS_OK) goto FreeMemory; Status = IM_ExpBlur(Src, B2, Width, Height, Stride, Radius * 2); if (Status != IM_STATUS_OK) goto FreeMemory; Status = IM_ExpBlur(Src, B3, Width, Height, Stride, Radius * 4); if (Status != IM_STATUS_OK) goto FreeMemory; for (int Y = 0; Y < Height * Stride; Y++) { int DiffB1 = Src[Y] - B1[Y]; int DiffB2 = B1[Y] - B2[Y]; int DiffB3 = B2[Y] - B3[Y]; Dest[Y] = IM_ClampToByte(((4 - 2 * IM_Sign(DiffB1)) * DiffB1 + 2 * DiffB2 + DiffB3) / 4 + Src[Y]); } FreeMemory: free(B1); free(B2); free(B3); return Status; }
為避免浮點計算,我們將W1、W2、W3放大四倍,累加玩后在除以4就可以了,整個的書寫過程就是按照公式(13)進行的。
其中IM_Sign函數定義如下:
inline int IM_Sign(int X) { return (X >> 31) | (unsigned(-X)) >> 31; }
處理前 處理后(半徑5)
處理效果由上面兩幅圖的比較來說,還是相當明顯的。
上面的代碼中我用的ExpBlur代替了高斯模糊,關於指數模糊可以參考:SSE圖像算法優化系列五:超高速指數模糊算法的實現和優化(10000*10000在100ms左右實現) 一文,他的效果和高斯模糊差不多,速度要快不少。
當指數模糊使用SSE優化后,剩下的代碼用純C實現,對於1080P的24位彩色圖,測試時間為73毫秒,如果除去3次指數模糊,純C部分代碼耗時約40 ms,所以有很大的優化空間,我們就用SSE來處理。
第一,我們需要來處理IM_Sign這個函數,這個函數當參數大於0時,返回1,參數小於0時,返回-1,參數等於0時,返回0,SSE沒有直接這樣的函數,幸好之前收集過一個這個的自定義函數,寫的也很巧妙:
inline __m128i _mm_sgn_epi16(__m128i v) { #ifdef __SSSE3__ v = _mm_sign_epi16(_mm_set1_epi16(1), v); // use PSIGNW on SSSE3 and later #else v = _mm_min_epi16(v, _mm_set1_epi16(1)); // use PMINSW/PMAXSW on SSE2/SSE3. v = _mm_max_epi16(v, _mm_set1_epi16(-1)); //_mm_set1_epi16(1) = _mm_srli_epi16(_mm_cmpeq_epi16(v, v), 15); //_mm_set1_epi16(-1) = _mm_cmpeq_epi16(v, v); #endif return v; }
如上所示,當系統只支持SSE2以及其下的版本時,直接用_mm_min_epi16和_mm_max_epi16這樣的函數硬實現,這個邏輯也很好理解。
如果系統支持SSE3及其以上的版本,系統提供了_mm_sign_epi16這個函數,關於這個函數其作用解釋如下:
// extern __m128i _mm_sign_epi16 (__m128i a, __m128i b); // Negate packed words in a if corresponding sign in b is less than zero. // Interpreting a, b, and r as arrays of signed 16-bit integers: for (i = 0; i < 8; i++) { if (b[i] < 0) { r[i] = -a[i]; } else if (b[i] == 0) { r[i] = 0; } else { r[i] = a[i]; } }
如果參數a傳值為1,不就能實現IM_Sign這個效果了嗎,真好玩。
解決了IM_Sign這個函數,其他的部分都很簡單了,考慮到數據范圍,把字節數據擴展為signed short類型處理就可以了,這樣一次性可以處理8個字節的數據,修改后的代碼如下所示:
int IM_MultiScaleSharpen(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) { int Channel = Stride / Width; if ((Src == NULL) || (Dest == NULL)) return IM_STATUS_NULLREFRENCE; if ((Width <= 0) || (Height <= 0)) return IM_STATUS_INVALIDPARAMETER; if ((Channel != 1) && (Channel != 3) && (Channel != 4)) return IM_STATUS_INVALIDPARAMETER; int Status = IM_STATUS_OK; unsigned char *B1 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); // 如果最后的大圖在這里內存有問題,一種解決辦法就是下面的模糊用BoxBlur unsigned char *B2 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); // 然后不要調用現有的BoxBlur函數,而是直接在本函數實現三種不同半徑的模糊 unsigned char *B3 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); // 運算速度(因為有些循環是公共的)會有點點提高,額外的內存占用則可以忽略 if ((B1 == NULL) || (B2 == NULL) || (B3 == NULL)) { if (B1 != NULL) free(B1); if (B2 != NULL) free(B2); if (B3 != NULL) free(B3); return IM_STATUS_OUTOFMEMORY; } Status = IM_ExpBlur(Src, B1, Width, Height, Stride, Radius); if (Status != IM_STATUS_OK) goto FreeMemory; Status = IM_ExpBlur(Src, B2, Width, Height, Stride, Radius * 2); if (Status != IM_STATUS_OK) goto FreeMemory; Status = IM_ExpBlur(Src, B3, Width, Height, Stride, Radius * 4); if (Status != IM_STATUS_OK) goto FreeMemory; int BlockSize = 8, Block = (Height * Stride) / BlockSize; __m128i Zero = _mm_setzero_si128(); __m128i Four = _mm_set1_epi16(4); for (int Y = 0; Y < Block * BlockSize; Y += BlockSize) { __m128i SrcV = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Src + Y)), Zero); __m128i SrcB1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B1 + Y)), Zero); __m128i SrcB2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B2 + Y)), Zero); __m128i SrcB3 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B3 + Y)), Zero); __m128i DiffB1 = _mm_sub_epi16(SrcV, SrcB1); __m128i DiffB2 = _mm_sub_epi16(SrcB1, SrcB2); __m128i DiffB3 = _mm_sub_epi16(SrcB2, SrcB3); __m128i Offset = _mm_srai_epi16(_mm_add_epi16(_mm_add_epi16(_mm_mullo_epi16(_mm_sub_epi16(Four, _mm_slli_epi16(_mm_sgn_epi16(DiffB1), 1)), DiffB1), _mm_slli_epi16(DiffB2, 1)), DiffB3), 2); _mm_storel_epi64((__m128i *)(Dest + Y), _mm_packus_epi16(_mm_add_epi16(SrcV, Offset), Zero)); } for (int Y = Block * BlockSize; Y < Height * Stride; Y++) { int DiffB1 = Src[Y] - B1[Y]; int DiffB2 = B1[Y] - B2[Y]; int DiffB3 = B2[Y] - B3[Y]; Dest[Y] = IM_ClampToByte(((4 - 2 * IM_Sign(DiffB1)) * DiffB1 + 2 * DiffB2 + DiffB3) / 4 + Src[Y]); } FreeMemory: free(B1); free(B2); free(B3); return Status; }
基本上就是按照C語言或者公式(13)所示的流程一步一步的編寫,只不過把有些乘法變成了移位。
對於1080P的彩色圖像,上述改動后處理時間變為了35ms,純C語言部分的耗時約在11ms左右,同之前的相比速度提高了4倍多,提速還是相當的明顯的。
在仔細觀察,覺得在IM_Sign這個的處理上還是有問題,雖然上述代碼能完美解決問題,但是總覺得有改進空間,當我們把IM_Sign(DiffB1) * DiffB1放在一起觀察時,就會發現這個整體不是可以直接使用_mm_sign_epi16予以實現嗎,比如 _mm_sign_epi16(DiffB1, DiffB1) 難道不是嗎? 這樣就少了一次乘法。
最后,我們把DiffB2, DiffB3展開,合並掉一些同類項,然后把乘以相同系數的變量在合並,又可以優化掉一些計算,最終的SSE部分代碼如下:
int BlockSize = 8, Block = (Height * Stride) / BlockSize; __m128i Zero = _mm_setzero_si128(); for (int Y = 0; Y < Block * BlockSize; Y += BlockSize) { __m128i SrcV = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Src + Y)), Zero); __m128i SrcB1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B1 + Y)), Zero); __m128i SrcB2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B2 + Y)), Zero); __m128i SrcB3 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B3 + Y)), Zero); __m128i DiffB1 = _mm_sub_epi16(SrcV, SrcB1); __m128i Offset = _mm_add_epi16(_mm_srai_epi16(_mm_sub_epi16(_mm_slli_epi16(_mm_sub_epi16(SrcB1, _mm_sign_epi16(DiffB1, DiffB1)), 1), _mm_add_epi16(SrcB2, SrcB3)), 2), DiffB1); _mm_storel_epi64((__m128i *)(Dest + Y), _mm_packus_epi16(_mm_add_epi16(SrcV, Offset), Zero)); }
雖然測試表明,速度沒有提高多少(主要是計算量太少),但這樣寫明顯合理很多。
測試工程的地址:http://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar