SSE圖像算法優化系列二十二:優化龔元浩博士的曲率濾波算法,達到約1000 MPixels/Sec的單次迭代速度


    2015年龔博士的曲率濾波算法剛出來的時候,在圖像處理界也曾引起不小的轟動,特別是其所說的算法的簡潔性,以及算法的效果、執行效率等方面較其他算法均有一定的優勢,我在該算法剛出來時也曾經有關注,不過那個時候看到是迭代的算法,而且迭代的次數還蠻多了,就覺得算法應該不會太快,所以就放棄了對其進一步優化。最近,又偶爾一次碰觸到該文章和代碼,感覺還是有蠻大的優化空間的,所以抽空簡單的實現他的算法。

    該算法作者已經完全開源,項目地址見:https://github.com/YuanhaoGong/CurvatureFilter , 里面有matlab\c++\python等語言的代碼,其中matlab的代碼比較簡潔,C++的是基於opencv的,而且里面包含了很多其他方面的代碼,整體看起來我感覺有點亂。我只是稍微瀏覽了下。

  通讀matlab的代碼,其和論文里提供的偽代碼好像除了TV算法外,其他的都基本對應不上,我不知道是怎么回事,因此,本文僅以TV算法的優化作為例子,而且TV算法在GC\MC\TV算法中屬於實現最為復雜和耗時的一個,也是最能反映優化極限的例子, 因此處理好了這個算法,另外兩個的優化就是水道渠成的事情了。

  算法的原理我不介紹,我也不在行,論文中給出的偽代碼如下所示:

       

  簡單的說,對於每個像素,我們以其為中心,領域3*3的點,做算法15的操作,新的像素值就是原始像素值加上dm。但是我們做的時候是把整副圖像分成四塊,如上圖右側圖所示,分為白色圓、黑色圓、白色三角形、黑色三角形,這四塊獨立更新,更新完后的心像素值可以為尚未更新的像素所使用,一副圖全部更新后,再次迭代該過程,直到達到指定迭代次數為止。

    在國內網站上搜索,我發現網友CeleryChen在他的博文中曲率濾波的簡單實現只有源碼提到了該算法的實現代碼,我看了下,和我的書寫習慣還是很類似的(比原作者的代碼看起來舒服多了),不過他寫的不是TV算法, 我修改為TV算法后,核心代碼如下所示:

void TV_Curvature_Filter(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int StartRow, int StartCol)
{
    for (int Y = StartRow; Y < Height - 1; Y += 2)
    {
        unsigned char *LinePC = Src + Y       * Stride;
        unsigned char *LinePL = Src + (Y - 1) * Stride;
        unsigned char *LinePN = Src + (Y + 1) * Stride;
        unsigned char *LinePD = Dest + Y * Stride;
        for (int X = StartCol; X < Width - 1; X += 2)
        {
            short Dist[8] = { 0 };
            int C5 = 5 * LinePC[X];
            Dist[0] = (LinePL[X - 1] + LinePL[X] + LinePC[X - 1] + LinePN[X - 1] + LinePN[X]) - C5;
            Dist[1] = (LinePL[X] + LinePL[X + 1] + LinePC[X + 1] + LinePN[X] + LinePN[X + 1]) - C5;
            Dist[2] = (LinePL[X - 1] + LinePL[X] + LinePL[X + 1] + LinePC[X - 1] + LinePC[X + 1]) - C5;
            Dist[3] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePC[X - 1] + LinePC[X + 1]) - C5;
            Dist[4] = (LinePL[X - 1] + LinePL[X] + LinePL[X + 1] + LinePC[X - 1] + LinePN[X - 1]) - C5;
            Dist[5] = (LinePL[X - 1] + LinePL[X + 1] + LinePL[X + 1] + LinePC[X + 1] + LinePN[X + 1]) - C5;
            Dist[6] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePL[X - 1] + LinePC[X - 1]) - C5;
            Dist[7] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePL[X + 1] + LinePC[X + 1]) - C5;

            __m128i Dist128 = _mm_loadu_si128((const __m128i*)Dist);
            __m128i AbsDist128 = _mm_abs_epi16(Dist128);
            __m128i MinPos128 = _mm_minpos_epu16(AbsDist128);

            LinePD[X] = IM_ClampToByte(LinePC[X] + Dist[_mm_extract_epi16(MinPos128, 1)] / 5) ;
        }
    }
}

void IM_CurvatureFilter(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Iteration) { unsigned char* Temp = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); memcpy(Temp, Src, Height * Stride * sizeof(unsigned char)); memcpy(Dest, Src, Height * Stride * sizeof(unsigned char)); for (int Y = 0; Y < Iteration; Y++) { TV_Curvature_Filter(Temp, Dest, Width, Height, Stride, 1, 1); memcpy(Temp, Dest, Height * Stride * sizeof(unsigned char)); TV_Curvature_Filter(Temp, Dest, Width, Height, Stride, 2, 2); memcpy(Temp, Dest, Height * Stride * sizeof(unsigned char)); TV_Curvature_Filter(Temp, Dest, Width, Height, Stride, 1, 2); memcpy(Temp, Dest, Height * Stride * sizeof(unsigned char)); TV_Curvature_Filter(Temp, Dest, Width, Height, Stride, 2, 1); memcpy(Temp, Dest, Height * Stride * sizeof(unsigned char)); } free(Temp); }

  代碼非常簡單和簡短,確實不超過100行,我們將迭代次數設置為50次,對比對應的matlab效果如下:

      

                  原圖                                       C++代碼的效果(迭代50次)                             matlab代碼的效果

  仔細觀察,發現C++代碼的效果在頭發,帽子部位明顯和matlab的不同,所以難怪CeleryChen在代碼注釋里有這樣一段話:打算自己實現一個更快更好的曲率濾波的算法將其開源, 但發現按照作者的博士論文的算法出來的效果,達不到作者文獻中的效果。

  其實不是CeleryChen的代碼邏輯問題,而是他可能沒有注意到數據范圍,在上述代碼中,他使用unsigned char作為中間數據,而在更新的累加過程中使用了類似這樣的語句:

      LinePD[X] = IM_ClampToByte(LinePC[X] + Dist[_mm_extract_epi16(MinPos128, 1)]);

  Clamp函數把數據再次抑制到0和255之間,這個在單次處理時問題不大,但是在迭代過程中就會出現較大的問題,總體來說同樣的迭代次數效果要明顯弱於matlab的代碼,所以只要這個數據類型進行擴展,就可以,比如更改為浮點類型,代碼如下所示:

void TV_Curvature_Filter(float *Src, float *Dest, int Width, int Height, int Stride, int StartRow, int StartCol)
{
    for (int Y = StartRow; Y < Height - 1; Y += 2)
    {
        float *LinePC = Src + Y       * Stride;
        float *LinePL = Src + (Y - 1) * Stride;
        float *LinePN = Src + (Y + 1) * Stride;
        float *LinePD = Dest + Y * Stride;
        for (int X = StartCol; X < Width - 1; X += 2)
        {
            float Dist[8] = { 0 };
            float C5 = 5 * LinePC[X];
            Dist[0] = (LinePL[X - 1] + LinePL[X] + LinePC[X - 1] + LinePN[X - 1] + LinePN[X]) - C5;
            Dist[1] = (LinePL[X] + LinePL[X + 1] + LinePC[X + 1] + LinePN[X] + LinePN[X + 1]) - C5;
            Dist[2] = (LinePL[X - 1] + LinePL[X] + LinePL[X + 1] + LinePC[X - 1] + LinePC[X + 1]) - C5;
            Dist[3] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePC[X - 1] + LinePC[X + 1]) - C5;
            Dist[4] = (LinePL[X - 1] + LinePL[X] + LinePL[X + 1] + LinePC[X - 1] + LinePN[X - 1]) - C5;
            Dist[5] = (LinePL[X - 1] + LinePL[X + 1] + LinePL[X + 1] + LinePC[X + 1] + LinePN[X + 1]) - C5;
            Dist[6] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePL[X - 1] + LinePC[X - 1]) - C5;
            Dist[7] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePL[X + 1] + LinePC[X + 1]) - C5;

            float Min = abs(Dist[0]);
            int Index = 0;
            for (int Z = 1; Z < 8; Z++)
            {
                if (abs(Dist[Z]) < Min)
                {
                    Min = abs(Dist[Z]);
                    Index = Z;
                }
            }
            LinePD[X] = LinePC[X] + Dist[Index] * 0.2f;
        }
    }
}

void IM_CurvatureFilter(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Iteration)
{
    float *Temp1 = (float *)malloc(Height * Stride * sizeof(float));
    float *Temp2 = (float *)malloc(Height * Stride * sizeof(float));
        
    for (int Y = 0; Y < Height * Stride; Y++)    Temp1[Y] = Src[Y];
    memcpy(Temp2, Temp1, Height * Stride * sizeof(float));
    
    for (int Y = 0; Y < Iteration; Y++)
    {
        TV_Curvature_Filter(Temp1, Temp2, Width, Height, Stride, 1, 1);
        memcpy(Temp1, Temp2, Height * Stride * sizeof(float));

        TV_Curvature_Filter(Temp1, Temp2, Width, Height, Stride, 2, 2);
        memcpy(Temp1, Temp2, Height * Stride * sizeof(float));

        TV_Curvature_Filter(Temp1, Temp2, Width, Height, Stride, 1, 2);
        memcpy(Temp1, Temp2, Height * Stride * sizeof(float));

        TV_Curvature_Filter(Temp1, Temp2, Width, Height, Stride, 2, 1);
        memcpy(Temp1, Temp2, Height * Stride * sizeof(float));
    }
    for (int Y = 0; Y < Height * Stride; Y++)    Dest[Y] = IM_ClampToByte((int)(Temp2[Y] + 0.4999999f));
    free(Temp1);
    free(Temp2);
}

  結果如下:

                                                      C++代碼效果                                       matlab代碼效果                                             差異圖

  差異圖不是全黑,這個和m代碼里局部細節和C++有點不同有關。

  到此,我們徹底消除CeleryChen所說的的算法效果問題。

  論文中說到了講圖像分成四個塊,如上述的代碼所示,分四個部分更新,那其實這里就有個問題,四個塊的更新順序是什么,或者說如果你指定一個順序,那你的理論依據是啥,因為一個塊更新后,后面一個塊的更新會依賴於前面已經更新的塊的數據,所以這個順序對結果可定是有影響的,我在論文里沒有找到相關說法,那么好,我就實踐,我把不同可能的更新順序的結果都做出來, 結果發現,大家確實有區別,但是區別都是在一兩個像素之間,因此,這個順序就顯得不是很重要了。

  但是,我們不分塊,又會存在什么問題呢,理論上有沒有問題,我不知道,但是我們就實測下,不分塊,代碼如下所示:

void TV_Curvature_Filter(float *Src, float *Dest, int Width, int Height, int Stride)
{
    for (int Y = 1; Y < Height - 1; Y++)
    {
        float *LinePC = Src + Y       * Stride;
        float *LinePL = Src + (Y - 1) * Stride;
        float *LinePN = Src + (Y + 1) * Stride;
        float *LinePD = Dest + Y * Stride;
        for (int X = 1; X < Width - 1; X++)
        {
            float Dist[8] = { 0 };
            float C5 = 5 * LinePC[X];
            Dist[0] = (LinePL[X - 1] + LinePL[X] + LinePC[X - 1] + LinePN[X - 1] + LinePN[X]) - C5;
            Dist[1] = (LinePL[X] + LinePL[X + 1] + LinePC[X + 1] + LinePN[X] + LinePN[X + 1]) - C5;
            Dist[2] = (LinePL[X - 1] + LinePL[X] + LinePL[X + 1] + LinePC[X - 1] + LinePC[X + 1]) - C5;
            Dist[3] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePC[X - 1] + LinePC[X + 1]) - C5;
            Dist[4] = (LinePL[X - 1] + LinePL[X] + LinePL[X + 1] + LinePC[X - 1] + LinePN[X - 1]) - C5;
            Dist[5] = (LinePL[X - 1] + LinePL[X + 1] + LinePL[X + 1] + LinePC[X + 1] + LinePN[X + 1]) - C5;
            Dist[6] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePL[X - 1] + LinePC[X - 1]) - C5;
            Dist[7] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePL[X + 1] + LinePC[X + 1]) - C5;

            float Min = abs(Dist[0]);
            int Index = 0;
            for (int Z = 1; Z < 8; Z++)
            {
                if (abs(Dist[Z]) < Min)
                {
                    Min = abs(Dist[Z]);
                    Index = Z;
                }
            }
            LinePD[X] = LinePC[X] + Dist[Index] * 0.2f;
        }
    }
}
void IM_CurvatureFilter_01(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Iteration)
{
    float *Temp1 = (float *)malloc(Height * Stride * sizeof(float));
    float *Temp2 = (float *)malloc(Height * Stride * sizeof(float));

    for (int Y = 0; Y < Height * Stride; Y++)    Temp1[Y] = Src[Y];
    memcpy(Temp2, Temp1, Height * Stride * sizeof(float));

    for (int Y = 0; Y < Iteration; Y++)
    {
        TV_Curvature_Filter(Temp1, Temp2, Width, Height, Stride);
        memcpy(Temp1, Temp2, Height * Stride * sizeof(float));
    }
    for (int Y = 0; Y < Height * Stride; Y++)    Dest[Y] = IM_ClampToByte((int)(Temp2[Y] + 0.4999999f));
    free(Temp1);
    free(Temp2);
}

  注意到在TV_Curvature_Filter函數中的兩重for循環里每次循環變量自增值已經修改了,如果分塊則一次增加2個像素,同時看到在前面未分塊的代碼匯中循環結束的標記是寬度和高度都減一,這個其實只是為了編碼方便,放置訪問越位內存,在matlab代碼中,作者是做了擴邊處理的,這些都不重要。

  我們來看看處理結果的比較:

     

               不分塊結果圖                                      分塊結果圖                                      差異圖

  確實有些差異,並且差異的比各分塊之間的差異要大, 但是我覺得沒有到那種已經有質的區別的檔次。因此后續的優化我是基於不分塊的來描述了,但是這種優化也是可以稍作修改就應用於分塊的優化(代碼會顯得稍微繁瑣一點,速度也會稍有下降)。

  我們首先記錄下目前的速度,不分塊50次迭代512*512的標准lena灰度圖耗時約 390ms(VS2013編譯器,默認啟用了SSE優化)。

  第一個小優化:每次迭代里有個memcpy函數,目的是把中間的臨時目的數據拷貝到臨時源數據中,這里其實可以不用這個拷貝,直接使用個指針交換就可以了,如下所示:

void IM_CurvatureFilter_01(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Iteration)
{
    float *Temp1 = (float *)malloc(Height * Stride * sizeof(float));
    float *Temp2 = (float *)malloc(Height * Stride * sizeof(float));
    float *Temp = NULL;
    for (int Y = 0; Y < Height * Stride; Y++)    Temp1[Y] = Src[Y];
    memcpy(Temp2, Temp1, Height * Stride * sizeof(float));

    for (int Y = 0; Y < Iteration; Y++)
    {
        TV_Curvature_Filter(Temp1, Temp2, Width, Height, Stride);
        Temp = Temp1, Temp1 = Temp2, Temp2 = Temp;
        //memcpy(Temp1, Temp2, Height * Stride * sizeof(float));
    }
    for (int Y = 0; Y < Height * Stride; Y++)    Dest[Y] = IM_ClampToByte((int)(Temp2[Y] + 0.4999999f));
    free(Temp1);
    free(Temp2);
}

  執行速度變為約385ms,這只是個小菜,沒啥意思。

  第二步嘗試,改變下算法的邏輯,我們注意到在Dist的8個元素計算中,其實是有很多重復計算的,我們如果不管Dist計算的順序,其實是可以總結出其運算規律:

  如果我們計算出第一個Dist中五個元素的和,后續就只要加一個進入的元素並且減去一個離開的元素就可以了,而且每次的減C5也可以只在第一個里做,這樣TV_Curvature_Filter變為如下形式:

void TV_Curvature_Filter(float *Src, float *Dest, int Width, int Height, int Stride)
{
    for (int Y = 1; Y < Height - 1; Y++)
    {
        float *LinePC = Src + Y       * Stride;
        float *LinePL = Src + (Y - 1) * Stride;
        float *LinePN = Src + (Y + 1) * Stride;
        float *LinePD = Dest + Y * Stride;
        for (int X = 1; X < Width - 1; X++)
        {
            float Dist[8] = { 0 };
            Dist[0] = (LinePL[X - 1] + LinePL[X] + LinePL[X + 1] + LinePC[X + 1] + LinePN[X + 1]) - 5 * LinePC[X];
            Dist[1] = Dist[0] - LinePL[X - 1] + LinePN[X];
            Dist[2] = Dist[1] - LinePL[X] + LinePN[X - 1];
            Dist[3] = Dist[2] - LinePL[X + 1] + LinePC[X - 1];
            Dist[4] = Dist[3] - LinePC[X + 1] + LinePL[X - 1];
            Dist[5] = Dist[4] - LinePN[X + 1] + LinePL[X];
            Dist[6] = Dist[5] - LinePN[X] + LinePL[X + 1];
            Dist[7] = Dist[6] - LinePN[X - 1] + LinePC[X + 1];
            float Min = abs(Dist[0]);
            int Index = 0;
            for (int Z = 1; Z < 8; Z++)
            {
                if (abs(Dist[Z]) < Min)
                {
                    Min = abs(Dist[Z]);
                    Index = Z;
                }
            }
            LinePD[X] = LinePC[X] + Dist[Index] * 0.2f;
        }
    }
}

  核心的計算由原先的32次加法及8次減法變化為11次加法及8次減法,計算大為減少,這應該能加速不少吧。

  按下F5,我靠運行時間變為了420ms,真是大失所望,為什么計算量少了但運行時間卻多了呢,反匯編看編譯器生成的代碼,確實是修改的指令多很多,修改后少了不少。我自己想了想,覺得主要還是因為修改后的算法的計算是前后依賴的,在沒算法前面的指令后續的指令是無法執行的,而編譯器的在指令級別也是可以實現指令級並行,修改后的方案就無法重復利用這個特性了。

  因此,這條路實際上是個失敗的方案,龔博士在matlab代碼里也有類似的優化方式,我把他修改為C代碼后,得到的結論也是一樣的。但是他 並不是一無是處,后面還是說到。

  還有一些常規的優化,比如內部的8次循環展開,不用數組,直接用變量代替中間的距離變量等等,實踐都表面沒有啥效果。因此,我們不浪費時間,直接進行SSE優化了。

  針對上述代碼進行SSE自行優化應該說不是很困難的事情,可以看到,各個點的計算是獨立的,和前后像素之間的計算是毫無干擾的,這種情況非常適合並行化(不管是SSE還是GPU都是一樣的道理),基本上只要把對應的普通C運算符轉換對一個的SIMD指令就OK了,那我們稍微關注下幾個SIMD沒有的運算符。

  第一個是abs, SSE提供了大量的求絕對值的指令,但唯獨對浮點數沒有,看不懂這個世界,不過也沒關系,浮點數普通C語言我們有個常用的方式實現絕對值計算如下所示:

inline float IM_AbsF(float V)
{
    int I = ((*(int *)&V) & 0x7fffffff);
    return (*(float *)&I);
}

  有這個,對應的__m128數據類型計算也很簡單:

//    浮點數據的絕對值計算
inline __m128 _mm_abs_ps(__m128 v)
{
    static const int i = 0x7fffffff;
    float mask = *(float*)&i;
    return _mm_and_ps(v, _mm_set_ps1(mask));
}

  注意這些簡短的函數一定要inline,有的時候我們甚至可以使用__forceinline(僅限VS編譯器)。

  接下來我們主要關注下求最小值的索引問題,求最小值可以直接使用_mm_min_ps這個宏,但是我們最終的目的不是求最小值,而是求最小值對應的索引項目,然后在獲得對應的值,因此,需要另尋出路。

  我們知道SSE里有一系列的比較函數,對於浮點類型的也有不少,他們比較后會獲得一個MASK值,符合條件的對應的MASK設置為OxFFFFFFFF,我們可以先獲得最小值,然后通過判斷這個比較對象其中的一個是否等於最小值來獲得一個Mask,正好SSE還有一個_mm_blendv_ps宏來使用Mask來決定取何值為結果值,這樣就可以獲得需要的值了。

    Min = _mm_min_ps(AbsDist0, AbsDist1);
    Value = _mm_blendv_ps(Dist0, Dist1, _mm_cmpeq_ps(Min, AbsDist1));
    Min = _mm_min_ps(Min, AbsDist2);
    Value = _mm_blendv_ps(Value, Dist2, _mm_cmpeq_ps(Min, AbsDist2));

  當然還有另外一種方式可以獲得同樣的效果:

    Cmp = _mm_cmpgt_ps(AbsDist0, AbsDist1);
    Min = _mm_blendv_ps(AbsDist0, AbsDist1, Cmp);
    Value = _mm_blendv_ps(Dist0, Dist1, Cmp);

    Cmp = _mm_cmpgt_ps(Min, AbsDist2);
    Min = _mm_blendv_ps(Min, AbsDist2, Cmp);
    Value = _mm_blendv_ps(Value, Dist2, Cmp);

  兩者的執行效率應該沒有多大的區別。

   貼出主要的SSE優化后的代碼:

void TV_Curvature_Filter(float *Src, float *Dest, int Width, int Height, int Stride)
{
    int BlockSize = 4, Block = (Width - 2) / BlockSize;
    for (int Y = 1; Y < Height - 1; Y++)
    {
        float *LinePC = Src + Y       * Stride;
        float *LinePL = Src + (Y - 1) * Stride;
        float *LinePN = Src + (Y + 1) * Stride;
        float *LinePD = Dest + Y * Stride;
        for (int X = 1; X < Block * BlockSize + 1; X += BlockSize)
        {
            __m128 FirstP0 = _mm_loadu_ps(LinePL + X - 1);
            __m128 FirstP1 = _mm_loadu_ps(LinePL + X);
            __m128 FirstP2 = _mm_loadu_ps(LinePL + X + 1);
            __m128 SecondP0 = _mm_loadu_ps(LinePC + X - 1);
            __m128 SecondP1 = _mm_loadu_ps(LinePC + X);
            __m128 SecondP2 = _mm_loadu_ps(LinePC + X + 1);
            __m128 ThirdP0 = _mm_loadu_ps(LinePN + X - 1);
            __m128 ThirdP1 = _mm_loadu_ps(LinePN + X);
            __m128 ThirdP2 = _mm_loadu_ps(LinePN + X + 1);

            __m128 C5 = _mm_mul_ps(SecondP1, _mm_set1_ps(5));
            __m128 Dist0 = _mm_sub_ps(_mm_add_ps(_mm_add_ps(_mm_add_ps(FirstP0, FirstP1), _mm_add_ps(FirstP2, SecondP2)), ThirdP2), C5);
            __m128 Dist1 = _mm_sub_ps(_mm_add_ps(_mm_add_ps(_mm_add_ps(FirstP1, FirstP2), _mm_add_ps(SecondP2, ThirdP2)), ThirdP1), C5);
            __m128 Dist2 = _mm_sub_ps(_mm_add_ps(_mm_add_ps(_mm_add_ps(FirstP2, SecondP2), _mm_add_ps(ThirdP2, ThirdP1)), ThirdP0), C5);
            __m128 Dist3 = _mm_sub_ps(_mm_add_ps(_mm_add_ps(_mm_add_ps(SecondP2, ThirdP2), _mm_add_ps(ThirdP1, ThirdP0)), SecondP0), C5);
            __m128 Dist4 = _mm_sub_ps(_mm_add_ps(_mm_add_ps(_mm_add_ps(ThirdP2, ThirdP1), _mm_add_ps(ThirdP0, SecondP0)), FirstP0), C5);
            __m128 Dist5 = _mm_sub_ps(_mm_add_ps(_mm_add_ps(_mm_add_ps(ThirdP1, ThirdP0), _mm_add_ps(SecondP0, FirstP0)), FirstP1), C5);
            __m128 Dist6 = _mm_sub_ps(_mm_add_ps(_mm_add_ps(_mm_add_ps(ThirdP0, SecondP0), _mm_add_ps(FirstP0, FirstP1)), FirstP2), C5);
            __m128 Dist7 = _mm_sub_ps(_mm_add_ps(_mm_add_ps(_mm_add_ps(SecondP0, FirstP0), _mm_add_ps(FirstP1, FirstP2)), FirstP1), C5);

            __m128 AbsDist0 = _mm_abs_ps(Dist0);
            __m128 AbsDist1 = _mm_abs_ps(Dist1);
            __m128 AbsDist2 = _mm_abs_ps(Dist2);
            __m128 AbsDist3 = _mm_abs_ps(Dist3);
            __m128 AbsDist4 = _mm_abs_ps(Dist4);
            __m128 AbsDist5 = _mm_abs_ps(Dist5);
            __m128 AbsDist6 = _mm_abs_ps(Dist6);
            __m128 AbsDist7 = _mm_abs_ps(Dist7);

            __m128 Min = _mm_min_ps(AbsDist0, AbsDist1);
            __m128 Value = _mm_blendv_ps(Dist0, Dist1, _mm_cmpeq_ps(Min, AbsDist1));
            Min = _mm_min_ps(Min, AbsDist2);
            Value = _mm_blendv_ps(Value, Dist2, _mm_cmpeq_ps(Min, AbsDist2));
            Min = _mm_min_ps(Min, AbsDist3);
            Value = _mm_blendv_ps(Value, Dist3, _mm_cmpeq_ps(Min, AbsDist3));
            Min = _mm_min_ps(Min, AbsDist4);
            Value = _mm_blendv_ps(Value, Dist4, _mm_cmpeq_ps(Min, AbsDist4));
            Min = _mm_min_ps(Min, AbsDist5);
            Value = _mm_blendv_ps(Value, Dist5, _mm_cmpeq_ps(Min, AbsDist5));
            Min = _mm_min_ps(Min, AbsDist6);
            Value = _mm_blendv_ps(Value, Dist6, _mm_cmpeq_ps(Min, AbsDist6));
            Min = _mm_min_ps(Min, AbsDist7);
            Value = _mm_blendv_ps(Value, Dist7, _mm_cmpeq_ps(Min, AbsDist7));

            _mm_storeu_ps(LinePD + X, _mm_add_ps(_mm_mul_ps(Value, _mm_set1_ps(0.20f)), SecondP1));

            for (int X = Block * BlockSize + 1; X < Width - 1; X++)
            {
                //    請自行添加
            }

        }
    }
}

  編譯后測試執行速度大約在85-90ms之間,和之前的385ms相比有4倍多的提高。

  如果我們把上面的Dist0至於Dist7的計算方式更改為前面所說的前后依賴的那種方式,你會發現SSE的表現和C的表現是一致的,也是速度變慢了。

  至此,我們的優化基本告一段落了,那是否就意味着這個算法的優化就已經到頭了呢,非也非也,我們來繼續挖掘。

  前面說過,網友CeleryChen的算法得不到理想的結果是因為,其使用字節數據作為中間結果的緩存空間,這樣會導致很多中間值被不合理的裁剪掉,因此,我們后面換成浮點數就解決了問題,但是,有沒有可能使用其他的數據類型呢,比如使用signed short類型,是否可行,我們來分析下。

  我么知道,short類型能表達的有效范圍是[-32768,32767]之間,遠比byte范圍廣,而且可以表達負數,而負數在中間結果里肯定會出現一些的。第二,如果我們把原始的圖像數據擴大若干倍數,比如8倍,那么其動態范圍將擴大8倍,這時在計算dm時的精度也將相應的擴大(既然采用整形,那么計算過程中的Dist值我們肯定會選擇整形),我們需要的是合理確定這個放大倍數,使得結算過程中的任何中間結果都在short所能表達的范圍內,注意到在TV算法中,最多時有5個像素值相加,這個時候我們如果選擇放大16倍,最大像素值由255變為4080,如果這個5個像素都為255,相加后的值為20400,小於32768,如果取放大32倍,則有可能出現溢出了,因此,最大選擇16倍是合理的。

  當然,隨着迭代的進行,中間值可能會出現超過4080的情況,但是要知道要出現多個4080以上的值才有可能出現溢出,同時這個迭代是於周邊像素有關的,要出現這種情況的可能行還是非常小的,也許理論上根本不會出現。

  好的,不在做更多的分析,我們先寫個代碼測試下:

void TV_Curvature_Filter(short *Src, short *Dest, int Width, int Height, int Stride)
{
    short Dist[8] = { 0 };
    for (int Y = 1; Y < Height - 1; Y++)
    {
        short    *LinePC = Src + Y       * Stride;
        short    *LinePL = Src + (Y - 1) * Stride;
        short    *LinePN = Src + (Y + 1) * Stride;
        short   *LinePD = Dest + Y * Stride;
        for (int X = 1; X < Width - 1; X++)
        {
            short C5 = 5 * LinePC[X];
            Dist[0] = (LinePL[X - 1] + LinePL[X] + LinePC[X - 1] + LinePN[X - 1] + LinePN[X]) - C5;
            Dist[1] = (LinePL[X] + LinePL[X + 1] + LinePC[X + 1] + LinePN[X] + LinePN[X + 1]) - C5;
            Dist[2] = (LinePL[X - 1] + LinePL[X] + LinePL[X + 1] + LinePC[X - 1] + LinePC[X + 1]) - C5;
            Dist[3] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePC[X - 1] + LinePC[X + 1]) - C5;
            Dist[4] = (LinePL[X - 1] + LinePL[X] + LinePL[X + 1] + LinePC[X - 1] + LinePN[X - 1]) - C5;
            Dist[5] = (LinePL[X - 1] + LinePL[X + 1] + LinePL[X + 1] + LinePC[X + 1] + LinePN[X + 1]) - C5;
            Dist[6] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePL[X - 1] + LinePC[X - 1]) - C5;
            Dist[7] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePL[X + 1] + LinePC[X + 1]) - C5;

            short Min = abs(Dist[0]);
            int Index = 0;
            for (int Z = 1; Z < 8; Z++)
            {
                if (abs(Dist[Z]) < Min)
                {
                    Min = abs(Dist[Z]);
                    Index = Z;
                }
            }
            LinePD[X] = LinePC[X] + ((Dist[Index] + 2)/ 5);
        }
    }
}

void IM_CurvatureFilter(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Iteration)
{
    short *Temp1 = (short *)malloc(Height * Stride * sizeof(short));
    short *Temp2 = (short *)malloc(Height * Stride * sizeof(short));
    short *Temp = NULL;
    for (int Y = 0; Y < Height * Stride; Y++)    Temp1[Y] = Src[Y] * 16;
    memcpy(Temp2, Temp1, Height * Stride * sizeof(unsigned short));

    for (int Y = 0; Y < Iteration; Y++)
    {
        TV_Curvature_Filter(Temp1, Temp2, Width, Height, Stride);
        Temp = Temp1, Temp1 = Temp2, Temp2 = Temp;
    }

    for (int Y = 0; Y < Height * Stride; Y++)    Dest[Y] = IM_ClampToByte((Temp2[Y] + 8) / 16);
    free(Temp1);
    free(Temp2);
}

  算法效果比較: 

    

            short類型簡化版                                                                                                                       float類型精確版                              差異圖

      確實有些差別,特別是在邊緣處,但是肉眼似乎無法感受到這種區別,個人認為如果要獲取更好的速度,這個優化方式是可取的。

   那么上述未做優化的short版本執行時間也大概是390ms,說明現代CPU的浮點計算能力也是很強的。

      在CeleryChen提供給的版本中,我們注意到了一個他沒有任何的進行比較的代碼,而是使用一個叫做_mm_minpos_epu16的函數,我們來看下這個函數的MSDN說明:

__m128i _mm_minpos_epu16(__m128i a );

Parameters

  • [in] a

    A 128-bit parameter that contains eight 16-bit unsigned integers.

Result value

         A 128-bit value. The lowest order 16 bits are the minimum value found in parameter a. The second-lowest order 16 bits are the index of the minimum value found in parameter a.

  就是說這個函數比較8個無符號的短整形數,返回的數據中第一個表示8個中最小的值,第二個數表示最小值對應的索引,這個和我們這個應用場景真的說是非常的靠譜,可以說是非常生動的案例。

  我們按照CeleryChen的方式重寫上上述的代碼,修改如下:

void TV_Curvature_Filter(short *Src, short *Dest, int Width, int Height, int Stride) { short Dist[8] = { 0 }; for (int Y = 1; Y < Height - 1; Y++) { short    *LinePC = Src + Y       * Stride; short    *LinePL = Src + (Y - 1) * Stride; short    *LinePN = Src + (Y + 1) * Stride; short   *LinePD = Dest + Y * Stride; for (int X = 1; X < Width - 1; X++) { short C5 = 5 * LinePC[X]; Dist[0] = (LinePL[X - 1] + LinePL[X] + LinePC[X - 1] + LinePN[X - 1] + LinePN[X]) - C5; Dist[1] = (LinePL[X] + LinePL[X + 1] + LinePC[X + 1] + LinePN[X] + LinePN[X + 1]) - C5; Dist[2] = (LinePL[X - 1] + LinePL[X] + LinePL[X + 1] + LinePC[X - 1] + LinePC[X + 1]) - C5; Dist[3] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePC[X - 1] + LinePC[X + 1]) - C5; Dist[4] = (LinePL[X - 1] + LinePL[X] + LinePL[X + 1] + LinePC[X - 1] + LinePN[X - 1]) - C5; Dist[5] = (LinePL[X - 1] + LinePL[X + 1] + LinePL[X + 1] + LinePC[X + 1] + LinePN[X + 1]) - C5; Dist[6] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePL[X - 1] + LinePC[X - 1]) - C5; Dist[7] = (LinePN[X - 1] + LinePN[X] + LinePN[X + 1] + LinePL[X + 1] + LinePC[X + 1]) - C5; __m128i Dist128 = _mm_loadu_si128((const __m128i*)Dist); __m128i AbsDist128 = _mm_abs_epi16(Dist128); __m128i MinPos128 = _mm_minpos_epu16(AbsDist128); LinePD[X] = LinePC[X] + Dist[_mm_extract_epi16(MinPos128, 1)] / 5 ; } } }

 

  速度一下子提升到了260ms。

  同樣的道理,對於上面Dist中8個值的計算,我們一樣可以用SIMD優化,但是注意到,此時繼續使用_mm_minpos_epu16這個函數進行優化就有一定的代價了,因為這個時候我們需要進行排序的元素不是正好在一個XMM寄存里,而是在8個XMM寄存器或內存的列方向排列,這個時候如果需要繼續使用_mm_minpos_epu16,就必須對計算出的8個Dist值進行轉置。而轉置的也是需要進行多次操作的,不知道這個耗時是否很大,我們編碼進行測試:

static void TV_Curvature_Filter(short *Src, short *Dest, int Width, int Height, int Stride) { int BlockSize = 8, Block = (Width - 2) / BlockSize; for (int Y = 1; Y < Height - 1; Y++) { short    *LinePC = Src + Y       * Stride; short    *LinePL = Src + (Y - 1) * Stride; short    *LinePN = Src + (Y + 1) * Stride; short   *LinePD = Dest + Y * Stride; for (int X = 1; X < Block * BlockSize + 1; X += BlockSize) { __m128i FirstP0 = _mm_loadu_si128((__m128i *)(LinePL + X - 1)); __m128i FirstP1 = _mm_loadu_si128((__m128i *)(LinePL + X)); __m128i FirstP2 = _mm_loadu_si128((__m128i *)(LinePL + X + 1)); __m128i SecondP0 = _mm_loadu_si128((__m128i *)(LinePC + X - 1)); __m128i SecondP1 = _mm_loadu_si128((__m128i *)(LinePC + X)); __m128i SecondP2 = _mm_loadu_si128((__m128i *)(LinePC + X + 1)); __m128i ThirdP0 = _mm_loadu_si128((__m128i *)(LinePN + X - 1)); __m128i ThirdP1 = _mm_loadu_si128((__m128i *)(LinePN + X)); __m128i ThirdP2 = _mm_loadu_si128((__m128i *)(LinePN + X + 1)); __m128i C5 = _mm_mullo_epi16(SecondP1, _mm_set1_epi16(5)); __m128i Dist0 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(FirstP0, FirstP1), _mm_add_epi16(FirstP2, SecondP2)), ThirdP2), C5); __m128i Dist1 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(FirstP1, FirstP2), _mm_add_epi16(SecondP2, ThirdP2)), ThirdP1), C5); __m128i Dist2 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(FirstP2, SecondP2), _mm_add_epi16(ThirdP2, ThirdP1)), ThirdP0), C5); __m128i Dist3 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(SecondP2, ThirdP2), _mm_add_epi16(ThirdP1, ThirdP0)), SecondP0), C5); __m128i Dist4 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(ThirdP2, ThirdP1), _mm_add_epi16(ThirdP0, SecondP0)), FirstP0), C5); __m128i Dist5 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(ThirdP1, ThirdP0), _mm_add_epi16(SecondP0, FirstP0)), FirstP1), C5); __m128i Dist6 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(ThirdP0, SecondP0), _mm_add_epi16(FirstP0, FirstP1)), FirstP2), C5); __m128i Dist7 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(SecondP0, FirstP0), _mm_add_epi16(FirstP1, FirstP2)), FirstP1), C5); __m128i S01L = _mm_unpacklo_epi16(Dist0, Dist1);                            // B3 A3 B2 A2 B1 A1 B0 A0
            __m128i S23L = _mm_unpacklo_epi16(Dist2, Dist3);                            // D3 C3 D2 C2 D1 C1 D0 C0
            __m128i S01H = _mm_unpackhi_epi16(Dist0, Dist1);                            // B7 A7 B6 A6 B5 A5 B4 A4
            __m128i S23H = _mm_unpackhi_epi16(Dist2, Dist3);                            // D7 C7 D6 C6 D5 C5 D4 C4
 __m128i S0123LL = _mm_unpacklo_epi32(S01L, S23L);                            // D1 C1 B1 A1 D0 C0 B0 A0
            __m128i S0123LH = _mm_unpackhi_epi32(S01L, S23L);                            // D3 C3 B3 A3 D2 C2 B2 A2 
            __m128i S0123HL = _mm_unpacklo_epi32(S01H, S23H);                            // D5 C5 B5 A5 D4 C4 B4 A4
            __m128i S0123HH = _mm_unpackhi_epi32(S01H, S23H);                            // D7 C7 B7 A7 D6 C6 B6 A6
 __m128i S45L = _mm_unpacklo_epi16(Dist4, Dist5);                            // F3 E3 F2 E2 F1 E1 F0 E0 
            __m128i S67L = _mm_unpacklo_epi16(Dist6, Dist7);                            // H3 G3 H2 G2 H1 G1 H0 G0 
            __m128i S45H = _mm_unpackhi_epi16(Dist4, Dist5);                            // F7 E7 F6 E6 F5 E5 F4 E4 
            __m128i S67H = _mm_unpackhi_epi16(Dist6, Dist7);                            // H7 G7 H6 G6 H5 G5 H4 G4
 __m128i S4567LL = _mm_unpacklo_epi32(S45L, S67L);                            // H1 G1 F1 E1 H0 G0 F0 E0
            __m128i S4567LH = _mm_unpackhi_epi32(S45L, S67L);                            // H3 G3 F3 E3 H2 G2 F2 E2
            __m128i S4567HL = _mm_unpacklo_epi32(S45H, S67H);                            // H5 G5 F5 E5 H4 G4 F4 E4
            __m128i S4567HH = _mm_unpackhi_epi32(S45H, S67H);                            // H7 G7 F7 E7 H6 G6 F6 E6
 Dist0 = _mm_unpacklo_epi64(S0123LL, S4567LL);                                // H0 G0 F0 E0 D0 C0 B0 A0
            Dist1 = _mm_unpackhi_epi64(S0123LL, S4567LL);                                // H1 G1 F1 E1 D1 C1 B1 A1 
            Dist2 = _mm_unpacklo_epi64(S0123LH, S4567LH);                                // H2 G2 F2 E2 D2 C2 B2 A2 
            Dist3 = _mm_unpackhi_epi64(S0123LH, S4567LH);                                // H3 G3 F3 E3 D3 C3 B3 A3 
            Dist4 = _mm_unpacklo_epi64(S0123HL, S4567HL);                                // H4 G4 F4 E4 D4 C4 B4 A4 
            Dist5 = _mm_unpackhi_epi64(S0123HL, S4567HL);                                // H5 G5 F5 E5 D5 C5 B5 A5 
            Dist6 = _mm_unpacklo_epi64(S0123HH, S4567HH);                                // H6 G6 F6 E6 D6 C6 B6 A6 
            Dist7 = _mm_unpackhi_epi64(S0123HH, S4567HH);                                // H6 G7 F7 E7 D7 C7 B7 A7 
 LinePD[X + 0] = LinePC[X + 0] + (Dist0.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist0)), 1)] + 2) / 5; LinePD[X + 1] = LinePC[X + 1] + (Dist1.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist1)), 1)] + 2) / 5; LinePD[X + 2] = LinePC[X + 2] + (Dist2.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist2)), 1)] + 2) / 5; LinePD[X + 3] = LinePC[X + 3] + (Dist3.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist3)), 1)] + 2) / 5; LinePD[X + 4] = LinePC[X + 4] + (Dist4.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist4)), 1)] + 2) / 5; LinePD[X + 5] = LinePC[X + 5] + (Dist5.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist5)), 1)] + 2) / 5; LinePD[X + 6] = LinePC[X + 6] + (Dist6.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist6)), 1)] + 2) / 5; LinePD[X + 7] = LinePC[X + 7] + (Dist7.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist7)), 1)] + 2) / 5; } for (int X = Block * BlockSize + 1; X < Width - 1; X++) { // 請自行添加
 } } }

  注意到上面有大幅的代碼是用來進行數據轉置的,這部分代碼如果展開解釋,估計又是半天時間,我在代碼后面都描述了每行執行后的狀態,可自行理解,或者參考SSE圖像算法優化系列四:圖像轉置的SSE優化(支持8位、24位、32位),提速4-6倍我這篇文章的說法也行。

  我們知道__m128i變量在系統內部是個union結構,他是一個內存變量,我在程序直接訪問他變量的成員,可能就會導致系統無法將這個變量直接編譯為一個實際的寄存器變量,這在一定程度上會減緩速度(因此執行相關操作時可能需要把變量先復制到寄存器后在進行處理),但是由於代碼中用到比較多的__m128i變量,估計這種損失會減少不少。

  注意代碼中我們的最后有個+2然后在整除,這個主要時為了四舍五入,對於這種迭代計算,這個可能比較重要,不然所得到的結果圖像可能越來越暗。

  算法的執行時間現在定格在50ms左右。

  注意到上述代碼最后還有相當一部分的計算是普通的C語言操作(+2, /5之類的),這部分為什么沒有用SSE做呢,主要是那個除法,不太明白SIMD指令里為什么不提供整數除法的相關指令,也許他是覺得沒有必要吧。但是如果我把他們轉換到浮點數,在調用浮點的除法,然后再次轉換回來,這個耗時可能還不如直接使用C的代碼。

  但是我們知道,除以一個常數通常編譯器都會優化為移位和乘法,如果這樣,既然硬件沒帶這個函數,我們就自己寫個整形除法,那么早就有人做好了這份工作,在Github上有一個很強大的SIMD庫,里面有一段關於整形除法(含16位及32位的)的代碼,詳見:https://www.agner.org/optimize/,唯一的問題就是除數必須是定值,而不能想_mm_div_ps那樣四個除數可以不同。這里就不貼出代碼了(詳見本文附件代碼)。

  有個整數的除法,我們就可以繼續優化上述代碼,我們把從Dist0到Dist7中提取的8個數據保存到臨時的8個short類型的數據中,然后后續的加法和除法直接使用SSE進行處理,如下所示:

static void TV_Curvature_Filter(short *Src, short *Dest, int Width, int Height, int Stride) { int BlockSize = 8, Block = (Width - 2) / BlockSize; short Multiplier5 = 0; int Shift5 = 0, Sign5 = 0; GetMSS_S(5, Multiplier5, Shift5, Sign5); short Dist[8]; for (int Y = 1; Y < Height - 1; Y++) { short    *LinePC = Src + Y       * Stride; short    *LinePL = Src + (Y - 1) * Stride; short    *LinePN = Src + (Y + 1) * Stride; short   *LinePD = Dest + Y * Stride; for (int X = 1; X < Block * BlockSize + 1; X += BlockSize) { __m128i FirstP0 = _mm_loadu_si128((__m128i *)(LinePL + X - 1)); __m128i FirstP1 = _mm_loadu_si128((__m128i *)(LinePL + X)); __m128i FirstP2 = _mm_loadu_si128((__m128i *)(LinePL + X + 1)); __m128i SecondP0 = _mm_loadu_si128((__m128i *)(LinePC + X - 1)); __m128i SecondP1 = _mm_loadu_si128((__m128i *)(LinePC + X)); __m128i SecondP2 = _mm_loadu_si128((__m128i *)(LinePC + X + 1)); __m128i ThirdP0 = _mm_loadu_si128((__m128i *)(LinePN + X - 1)); __m128i ThirdP1 = _mm_loadu_si128((__m128i *)(LinePN + X)); __m128i ThirdP2 = _mm_loadu_si128((__m128i *)(LinePN + X + 1)); __m128i C5 = _mm_mullo_epi16(SecondP1, _mm_set1_epi16(5)); __m128i Dist0 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(FirstP0, FirstP1), _mm_add_epi16(FirstP2, SecondP2)), ThirdP2), C5); __m128i Dist1 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(FirstP1, FirstP2), _mm_add_epi16(SecondP2, ThirdP2)), ThirdP1), C5); __m128i Dist2 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(FirstP2, SecondP2), _mm_add_epi16(ThirdP2, ThirdP1)), ThirdP0), C5); __m128i Dist3 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(SecondP2, ThirdP2), _mm_add_epi16(ThirdP1, ThirdP0)), SecondP0), C5); __m128i Dist4 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(ThirdP2, ThirdP1), _mm_add_epi16(ThirdP0, SecondP0)), FirstP0), C5); __m128i Dist5 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(ThirdP1, ThirdP0), _mm_add_epi16(SecondP0, FirstP0)), FirstP1), C5); __m128i Dist6 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(ThirdP0, SecondP0), _mm_add_epi16(FirstP0, FirstP1)), FirstP2), C5); __m128i Dist7 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(SecondP0, FirstP0), _mm_add_epi16(FirstP1, FirstP2)), FirstP1), C5); __m128i S01L = _mm_unpacklo_epi16(Dist0, Dist1);                            // B3 A3 B2 A2 B1 A1 B0 A0
            __m128i S23L = _mm_unpacklo_epi16(Dist2, Dist3);                            // D3 C3 D2 C2 D1 C1 D0 C0
            __m128i S01H = _mm_unpackhi_epi16(Dist0, Dist1);                            // B7 A7 B6 A6 B5 A5 B4 A4
            __m128i S23H = _mm_unpackhi_epi16(Dist2, Dist3);                            // D7 C7 D6 C6 D5 C5 D4 C4
 __m128i S0123LL = _mm_unpacklo_epi32(S01L, S23L);                            // D1 C1 B1 A1 D0 C0 B0 A0
            __m128i S0123LH = _mm_unpackhi_epi32(S01L, S23L);                            // D3 C3 B3 A3 D2 C2 B2 A2 
            __m128i S0123HL = _mm_unpacklo_epi32(S01H, S23H);                            // D5 C5 B5 A5 D4 C4 B4 A4
            __m128i S0123HH = _mm_unpackhi_epi32(S01H, S23H);                            // D7 C7 B7 A7 D6 C6 B6 A6
 __m128i S45L = _mm_unpacklo_epi16(Dist4, Dist5);                            // F3 E3 F2 E2 F1 E1 F0 E0 
            __m128i S67L = _mm_unpacklo_epi16(Dist6, Dist7);                            // H3 G3 H2 G2 H1 G1 H0 G0 
            __m128i S45H = _mm_unpackhi_epi16(Dist4, Dist5);                            // F7 E7 F6 E6 F5 E5 F4 E4 
            __m128i S67H = _mm_unpackhi_epi16(Dist6, Dist7);                            // H7 G7 H6 G6 H5 G5 H4 G4
 __m128i S4567LL = _mm_unpacklo_epi32(S45L, S67L);                            // H1 G1 F1 E1 H0 G0 F0 E0
            __m128i S4567LH = _mm_unpackhi_epi32(S45L, S67L);                            // H3 G3 F3 E3 H2 G2 F2 E2
            __m128i S4567HL = _mm_unpacklo_epi32(S45H, S67H);                            // H5 G5 F5 E5 H4 G4 F4 E4
            __m128i S4567HH = _mm_unpackhi_epi32(S45H, S67H);                            // H7 G7 F7 E7 H6 G6 F6 E6
 Dist0 = _mm_unpacklo_epi64(S0123LL, S4567LL);                                // H0 G0 F0 E0 D0 C0 B0 A0
            Dist1 = _mm_unpackhi_epi64(S0123LL, S4567LL);                                // H1 G1 F1 E1 D1 C1 B1 A1 
            Dist2 = _mm_unpacklo_epi64(S0123LH, S4567LH);                                // H2 G2 F2 E2 D2 C2 B2 A2 
            Dist3 = _mm_unpackhi_epi64(S0123LH, S4567LH);                                // H3 G3 F3 E3 D3 C3 B3 A3 
            Dist4 = _mm_unpacklo_epi64(S0123HL, S4567HL);                                // H4 G4 F4 E4 D4 C4 B4 A4 
            Dist5 = _mm_unpackhi_epi64(S0123HL, S4567HL);                                // H5 G5 F5 E5 D5 C5 B5 A5 
            Dist6 = _mm_unpacklo_epi64(S0123HH, S4567HH);                                // H6 G6 F6 E6 D6 C6 B6 A6 
            Dist7 = _mm_unpackhi_epi64(S0123HH, S4567HH);                                // H6 G7 F7 E7 D7 C7 B7 A7 
 Dist[0] = Dist0.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist0)), 1)]; Dist[1] = Dist1.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist1)), 1)]; Dist[2] = Dist2.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist2)), 1)]; Dist[3] = Dist3.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist3)), 1)]; Dist[4] = Dist4.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist4)), 1)]; Dist[5] = Dist5.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist5)), 1)]; Dist[6] = Dist6.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist6)), 1)]; Dist[7] = Dist7.m128i_i16[_mm_extract_epi16(_mm_minpos_epu16(_mm_abs_epi16(Dist7)), 1)]; __m128i DivY = _mm_divp_epi16(_mm_add_epi16(_mm_set1_epi16(2), _mm_loadu_si128((__m128i *)Dist)), Multiplier5, Shift5); _mm_storeu_si128((__m128i *)(LinePD + X), _mm_add_epi16(DivY, SecondP1)); } for (int X = Block * BlockSize + 1; X < Width - 1; X++) { // 請自行添加
 } } }

  速度提升到了45ms,相對於浮點類型的SSE優化,又有了進一倍的提速。

  我們再次回到浮點的SSE優化代碼,在浮點版本的優化里,因為浮點里是沒有_mm_minpos_epu16這樣的函數的,因此我們就使用了比較傳統的比較函數方式,那么同樣的道理,使用short類型的為什么不可以用那些函數呢,應該一樣可以,浮點的和short類型的基本都有對應的函數,除了沒有_mm_blendv_epi16函數(注意不是_mm_blend_epi16函數,他們對參數的要求是不一樣的),但是有_mm_blendv_epi8可以代替,而且功能是一樣的,我們再把剛剛講到的除法運算的優化結合在一起,形成了如下的優化代碼:

static void TV_Curvature_Filter_03(short *Src, short *Dest, int Width, int Height, int Stride) { int BlockSize = 8, Block = (Width - 2) / BlockSize; short Multiplier5 = 0; int Shift5 = 0, Sign5 = 0; GetMSS_S(5, Multiplier5, Shift5, Sign5); short Dist[8]; for (int Y = 1; Y < Height - 1; Y++) { short    *LinePC = Src + Y       * Stride; short    *LinePL = Src + (Y - 1) * Stride; short    *LinePN = Src + (Y + 1) * Stride; short   *LinePD = Dest + Y * Stride; for (int X = 1; X < Block * BlockSize + 1; X += BlockSize) { __m128i FirstP0 = _mm_loadu_si128((__m128i *)(LinePL + X - 1)); __m128i FirstP1 = _mm_loadu_si128((__m128i *)(LinePL + X)); __m128i FirstP2 = _mm_loadu_si128((__m128i *)(LinePL + X + 1)); __m128i SecondP0 = _mm_loadu_si128((__m128i *)(LinePC + X - 1)); __m128i SecondP1 = _mm_loadu_si128((__m128i *)(LinePC + X)); __m128i SecondP2 = _mm_loadu_si128((__m128i *)(LinePC + X + 1)); __m128i ThirdP0 = _mm_loadu_si128((__m128i *)(LinePN + X - 1)); __m128i ThirdP1 = _mm_loadu_si128((__m128i *)(LinePN + X)); __m128i ThirdP2 = _mm_loadu_si128((__m128i *)(LinePN + X + 1)); __m128i C5 = _mm_mullo_epi16(SecondP1, _mm_set1_epi16(5)); __m128i Dist0 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(FirstP0, FirstP1), _mm_add_epi16(FirstP2, SecondP2)), ThirdP2), C5); __m128i Dist1 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(FirstP1, FirstP2), _mm_add_epi16(SecondP2, ThirdP2)), ThirdP1), C5); __m128i Dist2 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(FirstP2, SecondP2), _mm_add_epi16(ThirdP2, ThirdP1)), ThirdP0), C5); __m128i Dist3 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(SecondP2, ThirdP2), _mm_add_epi16(ThirdP1, ThirdP0)), SecondP0), C5); __m128i Dist4 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(ThirdP2, ThirdP1), _mm_add_epi16(ThirdP0, SecondP0)), FirstP0), C5); __m128i Dist5 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(ThirdP1, ThirdP0), _mm_add_epi16(SecondP0, FirstP0)), FirstP1), C5); __m128i Dist6 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(ThirdP0, SecondP0), _mm_add_epi16(FirstP0, FirstP1)), FirstP2), C5); __m128i Dist7 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(SecondP0, FirstP0), _mm_add_epi16(FirstP1, FirstP2)), FirstP1), C5); __m128i AbsDist0 = _mm_abs_epi16(Dist0); __m128i AbsDist1 = _mm_abs_epi16(Dist1); __m128i AbsDist2 = _mm_abs_epi16(Dist2); __m128i AbsDist3 = _mm_abs_epi16(Dist3); __m128i AbsDist4 = _mm_abs_epi16(Dist4); __m128i AbsDist5 = _mm_abs_epi16(Dist5); __m128i AbsDist6 = _mm_abs_epi16(Dist6); __m128i AbsDist7 = _mm_abs_epi16(Dist7); __m128i Cmp = _mm_cmpgt_epi16(AbsDist0, AbsDist1); __m128i Min = _mm_blendv_epi8(AbsDist0, AbsDist1, Cmp); __m128i Value = _mm_blendv_epi8(Dist0, Dist1, Cmp); Cmp = _mm_cmpgt_epi16(Min, AbsDist2); Min = _mm_blendv_epi8(Min, AbsDist2, Cmp); Value = _mm_blendv_epi8(Value, Dist2, Cmp); Cmp = _mm_cmpgt_epi16(Min, AbsDist3); Min = _mm_blendv_epi8(Min, AbsDist3, Cmp); Value = _mm_blendv_epi8(Value, Dist3, Cmp); Cmp = _mm_cmpgt_epi16(Min, AbsDist4); Min = _mm_blendv_epi8(Min, AbsDist4, Cmp); Value = _mm_blendv_epi8(Value, Dist4, Cmp); Cmp = _mm_cmpgt_epi16(Min, AbsDist5); Min = _mm_blendv_epi8(Min, AbsDist5, Cmp); Value = _mm_blendv_epi8(Value, Dist5, Cmp); Cmp = _mm_cmpgt_epi16(Min, AbsDist6); Min = _mm_blendv_epi8(Min, AbsDist6, Cmp); Value = _mm_blendv_epi8(Value, Dist6, Cmp); Cmp = _mm_cmpgt_epi16(Min, AbsDist7); Min = _mm_blendv_epi8(Min, AbsDist7, Cmp); Value = _mm_blendv_epi8(Value, Dist7, Cmp); __m128i DivY = _mm_divp_epi16(_mm_add_epi16(_mm_set1_epi16(2), Value), Multiplier5, Shift5); _mm_storeu_si128((__m128i *)(LinePD + X), _mm_add_epi16(DivY, SecondP1)); } for (int X = Block * BlockSize + 1; X < Width - 1; X++) { // 請自行添加
 } } }

   執行速度來到了30ms,比前面的使用_mm_minpos_epu16要更快,這個事情說明我們不能墨守成規,不要被一些先驗的東西卡死。

        到這里,基本上說已經優化的差不多了,我們再多說點其他的東西,我們注意到,再上述所有代碼的處理中,我們都沒有處理邊緣位置的像素,這是因為邊緣像素的的8領域會超出圖像邊界,一種方式就是擴邊,然后執行算法,然后再裁剪,另外一種就是對邊緣像素進行獨立的算法書寫。我想他們都是比較煩人的事情,但是我們這里再次利用我曾經提過的一個技巧,即SSE圖像算法優化系列九:靈活運用SIMD指令16倍提升Sobel邊緣檢測的速度(4000*3000的24位圖像時間由480ms降低到30ms)一文所提到的利用三行緩存數據讓8領域算法邊界完美處理,而且還支持In-Place操作的方式,這里具有的優點還有就是不需要兩份中間內存,特別對於浮點版本的數據,內存占用量大幅減少,也是一個重要的突破。

void TV_Curvature_Filter_Short(short *Data, int Width, int Height, int Stride) { int Channel = Stride / Width; short Multiplier5 = 0; int Shift5 = 0, Sign5 = 0; GetMSS_S(5, Multiplier5, Shift5, Sign5); int BlockSize = 8, Block = (Width * Channel) / BlockSize; short *RowCopy = (short *)malloc((Width + 2) * 3 * Channel * sizeof(short)); short *First = RowCopy; short *Second = RowCopy + (Width + 2) * Channel; short *Third = RowCopy + (Width + 2) * 2 * Channel; memcpy(Second, Data, Channel * sizeof(short)); memcpy(Second + Channel, Data, Width * Channel * sizeof(short));                                            // 拷貝數據到中間位置
    memcpy(Second + (Width + 1) * Channel, Data + (Width - 1) * Channel, Channel * sizeof(short)); memcpy(First, Second, (Width + 2) * Channel * sizeof(short));                                                // 第一行和第二行一樣
 memcpy(Third, Data + Stride, Channel * sizeof(short));                                                // 拷貝第二行數據
    memcpy(Third + Channel, Data + Stride, Width * Channel * sizeof(short)); memcpy(Third + (Width + 1) * Channel, Data + Stride + (Width - 1) * Channel, Channel * sizeof(short)); for (int Y = 0; Y < Height; Y++) { short *LinePD = Data + Y * Stride; if (Y != 0) { short *Temp = First; First = Second; Second = Third; Third = Temp; } if (Y == Height - 1) { memcpy(Third, Second, (Width + 2) * Channel * sizeof(short)); } else { memcpy(Third, Data + (Y + 1) * Stride, Channel* sizeof(short)); memcpy(Third + Channel, Data + (Y + 1) * Stride, Width * Channel* sizeof(short));                                    // 由於備份了前面一行的數據,這里即使Data和Dest相同也是沒有問題的
            memcpy(Third + (Width + 1) * Channel, Data + (Y + 1) * Stride + (Width - 1) * Channel, Channel* sizeof(short)); } for (int X = 0; X < Block * BlockSize; X += BlockSize) { __m128i FirstP0 = _mm_loadu_si128((__m128i *)(First + X)); __m128i FirstP1 = _mm_loadu_si128((__m128i *)(First + X + Channel)); __m128i FirstP2 = _mm_loadu_si128((__m128i *)(First + X + Channel + Channel)); __m128i SecondP0 = _mm_loadu_si128((__m128i *)(Second + X)); __m128i SecondP1 = _mm_loadu_si128((__m128i *)(Second + X + Channel)); __m128i SecondP2 = _mm_loadu_si128((__m128i *)(Second + X + Channel + Channel)); __m128i ThirdP0 = _mm_loadu_si128((__m128i *)(Third + X)); __m128i ThirdP1 = _mm_loadu_si128((__m128i *)(Third + X + Channel)); __m128i ThirdP2 = _mm_loadu_si128((__m128i *)(Third + X + Channel + Channel)); __m128i C5 = _mm_mullo_epi16(SecondP1, _mm_set1_epi16(5)); __m128i Dist0 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(FirstP0, FirstP1), _mm_add_epi16(FirstP2, SecondP2)), ThirdP2), C5); __m128i Dist1 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(FirstP1, FirstP2), _mm_add_epi16(SecondP2, ThirdP2)), ThirdP1), C5); __m128i Dist2 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(FirstP2, SecondP2), _mm_add_epi16(ThirdP2, ThirdP1)), ThirdP0), C5); __m128i Dist3 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(SecondP2, ThirdP2), _mm_add_epi16(ThirdP1, ThirdP0)), SecondP0), C5); __m128i Dist4 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(ThirdP2, ThirdP1), _mm_add_epi16(ThirdP0, SecondP0)), FirstP0), C5); __m128i Dist5 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(ThirdP1, ThirdP0), _mm_add_epi16(SecondP0, FirstP0)), FirstP1), C5); __m128i Dist6 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(ThirdP0, SecondP0), _mm_add_epi16(FirstP0, FirstP1)), FirstP2), C5); __m128i Dist7 = _mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(_mm_add_epi16(SecondP0, FirstP0), _mm_add_epi16(FirstP1, FirstP2)), FirstP1), C5); __m128i AbsDist0 = _mm_abs_epi16(Dist0); __m128i AbsDist1 = _mm_abs_epi16(Dist1); __m128i AbsDist2 = _mm_abs_epi16(Dist2); __m128i AbsDist3 = _mm_abs_epi16(Dist3); __m128i AbsDist4 = _mm_abs_epi16(Dist4); __m128i AbsDist5 = _mm_abs_epi16(Dist5); __m128i AbsDist6 = _mm_abs_epi16(Dist6); __m128i AbsDist7 = _mm_abs_epi16(Dist7); __m128i Cmp = _mm_cmpgt_epi16(AbsDist0, AbsDist1); __m128i Min = _mm_blendv_epi8(AbsDist0, AbsDist1, Cmp); __m128i Value = _mm_blendv_epi8(Dist0, Dist1, Cmp); Cmp = _mm_cmpgt_epi16(Min, AbsDist2); Min = _mm_blendv_epi8(Min, AbsDist2, Cmp); Value = _mm_blendv_epi8(Value, Dist2, Cmp); Cmp = _mm_cmpgt_epi16(Min, AbsDist3); Min = _mm_blendv_epi8(Min, AbsDist3, Cmp); Value = _mm_blendv_epi8(Value, Dist3, Cmp); Cmp = _mm_cmpgt_epi16(Min, AbsDist4); Min = _mm_blendv_epi8(Min, AbsDist4, Cmp); Value = _mm_blendv_epi8(Value, Dist4, Cmp); Cmp = _mm_cmpgt_epi16(Min, AbsDist5); Min = _mm_blendv_epi8(Min, AbsDist5, Cmp); Value = _mm_blendv_epi8(Value, Dist5, Cmp); Cmp = _mm_cmpgt_epi16(Min, AbsDist6); Min = _mm_blendv_epi8(Min, AbsDist6, Cmp); Value = _mm_blendv_epi8(Value, Dist6, Cmp); Cmp = _mm_cmpgt_epi16(Min, AbsDist7); Min = _mm_blendv_epi8(Min, AbsDist7, Cmp); Value = _mm_blendv_epi8(Value, Dist7, Cmp); __m128i DivY = _mm_divp_epi16(_mm_add_epi16(_mm_set1_epi16(2), Value), Multiplier5, Shift5); _mm_storeu_si128((__m128i *)(LinePD + X), _mm_add_epi16(DivY, SecondP1)); } for (int X = Block * BlockSize; X < Width; X++) { // 請自行添加
 } } free(RowCopy); } void IM_TV_CurvatureFilter_Short(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Iteration) { short *Temp = (short *)malloc(Height * Stride * sizeof(short)); for (int Y = 0; Y < Height * Stride; Y++)    Temp[Y] = Src[Y] * 16; for (int Y = 0; Y < Iteration; Y++) { TV_Curvature_Filter_Short(Temp, Width, Height, Stride); } for (int Y = 0; Y < Height * Stride; Y++)    Dest[Y] = IM_ClampToByte((Temp[Y] + 8) / 16); free(Temp); }

  整體代碼也很簡單,執行效率卻沒有什么下降。

  其實還是有繼續優化的空間的,比如8個Dist的計算里還是有些重復的,還有除以5那一塊還可以繼續挖掘,直接嵌入相應的數據,在比如數據的初始化和最后數據在轉換到unsigned char也是可以用SSE進行優化的,我這里留給有興趣的朋友自行處理了。

  另外,再測試的過程中,我們發現除了最后的利用三行緩存的代碼外,其他的代碼都存在一個異常現象,就是512*512的處理時間會比其他非這個尺寸的慢不少(比如寬度改為513后),我記得以前確實在某個地方看到有這個問題,反倒是512這個寬度處理起來慢,不太記得這個是為什么了,好像是個cache的問題。

  那么對於GC和MC的優化因為思路差不多,因此,本文未對他們進行分析和處理了,但是要注意作者最原始論文的公式已經不是最新的了,可以從https://xplorebcpaz.ieee.org/document/7835193/里下載,不然你可以看到論文偽代碼和matlab的不對應。

  當然,如果覺得short的精度不夠,我們也可以考慮使用int類型來處理,對於MC這類有除以16之類的算法來說,使用int至少還是會比float來的快。

  那么10年之后的CPU基本都支持256位的單指令多數據流AVX,他可一次性的處理8個浮點數或16個short類型數據,因此,如果使用AVX的理論上速度應該更快,而且這個代碼更換成AVX版本的代碼也毫無難度,基本上就是把__m128i改為__m256i,把_mm部分內容替換位_mm256,除了在浮點比較函數SSE又一系列_mm_cmpXX_PS函數,而AVX統一到了_mm256_cmp_ps的第三個參數中外,其他的都是一樣的,當然還有最重要的一點就是要把Blocksize由4改為8或者由8改為16。

  同時還要注意到,由於不建議AVX和SSE混合編程,因此,需要在AVX指令后添加_mm256_zeroupper()指令以便清除YMM寄存器的高位數據。否則,SSE的相關代碼速度會嚴重下降。

  改為AVX后相關的測試表明,512*512的灰度圖迭代50次的整體耗時約在15ms,單次迭代速度達到了達到約1000 MPixels/Sec。

  源代碼下載地址:https://files.cnblogs.com/files/Imageshop/CurvatureFilter.rar

  真心寫的累,有愛心的朋友請看下................

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM