Android NDK使用NEON優化,SIMD優化


http://wenku.baidu.com/link?url=O-hWpGqRzGI00vHP4gsZI1u_8AV8xA94VTmMvtf8Rs4bmdnJdAPrYxg2WHs_1ZglnUNKIHUmnSFCCk9LP1UB3sjSsJYJI8F-9vuvRiHy_OK

http://www.tuicool.com/articles/673mIn

  http://www.arm.com/zh/products/processors/technologies/neon.php

http://blog.csdn.net/chshplp_liaoping/article/details/12752749

http://blog.csdn.net/ccjjnn19890720/article/details/7291228

http://community.arm.com/groups/android-community/blog/2015/03/27/arm-neon-programming-quick-reference

   有的時候其實網絡上資料比較多,但是自己很難找到。

譬如我一直想要做Android NDK的源代碼優化,知道可以利用NEON,可以利用匯編進行。

但是卻找不到正確的門路。所以耗費了很多時間。

在針對C代碼的優化上,實在是收益甚微,對某個函數進行的代碼優化,對整個系統來說,

影響一般很小(一方面代碼本來在優化上性能的提升倍數不多,另一方面單個函數在整個

系統中占用的比重都很低),所以優化了幾天也見不到明顯的進展。

找到一些相關的資料也花費了很多功夫,

首先找到了要在C源代碼中只用NEON庫需要的頭文件 arm_neon.h 

 

#include <arm_neon.h>
// 在代碼中先添加了這行語句,然后執行ndk-build 卻提示了錯誤 // 提示要增加什么標志,自己在 LOCAL_CXX_FLAGS 的后面添加了,但是仍然報錯 // 后來搜索 NDK + NEON 終於找到了一點點苗頭並開始發現。 // 遂總結如下內容


Android.mk 文件內容可以參考這個: 

http://download.csdn.net/download/carlonelong/4153631

我做了一點修改,改后的文件如下: 

 

LOCAL_PATH := $(call my-dir) include $(CLEAR_VARS) # 這里填寫要編譯的源文件路徑,這里只列舉了一部分 LOCAL_SRC_FILES := NcHevcDecoder.cpp JNI_OnLoad.cpp TAppDecTop.cpp # 默認包含的頭文件路徑 LOCAL_C_INCLUDES := \ $(LOCAL_PATH) \ $(LOCAL_PATH)/.. # -g 后面的一系列附加項目添加了才能使用 arm_neon.h 頭文件
# -mfloat-abi=softfp -mfpu=neon 使用 arm_neon.h 必須

LOCAL_CFLAGS := -D__cpusplus -g -mfloat-abi=softfp -mfpu=neon -march=armv7-a -mtune=cortex-a8LOCAL_LDLIBS := -lz -llogTARGET_ARCH_ABI :=armeabi-v7aLOCAL_ARM_MODE := arm 
ifeq ($(TARGET_ARCH_ABI),armeabi-v7a) 
# 采用NEON優化技術 
LOCAL_ARM_NEON := true 
endif 
LOCAL_MODULE := avcodecinclude $(BUILD_STATIC_LIBRARY)

 

同時需要修改一下Application.mk文件,其內容如下:

可以參考: http://blog.csdn.net/gg137608987/article/details/7565843 

 

APP_PROJECT_PATH := $(call my-dir)/.. APP_PLATFORM := android-10 APP_STL := stlport_static APP_ABI := armeabi-v7a APP_CPPFLAGS += -fexceptions

其中APP_ABI這句指定了編譯的目標平台類型,可以針對不同平台進行優化。

 

當然這樣指定了之后,就需要相應的設備支持NEON指令。 

 

我的一個NDK應用,在使用上述配置之后,即NEON優化等,程序的性能提升了近一倍。

系統的處理延時由原來的 95ms左右降低到了 51ms。

后續可以使用NEON庫進一步優化 NDK 程序代碼,實現更加優化的結果。 

 

NEON優化的部分將在后面介紹,我會一邊應用一邊更新博客。

網上有一個用NEON優化YUV轉RGB的NEON優化例子,可以參見:

http://hilbert-space.de/?p=22 

這里摘錄一下其優化過程:

1、原始代碼

 

void reference_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
  int i;
  for (i=0; i<n; i++)
  {
    int r = *src++; // load red int g = *src++; // load green int b = *src++; // load blue // build weighted average: int y = (r*77)+(g*151)+(b*28); // undo the scale by 256 and write to memory: *dest++ = (y>>8); } }

2、使用NEON庫進行代碼優化

Since NEON works in 64 or 128 bit registers it’s best to process eight pixels in parallel. 

That way we can exploit the parallel nature of the SIMD-unit. Here is what I came up with:

因為NEON工作在64位或128位的寄存器上,因此最適合同時處理8個像素點的轉換。

這樣就形成了下面這樣的代碼。 

 

void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n) { int i; uint8x8_t rfac = vdup_n_u8 (77); // 轉換權值 R uint8x8_t gfac = vdup_n_u8 (151); // 轉換權值 G uint8x8_t bfac = vdup_n_u8 (28); // 轉換權值 B n/=8; for (i=0; i<n; i++) { uint16x8_t temp; uint8x8x3_t rgb = vld3_u8 (src); uint8x8_t result; temp = vmull_u8 (rgb.val[0], rfac); // vmull_u8 每個字節(8bit)對應相乘,結果為每個單位2字節(16bit) temp = vmlal_u8 (temp,rgb.val[1], gfac); // 每個比特對應相乘並加上 temp = vmlal_u8 (temp,rgb.val[2], bfac); result = vshrn_n_u16 (temp, 8); // 全部移位8位 vst1_u8 (dest, result); // 轉存運算結果 src += 8*3; dest += 8; } }

 

vmull.u8 multiplies each byte of the first argument with each corresponding byte of the second argument. Each result becomes a 16 bit unsigned integer, so no overflow can happen. The entire result is returned as a 128 bit NEON register pair.

vmlal.u8 does the same thing as vmull.u8 but also adds the content of another register to the result.

So we end up with just three instructions for weighted average of eight pixels. Nice.

Now it’s time to undo the scaling of the weight factors. To do so I shift each 16 bit result to the right by 8 bits. This equals to a division by 256. ARM NEON has lots of instructions to do the shift, but also a “narrow” variant exists. This one does two things at once: It does the shift and afterwards converts the 16 bit integers back to 8 bit by removing all the high-bytes from the result. We get back from the 128 bit register pair to a single 64 bit register. 

 

3、結果對比 
(1)C語言NEON版本匯編

 

/*
未進行匯編優化的結果
C-version:           15.1 cycles per pixel.
NEON-version:     9.9 cycles per pixel.
這里是說優化結果並不非常理想,所以查看了一下它的匯編文件
That’s only a speed-up of factor 1.5. I expected much more from the NEON implementation. It processes 8 pixels with just 6 instructions after all. 
What’s going on here? A look at the assembler output explained it all. Here is the inner-loop part of the convert function:
*/
 160: f46a040f vld3.8 {d16-d18}, [sl] 164: e1a0c005 mov ip, r5 168: ecc80b06 vstmia r8, {d16-d18} 16c: e1a04007 mov r4, r7 170: e2866001 add r6, r6, #1 ; 0x1 174: e28aa018 add sl, sl, #24 ; 0x18 178: e8bc000f ldm ip!, {r0, r1, r2, r3} 17c: e15b0006 cmp fp, r6 180: e1a08005 mov r8, r5 184: e8a4000f stmia r4!, {r0, r1, r2, r3} 188: eddd0b06 vldr d16, [sp, #24] 18c: e89c0003 ldm ip, {r0, r1} 190: eddd2b08 vldr d18, [sp, #32] 194: f3c00ca6 vmull.u8 q8, d16, d22 198: f3c208a5 vmlal.u8 q8, d18, d21 19c: e8840003 stm r4, {r0, r1} 1a0: eddd3b0a vldr d19, [sp, #40] 1a4: f3c308a4 vmlal.u8 q8, d19, d20 1a8: f2c80830 vshrn.i16 d16, q8, #8 1ac: f449070f vst1.8 {d16}, [r9] 1b0: e2899008 add r9, r9, #8 ; 0x8 1b4: caffffe9 bgt 160

(2)NEON匯編優化 

 

Since the compiler can’t generate good code I wrote the same loop in assembler. 
In a nutshell I just took the intrinsic based loop and converted the instructions one by one. The loop-control is a bit different, but that’s all. // 這里針對生成的目標匯編代碼進一步作了優化,優化的代碼如下:  convert_asm_neon: # r0: Ptr to destination data # r1: Ptr to source data # r2: Iteration count: push {r4-r5,lr} lsr r2, r2, #3 # build the three constants: mov r3, #77 mov r4, #151 mov r5, #28 vdup.8 d3, r3 vdup.8 d4, r4 vdup.8 d5, r5 .loop: # load 8 pixels: vld3.8 {d0-d2}, [r1]! # do the weight average: vmull.u8 q3, d0, d3 vmlal.u8 q3, d1, d4 vmlal.u8 q3, d2, d5 # shift and store: vshrn.u16 d6, q3, #8 vst1.8 {d6}, [r0]! subs r2, r2, #1 bne .loop pop { r4-r5, pc }

 

最終結果對比: 

C-version:  15.1 cycles per pixel. NEON-version:  9.9 cycles per pixel. Assembler:  2.0 cycles per pixel.


可以見到NEON優化在性能上提速了 7 倍多(同時處理8個像素)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM