通過 GCC 學習 OpenMP 框架

本文轉載自查看原文 2018-12-14 21:16 1207

第一個 OpenMP 程序

入門：OpenMP 的一大特點就是您只需完成標准的 GCC 安裝即可。支持 OpenMP 的程序必須使用 -fopenmp 選項進行編譯。（也可以參考在VS中使用OpenMP）

讓我們先從一個 Hello, World! 打印應用程序開始，它包括一個額外的編譯指示.

清單1：使用了 OpenMP 的 Hello World 程序

1 #include <iostream>
2 #include<omp.h>
3 int main()
4 {
5   #pragma omp parallel
6   {
7     std::cout << "Hello World!\n";
8   }
9 }

在使用 g++ 編譯和運行清單 1 中的代碼時，控制台中應該會顯示一個 Hello, World!。現在，使用 -fopenmp 選項重新編譯代碼。清單 2 顯示了輸出

清單 2. 使用 -fopenmp 命令編譯並運行代碼

1 tintin$ g++ test1.cpp -fopenmp
2 tintin$ ./a.out 
3 Hello World!
4 Hello World!
5 Hello World!
6 Hello World!

發生了什么？#pragma omp parallel 僅在您指定了 -fopenmp 編譯器選項后才會發揮作用。在編譯期間，GCC 會根據硬件和操作系統配置在運行時生成代碼，創建盡可能多的線程。每個線程的起始例程為代碼塊中位於指令之后的代碼。這種行為是 隱式的並行化，而 OpenMP 本質上由一組功能強大的編譯指示組成，幫您省去了編寫大量樣本文件的工作。（為了進行比較，您需要了解使用 Portable Operating System Interface (POSIX) 線程 [pthreads] 實現您剛才的程序將會怎樣）。我這個測試是在虛擬機里進行的，使用的計算機運行的是 Intel® Core i5 處理器，分配有兩個物理核心，每個核心有兩個邏輯核心，因此清單 2 的輸出看上去非常合理（4 個線程 = 4 個邏輯核心）。

接下來，讓我們詳細了解並行編譯指示。

使用 OpenMP 實現並行處理的樂趣

使用編譯命令的 num_threads 參數控制線程的數量非常簡單。下面顯示了清單 1 中的代碼，可用線程的數量被指定為 5（如清單 3 所示）。

清單 3. 使用 num_threads 控制線程的數量

 1 #include <iostream>
 2 #include<omp.h>
 3 
 4 int main()
 5 {
 6   #pragma omp parallel num_threads(5) 
 7   {
 8     std::cout << "Hello World!\n";
 9   }
10 }

這里沒有使用 num_threads 方法，而是使用另一種方法來修改運行代碼的線程的數量。這還會將我們帶到您將要使用的第一個 OpenMP API：omp_set_num_threads。在 omp.h 頭文件中定義該函數。不需要鏈接到額外的庫就可以獲得清單 4 中的代碼，只需使用 -fopenmp。

清單 4. 使用 omp_set_num_threads 對線程的創建進行調優

#include <iostream>
#include<omp.h>

int main()
{
  omp_set_num_threads(5); 
  #pragma omp parallel 
  {
    std::cout << "Hello World!\n";
  }
}

最后，OpenMP 還使用了外部環境變量控制其行為。您可以調整清單 2 中的代碼，通過將 OMP_NUM_THREADS 變量設置為 6，就可以將 Hello World! 輸出六次。清單 5 顯示了代碼的執行.

清單 5. 使用環境變量來調整 OpenMP 行為

1 tintin$ export OMP_NUM_THREADS=6
2 tintin$ ./a.out 
3 Hello World!
4 Hello World!
5 Hello World!
6 Hello World!
7 Hello World!
8 Hello World!

目前，我們已經討論了 OpenMP 的三個方面：編譯指示、運行時 API 和環境變量。同時使用環境變量和運行時 API 會出現什么情況？運行時 API 將獲得更高的優先權。

一個實際示例

OpenMP 使用隱式並行化技術，您可以使用編譯指示、顯式函數和環境變量來指導編譯器的行為。讓我們看一個 OpenMP 可以真正派上用場的示例。請考慮清單 6 中的代碼。

清單 6. 在 for 循環中執行順序處理

1 int main( )
2 {
3 int a[1000000], b[1000000]; 
4 // ... some initialization code for populating arrays a and b; 
5 int c[1000000];
6 for (int i = 0; i < 1000000; ++i)
7   c[i] = a[i] * b[i] + a[i-1] * b[i+1];
8 // ... now do some processing with array c
9  }

顯然，您可以將 for 循環分為幾個部分，在不同的核心中運行它們，任何 c[k] 與 c 數組中的其他元素都是獨立的。清單 7 顯示了 OpenMP 如何幫助您實現這一點。

清單 7. 在 for 循環中使用 parallel for 指令進行並行處理

 1 int main( )
 2 {
 3 int a[1000000], b[1000000]; 
 4 // ... some initialization code for populating arrays a and b; 
 5 int c[1000000];
 6 #pragma omp parallel for 
 7 for (int i = 0; i < 1000000; ++i)
 8   c[i] = a[i] * b[i] + a[i-1] * b[i+1];
 9 // ... now do some processing with array c
10  }

parallel for 編譯指示可以幫助您將 for 循環工作負載划分到多個線程中，每個線程都可以在不同的核心上運行，這顯著減少了總的計算時間。清單 8 演示了這一點。

清單 8. 理解 omp_get_wtime

 1 #include <omp.h>
 2 #include <math.h>
 3 #include <time.h>
 4 #include <iostream>
 5  
 6 int main(int argc, char *argv[]) {
 7     int i, nthreads;
 8     clock_t clock_timer;
 9     double wall_timer;
10     double c[1000000]; 
11     for (nthreads = 1; nthreads <=8; ++nthreads) {
12         clock_timer = clock();
13         wall_timer = omp_get_wtime();
14         #pragma omp parallel for private(i) num_threads(nthreads)
15         for (i = 0; i < 1000000; i++) 
16           c[i] = sqrt(i * 4 + i * 2 + i); 
17         std::cout << "threads: " << nthreads <<  " time on clock(): " << 
18             (double) (clock() - clock_timer) / CLOCKS_PER_SEC
19            << " time on wall: " <<  omp_get_wtime() - wall_timer << "\n";
20     }
21 }

在清單 8 中，可以通過不斷增加線程的數量來計算運行內部 for 循環的時間。omp_get_wtime API 從一些任意的但是一致的點返回已用去的時間，以秒為單位。因此，omp_get_wtime() - wall_timer 將返回觀察到的所用時間並運行 for 循環。clock() 系統調用用於預估整個程序的處理器使用時間，也就是說，將各個特定於線程的處理器使用時間相加，然后報告最終的結果。在我的 Intel Core i5 計算機中，清單 9 顯示了最終的報告結果。

清單 9. 運行內部 for 循環的統計數據

1 threads: 1 time on clock(): 0.020236 time on wall: 0.0204209
2 threads: 2 time on clock(): 0.015448 time on wall: 0.00773731
3 threads: 3 time on clock(): 0.021125 time on wall: 0.00737248
4 threads: 4 time on clock(): 0.032429 time on wall: 0.00890261
5 threads: 5 time on clock(): 0.018109 time on wall: 0.00647498
6 threads: 6 time on clock(): 0.017804 time on wall: 0.00584579
7 threads: 7 time on clock(): 0.018956 time on wall: 0.00893618
8 threads: 8 time on clock(): 0.018512 time on wall: 0.00647849

盡管處理器時間在所有執行中都是相同的（應該是相同的，除去創建線程和上下文切換的時間），但是所用時間將隨着線程數量的增加而逐漸減小，這意味着數據在核心中實現了並行處理。關於指令語法，最后一點需要注意的是：#pragma parallel for private(i) 意味着循環向量 i 將被作為一個線程本地存儲進行處理，每個線程有一個該向量的副本。線程本地向量未進行初始化.

OpenMP 的臨界區段（critical section）

您沒有真正考慮過 OpenMP 是如何自動處理臨界區段的，是嗎？當然，您不需要顯式創建一個互斥現象 (mutex)，但是您仍然需要指定臨界區段。下面顯示了相關語法：

1 #pragma omp critical (optional section name)
2 {
3 // no 2 threads can execute this code block concurrently
4 }

pragma omp critical 之后的代碼只能由一個線程在給定時間運行。同樣，optional section name 是一個全局標識符，在同一時間，兩個線程不能使用相同的全局標識符名稱運行臨界區段。請考慮清單 10 中的代碼。

清單 10. 多個具有相同名稱的臨界區段

1 #pragma omp critical (section1)
2 {
3 myhashtable.insert("key1", "value1");
4 } 
5 // ... other code follows
6 #pragma omp critical (section1)
7 {
8 myhashtable.insert("key2", "value2");
9 }

在這一代碼的基礎上，您可以作出一個很安全的假設：永遠不會出現兩個散列表同時插入的情況，因為臨界區段名是相同的。這與您在使用 pthread 時處理臨界區段的方式有着明顯的不同，后者的特點就是使用（或濫用）鎖。

使用 OpenMP 實現鎖和互斥

有趣的是，OpenMP 提供了自己的互斥（因此，它並不是全部關於編譯指示）：omp_lock_t，它被定義為 omp.h 頭文件的一部分。常見的 pthread 形式的語法操作也適用，甚至 API 名稱也是類似的。您需要了解以下 5 個 API：

omp_init_lock：此 API 必須是第一個訪問 omp_lock_t 的 API，並且要使用它來完成初始化。注意，在完成初始化之后，鎖被認為處於未設置狀態。
omp_destroy_lock：此 API 會破壞鎖。在調用該 API 時，鎖必須處於未設置狀態，這意味着您無法調用 omp_set_lock 並隨后發出調用來破壞這個鎖。
omp_set_lock：此 API 設置 omp_lock_t，也就是說，將會獲得互斥。如果一個線程無法設置鎖，那么它將繼續等待，直到能夠執行鎖操作。
omp_test_lock：此 API 將在鎖可用時嘗試執行鎖操作，並在獲得成功后返回 1，否則返回 0。這是一個非阻塞 API，也就是說，該函數不需要線程等待就可以設置鎖。
omp_unset_lock：此 API 將會釋放鎖。

清單 11 顯示了一個遺留的單線程隊列，它被擴展為可使用 OpenMP 鎖實現多線程處理。請注意，這一行為並不適合所有場景，這里主要用它來進行快速演示。

清單 11. 使用 OpenMP 擴展一個單線程隊列

 1 #include <openmp.h> 
 2 #include "myqueue.h"
 3  
 4 class omp_q : public myqueue<int> { 
 5 public: 
 6    typedef myqueue<int> base; 
 7    omp_q( ) { 
 8       omp_init_lock(&lock);
 9    }
10    ~omp_q() { 
11        omp_destroy_lock(&lock);
12    }
13    bool push(const int& value) { 
14       omp_set_lock(&lock);
15       bool result = this->base::push(value);
16       omp_unset_lock(&lock);
17       return result;
18    }
19    bool trypush(const int& value) 
20    { 
21        bool result = omp_test_lock(&lock);
22        if (result) {
23           result = result && this->base::push(value);
24           omp_unset_lock(&lock);
25       } 
26       return result;
27    }
28    // likewise for pop 
29 private: 
30    omp_lock_t lock;
31 };

嵌套鎖

OpenMP 提供的其他類型的鎖為 omp_nest_lock_t 鎖的變體。它們與 omp_lock_t 類似，但是有一個額外的優勢：已經持有鎖的線程可以多次鎖定這些鎖。每當持有鎖的線程使用 omp_set_nest_lock 重新獲得嵌套鎖時，內部計數器將會加一。當一個或多個對 omp_unset_nest_lock 的調用最終將這個內部鎖計數器重置為 0 時，就會釋放該鎖。下面顯示了用於 omp_nest_lock_t 的 API：

omp_init_nest_lock(omp_nest_lock_t* )：此 API 將內部嵌套計數初始化為 0。
omp_destroy_nest_lock(omp_nest_lock_t* )：此 API 將破壞鎖。使用非零內部嵌套計數對某個鎖調用此 API 將會導致出現未定義的行為。
omp_set_nest_lock(omp_nest_lock_t* )：此 API 類似於 omp_set_lock，不同之處是線程可以在已持有鎖的情況下多次調用這個函數。
omp_test_nest_lock(omp_nest_lock_t* )：此 API 是 omp_set_nest_lock 的非阻塞版本。
omp_unset_nest_lock(omp_nest_lock_t* )：此 API 將在內部計數器為 0 時釋放鎖。否則，計數器將在每次調用該方法時遞減。

對任務執行的細粒度控制

在並行計算中，粒度是計算與通信之比的定性度量。最有效的粒度取決於算法及其運行的硬件環境，在大多數情況下，與通信和同步相關的開銷相對於執行速度而言是高的，因此具有粗粒度是有利的；但細粒度並行可以幫助減少由於負載不平衡導致的開銷。

我們已經了解了所有線程以並行的方式運行 pragma omp parallel 之后的代碼塊。您可以對代碼塊中的代碼進一步分類，然后由選定的線程執行它。請考慮清單 12 中的代碼。

清單 12. 學習使用 parallel sections 編譯指示

 1 #include<iostream>
 2 #include<omp.h>
 3 using namespace std;
 4 
 5 int main( )
 6 {
 7   #pragma omp parallel num_threads(8)
 8   {
 9     cout << "All threads run this\n";
10     #pragma omp sections
11     {
12       #pragma omp section
13       {
14         cout << "This executes in parallel\n";
15       }
16       #pragma omp section
17       {
18         cout << "Sequential statement 1\n";
19         cout << "This always executes after statement 1\n";
20       }
21       #pragma omp section
22       {
23         cout << "This also executes in parallel\n";
24       }
25     }
26   }
27   return 0;
28 }

pragma omp sections 和 pragma omp parallel 之間的代碼將由所有線程並行運行。pragma omp sections 之后的代碼塊通過 pragma omp section 進一步被分為各個子區段。每個 pragma omp section 塊將由一個單獨的線程執行。但是，區段塊中的各個指令始終按順序運行。清單 13 顯示了清單 12 中代碼的輸出。

清單 13. 運行清單 12 中代碼所產生的輸出

 1 All threads run this
 2 This executes in parallel
 3 Sequential statement 1
 4 This always executes after statement 1
 5 This also executes in parallel
 6 All threads run this
 7 All threads run this
 8 All threads run this
 9 All threads run this
10 All threads run this
11 All threads run this
12 All threads run this

在清單 13 中，您將再次一開始就創建 8 個線程。對於這 8 個線程，只需使用三個線程執行 pragma omp sections 代碼塊中的工作。在第二個區段中，您指定了輸出語句的運行順序。這就是使用 sections 編譯指示的全部意義。如果需要的話，您可以指定代碼塊的順序。

理解與並行循環一起使用的 `firstprivate` 和 `lastprivate` 指令

在前文中，我們了解了如何使用 private 聲明線程本地存儲。那么，您應當如何初始化線程本地變量呢？在運行之前使用主線程中的變量的值同步本地變量？此時，firstprivate 指令就可以發揮作用了。

firstprivate 指令

使用 firstprivate(variable)，您可以將線程中的變量初始化為它在主線程中的任意值。請參考清單 14 中的代碼。

清單 14. 使用與主線程不同步的線程本地變量

 1 #include <stdio.h>
 2 #include <omp.h>
 3  
 4 int main()
 5 {
 6   int idx = 100;
 7   #pragma omp parallel private(idx)
 8   {
 9     printf("In thread %d idx = %d\n", omp_get_thread_num(), idx);
10   }
11 }

下面是我得到的輸出。您的結果可能有所不同。

1 In thread 4 idx = 0
2 In thread 1 idx = 32660
3 In thread 2 idx = 0
4 In thread 3 idx = 0
5 In thread 5 idx = 0
6 In thread 6 idx = 0
7 In thread 0 idx = 0
8 In thread 7 idx = 0

清單 15 顯示了帶有 firstprivate 指令的代碼。和期望的一樣，輸出在所有線程中將 idx 初始化為 100。

清單 15. 使用 firstprivate 指令初始化線程本地變量

 1 #include <stdio.h>
 2 #include <omp.h>
 3  
 4 int main()
 5 {
 6   int idx = 100;
 7   #pragma omp parallel firstprivate(idx)
 8   {
 9     printf("In thread %d idx = %d\n", omp_get_thread_num(), idx);
10   }
11 }

還要注意的是，您使用了 omp_get_thread_num( ) 方法訪問線程的 ID。這與 Linux®top 命令顯示的線程 ID 不同，並且這只是 OpenMP 用於跟蹤線程數量的一種方式。如果您准備將 firstprivate 用於您的 C++ 代碼，那么還要注意，firstprivate 指令使用的變量是一個副本構造函數，用於從主線程的變量初始化自身，因此對您的類使用一個私有的副本構造函數肯定會產生不好的結果。現在讓我們了解一下lastprivate 指令，該指令在很多方面與 firstprivate 正好相反。

`lastprivate` 指令

與使用主線程的數據初始化線程本地變量不同，您現在將使用最后一次循環計數生成的數據同步主線程的數據。清單 16 中的代碼運行了並行的 for 循環。

清單 16. 使用並行的 for 循環，沒有與主線程進行同步

 1 #include <stdio.h>
 2 #include <omp.h>
 3  
 4 int main()
 5 {
 6   int idx = 100;
 7   int main_var = 2120;
 8  
 9   #pragma omp parallel for private(idx) 
10   for (idx = 0; idx < 12; ++idx)
11   {
12     main_var = idx * idx;
13     printf("In thread %d idx = %d main_var = %d\n",
14       omp_get_thread_num(), idx, main_var);
15   }
16   printf("Back in main thread with main_var = %d\n", main_var);
17 }

在我的擁有 4 個核心的開發計算機上，OpenMP 為 parallel for 塊創建了 3 個線程。每個線程執行兩次循環迭代。main_var 的最終值取決於最后一個運行的線程，即該線程中的 idx 的值。換言之，main_var 的值並不取決於 idx 的最后一個值，而是任意一個最后運行的線程中的 idx 的值。清單 17 的代碼解釋了這一點。

清單 17. main_var 的值取決於最后一次線程運行

 1 In thread 2 idx = 6 main_var = 36
 2 In thread 2 idx = 7 main_var = 49
 3 In thread 2 idx = 8 main_var = 64
 4 In thread 0 idx = 0 main_var = 0
 5 In thread 0 idx = 1 main_var = 1
 6 In thread 0 idx = 2 main_var = 4
 7 In thread 3 idx = 9 main_var = 81
 8 In thread 3 idx = 10 main_var = 100
 9 In thread 3 idx = 11 main_var = 121
10 In thread 1 idx = 3 main_var = 9
11 In thread 1 idx = 4 main_var = 16
12 In thread 1 idx = 5 main_var = 25
13 Back in main thread with main_var = 25

多次運行清單 17 中的代碼，確定主線程中的 main_var 的值始終依賴於最后運行的線程中的 idx 的值。那么如果您希望同步主線程的值與循環中 idx 的最終值，該怎樣做呢？這就是 lastprivate 指令發揮其作用的地方，如清單 18 所示。與清單 17 中的代碼相似，多次運行清單 18 中的代碼會發現主線程中的 main_var 的最終值為 121（idx 為最終的循環計數器值）。

清單 18. 使用 lastprivate 指令實現同步

 1 #include <stdio.h>
 2 #include <omp.h>
 3  
 4 int main()
 5 {
 6   int idx = 100;
 7   int main_var = 2120;
 8  
 9   #pragma omp parallel for private(idx) lastprivate(main_var)
10   for (idx = 0; idx < 12; ++idx)
11   {
12     main_var = idx * idx;
13     printf("In thread %d idx = %d main_var = %d\n",
14       omp_get_thread_num(), idx, main_var);
15   }
16   printf("Back in main thread with main_var = %d\n", main_var);
17 }

清單 19. 清單 18 的代碼輸出（請注意，主線程中 main_var always 值的等於 121）

In thread 1 idx = 3 main_var = 9
In thread 1 idx = 4 main_var = 16
In thread 1 idx = 5 main_var = 25
In thread 2 idx = 6 main_var = 36
In thread 2 idx = 7 main_var = 49
In thread 2 idx = 8 main_var = 64
In thread 0 idx = 0 main_var = 0
In thread 0 idx = 1 main_var = 1
In thread 0 idx = 2 main_var = 4
In thread 3 idx = 9 main_var = 81
In thread 3 idx = 10 main_var = 100
In thread 3 idx = 11 main_var = 121
Back in main thread with main_var = 121

最后一個注意事項：要支持對 C++ 對象使用 lastprivate 操作符，則需要相應的類具有公開可用的 operator= 方法。

使用 OpenMP 實現 merge sort

讓我們看一個 OpenMP 將會幫助您節省運行時間的真實示例。這並不是一個對 merge sort 進行大量優化后的版本，但是足以顯示在代碼中使用 OpenMP 的好處。清單 20 顯示了示例的代碼。

清單 20. 使用 OpenMP 實現 merge sort

 1 #include <omp.h>
 2 #include <vector>
 3 #include <iostream>
 4 using namespace std;
 5  
 6 vector<long> merge(const vector<long>& left, const vector<long>& right)
 7 {
 8     vector<long> result;
 9     unsigned left_it = 0, right_it = 0;
10  
11     while(left_it < left.size() && right_it < right.size())
12     {
13         if(left[left_it] < right[right_it])
14         {
15             result.push_back(left[left_it]);
16             left_it++;
17         }
18         else                    
19         {
20             result.push_back(right[right_it]);
21             right_it++;
22         }
23     }
24  
25     // Push the remaining data from both vectors onto the resultant
26     while(left_it < left.size())
27     {
28         result.push_back(left[left_it]);
29         left_it++;
30     }
31  
32     while(right_it < right.size())
33     {
34         result.push_back(right[right_it]);
35         right_it++;
36     }
37  
38     return result;
39 }
40  
41 vector<long> mergesort(vector<long>& vec, int threads)
42 {
43     // Termination condition: List is completely sorted if it
44     // only contains a single element.
45     if(vec.size() == 1)
46     {
47         return vec;
48     }
49  
50     // Determine the location of the middle element in the vector
51     std::vector<long>::iterator middle = vec.begin() + (vec.size() / 2);
52  
53     vector<long> left(vec.begin(), middle);
54     vector<long> right(middle, vec.end());
55  
56     // Perform a merge sort on the two smaller vectors
57  
58     if (threads > 1)
59     {
60       #pragma omp parallel sections
61       {
62         #pragma omp section
63         {
64           left = mergesort(left, threads/2);
65         }
66         #pragma omp section
67         {
68           right = mergesort(right, threads - threads/2);
69         }
70       }
71     }
72     else
73     {
74       left = mergesort(left, 1);
75       right = mergesort(right, 1);
76     }
77  
78     return merge(left, right);
79 }
80  
81 int main()
82 {
83   vector<long> v(1000000);
84   for (long i=0; i<1000000; ++i)
85     v[i] = (i * i) % 1000000;
86   v = mergesort(v, 1);
87   for (long i=0; i<1000000; ++i)
88     cout << v[i] << "\n";
89 }

使用 8 個線程運行 merge sort 使運行時的執行時間變為 2.1 秒，而使用一個線程時，該時間為 3.7 秒。此處您惟一需要注意的是線程的數量。我使用了 8 個線程：具體的數量取決於系統的配置。但是，如果不明確指定的話，那么有可能會創建成千上百個線程，並且系統性能很可能會下降。前面討論的 sections 編譯指示可以很好地優化 merge sort 代碼。

結束語

本文到此結束。我們在文章中介紹了大量內容：OpenMP 並行編譯指示；學習了創建線程的不同方法；了解了 OpenMP 提供的更好的時間性能、同步和細粒度控制；並通過 merge sort 實際應用了 OpenMP。但是，有關 OpenMP 需要學習的內容還有很多，學習 OpenMP 的最好的地方就是 OpenMP 項目網站。請務必查看參考資料部分，獲得有關的其他詳細信息。

相關主題

請務必訪問 OpenMP 項目網站。
有關提高 merge-sort 性能的更多信息，請參閱 Atanas Radenski 撰寫的文章 Shared Memory, Message Passing, and Hybrid Merge Sorts for Standalone and Clustered SMPs。

參考鏈接：https://www.ibm.com/developerworks/cn/aix/library/au-aix-openmp-framework/index.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基於GCC的openMP學習與測試 OpenMP fortran 學習【OpenCV學習】OpenMP並行化實例 openmp學習心得（二）----常見的運行時庫函數並行程序設計導論學習筆記——OpenMP（1） GCC幾個選項學習 gcc降版本方法 - [學習] OpenMP初探 CMake與OpenMP openmp(1)----計時