阿姆達爾定律


轉自:http://www.db110.com/%E9%98%BF%E5%A7%86%E8%BE%BE%E5%B0%94%E5%AE%9A%E5%BE%8Bamdahls-law/

  阿姆達爾定律是一個計算機科學界的經驗法則,因IBM公司計算機架構師吉恩·阿姆達爾而得名。吉恩·阿姆達爾在1967年發表的論文中提出了這個重要定律。

  阿姆達爾定律主要用於發現僅僅系統的部分得到改進,整體系統可以得到的最大期望改進。它經常用於並行計算領域,用來預測適用多個處理器時理論上的最大加速比。在我們的性能調優領域,我們利用此定律有助於我們解決或者緩解性能瓶頸問題

  阿姆達爾定律的模型闡釋了我們現實生產中串行資源爭用時候的現象。如下圖模型,一個系統中,不可避免有一些資源必須串行訪問,這限制了我們的加速比,即使我們增加了並發數(橫軸),但取得效果並不理想,難以獲得線性擴展能力(圖中直線)。

1

  以下介紹中、系統、算法、程序可以認為都是優化的對象,我不加以區分,它們都有串行的部分和可以並行的部分。

  在並行計算中,使用多個處理器的程序的加速比受限制於程序串行部分的執行時間。例如,如果一個程序使用一個CPU核執行需要20小時,其中部分代碼只能串行,需要執行1個小時,其他19小時的代碼執行可以並行,那么,不考慮有多少CPU可用來並行執行程序,最小執行時間不會小於1小時(串行工作的部分),因此加速比被限制為最多20倍(20/1)。

  加速比越高,證明優化效果越明顯。

  阿姆達爾定律可以用如下公式表示:

 

2

  • s( n ) 固定負載下,理論上的加速比
  • B 串行工作部分所占比例,取值0~1
  • n 並行線程數、並行處理節點個數

  以上公式說明:

加速比=沒有改進前的算法耗時/改進后的算法耗時

  如果我們假定算法沒有改進之前,執行總時間是1(假定為1個單元)。那么改進后的算法,其時間應該是串行工作部分的耗時(B)加上並行部分的耗時(1-B)/n,由於並行部分可以在多個cpu核上執行,所以並行部分實際的執行時間是(1-B)/n

  根據這個公式,如果並行線程數(我們可以理解為CPU處理器數量)趨於無窮,那么加速比與系統的串行工作部分的比例成反比,如果系統中有50%的代碼串行執行,那么系統的最大加速比為2。也就是說,為了提高系統的速度,僅增加CPU處理器的數量不一定能起到有效的作用,需要提高系統內可並行化的模塊比重,在此基礎上合理增加並行處理器數量,才能以最小的投入得到最大的加速比。

  對阿姆達爾定律做進一步說明。阿姆達爾這個模型 定義了固定負載下,某個算法的並行實現相對串行實現的加速比。例如,某個算法有12%的操作是可以並行執行的,而剩下的88%的操作不能並行,那么阿姆達爾定律聲明,最大加速比是1(1-0.12)=1.136。如上公式n趨向於無窮大,那么加速比S=1/B=1/(1-0.12)。

  再例如,對於某個算法,可以並行的比例是P,這部分並行的代碼能夠加速s倍(s可以理解是CPU核的個數,即新代碼的執行時間為原來執行時間的1/s)。此算法30%的代碼可以被並行加速,那么P等於0.3,這部分代碼可以被加速2倍,s等於2。那么,使用阿姆達爾定律計算其整個算法的加速比:

3

  以上公式和前一個公式類似,只是前一個公式的分母用串行比例B來表示。

  再例如,某項任務,我們可以分解為4個步驟,P1、P2、P3、P4,執行耗時占總耗時百分比分別是11%、18%、23%、48%。我們對它進行優化,P1不能優化,P2可以加速5倍,P3可以加速20倍,P4可以加速1.6倍。那么改進后的執行時間是:

4

  總的加速比是 1 / 0.4575 = 2.186 。我們可以看到,雖然有些部分加速比有20倍,5倍,但總的加速比並不高,略大於2,因為占時間比例最大的P4部分僅僅加速了1.6倍。

  對於如下的圖,我們可以觀察到,加速比受限制於串行工作部分的比例,當95%的代碼都可以進行並行優化時,理論的最大大加速比會更高,但最高不會超過20倍。

5

  阿姆達爾定律也用於指導CPU的可擴展設計。CPU的發展有兩個方向,更快的CPU或者更多的核。目前看來發展的重心偏向了CPU的核數,隨着技術的不斷發展,CPU的核數不斷增加,目前我們的數據庫服務器四核、六核已經比較常見,但有時我們會發現雖然擁有更多的核,當我們同時運行幾個程序時,只有少數幾個線程處於工作中,其它的並未做什么工作,實踐當中,並行運行多個線程往往並不能顯著提升性能,程序往往並不能有效的利用多核。在多核處理器中加速比是衡量並行程序性能的一個重要參數,能否有效降低串行計算部分的比例和降低交互開銷決定了能否充分發揮多核的性能,其中的關鍵在於:合理划分任務、減少核間通信

  以計算機架構師吉恩·阿姆達爾的名字命名的定律,用於尋找僅對系統的一部分進行改進時整個系統預期得到的最大改進。換言之,該定律要討論的是為什么增加某些東西並不總能帶來能力的翻番。該定律可應用在計算機行業,比如研究CPU的核數與性能的關系;在高性能計算領域,該定律可以解釋為什么增加節點並不能帶來性能的線性改善。

阿姆達爾定律證明

Speedup(due to EnhanceMent E) = ExcuteTime(without E)/ExcuteTime(With E) = Performance(With E)/Performanc(Without E)
Suppose the enhancement E accelerates a fraction P of one task by a factor S
and the remainder of the task unaffected then: ExcuteTime(With E) = {(1-P) + P/S} * ExcuteTime(Without E); Speedup(E) = 1/{(1-P)+P/S};

  以上就是Amdahl's law證明。
  Amdahl's law主要的用途是指出了在計算機體系結構設計過程中,某個部件的優化對整個結構的優化幫助是有上限的,這個極限就是當S→∞時,speedup(E) = 1/(1-P);也從另外一個方面說明了在體系結構的優化設計過程中,應該挑選對整體有重大影響的部件來進行優化,以得到更好的結果。

Amdahl's Law & Parallel Speedup

The theory of doing computational work in parallel has some fundamental laws that place limits on the benefits one can derive from parallelizing a computation (or really, any kind of work). To understand these laws, we have to first define the objective. In general, the goal in large scale computation is to get as much work done as possible in the shortest possible time within our budget. We ``win'' when we can do a big job in less time or a bigger job in the same time and not go broke doing so. The ``power'' of a computational system might thus be usefully defined to be the amount of computational work that can be done divided by the time it takes to do it, and we generally wish to optimize power per unit cost, or cost-benefit.

Physics and economics conspire to limit the raw power of individual single processor systems available to do any particular piece of work even when the dollar budget is effectively unlimited. The cost-benefit scaling of increasingly powerful single processor systems is generally nonlinear and very poor - one that is twice as fast might cost four times as much, yielding only half the cost-benefit, per dollar, of a cheaper but slower system. One way to increase the power of a computational system (for problems of the appropriate sort) past the economically feasible single processor limit is to apply more than one computational engine to the problem.

This is the motivation for beowulf design and construction; in many cases a beowulf may provide access to computational power that is available in a alternative single or multiple processor designs, but only at a far greater cost.

In a perfect world, a computational job that is split up among $N$ processors would complete in $1/N$ time, leading to an $N$-fold increase in power. However, any given piece of parallelized work to be done will contain parts of the work that must be done serially, one task after another, by a single processor. This part does not run any faster on a parallel collection of processors (and might even run more slowly). Only the part that can be parallelized runs as much as $N$-fold faster.

The ``speedup'' of a parallel program is defined to be the ratio of the rate at which work is done (the power) when a job is run on $N$ processors to the rate at which it is done by just one. To simplify the discussion, we will now consider the ``computational work'' to be accomplished to be an arbitrary task (generally speaking, the particular problem of greatest interest to the reader). We can then define the speedup (increase in power as a function of $N$) in terms of the time required to complete this particular fixed piece of work on 1 to $N$ processors.

Let $T(N)$ be the time required to complete the task on $N$ processors. The speedup $S(N)$ is the ratio

\begin{displaymath}
S(N) = \frac{T(1)}{T(N)}.
\end{displaymath} (1)

 

In many cases the time $T(1)$ has, as noted above, both a serial portion $T_s$ and a parallelizeable portion $T_p$. The serial time does not diminish when the parallel part is split up. If one is "optimally" fortunate, the parallel time is decreased by a factor of $1/N$). The speedup one can expect is thus

\begin{displaymath}
S(N) = \frac{T(1)}{T(N)} = \frac{T_s + T_p}{T_s + T_p/N}.
\end{displaymath} (2)

 

This elegant expression is known as Amdahl's Law [Amdahl] and is usually expressed as an inequality. This is in almost all cases the best speedup one can achieve by doing work in parallel, so the real speed up $S(N)$ is less than or equal to this quantity.

Amdahl's Law immediately eliminates many, many tasks from consideration for parallelization. If the serial fraction of the code is not much smaller than the part that could be parallelized (if we rewrote it and were fortunate in being able to split it up among nodes to complete in less time than it otherwise would), we simply won't see much speedup no matter how many nodes or how fast our communications. Even so, Amdahl's law is still far too optimistic. It ignores the overhead incurred due to parallelizing the code. We must generalize it.

A fairer (and more detailed) description of parallel speedup includes at least two more times of interest:

${\bf T_s}$ The original single-processor serial time.
${\bf T_{is}}$ The (average) additional serial time spent doing things like interprocessor communications (IPCs), setup, and so forth in all parallelized tasks. This time can depend on $N$ in a variety of ways, but the simplest assumption is that each system has to expend this much time, one after the other, so that the total additional serial time is for example $N*T_{is}$.
${\bf T_p}$ The original single-processor parallelizeable time.
${\bf T_{ip}}$ The (average) additional time spent by each processor doing just the setup and work that it does in parallel. This may well include idle time, which is often important enough to be accounted for separately.

It is worth remarking that generally, the most important element that contributes to $T_{is}$ is the time required for communication between the parallel subtasks. This communication time is always there - even in the simplest parallel models where identical jobs are farmed out and run in parallel on a cluster of networked computers, the remote jobs must be begun and controlled with messages passed over the network. In more complex jobs, partial results developed on each CPU may have to be sent to all other CPUs in the beowulf for the calculation to proceed, which can be very costly in scaled time. As we'll see below, $T_{is}$ in particular plays an extremely important role in determining the speedup scaling of a given calculation. For this (excellent!) reason many beowulf designers and programmers are obsessed with communications hardware and algorithms.

It is common to combine $T_{ip}$, $N$ and $T_{is}$ into a single expression $T_o(N)$ (the ``overhead time'') which includes any complicated $N$-scaling of the IPC, setup, idle, and other times associated with the overhead of running the calculation in parallel, as well as the scaling of these quantities with respect to the ``size'' of the task being accomplished. The description above (which we retain as it illustrates the generic form of the relevant scalings) is still a simplified description of the times - real life parallel tasks can be much more complicated, although in many cases the description above is adequate.

Using these definitions and doing a bit of algebra, it is easy to show that an improved (but still simple) estimate for the parallel speedup resulting from splitting a particular job up between $N$ nodes (assuming one processor per node) is:

\begin{displaymath}
S(N) = \frac{T_s + T_p}{T_s + N*T_{is} + T_p/N + T_{ip}}.
\end{displaymath} (3)

 

This expression will suffice to get at least a general feel for the scaling properties of a task that might be parallelized on a typical beowulf.

 

 

Figure 1: $T_{is} = 0$ and $T_p =$ 10, 100, 1000, 10000, 100000 (in increasing order).
\begin{figure}\centerline{\psfig{file=bscaling.Tis0.eps,height=2.8in}}\end{figure}

 

It is useful to plot the dimensionless ``real-world speedup'' (3) for various relative values of the times. In all the figures below, $T_s$ = 10 (which sets our basic scale, if you like) and $T_p$ = 10, 100, 1000, 10000, 100000 (to show the systematic effects of parallelizing more and more work compared to $T_s$).

The primary determinant of beowulf scaling performance is the amount of (serial) work that must be done to set up jobs on the nodes and then in communications between the nodes, the time that is represented as $T_{is}$. All figures have $T_{ip} = 1$ fixed; this parameter is rather boring as it effectively adds to $T_s$ and is often very small.

Figure 1 shows the kind of scaling one sees when communication times are negligible compared to computation. This set of curves is roughly what one expects from Amdahl's Law alone, which was derived with no consideration of IPC overhead. Note that the dashed line in all figures is perfectly linear speedup, which is never obtained over the entire range of $N$ although one can often come close for small $N$ or large enough $T_p$.

In figure 2, we show a fairly typical curve for a ``real'' beowulf, with a relatively small IPC overhead of $T_{is} = 1$. In this figure one can see the advantage of cranking up the parallel fraction ($T_p$ relative to $T_s$) and can also see how even a relatively small serial communications process on each node causes the gain curves to peak well short of the saturation predicted by Amdahl's Law in the first figure. Adding processors past this point costs one speedup. Increasing $T_{is}$ further (relative to everything else) causes the speedup curves to peak earlier and at smaller values.

Finally, in figure 3 we continue to set $T_{is} = 1$, but this time with a quadratic $N$ dependence $N^2*T_{is}$ of the serial IPC time. This might result if the communications required between processors is long range (so every processor must speak to every other processor) and is not efficiently managed by a suitable algorithm. There are other ways to get nonlinear dependences of the additional serial time on $N$, and as this figure clearly shows they can have a profound effect on the per-processor scaling of the speedup.

 

 

Figure 2: $T_{is} = 10$ and $T_p =$ 10, 100, 1000, 10000, 100000 (in increasing order).
\begin{figure}\centerline{\psfig{file=bscaling.Tis1.eps,height=2.8in}}\end{figure}

 

 

 

Figure 3: $T_{is} = 10$ and $T_p =$ 10, 100, 1000, 10000, 100000 (in increasing order) with $T_{is}$ contributing quadratically in $N$.
\begin{figure}\centerline{\psfig{file=bscaling.Tis1.Nsq.eps,height=2.8in}}\end{figure}

 

As one can clearly see, unless the ratio of $T_p$ to $T_{is}$ is in the ballpark of 100,000 to 1 one cannot actually benefit from having 128 processors in a ``typical'' beowulf. At only 10,000 to 1, the speedup saturates at around 100 processors and then decreases. When the ratio is even smaller, the speedup peaks with only a handful of nodes working on the problem. From this we learn some important lessons. The most important one is that for many problems simply adding processors to a beowulf design won't provide any additional speedup and could even slow a calculation down unless one also scales up the problem (increasing the $T_p$ to $T_{is}$ ratio) as well.

The scaling of a given calculation has a significant impact on beowulf engineering. Because of overhead, speedup is not a matter of just adding the speed of however many nodes one applies to a given problem. For some problems it is clearly advantageous to trade off the number of nodes one purchases (for example in a problem with small $T_s$ and $T_p/T_{is} \approx 100$) in order to purchase tenfold improved communications (and perhaps alter the $T_p/T_{is}$ ratio to 1000).

The nonlinearities prevent one from developing any simple rules of thumb in beowulf design. There are times when one obtains the greatest benefit by selecting the fastest possible processors and network (which reduce both $T_s$ and $T_p$ in absolute terms) instead of buying more nodes because we know that the rate equation above will limit the parallel speedup we might ever hope to get even with the fastest nodes. Paradoxically, there are other times that we can do better (get better speedup scaling, at any rate) by buying slower processors (when we are network bound, for example), as this can also increase $T_p/T_{is}$. In general, one should be aware of the peaks that occur at the various scales and not naively distribute small calculations (with small $T_p/T_{is}$) over more processors than they can use.

In summary, parallel performance depends primarily on certain relatively simple parameters like $T_s$, $T_p$ and $T_{is}$ (although there may well be a devil in the details that we've passed over). These parameters, in turn are at least partially under our control in the form of programming decisions and hardware design decisions. Unfortunately, they depend on many microscopic measures of system and network performance that are inaccessible to many potential beowulf designers and users. $T_p$ clearly should depend on the ``speed'' of a node, but the single node speed itself may depend nonlinearly on the speed of the processor, the size and structure of the caches, the operating system, and more.

Because of the nonlinear complexity, there is no way to a priori estimate expected performance on the basis of any simple measure. There is still considerable benefit to be derived from having in hand a set of quantitative measures of ``microscopic'' system performance and gradually coming to understand how one's program depends on the potential bottlenecks they reveal. The remainder of this paper is dedicated to reviewing the results of applying a suite of microbenchmark tools to a pair of nodes to provide a quantitative basis for further beowulf hardware and software engineering.

 


Robert G. Brown 2000-08-28 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM