十四、第三章再續:快速選擇SELECT算法的深入分析與實現


                          十四、亦第三章再續:快速選擇SELECT算法的深入分析與實現


作者:July。
出處:http://blog.csdn.net/v_JULY_v 

 

 

前言

    經典算法研究系列已經寫了十三個算法,共計22篇文章(詳情,見這:十三個經典算法研究與總結、目錄+索引),我很怕我自己不再把這個算法系列給繼續寫下去了。沉思良久,到底是不想因為要創作狂想曲系列而耽擱這個經典算法研究系列,何況它,至今反響還不錯。

    ok,狂想曲第三章提出了一個算法,就是快速選擇SELECT算法,關於這個SELECT算法通過選取數組中中位數的中位數作為樞紐元能保證在最壞情況下,亦能做到線性O(N)的時間復雜度的證明,在狂想曲第三章也已經給出。

   本文咱們從快速排序算法分析開始(因為如你所知,快速選擇算法與快速排序算法在partition划分過程上是類似的),參考Mark的數據結構與算法分析-c語言描述一書,而后逐步深入分析快速選擇SELECT算法,最后,給出SELECT算法的程序實現。

   同時,本文有部分內容來自狂想曲系列第三章,也算是對第三章、尋找最小的k個數的一個總結。yeah,有任何問題,歡迎各位批評指正,如果你挑出了本文章或本blog任何一個問題或錯誤,當即免費給予單獨贈送本blog最新一期第6期的博文集錦CHM文件,謝謝。


第一節、快速排序

1.1、快速排序算法的介紹

      關於快速排序算法,本人已經寫了3篇文章(可參見其中的兩篇:1、十二、快速排序算法之所有版本的c/c++實現,2、一之續、快速排序算法的深入分析),為何又要舊事重提列?正如很多事物都有相似的地方,而咱們面臨的問題--快速選擇算法中的划分過程等同於快速排序,所以,在分析快速選擇SELECT算法之前,咱們先再來簡單回顧和分析下快速排序,ok,今天看到Mark的數據結構與算法分析-c語言描述一書上對快速排序也有不錯的介紹,所以為了增加點新鮮感,就不用自己以前的文章而改為直接引用Mark的敘述了:

    As its name implies, quicksort is the fastest known sorting algorithm in practice. Its average running time is O(n log n)(快速排序是實踐中已知的最快的排序算法,他的平均運行時間為O(N*logN)). It is very fast, mainly due to a very tight and highly optimized inner loop. It has O(n2) worst-case performance(最壞情形的性能為O(N^2)), but this can be made exponentially unlikely with a little effort.

    The quicksort algorithm is simple to understand and prove correct, although for many years it had the reputation of being an algorithm that could in theory be highly optimized but in practice was impossible to code correctly (no doubt because of FORTRAN).

    Like mergesort, quicksort is a divide-and-conquer recursive algorithm(像歸並排序一樣,快速排序也是一種采取分治方法的遞歸算法). The basic algorithm to sort an array S consists of the following four easy steps(通過下面的4個步驟將數組S排序的算法如下):

1. If the number of elements in S is 0 or 1, then return(如果S中元素個數是0或1,則返回).
2. Pick any element v in S. This is called the pivot(取S中任一元素v,作為樞紐元).
3. Partition S - {v} (the remaining elements in S) into two disjoint groups(樞紐元v將S中其余的
元素分成兩個不想交的集合): S1 = {x(- S-{v}| x <= v}, and S2 = {x(- S-{v}| x >= v}.
4. Return { quicksort(S1) followed by v followed by quicksort(S2)}.

下面依據上述步驟對序列13,81,92,43,65,31,57,26,75,0 進行第一趟划分處理,可得到如下圖所示的過程:



1.2、選取樞紐元的幾種方法
1、糟糕的方法
    通常的做法是選擇數組中第一個元素作為樞紐元,如果輸入是隨機的,那么這是可以接受的。但是,如果輸入序列是預排序的或者是反序的,那么依據這樣的樞紐元進行划分則會出現相當糟糕的情況,因為可能所有的元素不是被划入S1,就是都被划入S2中。
2、較好的方法
   一個比較好的做法是隨機選取樞紐元,一般來說,這種策略是比較妥當的。
3、三數取取中值方法
   例如,輸入序列為 8, 1, 4, 9, 6, 3, 5, 2, 7, 0 ,它的左邊元素為8,右邊元素為0,中間位置|_left+right)/2_|上的元素為6,於是樞紐元為6.顯然,使用三數中值分割法消除了預排序輸入的壞情形,並且減少了快速排序大約5%(此為前人實驗所得數據,無法具體證明)的運行時間。

1.3、划分過程
   下面,我們再對序列8, 1, 4, 9, 6, 3, 5, 2, 7, 0進行第一趟划分,我們要達到的划分目的就是為了把所有小於樞紐元(據三數取中分割法取元素6為樞紐元)的元素移到數組的左邊,而把所有大於樞紐元的元素全部移到數組的右邊。

   此過程,如下述幾個圖所示:
8  1  4  9  0  3  5  2  7  6                    
i                               j

8  1  4  9  0  3  5  2  7  6                
i                           j

      After First Swap:
----------------------------
2  1  4  9  0  3  5  8  7  6               
i                           j


      Before Second Swap:
----------------------------
2  1  4  9  0  3  5  8  7  6                
            i           j

      After Second Swap:
----------------------------
2  1  4  5  0  3  9  8  7  6              
            i           j


     Before Third Swap
----------------------------
2  1  4  5  0  3  9  8  7  6
                    j   i
   //i,j在元素3處碰頭之后,i++指向了9,最后與6交換后,得到:

2  1  4  5  0  3  6  8  7  9                                
                        i         pivot

至此,第一趟划分過程結束,樞紐元6將整個序列划分成了左小右大兩個部分。

1.4、四個細節

下面,是4個值得你注意的細節問題:
    1、我們要考慮一下,就是如何處理那些等於樞紐元的元素,問題在於當i遇到第一個等於樞紐元的關鍵字時,是否應該停止移動i,或者當j遇到一個等於樞紐元的元素時是否應該停止移動j。
答案是:如果i,j遇到等於樞紐元的元素,那么我們就讓i和j都停止移動。
    2、對於很小的數組,如數組的大小N<=20時,快速排序不如插入排序好。
    3、只通過元素間進行比較達到排序目的的任何排序算法都需要進行O(N*logN)次比較,如快速排序算法(最壞O(N^2),最好O(N*logN)),歸並排序算法(最壞O(N*logN,不過歸並排序的問題在於合並兩個待排序的序列需要附加線性內存,在整個算法中,還要將數據拷貝到臨時數組再拷貝回來這樣一些額外的開銷,放慢了歸並排序的速度)等。
    4、下面是實現三數取中的划分方法的程序:

//三數取中分割法
input_type median3( input_type a[], int left, int right )   
//下面的快速排序算法實現之一,及通過三數取中分割法尋找最小的k個數的快速選擇SELECT算法都要調用這個median3函數
{
 int center;
 center = (left + right) / 2;
 
 if( a[left] > a[center] ) 
  swap( &a[left], &a[center] );
 if( a[left] > a[right] ) 
  swap( &a[left], &a[right] );
 if( a[center] > a[right] ) 
  swap( &a[center], &a[right] );
 
 /* invariant: a[left] <= a[center] <= a[right] */
 swap( &a[center], &a[right-1] );     /* hide pivot */
 return a[right-1];                   /* return pivot */

下面的程序是利用上面的三數取中分割法而運行的快速排序算法:

//快速排序的實現之一
void q_sort( input_type a[], int left, int right )
{
 int i, j;
 input_type pivot;
 if( left + CUTOFF <= right )
 { 
  pivot = median3( a, left, right );   //調用上面的實現三數取中分割法的median3函數
  i=left; j=right-1;   //第8句
  for(;;) 
  {  
   while( a[++i] < pivot );  
   while( a[--j] > pivot ); 
   if( i < j )  
    swap( &a[i], &a[j] );  
   else   
    break;       //第16句  
  } 
  swap( &a[i], &a[right-1] );   /*restore pivot*/   
  q_sort( a, left, i-1 );       
  q_sort( a, i+1, right );
  
  //如上所見,在划分過程(partition)后,快速排序需要兩次遞歸,一次對左邊遞歸
  //一次對右邊遞歸。下面,你將看到,快速選擇SELECT算法始終只對一邊進行遞歸。
  //這從直觀上也能反應出:此快速排序算法(O(N*logN))明顯會比
  //下面第二節中的快速選擇SELECT算法(O(N))平均花費更多的運行時間。

 }  
}

如果上面的第8-16句,改寫成以下這樣:

i=left+1; j=right-2;
for(;;)
{
 while( a[i] < pivot ) i++; 
 while( a[j] > pivot ) j--; 
 if( i < j ) 
  swap( &a[i], &a[j] ); 
 else
  break; 
}

那么,當a[i] = a[j] = pivot則會產生無限,即死循環(相信,不用我多余解釋,:D)。ok,接下來,咱們將進入正題--快速選擇SELECT算法。


第二節、線性期望時間的快速選擇SELECT算法

2.1、快速選擇SELECT算法的介紹

  Quicksort can be modified to solve the selection problem, which we have seen in chapters 1 and 6. Recall that by using a priority queue, we can find the kth largest (or smallest) element in O(n + k log n)(以用最小堆初始化數組,然后取這個優先隊列前k個值,復雜度O(n)+k*O(log n)。實際上,最好采用最大堆尋找最小的k個數,那樣,此時復雜度為n*logk。更多詳情,請參見:狂想曲系列第三章、尋找最小的k個數). For the special case of finding the median, this gives an O(n log n) algorithm.

    Since we can sort the file in O(nlog n) time, one might expect to obtain a better time bound for selection. The algorithm we present to find the kth smallest element in a set S is almost identical to quicksort. In fact, the first three steps are the same. We will call this algorithm quickselect(叫做快速選擇). Let |Si| denote the number of elements in Si(令|Si|為Si中元素的個數). The steps of quickselect are:

    1. If |S| = 1, then k = 1 and return the elements in S as the answer. If a cutoff for small files is being used and |S| <=CUTOFF, then sort S and return the kth smallest element.
    2. Pick a pivot element, v (- S.(選取一個樞紐元v屬於S)
    3. Partition S - {v} into S1 and S2, as was done with quicksort.
(將集合S-{v}分割成S1和S2,就像我們在快速排序中所作的那樣)

    4. If k <= |S1|, then the kth smallest element must be in S1. In this case, return quickselect (S1, k). If k = 1 + |S1|, then the pivot is the kth smallest element and we can return it as the answer. Otherwise, the kth smallest element lies in S2, and it is the (k - |S1| - 1)st smallest element in S2. We make a recursive call and return quickselect (S2, k - |S1| - 1).
(如果k<=|S1|,那么第k個最小元素必然在S1中。在這種情況下,返回quickselect(S1,k)。如果k=1+|S1|,那么樞紐元素就是第k個最小元素,即找到,直接返回它。否則,這第k個最小元素就在S2中,即S2中的第(k-|S1|-1)個最小元素,我們遞歸調用並返回quickselect(S2,k-|S1|-1))(下面幾節的程序關於k的表述可能會有所出入,但無礙,抓住原理即ok)。

    In contrast to quicksort, quickselect makes only one recursive call instead of two. The worst case of quickselect is identical to that of quicksort and is O(n2). Intuitively, this is because quicksort's worst case is when one of S1 and S2 is empty; thus, quickselect(快速選擇) is not really saving a recursive call. The average running time, however, is O(n)(不過,其平均運行時間為O(N)。看到了沒,就是平均復雜度為O(N)這句話). The analysis is similar to quicksort's and is left as an exercise.

    The implementation of quickselect is even simpler than the abstract description might imply. The code to do this shown in Figure 7.16. When the algorithm terminates, the kth smallest element is in position k. This destroys the original ordering; if this is not desirable, then a copy must be made.

2.2、三數中值分割法尋找第k小的元素

    第一節,已經介紹過此三數中值分割法,有個細節,你要注意,即數組元素索引是從“0...i”開始計數的,所以第k小的元素應該是返回a[i]=a[k-1].即k-1=i。換句話就是說,第k小元素,實際上應該在數組中對應下標為k-1。ok,下面給出三數中值分割法尋找第k小的元素的程序的兩個代碼實現:     上述程序使用三數中值作為樞紐元的方法可以使得最壞情況發生的概率幾乎可以忽略不計。然而,稍后,您將看到:通過一種更好的方法,如“五分化中項的中項”,或“中位數的中位數”等方法選取樞紐元,我們將能徹底保證在最壞情況下依然是線性O(N)的復雜度。即,如稍后2.3節所示。

2.3、五分化中項的中項,確保O(N)

    The selection problem requires us to find the kth smallest element in a list S of n elements(要求我們找出含N個元素的表S中的第k個最小的元素). Of particular interest is the special case of finding the median. This occurs when k = |-n/2-|(向上取整).(我們對找出中間元素的特殊情況有着特別的興趣,這種情況發生在k=|-n/2-|的時候)

    In Chapters 1, 6, 7 we have seen several solutions to the selection problem. The solution in Chapter 7 uses a variation of quicksort and runs in O(n) average time(第7章中的解法,即本文上面第1節所述的思路4,用到快速排序的變體並以平均時間O(N)運行). Indeed, it is described in Hoare's original paper on quicksort. 

    Although this algorithm runs in linear average time, it has a worst case of O (n2)(但它有一個O(N^2)的最快情況). Selection can easily be solved in O(n log n) worst-case time by sorting the elements, but for a long time it was unknown whether or not selection could be accomplished in O(n) worst-case time. The quickselect algorithm outlined in Section 7.7.6 is quite efficient in practice, so this was mostly a question of theoretical interest. 

    Recall that the basic algorithm is a simple recursive strategy. Assuming that n is larger than the cutoff point where elements are simply sorted, an element v, known as the pivot, is chosen. The remaining elements are placed into two sets, S1 and S2. S1 contains elements that are guaranteed to be no larger than v, and S2 contains elements that are no smaller than v. Finally, if k <= |S1|, then the kth smallest element in S can be found by recursively computing the kth smallest element in S1. If k = |S1| + 1, then the pivot is the kth smallest element. Otherwise, the kth smallest element in S is the (k - |S1| -1 )st smallest element in S2. The main difference between this algorithm and quicksort is that there is only one subproblem to solve instead of two(這個快速選擇算法與快速排序之間的主要區別在於,這里求解的只有一個子問題,而不是兩個子問題)。

    定理10.9
The running time of quickselect using median-of-median-of-five partitioning is O(n)。
 

    The basic idea is still useful. Indeed, we will see that we can use it to improve the expected number of comparisons that quickselect makes. To get a good worst case, however, the key idea is to use one more level of indirection. Instead of finding the median from a sample of random elements, we will find the median from a sample of medians.

The basic pivot selection algorithm is as follows:
    1. Arrange the n elements into |_n/5_| groups of 5 elements, ignoring the (at most four) extra elements.
    2. Find the median of each group. This gives a list M of |_n/5_| medians.
    3. Find the median of M. Return this as the pivot, v.

    We will use the term median-of-median-of-five partitioning to describe the quickselect algorithm that uses the pivot selection rule given above. (我們將用術語“五分化中項的中項”來描述使用上面給出的樞紐元選擇法的快速選擇算法)。We will now show that median-of-median-of-five partitioning guarantees that each recursive subproblem is at most roughly 70 percent as large as the original(現在我們要證明,“五分化中項的中項”,得保證每個遞歸子問題的大小最多為原問題的大約70%). We will also show that the pivot can be computed quickly enough to guarantee an O (n) running time for the entire selection algorithm(我們還要證明,對於整個選擇算法,樞紐元可以足夠快的算出,以確保O(N)的運行時間。看到了沒,這再次佐證了我們的類似快速排序的partition過程的分治方法為O(N)的觀點)(更多詳細的證明,請參考:第三章、尋找最小的k個數)。

2.4、中位數的中位數,O(N)的再次論證

    以下內容來自算法導論第九章第9.3節全部內容(最壞情況線性時間的選擇),如下(我酌情對之參考原中文版做了翻譯,下文中括號內的中文解釋,為我個人添加):

9.3 Selection in worst-case linear time(最壞情況下線性時間的選擇算法)

    We now examine a selection algorithm whose running time is O(n) in the worst case(現在來看,一個最壞情況運行時間為O(N)的選擇算法SELECT). Like RANDOMIZED-SELECT, the algorithm SELECT finds the desired element by recursively partitioning the input array. The idea behind the algorithm, however, is to guarantee a good split when the array is partitioned. SELECT uses the deterministic partitioning algorithm PARTITION from quicksort (see Section 7.1), modified to take the element to partition around as an input parameter(像RANDOMIZED-SELECT一樣,SELECTT通過輸入數組的遞歸划分來找出所求元素,但是,該算法的基本思想是要保證對數組的划分是個好的划分。SECLECT采用了取自快速排序的確定性划分算法partition,並做了修改,把划分主元元素作為其參數).

    The SELECT algorithm determines the ith smallest of an input array of n > 1 elements by executing the following steps. (If n = 1, then SELECT merely returns its only input value as the ith smallest.)(算法SELECT通過執行下列步驟來確定一個有n>1個元素的輸入數組中的第i小的元素。(如果n=1,則SELECT返回它的唯一輸入數值作為第i個最小值。))

  1. Divide the n elements of the input array into n/5 groups of 5 elements each and at most one group made up of the remaining n mod 5 elements.
  2. Find the median of each of the n/5 groups by first insertion sorting the elements of each group (of which there are at most 5) and then picking the median from the sorted list of group elements.
  3. Use SELECT recursively to find the median x of the n/5 medians found in step 2. (If there are an even number of medians, then by our convention, x is the lower median.)
  4. Partition the input array around the median-of-medians x using the modified version of PARTITION. Let k be one more than the number of elements on the low side of the partition, so that x is the kth smallest element and there are n-k elements on the high side of the partition.(利用修改過的partition過程,按中位數的中位數x對輸入數組進行划分,讓k比划低去的元素數目多1,所以,x是第k小的元素,並且有n-k個元素在划分的高區)
  5. If i = k, then return x. Otherwise, use SELECT recursively to find the ith smallest element on the low side if i < k, or the (i - k)th smallest element on the high side if i > k.(如果要找的第i小的元素等於程序返回的k,即i=k,則返回x。否則,如果i<k,則在低區遞歸調用SELECT以找出第i小的元素,如果i>k,則在高區間找第(i-k)個最小元素)

(以上五個步驟,即本文上面的第四節末中所提到的所謂“五分化中項的中項”的方法。)

 

    To analyze the running time of SELECT, we first determine a lower bound on the number of elements that are greater than the partitioning element x. (為了分析SELECT的運行時間,先來確定大於划分主元元素x的的元素數的一個下界)Figure 9.1 is helpful in visualizing this bookkeeping. At least half of the medians found in step 2 are greater than[1] the median-of-medians x. Thus, at least half of the n/5 groups contribute 3 elements that are greater than x, except for the one group that has fewer than 5 elements if 5 does not divide n exactly, and the one group containing x itself. Discounting these two groups, it follows that the number of elements greater than x is at least:

    

 

   
    (Figure 9.1: 對上圖的解釋或稱對SELECT算法的分析:n個元素由小圓圈來表示,並且每一個組占一縱列。組的中位數用白色表示,而各中位數的中位數x也被標出。(當尋找偶數數目元素的中位數時,使用下中位數)。箭頭從比較大的元素指向較小的元素,從中可以看出,在x的右邊,每一個包含5個元素的組中都有3個元素大於x,在x的左邊,每一個包含5個元素的組中有3個元素小於x。大於x的元素以陰影背景表示。 )

    Similarly, the number of elements that are less than x is at least 3n/10 - 6. Thus, in the worst case, SELECT is called recursively on at most 7n/10 + 6 elements in step 5.

    We can now develop a recurrence for the worst-case running time T(n) of the algorithm SELECT. Steps 1, 2, and 4 take O(n) time. (Step 2 consists of O(n) calls of insertion sort on sets of size O(1).) Step 3 takes time T(n/5), and step 5 takes time at most T(7n/10+ 6), assuming that T is monotonically increasing. We make the assumption, which seems unmotivated at first, that any input of 140 or fewer elements requires O(1) time; the origin of the magic constant 140 will be clear shortly. We can therefore obtain the recurrence:

         

    We show that the running time is linear by substitution. More specifically, we will show that T(n) ≤ cn for some suitably large constant c and all n > 0. We begin by assuming that T(n) ≤ cn for some suitably large constant c and all n ≤ 140; this assumption holds if c is large enough. We also pick a constant a such that the function described by the O(n) term above (which describes the non-recursive component of the running time of the algorithm) is bounded above by an for all n > 0. Substituting this inductive hypothesis into the right-hand side of the recurrence yields

T(n)

c n/5 + c(7n/10 + 6) + an

 

cn/5 + c + 7cn/10 + 6c + an

 

=

9cn/10 + 7c + an

 

=

cn + (-cn/10 + 7c + an) ,

which is at most cn if

              

Inequality (9.2) is equivalent to the inequality c ≥ 10a(n/(n - 70)) when n > 70. Because we assume that n ≥ 140, we have n/(n - 70) ≤ 2, and so choosing c ≥ 20a will satisfy inequality (9.2). (Note that there is nothing special about the constant 140; we could replace it by any integer strictly greater than 70 and then choose c accordingly.) The worst-case running time of SELECT is therefore linear(因此,此SELECT的最壞情況的運行時間是線性的).

 

    As in a comparison sort (see Section 8.1), SELECT and RANDOMIZED-SELECT determine information about the relative order of elements only by comparing elements. Recall from Chapter 8 that sorting requires (n lg n) time in the comparison model, even on average (see Problem 8-1). The linear-time sorting algorithms in Chapter 8 make assumptions about the input. In contrast, the linear-time selection algorithms in this chapter do not require any assumptions about the input. They are not subject to the (n lg n) lower bound because they manage to solve the selection problem without sorting.

(與比較排序(算法導論8.1節)中的一樣,SELECT和RANDOMIZED-SELECT僅通過元素間的比較來確定它們之間的相對次序。在算法導論第8章中,我們知道在比較模型中,即使在平均情況下,排序仍然要O(n*logn)的時間。第8章得線性時間排序算法在輸入上做了假設。相反地,本節提到的此類似partition過程的SELECT算法不需要關於輸入的任何假設,它們不受下界O(n*logn)的約束,因為它們沒有使用排序就解決了選擇問題(看到了沒,道出了此算法的本質阿))

    Thus, the running time is linear because these algorithms do not sort; the linear-time behavior is not a result of assumptions about the input, as was the case for the sorting algorithms in Chapter 8. Sorting requires (n lg n) time in the comparison model, even on average (see Problem 8-1), and thus the method of sorting and indexing presented in the introduction to this chapter is asymptotically inefficient.(所以,本節中的選擇算法之所以具有線性運行時間,是因為這些算法沒有進行排序;線性時間的結論並不需要在輸入上所任何假設,即可得到。.....)

 

第三節、快速選擇SELECT算法的實現

  本節,咱們將依據下圖所示的步驟,采取中位數的中位數選取樞紐元的方法來實現此SELECT算法,

    不過,在實現之前,有個細節我還是必須要提醒你,即上文中2.2節開頭處所述,“數組元素索引是從“0...i”開始計數的,所以第k小的元素應該是返回a[i]=a[k-1].即k-1=i。換句話就是說,第k小元素,實際上應該在數組中對應下標為k-1”這句話,我想,你應該明白了:返回數組中第k小的元素,實際上就是返回數組中的元素array[i],即array[k-1]。ok,最后請看此快速選擇SELECT算法的完整代碼實現(據我所知,在此之前,從沒有人采取中位數的中位數選取樞紐元的方法來實現過這個SELECT算法):


版權所有,本人對本blog內所有任何內容享有版權及著作權。實要轉載,請以鏈接形式注明出處。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM