水塘抽樣問題

本文轉載自查看原文 2013-10-01 04:18 5138 算法---隨機算法

google曾經有一道面試題，十分有趣：

I have a linked list of numbers of length N. N is very large and I don’t know in advance the exact value of N.

How can I most efficiently write a function that will return k completely random numbers from the list

題目非常簡單：有N個元素的鏈表，事先不知道有多長，寫一個函數可以高效地從其中取出k個隨機數。

初看這題心里沒有一點思路，最后查了下資料，這題不是什么新題，編程珠璣Column 12中的題目10提到過，其描述如下：

　　How could you select one of n objects at random, where you see the objects sequentially but you do not know the value of n beforehand? For concreteness, how would you read a text file, and select and print one random line, when you don’t know the number of lines in advance?

　　問題定義可以簡化如下：在不知道文件總行數的情況下，如何從文件中隨機的抽取一行？

　　首先想到的是我們做過類似的題目嗎?當然，在知道文件行數的情況下，我們可以很容易的用C運行庫的rand函數隨機的獲得一個行數，從而隨機的取出一行，但是，當前的情況是不知道行數，這樣如何求呢？我們需要一個概念來幫助我們做出猜想，來使得對每一行取出的概率相等，也即隨機。這個概念即蓄水池抽樣（Reservoir Sampling）。

wikipedia：http://en.wikipedia.org/wiki/Reservoir_sampling 說的很詳細：

水塘抽樣是一系列的隨機算法，其目的在於從包含n個項目的集合S中選取k個樣本，其中n為一很大或未知的數量，尤其適用於不能把所有n個項目都存放到主內存的情況。最常見例子為Jeffrey Vitter在其論文^[1]中所提及的算法R。

參照Dictionary of Algorithms and Data Structures^[2]所載的O(n)算法，包含以下步驟（假設陣列S以0開始標示）：

從S中抽取首k項放入「水塘」中
對於每一個S[j]項（j ≥ k）：
   隨機產生一個範圍從0到j的整數r
   若 r < k 則把水塘中的第r項換成S[j]項

array R[k];    // result
integer i, j;

// fill the reservoir array
for each i in 1 to k do
    R[i] := S[i]
done;

// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
    j := random(1, i);   // important: inclusive range
    if j <= k then
        R[j] := S[i]
    fi
done

c++實現：

#include<iostream>
#include<ctime>
using namespace std;

int main()
{
 
    int S[10]={0,1,2,3,4,5,6,7,8,9};
    const int k=4;
    int R[k];
    int i,j;
    for(i=0;i<k;i++)
        R[i]=S[i];

    for(i=k;i<sizeof(S)/sizeof(S[0]);i++)
    {
        srand(time(NULL));
        j=rand()%i;
        if(j<k)
            R[j]=S[i];
    }
    
    for(int i=0;i<k;i++)
        cout<<R[i]<<ends;
    cout<<endl;

}

為什么叫水塘抽樣，因為我們array R【k】類似一個reservoir水庫（蓄水池），

The algorithm creates a "reservoir" array of size k and populates it with the first k items of S. It then iterates through the remaining elements of S until Sis exhausted. At the i^th element of S, the algorithm generates a random number j between 1 and i. If j is less than k, the j^th element of the reservoir array is replaced with the i^th element of S. In effect, for all i, the i^th element of S is chosen to be included in the reservoir with probability k/i. Similarly, at each iteration the j^th element of the reservoir array is chosen to be replaced with probability j/k * k/i, which simplifies to j/i. It can be shown that when the algorithm has finished executing, each item in S has equal probability (i.e. k/length(S)) of being chosen for the reservoir.

　　有了這個概念，我們來看最先的問題，在不知道文件總行數的情況下，如何從文件中隨機的抽取一行？我們便有了這樣一個解決方案：定義取出的行號為choice，第一次直接以第一行作為取出行 choice ，而后第二次以二分之一概率決定是否用第二行替換 choice ，第三次以三分之一的概率決定是否以第三行替換 choice ……，以此類推，可用偽代碼描述如下：

i = 0

while more input lines

with probability 1.0/++i

choice = this input line

print choice

#include<iostream>
#include<ctime>
using namespace std;

int main()
{
    int choice=0;
    int start=0;
    const int n=10;
    for(int i=2;i<=n;i++)
    {
        srand(time(NULL));
        int randValue=rand()%(i+1-start)+start;
        if(randValue==0)
            choice=i;
    }
    cout<<choice;

}

這種方法的巧妙之處在於成功的構造出了一種方式使得最后可以證明對每一行的取出概率都為1/n（其中n為當前掃描到的文件行數），換句話說對每一行取出的概率均相等，也即完成了隨機的選取。

　　證明如下：

回顧這個問題，我們可以對其進行擴展，即如何從未知或者很大樣本空間隨機地取k個數？

　　類比下即可得到答案，即先把前k個數放入蓄水池，對第k+1，我們以k/(k+1)概率決定是否要把它換入蓄水池，換入時隨機的選取一個作為替換項，這樣一直做下去，對於任意的樣本空間n，對每個數的選取概率都為k/n。也就是說對每個數選取概率相等。

　　偽代碼：

Init : a reservoir with the size： k

for i= k+1 to N

    M=random(1, i);

    if( M < k)

     SWAP the Mth value and ith value

end for

　　證明如下：

wikipedia百科的證明好理解一些：

在循環內第n行被抽取的機率為k/n，以 $P_n$ Pn表示。如果檔案共有N行，任意第n行(注意這里n是序號，而不是總數）被抽取的機率為:

Pj為第j行選中的概率，為k/j;

為什么要除以k，因為現在求的是單個元素選中的概率，

1-(Pj/k) 就為不選中的概率。

蓄水池抽樣問題是一類問題，在這里總結一下，並由衷的感嘆這種方法之巧妙，不過對於這種思想產生的源頭還是發覺不夠，如果能夠知道為什么以及怎么樣想到這個解決方法的，定會更加有意義。

參考：http://www.cnblogs.com/HappyAngel/archive/2011/02/07/1949762.html

http://zh.wikipedia.org/wiki/%E6%B0%B4%E5%A1%98%E6%8A%BD%E6%A8%A3

可以看以前的：洗牌算法：http://www.cnblogs.com/youxin/p/3348626.html

http://www.cnblogs.com/youxin/p/3353024.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 統計抽樣與非統計抽樣概率抽樣與非概率抽樣重復抽樣與不重復抽樣的抽樣平均誤差大小？放回抽樣與不放回抽樣【抽樣調查】簡單隨機抽樣【抽樣調查】分層隨機抽樣統計量及其抽樣分布二統計量及其抽樣分布 python 分層抽樣一、統計量與抽樣分布