為什么會將Page Rank放在hadoop學習筆記里,是因為hadoop課程第一周就重點提到了Google當年三大論文(GFS, Map-Reduce和Big Table)以及hadoop思想的來源,並提到了page rank與Map-reduce解決方案下的PR算法,關於如何應用分布式計算來處理上萬億網頁的Page rank的Map-reduce思想現在還沒有搞清楚,在這之前,頗費了些周章去理解page rank的基本算法。有幾篇文章講述得非常清楚,(更是覺得數學是趨勢所需,沒有好的數學包括線性/高數/離散等很多路徑將走不通)
說實話,培訓課件中關於Page Rank算法的講解實在是太抽象了,而且矩陣也沒有說明白為什么必須得長成那樣,比如行是啥意思,列是啥意思,為什么必須得乘以個4行1列的列,還有這個收斂函數(PG)公式又是怎樣得來的,為什么要乘來乘去的,我接連聽了三遍都沒聽明白,終於在這兒找到想要的答案了...
博主用與課件中相同的路徑關系,講解了上面這些我在聽課件時遺留下來的問題,
>> http://blog.codinglabs.org/articles/intro-to-pagerank.html (真的寫得非常清楚)
另外,這兒有兩個關於Page Rank算法的小web app,可以自行拖動頁面關系,計算G值 https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/ ,其算法解釋為http://www.nowherenearithaca.com/2013/04/explorating-googles-pagerank.html 這個算法中加上了dead end的1/6的矩陣,我不知道是否必要,畢竟后面已經有加上一個(1-alpha) * 1/page count的矩陣了。
群里面一直有人沒明白googler當時整出這個0.85的alpha值究竟是干嘛的,而下述算法公式又是怎么得出的,
因為培訓的第一周作業就是使用代碼計算page rank,我也在代碼中試驗了一下這個值存在必要性。
hyperlink matrix中的你看到的數值1/3,1/3,1/3 是用戶在當前頁面跳轉到鏈接網站的概率,但如果有某個頁面它為只有鏈出沒有鏈進(或干脆完全孤立的話)被稱為dead end,則處於這個matrix中容易造成一些頁面的vector都為0,
比如我將第一題的matrix改一下,使得沒有任何頁面鏈向A,
/* A B C D E */
/*A*/ { 0, 0, 0, 0, 0 },
/*B*/ { 1/3f, 0, 0, 0, 0 },
/*C*/ { 1/3f, 0, 0, 1/2f, 1 },
/*D*/ { 1/3f, 1/2f, 0, 0, 0 },
/*E*/ { 0, 1/2f, 1, 1/2f, 0 }
直接從原hyperlink matrix算迭代的結果:
Staring iteration 4...
0 0*0 0*0 0*0.5 0*0 0*0.5 <0>
1 0.3333333*0 0*0 0*0.5 0*0 0*0.5 <0>
2 0.3333333*0 0*0 0*0.5 0.5*0 1*0.5 <0.5>
3 0.3333333*0 0.5*0 0*0.5 0*0 0*0.5 <0>
4 0*0 0.5*0 1*0.5 0.5*0 0*0.5 <0.5>
可以看到僅僅是這樣就造成了B和D的PR也為0,這是不正確的。
所以googler們想出一個可能性,就是用戶處於某個頁面時,有極小概率(比如1-0.85)會去打開與頁面無關的其它頁面,這種稱為稱為teleporting
所以0.85 * hyperlink matrix,然后加上(剩余的即0.15/頁面數,至於為什么要/頁面數,可以理解為一個到任何頁面的隨機概率矩陣,即全為1/頁面數的矩陣) 來使得這些沒有鏈出的頁面有極小的vector值,比如第一周題目中G MATRIX算出這些頁面的“偏移后的”概率為0.03
這樣就不會造成問題了。
加入teleporting后
Staring iteration 4...
0 0.03*0.02999999 0.03*0.0385 0.03*0.4361937 0.03*0.0548625 0.03*0.4404438 <0.02999999>
1 0.3133333*0.02999999 0.03*0.0385 0.03*0.4361937 0.03*0.0548625 0.03*0.4404438 <0.03849999>
2 0.3133333*0.02999999 0.03*0.0385 0.03*0.4361937 0.455*0.0548625 0.88*0.4404438 <0.4361937>
3 0.3133333*0.02999999 0.455*0.0385 0.03*0.4361937 0.03*0.0548625 0.03*0.4404438 <0.05486249>
4 0.03*0.02999999 0.455*0.0385 0.88*0.4361937 0.455*0.0548625 0.03*0.4404438 <0.4404437>
這是我在讀文后的理解,有理解不一致的歡迎指正。
附上題目及解決方法,使用C#代碼處理,用哪種語言沒差了,
1. 基本過程就是:設置初始值hyperlink matrix (按概率的概念),通過公式 alpha=0.85 G= 0.85 * hyperlink matrix + (1-0.85)/頁面數量 * 1 matrix 得到G矩陣
注意G矩陣每個PAGE(每列)的和不能超過1,否則結果會發散,應該等於1最后才能正確閉合。
之后所有運算基於固定G矩陣。qn+1 = Gqn
2. 迭代結束的收斂閉合條件:歐氏距離計算方法 《距離和相似度度量》
另外,初始向量數組q0的數值實驗得出的結果是確實關系不大,5個1最后14次0.0001差值精確,5個0.2最后13次0.0001差值精確,唯一關系到出來的vector的倍數,但這些頁面的比重是相同的。
題目:
1 參考根據幻燈片中第9頁所給出的“4網頁模型” ,現假設有A,B,C,D,E五個網頁,其中
1)A網頁有鏈接指向B,C,D
2)B網頁有鏈接指向A,E
3)C網頁有鏈接指向A,E
4)D網頁有鏈接指向C
5)E網頁有鏈接指向A,C
A 請寫出這個網頁鏈接結構的Google矩陣,目測你認為哪個頁面的重要性(PR值)最高?
B 手動或編程計算這5個頁面的PR值,可以使用任何你熟悉的編程語言,歡迎在論壇上曬自己的程序和結果 (可選)
C 指出當頁面較多的時候,計算PR的主要困難在什么地方,Map-Reduce是怎么解決這個難題的? (可選)
using System; namespace ConsoleApplication1 { class Program { static float[,] arrSrcMatrix; static float alpha = 0.85f; static float[] curPageRankMatrix; static int iterationTime; static void Main(string[] args) { arrSrcMatrix = new float[5, 5]{ /* A B C D E */ /*A*/ { 0, 1/2f, 1/2f, 0, 1/2f }, /*B*/ { 1/3f, 0, 0, 0, 0 }, /*C*/ { 1/3f, 0, 0, 1, 1/2f }, /*D*/ { 1/3f, 0, 0, 0, 0 }, /*E*/ { 0, 1/2f, 1/2f, 0, 0 } }; getGoogleMatrix(); curPageRankMatrix = new float[5] { 0.2f, 0.2f, 0.2f, 0.2f, 0.2f }; iterationTime = 0; double endValue = 0.00001d; while (1 == 1) { iterationTime++; var nextMatrix = doIterate(curPageRankMatrix); // 歐幾里得距離(Euclidean Distance) double cnt = 0.00d; for (var m = 0; m < curPageRankMatrix.Length; m++) { cnt += Math.Pow(nextMatrix[m] - curPageRankMatrix[m], 2); } if (Math.Sqrt(cnt) <= endValue) { break; } else { curPageRankMatrix = nextMatrix; } } } /// <summary> /// G = 0.85 * google matrix + 0.15/page count * one matrix /// </summary> static void getGoogleMatrix() { for (var m = 0; m <= arrSrcMatrix.GetUpperBound(0); m++) { Console.Write(string.Format("{0}\t", m)); for (var n = 0; n <= arrSrcMatrix.GetUpperBound(0); n++) { arrSrcMatrix[m, n] = arrSrcMatrix[m, n] * alpha + (1 - alpha) / (arrSrcMatrix.GetUpperBound(0) + 1); Console.Write(string.Format("{0}\t", arrSrcMatrix[m, n])); } Console.WriteLine(); } } /// <summary> /// current page rank matrix, shall be the number of pages /// </summary> /// <param name="curPageRankMatrix"></param> static float[] doIterate(float[] curPageRankMatrix) { float[] tgt = new float[curPageRankMatrix.Length]; Console.WriteLine("Staring iteration " + iterationTime + "..."); for (var m = 0; m <= arrSrcMatrix.GetUpperBound(0); m++) { if (m >= tgt.Length) break; float cur = 0.0f; Console.Write(string.Format("{0}\t", m)); for (var n = 0; n <= arrSrcMatrix.GetUpperBound(0); n++) { cur += arrSrcMatrix[m, n] * curPageRankMatrix[n]; Console.Write(string.Format("{0}*{1} ", arrSrcMatrix[m, n], curPageRankMatrix[n])); } tgt[m] = cur; Console.Write(string.Format("<{0}>", tgt[m])); Console.WriteLine(); } return tgt; } } }
運算結果 c:\Users\shixun\Desktop>ConsoleApplication1.exe 0 0.03 0.455 0.455 0.03 0.455 1 0.3133333 0.03 0.03 0.03 0.03 2 0.3133333 0.03 0.03 0.88 0.455 3 0.3133333 0.03 0.03 0.03 0.03 4 0.03 0.455 0.455 0.03 0.03 Staring iteration 1... 0 0.03*0.2 0.455*0.2 0.455*0.2 0.03*0.2 0.455*0.2 <0.285> 1 0.3133333*0.2 0.03*0.2 0.03*0.2 0.03*0.2 0.03*0.2 <0.08666666> 2 0.3133333*0.2 0.03*0.2 0.03*0.2 0.88*0.2 0.455*0.2 <0.3416667> 3 0.3133333*0.2 0.03*0.2 0.03*0.2 0.03*0.2 0.03*0.2 <0.08666666> 4 0.03*0.2 0.455*0.2 0.455*0.2 0.03*0.2 0.03*0.2 <0.2> Staring iteration 2... 0 0.03*0.285 0.455*0.08666666 0.455*0.3416667 0.03*0.08666666 0.455*0.2 <0.2970417> 1 0.3133333*0.285 0.03*0.08666666 0.03*0.3416667 0.03*0.08666666 0.03*0.2 <0.11075> 2 0.3133333*0.285 0.03*0.08666666 0.03*0.3416667 0.88*0.08666666 0.455*0.2 <0.2694167> 3 0.3133333*0.285 0.03*0.08666666 0.03*0.3416667 0.03*0.08666666 0.03*0.2 <0.11075> 4 0.03*0.285 0.455*0.08666666 0.455*0.3416667 0.03*0.08666666 0.03*0.2 <0.2120417> Staring iteration 3... 0 0.03*0.2970417 0.455*0.11075 0.455*0.2694167 0.03*0.11075 0.455*0.2120417 <0.2816885> 1 0.3133333*0.2970417 0.03*0.11075 0.03*0.2694167 0.03*0.11075 0.03*0.2120417 <0.1141618> 2 0.3133333*0.2970417 0.03*0.11075 0.03*0.2694167 0.88*0.11075 0.455*0.2120417 <0.298417> 3 0.3133333*0.2970417 0.03*0.11075 0.03*0.2694167 0.03*0.11075 0.03*0.2120417 <0.1141618> 4 0.03*0.2970417 0.455*0.11075 0.455*0.2694167 0.03*0.11075 0.03*0.2120417 <0.1915708> Staring iteration 4... 0 0.03*0.2816885 0.455*0.1141618 0.455*0.298417 0.03*0.1141618 0.455*0.1915708 <0.2867636> 1 0.3133333*0.2816885 0.03*0.1141618 0.03*0.298417 0.03*0.1141618 0.03*0.1915708 <0.1098117> 2 0.3133333*0.2816885 0.03*0.1141618 0.03*0.298417 0.88*0.1141618 0.455*0.1915708 <0.2882669> 3 0.3133333*0.2816885 0.03*0.1141618 0.03*0.298417 0.03*0.1141618 0.03*0.1915708 <0.1098117> 4 0.03*0.2816885 0.455*0.1141618 0.455*0.298417 0.03*0.1141618 0.03*0.1915708 <0.205346> Staring iteration 5... 0 0.03*0.2867636 0.455*0.1098117 0.455*0.2882669 0.03*0.1098117 0.455*0.205346 <0.2864555> 1 0.3133333*0.2867636 0.03*0.1098117 0.03*0.2882669 0.03*0.1098117 0.03*0.205346 <0.1112497> 2 0.3133333*0.2867636 0.03*0.1098117 0.03*0.2882669 0.88*0.1098117 0.455*0.205346 <0.2918617> 3 0.3133333*0.2867636 0.03*0.1098117 0.03*0.2882669 0.03*0.1098117 0.03*0.205346 <0.1112497> 4 0.03*0.2867636 0.455*0.1098117 0.455*0.2882669 0.03*0.1098117 0.03*0.205346 <0.1991834> Staring iteration 6... 0 0.03*0.2864555 0.455*0.1112497 0.455*0.2918617 0.03*0.1112497 0.455*0.1991834 <0.2859753> 1 0.3133333*0.2864555 0.03*0.1112497 0.03*0.2918617 0.03*0.1112497 0.03*0.1991834 <0.1111624> 2 0.3133333*0.2864555 0.03*0.1112497 0.03*0.2918617 0.88*0.1112497 0.455*0.1991834 <0.2903775> 3 0.3133333*0.2864555 0.03*0.1112497 0.03*0.2918617 0.03*0.1112497 0.03*0.1991834 <0.1111624> 4 0.03*0.2864555 0.455*0.1112497 0.455*0.2918617 0.03*0.1112497 0.03*0.1991834 <0.2013223> Staring iteration 7... 0 0.03*0.2859753 0.455*0.1111624 0.455*0.2903775 0.03*0.1111624 0.455*0.2013223 <0.2862164> 1 0.3133333*0.2859753 0.03*0.1111624 0.03*0.2903775 0.03*0.1111624 0.03*0.2013223 <0.1110263> 2 0.3133333*0.2859753 0.03*0.1111624 0.03*0.2903775 0.88*0.1111624 0.455*0.2013223 <0.2910763> 3 0.3133333*0.2859753 0.03*0.1111624 0.03*0.2903775 0.03*0.1111624 0.03*0.2013223 <0.1110263> 4 0.03*0.2859753 0.455*0.1111624 0.455*0.2903775 0.03*0.1111624 0.03*0.2013223 <0.2006544> Staring iteration 8... 0 0.03*0.2862164 0.455*0.1110263 0.455*0.2910763 0.03*0.1110263 0.455*0.2006544 <0.2861718> 1 0.3133333*0.2862164 0.03*0.1110263 0.03*0.2910763 0.03*0.1110263 0.03*0.2006544 <0.1110946> 2 0.3133333*0.2862164 0.03*0.1110263 0.03*0.2910763 0.88*0.1110263 0.455*0.2006544 <0.2907452> 3 0.3133333*0.2862164 0.03*0.1110263 0.03*0.2910763 0.03*0.1110263 0.03*0.2006544 <0.1110946> 4 0.03*0.2862164 0.455*0.1110263 0.455*0.2910763 0.03*0.1110263 0.03*0.2006544 <0.2008936> Staring iteration 9... 0 0.03*0.2861718 0.455*0.1110946 0.455*0.2907452 0.03*0.1110946 0.455*0.2008936 <0.2861617> 1 0.3133333*0.2861718 0.03*0.1110946 0.03*0.2907452 0.03*0.1110946 0.03*0.2008936 <0.111082> 2 0.3133333*0.2861718 0.03*0.1110946 0.03*0.2907452 0.88*0.1110946 0.455*0.2008936 <0.2908922> 3 0.3133333*0.2861718 0.03*0.1110946 0.03*0.2907452 0.03*0.1110946 0.03*0.2008936 <0.111082> 4 0.03*0.2861718 0.455*0.1110946 0.455*0.2907452 0.03*0.1110946 0.03*0.2008936 <0.2007819> Staring iteration 10... 0 0.03*0.2861617 0.455*0.111082 0.455*0.2908922 0.03*0.111082 0.455*0.2007819 <0.2861714> 1 0.3133333*0.2861617 0.03*0.111082 0.03*0.2908922 0.03*0.111082 0.03*0.2007819 <0.1110791> 2 0.3133333*0.2861617 0.03*0.111082 0.03*0.2908922 0.88*0.111082 0.455*0.2007819 <0.2908311> 3 0.3133333*0.2861617 0.03*0.111082 0.03*0.2908922 0.03*0.111082 0.03*0.2007819 <0.1110791> 4 0.03*0.2861617 0.455*0.111082 0.455*0.2908922 0.03*0.111082 0.03*0.2007819 <0.200839> Staring iteration 11... 0 0.03*0.2861714 0.455*0.1110791 0.455*0.2908311 0.03*0.1110791 0.455*0.200839 <0.2861685> 1 0.3133333*0.2861714 0.03*0.1110791 0.03*0.2908311 0.03*0.1110791 0.03*0.200839 <0.1110819> 2 0.3133333*0.2861714 0.03*0.1110791 0.03*0.2908311 0.88*0.1110791 0.455*0.200839 <0.2908558> 3 0.3133333*0.2861714 0.03*0.1110791 0.03*0.2908311 0.03*0.1110791 0.03*0.200839 <0.1110819> 4 0.03*0.2861714 0.455*0.1110791 0.455*0.2908311 0.03*0.1110791 0.03*0.200839 <0.2008119> Staring iteration 12... 0 0.03*0.2861685 0.455*0.1110819 0.455*0.2908558 0.03*0.1110819 0.455*0.2008119 <0.2861685> 1 0.3133333*0.2861685 0.03*0.1110819 0.03*0.2908558 0.03*0.1110819 0.03*0.2008119 <0.1110811> 2 0.3133333*0.2861685 0.03*0.1110819 0.03*0.2908558 0.88*0.1110819 0.455*0.2008119 <0.2908457> 3 0.3133333*0.2861685 0.03*0.1110819 0.03*0.2908558 0.03*0.1110819 0.03*0.2008119 <0.1110811> 4 0.03*0.2861685 0.455*0.1110819 0.455*0.2908558 0.03*0.1110819 0.03*0.2008119 <0.2008235> Staring iteration 13... 0 0.03*0.2861685 0.455*0.1110811 0.455*0.2908457 0.03*0.1110811 0.455*0.2008235 <0.2861689> 1 0.3133333*0.2861685 0.03*0.1110811 0.03*0.2908457 0.03*0.1110811 0.03*0.2008235 <0.1110811> 2 0.3133333*0.2861685 0.03*0.1110811 0.03*0.2908457 0.88*0.1110811 0.455*0.2008235 <0.29085> 3 0.3133333*0.2861685 0.03*0.1110811 0.03*0.2908457 0.03*0.1110811 0.03*0.2008235 <0.1110811> 4 0.03*0.2861685 0.455*0.1110811 0.455*0.2908457 0.03*0.1110811 0.03*0.2008235 <0.2008189>