wand(weak and)算法基本思路

本文轉載自查看原文 2013-08-29 18:06 7388 技術討論

　　一般搜索的query比較短，但如果query比較長，如是一段文本，需要搜索相似的文本，這時候一般就需要wand算法，該算法在廣告系統中有比較成熟的應該，主要是adsense場景，需要搜索一個頁面內容的相似廣告。

　　Wand方法簡單來說，一般我們在計算文本相關性的時候，會通過倒排索引的方式進行查詢，通過倒排索引已經要比全量遍歷節約大量時間，但是有時候仍然很慢。
　　原因是很多時候我們其實只是想要top n個結果，一些結果明顯較差的也進行了復雜的相關性計算，而weak-and算法通過計算每個詞的貢獻上限來估計文檔的相關性上限，從而建立一個閾值對倒排中的結果進行減枝，從而得到提速的效果。

　　wand算法首先要估計每個詞對相關性貢獻的上限，最簡單的相關性就是TF*IDF，一般query中詞的TF均為1，IDF是固定的，因此就是估計一個詞在文檔中的詞頻TF上限，一般TF需要歸一化，即除以文檔所有詞的個數，因此，就是要估算一個詞在文檔中所能占到的最大比例，這個線下計算即可。

　　知道了一個詞的相關性上界值，就可以知道一個query和一個文檔的相關性上限值，顯然就是他們共同的詞的相關性上限值的和。

　　這樣對於一個query，獲得其所有詞的相關性貢獻上限，然后對一個文檔，看其和query中都出現的詞，然后求這些詞的貢獻和即可，然后和一個預設值比較，如果超過預設值，則進入下一步的計算，否則則丟棄。

　　如果按照這樣的方法計算n個最相似文檔，就要取出所有的文檔，每個文檔作預計算，比較threshold，然后決定是否在top-n之列。這樣計算當然可行，但是還是可以優化的。優化的出發點就是盡量減少預計算，wand論文中提到的算法如下：

　　其基本思路是基於倒排索引來實現WAND方法：

　　首先是初始化：

提取出da中所有的詞，以及這些詞的倒排索引；
初始化curDoc=0；
初始化posting數組，使得posting[t]為詞t倒排索引中第一個文檔；

　　然后是尋找下一個需要完全計算權值的問題，具體流程如下：

可以定義一個next函數，用於查找下一個進行完全計算的文檔，論文中對next描述如下：

Function next(θ)  repeat  /* Sort the terms in non decreasing order of DID */  sort(terms, posting)  /* Find pivot term - the first one with accumulated UB ≥ θ */  pTerm ← findPivotTerm(terms, θ)  if (pTerm = null) return (NoMoreDocs)  pivot ← posting[pTerm].DID  if (pivot = lastID) return (NoMoreDocs)  if (pivot ≤ curDoc)  /* pivot has already been considered, advance one of the preceding terms */  aterm ← pickTerm(terms[0..pTerm])  posting[aterm] ← aterm.iterator.next(curDoc+1)  else /* pivot > curDoc */  if (posting[0].DID = pivot) //注，這個是sort之后的第一個term位置的doc id  /* Success, all terms preceding pTerm belong to the pivot */  curDoc ← pivot  return (curDoc, posting)  else  /* not enough mass yet on pivot, advance one of the preceding terms */  aterm ← pickTerm(terms[0..pTerm])  posting[aterm] ← aterm.iterator.next(pivot)  end repeat

其中用到了幾個函數，解釋如下：

aterm.iterator.next(n)

這個函數返回aterm倒排索引中的DID，這個DID要滿足DID >= n。DID就是docID

sort(terms, posting)

sort是把terms按照posting當前指向的DID的非遞減排序。比如da所有詞的倒排索引為：

t0: [1, 3, 26]
t1: [1, 2, 4, 10, 100]
t2: [2, 3, 6, 34, 56]
t3: [1, 4, 5, 23, 70, 200]
t4: [5, 14, 78]

當前posting數組為：[2, 2, 0, 3, 0]

根據以上兩條信息，可以得到：{t0 : 26, t1 : 4, t2 : 2, t3 : 23, t4 : 5}

則排序后的結果為[t2, t1, t4, t3, t0]

findPivotTerm(terms, θ)

按照之前得到的排序，返回第一個上界累加和≥θ的term。

引入以下數據：[UB0, UB1, UB2, UB3, UB4] = [0.5, 1, 2, 3, 4], θ = 8，UB*為詞B*的最大可能的貢獻值。

因為(2 + 1 + 4) = 7 < 8 而 (2 + 1 + 4 + 3) = 10 > 8，所以此函數返回t3

pickTerm(terms[0..pTerm])

在0到pTerm(不包含pTerm)中選擇一個term。

還是用之前的數據，則是在[t2, t1, t4](沒有t3)中選擇一個term返回。

關於選擇策略，當然是以可以跳過最多的文檔為指導，論文中選擇了idf最大的term。

上面的過程可以用下圖表示：

　　上圖即為已經按照term當前指向的doc id排序的情況。

對於doc 2，其可能的最大得分為2<8 //2:max of t2

對於doc 4，其可能的最大得分為2+1=3<8 //1:max of t1

對於doc 5，其可能的最大得分為2+1+4=7<8 //4:max of t4

對於doc 23，其可能的最大得分為2+1+4+3=10>8 //3:max of t3

因此，doc 23即為我們需要尋找的pivot term

　　上圖其實也解釋了為什么要尋找pivot term，因為doc 2、doc 4、doc 5的得分不可能達到threshold，所以可以直接忽略，t2、t1、t4對應的posting list直接skip到doc 23(大於等於doc23的位置)，具體選擇先跳哪個，可以根據term的idf來選擇，當然也可以按照其距離pivot對於的doc id距離選擇，選擇一個跳的最多的。在這里，doc 23 被稱為pivot，可以作為候選文檔（candidate），進一步計算全局得分（evaluate）；doc 2、doc 4、doc 5被跳過。

解釋了以上幾個函數之后，理解上邊的偽碼應該沒有問題。至於要理解為什么這樣不會錯過相似度大的文檔，就需要自己動一翻腦子。可以參考論文中的解釋，不過說起來比較啰嗦，這里就不證明了。

最后要提到的一點是，在sort函數中是沒有必要完全排序的，因為每一次循環都只是改變了posting中的一條數據，只要把這個數據排好序就可以了。

附python代碼

#!/usr/bin/env python
#wand, assume threshold is 4,the upper bound of every term is UB

#max contribute

import time
import heapq

UB = {"t0":0.5,"t1":1,"t2":2,"t3":3,"t4":4} #upper bound of term's value
MAX_RESULT_NUM = 3 #max result number 

class WAND:
    #initial index
    def __init__(self, InvertIndex, last_docid):
        self.result_list = [] #result list
        self.invert_index = InvertIndex #InvertIndex: term -> docid1, docid2, docid3 ...
        self.current_doc = 0
        self.current_invert_index = {} #posting
        self.query_terms = []
        self.threshold = -1
        self.sort_terms = []
        self.LastID = 2000000000 #big num
        self.last_docid = last_docid
    
    #get index list according to query term
    def __InitQuery(self, query_terms):
        self.current_doc = -1
        self.current_invert_index.clear()
        self.query_terms = query_terms
        self.sort_terms[:] = []
        
        for term in query_terms:
            #initial start pos from the first position of term's invert_index
            self.current_invert_index[term] = [ self.invert_index[term][0], 0 ] #[ docid, index ]
    
    #sort term according its current posting doc id
    def __SortTerms(self):
        if len(self.sort_terms) == 0:
            for term in self.query_terms:
                if term in self.current_invert_index:
                    doc_id = self.current_invert_index[term][0]
                    self.sort_terms.append([ int(doc_id), term ])
        self.sort_terms.sort()

    #select the first term in sorted term list
    def __PickTerm(self, pivot_index):
        return 0

    #find pivot term
    def __FindPivotTerm(self):
        score = 0
        #print "sort term ", self.sort_terms  #[docid, term]
        for i in range(0, len(self.sort_terms)):
            score = score + UB[self.sort_terms[i][1]]
            if score >= self.threshold:
                return [ self.sort_terms[i][1], i] #[term, index]

        return [ None, len(self.sort_terms)]

    #move to doc id >= docid
    def __IteratorInvertIndex(self, change_term, docid, pos):
        doc_list = self.invert_index[change_term]
        i = 0
        for i in range(pos, len(doc_list)):
            if doc_list[i] >= docid:
                pos = i
                docid = doc_list[i]
                break

        return [ docid, pos ]

    
    def __AdvanceTerm(self, change_index, docid ):
        change_term = self.sort_terms[change_index][1]
        pos = self.current_invert_index[change_term][1]
        (new_doc, new_pos) = self.__IteratorInvertIndex(change_term, docid, pos)
        
        self.current_invert_index[change_term] = [ new_doc , new_pos ]
        self.sort_terms[change_index][0] = new_doc

    def __Next(self):
        if self.last_docid == self.current_doc:
            return None
        
        while True:
            #sort terms by doc id
            self.__SortTerms()
            
            #find pivot term > threshold
            (pivot_term, pivot_index) = self.__FindPivotTerm()
            if pivot_term == None:
                #no more candidate
                return None
            
            pivot_doc_id = self.current_invert_index[pivot_term][0]
            
            if pivot_doc_id == self.LastID: #!!
                return None
            
            if pivot_doc_id <= self.current_doc:
                change_index = self.__PickTerm(pivot_index)#always retrun 0
                self.__AdvanceTerm( change_index, self.current_doc + 1 )
            else:
                first_docid = self.sort_terms[0][0]
                if pivot_doc_id == first_docid:
                    self.current_doc = pivot_doc_id
                    return self.current_doc
                else:
                    #pick all preceding term,advance to pivot
                    for i in range(0, pivot_index):
                        change_index = i
                        self.__AdvanceTerm( change_index, pivot_doc_id )

    def __InsertHeap(self,doc_id,score):
        if len(self.result_list)<3:
            heapq.heappush(self.result_list, (score, doc_id))
        else:
            if score>self.result_list[0][0]: #large than mini item in heap
                heapq.heappop(self.result_list)
                heapq.heappush(self.result_list, (score, doc_id))
        return self.result_list[0][0]

    #full evaluate the doucment, get its full score, to be added
    def __FullEvaluate(self, docid):
        return 4

    def DoQuery(self, query_terms):
        self.__InitQuery(query_terms)
        while True:
            candidate_docid = self.__Next()
            if candidate_docid == None:
                break
            print "candidate_docid:" + str(candidate_docid)
            #insert candidate_docid to heap
            full_doc_score = self.__FullEvaluate(candidate_docid)
            mini_item_value = self.__InsertHeap(candidate_docid, full_doc_score)
            #update threshold
            self.threshold = mini_item_value
            print "result list ", self.result_list
        return self.result_list

if __name__ == "__main__":
    testIndex = {}
    testIndex["t0"] = [ 1, 3, 26, 2000000000]
    testIndex["t1"] = [ 1, 2, 4, 10, 100, 2000000000 ]
    testIndex["t2"] = [ 2, 3, 6, 34, 56, 2000000000 ]
    testIndex["t3"] = [ 1, 4, 5, 23, 70, 200, 2000000000 ]
    testIndex["t4"] = [ 5, 14, 78, 2000000000 ]
    
    last_doc_id = 100
    w = WAND(testIndex, last_doc_id)
    final_result = w.DoQuery(["t0", "t1", "t2", "t3", "t4"])
    print "final result "
    for item in final_result:
        print "doc " + str(item[1])

附錄每一步的執行過程，如果原來的設置，是0結果的，因此設置threshold為4：

初始位置：

t0: [1, 3, 26] max:0
t1: [1, 2, 4, 10, 100] max:1
t2: [2, 3, 6, 34, 56] max:2
t3: [1, 4, 5, 23, 70, 200] max:3
t4: [5, 14, 78] max:4

　　posting=[1,1,2,1,5]，也就是{t0:1,t1:1,t2:2,t3:1,t4:5}

　　cur_doc = 0

第1步：

按照posting的doc id對term升序排序，這里得到[t0,t1,t3,t2,t4]
尋找pivot，0+1+3=4>=4, 也就是到了t3大於等於4，因此t3就是pivot term，pivot就是其對應的doc id=1
pivot為1，cur doc id=0，因此pivot>cur_doc, 然后比較posting[0].doc_id，也就是t0對應的當前doc id=1 和pivot=1是否相等，這里相等，then，返回doc1, cur_doc_id=1

　　當前的數據不變：

t0: [1, 3, 26] max:0
t1: [1, 2, 4, 10, 100] max:1
t2: [2, 3, 6, 34, 56] max:2
t3: [1, 4, 5, 23, 70, 200] max:3
t4: [5, 14, 78] max:4

　　posting=[26,1,2,1,5]，也就是{t0:26,t1:1,t2:2,t3:1,t4:5}

　　cur_doc = 0

第2步：

按照posting的doc id對term升序排序，這里得到[t0,t1,t3,t2,t4]
尋找pivot，0+1+3=4>=4, 也就是到了t3大於等於4，因此t3就是pivot term，pivot就是其對應的doc id=1
pivot為1，cur doc id=1，因此pivot<=cur_doc, ，then，選擇一個pterm之前的term，向前移動。這里選擇第一個，也就是t0，然后將其對應的倒排指針移動到大於等於doc_cur_id+1的位置，這里移動到3.

　　當前的數據：

t0: [1, 3, 26] max:0
t1: [1, 2, 4, 10, 100] max:1
t2: [2, 3, 6, 34, 56] max:2
t3: [1, 4, 5, 23, 70, 200] max:3
t4: [5, 14, 78] max:4

　　posting=[3,1,2,1,5]，也就是{t0:3,t1:10,t2:2,t3:1,t4:5}

后續的步驟基本差不多

第3步，排序后[t1,t3,t2,t0,t4] cur_doc = 1,pivot term=t1, pivot=1，移動t1

t0: [1, 3, 26] max:0
t1: [1, 2, 4, 10, 100] max:1
t2: [2, 3, 6, 34, 56] max:2
t3: [1, 4, 5, 23, 70, 200] max:3
t4: [5, 14, 78] max:4

第4步, 排序后[t3,t2,t1,t0,t4] cur_doc = 1,pivot term=t3, pivot=1，移動t3

t0: [1, 3, 26] max:0
t1: [1, 2, 4, 10, 100] max:1
t2: [2, 3, 6, 34, 56] max:2
t3: [1, 4, 5, 23, 70, 200] max:3
t4: [5, 14, 78] max:4

第5步, 排序后[t1,t2,t0,t3,t4] cur_doc = 1,pivot term=t3, pivot=4,移動t1

t0: [1, 3, 26] max:0
t1: [1, 2, 4, 10, 100] max:1
t2: [2, 3, 6, 34, 56] max:2
t3: [1, 4, 5, 23, 70, 200] max:3
t4: [5, 14, 78] max:4

第6步，排序后[t2,t0,t1,t3,t4] cur_doc = 1,pivot term=t3 pivot=4，移動t2

t0: [1, 3, 26] max:0
t1: [1, 2, 4, 10, 100] max:1
t2: [2, 3, 6, 34, 56] max:2
t3: [1, 4, 5, 23, 70, 200] max:3
t4: [5, 14, 78] max:4

第7步，排序后[t0,t1,t3,t2,t4] cur_doc = 1,pivot term=t3 pivot=4，移動t0

t0: [1, 3, 26] max:0
t1: [1, 2, 4, 10, 100] max:1
t2: [2, 3, 6, 34, 56] max:2
t3: [1, 4, 5, 23, 70, 200] max:3
t4: [5, 14, 78] max:4

第7步，排序后[t1,t3,t4,t2,t0] cur_doc = 1,pivot term=t3 pivot=4

此時posting[0].did，也就是t1對倒排列表指針指向doc id=4==pivot=4

因此，返回doc_id=4,cur_doc_id=4

至此已經得到兩個結果文檔，即為1和4

后續可以以此類推

可以繼續得到結果5,14,78

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 選擇排序的基本思路秒殺程序基本思路 vue封裝axios基本思路 semantic UI 基本思路和框架簡述漏洞挖掘基本思路 ASC超算競賽及基本思路在web應用中使用spring的基本思路 js寫日歷插件基本思路 Spring Boot實現SAAS平台的基本思路 Vue實現懶加載的基本思路