1. 前言

本文介紹如何在無監督的情況下，對文本進行簡單的觀點提取和聚類。

2. 觀點提取

觀點提取是通過依存關系的方式，根據固定的依存結構，從原文本中提取重要的結構，代表整句的主要意思。

我認為比較重要的依存關系結構是"動補結構", "動賓關系", "介賓關系"3個關系。不重要的結構是"定中關系", "狀中結構", "主謂關系"。通過核心詞ROOT出發，來提取觀點。

觀點提取的主要方法如下，完整代碼請移步致github。

''' 
關鍵詞觀點提取，根據關鍵詞key，找到關鍵處的rootpath，尋找這個root中的觀點，觀點提取方式和parseSentence的基本一樣。
支持提取多個root的觀點。
'''
def parseSentWithKey(self, sentence, key=None):
    #key是關鍵字，如果關鍵字存在，則只分析存在關鍵詞key的句子，如果沒有key，則不判斷。
    if key:
        keyIndex = 0
        if key not in sentence:
            return []
    rootList = []
    parse_result = str(self.hanlp.parseDependency(sentence)).strip().split('\n')
    # 索引-1，改正確，因為從pyhanlp出來的索引是從1開始的。
    for i in range(len(parse_result)):
        parse_result[i] = parse_result[i].split('\t')
        parse_result[i][0] = int(parse_result[i][0]) - 1
        parse_result[i][6] = int(parse_result[i][6]) - 1
        if key and parse_result[i][1] == key:
            keyIndex = i

    for i in range(len(parse_result)):
        self_index = int(parse_result[i][0])
        target_index = int(parse_result[i][6])
        relation = parse_result[i][7]
        if relation in self.main_relation:
            if self_index not in rootList:
                rootList.append(self_index)
        # 尋找多個root，和root是並列關系的也是root
        elif relation == "並列關系" and target_index in rootList:
            if self_index not in rootList:
                rootList.append(self_index)


        if len(parse_result[target_index]) == 10:
            parse_result[target_index].append([])

        #對依存關系，再加一個第11項，第11項是一個當前這個依存關系指向的其他索引
        if target_index != -1 and not (relation == "並列關系" and target_index in rootList):
            parse_result[target_index][10].append(self_index)
    
    # 尋找key在的那一條root路徑
    if key:
        rootIndex = 0
        if len(rootList) > 1:
            target = keyIndex
            while True:
                if target in rootList:
                    rootIndex = rootList.index(target)
                    break
                next_item = parse_result[target]
                target = int(next_item[6])
        loopRoot = [rootList[rootIndex]]
    else:
        loopRoot = rootList

    result = {}
    related_words = set()
    for root in loopRoot:
        # 把key和root加入到result中
        if key:
            self.addToResult(parse_result, keyIndex, result, related_words)
        self.addToResult(parse_result, root, result, related_words)

    #根據'動補結構', '動賓關系', '介賓關系'，選擇觀點
    for item in parse_result:
        relation = item[7]
        target = int(item[6])
        index = int(item[0])
        if relation in self.reverse_relation and target in result and target not in related_words:
            self.addToResult(parse_result, index, result, related_words)

    # 加入關鍵詞
    for item in parse_result:
        word = item[1]
        if word == key:
            result[int(item[0])] = word

    #對已經在result中的詞，按照在句子中原來的順序排列
    sorted_keys = sorted(result.items(), key=operator.itemgetter(0))
    selected_words = [w[1] for w in sorted_keys]
    return selected_words

通過這個方法，我們拿到了每個句子對應的觀點了。下面對所有觀點進行聚類。

2.1 觀點提取效果

原句	觀點
這個手機是正品嗎？	手機是正品
禮品是一些什么東西？	禮品是什么東西
現在都送什么禮品啊	都送什么禮品
直接付款是怎么付的啊	付款是怎么付
如果不滿意也可以退貨的吧	不滿意可以退貨

3. 觀點聚類

觀點聚類的方法有幾種：

直接計算2個觀點的聚類。（我使用的方法）
把觀點轉化為向量，比較余弦距離。

我的方法是用difflib對任意兩個觀點進行聚類。我的時間復雜度很高\(O(n^2)\)，用一個小技巧優化了下。代碼如下：

def extractor(self):
    de = DependencyExtraction()
    opinionList = OpinionCluster()
    for sent in self.sentences:
        keyword = ""
        if not self.keyword:
            keyword = ""
        else:
            checkSent = []
            for word in self.keyword:
                if sent not in checkSent and word in sent:
                    keyword = word
                    checkSent.append(sent)
                    break

        opinion = "".join(de.parseSentWithKey(sent, keyword))
        if self.filterOpinion(opinion):
            opinionList.addOpinion(Opinion(sent, opinion, keyword))


    '''
        這里設置兩個閾值，先用小閾值把一個大數據切成小塊，由於是小閾值，所以本身是一類的基本也能分到一類里面。
        由於分成了許多小塊，再對每個小塊做聚類，聚類速度大大提升，thresholds=[0.2, 0.6]比thresholds=[0.6]速度高30倍左右。
        但是[0.2, 0.6]和[0.6]最后的結果不是一樣的，會把一些相同的觀點拆開。
    '''
    thresholds = self.json_config["thresholds"]
    clusters = [opinionList]
    for threshold in thresholds:
        newClusters = []
        for cluster in clusters:
            newClusters += self.clusterOpinion(cluster, threshold)
        clusters = newClusters

    resMaxLen = {}
    for oc in clusters:
        if len(oc.getOpinions()) >= self.json_config["minClusterLen"]:
            summaryStr = oc.getSummary(self.json_config["freqStrLen"])
            resMaxLen[summaryStr] = oc.getSentences()

    return self.sortRes(resMaxLen)

3.1 觀點總結

對聚類在一起的觀點，提取一個比較好的代表整個聚類的觀點。

我的方法是對聚類觀點里面的所有觀點進行字的頻率統計，對高頻的字組成的字符串去和所有觀點計算相似度，相似度最高的那個當做整個觀點聚類的總的觀點。

def getSummary(self, freqStrLen):
    opinionStrs = []
    for op in self._opinions:
        opinion = op.opinion
        opinionStrs.append(opinion)

    # 統計字頻率
    word_counter = collections.Counter(list("".join(opinionStrs))).most_common()

    freqStr = ""
    for item in word_counter:
        if item[1] >= freqStrLen:
            freqStr += item[0]

    maxSim = -1
    maxOpinion = ""
    for opinion in opinionStrs:
        sim = similarity(freqStr, opinion)
        if sim > maxSim:
            maxSim = sim
            maxOpinion = opinion

    return maxOpinion

3.2 觀點總結效果

聚類總結	所有觀點
手機是全新正品	手機是全新正品手機是全新手機是不是正品保證是全新手機
能送無線充電器	能送無線充電器人家送無線充電器送無線充電器買能送無線充電器
可以優惠多少	可以優惠多少你好可優惠多少能優惠多少可以優惠多少
是不是翻新機	是不是翻新機不會是翻新機手機是還是翻新會不會是翻新機
花唄可以分期	花唄不夠可以分期花唄分期可以可以花唄分期花唄可以分期
沒有給發票	我沒有發票發票有開給我沒有給發票你們有給發票

4. 總結

以上我本人做的一些簡單的觀點提取和聚類，可以適用一些簡單的場景中。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Uri詳解之——Uri結構與代碼提取 DBSCAN聚類算法 Python 代碼 K均值聚類和代碼實現譜聚類算法—Matlab代碼 webpack提取css代碼 2.基於Spring-Boot的代碼規范實例 U-BOOT詳解2.從0編寫uboot 對 JimmyZhang 老師的文章《項目代碼風格要求》的一些個人觀點 python-k中心聚類代碼聚類算法總結以及python代碼實現