一、snownlp簡介

snownlp是什么？

SnowNLP是一個python寫的類庫，可以方便的處理中文文本內容，是受到了TextBlob的啟發而寫的，由於現在大部分的自然語言處理庫基本都是針對英文的，於是寫了一個方便處理中文的類庫，並且和TextBlob不同的是，這里沒有用NLTK，所有的算法都是自己實現的，並且自帶了一些訓練好的字典。注意本程序都是處理的unicode編碼，所以使用時請自行decode成unicode。

以上是官方對snownlp的描述，簡單地說，snownlp是一個中文的自然語言處理的Python庫，支持的中文自然語言操作包括：

中文分詞
詞性標注
情感分析
文本分類
轉換成拼音
繁體轉簡體
提取文本關鍵詞
提取文本摘要
tf，idf
Tokenization
文本相似

在本文中，將重點介紹snownlp中的情感分析（Sentiment Analysis）。

二、snownlp情感分析模塊的使用

2.1、snownlp庫的安裝

snownlp的安裝方法如下：

pip install snownlp

2.2、使用snownlp情感分析

利用snownlp進行情感分析的代碼如下所示：

#coding:UTF-8
import sys
from snownlp import SnowNLP

def read_and_analysis(input_file, output_file):
  f = open(input_file)
  fw = open(output_file, "w")
  while True:
    line = f.readline()
    if not line:
      break
    lines = line.strip().split("\t")
    if len(lines) < 2:
      continue

    s = SnowNLP(lines[1].decode('utf-8'))
    # s.words 查詢分詞結果
    seg_words = ""
    for x in s.words:
      seg_words += "_"
      seg_words += x
    # s.sentiments 查詢最終的情感分析的得分
    fw.write(lines[0] + "\t" + lines[1] + "\t" + seg_words.encode('utf-8') + "\t" + str(s.sentiments) + "\n")
  fw.close()
  f.close()

if __name__ == "__main__":
  input_file = sys.argv[1]
  output_file = sys.argv[2]
  read_and_analysis(input_file, output_file)

上述代碼會從文件中讀取每一行的文本，並對其進行情感分析並輸出最終的結果。

注：庫中已經訓練好的模型是基於商品的評論數據，因此，在實際使用的過程中，需要根據自己的情況，重新訓練模型。

2.3、利用新的數據訓練情感分析模型

在實際的項目中，需要根據實際的數據重新訓練情感分析的模型，大致分為如下的幾個步驟：

准備正負樣本，並分別保存，如正樣本保存到pos.txt，負樣本保存到neg.txt；
利用snownlp訓練新的模型
保存好新的模型

重新訓練情感分析的代碼如下所示：

#coding:UTF-8

from snownlp import sentiment

if __name__ == "__main__":
  # 重新訓練模型
  sentiment.train('./neg.txt', './pos.txt')
  # 保存好新訓練的模型
  sentiment.save('sentiment.marshal')

注意：若是想要利用新訓練的模型進行情感分析，需要修改代碼中的調用模型的位置。

data_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),'sentiment.marshal')

三、snownlp情感分析的源碼解析

snownlp中支持情感分析的模塊在sentiment文件夾中，其核心代碼為__init__.py

如下是Sentiment類的代碼：

class Sentiment(object):

    def __init__(self):
        self.classifier = Bayes() # 使用的是Bayes的模型

    def save(self, fname, iszip=True):
        self.classifier.save(fname, iszip) # 保存最終的模型

    def load(self, fname=data_path, iszip=True):
        self.classifier.load(fname, iszip) # 加載貝葉斯模型

    # 分詞以及去停用詞的操作 
    def handle(self, doc):
        words = seg.seg(doc) # 分詞
        words = normal.filter_stop(words) # 去停用詞
        return words # 返回分詞后的結果

    def train(self, neg_docs, pos_docs):
        data = []
        # 讀入負樣本
        for sent in neg_docs:
            data.append([self.handle(sent), 'neg'])
        # 讀入正樣本
        for sent in pos_docs:
            data.append([self.handle(sent), 'pos'])
        # 調用的是Bayes模型的訓練方法
        self.classifier.train(data)

    def classify(self, sent):
        # 1、調用sentiment類中的handle方法
        # 2、調用Bayes類中的classify方法
        ret, prob = self.classifier.classify(self.handle(sent)) # 調用貝葉斯中的classify方法
        if ret == 'pos':
            return prob
        return 1-probclass Sentiment(object):

    def __init__(self):
        self.classifier = Bayes() # 使用的是Bayes的模型

    def save(self, fname, iszip=True):
        self.classifier.save(fname, iszip) # 保存最終的模型

    def load(self, fname=data_path, iszip=True):
        self.classifier.load(fname, iszip) # 加載貝葉斯模型

    # 分詞以及去停用詞的操作 
    def handle(self, doc):
        words = seg.seg(doc) # 分詞
        words = normal.filter_stop(words) # 去停用詞
        return words # 返回分詞后的結果

    def train(self, neg_docs, pos_docs):
        data = []
        # 讀入負樣本
        for sent in neg_docs:
            data.append([self.handle(sent), 'neg'])
        # 讀入正樣本
        for sent in pos_docs:
            data.append([self.handle(sent), 'pos'])
        # 調用的是Bayes模型的訓練方法
        self.classifier.train(data)

    def classify(self, sent):
        # 1、調用sentiment類中的handle方法
        # 2、調用Bayes類中的classify方法
        ret, prob = self.classifier.classify(self.handle(sent)) # 調用貝葉斯中的classify方法
        if ret == 'pos':
            return prob
        return 1-prob

從上述的代碼中，classify函數和train函數是兩個核心的函數，其中，train函數用於訓練一個情感分類器，classify函數用於預測。在這兩個函數中，都同時使用到的handle函數，handle函數的主要工作為：

對輸入文本分詞
去停用詞

情感分類的基本模型是貝葉斯模型Bayes，對於貝葉斯模型，可以參見文章簡單易學的機器學習算法——朴素貝葉斯。對於有兩個類別 $c_{1}$ 和 $c_{2}$ 的分類問題來說，其特征為 $w_{1}, \dots, w_{n}$ ，特征之間是相互獨立的，屬於類別 $c_{1}$ 的貝葉斯模型的基本過程為：

P (c 1 ∣ w 1, \dots, w n) = P ( w 1 , \dots , w n ∣ c 1 ) \cdot P ( c 1 ) P ( w 1 , \dots , w n )

其中：

P (w 1, \dots, w n) = P (w 1, \dots, w n ∣ c 1) \cdot P (c 1) + P (w 1, \dots, w n ∣ c 2) \cdot P (c 2)

3.1、貝葉斯模型的訓練

貝葉斯模型的訓練過程實質上是在統計每一個特征出現的頻次，其核心代碼如下：

def train(self, data):
    # data 中既包含正樣本，也包含負樣本
    for d in data: # data中是list
        # d[0]:分詞的結果，list
        # d[1]:正/負樣本的標記
        c = d[1]
        if c not in self.d:
            self.d[c] = AddOneProb() # 類的初始化
        for word in d[0]: # 分詞結果中的每一個詞
            self.d[c].add(word, 1)
    # 返回的是正類和負類之和
    self.total = sum(map(lambda x: self.d[x].getsum(), self.d.keys())) # 取得所有的d中的sum之和

這使用到了AddOneProb類，AddOneProb類如下所示：

class AddOneProb(BaseProb):

    def __init__(self):
        self.d = {}
        self.total = 0.0
        self.none = 1 # 默認所有的none為1
    # 這里如果value也等於1，則當key不存在時，累加的是2
    def add(self, key, value):
        self.total += value
        # 不存在該key時，需新建key
        if not self.exists(key):
            self.d[key] = 1
            self.total += 1
        self.d[key] += value

注意：

none的默認值為1
當key不存在時，total和對應的d[key]累加的是1+value，這在后面預測時需要用到

AddOneProb類中的total表示的是正類或者負類中的所有值；train函數中的total表示的是正負類的total之和。

當統計好了訓練樣本中的total和每一個特征key的d[key]后，訓練過程就構建完成了。

3.2、貝葉斯模型的預測

預測的過程使用到了上述的公式，即：

P (c 1 ∣ w 1, \dots, w n) = P ( w 1 , \dots , w n ∣ c 1 ) \cdot P ( c 1 ) P ( w 1 , \dots , w n ∣ c 1 ) \cdot P ( c 1 ) + P ( w 1 , \dots , w n ∣ c 2 ) \cdot P ( c 2 )

對上述的公式簡化：

P (c 1 ∣ w 1, \dots, w n) = P ( w 1 , \dots , w n ∣ c 1 ) \cdot P ( c 1 ) P ( w 1 , \dots , w n ∣ c 1 ) \cdot P ( c 1 ) + P ( w 1 , \dots , w n ∣ c 2 ) \cdot P ( c 2 ) = 1 1 + P ( w 1 , \dots , w n ∣ c 2 ) \cdot P ( c 2 ) P ( w 1 , \dots , w n ∣ c 1 ) \cdot P ( c 1 ) = 1 1 + e x p [ l o g ( P ( w 1 , \dots , w n ∣ c 2 ) \cdot P ( c 2 ) P ( w 1 , \dots , w n ∣ c 1 ) \cdot P ( c 1 ) ) ] = 1 1 + e x p [ l o g ( P ( w 1 , \dots , w n ∣ c 2 ) \cdot P ( c 2 ) ) - l o g ( P ( w 1 , \dots , w n ∣ c 1 ) \cdot P ( c 1 ) ) ]

其中，分母中的1可以改寫為：

1 = e x p [l o g (P (w 1, \dots, w n ∣ c 1) \cdot P (c 1)) - l o g (P (w 1, \dots, w n ∣ c 1) \cdot P (c 1))]

上述過程對應的代碼如下所示：

def classify(self, x):
    tmp = {}
    for k in self.d: # 正類和負類
        tmp[k] = log(self.d[k].getsum()) - log(self.total) # 正類/負類的和的log函數-所有之和的log函數
        for word in x:
            tmp[k] += log(self.d[k].freq(word)) # 詞頻，不存在就為0
    ret, prob = 0, 0
    for k in self.d:
        now = 0
        try:
            for otherk in self.d:
                now += exp(tmp[otherk]-tmp[k])
            now = 1/now
        except OverflowError:
            now = 0
        if now > prob:
            ret, prob = k, now
    return (ret, prob)

其中，第一個for循環中的tmp[k]對應了公式中的 $l o g (P (c_{k}))$ ，第二個for循環中的tmp[k]對應了公式中的 $l o g (P (w_{1}, \dots, w_{n} ∣ c_{k}) \cdot P (c_{k}))$ 。

參考文獻

轉：
https://blog.csdn.net/google19890102/article/details/80091502

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基於情感詞典的文本情感分析 (snownlp) Python分詞、情感分析工具——SnowNLP python snownlp情感分析簡易demo 中文情感分析——snownlp類庫源碼注釋及使用文本挖掘（三）python 基於snownlp做情感分析基於SnowNLP朴素貝葉斯算法的文本情感分析情感分析snownlp包部分核心代碼理解 python 輿情分析 nlp主題分析（2）-結合snownlp與jieba庫，提高分詞與情感判斷待續 NLP之中文自然語言處理工具庫：SnowNLP(情感分析/分詞/自動摘要) 情感分析