自然語言之情感分析(中文)

數據來源：香港金融新聞平台
處理工具：python3.5
處理結果：分析語言的積極/消極意義
領域：金融/炒股

請隨意觀看表演

數據准備
數據清洗
情感分析
報錯處理
成果展示
遺留問題

No1.數據准備

准備工作主要是對字典進行處理，將其按照類型分類寫入python文件中，方便其余腳本調用。並且，將詞典寫入到emotion_word.txt中，使用 jieba詞庫 重載

將字典寫入.py文件好處

方便調用：from emotion_word import *
按照類型分類，調用后，直接使用most_degree即可，避免打開txt文件的大量代碼
可以使用python高級結構的方法
附一張emotion_word.py的截圖

寫入方法

將txt字典中的每行的詞語讀出來，再寫入列表，再print(List)。當數據少的時候可以，但是當數據達到幾百以上，顯然不可行。
若txt字典中的詞語都是按行分布的:

word_list = []
def main():
    with open('emotion_word.txt','r',encoding="utf-8") as f:
        global word_list
        for line in f.readlines():
            word_list.append(line.strip('\n'))
    with open('tem.txt','a',encoding="utf-8") as f:
        writted = 'word_list = '+str(word_list)+'\n'
        f.write(writted)

if __name__=='__main__':
    main()

寫入后，再全選復制，粘貼到對應.py文件就可以了
附截圖

No2.數據清洗

拿到的數據是這樣的，附截圖

主要就是：繁體去簡體，去掉html標簽和各種奇葩符號
繁體和簡體的轉化，用到了國人的一個庫，請戳這里下載 😃

使用方法很簡單:

from langconv import *
#轉換繁體到簡體
def cht_to_chs(line):
    line = Converter('zh-hans').convert(line)
    line.encode('utf-8')
    return line

#轉換簡體到繁體
def chs_to_cht(line):
    line = Converter('zh-hant').convert(line)
    line.encode('utf-8')
    return line

代碼會在之后用類一起封裝

No3.情感分析

分析title(新聞標題)和content(新聞主體)的成績(只看正負)和方差。對於成績，我們更重視新聞標題，因為關鍵詞明確，數量少，影響因素少；對於方差，我們更看重新聞主體，詞語多，從方差可以看出來這段新聞語氣程度(肯定/不確定...)。當然，當titile成績為0或者主體方差為0，我們會看主體的成績和title的方差。

當前詞的正負性(褒義/貶義)
檢索前一個詞是否是程度詞/反義詞
后一個詞/標點是否能加深程度

字典特征

字典里面的否定詞:'不好',而不是'不','好'。所以否定詞是和別的詞連在一起的。但也有少數不是。
字典包含標點符號
字典有一些缺陷，並且不是針對金融領域的專門字典

class EmotionAnalysis:
    def __init__(self,news=None):
        self.news = news
        self.list = []

    def __repr__(self):
        return "News:"+self.news
    
    #新聞去標簽,繁->簡
    def delete_label(self):
        rule = r'(<.*?>)| |\t|\n|○|■|☉'
        self.news = re.sub(rule,'',self.news)
        self.news = cht_to_chs(self.news)

    #得到成績和方差
    def get_score(self):
        self.list = list(jieba.cut(self.news))
        index_list = zip(range(len(self.list)),self.list)
        score = 0
        mean_list = []
        #tem_list= []
        for (index,word) in index_list:
            #tem_list.append(word)
            tem_score = 0
            
            #print("NO:",index,'WORD:',word)
            if (word in pos_emotion) or (word in pos_envalute):
                tem_score = 0.1
                
                #搜索程度詞
                if self.list[index-1] in most_degree and (index-1):
                    tem_score = tem_score*3
                elif self.list[index-1] in very_degree  and (index-1):
                    tem_score = tem_score*2.5
                elif self.list[index-1] in more_degree and (index-1):
                    tem_score = tem_score*2
                elif self.list[index-1] in ish_degree and (index-1):
                    tem_score = tem_score*1.5
                elif self.list[index-1] in least_degree and (index-1):
                    tem_score = tem_score*1
                else:pass
                #搜索否定詞/反意詞
                if (self.list[index-1] in neg_degree and index!=0)  or  (index<len(self.list)-1 and self.list[index+1] in neg_degree):
                    tem_score = -tem_score
                #print("|  tem_score:",tem_score)

            elif (word in neg_emotion) or (word in neg_envalute):
                tem_score = -0.3
                if self.list[index-1] in most_degree and (index-1):
                    tem_score = tem_score*3
                elif self.list[index-1] in very_degree  and (index-1):
                    tem_score = tem_score*2.5
                elif self.list[index-1] in more_degree and (index-1):
                    tem_score = tem_score*2
                elif self.list[index-1] in ish_degree and (index-1):
                    tem_score = tem_score*1.5
                elif self.list[index-1] in least_degree and (index-1):
                    tem_score = tem_score*1
                else:pass
                #print("|  tem_score:",tem_score)
            mean_list.append(tem_score)
            score+=tem_score
        #print(tem_list)
        #返回(成績,方差)
        return (score,np.var(mean_list))

No4.報錯處理

一共231506條新聞，為了方便回查，設置報錯處理(在數據庫操作的類里實現)

log_file = 'error.log'
class SQL(object):
	......
	def run(self,cmd,index):
        try:
            self.read_SQL(cmd,index)
            self.operate()
            
            self.write_SQL(index)
            self.w_conn.commit()

        except Exception as r:
            self.r_conn.rollback()
            self.w_conn.rollback()

            error = "ID "+str(self.r_dict['id'])+str(r)
            global log_file
            log_error(log_file = log_file,error=error)

No5.成果展示

由於var太小，所以擴大了1w倍，便於觀察相對大小和后期工作的進行。請觀察id，來觀看結果(為了方便顯示，導入到了兩個csv文件)

No6.遺留問題

在EmotionAnalysis類里的get_score函數里，對應的分值容易確定。(有空看一下機器學習，maybe能改進)。所以現在的分數只能看正負，來確定消極或積極。但對於這種金融新聞（特點：言簡意賅），效果還可以。
字典問題，請看 No3里面的字典特征

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 自然語言11_情感分析使用TensorFlow進行中文自然語言處理的情感分析 NLP之中文自然語言處理工具庫：SnowNLP(情感分析/分詞/自動摘要) 自然語言處理（NLP）中的詞雲圖繪制、情感分析、LDA主題分析自然語言處理掃盲·第三天——白話情感分析原理拓端數據tecdat|R語言自然語言處理（NLP）：情感分析新聞文本數據自然語言處理之文本情感分類文本挖掘之情感分析（一）【R語言學習筆記】4. 文本挖掘之情感分析自然語言分析工具Hanlp依存文法分析python使用總結（附帶依存關系英文簡寫的中文解釋）