python實現中文文檔jieba分詞和分詞結果寫入excel文件

本文轉載自查看原文 2020-02-15 22:30 2523

輸入　

　　本篇文章中采用的是對京東某商品的2000個正面評價txt文檔和2000個負面評價txt文檔，總共是4000個txt文檔。

　　一個正面評價txt文檔中的內容類似如下：

　　1 鋼琴漆，很滑很亮。2 LED寬屏，看起來很爽3 按鍵很舒服4 活動贈品多

　　一個負面評價txt文檔中的內容類似如下：

　　送貨上門后發現電腦顯示器的兩邊有縫隙；成型塑料表面凹凸不平。做工很差，，，，，

輸出　　

　　首先，是對4000個txt文檔進行jieba分詞后的輸出結果。

　　對應上面輸入中正面評價txt文檔中的內容經過分詞后，分詞結果如下：

　　鋼琴漆很滑很亮 LED 寬屏很爽按鍵舒服活動贈品

　　對應上面負面評價txt文檔中的內容經過分詞后，分詞結果如下：

　　送貨上門發現電腦顯示器兩邊縫隙成型塑料表面凹凸不平做工很差

　　然后，把2000個正面評價txt文檔和2000個負面評價txt文檔的分詞結果寫入excel文件，每個分詞結果都對應一個標簽（正面評為1，負面評價為0），圖示如下：

正面評價txt文檔的分詞結果

負面評價txt文檔的分詞結果

工具　

　　本文使用工具為：Anaconda、PyCharm、python語言、jieba中文分詞工具、網上下載的停用詞文檔

原理

　　使用jieba工具對每篇txt文檔中的中文段落進行分詞，分詞后的結果去掉停用詞后寫入excel文檔。

Python代碼實現

 1 from os.path import os
 2 from xlwt.Workbook import Workbook
 3 import jieba
 4 
 5 # 將停用詞文檔轉換為停用詞列表
 6 def stopwordslist():
 7     stopwords = [line.strip() for line in open('stopwords.txt', encoding='UTF-8').readlines()]
 8     return stopwords
 9 
10 # 對文檔字符串進行中文分詞
11 def seg_depart(sentence):
12     print('sentence:{}'.format(sentence))
13     # jieba工具分詞結果
14     sentence_depart = jieba.cut(sentence.strip())
15     # 停用詞列表
16     stopwords = stopwordslist()
17 
18     # 輸出結果保存至outstr
19     outstr = ''
20     # 去停用詞
21     for word in sentence_depart:
22         if word not in stopwords:
23             if word != '\t':
24                 outstr += word
25                 outstr += ' '
26     print('outstr:{}'.format(outstr))
27     return outstr
28 
29 # txt文檔的路徑
30 #mypath = 'F:\\Jingdong_4000\\neg\\'
31 mypath = 'F:\\Jingdong_4000\\pos\\'
32 myfiles = os.listdir(mypath)
33 
34 # txt文檔名列表
35 fileList = []
36 for f in myfiles:
37         if(os.path.isfile(mypath + '/' + f)):
38             if os.path.splitext(f)[1] == '.txt':
39                 fileList.append(f)
40 # 待寫入excel文件的每一行組成的列表
41 # excellist中的元素為列表，包括分詞結果和標簽兩部分
42 excellist = []
43 for ff in fileList:
44     f = open(mypath+ff,'r',encoding='gb2312', errors='ignore')
45     sourceInLines = f.readlines()
46     f.close()
47     str = ''
48     rowList = []
49     for line in sourceInLines:
50         str += line
51         str = str.strip()
52 
53     # 對str做分詞
54     str = seg_depart(str)
55     str = str.strip()
56     rowList.append(str)
57 
58     # 添加對應的標簽0或1
59     #rowList.append(0)
60     rowList.append(1)
61 
62     excellist.append(rowList)
63 
64 # excel表格式
65 book = Workbook()
66 sheet1 = book.add_sheet('Sheet1')
67 row0 = ['review', 'label']
68 
69 for i in range(len(row0)):
70     sheet1.write(0,i,row0[i])
71 
72 # 兩個for循環，第一個for循環針對寫入excel的每行，第二個for循環針對每行的各列
73 for i, li in enumerate(excellist):
74     print('i:{}, li:{}'.format(i, li))
75     for j, lj in enumerate(li):
76         sheet1.write(i+1,j,lj)
77 # 數據存入excel文件
78 #book.save('neg_fenci_excel.xls')
79 book.save('pos_fenci_excel.xls')

代碼運行結果

　　生成如輸出一節展示內容的excel文檔。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python中文分詞庫——jieba 分詞————jieba分詞（Python） python 讀寫txt文件並用jieba庫進行中文分詞 python 將分詞結果寫入txt文件 jieba 分詞庫（python） python jieba分詞詞性 python 分詞庫jieba python結巴(jieba)分詞模塊 jieba結巴分詞庫中文分詞 python-jieba分詞庫