上一篇博客中爬取到了10個類別中數據並以文本的形式存取。
第二步便考慮對獲得的文本進行分詞操作~
開發環境:
anaconda3;
jieba分詞;(在anaconda中pip install jieba 命令成功下載並安裝jieba包(conda和pip是兩個不同的包管理器,那個jieba沒在conda中,應該用pip進行安裝)
上代碼
# -*- coding: utf-8 -*- """ Created on Thu Mar 8 10:26:40 2018 @author: Administrator """ # 2017年7月4日00:13:40 # silei # jieba分詞,停用詞,數據可視化,知識圖譜 # 數據文件數一共1170個 # baby,car,food,health,legend,life,love,news,science,sexual # 130,130,130,130,130,130,130,130,130,130 # -*- coding:UTF-8 -*- import jieba dir = {'baby': 130,'car': 130,'food': 130,'health': 130,'legend': 130,'life': 130,'love': 130,'news': 130,'science': 130,'sexual': 39}# 設置詞典,分別是類別名稱和該類別下一共包含的文本數量 data_file_number = 0# 當前處理文件索引數 for world_data_name,world_data_number in dir.items():# 將詞典中的數據分別復制到world_data_name,world_data_number中 while (data_file_number < world_data_number): print(world_data_name) print(world_data_number) print(data_file_number)# 打印文件索引信息 file = open('F:\\test\\'+world_data_name+'\\'+str(data_file_number)+'.txt','r',encoding= 'UTF-8') file_w = open('F:\\test\\trainTest\\'+world_data_name+'\\'+str(data_file_number)+'.txt','w',encoding= 'UTF-8') for line in file: stoplist = {}.fromkeys([ line.strip() for line in open("F:\\test\\stopword.txt",encoding= 'UTF-8') ]) # 讀取停用詞在列表中 seg_list = jieba.lcut(line,cut_all=False)# jieba分詞精確模式 seg_list = [word for word in list(seg_list) if word not in stoplist] # 去除停用詞 print("Default Mode:", "/ ".join(seg_list)) for i in range(len(seg_list)): file_w.write(str(seg_list[i])+'\n')# 分完詞分行輸入到文本中 # file_w.write(str(seg_list)) # print(line, end='') file_w.close() file.close() data_file_number = data_file_number + 1 data_file_number = 0
運行完代碼便可獲得分詞完的文本,分詞操作完成!
