使用pyhanlp創建詞雲
去年我曾經寫過一篇文章Python + wordcloud + jieba 十分鍾學會用任意中文文本生成詞雲(你可能覺得這和wordcloud中官方文檔中的中文詞雲的例子代碼很像,不要誤會,那個也是我寫的)
現在我們可以仿照之前的格式在寫一份pyhanlp版本的。
對於wordcloud而言,因為原生支持的英文是自帶空格的,所以我們這里需要的是進行分詞和去停處理,然后將文本變為我們需要的list格式,輸入wordcloud。同時因為文檔中可能有新的詞匯我們之前並沒有發現,所以有必要的話。我們還需要進行添加自定義詞典的代碼。其代碼時非常簡單的。
首先對於添加自定義詞典我們只需要,引用CustomDictionary類即可,代碼如下。
CustomDictionary = JClass("com.hankcs.hanlp.dictionary.CustomDictionary") for word in userdict_list: CustomDictionary.add(word)
不過為了更好的發現新詞我們還是采用一款發現新詞比較好的分詞器,比如默認的維特比,或者CRF。考慮到之前實驗中CRF在默認的命名實體識別條件下,表現並不好,或許維特比才是更好的選擇。
mywordlist = [] HanLP.Config.ShowTermNature = False CRFnewSegment = HanLP.newSegment("viterbi")
去停功能同樣十分簡單,我們還是利用之前在分詞中提到的方法,不過這次我們把代碼更pythoic一些
CoreStopWordDictionary = JClass("com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary") text_list = CRFnewSegment.seg(text) CoreStopWordDictionary.apply(text_list) fianlText = [i.word for i in text_list]
而是否采用之前文章中的停詞功能則可以自主選擇,而處理為wordcloud所需要的格式,我們則直接采用之前的代碼即可。最終核心部分代碼如下:
CustomDictionary = JClass("com.hankcs.hanlp.dictionary.CustomDictionary") for word in userdict_list: CustomDictionary.add(word) mywordlist = [] HanLP.Config.ShowTermNature = False CRFnewSegment = HanLP.newSegment("viterbi") fianlText = [] if isUseStopwordsOfHanLP == True: CoreStopWordDictionary = JClass("com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary") text_list = CRFnewSegment.seg(text) CoreStopWordDictionary.apply(text_list) fianlText = [i.word for i in text_list] else: fianlText = list(CRFnewSegment.segment(text)) liststr = "/ ".join(fianlText) with open(stopwords_path, encoding='utf-8') as f_stop: f_stop_text = f_stop.read() f_stop_seg_list = f_stop_text.splitlines() for myword in liststr.split('/'): if not (myword.strip() in f_stop_seg_list) and len(myword.strip()) > 1: mywordlist.append(myword) return ' '.join(mywordlist)
最終實例
現在我們得到了最終實例,已經可以運行了,關於該部分,你還可以參考我fork的wordcloud項目,還有相比於此處新的變化。那里提供了一個jieba和pyhanlp合並的版本。
# - * - coding: utf - 8 -*- """ create wordcloud with chinese ======================= Wordcloud is a very good tools, but if you want to create Chinese wordcloud only wordcloud is not enough. The file shows how to use wordcloud with Chinese. First, you need a Chinese word segmentation library pyhanlp, pyhanlp is One of the most powerful natural language processing libraries in Chinese today, and it's extremely easy to use.You can use 'PIP install pyhanlp'. To install it. Its level of identity of named entity,word segmentation was better than jieba, and has more ways to do it """ from os import path from scipy.misc import imread import matplotlib.pyplot as plt from wordcloud import WordCloud, ImageColorGenerator from pyhanlp import * %matplotlib inline # d = path.dirname(__file__) d = "/home/fonttian/Github/word_cloud/examples" stopwords_path = d + '/wc_cn/stopwords_cn_en.txt' # Chinese fonts must be set font_path = d + '/fonts/SourceHanSerif/SourceHanSerifK-Light.otf' # the path to save worldcloud imgname1 = d + '/wc_cn/LuXun.jpg' imgname2 = d + '/wc_cn/LuXun_colored.jpg' # read the mask / color image taken from back_coloring = imread(path.join(d, d + '/wc_cn/LuXun_color.jpg')) # Read the whole text. text = open(path.join(d, d + '/wc_cn/CalltoArms.txt')).read() # userdict_list = ['孔乙己'] # The function for processing text with HaanLP def pyhanlp_processing_txt(text,isUseStopwordsOfHanLP = True): CustomDictionary = JClass("com.hankcs.hanlp.dictionary.CustomDictionary") for word in userdict_list: CustomDictionary.add(word) mywordlist = [] HanLP.Config.ShowTermNature = False CRFnewSegment = HanLP.newSegment("viterbi") fianlText = [] if isUseStopwordsOfHanLP == True: CoreStopWordDictionary = JClass("com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary") text_list = CRFnewSegment.seg(text) CoreStopWordDictionary.apply(text_list) fianlText = [i.word for i in text_list] else: fianlText = list(CRFnewSegment.segment(text)) liststr = "/ ".join(fianlText) with open(stopwords_path, encoding='utf-8') as f_stop: f_stop_text = f_stop.read() f_stop_seg_list = f_stop_text.splitlines() for myword in liststr.split('/'): if not (myword.strip() in f_stop_seg_list) and len(myword.strip()) > 1: mywordlist.append(myword) return ' '.join(mywordlist) wc = WordCloud(font_path=font_path, background_color="white", max_words=2000, mask=back_coloring, max_font_size=100, random_state=42, width=1000, height=860, margin=2,) pyhanlp_processing_txt = pyhanlp_processing_txt(text,isUseStopwordsOfHanLP = True) wc.generate(pyhanlp_processing_txt) # create coloring from image image_colors_default = ImageColorGenerator(back_coloring) plt.figure() # recolor wordcloud and show plt.imshow(wc, interpolation="bilinear") plt.axis("off") plt.show() # save wordcloud wc.to_file(path.join(d, imgname1)) # create coloring from image image_colors_byImg = ImageColorGenerator(back_coloring) # show # we could also give color_func=image_colors directly in the constructor plt.imshow(wc.recolor(color_func=image_colors_byImg), interpolation="bilinear") plt.axis("off") plt.figure() plt.imshow(back_coloring, interpolation="bilinear") plt.axis("off") plt.show() # save wordcloud wc.to_file(path.join(d, imgname2))