python 中文字數統計/分詞

本文轉載自查看原文 2014-03-19 14:21 4583 Linux/ Python

因為想把一段文字分詞，所以，需要明確一定的詞語關系。

在網上隨便下載了一篇中文小說。隨便的txt小說，就1mb多。要數數這1mb多的中文到底有多少字，多少分詞，這些分詞的詞性是什么樣的。

這里是思路

1）先把小說讀到內存里面去。

2）再把小說根據正則表達法開始分詞，獲得小說中漢字總數

3）將內存中的小說每段POST到提供分詞服務的API里面去，獲取分詞結果

4）按照API說明，取詞

素材：

1、linux/GNU => debian/ubuntu 12.04/Linuxmint 13 Preferred
2、python
3、中文分詞API， 這里我們使用的是 http://www.vapsec.com/fenci/
4、分詞屬性的說明文件下載 http://vdisk.weibo.com/s/qR7KSFDa9ON 或者 http://ishare.iask.sina.com.cn/f/68191875.html

這里已經寫好了一個測試腳本。只是單個進程訪問。還沒有加入並發的測試。

在以后的測試中，我會加入並發的概念的。

下面是測試腳本 test.py

#!/usr/bin/env python
#coding: utf-8
import sys
import urllib
import urllib2
import os
import re
from datetime import datetime, timedelta


def url_post(word='My name is Jake Anderson', geshi="json"):
    url = "http://open.vapsec.com/segment/get_word"
    postDict = {
            "word":word,
            "format":geshi
    }
    
    
    postData = urllib.urlencode(postDict)
    request = urllib2.Request(url, postData)
    request.get_method = lambda : 'POST'
    #request.add_header('Authorization', basic)
    response = urllib2.urlopen(request)
    r = response.readlines()
    print r

if __name__ == "__main__":
    f = open('novel2.txt', 'r')
    # get Chinese characters quantity
    regex=re.compile(r"(?x) (?: [\w-]+ | [\x80-\xff]{3} )")
    count = 0
    for line in f:
        line = line.decode('gbk')
        line = line.encode('utf8')
        word = [w for w in regex.split(line)]
        count += len(word)
    #print count
    f = open('novel2.txt', 'r')
    start_time = datetime.now()
    for line in f:
        line = line.decode('gbk')
        line = line.encode('utf8')
        word2 = [w for w in regex.split(line)]
        print line
        url_post(line)
    end_time = datetime.now()
    tdelta = start_time - end_time
    print "It takes " + str(tdelta.total_seconds()) + " seconds to segment " + str(count) + " Chinese words!"
    print "This means it can segment " + str(count/tdelta.total_seconds()) + " Chinese characters per second!"

novel2.txt 是下載的小說。這個小說1.2MB大小。大約有580000字吧。

小說是GBK的格式，所以下載后，要轉碼成 utf-8的格式。

可以看到的終端效果大致是這樣的。

把小說中所有的詞，進行遠程分詞服務。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【Python】哈姆雷特字數統計 VIM - 字數統計 input限制中文字數使用 Python 統計中文字符的數量如何讓Pages文稿顯示字數統計？ Hexo添加字數統計、閱讀時長 JAVA 仿 MS word 字數統計 angularjs textarea 剩余字數統計 CKeditor字數統計插件wordcount Python中文分詞及詞頻統計