基於Bayes和KNN的newsgroup 18828文本分類器的Python實現


向@yangliuy大牛學習NLP,這篇博客是數據挖掘-基於貝葉斯算法及KNN算法的newsgroup18828文本分類器的JAVA實現(上)的Python實現。入門為主,沒有太多自己的東西。

1. 數據集

Newsgroup新聞文檔集,含有20000篇左右的Usenet文檔,平均分配在20個新聞組,即有20個文件夾。現在用的Newsgroup18828新聞文檔集是經過處理的,即每篇文檔只屬於一個新聞組。

2. 預處理,對每篇文檔進行文本處理,為后續構造字典、提取特征詞做准備

# -*- coding: utf-8 -*-
from numpy import *
from os import listdir,mkdir,path
import re
from nltk.corpus import stopwords
import nltk
import operator
##############################################################
## 1. 創建新文件夾,存放預處理后的文本數據
##############################################################
def createFiles():
    srcFilesList = listdir('originSample')
    for i in range(len(srcFilesList)):
        if i==0: continue
        dataFilesDir = 'originSample/' + srcFilesList[i] # 20個文件夾每個的路徑
        dataFilesList = listdir(dataFilesDir)
        targetDir = 'processedSample_includeNotSpecial/' + srcFilesList[i] # 20個新文件夾每個的路徑
        if path.exists(targetDir)==False:
            mkdir(targetDir)
        else:
            print '%s exists' % targetDir
        for j in range(len(dataFilesList)):
            createProcessFile(srcFilesList[i],dataFilesList[j]) # 調用createProcessFile()在新文檔中處理文本
            print '%s %s' % (srcFilesList[i],dataFilesList[j])
##############################################################
## 2. 建立目標文件夾,生成目標文件
## @param srcFilesName 某組新聞文件夾的文件名,比如alt.atheism
## @param dataFilesName 文件夾下某個數據文件的文件名
## @param dataList 數據文件按行讀取后的字符串列表
##############################################################
def createProcessFile(srcFilesName,dataFilesName):
    srcFile = 'originSample/' + srcFilesName + '/' + dataFilesName
    targetFile= 'processedSample_includeNotSpecial/' + srcFilesName\
                + '/' + dataFilesName
    fw = open(targetFile,'w')
    dataList = open(srcFile).readlines()
    for line in dataList:
        resLine = lineProcess(line) # 調用lineProcess()處理每行文本
        for word in resLine:
            fw.write('%s\n' % word) #一行一個單詞
    fw.close()
##############################################################
##3. 對每行字符串進行處理,主要是去除非字母字符,轉換大寫為小寫,去除停用詞
## @param line 待處理的一行字符串
## @return words 按非字母分隔后的單詞所組成的列表
##############################################################
def lineProcess(line):
    stopwords = nltk.corpus.stopwords.words('english') #去停用詞
    porter = nltk.PorterStemmer()  #詞干分析
    splitter = re.compile('[^a-zA-Z]')  #去除非字母字符,形成分隔
    words = [porter.stem(word.lower()) for word in splitter.split(line)\
             if len(word)>0 and\
             word.lower() not in stopwords]
    return words

 

3. 構造字典sortedNewWordMap

########################################################
## 統計每個詞的總的出現次數
## @param strDir
## @param wordMap
## return newWordMap 返回字典,<key, value>結構,按key排序,value都大於4,即都是出現次數大於4的詞
#########################################################
def countWords():
    wordMap = {}
    newWordMap = {}
    fileDir = 'processedSample_includeNotSpecial'
    sampleFilesList = listdir(fileDir)
    for i in range(len(sampleFilesList)):
        sampleFilesDir = fileDir + '/' + sampleFilesList[i]
        sampleList = listdir(sampleFilesDir)
        for j in range(len(sampleList)):
            sampleDir = sampleFilesDir + '/' + sampleList[j]
            for line in open(sampleDir).readlines():
                word = line.strip('\n')
                wordMap[word] = wordMap.get(word,0.0) + 1.0
    #只返回出現次數大於4的單詞
    for key, value in wordMap.items():
        if value > 4:
            newWordMap[key] = value
    sortedNewWordMap = sorted(newWordMap.iteritems())
    print 'wordMap size : %d' % len(wordMap)
    print 'newWordMap size : %d' % len(sortedNewWordMap)
    return sortedNewWordMap
############################################################
##打印屬性字典
###########################################################
def printWordMap():
    print 'Print Word Map'
    countLine=0
    fr = open('D:\\04_Python\\Test1\\Ex2_bayesian\\docVector\\allDicWordCountMap.txt','w')
    sortedWordMap = countWords()
    for item in sortedWordMap:
        fr.write('%s %.1f\n' % (item[0],item[1]))
        countLine += 1
    print 'sortedWordMap size : %d' % countLine


4. 特征詞選取,再生成20個文件夾,每個文件夾的每篇文檔中存放本篇中的特征詞,即本篇中出現的字典中的詞。

#####################################################
##特征詞選取
####################################################
def filterSpecialWords():
    fileDir = 'processedSample_includeNotSpecial'
    wordMapDict = {}
    sortedWordMap = countWords()
    for i in range(len(sortedWordMap)):
        wordMapDict[sortedWordMap[i][0]]=sortedWordMap[i][0]    
    sampleDir = listdir(fileDir)
    for i in range(len(sampleDir)):
        targetDir = 'processedSampleOnlySpecial_2' + '/' + sampleDir[i]
        srcDir = 'processedSample_includeNotSpecial' + '/' + sampleDir[i]
        if path.exists(targetDir) == False:
            mkdir(targetDir)
        sample = listdir(srcDir)
        for j in range(len(sample)):
            targetSampleFile = targetDir + '/' + sample[j]
            fr=open(targetSampleFile,'w')
            srcSampleFile = srcDir + '/' + sample[j]
            for line in open(srcSampleFile).readlines():
                word = line.strip('\n')
                if word in wordMapDict.keys():
                    fr.write('%s\n' % word)
            fr.close()


5. 創建每次迭代所用的訓練樣例集合(20個文件夾)和測試樣例集合(20個文件夾),並生成標注數據集合。

##########################################################
## 創建訓練樣例集合和測試樣例集合
## @param indexOfSample 第k次實驗
## @param classifyRightCate 第k次實驗的測試集中,<doc rightCategory>數據
## @param trainSamplePercent 訓練集與測試集的分割比例
############################################################
def createTestSample(indexOfSample,classifyRightCate,trainSamplePercent=0.9):
    fr = open(classifyRightCate,'w')
    fileDir = 'processedSampleOnlySpecial'
    sampleFilesList=listdir(fileDir)
    for i in range(len(sampleFilesList)):
        sampleFilesDir = fileDir + '/' + sampleFilesList[i]
        sampleList = listdir(sampleFilesDir)
        m = len(sampleList)
        testBeginIndex = indexOfSample * ( m * (1-trainSamplePercent) ) 
        testEndIndex = (indexOfSample + 1) * ( m * (1-trainSamplePercent) )
        for j in range(m):
            # 序號在規定區間內的作為測試樣本,需要為測試樣本生成類別-序號文件,最后加入分類的結果,
    # 一行對應一個文件,方便統計准確率 
            if (j > testBeginIndex) and (j < testEndIndex):
                fr.write('%s %s\n' % (sampleList[j],sampleFilesList[i])) # 寫入內容:每篇文檔序號 它所在的文檔名稱即分類
targetDir = 'TestSample'+str(indexOfSample)+\ '/'+sampleFilesList[i] else: targetDir = 'TrainSample'+str(indexOfSample)+\ '/'+sampleFilesList[i] if path.exists(targetDir) == False: mkdir(targetDir) sampleDir = sampleFilesDir + '/' + sampleList[j] sample = open(sampleDir).readlines() sampleWriter = open(targetDir+'/'+sampleList[j],'w') for line in sample: sampleWriter.write('%s\n' % line.strip('\n')) sampleWriter.close() fr.close()

# 調用以上函數生成標注集,訓練和測試集合
  def test():
      for i in range(10):
          classifyRightCate = 'classifyRightCate' + str(i) + '.txt'
          createTestSample(i,classifyRightCate)


6. 貝葉斯算法實現

每個測試樣例屬於某個類別的概率 = 某個類別中出現樣例中詞的概率的乘積(類條件概率) * 出現某個類別的概率(先驗概率)

p(cate|doc) = p(word| cate) * p(cate)

具體計算類條件概率和先驗概率時,朴素貝葉斯分類有兩種模型:

1) 多元分布模型(muiltinomial model)

以單詞為粒度,不僅僅計算特征詞出現/不出現,還要計算出現的次數

  

類條件概率 p(word | cate) = (類cate下單詞word出現在所有文檔中的次數之和 + 1) / (類cate下單詞總數 + 訓練樣本中不重復的特征詞總數)

先驗概率 p(cate) = 類cate下單詞總數 / 訓練樣本中的特征詞總數

2) 伯努利模型(Bernoulli Model)

以文件為粒度

類條件概率 p(word | cate) = (類cate下出現word的文件總數 + 1) / (類cate下的文件總數 + 2)

先驗概率 p(cate) = (類cate下的文件總數) / (整個訓練樣本文件總數)

引用@yangliuy:根據《Introduction to Information Retrieval 》,多元分布模型計算准確率更高,所以分類器選用多元分布模型計算

 

6.1 統計

########################################################################
## 統計訓練樣本中,每個目錄下每個單詞的出現次數, 及每個目錄下的單詞總數
## @param 訓練樣本集目錄
## @return cateWordsProb <類目_單詞 ,某單詞出現次數>
## @return cateWordsNum <類目,單詞總數>
#########################################################################
def getCateWordsProb(strDir):
    #strDir = TrainSample0 
    cateWordsNum = {}
    cateWordsProb = {}
    cateDir = listdir(strDir)
    for i in range(len(cateDir)):
        count = 0 # 記錄每個目錄下(即每個類下)單詞總數
        sampleDir = strDir + '/' + cateDir[i]
        sample = listdir(sampleDir)
        for j in range(len(sample)):
            sampleFile = sampleDir + '/' + sample[j]
            words = open(sampleFile).readlines()
            for line in words:
                count = count + 1
                word = line.strip('\n')                
                keyName = cateDir[i] + '_' + word
                cateWordsProb[keyName] = cateWordsProb.get(keyName,0)+1 # 記錄每個目錄下(即每個類下)每個單詞的出現次數
        cateWordsNum[cateDir[i]] = count
        print 'cate %d contains %d' % (i,cateWordsNum[cateDir[i]])
    print 'cate-word size: %d' % len(cateWordsProb)
    return cateWordsProb, cateWordsNum

 


6.2 用bayes對測試文檔做分類

##########################################
## 用貝葉斯對測試文檔分類
## @param traindir 訓練集目錄
## @param testdir  測試集目錄
## @param classifyResultFileNew  分類結果文件
## @return 返回該測試樣本在該類別的概率
##########################################
def NBprocess(traindir,testdir,classifyResultFileNew):
    crWriter = open(classifyResultFileNew,'w')
    # traindir = 'TrainSample0'
    # testdir = 'TestSample0'
    #返回類k下詞C的出現次數,類k總詞數
    cateWordsProb, cateWordsNum = getCateWordsProb(traindir)

    #訓練集的總詞數
    trainTotalNum = sum(cateWordsNum.values())
    print 'trainTotalNum: %d' % trainTotalNum

    #開始對測試樣例做分類
    testDirFiles = listdir(testdir)
    for i in range(len(testDirFiles)):
        testSampleDir = testdir + '/' + testDirFiles[i]
        testSample = listdir(testSampleDir)
        for j in range(len(testSample)):
            testFilesWords = []
            sampleDir = testSampleDir + '/' + testSample[j]
            lines = open(sampleDir).readlines()
            for line in lines:
                word = line.strip('\n')
                testFilesWords.append(word)

            maxP = 0.0
            trainDirFiles = listdir(traindir)
            for k in range(len(trainDirFiles)):
                p = computeCateProb(trainDirFiles[k], testFilesWords,\
                                    cateWordsNum, trainTotalNum, cateWordsProb)
                if k==0:
                    maxP = p
                    bestCate = trainDirFiles[k]
                    continue
                if p > maxP:
                    maxP = p
                    bestCate = trainDirFiles[k]
            crWriter.write('%s %s\n' % (testSample[j],bestCate))
    crWriter.close()

#################################################
## @param traindir       類k
## @param testFilesWords 某個測試文檔
## @param cateWordsNum   訓練集類k下單詞總數 <類目,單詞總數>
## @param totalWordsNum  訓練集單詞總數
## @param cateWordsProb  訓練集類k下詞c出現的次數 <類目_單詞 ,某單詞出現次數>
## 計算 條件概率 =(類k中單詞i的數目+0.0001)/(類k中單詞總數+訓練樣本中所有類單詞總數)
## 計算 先驗概率 =(類k中單詞總數)/(訓練樣本中所有類單詞總數)
#################################################
def computeCateProb(traindir,testFilesWords,cateWordsNum,\
                    totalWordsNum,cateWordsProb):
    prob = 0
    wordNumInCate = cateWordsNum[traindir]  # 類k下單詞總數 <類目,單詞總數>
    for i in range(len(testFilesWords)):
        keyName = traindir + '_' + testFilesWords[i]
        if cateWordsProb.has_key(keyName):
            testFileWordNumInCate = cateWordsProb[keyName] # 類k下詞c出現的次數
        else: testFileWordNumInCate = 0.0
        xcProb = log((testFileWordNumInCate + 0.0001) / \ # 求對數避免很多很小的數相乘下溢出
                 (wordNumInCate + totalWordsNum))
        prob = prob + xcProb
    res = prob + log(wordNumInCate) - log(totalWordsNum)
    return res


7. 計算准確率

def computeAccuracy(rightCate,resultCate,k):
    rightCateDict = {}
    resultCateDict = {}
    rightCount = 0.0

    for line in open(rightCate).readlines():
        (sampleFile,cate) = line.strip('\n').split(' ')
        rightCateDict[sampleFile] = cate
        
    for line in open(resultCate).readlines():
        (sampleFile,cate) = line.strip('\n').split(' ')
        resultCateDict[sampleFile] = cate
        
    for sampleFile in rightCateDict.keys():
        #print 'rightCate: %s  resultCate: %s' % \
         #     (rightCateDict[sampleFile],resultCateDict[sampleFile])
        #print 'equal or not: %s' % (rightCateDict[sampleFile]==resultCateDict[sampleFile])

        if (rightCateDict[sampleFile]==resultCateDict[sampleFile]):
            rightCount += 1.0
    print 'rightCount : %d  rightCate: %d' % (rightCount,len(rightCateDict))
    accuracy = rightCount/len(rightCateDict)
    print 'accuracy %d : %f' % (k,accuracy)
    return accuracy

 

8.

#############################################################################
## 生成每次迭代的測試用例、標注集
def step1():
    for i in range(10):
        classifyRightCate = 'classifyRightCate' + str(i) + '.txt'
        createTestSample(i,classifyRightCate)
##############################################################################
## bayes對測試文檔做分類
def step2():
    for i in range(10):       
        traindir = 'TrainSample' + str(i)
        testdir = 'TestSample' + str(i)
        classifyResultFileNew = 'classifyResultFileNew' + str(i) + '.txt'
        NBprocess(traindir,testdir,classifyResultFileNew)
##############################################################################
## 計算准確率
def step3():
    accuracyOfEveryExp = []
    for i in range(10):
        rightCate = 'classifyRightCate'+str(i)+'.txt'
        resultCate = 'classifyResultFileNew'+str(i)+'.txt'
        accuracyOfEveryExp.append(computeAccuracy(rightCate,resultCate,i))
    return accuracyOfEveryExp

輸出結果:

WordMap size32189
Cate_Word: 162649
cate alt.atheism  contains 130141.0
cate comp.graphics  contains 145322.0
cate comp.os.ms-windows.misc  contains 348719.0
cate comp.sys.ibm.pc.hardware  contains 96505.0
cate comp.sys.mac.hardware  contains 88902.0
cate comp.windows.x  contains 131896.0
cate misc.forsale  contains 75843.0
cate rec.autos  contains 109281.0
cate rec.motorcycles  contains 99047.0
cate rec.sport.baseball  contains 111705.0
cate rec.sport.hockey  contains 135429.0
cate sci.crypt  contains 147705.0
cate sci.electronics  contains 101945.0
cate sci.med  contains 153708.0
cate sci.space  contains 135170.0
cate soc.religion.christian  contains 174490.0
cate talk.politics.guns  contains 155503.0
cate talk.politics.mideast  contains 219330.0
cate talk.politics.misc  contains 162621.0
cate talk.religion.misc  contains 103775.0
totalWordsNum: 2827037.0
rightCount: 1513.0resultCate: 1870
The accuracy for Naive Bayesian Classifier in 0th Exp is :0.8090909090909091

 

@yangliuy大牛談到幾點要注意的,我覺得也是很價值的

(1) 計算概率用到了BigDecimal類實現任意精度計算

python中處理任意精度,我覺得以下兩種方法可參考:

 


(2) 用交叉驗證法做十次分類實驗,對准確率取平均值
(3) cateWordsProb key為“類目_單詞”, value為該類目下該單詞的出現次數,避免重復計算

這樣安排的字典結構很方便統計


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM