《機器學習實戰》中貝葉斯分類中導入RSS源例子

本文轉載自查看原文 2014-07-22 15:48 2146 算法/ 機器學習/ Machine Learning/ Python

跟着書中代碼往下寫在這里卡住了，考慮到可能還會有其他同學也遇到了這樣的問題，記下來分享。

先吐槽一下，相信大部分網友在這里卡住的主要原因是偉大的GFW，所以無論是軟件翻牆還是肉身翻牆的小伙伴們估計是無論如何也看不到這篇博文的，不想往下看的請自覺使用翻牆技能。

怎么安裝feedparser？

按書中提供的網址直接安裝feedparser會提示錯誤說沒有setuptools，然后去找setuptools，官方的說法是windows最好用ez_setup.py安裝，我確實下載不下來官網的那個ez_etup.py，這個帖子給出了解決方案：http://adesquared.wordpress.com/2013/07/07/setting-up-python-and-easy_install-on-windows-7/

ez_setup.py

將這個文件直接拷貝到C:\\python27文件夾中，輸入命令行：python ez_setup.py install

然后轉到放feedparser安裝文件的文件夾中，命令行輸入：python setup.py install

作者提供的RSS源鏈接“http://newyork.craigslist.org/stp/index.rss”不可訪問怎么辦？

書中作者的意思是以來自源 http://newyork.craigslist.org/stp/index.rss 中的文章作為分類為1的文章，以來自源 http://sfbay.craigslist.org/stp/index.rss 中的文章作為分類為0的文章

為了能夠跑通示例代碼，可以找兩可用的RSS源作為替代。

我用的是這兩個源：

NASA Image of the Day：http://www.nasa.gov/rss/dyn/image_of_the_day.rss

Yahoo Sports - NBA - Houston Rockets News：http://sports.yahoo.com/nba/teams/hou/rss.xml

也就是說，如果算法運行正確的話，所有來自於 nasa 的文章將會被分類為1，所有來自於yahoo sports的休斯頓火箭隊新聞將會分類為0

使用自己定義的RSS源，當程序運行到trainNB0(array(trainMat),array(trainClasses))時會報錯，怎么辦？

從書中作者的例子來看，作者使用的源中文章數量較多，len(ny['entries']) 為 100，我找的幾個 RSS 源只有10-20個左右。

>>> import feedparser
>>>ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> ny['entries']
>>> len(ny['entries'])
100

因為作者的一個RSS源有100篇文章，所以他可以在代碼中剔除了30個“停用詞”，隨機選擇20篇文章作為測試集，但是當我們使用替代RSS源時我們只有10篇文章卻要取出20篇文章作為測試集，這樣顯然是會出錯的。只要自己調整下測試集的數量就可以讓代碼跑通；如果文章中的詞太少，減少剔除的“停用詞”數量可以提高算法的准確度。

如果不想將出現頻率排序最高的30個單詞移除，該如何去掉“停用詞”？

可以把要去掉的停用詞存放到txt文件中，使用時讀取（替代移除高頻詞的代碼）。具體需要停用哪些詞可以參考這里 http://www.ranks.nl/stopwords

以下代碼想正常運行需要將停用詞保存至stopword.txt中。

我的txt中保存了以下單詞，效果還不錯：

a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
below
between
both
but
by
can't
cannot
could
couldn't
did
didn't
do
does
doesn't
doing
don't
down
during
each
few
for
from
further
had
hadn't
has
hasn't
have
haven't
having
he
he'd
he'll
he's
her
here
here's
hers
herself
him
himself
his
how
how's
i
i'd
i'll
i'm
i've
if
in
into
is
isn't
it
it's
its
itself
let's
me
more
most
mustn't
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan't
she
she'd
she'll
she's
should
shouldn't
so
some
such
than
that
that's
the
their
theirs
them
themselves
then
there
there's
these
they
they'd
they'll
they're
they've
this
those
through
to
too
under
until
up
very
was
wasn't
we
we'd
we'll
we're
we've
were
weren't
what
what's
when
when's
where
where's
which
while
who
who's
whom
why
why's
with
won't
would
wouldn't
you
you'd
you'll
you're
you've
your
yours
yourself
yourselves

'''
Created on Oct 19, 2010

@author: Peter
'''
from numpy import *

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec
                 
def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def bagOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0
    
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

def testingNB():
    print '*** load dataset for training ***'
    listOPosts,listClasses = loadDataSet()
    print 'listOPost:\n',listOPosts
    print 'listClasses:\n',listClasses
    print '\n*** create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'myVocabList:\n',myVocabList
    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))
    print 'train matrix:',trainMat
    print '\n*** train P0V p1V pAb ***'
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
    print '\n*** classify ***'
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 
    
def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)
    #return vocabList,fullText

def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]       

def stopWords():
    import re
    wordList =  open('stopword.txt').read() # see http://www.ranks.nl/stopwords
    listOfTokens = re.split(r'\W*', wordList)
    return [tok.lower() for tok in listOfTokens] 
    print 'read stop word from \'stopword.txt\':',listOfTokens
    return listOfTokens

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    print 'feed1 entries length: ', len(feed1['entries']), '\nfeed0 entries length: ', len(feed0['entries'])
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    print '\nmin Length: ', minLen
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        print '\nfeed1\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        print '\nfeed0\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    print '\nVocabList is ',vocabList
    print '\nRemove Stop Word:'
    stopWordList = stopWords()
    for stopWord in stopWordList:
        if stopWord in vocabList:
            vocabList.remove(stopWord)
            print 'Removed: ',stopWord
##    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
##    print '\nTop 30 words: ', top30Words
##    for pairW in top30Words:
##        if pairW[0] in vocabList:
##            vocabList.remove(pairW[0])
##            print '\nRemoved: ',pairW[0]
    trainingSet = range(2*minLen); testSet=[]           #create test set
    print '\n\nBegin to create a test set: \ntrainingSet:',trainingSet,'\ntestSet',testSet
    for i in range(5):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    print 'random select 5 sets as the testSet:\ntrainingSet:',trainingSet,'\ntestSet',testSet
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    print '\ntrainMat length:',len(trainMat)
    print '\ntrainClasses',trainClasses
    print '\n\ntrainNB0:'
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    #print '\np0V:',p0V,'\np1V',p1V,'\npSpam',pSpam
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)
        originalClass = classList[docIndex]
        result =  classifiedClass != originalClass
        if result:
            errorCount += 1
        print '\n',docList[docIndex],'\nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result
    print '\nthe error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V

def testRSS():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    vocabList,pSF,pNY = localWords(ny,sf)

def testTopWords():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    getTopWords(ny,sf)

def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]

def test42():
    print '\n*** Load DataSet ***'
    listOPosts,listClasses = loadDataSet()
    print 'List of posts:\n', listOPosts
    print 'List of Classes:\n', listClasses

    print '\n*** Create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'Vocab List from posts:\n', myVocabList

    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))
    print 'Train Matrix:\n', trainMat

    print '\n*** Train ***'
    p0V,p1V,pAb = trainNB0(trainMat,listClasses)
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習實戰1：朴素貝葉斯模型:文本分類+垃圾郵件分類 [機器學習]-朴素貝葉斯-最簡單的入門實戰例子機器學習實戰-朴素貝葉斯垃圾郵件分類機器學習基礎——帶你實戰朴素貝葉斯模型文本分類機器學習-貝葉斯新聞分類實例機器學習之貝葉斯垃圾郵件分類機器學習——貝葉斯分類算法詳解秒懂機器學習---朴素貝葉斯進行垃圾郵件分類實戰朴素貝葉斯方法（二分類）[機器學習實戰] [機器學習] 分類 --- Naive Bayes（朴素貝葉斯）