跟着書中代碼往下寫在這里卡住了,考慮到可能還會有其他同學也遇到了這樣的問題,記下來分享。
先吐槽一下,相信大部分網友在這里卡住的主要原因是偉大的GFW,所以無論是軟件翻牆還是肉身翻牆的小伙伴們估計是無論如何也看不到這篇博文的,不想往下看的請自覺使用翻牆技能。
怎么安裝feedparser?
按書中提供的網址直接安裝feedparser會提示錯誤說沒有setuptools,然后去找setuptools,官方的說法是windows最好用ez_setup.py安裝,我確實下載不下來官網的那個ez_etup.py,這個帖子給出了解決方案:http://adesquared.wordpress.com/2013/07/07/setting-up-python-and-easy_install-on-windows-7/
將這個文件直接拷貝到C:\\python27文件夾中,輸入命令行:python ez_setup.py install
然后轉到放feedparser安裝文件的文件夾中,命令行輸入:python setup.py install
作者提供的RSS源鏈接“http://newyork.craigslist.org/stp/index.rss”不可訪問怎么辦?
書中作者的意思是以來自源 http://newyork.craigslist.org/stp/index.rss 中的文章作為分類為1的文章,以來自源 http://sfbay.craigslist.org/stp/index.rss 中的文章作為分類為0的文章
為了能夠跑通示例代碼,可以找兩可用的RSS源作為替代。
我用的是這兩個源:
NASA Image of the Day:http://www.nasa.gov/rss/dyn/image_of_the_day.rss
Yahoo Sports - NBA - Houston Rockets News:http://sports.yahoo.com/nba/teams/hou/rss.xml
也就是說,如果算法運行正確的話,所有來自於 nasa 的文章將會被分類為1,所有來自於yahoo sports的休斯頓火箭隊新聞將會分類為0
使用自己定義的RSS源,當程序運行到trainNB0(array(trainMat),array(trainClasses))時會報錯,怎么辦?
從書中作者的例子來看,作者使用的源中文章數量較多,len(ny['entries']) 為 100,我找的幾個 RSS 源只有10-20個左右。
>>> import feedparser
>>>ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> ny['entries']
>>> len(ny['entries'])
100
因為作者的一個RSS源有100篇文章,所以他可以在代碼中剔除了30個“停用詞”,隨機選擇20篇文章作為測試集,但是當我們使用替代RSS源時我們只有10篇文章卻要取出20篇文章作為測試集,這樣顯然是會出錯的。只要自己調整下測試集的數量就可以讓代碼跑通;如果文章中的詞太少,減少剔除的“停用詞”數量可以提高算法的准確度。
如果不想將出現頻率排序最高的30個單詞移除,該如何去掉“停用詞”?
可以把要去掉的停用詞存放到txt文件中,使用時讀取(替代移除高頻詞的代碼)。具體需要停用哪些詞可以參考這里 http://www.ranks.nl/stopwords
以下代碼想正常運行需要將停用詞保存至stopword.txt中。
我的txt中保存了以下單詞,效果還不錯:
a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
below
between
both
but
by
can't
cannot
could
couldn't
did
didn't
do
does
doesn't
doing
don't
down
during
each
few
for
from
further
had
hadn't
has
hasn't
have
haven't
having
he
he'd
he'll
he's
her
here
here's
hers
herself
him
himself
his
how
how's
i
i'd
i'll
i'm
i've
if
in
into
is
isn't
it
it's
its
itself
let's
me
more
most
mustn't
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan't
she
she'd
she'll
she's
should
shouldn't
so
some
such
than
that
that's
the
their
theirs
them
themselves
then
there
there's
these
they
they'd
they'll
they're
they've
this
those
through
to
too
under
until
up
very
was
wasn't
we
we'd
we'll
we're
we've
were
weren't
what
what's
when
when's
where
where's
which
while
who
who's
whom
why
why's
with
won't
would
wouldn't
you
you'd
you'll
you're
you've
your
yours
yourself
yourselves
''' Created on Oct 19, 2010 @author: Peter ''' from numpy import * def loadDataSet(): postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] classVec = [0,1,0,1,0,1] #1 is abusive, 0 not return postingList,classVec def createVocabList(dataSet): vocabSet = set([]) #create empty set for document in dataSet: vocabSet = vocabSet | set(document) #union of the two sets return list(vocabSet) def bagOfWords2Vec(vocabList, inputSet): returnVec = [0]*len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] += 1 else: print "the word: %s is not in my Vocabulary!" % word return returnVec def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = ones(numWords); p1Num = ones(numWords) #change to ones() p0Denom = 2.0; p1Denom = 2.0 #change to 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = log(p1Num/p1Denom) #change to log() p0Vect = log(p0Num/p0Denom) #change to log() return p0Vect,p1Vect,pAbusive def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = sum(vec2Classify * p1Vec) + log(pClass1) #element-wise mult p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) if p1 > p0: return 1 else: return 0 def bagOfWords2VecMN(vocabList, inputSet): returnVec = [0]*len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] += 1 return returnVec def testingNB(): print '*** load dataset for training ***' listOPosts,listClasses = loadDataSet() print 'listOPost:\n',listOPosts print 'listClasses:\n',listClasses print '\n*** create Vocab List ***' myVocabList = createVocabList(listOPosts) print 'myVocabList:\n',myVocabList print '\n*** Vocab show in post Vector Matrix ***' trainMat=[] for postinDoc in listOPosts: trainMat.append(bagOfWords2Vec(myVocabList, postinDoc)) print 'train matrix:',trainMat print '\n*** train P0V p1V pAb ***' p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses)) print 'p0V:\n',p0V print 'p1V:\n',p1V print 'pAb:\n',pAb print '\n*** classify ***' testEntry = ['love', 'my', 'dalmation'] thisDoc = array(bagOfWords2Vec(myVocabList, testEntry)) print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) testEntry = ['stupid', 'garbage'] thisDoc = array(bagOfWords2Vec(myVocabList, testEntry)) print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) def textParse(bigString): #input is big string, #output is word list import re listOfTokens = re.split(r'\W*', bigString) return [tok.lower() for tok in listOfTokens if len(tok) > 2] def spamTest(): docList=[]; classList = []; fullText =[] for i in range(1,26): wordList = textParse(open('email/spam/%d.txt' % i).read()) docList.append(wordList) fullText.extend(wordList) classList.append(1) wordList = textParse(open('email/ham/%d.txt' % i).read()) docList.append(wordList) fullText.extend(wordList) classList.append(0) vocabList = createVocabList(docList)#create vocabulary trainingSet = range(50); testSet=[] #create test set for i in range(10): randIndex = int(random.uniform(0,len(trainingSet))) testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) trainMat=[]; trainClasses = [] for docIndex in trainingSet:#train the classifier (get probs) trainNB0 trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex])) trainClasses.append(classList[docIndex]) p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses)) errorCount = 0 for docIndex in testSet: #classify the remaining items wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]: errorCount += 1 print "classification error",docList[docIndex] print 'the error rate is: ',float(errorCount)/len(testSet) #return vocabList,fullText def calcMostFreq(vocabList,fullText): import operator freqDict = {} for token in vocabList: freqDict[token]=fullText.count(token) sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedFreq[:30] def stopWords(): import re wordList = open('stopword.txt').read() # see http://www.ranks.nl/stopwords listOfTokens = re.split(r'\W*', wordList) return [tok.lower() for tok in listOfTokens] print 'read stop word from \'stopword.txt\':',listOfTokens return listOfTokens def localWords(feed1,feed0): import feedparser docList=[]; classList = []; fullText =[] print 'feed1 entries length: ', len(feed1['entries']), '\nfeed0 entries length: ', len(feed0['entries']) minLen = min(len(feed1['entries']),len(feed0['entries'])) print '\nmin Length: ', minLen for i in range(minLen): wordList = textParse(feed1['entries'][i]['summary']) print '\nfeed1\'s entries[',i,']\'s summary - ','parse text:\n',wordList docList.append(wordList) fullText.extend(wordList) classList.append(1) #NY is class 1 wordList = textParse(feed0['entries'][i]['summary']) print '\nfeed0\'s entries[',i,']\'s summary - ','parse text:\n',wordList docList.append(wordList) fullText.extend(wordList) classList.append(0) vocabList = createVocabList(docList)#create vocabulary print '\nVocabList is ',vocabList print '\nRemove Stop Word:' stopWordList = stopWords() for stopWord in stopWordList: if stopWord in vocabList: vocabList.remove(stopWord) print 'Removed: ',stopWord ## top30Words = calcMostFreq(vocabList,fullText) #remove top 30 words ## print '\nTop 30 words: ', top30Words ## for pairW in top30Words: ## if pairW[0] in vocabList: ## vocabList.remove(pairW[0]) ## print '\nRemoved: ',pairW[0] trainingSet = range(2*minLen); testSet=[] #create test set print '\n\nBegin to create a test set: \ntrainingSet:',trainingSet,'\ntestSet',testSet for i in range(5): randIndex = int(random.uniform(0,len(trainingSet))) testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) print 'random select 5 sets as the testSet:\ntrainingSet:',trainingSet,'\ntestSet',testSet trainMat=[]; trainClasses = [] for docIndex in trainingSet:#train the classifier (get probs) trainNB0 trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex])) trainClasses.append(classList[docIndex]) print '\ntrainMat length:',len(trainMat) print '\ntrainClasses',trainClasses print '\n\ntrainNB0:' p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses)) #print '\np0V:',p0V,'\np1V',p1V,'\npSpam',pSpam errorCount = 0 for docIndex in testSet: #classify the remaining items wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam) originalClass = classList[docIndex] result = classifiedClass != originalClass if result: errorCount += 1 print '\n',docList[docIndex],'\nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result print '\nthe error rate is: ',float(errorCount)/len(testSet) return vocabList,p0V,p1V def testRSS(): import feedparser ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss') sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml') vocabList,pSF,pNY = localWords(ny,sf) def testTopWords(): import feedparser ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss') sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml') getTopWords(ny,sf) def getTopWords(ny,sf): import operator vocabList,p0V,p1V=localWords(ny,sf) topNY=[]; topSF=[] for i in range(len(p0V)): if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i])) if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i])) sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True) print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**" for item in sortedSF: print item[0] sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True) print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**" for item in sortedNY: print item[0] def test42(): print '\n*** Load DataSet ***' listOPosts,listClasses = loadDataSet() print 'List of posts:\n', listOPosts print 'List of Classes:\n', listClasses print '\n*** Create Vocab List ***' myVocabList = createVocabList(listOPosts) print 'Vocab List from posts:\n', myVocabList print '\n*** Vocab show in post Vector Matrix ***' trainMat=[] for postinDoc in listOPosts: trainMat.append(bagOfWords2Vec(myVocabList,postinDoc)) print 'Train Matrix:\n', trainMat print '\n*** Train ***' p0V,p1V,pAb = trainNB0(trainMat,listClasses) print 'p0V:\n',p0V print 'p1V:\n',p1V print 'pAb:\n',pAb