------------------郵件數據預處理------------------
一:郵件數據讀取
with open('emailSample1.txt','r') as fp: content = fp.read() #一次讀取了全部數據 print(content)
二:預處理操作
(一)預處理內容
預處理主要包括以下9個部分:
1. 將大小寫統一成小寫字母; 2. 移除所有HTML標簽,只保留內容。 3. 將所有的網址替換為字符串 “httpaddr”. 4. 將所有的郵箱地址替換為 “emailaddr” 5. 將所有dollar符號($)替換為“dollar”. 6. 將所有數字替換為“number” 7. 將所有單詞還原為詞源,詞干提取 8. 移除所有非文字類型 9. 去除空字符串‘’
(二)預處理實現讀取郵件
import re import nltk.stem as ns def preprocessing(email): #1. 將大小寫統一成小寫字母; email = email.lower() #2. 移除所有HTML標簽,只保留內容 email = re.sub("<[^<>]>"," ",email) #找到<>標簽進行替換,注意:我們匹配的<>標簽中內部不能含有<>---<<>>---最小匹配,其他方法也可以實現 #3. 將所有的網址替換為字符串 “httpaddr”. email = re.sub("(http|https)://[^\s]*","httpaddr",email) #匹配網址,直到遇到空格 #4. 將所有的郵箱地址替換為 “emailaddr” email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email) #匹配郵箱 @前后為空為截止 #5. 將所有dollar符號($)替換為“dollar” email = re.sub('[\$]+', 'dollar', email) #6. 將所有數字替換為“number” email = re.sub('[0-9]+', 'number', email) #7. 將所有單詞還原為詞源,詞干提取 https://www.jb51.net/article/63732.htm tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email) #使用以上字符進行切分內容為單詞 tokenlist = [] s = ns.SnowballStemmer('english') #一個詞干提取器對象 for token in tokens: #8. 移除所有非文字類型 token = re.sub("[^a-zA-Z0-9]",'',token) #9. 去除空字符串‘’ if not len(token): continue # print("---: ",token) stemmed = s.stem(token) #獲取詞根 costs變為cost expecting變為expect # print("++++++: ",stemmed) tokenlist.append(stemmed) return tokenlist
with open('emailSample1.txt','r') as fp: content = fp.read() email = preprocessing(content)
(三)將Email轉化為詞向量
將提取的單詞轉換成特征向量:
def email2VocabIndices(email,vocab): """ 提取我們郵件email中的單詞,在vocab單詞表中出現的索引 """ token = preprocessing(email) #獲取預處理后的郵件單詞列表 #獲取郵件單詞在垃圾郵件單詞表的索引位置 index_list = [i for i in range(len(vocab)) if vocab[i] in token] return index_list def email2FeatureVector(email): """ 將email轉為詞向量 """ #讀取提供給我們的單詞(被認為是垃圾郵件的單詞) df = pd.read_table("vocab.txt",names=['words']) #讀取數據 vocab = df.values #array類型 # print(vocab) vector = np.zeros(vocab.shape[0]) #查看我們我們的郵件中是否存在這些單詞,如果存在則返回索引(是在上面vocab中的索引) index_list = email2VocabIndices(email,vocab) print(index_list) #將單詞索引轉向量 for i in index_list: vector[i] = 1 return vector
with open('emailSample1.txt','r') as fp: content = fp.read() vector = email2FeatureVector(content) print(vector)
[70, 85, 88, 161, 180, 237, 369, 374, 430, 478, 529, 530, 591, 687, 789, 793, 798, 809, 882, 915, 944, 960, 991, 1001, 1061, 1076, 1119, 1161, 1170, 1181, 1236, 1363, 1439, 1476, 1509, 1546, 1662, 1698, 1757, 1821, 1830, 1892, 1894, 1895] [0. 0. 0. ... 0. 0. 0.]
print(vector.shape)
------------------垃圾郵件過濾參數問題(線性核函數)------------------
注意:data/spamTrain.mat是對郵件進行預處理后(自然語言處理)獲得的向量。
一:數據讀取
data1 = sio.loadmat("spamTrain.mat") X,y = data1['X'],data1['y'].flatten() data2 = sio.loadmat("spamTest.mat") Xtest,ytest = data2['Xtest'],data2['ytest'].flatten()
print(X.shape,y.shape)
print(X) #每一行代表一個郵件樣本,每個樣本有1899個特征,特征為1表示在跟垃圾郵件有關的語義庫中找到相關單詞
print(y) # 每一行代表一個郵件樣本,等於1表示為垃圾郵件
這里用於判斷垃圾郵件的詞匯,還是我們預處理中的vocab.txt的單詞。
二:獲取最佳准確率和參數C(線性核只有C)
def get_best_params(X,y,Xval,yval): # C 和 σ 的候選值 Cvalues = [3, 0.03, 100, 0.01, 0.1, 0.3, 1, 10, 30] # 9 best_score = 0 #用於存放最佳准確率 best_params = 0 #用於存放參數C for c in Cvalues: clf = svm.SVC(c,kernel='linear') clf.fit(X,y) # 用訓練集數據擬合模型 score = clf.score(Xval,yval) # 用驗證集數據進行評分 if score > best_score: best_score = score best_params = c return best_score,best_params
best_score,best_params = get_best_params(X,y,Xtest,ytest) print(best_score,best_params) clf = svm.SVC(best_params,kernel="linear") clf.fit(X,y) score_train = clf.score(X,y) score_test = clf.score(Xtest,ytest) print(score_train,score_test)
三:根據我們訓練的結果,判斷最有可能被判為垃圾郵件的單詞
best_params = 0.3 clf = svm.SVC(best_params,kernel="linear") clf.fit(X,y) #進行訓練 # 讀取提供給我們的單詞(被認為是垃圾郵件的單詞) df = pd.read_table("vocab.txt", names=['words']) # 讀取數據 vocab = df.values # array類型 #先獲取訓練結果中,各個特征的重要性 ---- coef_:各特征的系數(重要性)https://www.cnblogs.com/xxtalhr/p/11166848.html print(clf.coef_) indices = np.argsort(clf.coef_).flatten()[::-1] #因為argsort是小到大排序,提取其對應的index(索引),我們設置步長為-1,倒序獲取特征系數 print(indices) #打印索引 for i in range(15): #我們獲取前15個最有可能判為垃圾郵件的單詞 print("{}---{:0.6f}".format(vocab[indices[i]],clf.coef_.flatten()[indices[i]]))
------------------垃圾郵件判斷------------------
一:郵件數據
(一)正常郵件

> Anyone knows how much it costs to host a web portal ? > Well, it depends on how many visitors you're expecting. This can be anywhere from less than 10 bucks a month to a couple of $100. You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 if youre running something big.. To unsubscribe yourself from this mailing list, send an email to: groupname-unsubscribe@egroups.com

Folks, my first time posting - have a bit of Unix experience, but am new to Linux. Just got a new PC at home - Dell box with Windows XP. Added a second hard disk for Linux. Partitioned the disk and have installed Suse 7.2 from CD, which went fine except it didn't pick up my monitor. I have a Dell branded E151FPp 15" LCD flat panel monitor and a nVidia GeForce4 Ti4200 video card, both of which are probably too new to feature in Suse's default set. I downloaded a driver from the nVidia website and installed it using RPM. Then I ran Sax2 (as was recommended in some postings I found on the net), but it still doesn't feature my video card in the available list. What next? Another problem. I have a Dell branded keyboard and if I hit Caps-Lock twice, the whole machine crashes (in Linux, not Windows) - even the on/off switch is inactive, leaving me to reach for the power cable instead. If anyone can help me in any way with these probs., I'd be really grateful - I've searched the 'net but have run out of ideas. Or should I be going for a different version of Linux such as RedHat? Opinions welcome. Thanks a lot, Peter -- Irish Linux Users' Group: ilug@linux.ie http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information. List maintainer: listmaster@linux.ie
(二)垃圾郵件

Do You Want To Make $1000 Or More Per Week? If you are a motivated and qualified individual - I will personally demonstrate to you a system that will make you $1,000 per week or more! This is NOT mlm. Call our 24 hour pre-recorded number to get the details. 000-456-789 I need people who want to make serious money. Make the call and get the facts. Invest 2 minutes in yourself now! 000-456-789 Looking forward to your call and I will introduce you to people like yourself who are currently making $10,000 plus per week! 000-456-789 3484lJGv6-241lEaN9080lRmS6-271WxHo7524qiyT5-438rjUv5615hQcf0-662eiDB9057dMtVl72

Best Buy Viagra Generic Online Viagra 100mg x 60 Pills $125, Free Pills & Reorder Discount, Top Selling 100% Quality & Satisfaction guaranteed! We accept VISA, Master & E-Check Payments, 90000+ Satisfied Customers! http://medphysitcstech.ru
二:獲取各個郵件的詞向量特征(預處理獲得)
with open('emailSample1.txt','r') as fp: content = fp.read() with open('emailSample2.txt','r') as fp: content2 = fp.read() with open('spamSample1.txt','r') as fp: content3 = fp.read() with open('spamSample2.txt','r') as fp: content4 = fp.read() vector = email2FeatureVector(content) vector2 = email2FeatureVector(content2) vector3 = email2FeatureVector(content3) vector4 = email2FeatureVector(content4)
三:提供訓練的參數,進行預測郵件是否為垃圾郵件
res = clf.predict(np.array([vector])) res2 = clf.predict(np.array([vector2])) res3 = clf.predict(np.array([vector3])) res4 = clf.predict(np.array([vector4])) print(res,res2,res3,res4)