機器學習作業---支持向量機SVM(二)垃圾郵件分類


------------------郵件數據預處理------------------

一:郵件數據讀取

with open('emailSample1.txt','r') as fp:
 content = fp.read()  #一次讀取了全部數據
    print(content)

二:預處理操作

(一)預處理內容

預處理主要包括以下9個部分:

1. 將大小寫統一成小寫字母; 2. 移除所有HTML標簽,只保留內容。 3. 將所有的網址替換為字符串 “httpaddr”. 4. 將所有的郵箱地址替換為 “emailaddr” 5. 將所有dollar符號($)替換為“dollar”. 6. 將所有數字替換為“number” 7. 將所有單詞還原為詞源,詞干提取 8. 移除所有非文字類型 9. 去除空字符串‘’

(二)預處理實現讀取郵件

import re
import nltk.stem as ns

def preprocessing(email):
    #1. 將大小寫統一成小寫字母;
    email = email.lower()

    #2. 移除所有HTML標簽,只保留內容
    email = re.sub("<[^<>]>"," ",email) #找到<>標簽進行替換,注意:我們匹配的<>標簽中內部不能含有<>---<<>>---最小匹配,其他方法也可以實現

    #3. 將所有的網址替換為字符串 “httpaddr”.
    email = re.sub("(http|https)://[^\s]*","httpaddr",email)    #匹配網址,直到遇到空格

    #4. 將所有的郵箱地址替換為 “emailaddr”
    email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email) #匹配郵箱 @前后為空為截止

    #5. 將所有dollar符號($)替換為“dollar”
    email = re.sub('[\$]+', 'dollar', email)

    #6. 將所有數字替換為“number”
    email = re.sub('[0-9]+', 'number', email)

    #7. 將所有單詞還原為詞源,詞干提取 https://www.jb51.net/article/63732.htm
    tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email)   #使用以上字符進行切分內容為單詞
    tokenlist = []

    s = ns.SnowballStemmer('english')   #一個詞干提取器對象

    for token in tokens:
        #8. 移除所有非文字類型
        token = re.sub("[^a-zA-Z0-9]",'',token)

        #9. 去除空字符串‘’
        if not len(token):
            continue
        # print("---: ",token)
        stemmed = s.stem(token) #獲取詞根   costs變為cost expecting變為expect
        # print("++++++: ",stemmed)

        tokenlist.append(stemmed)

    return tokenlist
with open('emailSample1.txt','r') as fp:
    content = fp.read()

email = preprocessing(content)

(三)將Email轉化為詞向量 

將提取的單詞轉換成特征向量:

def email2VocabIndices(email,vocab):
    """
    提取我們郵件email中的單詞,在vocab單詞表中出現的索引
    """
    token = preprocessing(email)    #獲取預處理后的郵件單詞列表

    #獲取郵件單詞在垃圾郵件單詞表的索引位置
    index_list = [i for i in range(len(vocab)) if vocab[i] in token]

    return index_list

def email2FeatureVector(email):
    """
    將email轉為詞向量
    """
    #讀取提供給我們的單詞(被認為是垃圾郵件的單詞)
    df = pd.read_table("vocab.txt",names=['words']) #讀取數據
    vocab = df.values   #array類型
    # print(vocab)
    vector = np.zeros(vocab.shape[0])

    #查看我們我們的郵件中是否存在這些單詞,如果存在則返回索引(是在上面vocab中的索引)
    index_list = email2VocabIndices(email,vocab)
    print(index_list)

    #將單詞索引轉向量
    for i in index_list:
        vector[i] = 1

    return vector
with open('emailSample1.txt','r') as fp:
    content = fp.read()

vector = email2FeatureVector(content)
print(vector)
[70, 85, 88, 161, 180, 237, 369, 374, 430, 478, 529, 530, 591, 687, 789, 793, 798, 809, 882, 915, 944, 960, 991, 1001, 1061, 1076, 1119, 1161, 1170, 1181, 1236, 1363, 1439, 1476, 1509, 1546, 1662, 1698, 1757, 1821, 1830, 1892, 1894, 1895]
[0. 0. 0. ... 0. 0. 0.]
print(vector.shape)

------------------垃圾郵件過濾參數問題(線性核函數)------------------

注意:data/spamTrain.mat是對郵件進行預處理后(自然語言處理)獲得的向量。

一:數據讀取

data1 = sio.loadmat("spamTrain.mat")
X,y = data1['X'],data1['y'].flatten()

data2 = sio.loadmat("spamTest.mat")
Xtest,ytest = data2['Xtest'],data2['ytest'].flatten()
print(X.shape,y.shape)

print(X)  #每一行代表一個郵件樣本,每個樣本有1899個特征,特征為1表示在跟垃圾郵件有關的語義庫中找到相關單詞

print(y)  # 每一行代表一個郵件樣本,等於1表示為垃圾郵件

這里用於判斷垃圾郵件的詞匯,還是我們預處理中的vocab.txt的單詞。

二:獲取最佳准確率和參數C(線性核只有C)

def get_best_params(X,y,Xval,yval):
    # C 和 σ 的候選值
    Cvalues = [3, 0.03, 100, 0.01, 0.1, 0.3, 1, 10, 30]  # 9

    best_score = 0  #用於存放最佳准確率
    best_params = 0  #用於存放參數C

    for c in Cvalues:
        clf = svm.SVC(c,kernel='linear')
        clf.fit(X,y)    # 用訓練集數據擬合模型
        score = clf.score(Xval,yval)    # 用驗證集數據進行評分
        if score > best_score:
            best_score = score
            best_params = c

    return best_score,best_params
best_score,best_params = get_best_params(X,y,Xtest,ytest)
print(best_score,best_params)

clf = svm.SVC(best_params,kernel="linear")
clf.fit(X,y)
score_train = clf.score(X,y)
score_test = clf.score(Xtest,ytest)
print(score_train,score_test)

三:根據我們訓練的結果,判斷最有可能被判為垃圾郵件的單詞

best_params = 0.3

clf = svm.SVC(best_params,kernel="linear")
clf.fit(X,y)    #進行訓練

# 讀取提供給我們的單詞(被認為是垃圾郵件的單詞)
df = pd.read_table("vocab.txt", names=['words'])  # 讀取數據
vocab = df.values  # array類型

#先獲取訓練結果中,各個特征的重要性 ---- coef_:各特征的系數(重要性)https://www.cnblogs.com/xxtalhr/p/11166848.html
print(clf.coef_)
indices = np.argsort(clf.coef_).flatten()[::-1] #因為argsort是小到大排序,提取其對應的index(索引),我們設置步長為-1,倒序獲取特征系數
print(indices)  #打印索引

for i in range(15): #我們獲取前15個最有可能判為垃圾郵件的單詞
    print("{}---{:0.6f}".format(vocab[indices[i]],clf.coef_.flatten()[indices[i]]))

------------------垃圾郵件判斷------------------

一:郵件數據

(一)正常郵件

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com
emailSample1.txt
Folks,
 
my first time posting - have a bit of Unix experience, but am new to Linux.

 
Just got a new PC at home - Dell box with Windows XP. Added a second hard disk
for Linux. Partitioned the disk and have installed Suse 7.2 from CD, which went
fine except it didn't pick up my monitor.
 
I have a Dell branded E151FPp 15" LCD flat panel monitor and a nVidia GeForce4
Ti4200 video card, both of which are probably too new to feature in Suse's default
set. I downloaded a driver from the nVidia website and installed it using RPM.
Then I ran Sax2 (as was recommended in some postings I found on the net), but
it still doesn't feature my video card in the available list. What next?
 
Another problem. I have a Dell branded keyboard and if I hit Caps-Lock twice,
the whole machine crashes (in Linux, not Windows) - even the on/off switch is
inactive, leaving me to reach for the power cable instead.
 
If anyone can help me in any way with these probs., I'd be really grateful -
I've searched the 'net but have run out of ideas.
 
Or should I be going for a different version of Linux such as RedHat? Opinions
welcome.
 
Thanks a lot,
Peter

-- 
Irish Linux Users' Group: ilug@linux.ie
http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.
List maintainer: listmaster@linux.ie
emailSample2.txt

(二)垃圾郵件

Do You Want To Make $1000 Or More Per Week?

 

If you are a motivated and qualified individual - I 
will personally demonstrate to you a system that will 
make you $1,000 per week or more! This is NOT mlm.

 

Call our 24 hour pre-recorded number to get the 
details.  

 

000-456-789

 

I need people who want to make serious money.  Make 
the call and get the facts. 

Invest 2 minutes in yourself now!

 

000-456-789

 

Looking forward to your call and I will introduce you 
to people like yourself who
are currently making $10,000 plus per week!

 

000-456-789



3484lJGv6-241lEaN9080lRmS6-271WxHo7524qiyT5-438rjUv5615hQcf0-662eiDB9057dMtVl72
spamSample1.txt
Best Buy Viagra Generic Online

Viagra 100mg x 60 Pills $125, Free Pills & Reorder Discount, Top Selling 100% Quality & Satisfaction guaranteed!

We accept VISA, Master & E-Check Payments, 90000+ Satisfied Customers!
http://medphysitcstech.ru
spamSample2.txt

二:獲取各個郵件的詞向量特征(預處理獲得)

with open('emailSample1.txt','r') as fp:
    content = fp.read()

with open('emailSample2.txt','r') as fp:
    content2 = fp.read()

with open('spamSample1.txt','r') as fp:
    content3 = fp.read()

with open('spamSample2.txt','r') as fp:
    content4 = fp.read()

vector = email2FeatureVector(content)
vector2 = email2FeatureVector(content2)
vector3 = email2FeatureVector(content3)
vector4 = email2FeatureVector(content4)

三:提供訓練的參數,進行預測郵件是否為垃圾郵件

res = clf.predict(np.array([vector]))
res2 = clf.predict(np.array([vector2]))
res3 = clf.predict(np.array([vector3]))
res4 = clf.predict(np.array([vector4]))
print(res,res2,res3,res4)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM