在本練習中,先介紹了SVM的一些基本知識,再使用SVM(支持向量機 )實現一個垃圾郵件分類器。
在開始之前,先簡單介紹一下SVM
①從邏輯回歸的 cost function 到SVM 的 cost function
邏輯回歸的假設函數如下:
hθ(x)取值范圍為[0,1],約定hθ(x)>=0.5,也即θT·x >=0時,y=1;比如hθ(x)=0.6,此時表示有60%的概率相信 y 等於1
顯然,要想讓y取值為1,hθ(x)越大越好,因為hθ(x)越大,y 取值為1的概率也就越大,也即:更好把握相信 y 等於1。而要想hθ(x)越大,也就是θT·x遠遠大於0
The larger θT·x is, the larger also is hθ(x) = p(y = 1|x; w, b), and thus also the higher our degree of “confidence”
that the label is 1
同理,y 等於0,也可以通過類似的推理得到:要想讓 y 取值為0,則hθ(x)越小越好,而要想hθ(x)越小,也就是θT·x遠遠小於0
邏輯回歸的代價函數(cost function)如下:(為了方便討論,假設 training examples 只有一個,即:m = 1)
從上面的cost function公式 可以看出:當y==0時,只有右邊的那部分式子起作用;當y==1時,(1-y==0)只有左邊的那部分式子起作用。
y==1時,邏輯回歸的代價函數的圖形表示如下:可以看出,邏輯回歸的代價函數在整個坐標軸上是連續的。
在上面的y==1時的邏輯回歸代價函數的基礎上,構造一條新的代價函數曲線,記為cost1(z) ,(用紫色的兩條直線 線段表示,z==1處是轉折點),如下圖:在z==1 點,新的代價函數是不連續的。
同理,y==0時,邏輯回歸的代價函數的圖形表示如下圖:可以看出,邏輯回歸的代價函數在整個坐標軸上是連續的。
在上面的y==0時的邏輯回歸代價函數的基礎上,構造一條新的代價函數曲線,記為cost0(z)(用紫色的兩條直線 線段表示,z== -1處是轉折點),如下圖:在z== -1 點,新的代價函數是不連續的。
使用上面新構造的兩條函數曲線:cost0(z) 和 cost1(z) (z 等於θT·x),組成了支持向量機(SVM)的cost function,如下:
對於training example的數目 m 而言,它是一個常量,故在SVM的cost function中 去掉了 m
因此,總結一下,得到SVM的代價函數如下:
對於SVM而言,y==1時,要求:θT·x>=1;y==0時,要求:θT·x<=-1
可以看出:相比於邏輯回歸,SVM中的 label of result y 等於 1 時,要求θT·x大於等於1,而不是0,這就相當於多了提高了限制條件,多了一層保障。
另外,SVM的代價函數中的 參數 C 就相當於邏輯回歸中的lambda(λ)
因為,我們的目的是最小化 cost function,當 C 很大時,與 C 相乘的這一項只有非常接近於0時,才能讓 cost function變小啊...當 C 非常大時,代價函數就等價於:min (1/2)·Σθ2j
②SVM的decision boundary
相比於邏輯回歸,SVM能實現更復雜的非線性分類問題。先討論下線性可分的情況下如何選擇更好的 decision boundary?
對於樣本數據而言,可能有很多種不同的 decision boundary 來對樣本進行划分,比如:下圖中就有三條 decision boundary,但我們覺得,黑色的那條decision boundary更好地將樣本數據分開。
黑色的那條 decision boundary 的優點是:有着更大的 margin。這就是SVM分類器的特點:總是盡可能地找出一條最大 margin 的decision boundary,因此SVM有時也稱為 Large Margin Classifier。
對於下圖中的數據(有一個紅色的叉很“奇特”),SVM又會怎樣尋找 decision boundary呢?
當SVM的代價函數中的C不是太大的情況下,SVM還是會堅持黑色那條decision boundary,而不是轉化成紫色的那條 decision boundary。
當SVM的代價函數中的參數C很大時,它就很可能會選擇紫色的那條 decision boundary了。但是,在實際應用上,C 不會是 很大很大的,因此,盡管樣本中出現了“奇異點”樣本,SVM還是會堅持黑色那條decision boundary,從而具有一定的“容錯性”
③SVM為什么是大間距分類器?(Why Large Margin?)
假設當C很大時,代價函數:min (1/2)·Σθ2j 可以表示成向量的乘法形式:min (1/2)θT·θ
因為:Σθj2 = (θ12 +θ22 +.... +θn2) = (θ1,θ2,....,θn)T• (θ1,θ2,....,θn) = ||θ||2
因此,我們把代價函數 轉化成了:向量θ的范數,如下圖 (n=2)
在最小化代價函數時,服從於下面條件:
θT·x(i) >= 1 if y==1
θT·x(i) <= -1 if y==0
向量乘法與向量投影之間的關系:假設有兩個向量a,向量b;向量a、b之間的夾角為theta,由向量乘法公式:a*b=||a||*||b||*cos(theta)。其實,||b||*cos(theta)就是向量b在向量a上的投影。
根據向量的投影,θT·x(i) = p(i)•||θ||,其中p(i)是向量 x(i) 在 向量θ 方向上的投影。θT·x(i) = p(i)•||θ|| 的示意圖如下:
從而,將代價函數服從的條件轉化成了“向量投影”表示,如下:
要想最小化代價函數(1/2)·Σθ2j ,就得讓||θ||盡可能地小;但是又得滿足條件:θT·x(i) >= 1 if y==1 and θT·x(i) <= -1 if y==0
根據:θT·x(i) = p(i)•||θ||。因此,要想θT·x(i) 盡可能地大於等於 1 ,就得讓p(i) 盡可能地大,這樣p(i)•||θ|| 才有更大的可能 大於等於1。
(不能讓||θ||盡可能地大,因為||θ||大了,代價函數就大了,而我們的目標是最小化代價函數)
好,既然現在的目標是讓p(i)盡可能地大,那p(i)代表的意義是什么呢?就是:margins(間距),這也是SVM被稱為 大間距分類器 的原因。
那 p(i) 為什么代表的是間距呢?看下圖:
紅色的叉叉 和 圓圈 表示的是訓練的樣本,我們用一條綠色的線(decision boundary)將叉叉 和 圓圈 分開。向量θ 的方向是與decision boundary 垂直的。
對於“紅色的叉叉”這個樣本x(1)而言,它的幾何間距是 p(1);對於圓圈樣本 x(2) 而言,它的幾何間距是p(2)
從上圖中可以看出,p(1) 和p(2) 的長度 都比較短,因此,要想讓p(i)•||θ|| 大於等於1 或者 小於等於-1,只能讓||θ||大一點了,而||θ||要是很大,代價函數就大了,而最小化SVM的代價函數的意義就是:找出一組參數(θ1,θ2......θn),使得代價函數盡可能地小。因此,SVM是不會選擇 p(1) 長度 小的 decision boundary的。
再來看一個投影長度p(i) 比較長的例子:
紅色的叉叉 和 圓圈 表示的是訓練的樣本,我們用一條綠色的線(decision boundary)將叉叉 和 圓圈 分開,此時的decision boundary剛好是 y 軸(豎直線);紅色的叉叉樣本x(1) 在向量上的投影p(1 )剛好是x(1) 的 x 軸的坐標,它要比斜着的綠色decision boundary 上的投影 要長。
因此,SVM 會選擇這條豎直的綠色 decision boundary 作為分類邊界。它的 Margin 的示意圖如下:
在上面的描述中,我們是先假設 SVM 的cost function 中的參數 C 很大,只留下了 θ(min (1/2)·Σθ2j )。然后根據 樣本x(i) 在 θ向量上 的投影盡可能大 使得Margin很大,從而選擇Margin大的 decision boundary,下面將從另一個角度來討論為什么選擇大的 margins(參考 cs229-notes3.pdf)
首先看下圖中的一個線性可分的例子:
因此,如果我們能找到這樣一條 decision boundary,讓所有的點盡可能地離該decision boundary 遠,這樣我們就更有理由預測 y==1 或者 y==0
比如,相比於C點,我們更有理由相信 A點屬於 positive 這一類,即更相信 預測 A點的 y==1 而不是預測C點的 y==1
簡便起見,我們只考慮線性可分的二分類問題,y==1 或者 y==-1 來標記樣本屬於不同的分類。
we’ll use y ∈ {−1, 1} (instead of {0, 1}) to denote the class labels
待完成
④核函數(主要討論高斯核函數)(待完成)
We’ll also see kernels, which give a way to apply SVMs efficiently in very high dimensional
(such as infinitedimensional) feature spaces
使用SVM實現垃圾郵件分類器
ⓐ輸入數據(郵件)預處理並構造樣本特征(input features)
假設一份樣本郵件如下:
> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100.
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2
if youre running something big..
To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com
這份樣本郵件中有:URL、郵件地址、數字、金額(dollar)....這些特征都是與特定的Email相關的。對於不同的Email,也許也有URL、郵件地址.....但只是URL地址不一樣,郵件地址不一樣、出現的金額不一樣.....因此,就需要“normalize”這些值,也就是說:我們關注的還是URL地址是什么,金額是多少,我們關注的是:是否出現了URL,是否出現了郵件地址....預處理策略如下:
比如,所有的URL 都用字符串“httpaddr”代替;所有的郵件地址都用字符串“emailaddr”代替;郵件中的所有的大寫字母都轉換成小寫字母....
處理完之后,郵件變成了如下:
anyon know how much it cost to host a web portal well it depend on how mani
visitor you re expect thi can be anywher from less than number buck a month
to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb
if your run someth big to unsubscrib yourself from thi mail list send an
email to emailaddr
在本例中,我們有一個詞庫,這個詞庫有1899個單詞,因此 input features 是一個1899維的向量
如果處理之后的郵件中的單詞出現在了詞庫中,input features 對應的位置為1,否則為0
processEmail.m用來進行上述預處理,其代碼如下:

function word_indices = processEmail(email_contents) %PROCESSEMAIL preprocesses a the body of an email and %returns a list of word_indices % word_indices = PROCESSEMAIL(email_contents) preprocesses % the body of an email and returns a list of indices of the % words contained in the email. % % Load Vocabulary vocabList = getVocabList(); % Init return value word_indices = []; % ========================== Preprocess Email =========================== % Find the Headers ( \n\n and remove ) % Uncomment the following lines if you are working with raw emails with the % full headers % hdrstart = strfind(email_contents, ([char(10) char(10)])); % email_contents = email_contents(hdrstart(1):end); % Lower case email_contents = lower(email_contents); % Strip all HTML % Looks for any expression that starts with < and ends with > and replace % and does not have any < or > in the tag it with a space email_contents = regexprep(email_contents, '<[^<>]+>', ' '); % Handle Numbers % Look for one or more characters between 0-9 email_contents = regexprep(email_contents, '[0-9]+', 'number'); % Handle URLS % Look for strings starting with http:// or https:// email_contents = regexprep(email_contents, ... '(http|https)://[^\s]*', 'httpaddr'); % Handle Email Addresses % Look for strings with @ in the middle email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr'); % Handle $ sign email_contents = regexprep(email_contents, '[$]+', 'dollar'); % ========================== Tokenize Email =========================== % Output the email to screen as well fprintf('\n==== Processed Email ====\n\n'); % Process file l = 0; while ~isempty(email_contents) % Tokenize and also get rid of any punctuation [str, email_contents] = ... strtok(email_contents, ... [' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]); % Remove any non alphanumeric characters str = regexprep(str, '[^a-zA-Z0-9]', ''); % Stem the word % (the porterStemmer sometimes has issues, so we use a try catch block) try str = porterStemmer(strtrim(str)); catch str = ''; continue; end; % Skip the word if it is too short if length(str) < 1 continue; end % Look up the word in the dictionary and add to word_indices if % found % ====================== YOUR CODE HERE ====================== % Instructions: Fill in this function to add the index of str to % word_indices if it is in the vocabulary. At this point % of the code, you have a stemmed word from the email in % the variable str. You should look up str in the % vocabulary list (vocabList). If a match exists, you % should add the index of the word to the word_indices % vector. Concretely, if str = 'action', then you should % look up the vocabulary list to find where in vocabList % 'action' appears. For example, if vocabList{18} = % 'action', then, you should add 18 to the word_indices % vector (e.g., word_indices = [word_indices ; 18]; ). % % Note: vocabList{idx} returns a the word with index idx in the % vocabulary list. % % Note: You can use strcmp(str1, str2) to compare two strings (str1 and % str2). It will return 1 only if the two strings are equivalent. % for i = 1:length(vocabList) if(strcmp(str, vocabList{i}) == 1) word_indices = [word_indices; i]; break; end end % ============================================================= % Print to screen, ensuring that the output lines are not too long if (l + length(str) + 1) > 78 fprintf('\n'); l = 0; end fprintf('%s ', str); l = l + length(str) + 1; end % Print footer fprintf('\n\n=========================\n'); end
預處理的結果如下:
Length of feature vector: 1899
Number of non-zero entries: 45
表明:在上面的那封郵件中,有45個單詞出現在詞庫中。
ⓑ訓練SVM
spamTrain.mat 中包含了4000封郵件(即有垃圾郵件,也有非垃圾郵件),spamTest.mat中包含了1000個測試樣本,相應的訓練代碼如下:
%% =========== Part 3: Train Linear SVM for Spam Classification ========
% In this section, you will train a linear classifier to determine if an
% email is Spam or Not-Spam.
% Load the Spam Email dataset
% You will have X, y in your environment
load('spamTrain.mat');
fprintf('\nTraining Linear SVM (Spam Classification)\n')
fprintf('(this may take 1 to 2 minutes) ...\n')
C = 0.1;
model = svmTrain(X, y, C, @linearKernel);
p = svmPredict(model, X);
fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);
svmTrain 實現了SMO算法,svmTrain.m代碼如下:

function [model] = svmTrain(X, Y, C, kernelFunction, ... tol, max_passes) %SVMTRAIN Trains an SVM classifier using a simplified version of the SMO %algorithm. % [model] = SVMTRAIN(X, Y, C, kernelFunction, tol, max_passes) trains an % SVM classifier and returns trained model. X is the matrix of training % examples. Each row is a training example, and the jth column holds the % jth feature. Y is a column matrix containing 1 for positive examples % and 0 for negative examples. C is the standard SVM regularization % parameter. tol is a tolerance value used for determining equality of % floating point numbers. max_passes controls the number of iterations % over the dataset (without changes to alpha) before the algorithm quits. % % Note: This is a simplified version of the SMO algorithm for training % SVMs. In practice, if you want to train an SVM classifier, we % recommend using an optimized package such as: % % LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) % SVMLight (http://svmlight.joachims.org/) % % if ~exist('tol', 'var') || isempty(tol) tol = 1e-3; end if ~exist('max_passes', 'var') || isempty(max_passes) max_passes = 5; end % Data parameters m = size(X, 1); n = size(X, 2); % Map 0 to -1 Y(Y==0) = -1; % Variables alphas = zeros(m, 1); b = 0; E = zeros(m, 1); passes = 0; eta = 0; L = 0; H = 0; % Pre-compute the Kernel Matrix since our dataset is small % (in practice, optimized SVM packages that handle large datasets % gracefully will _not_ do this) % % We have implemented optimized vectorized version of the Kernels here so % that the svm training will run faster. if strcmp(func2str(kernelFunction), 'linearKernel') % Vectorized computation for the Linear Kernel % This is equivalent to computing the kernel on every pair of examples K = X*X'; elseif strfind(func2str(kernelFunction), 'gaussianKernel') % Vectorized RBF Kernel % This is equivalent to computing the kernel on every pair of examples X2 = sum(X.^2, 2); K = bsxfun(@plus, X2, bsxfun(@plus, X2', - 2 * (X * X'))); K = kernelFunction(1, 0) .^ K; else % Pre-compute the Kernel Matrix % The following can be slow due to the lack of vectorization K = zeros(m); for i = 1:m for j = i:m K(i,j) = kernelFunction(X(i,:)', X(j,:)'); K(j,i) = K(i,j); %the matrix is symmetric end end end % Train fprintf('\nTraining ...'); dots = 12; while passes < max_passes, num_changed_alphas = 0; for i = 1:m, % Calculate Ei = f(x(i)) - y(i) using (2). % E(i) = b + sum (X(i, :) * (repmat(alphas.*Y,1,n).*X)') - Y(i); E(i) = b + sum (alphas.*Y.*K(:,i)) - Y(i); if ((Y(i)*E(i) < -tol && alphas(i) < C) || (Y(i)*E(i) > tol && alphas(i) > 0)), % In practice, there are many heuristics one can use to select % the i and j. In this simplified code, we select them randomly. j = ceil(m * rand()); while j == i, % Make sure i \neq j j = ceil(m * rand()); end % Calculate Ej = f(x(j)) - y(j) using (2). E(j) = b + sum (alphas.*Y.*K(:,j)) - Y(j); % Save old alphas alpha_i_old = alphas(i); alpha_j_old = alphas(j); % Compute L and H by (10) or (11). if (Y(i) == Y(j)), L = max(0, alphas(j) + alphas(i) - C); H = min(C, alphas(j) + alphas(i)); else L = max(0, alphas(j) - alphas(i)); H = min(C, C + alphas(j) - alphas(i)); end if (L == H), % continue to next i. continue; end % Compute eta by (14). eta = 2 * K(i,j) - K(i,i) - K(j,j); if (eta >= 0), % continue to next i. continue; end % Compute and clip new value for alpha j using (12) and (15). alphas(j) = alphas(j) - (Y(j) * (E(i) - E(j))) / eta; % Clip alphas(j) = min (H, alphas(j)); alphas(j) = max (L, alphas(j)); % Check if change in alpha is significant if (abs(alphas(j) - alpha_j_old) < tol), % continue to next i. % replace anyway alphas(j) = alpha_j_old; continue; end % Determine value for alpha i using (16). alphas(i) = alphas(i) + Y(i)*Y(j)*(alpha_j_old - alphas(j)); % Compute b1 and b2 using (17) and (18) respectively. b1 = b - E(i) ... - Y(i) * (alphas(i) - alpha_i_old) * K(i,j)' ... - Y(j) * (alphas(j) - alpha_j_old) * K(i,j)'; b2 = b - E(j) ... - Y(i) * (alphas(i) - alpha_i_old) * K(i,j)' ... - Y(j) * (alphas(j) - alpha_j_old) * K(j,j)'; % Compute b by (19). if (0 < alphas(i) && alphas(i) < C), b = b1; elseif (0 < alphas(j) && alphas(j) < C), b = b2; else b = (b1+b2)/2; end num_changed_alphas = num_changed_alphas + 1; end end if (num_changed_alphas == 0), passes = passes + 1; else passes = 0; end fprintf('.'); dots = dots + 1; if dots > 78 dots = 0; fprintf('\n'); end if exist('OCTAVE_VERSION') fflush(stdout); end end fprintf(' Done! \n\n'); % Save the model idx = alphas > 0; model.X= X(idx,:); model.y= Y(idx); model.kernelFunction = kernelFunction; model.b= b; model.alphas= alphas(idx); model.w = ((alphas.*Y)'*X)'; end
訓練的結果如下:
Training Accuracy: 99.850000
Evaluating the trained Linear SVM on a test set ...
Test Accuracy: 98.900000
ⓒ使用訓練好的SVM分類進行郵件分類
%% =================== Part 6: Try Your Own Emails =====================
% Now that you've trained the spam classifier, you can use it on your own
% emails! In the starter code, we have included spamSample1.txt,
% spamSample2.txt, emailSample1.txt and emailSample2.txt as examples.
% The following code reads in one of these emails and then uses your
% learned SVM classifier to determine whether the email is Spam or
% Not Spam
% Set the file to be read in (change this to spamSample2.txt,
% emailSample1.txt or emailSample2.txt to see different predictions on
% different emails types). Try your own emails as well!
filename = 'emailSample1.txt';
% Read and predict
file_contents = readFile(filename);
word_indices = processEmail(file_contents);
x = emailFeatures(word_indices);
p = svmPredict(model, x);
fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p);
fprintf('(1 indicates spam, 0 indicates not spam)\n\n');
原文:http://www.cnblogs.com/hapjin/p/6140646.html