本文參考了很多網頁,主要有:
http://blog.csdn.net/zouxy09/article/details/20319673
http://www.wbrecom.com/?p=394
http://www.cnblogs.com/EE-NovRain/p/3810737.html
下面是我收集的一些資料:
http://pan.baidu.com/s/1o72JoJg
邏輯回歸
- 什么是邏輯回歸?
- 邏輯回歸求解
- 梯度下降法
- 正則化
- 在線梯度下降
- FOBOS
- RDA
- FTRL
什么是邏輯回歸?
某一天,計院的屌絲小明走在路上,看到了音樂學院的一個美女,內心無比盪漾。然后小明就四處打探,發現這個姑娘沒有對象,哈哈。但是小明對自己沒啥信心,不知道這個妹子喜歡什么樣的男生,萬一悲劇了,多傷心。不過小明在打探過程中知道這個姑娘曾經有過好幾個對象,並且知道了這些對象的一些特點,個子都要超過1米7,而且比較有錢,而且曾經有個學霸追求她,但是被她拒絕了等等。這時候小明會去搞了一個分類算法,來預測下妹子是不是喜歡他。當然屌絲小明一開始不敢拿自己做實驗,就用了實驗室的幾個小伙伴做預測,這時候小明發現,結果只有喜歡和不喜歡,這個太粗礦了。如果能夠得到一個值,得到妹子喜歡的程度,豈不更好。於是乎,小明就用了邏輯回歸,將妹子的喜歡程度變為了0到1之間的某個值,同時小明同學發現利用自己的預測值為0.7,而實驗室的其它伙伴只有0.5不到。吆西,是時候組織一波進攻了。
在這里,姑娘前任的個子,有沒有錢,是不是學霸,有沒有車等等都可以表示為特征\(x_{1},x_{2},...,x_{n}\),同時姑娘要求身高一定要高過1米7,那么身高這個特征就很重要,在這里我們用\(w _{i}\)表示升高在妹子心中的重要程度。
當然前面都是扯淡,主要是為了通過例子帶出邏輯回歸。在生活中,假設一件事情發生的概率為p,不可能發生的概率為1-p,那么這件事情發生的幾率(odds)表示為\(\frac{p}{1-p}\), log odds 表示為
\(\log \frac{p}{1-p}=w_{1}x_{1}+...+w_{n}x_{x}=WX\)
通過計算,可以得到一件事情發生的概率\(p(y=1|x)\)為
\(p(y=1|x)=\frac{1}{1+e^{-wx}}=\sigma(wx)\)
這就是邏輯回歸( sigmoid function),
邏輯回歸求解
假設現在有n個獨立的訓練樣本\(\{\mathbf{x}_{i},y_{i}\},y_{i}=0,1\),其中X表示為變量,現在有一個邏輯回歸模型,\(p(\mathbf{x}_{i},y_{i})\)這個樣本出現的概率。
\(p(\mathbf{x}_{i},y_{i})=p(y_{i}=1|\mathbf{x}_{i})^{y_{i}} + (1-p(y_{i}=1|\mathbf{x}_{i}))^{1-y_{i}}\)
因為這些訓練樣本是獨立的,所以這n個樣本發生的概率為
\(\prod p(\mathbf{x}_{i},y_{i})=p(y_{i}=1|\mathbf{x}_{i})^{y_{i}} + (1-p(y_{i}=1|\mathbf{x}_{i}))^{1-y_{i}}\)
上面這個公式就是似然函數,直觀上理解,如果這個似然函數最大,就表明這些樣本發生的概率越大,這個模型越符合要求。我們一般用對數似然函數
\(L(w)=\log (\prod p(\mathbf{x}_{i},y_{i})=p(y_{i}=1|\mathbf{x}_{i})^{y_{i}} + (1-p(y_{i}=1|\mathbf{x}_{i}))^{1-y_{i}})\)
然后一頓噼里啪啦的求導,得到
\(\frac{\partial L}{\partial w}=\sum_{i=1}^{n}(y_{i}-\sigma (\theta ^{T}x_{i}))x_{i}\)
這里設\(l(w,z) = -\(L(w,z)\),題目就轉化為一個最優化問題:
\(w = \arg \min_{w}l(w,z)\)
另該導數公式為0,然后求得對應的w,但是這是無法直接求解,可以通過不斷迭代來逼近最優解。
隨機梯度下降:
梯度下降法的意義在於函數的值沿着導數的反方向下降的最快。梯度下降法有批量梯度下降和隨機梯度下降。
批量梯度下降:
\(w_{j}=w_{j-1} +\alpha \sum_{i=1}^{n}(y_{i}-\sigma (w^{T}x_{i}))x_{i}\)
公式中\(\alpha \)表示學習率(步長)。大了,可以快速到達最優值附近,但是無法逼近最優值。小了,收斂速度太慢了,等的時間太長。
梯度下降法每次迭代都需要對樣本集進行遍歷,如果樣本數量太大,耗費時間就比較長。假如每次迭代只使用一個樣本,就可以節約時間,這就是隨機梯度上升。
w_{j}=w_{j-1} +\alpha(y_{i}-\sigma (w^{T}x_{i}))x_{i}
隨機梯度上升與梯度上升相比較:
- 如果存在一個最優解,梯度上升可以求得最優解,隨機梯度上升大部分都在最優解附近。
- 如果存在多個解,梯度上升很可能陷入局部最優解,隨機梯度上升有可能跳出局部最優解。
- 隨機梯度上升的步長小於隨着迭代的次數逐漸減小,走的步子越來越小,可以越來越逼近最優解,避免步子太大了導致在最優解附近一直跳來跳去。
- 隨機梯度上升適合增量計算(online)。
正則化
如果模型在訓練的時候擬合度很高,但是在泛化(預測)的時候效果卻很差,這就說明出現了過擬合問題了。當數據維度很高,但是數據量比較小的時候
就很容易出現過擬合問題。解決過擬合問題的方法就是正則化,邏輯回歸常用的正則化有兩種\(l1\)正則和\(l2\)正則。
\(l1\)正則: \(\varphi (w) =\left \| w \right \| _{1}= \sum |w|\)
\(l2\)正則: \(\varphi (w) =\left \| w \right \| _{2}= \sum w^{2} = W^{T}W\)
同時最優化問題變為了
\(w = \arg \min_{w}[l(w,z) + \lambda \varphi(W)])\)
\(l1\)正則和\(l2\)正則的作用都是為了防止W產生一些絕對值很大的值,避免過擬合。但是\(l1\)正則和\(l2\)正則存在一些區別:
\(l1\)正則在批量梯度下降的時候,能夠產生稀疏解,就是W有很多維度的值為0,這些為0的維度通常都是相關度不高的維度,這樣在計算高維度的向量時能夠起到特征選擇的作用。
稀疏解是在進行邏輯回歸計算的時候一個很重要的目標,在廣告CTR預估的時候,每個樣本的維度都特別高,稀疏解可以大大節省計算時間。
但是前面說過批量梯度下降的計算時間很長,而隨機梯度下降即使使用了\(l1\)正則,仍然很難得到一個稀疏解。
有一種解決辦法,就是當W中某個維度的值低於一定閾值的時候,直接將這個維度的值設置為0。這種方式叫做截斷法,但是這種方法太簡單暴力,容易將一些實際中需要的維度干掉。那么有沒有一種辦法能夠溫柔一點呢,那就是截斷梯度法[1]。
FOBOS
FOBOS[2]叫做前向后向切分法,它不僅與迭代前的W相關,同時考慮到了迭代后的W。
第二個公式中左面部分確保了新的解不會距離上一個解太遠,同時第二部分保證能夠產生稀疏解。
RDA
RDA算法叫做正則對偶平均(Regularized Dual Averaging),該算法的迭代方式如下:
公式中第1部分是指梯度和的平均值,第2部分是指正則項,第3部分是指額外的一個正則項。
與FOBOS相比較,RDA算法能夠產生更加稀疏的解,但是精度會比較差。
FTRL
FTRL將FOBOS和RDA算法進行結合,這樣既兼顧了精度,又兼顧了稀疏性。迭代公式如下:
經過不斷的推導,最終結果變為:
其中
這個算法過程如下
這幾部分描述的比較簡陋,主要還是需要參考作者的論文,FTRL算法應用在CTR預估方面有很好的效果,kaggle上面也有對應的項目[3]。項目代碼如下:
''' DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE Version 2, December 2004 Copyright (C) 2004 Sam Hocevar <sam@hocevar.net> Everyone is permitted to copy and distribute verbatim or modified copies of this license document, and changing it is allowed as long as the name is changed. DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. You just DO WHAT THE FUCK YOU WANT TO. ''' from datetime import datetime from csv import DictReader from math import exp, log, sqrt # TL; DR, the main training process starts on line: 250, # you may want to start reading the code from there ############################################################################## # parameters ################################################################# ############################################################################## # A, paths train = 'train_rev2' # path to training file test = 'test_rev2' # path to testing file submission = 'submission1234.csv' # path of to be outputted submission file # B, model alpha = .1 # learning rate beta = 1. # smoothing parameter for adaptive learning rate L1 = 1. # L1 regularization, larger value means more regularized L2 = 1. # L2 regularization, larger value means more regularized # C, feature/hash trick D = 2 ** 20 # number of weights to use interaction = False # whether to enable poly2 feature interactions # D, training/validation epoch = 1 # learn training data for N passes holdafter = 9 # data after date N (exclusive) are used as validation holdout = None # use every N training instance for holdout validation ############################################################################## # class, function, generator definitions ##################################### ############################################################################## class ftrl_proximal(object): ''' Our main algorithm: Follow the regularized leader - proximal In short, this is an adaptive-learning-rate sparse logistic-regression with efficient L1-L2-regularization Reference: http://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf ''' def __init__(self, alpha, beta, L1, L2, D, interaction): # parameters self.alpha = alpha self.beta = beta self.L1 = L1 self.L2 = L2 # feature related parameters self.D = D self.interaction = interaction # model # n: squared sum of past gradients # z: weights # w: lazy weights self.n = [0.] * D self.z = [0.] * D self.w = {} def _indices(self, x): ''' A helper generator that yields the indices in x The purpose of this generator is to make the following code a bit cleaner when doing feature interaction. ''' # first yield index of the bias term yield 0 # then yield the normal indices for index in x: yield index # now yield interactions (if applicable) if self.interaction: D = self.D L = len(x) x = sorted(x) for i in xrange(L): for j in xrange(i+1, L): # one-hot encode interactions with hash trick yield abs(hash(str(x[i]) + '_' + str(x[j]))) % D def predict(self, x): ''' Get probability estimation on x INPUT: x: features OUTPUT: probability of p(y = 1 | x; w) ''' # parameters alpha = self.alpha beta = self.beta L1 = self.L1 L2 = self.L2 # model n = self.n z = self.z w = {} # wTx is the inner product of w and x wTx = 0. for i in self._indices(x): sign = -1. if z[i] < 0 else 1. # get sign of z[i] # build w on the fly using z and n, hence the name - lazy weights # we are doing this at prediction instead of update time is because # this allows us for not storing the complete w if sign * z[i] <= L1: # w[i] vanishes due to L1 regularization w[i] = 0. else: # apply prediction time L1, L2 regularization to z and get w w[i] = (sign * L1 - z[i]) / ((beta + sqrt(n[i])) / alpha + L2) wTx += w[i] # cache the current w for update stage self.w = w # bounded sigmoid function, this is the probability estimation return 1. / (1. + exp(-max(min(wTx, 35.), -35.))) def update(self, x, p, y): ''' Update model using x, p, y INPUT: x: feature, a list of indices p: click probability prediction of our model y: answer MODIFIES: self.n: increase by squared gradient self.z: weights ''' # parameter alpha = self.alpha # model n = self.n z = self.z w = self.w # gradient under logloss g = p - y # update z and n for i in self._indices(x): sigma = (sqrt(n[i] + g * g) - sqrt(n[i])) / alpha z[i] += g - sigma * w[i] n[i] += g * g def logloss(p, y): ''' FUNCTION: Bounded logloss INPUT: p: our prediction y: real answer OUTPUT: logarithmic loss of p given y ''' p = max(min(p, 1. - 10e-15), 10e-15) return -log(p) if y == 1. else -log(1. - p) def data(path, D): ''' GENERATOR: Apply hash-trick to the original csv row and for simplicity, we one-hot-encode everything INPUT: path: path to training or testing file D: the max index that we can hash to YIELDS: ID: id of the instance, mainly useless x: a list of hashed and one-hot-encoded 'indices' we only need the index since all values are either 0 or 1 y: y = 1 if we have a click, else we have y = 0 ''' for t, row in enumerate(DictReader(open(path))): # process id ID = row['id'] del row['id'] # process clicks y = 0. if 'click' in row: if row['click'] == '1': y = 1. del row['click'] # extract date date = int(row['hour'][4:6]) # turn hour really into hour, it was originally YYMMDDHH row['hour'] = row['hour'][6:] # build x x = [] for key in row: value = row[key] # one-hot encode everything with hash trick index = abs(hash(key + '_' + value)) % D x.append(index) yield t, date, ID, x, y ############################################################################## # start training ############################################################# ############################################################################## start = datetime.now() # initialize ourselves a learner learner = ftrl_proximal(alpha, beta, L1, L2, D, interaction) # start training for e in xrange(epoch): loss = 0. count = 0 for t, date, ID, x, y in data(train, D): # data is a generator # t: just a instance counter # date: you know what this is # ID: id provided in original data # x: features # y: label (click) # step 1, get prediction from learner p = learner.predict(x) if (holdafter and date > holdafter) or (holdout and t % holdout == 0): # step 2-1, calculate validation loss # we do not train with the validation data so that our # validation loss is an accurate estimation # # holdafter: train instances from day 1 to day N # validate with instances from day N + 1 and after # # holdout: validate with every N instance, train with others loss += logloss(p, y) count += 1 else: # step 2-2, update learner with label (click) information learner.update(x, p, y) print('Epoch %d finished, validation logloss: %f, elapsed time: %s' % ( e, loss/count, str(datetime.now() - start))) ############################################################################## # start testing, and build Kaggle's submission file ########################## ############################################################################## with open(submission, 'w') as outfile: outfile.write('id,click\n') for t, date, ID, x, y in data(test, D): p = learner.predict(x) outfile.write('%s,%s\n' % (ID, str(p)))
[1]Langford J, Li L, Zhang T. Sparse online learning via truncated gradient[C]//Advances in neural information processing systems. 2009: 905-912.
[2]John Duchi & Yoram Singer. Efficient Online and Batch Learning using Forward Backward Splitting. Journal of Machine Learning Research, 2009
[3]https://www.kaggle.com/c/avazu-ctr-prediction