基於Caffe的Large Margin Softmax Loss的實現（中）

本文轉載自查看原文 2016-10-08 17:29 1444 L-Softmax/ Caffe

小喵的嘮叨話：前一篇博客，我們做完了L-Softmax的准備工作。而這一章，我們開始進行前饋的研究。

小喵博客: http://miaoerduo.com

博客原文: http://www.miaoerduo.com/deep-learning/基於caffe的large-margin-softmax-loss的實現（中）.html

四、前饋

還記得上一篇博客，小喵給出的三個公式嗎？不記得也沒關系。

這次，我們要一點一點的通過代碼來實現這些公式。小喵主要是GPU上實現前后饋的代碼，因為這個層只是用來訓練，GPU速度應該會快一點。

我們首先要進行一般的FC層的前饋，因為LM_FC的前饋只是修改了一般的FC中的若干個值，而大部分的值都是沒有修改過的。

 1 const Dtype* bottom_data = bottom[0]->gpu_data();
 2 const Dtype* label_data = bottom[1]->gpu_data();
 3 Dtype* top_data = top[0]->mutable_gpu_data();
 4 const Dtype* weight = this->blobs_[0]->gpu_data();
 5 // 普通fc層的計算
 6 if (M_ == 1) {
 7   caffe_gpu_gemv<Dtype>(CblasNoTrans, N_, K_, (Dtype)1.,
 8                        weight, bottom_data, (Dtype)0., top_data);
 9 } else {
10   caffe_gpu_gemm<Dtype>(CblasNoTrans,
11                         transpose_ ? CblasNoTrans : CblasTrans,
12                         M_, N_, K_, (Dtype)1.,
13                         bottom_data, weight, (Dtype)0., top_data);
14 }

這樣就計算完了一個普通的FC的前饋。

之后是一些具體的實現。

1，$\cos(\theta_j)=\frac{W_j^Tx_i}{\|W_j\|\|x_i\|}$

這是要求出label為$j$的weight的權值和feature之間的余弦值。公式大家在高中應該就學過了。這樣需要出三部分：$W_j^Tx_i$，$\|W_j\|$和$\|x_i\|$。這里$i$表示feature的序號，因為一個mini batch中有很多張圖片。$j$表示正確的label值。

$W_j^Tx_i$的計算非常簡單，因為FC層的前饋計算出來的就是這個值。因此我們可以直接從FC的前饋結果中直接復制對應位置的結果。

$\|W_j\|$和$\|x_i\|$是比較簡單的模值的計算，使用caffe_cpu_dot很容易就可以求得（為什么不使用caffe_gpu_dot呢？因為小喵在使用caffe_gpu_dot的時候，caffe會報一個奇怪的錯誤，不知道是不是因為GPU的顯存不能隨意訪問的）。

最后的余弦值帶入到上面的式子，就一下子搞定~

這里用到了幾個變量：

M_: batch size

N_: class num

K_: feature length

 1 // w * x
 2 // 直接從前饋的結果中復制
 3 Dtype *wx_data = this->wx_.mutable_gpu_data();
 4 copy_label_score<Dtype><<<CAFFE_GET_BLOCKS(M_), CAFFE_CUDA_NUM_THREADS>>>(M_, N_, label_data, top_data, wx_data);
 5 
 6 // w * w
 7 Dtype *abs_w_data = this->abs_w_.mutable_cpu_data();
 8 for (int m = 0; m < M_; ++ m) {
 9   abs_w_data[m] = caffe_cpu_dot<Dtype>(
10     K_,
11     this->blobs_[0]->cpu_data() + static_cast<int>(label_cpu_data[m]) * K_,
12     this->blobs_[0]->cpu_data() + static_cast<int>(label_cpu_data[m]) * K_
13     );
14 }
15 
16 // x * x
17 Dtype *abs_x_data = this->abs_x_.mutable_cpu_data();
18 for (int m = 0; m < M_; ++ m) {
19   abs_x_data[m] = caffe_cpu_dot<Dtype>(
20     K_, 
21     bottom[0]->cpu_data() + m * K_,
22     bottom[0]->cpu_data() + m * K_
23     );
24 }
25 // abs_w, abs_x
26 caffe_gpu_powx<Dtype>(M_, this->abs_w_.mutable_gpu_data(), 0.5, this->abs_w_.mutable_gpu_data());
27 caffe_gpu_powx<Dtype>(M_, this->abs_x_.mutable_gpu_data(), 0.5, this->abs_x_.mutable_gpu_data());
28 
29 // cos_t = wx / (|x| * |w|)
30 Dtype *cos_t_data = this->cos_t_.mutable_gpu_data();
31 caffe_gpu_div<Dtype>(M_, wx_data, this->abs_x_.gpu_data(), cos_t_data);
32 caffe_gpu_div<Dtype>(M_, cos_t_data, this->abs_w_.gpu_data(), cos_t_data);

其中copy_label_score是我們自己編寫的用來復制結果的核函數（如何編寫Cuda程序就是另一門學科了）：

1 template <typename Dtype>
2 __global__ void copy_label_score(const int M, const int N, const Dtype *label_data, const Dtype *top_data, Dtype *wx_data) {
3   CUDA_KERNEL_LOOP(index, M) {
4     wx_data[index] = top_data[index * N + static_cast<int>(label_data[index])];
5   }
6 }

相信機智如你的喵粉，看到這幾行代碼，一定可以輕松理解。

這里，小喵想多介紹一點東西。
我們知道Caffe里面的數據都是通過Blob結構來存儲的，比如這里的bottom_data，其實就是一個blob，默認形狀是(n, c, h, w)，n表示的就是batch size，c是channel數，h，w分貝表示高和寬。而且blob中的內存的存儲順序，也和一般的C語言中的數組一樣。因此我們這里計算feature的模的時候，是直接每K_個數值計算一次點乘。
同理，weight是存儲在this->blobs[0]中的，那么weight的形狀又是什么樣子的呢？這里非常碰巧的是，如果我們在prototxt中設置的transpose為false的話，weight的形狀是N_*K_，也就是說，我們可以將weight看成一個矩陣，它的每一行都與feature直接點乘，得到輸出，也就是說weight的每一行都是我們需要計算模值的$W_j$，所以我們計算weight的模的時候，用的計算方法和計算feature模時很相似。我們這里強制設置transpose為false，因為這樣計算會比較簡單。如果你設成了true，那就必須自己寫個求模的函數了。

2，$\cos(m\theta_i)=\sum_n(-1)^n{C_m^{2n}\cos^{m-2n}(\theta_i)\cdot(1-\cos(\theta_i)^2)^n}, (2n\leq m)$

我們在(1)中求出了$\cos(\theta)$，對於給定的margin，只需要代入公式就可以求出$\cos(m\theta)$的值了。

 1 template <typename Dtype>
 2 __global__ void cal_cos_mt(const int count, const unsigned int margin, const int *C_M_N, const Dtype *cos_t_data, Dtype *cos_mt_data) {
 3   CUDA_KERNEL_LOOP(index, count) {
 4     Dtype cos_t = cos_t_data[index];
 5     Dtype sin_t_2 = 1 - cos_t * cos_t;
 6     Dtype cos_mt = 0.;
 7     int flag = -1;
 8     for (int n = 0; n <= (margin / 2); ++ n) {
 9       flag *= -1;
10       cos_mt += flag * C_M_N[2 * n] * powf(cos_t, (margin - 2 * n)) * powf(sin_t_2, n);
11     }
12     cos_mt_data[index] = cos_mt;
13   }
14 }

上面是用來計算$\cos(m\theta)$的cuda函數，調用也十分的簡單：

1 // cos(mt)
2 cal_cos_mt<Dtype><<<CAFFE_GET_BLOCKS(M_), CAFFE_CUDA_NUM_THREADS>>>(
3   M_, this->margin, this->C_M_N_.gpu_data(), this->cos_t_.mutable_gpu_data(), this->cos_mt_->mutable_gpu_data());

3，$f_{y_{i}}=(-1)^k\cdot\|W_{y_{i}}\|\|x_{i}\|\cos(m\theta_i)-2k\cdot\|W_{y_i}\|\|x_i\|$

嚴格上來說，我們需要求的並不是這個式子，而是：

\[f_{y_i}=\frac{\lambda\|W_{y_i}\|\|x_i\|\cos(\theta_{y_i})+\|W_{y_i}\|\|x_i\|\varphi(\theta_{y_i})}{1+\lambda}\]

\[\varphi(\theta)=(-1)^k\cos(m\theta)-2k, \theta\in[\frac{k\pi}{m}, \frac{(k+1)\pi}{m}]\]

可以看出，當$\lambda$為0的時候，這兩個式子就退化成前面的一個式子了。

k的求法十分簡單，只需要將$\cos(\theta)$與各個區間進行比較就可以得到。

 1 // k
 2 int *k_cpu_data = this->k_.mutable_cpu_data();
 3 const Dtype *cos_t_cpu_data = this->cos_t_.cpu_data();
 4 for (int m = 0; m < M_; ++ m) {
 5   for (int _k = 0; _k < this->cos_theta_bound_.count(); ++ _k) {
 6     if (this->cos_theta_bound_.cpu_data()[_k] < cos_t_cpu_data[m]) {
 7       k_cpu_data[m] = _k - 1;
 8       break;
 9     }
10   }
11 }

最后一步就是計算出真正的前饋值了！按照公式容易編寫程序：

 1 template <typename Dtype>
 2 __global__ void LMForward(
 3   const int M, const int N, const float lambda,
 4   const Dtype *label_data, const Dtype *cos_mt_data, const int *k_data,
 5   const Dtype *abs_w_data, const Dtype *abs_x_data, Dtype *top_data) {
 6 
 7   CUDA_KERNEL_LOOP(index, M) {
 8     Dtype cos_mt = cos_mt_data[index];
 9     int k = k_data[index];
10     int label = static_cast<int>(label_data[index]);
11     Dtype abs_w = abs_w_data[index];
12     Dtype abs_x = abs_x_data[index];
13     top_data[N * index + label] =  (lambda * top_data[N * index + label] + abs_w * abs_x * ( powf(-1, k) * cos_mt - 2 * k )) / (1 + lambda);
14   }
15 }

調用也十分簡單：

1 // y
2 LMForward<Dtype><<<CAFFE_GET_BLOCKS(M_), CAFFE_CUDA_NUM_THREADS>>>(
3   M_, N_, this->lambda,
4   label_data, this->cos_mt_->gpu_data(), this->k_.gpu_data(),
5   this->abs_w_.gpu_data(), this->abs_x_.gpu_data(), top[0]->mutable_gpu_data());

最后附上，完整的前饋代碼（省略頭文件和caffe的名字空間）：

  1 template <typename Dtype>
  2 __global__ void copy_label_score(const int M, const int N, const Dtype *label_data, const Dtype *top_data, Dtype *wx_data) {
  3   CUDA_KERNEL_LOOP(index, M) {
  4     wx_data[index] = top_data[index * N + static_cast<int>(label_data[index])];
  5   }
  6 }
  7 
  8 template <typename Dtype>
  9 __global__ void cal_cos_mt(const int count, const unsigned int margin, const int *C_M_N, const Dtype *cos_t_data, Dtype *cos_mt_data) {
 10   CUDA_KERNEL_LOOP(index, count) {
 11     Dtype cos_t = cos_t_data[index];
 12     Dtype sin_t_2 = 1 - cos_t * cos_t;
 13     Dtype cos_mt = 0.;
 14     int flag = -1;
 15     for (int n = 0; n <= (margin / 2); ++ n) {
 16       flag *= -1;
 17       cos_mt += flag * C_M_N[2 * n] * powf(cos_t, (margin - 2 * n)) * powf(sin_t_2, n);
 18     }
 19     cos_mt_data[index] = cos_mt;
 20   }
 21 }
 22 
 23 template <typename Dtype>
 24 __global__ void LMForward(
 25   const int M, const int N, const float lambda,
 26   const Dtype *label_data, const Dtype *cos_mt_data, const int *k_data,
 27   const Dtype *abs_w_data, const Dtype *abs_x_data, Dtype *top_data) {
 28 
 29   CUDA_KERNEL_LOOP(index, M) {
 30     Dtype cos_mt = cos_mt_data[index];
 31     int k = k_data[index];
 32     int label = static_cast<int>(label_data[index]);
 33     Dtype abs_w = abs_w_data[index];
 34     Dtype abs_x = abs_x_data[index];
 35     top_data[N * index + label] =  (lambda * top_data[N * index + label] + abs_w * abs_x * ( powf(-1, k) * cos_mt - 2 * k )) / (1 + lambda);
 36   }
 37 }
 38 
 39 template <typename Dtype>
 40 void LargeMarginInnerProductLayer<Dtype>::Forward_gpu(const vector<Blob<Dtype>*>& bottom,
 41     const vector<Blob<Dtype>*>& top) {
 42   const Dtype* bottom_data = bottom[0]->gpu_data();
 43   const Dtype* label_data = bottom[1]->gpu_data();
 44   Dtype* top_data = top[0]->mutable_gpu_data();
 45   const Dtype* weight = this->blobs_[0]->gpu_data();
 46 
 47   // 普通fc層的計算
 48   if (M_ == 1) {
 49     caffe_gpu_gemv<Dtype>(CblasNoTrans, N_, K_, (Dtype)1.,
 50                          weight, bottom_data, (Dtype)0., top_data);
 51   } else {
 52     caffe_gpu_gemm<Dtype>(CblasNoTrans,
 53                           transpose_ ? CblasNoTrans : CblasTrans,
 54                           M_, N_, K_, (Dtype)1.,
 55                           bottom_data, weight, (Dtype)0., top_data);
 56   }
 57 
 58   const Dtype* label_cpu_data = bottom[1]->cpu_data();
 59 
 60   // w * x
 61   // 直接從前饋的結果中復制
 62   Dtype *wx_data = this->wx_.mutable_gpu_data();
 63   copy_label_score<Dtype><<<CAFFE_GET_BLOCKS(M_), CAFFE_CUDA_NUM_THREADS>>>(M_, N_, label_data, top_data, wx_data);
 64 
 65   // w * w
 66   Dtype *abs_w_data = this->abs_w_.mutable_cpu_data();
 67   for (int m = 0; m < M_; ++ m) {
 68     abs_w_data[m] = caffe_cpu_dot<Dtype>(
 69       K_,
 70       this->blobs_[0]->cpu_data() + static_cast<int>(label_cpu_data[m]) * K_,
 71       this->blobs_[0]->cpu_data() + static_cast<int>(label_cpu_data[m]) * K_
 72       );
 73   }
 74   
 75   // x * x
 76   Dtype *abs_x_data = this->abs_x_.mutable_cpu_data();
 77   for (int m = 0; m < M_; ++ m) {
 78     abs_x_data[m] = caffe_cpu_dot<Dtype>(
 79       K_, 
 80       bottom[0]->cpu_data() + m * K_,
 81       bottom[0]->cpu_data() + m * K_
 82       );
 83   }
 84 
 85   // abs_w, abs_x
 86   caffe_gpu_powx<Dtype>(M_, this->abs_w_.mutable_gpu_data(), 0.5, this->abs_w_.mutable_gpu_data());
 87   caffe_gpu_powx<Dtype>(M_, this->abs_x_.mutable_gpu_data(), 0.5, this->abs_x_.mutable_gpu_data());
 88 
 89   // cos_t = wx / (|x| * |w|)
 90   Dtype *cos_t_data = this->cos_t_.mutable_gpu_data();
 91   caffe_gpu_div<Dtype>(M_, wx_data, this->abs_x_.gpu_data(), cos_t_data);
 92   caffe_gpu_div<Dtype>(M_, cos_t_data, this->abs_w_.gpu_data(), cos_t_data);
 93 
 94   // cos(mt)
 95   cal_cos_mt<Dtype><<<CAFFE_GET_BLOCKS(M_), CAFFE_CUDA_NUM_THREADS>>>(
 96     M_, this->margin, 
 97     this->C_M_N_.gpu_data(), 
 98     this->cos_t_.gpu_data(),
 99     this->cos_mt_.mutable_gpu_data()
100     );
101 
102   // k
103   int *k_cpu_data = this->k_.mutable_cpu_data();
104   const Dtype *cos_t_cpu_data = this->cos_t_.cpu_data();
105   for (int m = 0; m < M_; ++ m) {
106     for (int _k = 0; _k < this->cos_theta_bound_.count(); ++ _k) {
107       if (this->cos_theta_bound_.cpu_data()[_k] < cos_t_cpu_data[m]) {
108         k_cpu_data[m] = _k - 1;
109         break;
110       }
111     }
112   }
113 
114   // y
115   LMForward<Dtype><<<CAFFE_GET_BLOCKS(M_), CAFFE_CUDA_NUM_THREADS>>>(
116     M_, N_, this->lambda,
117     label_data, this->cos_mt_.gpu_data(), this->k_.gpu_data(),
118     this->abs_w_.gpu_data(), this->abs_x_.gpu_data(), top[0]->mutable_gpu_data());
119 }

那么，這樣關於large margin softmax loss的前饋我們就輕松的實現了。下一篇，我們要講最復雜的后饋的實現了。

如果您覺得本文對您有幫助，那請小喵喝杯茶吧~~O(∩_∩)O~~ 再次感慨 $\LaTeX$ 大法好。

轉載請注明出處~

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基於Caffe的Large Margin Softmax Loss的實現（上）源碼分析 Large-Margin Softmax Loss for Convolutional Neural Networks caffe層解讀系列-softmax_loss 怎樣在caffe中添加layer以及caffe中triplet loss layer的實現損失函數總結以及python實現：hinge loss(合頁損失)、softmax loss、cross_entropy loss(交叉熵損失) softmax、cross entropy和softmax loss學習筆記損失函數 hinge loss vs softmax loss caffe中accuracy和loss用python從log日志里面獲取 Margin Loss 損失函數的設計【caffe】loss function、cost function和error