神經網絡壓縮的研究近三年十分熱門,筆者查閱到相關的兩篇博客,博主們非常奉獻的提供了源代碼,但是發發現在使用gpu訓練添加mask的網絡上,稍微有些不順,特此再進行詳細說明。
此文是在 基於Caffe的CNN剪枝[1]和 Deep Compression閱讀理解及Caffe源碼修改[2] 的基礎上修改的。
mask的結構?
[1]中使用的blob,存儲mask。blob是一塊數據塊,在初始化時,需要為gpu上的數據塊申請一塊空間,故有Addmask()函數。AddMask()是blob.hpp中的blob的成員方法,需要在blob.cpp中實現。使用時將Addmask()添加在innerproduct.cpp和base_conv.cpp中,使得網絡在setuplayer的過程中,為fc層和conv層多開辟一塊存放mask的syncedmemory。blob有一系列需要實現的cpu_data()/mutable_cpu_data()等,初始化中改變mask的值時需要注意使用合理的方式。
InnerProductLayer.cpp
1 void InnerProductLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom, 2 const vector<Blob<Dtype>*>& top) { 3 ... 4 this->blobs_[0].reset(new Blob<Dtype>(weight_shape)); 5 this->blobs_[0]->Addmask(); 6 ...}
base_conv.cpp:
1 template <typename Dtype> 2 void BaseConvolutionLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom, 3 const vector<Blob<Dtype>*>& top) { 4 ... 5 this->blobs_[0].reset(new Blob<Dtype>(weight_shape)); 6 this->blobs_[0]->Addmask(); 7 ...}
修改blob.hpp和blob.cpp,添加成員mask_和相關的方法,在[1]文章的評論里作者已給出源代碼。
[2]中使用layer結構定義mask,layer是相當於數據的一系列操作,或者說是blob的組合方法。
但是,想要實現在gpu上的操作,數據需要有gpu有關的操作。故此處采用[1]中的方法,將mask_添加到blob class中,實現mask_屬性。
mask的初始化?
在Caffe框架下,網絡的初始化有兩種方式,一種是調用filler,按照模型中定義的初始化方式進行初始化,第二種是從已有的caffemodel或者snapshot中讀取相應參數矩陣進行初始化[1]。
1、filler的方法
在程序開始時,網絡使用net.cpp中的Init()進行初始化,由輸入至輸出,依次調用各個層的layersetup,建立網絡結構。如下所示是caffe中使用xavier方法進行填充的操作。
1 virtual void Fill(Blob<Dtype>* blob) { 2 CHECK(blob->count()); 3 int fan_in = blob->count() / blob->num(); 4 int fan_out = blob->count() / blob->channels(); 5 Dtype n = fan_in; // default to fan_in 6 if (this->filler_param_.variance_norm() == 7 FillerParameter_VarianceNorm_AVERAGE) { 8 n = (fan_in + fan_out) / Dtype(2); 9 } else if (this->filler_param_.variance_norm() == 10 FillerParameter_VarianceNorm_FAN_OUT) { 11 n = fan_out; 12 } 13 Dtype scale = sqrt(Dtype(3) / n); 14 caffe_rng_uniform<Dtype>(blob->count(), -scale, scale, 15 blob->mutable_cpu_data()); 16 //Filler<Dtype>:: FillMask(blob); 17 CHECK_EQ(this->filler_param_.sparse(), -1) 18 << "Sparsity not supported by this Filler."; 19 }
filler的作用是,為建立的網絡結構產生隨機初始化值。
即使是從snapshot或caffemodel中讀入數據,也執行隨機填充操作。
2、從snapshot或caffemodel中讀入數據
tools/caffe.cpp 中的phase:train可以從snapshot或caffemodel中提取參數,進行finetune。phase:test則可以從提取的參數中建立網絡,進行預測過程。
這里筆者的網絡結構是在pycaffe中進行稀疏化的,因此讀入網絡的proto文件是一個連接數不變、存在部分連接權值為零的網絡。需要在讀入參數的同時初始化mask_。因此修改blob.cpp中的fromproto函數:
1 template <typename Dtype> 2 void Blob<Dtype>::FromProto(const BlobProto& proto, bool reshape) { 3 if (reshape) { 4 vector<int> shape; 5 if (proto.has_num() || proto.has_channels() || 6 proto.has_height() || proto.has_width()) { 7 // Using deprecated 4D Blob dimensions -- 8 // shape is (num, channels, height, width). 9 shape.resize(4); 10 shape[0] = proto.num(); 11 shape[1] = proto.channels(); 12 shape[2] = proto.height(); 13 shape[3] = proto.width(); 14 } else { 15 shape.resize(proto.shape().dim_size()); 16 for (int i = 0; i < proto.shape().dim_size(); ++i) { 17 shape[i] = proto.shape().dim(i); 18 } 19 } 20 Reshape(shape); 21 } else { 22 CHECK(ShapeEquals(proto)) << "shape mismatch (reshape not set)"; 23 } 24 // copy data 25 Dtype* data_vec = mutable_cpu_data(); 26 if (proto.double_data_size() > 0) { 27 CHECK_EQ(count_, proto.double_data_size()); 28 for (int i = 0; i < count_; ++i) { 29 data_vec[i] = proto.double_data(i); 30 } 31 } else { 32 CHECK_EQ(count_, proto.data_size()); 33 for (int i = 0; i < count_; ++i) { 34 data_vec[i] = proto.data(i); 35 } 36 } 37 if (proto.double_diff_size() > 0) { 38 CHECK_EQ(count_, proto.double_diff_size()); 39 Dtype* diff_vec = mutable_cpu_diff(); 40 for (int i = 0; i < count_; ++i) { 41 diff_vec[i] = proto.double_diff(i); 42 } 43 } else if (proto.diff_size() > 0) { 44 CHECK_EQ(count_, proto.diff_size()); 45 Dtype* diff_vec = mutable_cpu_diff(); 46 for (int i = 0; i < count_; ++i) { 47 diff_vec[i] = proto.diff(i); 48 } 49 } 50 if(shape_.size()==4||shape_.size()==2){ 51 Dtype* mask_vec = mutable_cpu_data(); 52 CHECK(count_); 53 for(int i=0;i<count_;i++) 54 mask_vec[i]=data_vec[i]?1:0; 55 }
在讀入proto文件的同時,如果層的大小是4D——conv層、或2D——fc層時,初始化mask_為data_vec[i]?1:0。當層的大小是1Ds——pool或relu層時,不進行mask的初始化。
反向傳播的修改?
1、修改blob的更新方式,添加math_funcion.hpp頭文件。
1 template <typename Dtype> 2 void Blob<Dtype>::Update() { 3 // We will perform update based on where the data is located. 4 switch (data_->head()) { 5 case SyncedMemory::HEAD_AT_CPU: 6 // perform computation on CPU 7 caffe_axpy<Dtype>(count_, Dtype(-1), 8 static_cast<const Dtype*>(diff_->cpu_data()), 9 static_cast<Dtype*>(data_->mutable_cpu_data())); 10 caffe_mul<Dtype>(count_, 11 static_cast<const Dtype*>(mask_->cpu_data()), 12 static_cast<const Dtype*>(data_->cpu_data()), 13 static_cast<Dtype*>(data_->mutable_cpu_data())); 14 break; 15 case SyncedMemory::HEAD_AT_GPU: 16 case SyncedMemory::SYNCED: 17 #ifndef CPU_ONLY 18 // perform computation on GPU 19 caffe_gpu_axpy<Dtype>(count_, Dtype(-1), 20 static_cast<const Dtype*>(diff_->gpu_data()), 21 static_cast<Dtype*>(data_->mutable_gpu_data())); 22 caffe_gpu_mul<Dtype>(count_, 23 static_cast<const Dtype*>(mask_->gpu_data()), 24 static_cast<const Dtype*>(data_->gpu_data()), 25 static_cast<Dtype*>(data_->mutable_gpu_data())); 26 #else 27 NO_GPU; 28 #endif 29 break; 30 default: 31 LOG(FATAL) << "Syncedmem not initialized."; 32 } 33 }
2、為cpu下的計算和gpu下的計算分別添加形如weight[i]*=mask[i];的運算方式。
inner_product_layer.cpp:
1 void InnerProductLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top, 2 const vector<bool>& propagate_down, 3 const vector<Blob<Dtype>*>& bottom) { 4 if (this->param_propagate_down_[0]) { 5 const Dtype* top_diff = top[0]->cpu_diff(); 6 const Dtype* bottom_data = bottom[0]->cpu_data(); 7 // Gradient with respect to weight 8 Dtype* weight_diff = this->blobs_[0]->mutable_cpu_diff(); 9 vector<int> weight_shape(2); 10 if (transpose_) { 11 weight_shape[0] = K_; 12 weight_shape[1] = N_; 13 } else { 14 weight_shape[0] = N_; 15 weight_shape[1] = K_; 16 } 17 int count = weight_shape[0]*weight_shape[1]; 18 const Dtype* mask = this->blobs_[0]->cpu_mask(); 19 for(int j=0;j<count;j++) 20 weight_diff[j]*=mask[j]; 21 22 if (transpose_) { 23 caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans, 24 K_, N_, M_, 25 (Dtype)1., bottom_data, top_diff, 26 (Dtype)1., weight_diff); 27 } else { 28 caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans, 29 N_, K_, M_, 30 (Dtype)1., top_diff, bottom_data, 31 (Dtype)1., weight_diff); 32 } 33 } 34 if (bias_term_ && this->param_propagate_down_[1]) { 35 const Dtype* top_diff = top[0]->cpu_diff(); 36 // Gradient with respect to bias 37 caffe_cpu_gemv<Dtype>(CblasTrans, M_, N_, (Dtype)1., top_diff, 38 bias_multiplier_.cpu_data(), (Dtype)1., 39 this->blobs_[1]->mutable_cpu_diff()); 40 } 41 if (propagate_down[0]) { 42 const Dtype* top_diff = top[0]->cpu_diff(); 43 // Gradient with respect to bottom data 44 if (transpose_) { 45 caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasTrans, 46 M_, K_, N_, 47 (Dtype)1., top_diff, this->blobs_[0]->cpu_data(), 48 (Dtype)0., bottom[0]->mutable_cpu_diff()); 49 } else { 50 caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, 51 M_, K_, N_, 52 (Dtype)1., top_diff, this->blobs_[0]->cpu_data(), 53 (Dtype)0., bottom[0]->mutable_cpu_diff()); 54 } 55 } 56 }
inner_product_layer.cu:
1 template <typename Dtype> 2 void InnerProductLayer<Dtype>::Backward_gpu(const vector<Blob<Dtype>*>& top, 3 const vector<bool>& propagate_down, 4 const vector<Blob<Dtype>*>& bottom) { 5 if (this->param_propagate_down_[0]) { 6 const Dtype* top_diff = top[0]->gpu_diff(); 7 const Dtype* bottom_data = bottom[0]->gpu_data(); 8 vector<int> weight_shape(2); 9 if (transpose_) { 10 weight_shape[0] = K_; 11 weight_shape[1] = N_; 12 } else { 13 weight_shape[0] = N_; 14 weight_shape[1] = K_; 15 } 16 int count = weight_shape[0]*weight_shape[1]; 17 caffe_gpu_mul<Dtype>(count,static_cast<const Dtype*>(this->blobs_[0]->mutable_gpu_diff()),static_cast<const Dtype*>(this->blobs_[0]->gpu_mask()),static_cast<Dtype*>(this->blobs_[0]->mutable_gpu_diff())); 18 Dtype* weight_diff = this->blobs_[0]->mutable_gpu_diff(); 19 //for(int j=0;j<count;j++) 20 //weight_diff[j]*=this->masks_[j]; 21 // Gradient with respect to weight 22 if (transpose_) { 23 caffe_gpu_gemm<Dtype>(CblasTrans, CblasNoTrans, 24 K_, N_, M_, 25 (Dtype)1., bottom_data, top_diff, 26 (Dtype)1., weight_diff); 27 } else { 28 caffe_gpu_gemm<Dtype>(CblasTrans, CblasNoTrans, 29 N_, K_, M_, 30 (Dtype)1., top_diff, bottom_data, 31 (Dtype)1., weight_diff); 32 } 33 } 34 if (bias_term_ && this->param_propagate_down_[1]) { 35 const Dtype* top_diff = top[0]->gpu_diff(); 36 // Gradient with respect to bias 37 caffe_gpu_gemv<Dtype>(CblasTrans, M_, N_, (Dtype)1., top_diff, 38 bias_multiplier_.gpu_data(), (Dtype)1., 39 this->blobs_[1]->mutable_gpu_diff()); 40 } 41 if (propagate_down[0]) { 42 const Dtype* top_diff = top[0]->gpu_diff(); 43 // Gradient with respect to bottom data 44 if (transpose_) { 45 caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasTrans, 46 M_, K_, N_, 47 (Dtype)1., top_diff, this->blobs_[0]->gpu_data(), 48 (Dtype)0., bottom[0]->mutable_gpu_diff()); 49 } else { 50 caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, 51 M_, K_, N_, 52 (Dtype)1., top_diff, this->blobs_[0]->gpu_data(), 53 (Dtype)0., bottom[0]->mutable_gpu_diff()); 54 } 55 } 56 }
至此修改完畢。
另外,caffe在新的版本中已添加sparse_參數,參考 https://github.com/BVLC/caffe/pulls?utf8=%E2%9C%93&q=sparse