本文來自《MobiFace: A Lightweight Deep Learning Face Recognition on Mobile Devices》,時間線為2018年11月。是作者分別來自CMU和uark學校。
0 引言
隨着DCNN的普及,在目標檢測,目標分割等領域都有不小的進步,然而其較高准確度背后卻是大量的參數和計算量。如AlexNet需要61百萬參數量,VGG16需要138百萬參數量,Resnet-50需要25百萬參數量。Densenet190(k=40)需要40百萬參數量。雖然這些網絡現在看來都不算很深的網絡,可是還是需要200MB和500MB的內存。因此,這樣的模型通常是不能部署在移動端或者嵌入式端的。所以最近在圖像分類和目標檢測領域中也有不少壓縮模型被提出來,如剪枝[13,14,32],逐深度卷積[18,38],二值網絡[3,4,22,36],mimic網絡[31,44]。這些網絡可以在沒有損失較多准確度的基礎上對inference速度進行加速。然而這些模型沒有應用在人臉識別領域上。相對於目標檢測和人臉分類,人臉識別問題通常需要一定數量的層去提取夠魯棒的辨識性的人臉特征,畢竟人臉模板都一樣(兩個眼睛,一個嘴巴)。
本文作者提出一個輕量級但是高性能的深度神經網絡,以此讓人臉識別能部署在移動設備上。相比於其他網絡,MobiNet優勢有:
- 讓MobileNet架構變得更輕量級,提出的MobiNet模型可以很好的部署在移動設備上;
- 提出的MobiNet可以end-to-end的優化;
- 將MobiNet與基於mobile的網絡和大規模深度網絡在人臉識別數據上進行對比。
1 MobiNet
目前為止,已經有不少輕量級深度網絡的設計方案,如binarized networks, quantized networks, mimicked networks, designed compact modules 和 pruned networks。本文主要關注最后兩種設計方案。
Designed compact modules
通過整合小的模型或者緊湊的模塊和層,可以減少權重的數量,有助於減少內存使用和inference階段的時間消耗。MobileNet提出一個逐深度分離的卷積模塊來代替傳統的卷積層,以此明顯減少參數量。逐深度卷積操作首先出現在Sifre[41]論文中,然后用在[2,18,38]網絡中。在Mobilenet[18]中,空間輸入通過一個3x3空間可分通道濾波器進行卷積生成獨立的特征,然后接一個逐點(1x1)卷積操作以此生成新的特征。通過這個策略代替傳統的卷積操作,使得MobileNet只有4.2百萬的參數量和569百萬的MAdds。在Imagenet上獲得70.6%的結果(VGG16結果是71.5%)。為了提升MobileNet在多任務和benchmark上的性能。Sandler等人提出一個倒置殘差和線性botleneck(inverted residuals and linear bottlenecks),叫MobileNet-v2。倒置殘差類似[16]中的殘差bottleneck,但是中間特征可以關於輸入通道的數量擴展到一個特定比例。線性bottleneck是不帶有ReLU層的塊。MobileNetv2將之前准確度提升到72%,而只需要3.4百萬參數量和300百萬MAdds。雖然逐深度可分卷積被證實很有效,[18,38]仍然在iphone和安卓上占用很多內存和計算力。而本文發出的時間上,作者並未找到逐深度卷積在CPU上有很好的框架(tf,pytorch,caffe,mxnet)實現。為了減少MobileNet的計算量,FD-Mobilenet中引入快速下采樣策略。受到MobileNet-v2的結構啟發,MobileFaceNet通過引入相似的網絡結構,並通過將全局平均池化層替換成全局逐深度卷積層來減少參數量。
Pruned networks
DNN一直受到參數量巨大和內存消耗很多的困擾。[14]提出一個深度壓縮模型通過絕對值去剪枝那些不重要的連接,在Alexnet和VGG16上獲得了9x和13x的加速,且並未有多少准確度損失。[32]使用BN中的縮放因子(而不是權重的絕對值)對網絡進行瘦身。這些縮放因子通過L1-懲罰進行稀疏訓練。在VGG16,DenseNet,ResNet中Slimming networks [32]基於CIFAR數據集獲得比原始網絡更好的准確度。然而,每個剪枝后的連接索引需要存在內存中,這拉低了訓練和測試的速度。
1.1 網絡設計策略
帶有擴展層的bottleneck殘差塊(Bottleneck Residual block with the expansion layers)
[37]中引入bottlenect殘差塊,該塊包含三個主要的變換操作,兩個線性變換和一個非線性逐通道變換:
- 非線性變換學習復雜的映射函數;
- 在內層中增加了feature map的數量;
- 通過shortcut連接去學習殘差。
給定一個輸入\(\mathbf{x}\)和對應size為\(h\times w\times k\),一個bottleneck殘差塊可以表示為:
其中,\(F_1:R^{w\times h\times k}\mapsto R^{w\times h\times tk}\),\(F_3:R^{w\times h\times k}\mapsto R^{\frac{w}{s}\times \frac{h}{s}\times k_1}\)都是通過1x1卷積實現的線性函數,t表示擴展因子。\(F_2:R^{w\times h \times tk}\mapsto R^{\frac{w}{s}\times \frac{h}{s}\times tk}\)是非線性映射函數,通過三個操作組合實現的:ReLU,3x3逐深度卷積(stride=s),和ReLU。
在bottleneck塊中采用了殘差學習連接,以此阻止變換中的流行塌陷和增加特征embedding的表征能力[37]>
快速下采樣
基於有限的計算資源,緊湊的網絡應該最大化輸入圖像轉換到輸出特征中的信息變換,同時避免高代價的計算,如較大的feature map空間維度(分辨率)。在大規模深度網絡中,信息流是通過較慢的下采樣策略實現的,如空間維度在層之間是緩慢變小的。而輕量級網絡不能這樣。
所謂快速下采樣,就是在特征embedding過程的最初階段連續使用下采樣步驟,以避免feature map的大空間維度,然后在后面的階段上,增加更多feature map來保證信息流的傳遞。要注意的是,雖然增加更多feature map,會導致通道數量的上升,但是因為本身feature map的分辨率夠小,所以增加的計算代價不大。
1.2 MobiFace
MobiFace網絡,給定輸入人臉圖像size為112x112x3,該輕量級網絡意在最大化信息流變換同時減少計算量。基於上述分析,帶有擴展層的參數botteneck塊(Residual Bottleneck block with expansion layers)可以作為MobiFace的構建塊。表1給出了MobiFace的主要結構。
其中DWConv為depthwise conv。如表1所示,MobiFace主要結構包含:
- 一個3x3的卷積層;
- 一個3x3的逐深度分離卷積層(depthwise separable convolutional layer);
- 一系列bottleneck塊和殘差bottleneck塊;
- 一個1x1卷積層;
- 一個全連接層。
其中殘差bottleneck塊和bottleneck塊很像,除了殘差bottleneck塊會添加shortcut方式以連接1×1卷積層的輸入和輸出。而且在bottleneck 塊中stride=2,而在殘差bottleneck塊中每層stride=1。
MobiFace通過引入快速下采樣策略,快速減少層/塊的空間維度。可以發現本來輸入大小為112x112x3,在前兩層就減少了一半,並且在后面7個bottleneck塊中就減少了8x之多。擴展因子保持為2,而通道數在每個bottleneck塊后就翻倍了。
除了標記為“linear”的卷積層之外,在每個卷積層之后應用BN和非線性激活函數。本文中,主要用PReLU而不是ReLU。在Mobiface最后一層,不采用全局平均池化層,而是采用全連接層。因為全局平均池化是無差別對待每個神經元(而中間區域神經元的重要性要大於邊緣區域神經元),FC層可以針對不同神經元學到不同權重,從而將此類信息嵌入到最后的特征向量中。
2 實驗
先基於提煉后的MS-Celeb-1M數據集(3.8百萬張圖片,85個ID)進行訓練,然后在LFW和MegaFace數據集上進行評估結果。
2.1 實現細節
在預處理階段,采用MTCNN模型進行人臉檢測和5個關鍵點檢測。然后將其對齊到112x112x3上。然后通過減去127.5並除以128進行歸一化。在訓練階段,通過SGD進行訓練,batchsize為1024,動量為0.9.學習率在40K,60K,80K處分別除以10。一共迭代100K次。
2.2 人臉驗證准確度
表2給出了在LFW上的benckmark。
表3給出了MegaFace數據集上的驗證結果。
reference:
[1] S. Chen, Y. Liu, X. Gao, and Z. Han. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. arXiv preprint arXiv:1804.07573, 2018.
[2] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pages 1800–1807. IEEE Computer Society, 2017.
[3] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1. CoRR, abs/1602.02830, 2016.
[4] M. Courbariaux, Y. Bengio, and J. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, pages 3123–3131, 2015.
[5] J. Deng, W. Dong, R. Socher, L. jia Li, K. Li, and L. Fei-fei. Imagenet: A large-scale hierarchical image database. In In CVPR, 2009.
[6] C. N. Duong, K. Luu, K. Quach, and T. Bui. Beyond principal components: Deep boltzmann machines for face modeling. In CVPR, 2015.
[7] C. N. Duong, K. Luu, K. Quach, and T. Bui. Longitudinal face modeling via temporal deep restricted boltzmann machines. In CVPR, 2016.
[8] C. N. Duong, K. Luu, K. Quach, and T. Bui. Deep appearance models: A deep boltzmann machine approach for face modeling. Intl Journal of Computer Vision (IJCV), 2018.
[9] C. N. Duong, K. G. Quach, K. Luu, T. H. N. Le, and M. Savvides. Temporal non-volume preserving approach to facial age-progression and age-invariant face recognition. In ICCV, 2017.
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014.
[11] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pages 87–102. Springer, 2016.
[12] M. S. H. N. Le, R. Gummadi. Deep recurrent level set for segmenting brain tumors. In Medical Image Computing and Computer Assisted Intervention (MICCAI), pages 646–653. Springer, 2018.
[13] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.
[14] S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 1135–1143, Cambridge, MA, USA, 2015. MIT Press.
[15] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
[19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[20] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 2261–2269, 2017.
[21] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.
[22] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In NIPS, pages 4107–4115, 2016.
[23] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, volume 37, pages 448–456. JMLR.org, 2015.
[24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, pages 675–678. ACM, 2014.
[25] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4873–4882, 2016.
[26] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
[28] H. N. Le, C. N. Duong, K. Luu, and M. Savvides. Deep contextual recurrent residual networks for scene labeling. In Journal of Pattern Recognition, 2018.
[29] H. N. Le, K. G. Quach, K. Luu, and M. Savvides. Reformulating level sets as deep recurrent neural network approach to semantic segmentation. In Trans. on Image Processing (TIP), 2018.
[30] H. N. Le, C. Zhu, Y. Zheng, K. Luu, and M. Savvides. Robust hand detection in vehicles. In Intl. Conf. on Pattern Recognition (ICPR), 2016.
[31] Q. Li, S. Jin, and J. Yan. Mimicking very efficient network for object detection. 2017 IEEE Conference on CVPR, pages 7341–7349, 2017.
[32] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2755–2763, 2017.
[33] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR.
[34] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
[35] Z. Qin, Z. Zhang, X. Chen, C. Wang, and Y. Peng. Fd-mobilenet: Improved mobilenet with a fast downsampling strategy. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 1363–1367. IEEE, 2018.
[36] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV (4), volume 9908 of Lecture Notes in Computer Science, pages 525–542. Springer, 2016.
[37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
[38] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR, abs/1801.04381, 2018.
[39] M. W. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In ECML, 2007.
[40] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
[41] L. Sifre. Rigid-motion scattering for image classification, 2014.
[42] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition, 2014.
[43] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[44] Y. Wei, X. Pan, H. Qin, and J. Yan. Quantization mimic: Towards very tiny cnn for object detection. CoRR, abs/1805.02152, 2018.
[45] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018.
[46] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
[47] Y. Zheng, C. Zhu, K. Luu, H. N. Le, C. Bhagavatula, and M. Savvides. Towards a deep learning framework for unconstrained face detection. In BTAS, 2016.
[48] C. Zhu, Y. Ran, K. Luu, and M. Savvides. Seeing small faces from robust anchor’s perspective. In CVPR, 2018.
[49] C. Zhu, Y. Zheng, K. Luu, H. N. Le, C. Bhagavatula, and M. Savvides. Weakly supervised facial analysis with dense hyper-column features. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2016.
[50] C. Zhu, Y. Zheng, K. Luu, and M. Savvides. Enhancing interior and exterior deep facial features for face detection in the wild. In Intl Conf. on Automatic Face and Gesture Recognition (FG), 2018.