Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network

作者和相關鏈接

個人主頁：Zhi Tian，黃偉林，Tong He，Pan He，喬宇
作者簡單信息：

論文下載：論文傳送門
代碼下載：代碼傳送門

幾個關鍵的Idea出發點

文本檢測和一般目標檢測的不同——文本線是一個sequence（字符、字符的一部分、多字符組成的一個sequence），而不是一般目標檢測中只有一個獨立的目標。這既是優勢，也是難點。優勢體現在同一文本線上不同字符可以互相利用上下文，可以用sequence的方法比如RNN來表示。難點體現在要檢測出一個完整的文本線，同一文本線上不同字符可能差異大，距離遠，要作為一個整體檢測出來難度比單個目標更大——因此，作者認為預測文本的豎直位置（文本bounding box的上下邊界）比水平位置（文本bounding box的左右邊界）更容易。
Top-down（先檢測文本區域，再找出文本線）的文本檢測方法比傳統的bottom-up的檢測方法（先檢測字符，再串成文本線）更好。自底向上的方法的缺點在於（這點在作者的另一篇文章中說的更清楚），總結起來就是沒有考慮上下文，不夠魯棒，系統需要太多子模塊，太復雜且誤差逐步積累，性能受限。
RNN和CNN的無縫結合可以提高檢測精度。CNN用來提取深度特征，RNN用來序列的特征識別（2類），二者無縫結合，用在檢測上性能更好。

方法概括

基本流程如Fig 1，整個檢測分六步：
- 第一，用VGG16的前5個Conv stage（到conv5）得到feature map(W*H*C)
- 第二，在Conv5的feature map的每個位置上取3*3*C的窗口的特征，這些特征將用於預測該位置k個anchor（anchor的定義和Faster RCNN類似）對應的類別信息，位置信息。
- 第三，將每一行的所有窗口對應的3*3*C的特征（W*3*3*C）輸入到RNN（BLSTM）中，得到W*256的輸出
- 第四，將RNN的W*256輸入到512維的fc層
- 第五，fc層特征輸入到三個分類或者回歸層中。第二個2k scores 表示的是k個anchor的類別信息（是字符或不是字符）。第一個2k vertical coordinate和第三個k side-refinement是用來回歸k個anchor的位置信息。2k vertical coordinate表示的是bounding box的高度和中心的y軸坐標（可以決定上下邊界），k個side-refinement表示的bounding box的水平平移量。這邊注意，只用了3個參數表示回歸的bounding box，因為這里默認了每個anchor的width是16，且不再變化（VGG16的conv5的stride是16）。回歸出來的box如Fig.1中那些紅色的細長矩形，它們的寬度是一定的。
- 第六，用簡單的文本線構造算法，把分類得到的文字的proposal（圖Fig.1（b）中的細長的矩形）合並成文本線

Fig. 1: (a) Architecture of the Connectionist Text Proposal Network (CTPN). We densely slide a 3×3 spatial window through the last convolutional maps (conv5 ) of the VGG16 model [27]. The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs). The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-axis coordinates and side-refinement offsets of k anchors. (b) The CTPN outputs sequential fixed-width fine-scale text proposals. Color of each box indicates the text/non-text score. Only the boxes with positive scores are presented.

方法細節

Detecting Text in Fine-scale proposals
- k個anchor尺度和長寬比設置：寬度都是16，k = 10，高度從11~273（每次除於0.7）
- 回歸的高度和bounding box的中心的y坐標如下，帶*的表示是groundTruth，帶a的表示是anchor

- score閾值設置：0.7 （+NMS）
- 一般的RPN和采用本文的方法檢測出的效果對比

Recurrent Connectionist Text Proposals
- RNN類型：BLSTM（雙向LSTM），每個LSTM有128個隱含層
- RNN輸入：每個滑動窗口的3*3*C的特征（可以拉成一列），同一行的窗口的特征形成一個序列
- RNN輸出：每個窗口對應256維特征
- 使用RNN和不適用RNN的效果對比，CTPN是本文的方法（Connectionist Text Proposal Network）

Side-refinement
- 文本線構造算法（多個細長的proposal合並成一條文本線）
  - 主要思想：每兩個相近的proposal組成一個pair，合並不同的pair直到無法再合並為止（沒有公共元素）
  - 判斷兩個proposal，Bi和Bj組成pair的條件：
    1. Bj->Bi，且Bi->Bj。（Bj->Bi表示Bj是Bi的最好鄰居）
    2. Bj->Bi條件1：Bj是Bi的鄰居中距離Bi最近的，且該距離小於50個像素
    3. Bj->Bi條件2：Bj和Bi的vertical overlap大於0.7
- 固定要regression的box的寬度和水平位置會導致predict的box的水平位置不准確，所以作者引入了side-refinement，用於水平位置的regression。where x_side is the predicted x-coordinate of the nearest horizontal side (e.g., left or right side) to current anchor. x^∗ side is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT bounding box and anchor location. c^a_xis the center of anchor in x-axis. wa is the width of anchor, which is fixed, w_a = 16

- 使用side-refinement的效果對比

實驗結果

時間：0.14s with GPU
ICDAR2011，ICDAR2013，ICDAR2015庫上檢測結果

總結與收獲點

這篇文章的方法最大亮點在於把RNN引入檢測問題（以前一般做識別）。文本檢測，先用CNN得到深度特征，然后用固定寬度的anchor來檢測text proposal（文本線的一部分），並把同一行anchor對應的特征串成序列，輸入到RNN中，最后用全連接層來分類或回歸，並將正確的text proposal進行合並成文本線。這種把RNN和CNN無縫結合的方法提高了檢測精度。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 論文閱讀筆記四：CTPN: Detecting Text in Natural Image with Connectionist Text Proposal Network(ECCV2016) 論文閱讀（Weilin Huang——【arXiv2016】Accurate Text Localization in Natural Image with Cascaded Convolutional Text Network）深度學習論文翻譯解析（三）：Detecting Text in Natural Image with Connectionist Text Proposal Network 論文閱讀（Weilin Huang——【TIP2016】Text-Attentional Convolutional Neural Network for Scene Text Detection）論文閱讀（Weilin Huang——【AAAI2016】Reading Scene Text in Deep Convolutional Sequences）論文閱讀（Zhuoyao Zhong——【aixiv2016】DeepText A Unified Framework for Text Proposal Generation and Text Detection in Natural Images）論文閱讀（XiangBai——【CVPR2017】Detecting Oriented Text in Natural Images by Linking Segments）論文閱讀筆記二十九：SSD: Single Shot MultiBox Detector(ECCV2016) 【論文閱讀】TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes 【論文速讀】Shangbang Long_ECCV2018_TextSnake_A Flexible Representation for Detecting Text of Arbitrary Shapes

論文閱讀（Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network）

Weilin Huang——【ECCV2016】Detecting Text in Natural Image with Connectionist Text Proposal Network

目錄

作者和相關鏈接

幾個關鍵的Idea出發點

方法概括

基本流程如Fig 1， 整個檢測分六步：

方法細節

Detecting Text in Fine-scale proposals

Recurrent Connectionist Text Proposals

Side-refinement

實驗結果

總結與收獲點

免責聲明！

基本流程如Fig 1，整個檢測分六步：