目標檢測網絡CenterNet詳解(四)

本文轉載自查看原文 2020-11-15 17:13 17450 目標檢測和跟蹤

　　CenterNet是在2019年論文Objects as points中提出，相比yolo，ssd，faster_rcnn依靠大量anchor的檢測網絡，CenterNet是一種anchor-free的目標檢測網絡，在速度和精度上都比較有優勢，值得學習下。

對於CenterNet的理解主要在於四方面：網絡結構，heatmap生成，數據增強，loss函數理解。

1. CenterNet網絡結構

　　除了檢測任務外，CenterNet還可以用於肢體識別或者3D目標檢測等，因此CenterNet論文中提出了三種backbone的網絡結構，分別是Resnet-18， DLA-34和Hourglass-104, 三種backbone准確度和速度如下：

Resnet-18 with up-convolutional layers : 28.1% coco and 142 FPS
DLA-34 : 37.4% COCOAP and 52 FPS
Hourglass-104 : 45.1% COCOAP and 1.4 FPS

　　實際工作中我主要用CenterNet進行目標檢測，常用Resnet50作為backbone，這里主要介紹resnet50_center_net，其網絡結構如下：

　　可以發現CenterNet網絡比較簡單，主要包括resnet50提取圖片特征，然后是反卷積模塊Deconv(三個反卷積)對特征圖進行上采樣，最后三個分支卷積網絡用來預測heatmap, 目標的寬高和目標的中心點坐標。值得注意的是反卷積模塊，其包括三個反卷積組，每個組都包括一個3*3的卷積和一個反卷積，每次反卷積都會將特征圖尺寸放大一倍，有很多代碼中會將反卷積前的3x3的卷積替換為DCNv2(Deformable ConvetNets V2)來提高模型擬合能力。

　　關於DCN(Deformable ConvetNets)參見：https://zhuanlan.zhihu.com/p/37578271， https://zhuanlan.zhihu.com/p/53127011

CenterNet的模型計算流程如下：

圖片縮放到512x512尺寸(長邊縮放到512，短邊補0)，隨后將縮放后1x3x512x512的圖片輸入網絡
圖片經過resnet50提取特征得到feature1尺寸為1x2048x16x16
feature1經過反卷積模塊Deconv，三次上采樣得到feature2尺寸為1x64x128x128
將feature2分別送入三個分支進行預測，預測heatmap尺寸為1x80x128x128(表示80個類別)，預測長寬尺寸為1x2x128x128(2表示長和寬)，預測中心點偏移量尺寸為1x2x128x128(2表示x, y)

關於另外兩種backbone沒有嘗試過，以后再寫

　　　　DLA-34網絡即Deep Layer Aggregation, 其理解參見：https://cloud.tencent.com/developer/article/1676834

　　　　Hourglass網絡主要用於人體姿態估計，其理解參見：https://zhuanlan.zhihu.com/p/45002720

2. heatmap(熱力圖)理解和生成

2.1 heatmap生成

　　CenterNet將目標當成一個點來檢測，即用目標box的中心點來表示這個目標，預測目標的中心點偏移量(offset)，寬高(size)來得到物體實際box，而heatmap則是表示分類信息。每一個類別都有一張heatmap，每一張heatmap上，若某個坐標處有物體目標的中心點，即在該坐標處產生一個keypoint(用高斯圓表示），如下圖所示：

產生heatmap的步驟解釋如下：

如下圖左邊是縮放后送入網絡的圖片，尺寸為512x512，右邊是生成的heatmap圖，尺寸為128x128（網絡最后預測的heatmap尺度為128x128。其步驟如下：

1.將目標的box縮放到128x128的尺度上，然后求box的中心點坐標並取整，設為point
2.根據目標box大小計算高斯圓的半徑，設為R
3.在heatmap圖上，以point為圓心，半徑為R填充高斯函數計算值。(point點處為最大值，沿着半徑向外按高斯函數遞減)

（注意：由於兩個目標都是貓，屬於同一類別，所以在同一張heatmap上。若還有一只狗，則狗的keypoint在另外一張heatmap上)

2.2 heatmap高斯函數半徑的確定

　　heatmap上的關鍵點之所以采用二維高斯核來表示，是由於對於在目標中心點附近的一些點，期預測出來的box和gt_box的IOU可能會大於0.7，不能直接對這些預測值進行懲罰，需要溫和一點，所以采用高斯核。借用下大佬們的解釋，如下圖所示：

　　關於高斯圓的半徑確定，主要還是依賴於目標box的寬高，其計算方法為下圖所示。實際情況中會取IOU=0.7，即下圖中的overlap=0.7作為臨界值，然后分別計算出三種情況的半徑，取最小值作為高斯核的半徑r

參考：https://zhuanlan.zhihu.com/p/96856635?utm_source=wechat_session

3. 數據增強

　　關於CenterNet還有一點值得注意的是其數據增強部分，采用了仿射變換warpAffine，其實就是對原圖中進行裁剪，然后縮放到512x512的大小(長邊縮放，短邊補0)。實際過程中先確定一個中心點，和一個裁剪的長寬，然后進行仿射變換，如下圖所示，綠色框住的圖片會被裁剪出來，然后縮放到512x512(實際效果見圖二中六個子圖中第一個)

　　下面是上圖選擇不同中心點和長度進行仿射變換得到的樣本。除了中心點，裁剪長度，仿射變換還可以設置角度，CenterNet中沒有設置角度(代碼中為0)，是由於加上旋轉角度后，gt_box會變的不是很准確，如最右邊兩個旋轉樣本

　　下面是摘抄的gluoncv中部分CenterNet源碼，可以看到其前處理邏輯：

"""Transforms described in https://arxiv.org/abs/1904.07850."""
# pylint: disable=too-many-function-args
from __future__ import absolute_import
import numpy as np
import mxnet as mx
from gluoncv.data.transforms import bbox as tbbox
from gluoncv.data.transforms import image as timage
from gluoncv.data.transforms import experimental
from gluoncv.utils.filesystem import try_import_cv2
from mxnet import nd
from mxnet import gluon


class CenterNetTargetGenerator(gluon.Block):
    """Target generator for CenterNet.

    Parameters
    ----------
    num_class : int
        Number of categories.
    output_width : int
        Width of the network output.
    output_height : int
        Height of the network output.

    """
    def __init__(self, num_class, output_width, output_height):
        super(CenterNetTargetGenerator, self).__init__()
        self._num_class = num_class
        self._output_width = int(output_width)
        self._output_height = int(output_height)

    def forward(self, gt_boxes, gt_ids):
        """Target generation"""
        # pylint: disable=arguments-differ
        h_scale = 1.0  # already scaled in affinetransform
        w_scale = 1.0  # already scaled in affinetransform
        heatmap = np.zeros((self._num_class, self._output_height, self._output_width),
                           dtype=np.float32)
        wh_target = np.zeros((2, self._output_height, self._output_width), dtype=np.float32)
        wh_mask = np.zeros((2, self._output_height, self._output_width), dtype=np.float32)
        center_reg = np.zeros((2, self._output_height, self._output_width), dtype=np.float32)
        center_reg_mask = np.zeros((2, self._output_height, self._output_width), dtype=np.float32)
        for bbox, cid in zip(gt_boxes, gt_ids):
            cid = int(cid)
            box_h, box_w = bbox[3] - bbox[1], bbox[2] - bbox[0]
            if box_h > 0 and box_w > 0:
                radius = _gaussian_radius((np.ceil(box_h), np.ceil(box_w)))
                radius = max(0, int(radius))
                center = np.array(
                    [(bbox[0] + bbox[2]) / 2 * w_scale, (bbox[1] + bbox[3]) / 2 * h_scale],
                    dtype=np.float32)
                center_int = center.astype(np.int32)   #浮點數變為整數
                center_x, center_y = center_int
                assert center_x < self._output_width, \
                    'center_x: {} > output_width: {}'.format(center_x, self._output_width)
                assert center_y < self._output_height, \
                    'center_y: {} > output_height: {}'.format(center_y, self._output_height)
                _draw_umich_gaussian(heatmap[cid], center_int, radius)
                print(radius)
                wh_target[0, center_y, center_x] = box_w * w_scale
                wh_target[1, center_y, center_x] = box_h * h_scale
                wh_mask[:, center_y, center_x] = 1.0
                center_reg[:, center_y, center_x] = center - center_int   #偏移量為浮點數相對於整數坐標的偏移量
                center_reg_mask[:, center_y, center_x] = 1.0
        return tuple([nd.array(x) for x in \
            (heatmap, wh_target, wh_mask, center_reg, center_reg_mask)])


def _gaussian_radius(det_size, min_overlap=0.7):
    """Calculate gaussian radius for foreground objects.

    Parameters
    ----------
    det_size : tuple of int
        Object size (h, w).
    min_overlap : float
        Minimal overlap between objects.

    Returns
    -------
    float
        Gaussian radius.

    """
    height, width = det_size

    a1 = 1
    b1 = (height + width)
    c1 = width * height * (1 - min_overlap) / (1 + min_overlap)
    sq1 = np.sqrt(b1 ** 2 - 4 * a1 * c1)
    r1 = (b1 + sq1) / 2

    a2 = 4
    b2 = 2 * (height + width)
    c2 = (1 - min_overlap) * width * height
    sq2 = np.sqrt(b2 ** 2 - 4 * a2 * c2)
    r2 = (b2 + sq2) / 2

    a3 = 4 * min_overlap
    b3 = -2 * min_overlap * (height + width)
    c3 = (min_overlap - 1) * width * height
    sq3 = np.sqrt(b3 ** 2 - 4 * a3 * c3)
    r3 = (b3 + sq3) / 2
    return min(r1, r2, r3)

def _gaussian_2d(shape, sigma=1):
    """Generate 2d gaussian.

    Parameters
    ----------
    shape : tuple of int
        The shape of the gaussian.
    sigma : float
        Sigma for gaussian.

    Returns
    -------
    float
        2D gaussian kernel.

    """
    m, n = [(ss - 1.) / 2. for ss in shape]
    y, x = np.ogrid[-m:m+1, -n:n+1]

    h = np.exp(-(x * x + y * y) / (2 * sigma * sigma))
    h[h < np.finfo(h.dtype).eps * h.max()] = 0
    return h

def _draw_umich_gaussian(heatmap, center, radius, k=1):
    """Draw a 2D gaussian heatmap.

    Parameters
    ----------
    heatmap : numpy.ndarray
        Heatmap to be write inplace.
    center : tuple of int
        Center of object (h, w).
    radius : type
        The radius of gaussian.

    Returns
    -------
    numpy.ndarray
        Drawn gaussian heatmap.

    """
    diameter = 2 * radius + 1
    gaussian = _gaussian_2d((diameter, diameter), sigma=diameter / 6)

    x, y = int(center[0]), int(center[1])

    height, width = heatmap.shape[0:2]

    left, right = min(x, radius), min(width - x, radius + 1)
    top, bottom = min(y, radius), min(height - y, radius + 1)

    masked_heatmap = heatmap[y - top:y + bottom, x - left:x + right]
    masked_gaussian = gaussian[radius - top:radius + bottom, radius - left:radius + right]
    if min(masked_gaussian.shape) > 0 and min(masked_heatmap.shape) > 0:
        np.maximum(masked_heatmap, masked_gaussian * k, out=masked_heatmap)
    return heatmap


def transform_test(imgs, short, max_size=1024, mean=(0.485, 0.456, 0.406),
                   std=(0.229, 0.224, 0.225)):
    """A util function to transform all images to tensors as network input by applying
    normalizations. This function support 1 NDArray or iterable of NDArrays.

    Parameters
    ----------
    imgs : NDArray or iterable of NDArray
        Image(s) to be transformed.
    short : int
        Resize image short side to this `short` and keep aspect ratio.
    max_size : int, optional
        Maximum longer side length to fit image.
        This is to limit the input image shape. Aspect ratio is intact because we
        support arbitrary input size in our SSD implementation.
    mean : iterable of float
        Mean pixel values.
    std : iterable of float
        Standard deviations of pixel values.

    Returns
    -------
    (mxnet.NDArray, numpy.ndarray) or list of such tuple
        A (1, 3, H, W) mxnet NDArray as input to network, and a numpy ndarray as
        original un-normalized color image for display.
        If multiple image names are supplied, return two lists. You can use
        `zip()`` to collapse it.

    """
    if isinstance(imgs, mx.nd.NDArray):
        imgs = [imgs]
    for im in imgs:
        assert isinstance(im, mx.nd.NDArray), "Expect NDArray, got {}".format(type(im))

    tensors = []
    origs = []
    for img in imgs:
        img = timage.resize_short_within(img, short, max_size)
        orig_img = img.asnumpy().astype('uint8')
        img = mx.nd.image.to_tensor(img)
        img = mx.nd.image.normalize(img, mean=mean, std=std)
        tensors.append(img.expand_dims(0))
        origs.append(orig_img)
    if len(tensors) == 1:
        return tensors[0], origs[0]
    return tensors, origs

def load_test(filenames, short, max_size=1024, mean=(0.485, 0.456, 0.406),
              std=(0.229, 0.224, 0.225)):
    """A util function to load all images, transform them to tensor by applying
    normalizations. This function support 1 filename or iterable of filenames.

    Parameters
    ----------
    filenames : str or list of str
        Image filename(s) to be loaded.
    short : int
        Resize image short side to this `short` and keep aspect ratio.
    max_size : int, optional
        Maximum longer side length to fit image.
        This is to limit the input image shape. Aspect ratio is intact because we
        support arbitrary input size in our SSD implementation.
    mean : iterable of float
        Mean pixel values.
    std : iterable of float
        Standard deviations of pixel values.

    Returns
    -------
    (mxnet.NDArray, numpy.ndarray) or list of such tuple
        A (1, 3, H, W) mxnet NDArray as input to network, and a numpy ndarray as
        original un-normalized color image for display.
        If multiple image names are supplied, return two lists. You can use
        `zip()`` to collapse it.

    """
    if isinstance(filenames, str):
        filenames = [filenames]
    imgs = [mx.image.imread(f) for f in filenames]
    return transform_test(imgs, short, max_size, mean, std)


class CenterNetDefaultTrainTransform(object):
    """Default SSD training transform which includes tons of image augmentations.

    Parameters
    ----------
    width : int
        Image width.
    height : int
        Image height.
    num_class : int
        Number of categories
    scale_factor : int, default is 4
        The downsampling scale factor between input image and output heatmap
    mean : array-like of size 3
        Mean pixel values to be subtracted from image tensor. Default is [0.485, 0.456, 0.406].
    std : array-like of size 3
        Standard deviation to be divided from image. Default is [0.229, 0.224, 0.225].
    """
    def __init__(self, width, height, num_class, scale_factor=4, mean=(0.485, 0.456, 0.406),
                 std=(0.229, 0.224, 0.225), **kwargs):
        self._kwargs = kwargs
        self._width = width
        self._height = height
        self._num_class = num_class
        self._scale_factor = scale_factor
        self._mean = np.array(mean, dtype=np.float32).reshape(1, 1, 3)
        self._std = np.array(std, dtype=np.float32).reshape(1, 1, 3)
        self._data_rng = np.random.RandomState(123)
        self._eig_val = np.array([0.2141788, 0.01817699, 0.00341571],
                                 dtype=np.float32)
        self._eig_vec = np.array([
            [-0.58752847, -0.69563484, 0.41340352],
            [-0.5832747, 0.00994535, -0.81221408],
            [-0.56089297, 0.71832671, 0.41158938]
        ], dtype=np.float32)

        self._target_generator = CenterNetTargetGenerator(
            num_class, width // scale_factor, height // scale_factor)

    def __call__(self, src, label):
        """Apply transform to training image/label."""
        # random color jittering
        img = src
        bbox = label

        # random horizontal flip
        h, w, _ = img.shape
        # img, flips = timage.random_flip(img, px=0.5)
        # bbox = tbbox.flip(bbox, (w, h), flip_x=flips[0])

        cv2 = try_import_cv2()
        input_h, input_w = self._height, self._width
        s = max(h, w) * 1.0
        c = np.array([w / 2., h / 2.], dtype=np.float32)
        sf = 0.4
        w_border = _get_border(128, img.shape[1])
        h_border = _get_border(128, img.shape[0])
        c[0] = np.random.randint(low=w_border, high=img.shape[1] - w_border)  # 裁剪后目標的中心點
        c[1] = np.random.randint(low=h_border, high=img.shape[0] - h_border)
        s = s * np.clip(np.random.randn() * sf + 1, 1 - sf, 1 + sf)  # 裁剪后圖片的長寬
        # a = random.randint(-10, 10)                                         #裁剪后圖片的旋轉角度
        # trans_input = tbbox.get_affine_transform(c, s, a, [input_w, input_h])
        trans_input = tbbox.get_affine_transform(c, s, 0, [input_w, input_h])  #根據中心點和長寬，計算變換矩陣
        inp = cv2.warpAffine(img.asnumpy(), trans_input, (input_w, input_h), flags=cv2.INTER_LINEAR) #對圖片進行仿射變換(主要是裁剪和resize)

        output_w = input_w // self._scale_factor
        output_h = input_h // self._scale_factor
        trans_output = tbbox.get_affine_transform(c, s, 0, [output_w, output_h])
        for i in range(bbox.shape[0]):
            bbox[i, :2] = tbbox.affine_transform(bbox[i, :2], trans_output)
            bbox[i, 2:4] = tbbox.affine_transform(bbox[i, 2:4], trans_output)
        bbox[:, :2] = np.clip(bbox[:, :2], 0, output_w - 1)
        bbox[:, 2:4] = np.clip(bbox[:, 2:4], 0, output_h - 1)
        img = inp
        # to tensor
        img = img.astype(np.float32) / 255.
        experimental.image.np_random_color_distort(img, data_rng=self._data_rng)
        img = (img - self._mean) / self._std
        img = img.transpose(2, 0, 1).astype(np.float32)
        img = mx.nd.array(img)

        # generate training target so cpu workers can help reduce the workload on gpu
        gt_bboxes = bbox[:, :4]
        gt_ids = bbox[:, 4:5]
        heatmap, wh_target, wh_mask, center_reg, center_reg_mask = self._target_generator(
            gt_bboxes, gt_ids)

        return img, heatmap, wh_target, wh_mask, center_reg, center_reg_mask

        # img=cv2.cvtColor(inp, cv2.COLOR_RGB2BGR)
        # for i in range(bbox.shape[0]):
        #     bbox[i, :2] = tbbox.affine_transform(bbox[i, :2], trans_input)
        #     bbox[i, 2:4] = tbbox.affine_transform(bbox[i, 2:4], trans_input)
        # bbox[:, :2] = np.clip(bbox[:, :2], 0, input_w - 1)
        # bbox[:, 2:4] = np.clip(bbox[:, 2:4], 0, input_h - 1)
        # # print(bbox, c, s, a, img.shape)
        # for i in range(len(bbox)):
        #     cv2.rectangle(img, tuple(bbox[i][:2]), tuple(bbox[i][2:4]), (0, 255, 0), 2)
        # cv2.imshow("img", img)
        # cv2.waitKey(0)
        # cv2.destroyAllWindows()

def _get_border(border, size):
    """Get the border size of the image"""
    i = 1
    while size - border // i <= border // i:
        i *= 2
    return border // i


if __name__ == "__main__":
    img_path = r"E:\two_cats.jpg"
    transform = CenterNetDefaultTrainTransform(
        512, 512, num_class=3, scale_factor=4)
    img = mx.image.imread(img_path)
    label = np.array([[112, 335, 1360, 980, 1], [1140, 151, 1509, 774, 1]]).reshape(-1, 5)
    transform(img, label)

centerNet數據增強

4. loss函數理解

　　center_net的loss包括三部分，heatmap的loss，目標長寬預測loss，目標中心點偏移值loss。其中heatmap的L_K $采用改進的focal loss，長寬預測的L size 和目標中心點偏移L off 都采用L1Loss, 而且L size 加上了0.1的權重。$

heatmap loss

　　heatmap loss的計算公式如下，對focal loss進行了改寫，α和β是超參數，用來均衡難易樣本和正負樣本。N是圖像的關鍵點數量(正樣本個數)，用於將所有的positive focal loss標准化為1，求和符號的下標xyc表示所有heatmap上的所有坐標點(c表示目標類別，每個類別一張heatmap)， $\hat{Y}_{x y c}$ $為預測值，Y xyc 為標注真實值。$