CenterNet是在2019年論文Objects as points中提出,相比yolo,ssd,faster_rcnn依靠大量anchor的檢測網絡,CenterNet是一種anchor-free的目標檢測網絡,在速度和精度上都比較有優勢,值得學習下。
對於CenterNet的理解主要在於四方面:網絡結構,heatmap生成,數據增強,loss函數理解。
1. CenterNet網絡結構
除了檢測任務外,CenterNet還可以用於肢體識別或者3D目標檢測等,因此CenterNet論文中提出了三種backbone的網絡結構,分別是Resnet-18, DLA-34和Hourglass-104, 三種backbone准確度和速度如下:
-
Resnet-18 with up-convolutional layers : 28.1% coco and 142 FPS
-
DLA-34 : 37.4% COCOAP and 52 FPS
-
Hourglass-104 : 45.1% COCOAP and 1.4 FPS
實際工作中我主要用CenterNet進行目標檢測,常用Resnet50作為backbone,這里主要介紹resnet50_center_net,其網絡結構如下:

可以發現CenterNet網絡比較簡單,主要包括resnet50提取圖片特征,然后是反卷積模塊Deconv(三個反卷積)對特征圖進行上采樣,最后三個分支卷積網絡用來預測heatmap, 目標的寬高和目標的中心點坐標。值得注意的是反卷積模塊,其包括三個反卷積組,每個組都包括一個3*3的卷積和一個反卷積,每次反卷積都會將特征圖尺寸放大一倍,有很多代碼中會將反卷積前的3x3的卷積替換為DCNv2(Deformable ConvetNets V2)來提高模型擬合能力。
關於DCN(Deformable ConvetNets)參見:https://zhuanlan.zhihu.com/p/37578271, https://zhuanlan.zhihu.com/p/53127011
CenterNet的模型計算流程如下:
-
圖片縮放到512x512尺寸(長邊縮放到512,短邊補0),隨后將縮放后1x3x512x512的圖片輸入網絡
-
圖片經過resnet50提取特征得到feature1尺寸為1x2048x16x16
-
feature1經過反卷積模塊Deconv,三次上采樣得到feature2尺寸為1x64x128x128
-
將feature2分別送入三個分支進行預測,預測heatmap尺寸為1x80x128x128(表示80個類別),預測長寬尺寸為1x2x128x128(2表示長和寬),預測中心點偏移量尺寸為1x2x128x128(2表示x, y)
關於另外兩種backbone沒有嘗試過,以后再寫
DLA-34網絡即Deep Layer Aggregation, 其理解參見:https://cloud.tencent.com/developer/article/1676834
Hourglass網絡主要用於人體姿態估計,其理解參見:https://zhuanlan.zhihu.com/p/45002720
2. heatmap(熱力圖)理解和生成
2.1 heatmap生成
CenterNet將目標當成一個點來檢測,即用目標box的中心點來表示這個目標,預測目標的中心點偏移量(offset),寬高(size)來得到物體實際box,而heatmap則是表示分類信息。每一個類別都有一張heatmap,每一張heatmap上,若某個坐標處有物體目標的中心點,即在該坐標處產生一個keypoint(用高斯圓表示),如下圖所示:

產生heatmap的步驟解釋如下:
如下圖左邊是縮放后送入網絡的圖片,尺寸為512x512,右邊是生成的heatmap圖,尺寸為128x128(網絡最后預測的heatmap尺度為128x128。其步驟如下:
-
1.將目標的box縮放到128x128的尺度上,然后求box的中心點坐標並取整,設為point
-
2.根據目標box大小計算高斯圓的半徑,設為R
-
3.在heatmap圖上,以point為圓心,半徑為R填充高斯函數計算值。(point點處為最大值,沿着半徑向外按高斯函數遞減)
(注意:由於兩個目標都是貓,屬於同一類別,所以在同一張heatmap上。若還有一只狗,則狗的keypoint在另外一張heatmap上)

2.2 heatmap高斯函數半徑的確定
heatmap上的關鍵點之所以采用二維高斯核來表示,是由於對於在目標中心點附近的一些點,期預測出來的box和gt_box的IOU可能會大於0.7,不能直接對這些預測值進行懲罰,需要溫和一點,所以采用高斯核。借用下大佬們的解釋,如下圖所示:

關於高斯圓的半徑確定,主要還是依賴於目標box的寬高,其計算方法為下圖所示。 實際情況中會取IOU=0.7,即下圖中的overlap=0.7作為臨界值,然后分別計算出三種情況的半徑,取最小值作為高斯核的半徑r

參考:https://zhuanlan.zhihu.com/p/96856635?utm_source=wechat_session
3. 數據增強
關於CenterNet還有一點值得注意的是其數據增強部分,采用了仿射變換warpAffine,其實就是對原圖中進行裁剪,然后縮放到512x512的大小(長邊縮放,短邊補0)。實際過程中先確定一個中心點,和一個裁剪的長寬,然后進行仿射變換,如下圖所示,綠色框住的圖片會被裁剪出來,然后縮放到512x512(實際效果見圖二中六個子圖中第一個)

下面是上圖選擇不同中心點和長度進行仿射變換得到的樣本。除了中心點,裁剪長度,仿射變換還可以設置角度,CenterNet中沒有設置角度(代碼中為0),是由於加上旋轉角度后,gt_box會變的不是很准確,如最右邊兩個旋轉樣本

下面是摘抄的gluoncv中部分CenterNet源碼,可以看到其前處理邏輯:
"""Transforms described in https://arxiv.org/abs/1904.07850.""" # pylint: disable=too-many-function-args from __future__ import absolute_import import numpy as np import mxnet as mx from gluoncv.data.transforms import bbox as tbbox from gluoncv.data.transforms import image as timage from gluoncv.data.transforms import experimental from gluoncv.utils.filesystem import try_import_cv2 from mxnet import nd from mxnet import gluon class CenterNetTargetGenerator(gluon.Block): """Target generator for CenterNet. Parameters ---------- num_class : int Number of categories. output_width : int Width of the network output. output_height : int Height of the network output. """ def __init__(self, num_class, output_width, output_height): super(CenterNetTargetGenerator, self).__init__() self._num_class = num_class self._output_width = int(output_width) self._output_height = int(output_height) def forward(self, gt_boxes, gt_ids): """Target generation""" # pylint: disable=arguments-differ h_scale = 1.0 # already scaled in affinetransform w_scale = 1.0 # already scaled in affinetransform heatmap = np.zeros((self._num_class, self._output_height, self._output_width), dtype=np.float32) wh_target = np.zeros((2, self._output_height, self._output_width), dtype=np.float32) wh_mask = np.zeros((2, self._output_height, self._output_width), dtype=np.float32) center_reg = np.zeros((2, self._output_height, self._output_width), dtype=np.float32) center_reg_mask = np.zeros((2, self._output_height, self._output_width), dtype=np.float32) for bbox, cid in zip(gt_boxes, gt_ids): cid = int(cid) box_h, box_w = bbox[3] - bbox[1], bbox[2] - bbox[0] if box_h > 0 and box_w > 0: radius = _gaussian_radius((np.ceil(box_h), np.ceil(box_w))) radius = max(0, int(radius)) center = np.array( [(bbox[0] + bbox[2]) / 2 * w_scale, (bbox[1] + bbox[3]) / 2 * h_scale], dtype=np.float32) center_int = center.astype(np.int32) #浮點數變為整數 center_x, center_y = center_int assert center_x < self._output_width, \ 'center_x: {} > output_width: {}'.format(center_x, self._output_width) assert center_y < self._output_height, \ 'center_y: {} > output_height: {}'.format(center_y, self._output_height) _draw_umich_gaussian(heatmap[cid], center_int, radius) print(radius) wh_target[0, center_y, center_x] = box_w * w_scale wh_target[1, center_y, center_x] = box_h * h_scale wh_mask[:, center_y, center_x] = 1.0 center_reg[:, center_y, center_x] = center - center_int #偏移量為浮點數相對於整數坐標的偏移量 center_reg_mask[:, center_y, center_x] = 1.0 return tuple([nd.array(x) for x in \ (heatmap, wh_target, wh_mask, center_reg, center_reg_mask)]) def _gaussian_radius(det_size, min_overlap=0.7): """Calculate gaussian radius for foreground objects. Parameters ---------- det_size : tuple of int Object size (h, w). min_overlap : float Minimal overlap between objects. Returns ------- float Gaussian radius. """ height, width = det_size a1 = 1 b1 = (height + width) c1 = width * height * (1 - min_overlap) / (1 + min_overlap) sq1 = np.sqrt(b1 ** 2 - 4 * a1 * c1) r1 = (b1 + sq1) / 2 a2 = 4 b2 = 2 * (height + width) c2 = (1 - min_overlap) * width * height sq2 = np.sqrt(b2 ** 2 - 4 * a2 * c2) r2 = (b2 + sq2) / 2 a3 = 4 * min_overlap b3 = -2 * min_overlap * (height + width) c3 = (min_overlap - 1) * width * height sq3 = np.sqrt(b3 ** 2 - 4 * a3 * c3) r3 = (b3 + sq3) / 2 return min(r1, r2, r3) def _gaussian_2d(shape, sigma=1): """Generate 2d gaussian. Parameters ---------- shape : tuple of int The shape of the gaussian. sigma : float Sigma for gaussian. Returns ------- float 2D gaussian kernel. """ m, n = [(ss - 1.) / 2. for ss in shape] y, x = np.ogrid[-m:m+1, -n:n+1] h = np.exp(-(x * x + y * y) / (2 * sigma * sigma)) h[h < np.finfo(h.dtype).eps * h.max()] = 0 return h def _draw_umich_gaussian(heatmap, center, radius, k=1): """Draw a 2D gaussian heatmap. Parameters ---------- heatmap : numpy.ndarray Heatmap to be write inplace. center : tuple of int Center of object (h, w). radius : type The radius of gaussian. Returns ------- numpy.ndarray Drawn gaussian heatmap. """ diameter = 2 * radius + 1 gaussian = _gaussian_2d((diameter, diameter), sigma=diameter / 6) x, y = int(center[0]), int(center[1]) height, width = heatmap.shape[0:2] left, right = min(x, radius), min(width - x, radius + 1) top, bottom = min(y, radius), min(height - y, radius + 1) masked_heatmap = heatmap[y - top:y + bottom, x - left:x + right] masked_gaussian = gaussian[radius - top:radius + bottom, radius - left:radius + right] if min(masked_gaussian.shape) > 0 and min(masked_heatmap.shape) > 0: np.maximum(masked_heatmap, masked_gaussian * k, out=masked_heatmap) return heatmap def transform_test(imgs, short, max_size=1024, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)): """A util function to transform all images to tensors as network input by applying normalizations. This function support 1 NDArray or iterable of NDArrays. Parameters ---------- imgs : NDArray or iterable of NDArray Image(s) to be transformed. short : int Resize image short side to this `short` and keep aspect ratio. max_size : int, optional Maximum longer side length to fit image. This is to limit the input image shape. Aspect ratio is intact because we support arbitrary input size in our SSD implementation. mean : iterable of float Mean pixel values. std : iterable of float Standard deviations of pixel values. Returns ------- (mxnet.NDArray, numpy.ndarray) or list of such tuple A (1, 3, H, W) mxnet NDArray as input to network, and a numpy ndarray as original un-normalized color image for display. If multiple image names are supplied, return two lists. You can use `zip()`` to collapse it. """ if isinstance(imgs, mx.nd.NDArray): imgs = [imgs] for im in imgs: assert isinstance(im, mx.nd.NDArray), "Expect NDArray, got {}".format(type(im)) tensors = [] origs = [] for img in imgs: img = timage.resize_short_within(img, short, max_size) orig_img = img.asnumpy().astype('uint8') img = mx.nd.image.to_tensor(img) img = mx.nd.image.normalize(img, mean=mean, std=std) tensors.append(img.expand_dims(0)) origs.append(orig_img) if len(tensors) == 1: return tensors[0], origs[0] return tensors, origs def load_test(filenames, short, max_size=1024, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)): """A util function to load all images, transform them to tensor by applying normalizations. This function support 1 filename or iterable of filenames. Parameters ---------- filenames : str or list of str Image filename(s) to be loaded. short : int Resize image short side to this `short` and keep aspect ratio. max_size : int, optional Maximum longer side length to fit image. This is to limit the input image shape. Aspect ratio is intact because we support arbitrary input size in our SSD implementation. mean : iterable of float Mean pixel values. std : iterable of float Standard deviations of pixel values. Returns ------- (mxnet.NDArray, numpy.ndarray) or list of such tuple A (1, 3, H, W) mxnet NDArray as input to network, and a numpy ndarray as original un-normalized color image for display. If multiple image names are supplied, return two lists. You can use `zip()`` to collapse it. """ if isinstance(filenames, str): filenames = [filenames] imgs = [mx.image.imread(f) for f in filenames] return transform_test(imgs, short, max_size, mean, std) class CenterNetDefaultTrainTransform(object): """Default SSD training transform which includes tons of image augmentations. Parameters ---------- width : int Image width. height : int Image height. num_class : int Number of categories scale_factor : int, default is 4 The downsampling scale factor between input image and output heatmap mean : array-like of size 3 Mean pixel values to be subtracted from image tensor. Default is [0.485, 0.456, 0.406]. std : array-like of size 3 Standard deviation to be divided from image. Default is [0.229, 0.224, 0.225]. """ def __init__(self, width, height, num_class, scale_factor=4, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225), **kwargs): self._kwargs = kwargs self._width = width self._height = height self._num_class = num_class self._scale_factor = scale_factor self._mean = np.array(mean, dtype=np.float32).reshape(1, 1, 3) self._std = np.array(std, dtype=np.float32).reshape(1, 1, 3) self._data_rng = np.random.RandomState(123) self._eig_val = np.array([0.2141788, 0.01817699, 0.00341571], dtype=np.float32) self._eig_vec = np.array([ [-0.58752847, -0.69563484, 0.41340352], [-0.5832747, 0.00994535, -0.81221408], [-0.56089297, 0.71832671, 0.41158938] ], dtype=np.float32) self._target_generator = CenterNetTargetGenerator( num_class, width // scale_factor, height // scale_factor) def __call__(self, src, label): """Apply transform to training image/label.""" # random color jittering img = src bbox = label # random horizontal flip h, w, _ = img.shape # img, flips = timage.random_flip(img, px=0.5) # bbox = tbbox.flip(bbox, (w, h), flip_x=flips[0]) cv2 = try_import_cv2() input_h, input_w = self._height, self._width s = max(h, w) * 1.0 c = np.array([w / 2., h / 2.], dtype=np.float32) sf = 0.4 w_border = _get_border(128, img.shape[1]) h_border = _get_border(128, img.shape[0]) c[0] = np.random.randint(low=w_border, high=img.shape[1] - w_border) # 裁剪后目標的中心點 c[1] = np.random.randint(low=h_border, high=img.shape[0] - h_border) s = s * np.clip(np.random.randn() * sf + 1, 1 - sf, 1 + sf) # 裁剪后圖片的長寬 # a = random.randint(-10, 10) #裁剪后圖片的旋轉角度 # trans_input = tbbox.get_affine_transform(c, s, a, [input_w, input_h]) trans_input = tbbox.get_affine_transform(c, s, 0, [input_w, input_h]) #根據中心點和長寬,計算變換矩陣 inp = cv2.warpAffine(img.asnumpy(), trans_input, (input_w, input_h), flags=cv2.INTER_LINEAR) #對圖片進行仿射變換(主要是裁剪和resize) output_w = input_w // self._scale_factor output_h = input_h // self._scale_factor trans_output = tbbox.get_affine_transform(c, s, 0, [output_w, output_h]) for i in range(bbox.shape[0]): bbox[i, :2] = tbbox.affine_transform(bbox[i, :2], trans_output) bbox[i, 2:4] = tbbox.affine_transform(bbox[i, 2:4], trans_output) bbox[:, :2] = np.clip(bbox[:, :2], 0, output_w - 1) bbox[:, 2:4] = np.clip(bbox[:, 2:4], 0, output_h - 1) img = inp # to tensor img = img.astype(np.float32) / 255. experimental.image.np_random_color_distort(img, data_rng=self._data_rng) img = (img - self._mean) / self._std img = img.transpose(2, 0, 1).astype(np.float32) img = mx.nd.array(img) # generate training target so cpu workers can help reduce the workload on gpu gt_bboxes = bbox[:, :4] gt_ids = bbox[:, 4:5] heatmap, wh_target, wh_mask, center_reg, center_reg_mask = self._target_generator( gt_bboxes, gt_ids) return img, heatmap, wh_target, wh_mask, center_reg, center_reg_mask # img=cv2.cvtColor(inp, cv2.COLOR_RGB2BGR) # for i in range(bbox.shape[0]): # bbox[i, :2] = tbbox.affine_transform(bbox[i, :2], trans_input) # bbox[i, 2:4] = tbbox.affine_transform(bbox[i, 2:4], trans_input) # bbox[:, :2] = np.clip(bbox[:, :2], 0, input_w - 1) # bbox[:, 2:4] = np.clip(bbox[:, 2:4], 0, input_h - 1) # # print(bbox, c, s, a, img.shape) # for i in range(len(bbox)): # cv2.rectangle(img, tuple(bbox[i][:2]), tuple(bbox[i][2:4]), (0, 255, 0), 2) # cv2.imshow("img", img) # cv2.waitKey(0) # cv2.destroyAllWindows() def _get_border(border, size): """Get the border size of the image""" i = 1 while size - border // i <= border // i: i *= 2 return border // i if __name__ == "__main__": img_path = r"E:\two_cats.jpg" transform = CenterNetDefaultTrainTransform( 512, 512, num_class=3, scale_factor=4) img = mx.image.imread(img_path) label = np.array([[112, 335, 1360, 980, 1], [1140, 151, 1509, 774, 1]]).reshape(-1, 5) transform(img, label)
4. loss函數理解
center_net的loss包括三部分,heatmap的loss,目標長寬預測loss,目標中心點偏移值loss。其中heatmap的LK采用改進的focal loss,長寬預測的Lsize和目標中心點偏移Loff都采用L1Loss, 而且Lsize加上了0.1的權重。

heatmap loss
heatmap loss的計算公式如下,對focal loss進行了改寫,α和β是超參數,用來均衡難易樣本和正負樣本。N是圖像的關鍵點數量(正樣本個數),用於將所有的positive focal loss標准化為1,求和符號的下標xyc表示所有heatmap上的所有坐標點(c表示目標類別,每個類別一張heatmap),
為預測值,Yxyc為標注真實值。
關於focal loss的理解參考:https://www.cnblogs.com/silence-cho/p/12987476.html , https://zhuanlan.zhihu.com/p/66048276

中心點偏移值損失
Loff損失函數公式如下, 其只對正樣本的偏移值損失進行計算。其中
表示預測的偏移值,p為圖片中目標中心點坐標,R為縮放尺度,
為縮放后中心點的近似整數坐標
假設圖片實際中心點p為(125, 63),由於圖片的尺寸為512*512,縮放尺度R=4,因此縮放后的128x128尺寸下中心點坐標為p\R(31.25, 15.75), 相對於整數坐標
(31, 15)的偏移值即為(0.25, 0.75), 即(p/R -
)
長寬預測損失值
損失函數公式如下,也是只對正樣本的損失值計算,Spk為預測尺寸,Sk

參考:
https://zhuanlan.zhihu.com/p/96856635?utm_source=wechat_session
https://zhuanlan.zhihu.com/p/165313457?utm_source=wechat_session
https://zhuanlan.zhihu.com/p/66048276
https://zhuanlan.zhihu.com/p/73516696
