這篇文章相比於Segmentation-based deep- ˇ learning approach for surface-defect detection.這篇文章,首先他們都是一個作者寫的,其次網絡的架構也是非常接近的
為了保證小細節被保留下來,這里使用的是max-pool2x2,而不是使用stride=2的卷積,這里使用了3次的max-pool2x2, 用來構成分割的主體網絡,從圖中可以看出,Segmentation結果相比於原始圖小了8倍,因此這里使用的分割標簽low-level resolution,粗略的標簽而不是精細的標簽,因為作者認為相比於像素集的損失,判斷是否有缺陷的分類結果更為重要。
這里講一下文章中所提到的四個關鍵優化點
1. Dynamically balanced loss
To implement end-to-end learning, we combine both losses, the segmentation loss and the classification loss, into a single unified loss, allowing for a simultaneous learning. New combined loss is defined as:
為了使用端到端的學習,我們融合了兩種Loss, 分割loss和分類的loss,融合為一種簡單統一loss,允許進行共同學習。新的loss如下:
where Lseg and Lcls represent segmentation and classification losses, respectively, δ is an additional classification loss weight that prevents the classification loss from dominating the total loss, while λ is a mixing factor that balances the contribution of each network in the final loss. Note that λ, and δ do not replace the learning rate η in SGD, but complement it. They adequately control the learning process, as losses can be in different scales. The segmentation loss is averaged over all the pixels, and more importantly, only a handful of pixels are anomalous, resulting in a relatively small values compared to the classification loss, therefore we normally use smaller δ values to prevent the classification loss from dominating the total loss.
這里Lseg和Lcls表示分割和分類的loss,各自的,δ是一個額外的分類損失權重這可以防止分類的loss占據總的Loss,λ是一個混合因子可以平衡不同網絡在最終loss中的貢獻。注意,λ, and δ不能替換SGD中的學習率,但是可以補充它。他們充分的控制着學習流程,作為在不同尺寸上的損失。這個Segmentation loss 是相對於所有像素的平均,更加重要,但是只有少了的像素值是目標,相比於分類損失這個數值就很小,因此我們使用一個更小的δ 數值來預防分類損失占據整個損失。
Learning the classification network before the segmentation network features are stable represents an additional challenge for the simultaneous learning. We address this by proposing to start learning only the segmentation network at the beginning and gradually progress towards learning only the classification part at the end. We formulate this by computing segmentation and classification mixing factors as a simple linear function:
在分割網絡特征穩定前訓練分類網絡對於同時訓練是另外一個挑戰。我們處理這情況通過在開始的時候訓練分割網絡並且逐步學習在最后的時候訓練分類網絡。我們制定這個通過計算分割和分類的混合因子作為一個簡單的線性函數
where n represents the index of the current epoch and total epoch represents the total number of training epochs. Without the gradual mixing of both losses, the learning would in some cases result in exploding gradients, thus making the model more difficult to use. We term the process of gradually including classification network and excluding segmentation network as a dynamically balanced loss. Additionally, using lower δ values further reduces the issue of learning on the noisy segmentation features early on, whereas using greater values has sometimes lead to the problem of exploding gradients.
這里n表示當前的epoch的索引值,total epoch 表示整個訓練的epoch次數,沒有兩種損失函數的逐步混合,這個訓練將會在一些情況下導致梯度爆炸,因此導致這個網絡更加難以使用。我們把這個逐步包括分類網絡且不包含分割網絡的過程稱為動態平衡損失函數,另外,進一步使用較低的δ values減少在早期學習到帶有噪聲的分割特征,然而使用較大的值有時候會導致梯度爆炸的問題
2.Gradient-flow adjustments
We propose to eliminate the gradient-flow from the classification network through the segmentation network, which is required to successfully learn the segmentation and the classification layers in a single endto-end manner. First, we remove the gradient-flow through the max/avg-pooling shortcuts used by the classification network as proposed in [9]. In Figure 1, this is marked with the (a). Those shortcuts utilize the segmentation network’s output map to speed-up the classification learning. Propagating gradients back through them would add error gradients to the segmentation’s output map, however, this can be harmful since we already have error for that output in the form of pixel-level annotation.
我們提議通過分割網絡來消除分類網絡的梯度流,這個需要在一個簡單的端對端方式中,分割網絡和分類網絡都能很好的學習。首先我們使用9中提到的在分類網絡中使用max/avg-pooling裁剪溢出了梯度流。這些裁剪使用分割網絡的輸出圖來增加分類網絡訓練。梯度回傳會給分割的輸出圖添加錯誤的梯度,這可能是有害的,因為我們已經有了以像素為對象的輸出錯誤。
We also propose to limit the gradients for the segmentation that originate in the classification network. In Figure 1, this is marked with the (b). During the initial phases of the training, the segmentation net does not yet produce meaningful outputs, therefore gradients back-propagating from the classification network can negatively effect the segmentation part. We propose to completely stop those gradients, thus preventing the classification network from changing the segmentation network. This closely follows the behaviour of a two-stage learning from [9], where segmentation network is trained first, then the segmentation layers are frozen and only the classification network is trained in the end.
我們也提議在分割中限制起源於分類網絡的梯度。在第一幅圖中,在初始化參數的訓練中,分割網絡無法產生有意義的輸出,因此來自於分類網絡的梯度回傳可能有害於分割部分。我們提議完全禁止這些梯度,因此預防分類網絡改變分割網絡。這和9中的2步訓練方式有些接近,首先訓練分割網絡,然后分割網絡被凍結,在最后的時候只訓練分類網絡
------------恢復內容開始------------
這篇文章相比於Segmentation-based deep- ˇ learning approach for surface-defect detection.這篇文章,首先他們都是一個作者寫的,其次網絡的架構也是非常接近的
為了保證小細節被保留下來,這里使用的是max-pool2x2,而不是使用stride=2的卷積,這里使用了3次的max-pool2x2, 用來構成分割的主體網絡,從圖中可以看出,Segmentation結果相比於原始圖小了8倍,因此這里使用的分割標簽low-level resolution,粗略的標簽而不是精細的標簽,因為作者認為相比於像素集的損失,判斷是否有缺陷的分類結果更為重要。
這里講一下文章中所提到的四個關鍵優化點
1. Dynamically balanced loss
To implement end-to-end learning, we combine both losses, the segmentation loss and the classification loss, into a single unified loss, allowing for a simultaneous learning. New combined loss is defined as:
為了使用端到端的學習,我們融合了兩種Loss, 分割loss和分類的loss,融合為一種簡單統一loss,允許進行共同學習。新的loss如下:
where Lseg and Lcls represent segmentation and classification losses, respectively, δ is an additional classification loss weight that prevents the classification loss from dominating the total loss, while λ is a mixing factor that balances the contribution of each network in the final loss. Note that λ, and δ do not replace the learning rate η in SGD, but complement it. They adequately control the learning process, as losses can be in different scales. The segmentation loss is averaged over all the pixels, and more importantly, only a handful of pixels are anomalous, resulting in a relatively small values compared to the classification loss, therefore we normally use smaller δ values to prevent the classification loss from dominating the total loss.
這里Lseg和Lcls表示分割和分類的loss,各自的,δ是一個額外的分類損失權重這可以防止分類的loss占據總的Loss,λ是一個混合因子可以平衡不同網絡在最終loss中的貢獻。注意,λ, and δ不能替換SGD中的學習率,但是可以補充它。他們充分的控制着學習流程,作為在不同尺寸上的損失。這個Segmentation loss 是相對於所有像素的平均,更加重要,但是只有少了的像素值是目標,相比於分類損失這個數值就很小,因此我們使用一個更小的δ 數值來預防分類損失占據整個損失。
損失函數結構:
weight_loss_seg * loss_seg + weight_loss_dec * loss_dec
Learning the classification network before the segmentation network features are stable represents an additional challenge for the simultaneous learning. We address this by proposing to start learning only the segmentation network at the beginning and gradually progress towards learning only the classification part at the end. We formulate this by computing segmentation and classification mixing factors as a simple linear function:
在分割網絡特征穩定前訓練分類網絡對於同時訓練是另外一個挑戰。我們處理這情況通過在開始的時候訓練分割網絡並且逐步學習在最后的時候訓練分類網絡。我們制定這個通過計算分割和分類的混合因子作為一個簡單的線性函數
其中λ等於1 - (epoch / total_epochs), δ等於self.cfg.DELTA_CLS_LOSS
def get_loss_weights(self, epoch): total_epochs = float(self.cfg.EPOCHS) if self.cfg.DYN_BALANCED_LOSS: seg_loss_weight = 1 - (epoch / total_epochs) dec_loss_weight = self.cfg.DELTA_CLS_LOSS * (epoch / total_epochs) else: seg_loss_weight = 1 dec_loss_weight = self.cfg.DELTA_CLS_LOSS self._log(f"Returning seg_loss_weight {seg_loss_weight} and dec_loss_weight {dec_loss_weight}", LVL_DEBUG) return seg_loss_weight, dec_loss_weight
where n represents the index of the current epoch and total epoch represents the total number of training epochs. Without the gradual mixing of both losses, the learning would in some cases result in exploding gradients, thus making the model more difficult to use. We term the process of gradually including classification network and excluding segmentation network as a dynamically balanced loss. Additionally, using lower δ values further reduces the issue of learning on the noisy segmentation features early on, whereas using greater values has sometimes lead to the problem of exploding gradients.
這里n表示當前的epoch的索引值,total epoch 表示整個訓練的epoch次數,沒有兩種損失函數的逐步混合,這個訓練將會在一些情況下導致梯度爆炸,因此導致這個網絡更加難以使用。我們把這個逐步包括分類網絡且不包含分割網絡的過程稱為動態平衡損失函數,另外,進一步使用較低的δ values減少在早期學習到帶有噪聲的分割特征,然而使用較大的值有時候會導致梯度爆炸的問題
2.Gradient-flow adjustments
We propose to eliminate the gradient-flow from the classification network through the segmentation network, which is required to successfully learn the segmentation and the classification layers in a single endto-end manner. First, we remove the gradient-flow through the max/avg-pooling shortcuts used by the classification network as proposed in [9]. In Figure 1, this is marked with the (a). Those shortcuts utilize the segmentation network’s output map to speed-up the classification learning. Propagating gradients back through them would add error gradients to the segmentation’s output map, however, this can be harmful since we already have error for that output in the form of pixel-level annotation.
我們提議通過分割網絡來消除分類網絡的梯度流,這個需要在一個簡單的端對端方式中,分割網絡和分類網絡都能很好的學習。首先我們使用9中提到的在分類網絡中使用max/avg-pooling裁剪溢出了梯度流。這些裁剪使用分割網絡的輸出圖來增加分類網絡訓練。梯度回傳會給分割的輸出圖添加錯誤的梯度,這可能是有害的,因為我們已經有了以像素為對象的輸出錯誤。
We also propose to limit the gradients for the segmentation that originate in the classification network. In Figure 1, this is marked with the (b). During the initial phases of the training, the segmentation net does not yet produce meaningful outputs, therefore gradients back-propagating from the classification network can negatively effect the segmentation part. We propose to completely stop those gradients, thus preventing the classification network from changing the segmentation network. This closely follows the behaviour of a two-stage learning from [9], where segmentation network is trained first, then the segmentation layers are frozen and only the classification network is trained in the end.
我們也提議在分割中限制起源於分類網絡的梯度。在第一幅圖中,在初始化參數的訓練中,分割網絡無法產生有意義的輸出,因此來自於分類網絡的梯度回傳可能有害於分割部分。我們提議完全禁止這些梯度,因此預防分類網絡改變分割網絡。這和9中的2步訓練方式有些接近,首先訓練分割網絡,然后分割網絡被凍結,在最后的時候只訓練分類網絡
這里禁止了兩個紅色框出來的梯度回傳,即cat成1025以后的梯度和global max-pool以及global avg-pool, 使用的方式是通過給回傳的梯度設置一個0的權重,即相乘以后的梯度為0。
self.volume_lr_multiplier_layer = GradientMultiplyLayer().apply self.glob_max_lr_multiplier_layer = GradientMultiplyLayer().apply self.glob_avg_lr_multiplier_layer = GradientMultiplyLayer().apply cat = self.volume_lr_multiplier_layer(cat, self.volume_lr_multiplier_mask) # 給cat層添加一個梯度層 global_max_seg = self.glob_max_lr_multiplier_layer(global_max_seg, self.glob_max_lr_multiplier_mask) # 給global_max_seg設置梯度層 global_avg_seg = self.glob_avg_lr_multiplier_layer(global_avg_seg, self.glob_avg_lr_multiplier_mask) # 給global_avg_seg設置梯度層 class GradientMultiplyLayer(torch.autograd.Function): @staticmethod def forward(ctx, input, mask_bw): ctx.save_for_backward(mask_bw) #將mask_bw 進行保存 return input @staticmethod def backward(ctx, grad_output): mask_bw, = ctx.saved_tensors # 取出tensors的結果 return grad_output.mul(mask_bw), None # 將梯度與結果進行回傳
3.Frequency-of-use sampling
Current implementation of the two-stage architecture employed an alternating sampling scheme [9] that provided balance between the positive and the negative training samples by alternating between a positive and a negative sample in each training step. We propose to improve the alternating sampling scheme by replacing the naive random sampling with one based on the frequency of use of each negative sample. The existing alternating sampling scheme forces a selection of negative images (N˜ ⊂ N) for every epoch in the same amount of positive images (P). However, due too P << N, the selected subset N˜ will be relatively small. Since current approach [9] employs uniform random sampling of negative images for every epoch, this leads to a significant over-use of some samples and under-use of others, as can be observed on the left side in Figure 2.
兩階段結構的當前實現采用一個交替抽樣的方案,這個方案通過在每一個訓練步中選擇一個正樣本還是一個負樣來提供正樣本和負樣本之間的平衡。我們提出了一種改進的交替抽樣的方案,通過基於每一個負樣本的使用頻率來替換掉單純的隨機抽樣。現有的交替抽樣方案在等量的正樣本中迫使選擇一個負樣本,然而,歸結於P<<N, 可選擇N的子集相比而言要少,因此每個epoch中,當前的方法采用統一隨機抽取負樣本,這將會導致一些樣本的過度使用和其他樣本的使用不足,這個可以在左上角的figure2觀察到。
We propose to replace the random sampling of negative examples with a one based on the frequency-of-use. We sample each image with the probability inversely proportional to the frequency of use of that image. As seen in the right histogram in Figure 2, the frequency-of-use sampling significantly reduces the over-use and the under-use in all samples, and ensures even use of every negative image during the training process.
我們打算使用基於概率使用的方法來替換隨機抽取負樣本。我們采樣每一張圖片的概率與這張圖片的使用頻率成反比,這種概率的采樣可以很好的降低過度使用和使用不足的情況,並且可以確保在訓練過程中每一個負樣本的使用。
當迭代的次數self.counter小於self.len的時候,負樣本(沒有缺陷的)的使用個數與正樣本(有缺陷的)的使用個數相同, 在2x正樣本的個數中抽取數據。即最前面的正樣本的個數的負樣本被抽取到
當迭代的次數self.counter大於self.len的時候, 即經過了一個epoch,重新對負樣本的索引進行隨機抽取,已經被抽取到的負樣本的概率減少,沒有被抽取到的負樣本的概率增加
上述的好處,1. 每一個epoch過程中,正樣本和負樣本的個數是一樣的
2.每一個負樣本被抽取的概率是接近的。
個人理解,在樣本量比較少的時候,這種方法是有效的,樣本量比較大的時候,采用隨機抽取也是有效的。
if self.counter >= self.len: # 剛開始不進入 self.counter = 0 if self.frequency_sampling: sample_probability = 1 - (self.neg_retrieval_freq / np.max(self.neg_retrieval_freq)) # 將1變為0, 0變為1 sample_probability = sample_probability - np.median(sample_probability) + 1 # 計算其中值 sample_probability = sample_probability ** (np.log(len(sample_probability)) * 4) sample_probability = sample_probability / np.sum(sample_probability) # use replace=False for to get only unique values self.neg_imgs_permutation = np.random.choice(range(self.num_neg), size=self.num_negatives_per_one_positive * self.num_pos, p=sample_probability, replace=False) # p是用來保證抽取的概率, 即已經抽取過的概率要小 else: self.neg_imgs_permutation = np.random.permutation(self.num_neg) if self.kind == 'TRAIN': if index >= self.num_pos: # 79 ix = index % self.num_pos ix = self.neg_imgs_permutation[ix] # 選擇72 item = self.neg_samples[ix] self.neg_retrieval_freq[ix] = self.neg_retrieval_freq[ix] + 1 # 將其中挑選的負樣本標記為1 else: ix = index item = self.pos_samples[ix] else: if index < self.num_neg: ix = index item = self.neg_samples[ix] else: ix = index - self.num_neg item = self.pos_samples[ix]
4. Loss weighting for positive pixels
When only approximate, region-based labels are available, such as shown in Figure 3, we propose to consider the different pixels of the annotated defected regions differently. In particular, we give more importance to the center of the annotated regions and less importance to the outer parts. This alleviates the ambiguity arising at the edges of the defect, where we can not be certain whether the defect is present or not. We implement the importance in different sections of labels by weighting the segmentation loss accordingly. We weight the influence of each pixel at positive labels in accordance with its distance to the nearest negatively labelled pixel using the distance transform algorithm.
We formulate weighting of the positive pixels as:
當只有近似,基於區域的標簽是有效的,就像Figure3圖中顯示的,我們建議對不同標注區域的像素點進行不同的考慮,特別的,我們對於目標區域給出更多關注和給外部區域更少的關注。這將會減少在缺陷邊界的不確定性,當我們不能確定缺陷 是否是存在的。我們直接通過加權分割的損失來實現標簽不同位置的重要性。我們使用距離變化方法,根據離最近負標簽像素的距離,來對每一個正樣本的像素進行加權。
where Lˆ(pix) is the original loss of the pixel, D(pix)/D(pixmax) is a distance to the nearest negative pixel normalized by the maximum distance value within the groundtruth region and Ω(x) is a scaling function that converts the relative distance value into the weight for the loss. In general, the scaling function Ω(x) can be defined differently depending on the defect and annotation type. However, we have found that a simple polynomial function provides enough flexibility for different defect types:
這里L^(pix)是一個原始的像素loss,D(pix) / D(pixmax)是由在真實樣本區域內最大距離歸一化到最近負樣本像素的距離,Ω(x)是一個縮放函數,它將將相對的距離轉換為損失權重,縮放函數Ω(x)可以根據缺陷和目標的類型被定義為不同。然而,我們發現簡單的線性函數可以表達出足夠的靈活性來應對不同的缺陷類型
where p controls the rate of decreasing the pixel importance as it gets further away from the center, while wpos is an additional scalar weight for all positive pixels. We have often found p = 1 and p = 2 as best performing, depending on the annotation type. Examples of a segmentation mask and two weight masks are depicted in Figure 3. Note, that the weights for the negatively labeled pixels remain 1.
這里p控制着像素的重要性隨着其遠離中心點而降低的速度,這里wpos是一個額外的權重對於所有的正像素。我們發現p=1和p=2擁有最好的表現,取決於注釋的類型。一個分割的掩模和兩個權重的掩模在Figure3被展示,注意,對於負樣本像素的標簽這個權重使用保持為1.
對分割的loss根據非零點到最近零點的距離計算,作為分割損失的加權,使得中心位置的權重大於邊緣位置,可以提高邊界的分界情況。
def distance_transform(self, mask: np.ndarray, max_val: float, p: float) -> np.ndarray: h, w = mask.shape[:2] # label max_val = 1 p = 2 dst_trf = np.zeros((h, w)) num_labels, labels = cv2.connectedComponents((mask * 255.0).astype(np.uint8), connectivity=8) #相鄰的缺陷聚集為一個類別 for idx in range(1, num_labels): # mask_roi= np.zeros((h, w)) k = labels == idx #找出標簽等於類別的位置 mask_roi[k] = 255 # 將其中等於類別的位置設置為255 dst_trf_roi = distance_transform_edt(mask_roi) #計算非零點到最近零點的距離,此時邊界的權重為1,靠近中心的位置大於1,且越靠近中心值越大 if dst_trf_roi.max() > 0: # 如果存在這樣的結果 dst_trf_roi = (dst_trf_roi / dst_trf_roi.max()) dst_trf_roi = (dst_trf_roi ** p) * max_val # max_val 表示Wpos dst_trf += dst_trf_roi # 將結果進行相加 dst_trf[mask == 0] = 1 # 將負樣本的標簽權重設置為1 return np.array(dst_trf, dtype=np.float32)
上述函數說明: cv2.connectedComponents((mask * 255.0).astype(np.uint8), connectivity=8),這里實現的效果如下 將相連的像素分割為一類
原始圖 效果圖
上述函數說明: distance_transform_edt計算非零像素到最近的零像素的距離
離最近背景點距離為1的用紅色標出,如下圖所示
離最近背景點距離為sqrt(2)
離最近背景點距離為2 的用橘色標出,如下圖所示