長尾分布之DECOUPLING REPRESENTATION AND CLASSIFIER FOR LONG-TAILED RECOGNITION

本文轉載自查看原文 2021-11-08 10:47 192 論文閱讀/ 深度學習/ 論文

原始文檔：https://www.yuque.com/lart/papers/drggso

ICLR 2020的文章.

針對長尾分布的分類問題提出了一種簡單有效的基於re-sample范式的策略.

提出的方法將模型的學習過程拆分成兩部分:representation learning 和 classification.

對於前者, 則將完整的模型在原始的數據分布上進行訓練, 即instance-balanced (natural) sampling, 從而學習_the best and most generalizable representations_. 訓練好后, 再額外調整模型的分類器(retraining the classifier with class-balanced sampling or by a simple, yet effective, classifier weight normalization which has only a single hyperparameter controlling the "temperature" and which does not require additional training).

在這份工作中, 作者們證明了在長尾場景中, 將這種分離(separation)可以更加直接的獲得好的檢測性能, 而不需要設計采樣策略、平衡損失或者是添加memory模塊.

按照這里https://zhuanlan.zhihu.com/p/158638078總結的:

對任何不均衡分類數據集地再平衡本質都應該只是對分類器地再均衡, 而不應該用類別的分布改變特征學習時圖片特征的分布, 或者說圖片特征的分布和類別標注的分布, 本質上是不耦合的.

背景信息

長尾識別的表征學習

在長尾識別中, 訓練集在所有的類上整體遵循着一個長尾分布. 在訓練過程中, 對於一些不常見的類數據量很小, 使用這樣的不平衡的數據集訓練得到的模型趨向於在小樣本類上欠擬合. 但是實踐中, 對於所有類都能良好識別的模型才是我們需要的. 因此不同針對少樣本的重采樣策略、損失重加權和邊界正則化(margin regularization)方法被提出. 然而, 目前尚不清楚它們如何實現長尾識別的性能提升(如果有的話).
本文將會系統地通過將表征學習過程與分類器學習過程分離的方式探究他們的有效性, 來識別什么對於長尾分布確實重要.
首先明確相關的符號表示:

\(X=\{x_i, y_i\}, i \in \{1, \dots, n\}\)表示訓練集, 其中的\(y_i\)表示對於數據點\(x_i\)對應的標簽.
\(n_j\)表示對於類別\(j\)對應的訓練樣本的數量, 而\(n = \Sigma^{c}_{j=1} n_j\)表示總的訓練樣本數.
不是一般性, 這里將所有類按照各自的樣本數, 即其容量來降序排序, 即, 如果\(i<j\), 則有\(n_i \ge n_j\). 另外由於長尾的設定, 所以\(n_1 \gg n_C\), 即頭部類遠大於尾部類.
\(f(x; \theta) = z\)表示對於輸入數據的表征, 這里\(f(x; \theta)\)通過參數為\(\theta\)的CNN模型實現.
最終的類別預測\(\tilde{y}\)由分類器函數\(g\)給出, 即\(\tilde{y} = \text{argmax}\, g(z)\). 一般情況下\(g\)就是一個線性分類器, 即\(g(z) = \mathbf{W}^\top z + \mathbf{b}\). 這里的\(\mathbf{W} \& \mathbf{b}\)分別表示權重矩陣和偏置參數. 當然, 文章中也討論了一些其他形式的\(g\).

采樣策略

這旨在平衡表征學習與分類器學習的數據分布. 大多數采樣策略都可以被統一表示成如下形式. 即采樣一個數據點, 它來自於類別\(j\)的概率\(p_j\)可以被表示為:\(p_j = \frac{n^q_j}{\Sigma^C_{i=1}n_i^q}\). 注意, 這里是基於類進行的表示, 實際上對於每個單獨的數據而言, 他們的采樣過程可以看作是一個兩階段的過程, 即先對\(C\)個類進行自定義采樣, 再對類內部的數據均勻采樣. 這里的包含了一個參數\(q \in [0, 1]\). 用以調制不同類的采樣概率, 根據它的不同取值, 從而可以划分為多種情形:

Instance-balanced sampling: 這是最通常的采樣數據的方式, 每個訓練樣本都是等概率被選擇. 此時\(q=1\). 來自特定類別的數據點被采樣的概率\(p^{IB}\)成比例與該類別的容量.
Class-balanced sampling: 對於不平衡的數據集, Instance-balanced sampling是次優的, 因為模型會欠擬合少樣本類, 導致更低的准確率, 尤其是對於平衡的測試集. 而Class-balanced sampling已經被用來緩解這一差異. 在這種情況下, 每個類別會被強制等概率的被選擇. 此時有\(q = 0\), 即直接抹平了類內數據量的影響. 所有類都有\(p^{CB} = 1/C\). 對於實際中, 該策略可以可看作是兩階段采樣過程, 第一步各個類被從類別集合中均勻采樣, 第二部, 類內樣本被均勻采樣.
Square-root sampling: 一些其他的采樣策略略同樣被探索, 普遍使用的變體是平方根采樣, 這時\(q=1/2\).
- Typically, a class-balanced loss assigns sample weights inversely proportionally to the class frequency. This simple heuristic method has been widely adopted. However, recent work on training from large-scale, real-world, long-tailed datasets reveals poor performance when using this strategy. Instead, they use a "smoothed" version of weights that are empirically set to be inversely proportional to the square root of class frequency. (來自_Class-Balanced Loss Based on Effective Number of Samples_)
Progressively-lalanced sampling: 最近一些方法嘗試將前面的策略進行組合, 從而實現了混合采樣策略. 實踐中, 先在一些的epoch中使用實例平衡采樣, 之后在剩余的epoch中切換為類平衡采樣. 這些混合采樣策略需要設置切換的時間點, 這引入了胃癌的超參數. 在本文中, 使用了一個"軟化"版本, 即漸進式平衡采樣. 通過使用一個隨着訓練epoch進度不斷調整的插值參數來線性加權IB與CB的類別采樣概率. 所以有\(p^{PB}_j(t) = (1 - \frac{t}{T}) p_j^{IB} + \frac{t}{T} p_j^{CB}\). 這里的\(T\)表示總的epoch數量.

作者基於ImageNet-LT的數據繪制了采樣權重的比例圖:

論文_Class-Balanced Loss Based on Effective Number of Samples_中的如下內容很好的說明了重采樣存在的問題:

Inthe context of deep feature representation learning using CNNs, re-sampling may either introduce large amounts of duplicated samples, which slows down the training and makes the model susceptible to overfitting when oversampling, or discard valuable examples that are important for feature learning when under-sampling.

損失重加權

這部分內容實際上和本文的討論相關性並不大, 所以作者們並沒有太詳細的梳理.

此外, 我們發現一些報告高性能的最新方法很難訓練和重現, 並且在許多情況下需要廣泛的、特定於數據集的超參數調整.

文章的實驗表明配備適當的平衡的分類器的基線方法可以比最新的損失重加權的方法如果不是更好, 也最起碼是同樣好.

文章比較的一些最新的相關方法:

Focal Loss: 針對目標檢測任務提出. 通過減低簡單樣本的損失權重, 來平衡樣本級別的分類損失. 它對對應於類別\(y_i\)的樣本\(x_i\)的概率預測\(h_i\)添加了一個重加權因子\((1 - h_i)^{\gamma}, \gamma > 0\), 來調整標准的交叉熵損失:\(\mathcal{L}_{\text{focal}} := (1 - h_i)^\gamma \mathcal{L}_{\text{CE}} = -(1 - h_i)^\gamma \text{log}(h_i)\). 整體的作用就是對於有着大的預測概率的簡單樣本施加更小的權重, 對於有着較小預測概率的困難樣本施加更大的權重.
Focal Loss的類平衡變體: 對於來自類別\(j \)的樣本使用類平衡系數來加權. 這可以用來替換原始FocalLoss中的alpha參數. 所以該方法 (解析可見:https://www.cnblogs.com/wanghui-garcia/p/12193562.html) 可以看做一個基於有效樣本數量的概念的基礎上, 明確地在focal loss中設置alpha的方式. (Class-Balanced Loss Based on Effective Number of Samples: https://github.com/richardaecn/class-balanced-loss)
Label-distribution-aware margin(LDAM)loss (https://arxiv.org/pdf/1906.07413.pdf): 鼓勵少樣本類由更大的邊界, 並且他們的最終的損失形式可以表示為一個有着強制邊界的交叉熵損失: \(\mathcal{L}_{\text{LDAM}} := -\log\frac{e^{\hat{y}_j - \Delta_j}}{e^{\hat{y}_j - \Delta_j} + \Sigma_{c \ne j} e^{\hat{y}_c}}\). 這里的\(\hat{y}\)是logits, 而\(\Delta_j \propto \frac{1}{n_j^{1/4}}\)是class-aware margin. (關於softmax損失的margin的一些介紹:Softmax理解之margin - 王峰的文章 - 知乎https://zhuanlan.zhihu.com/p/52108088)

長尾識別的分類學習

當在平衡數據集上學習分類模型的時候, 分類器被和用於提取表征的模型通過交叉熵損失一同聯合訓練. 這實際上也是一個長尾識別任務的典型的基線設定. 盡管存在不同的例如重采樣、重加權或者遷移表征的方法被提出, 但是普遍的范式仍然一致, 即分類器要么與表征學習一起端到端聯合學習, 要么通過兩階段方法, 其中第二階段里, 分類器和表征學習通過類平衡采樣的變體進行聯合微調.

本文將表征學習從分類中分離出來, 來應對長尾識別.

所以下面展示了一些文中用到的學習分類器的方法, 旨在矯正關於頭部和尾部類的決策邊界, 這主要通過使用不同的采樣策略, 或者其他的無參方法(例如最近鄰類別均值分類器)來微調分類器. 同樣也考慮了一些不需要額外重新訓練的方法來重新平衡分類器權重, 這展現了不錯的准確率.

Classifier Re-training (cRT). 這是一種直接的方法, 其通過使用類別平衡采樣來重新訓練分類器. 即, 保持表征學習模型固定, 隨機重新初始化並且優化分類器權重和偏置, 使用類別平衡采樣重新訓練少量的epoch.
Nearest Class Mean classifier (NCM). 另一種常用的方法是首先在訓練集上針對每個類計算平均特征表征, 在L2歸一化后的均值特征上, 執行最近鄰搜索, 基於預先相似度, 或者基於歐式距離. 盡管這樣的設定很簡單, 但是卻也是一個很強的baseline模型. 在文中的實驗中, 余弦相似度通過內含的歸一化緩解了權重不平衡問題.
\(\tau\)-normalized classifier (\(\tau\)-normalized).
- 這里探究了一種有效的方法來重平衡分類器的決策邊界, 這受啟發與一種經驗性的觀察, 即, 在使用實例平衡采樣聯合訓練之后, 權重范數\(||w_j||\)與類別容量是相關的. 然而在使用類平衡采樣微調分類器之后, 分類器權重的范數趨向於更加相似(從圖2左側可以看出來, 類平衡采樣微調后的模型的權重范數微調后缺失相對平緩了許多).
- 受這樣的觀察的啟發, 作者們考慮通過\(\tau\)-normalization直接調整分類器的權重范數來修正決策邊界的不平衡. 這里讓\(\mathbf{W} = \{w_j\} \in \mathbb{R}^{d \times C}, w_j \in \mathbb{R}^d\), 表示對應於各個類\(j\)的分類權重集合. 這里放縮權重來得到歸一化形式:\(\tilde{\mathbf{W}} =\{\tilde{w}_j\}, \tilde{w}_i = \frac{w_i}{||w_i||^{\tau}}\), 這里的\(\tau\)是一個歸一化溫度的超參數. 並且分母使用的是L2范數. 當\(\tau = 1\), 分式轉化為L2歸一化, 而當其為0時, 沒有了歸一化處理. 這里經驗性的選擇\(\tau \in (0, 1)\), 以至於圈中可以被平滑的修正.
- 在這樣的歸一化處理之后, 分類的logits則可以表示為, 即使用歸一化后的線性分類器來處理提取得到的表征\(f(x; \theta)\). 注意這里去掉了偏置項, 因為其對於logits和最終的預測的影響是可以忽略的.
- 這里參數tau使用驗證集來網格搜索:In our submission, tau is determined by grid search on a validation dataset. The search grid is [0.0, 0.1, 0.2, ..., 1.0]. We use overall top-1 accuracy to find the best tau on validation set and use that value for test set.
Learnable weight scaling (LWS)另一種解釋\(\tau\)-normalization的方式是將其看做一種保持分類器權重方向的同時, 對權重幅度重新放縮, 這可以被重新表述為:\(\tilde{w}_i = f_i * w_i, f_i = \frac{1}{||w_i||^\tau}\). 盡管對於\(\tau\)-normalization的超參數可以通過交叉驗證來選擇, 但是作者們進一步嘗試將放縮因子\(f_i\)在訓練集上學習, 同時使用類平衡采樣. 這樣的情況下, 保持表征和分類器權重都固定, 只學習放縮因子.

來自附錄

注意上面提到的幾種在第二階段中調整分類器的策略中, 涉及到重新訓練和采樣策略的只有cRT和LWS, 且都是用的類別平衡重采樣. 而NCM和\(\tau\)-normalized都是不需要考慮第二階段的重采樣策略的, 因為他們不需要重新訓練.

實驗細節

實驗設置

數據集

Places-LT and ImageNet-LT are artificially truncated from their balanced versions (Places-2 (Zhou et al., 2017) and ImageNet-2012 (Deng et al., 2009)) so that the labels of the training set follow a long-tailed distribution.
- Places-LT contains images from 365 categories and the number of images per class ranges from 4980 to 5.
- ImageNet-LT has 1000 classes and the number of images per class ranges from 1280 to 5 images.
iNaturalist 2018 is a real-world, naturally long-tailed dataset, consisting of samples from 8, 142 species.

評估方式

After training on the long-tailed datasets, we evaluate the models on the corresponding balanced test/validation datasets and report the commonly used top-1 accuracy over all classes, denoted as All.
To better examine performance variations across classes with different number of examples seen during training, we follow Liu et al. (2019) and further report accuracy on three splits of the set of classes: Many-shot (more than 100 images), Medium-shot (20∼100 images) and Few-shot (less than 20 images). Accuracy is reported as a percentage.

實現細節

We use the PyTorch (Paszke et al., 2017) framework for all experiments.
For Places-LT, we choose ResNet-152 as the backbone network and **pretrain it on the full ImageNet-2012 dataset **, following Liu et al. (2019).
On ImageNet-LT, we report results with ResNet-{10, 50, 101, 152} (He et al., 2016) and ResNeXt-{50, 101, 152}(32x4d) (Xie et al., 2017) but mainly use ResNeXt-50 for analysis.
Similarly, ResNet-{50, 101, 152} is also used for iNaturalist 2018.
For all experiements, if not specified, we use SGD optimizer with momentum 0.9, batch size 512, cosine learning rate schedule (Loshchilov & Hutter, 2016) gradually decaying from 0.2 to 0 and image resolution 224×224.
In the first representation learning stage, the backbone network is usually trained for 90 epochs. (這里都使用Instance-balanced sampling for representation learning)
In the second stage, i.e., for retraining a classifier (cRT), we restart the learning rate and train it for 10 epochs while keeping the backbone network fixed.

具體實驗

要注意, 這個圖中的采樣策略都是在指代表征學習過程中使用的采樣策略.

聯合訓練時不同采樣策略的效果比較

來自附錄中的補充圖表
For the joint training scheme (Joint), the linear classifier and backbone for representation learning are jointly trained for 90 epochs using a standard cross-entropy loss and different sampling strategies, i.e., Instance-balanced, Class-balanced, Square-root, and Progressively-balanced.
圖1和表5中的Joint之間的對比可以看出來:

使用更好的采樣策略可以獲得更好的性能. 聯合訓練中不同采樣策略的結果驗證了試圖設計更好的數據采樣方法的相關工作的動機.
實例平衡采樣對於many-shot而言表現的更好, 這是因為最終模型會高度偏向與這些many-shot類.

解耦學習策略的有效性

比對圖1, 可以知道, 這里在第二階段使用的是cRT策略來調整模型

For the decoupled learning schemes, we present results when learning the classifier in the ways, i.e., re-initialize and re-train (cRT), Nearest Class Mean (NCM) as well as τ-normalized classifier.
從圖1中整體可以看出來:

在大多數情況下, 解耦訓練策略都是要好於整體訓練的.
甚至無參數的NCM策略都表現的不是很差. 其整體性能主要是由於many-shot時表現太差而拉了下來.
不需要額外訓練或是采樣策略的NCM和\(\tau\)-normalized策略都表現出了極具競爭力的性能. 他們的出色表現可能源於他們能夠自適應地調整many-/medium-/few-shot類的決策邊界(如圖4所示).
在所有解耦方法中, 當涉及到整體性能以及除了many-shot之外的所有類別拆分時, 我們都會看到, 實例平衡采樣提供了最佳結果. 這特別有趣, 因為它意味着數據不平衡可能不是一個會影響學習高質量表示的問題.實例平衡采樣可以提供最通用的表征.

為了進一步對比, 表1中列舉了將backbone與線性分類器聯合微調時模型(B+C和B+C(0.1xlr))、僅微調backbone最后一個block(LB+C), 或者固定backbone而重訓練分類器(C)的幾種情形.
表1中可以看出來:

微調整個模型性能最差.
固定backbone, 效果最好(因為是長尾分布的任務, 所以更多關注整體效果和少樣本類的效果).
解耦訓練的設定, 是很適用於長尾識別任務的.

不同平衡分類器策略的效果比較

在圖2 (左) 中, 我們憑經驗顯示了所有分類器的權重向量的L2范數, 以及相對於訓練集中的實例數降序排序的訓練數據分布.
我們可以觀察到:

聯合分類器 (藍線) 的權重范數與相應類的訓練實例數呈正相關.
- more-shot類傾向於學習具有更大幅度的分類器. 如圖4所示, 這在特征空間中產生了更寬的分類邊界, 允許分類器對數據豐富的類具有更高的准確性, 但會損害數據稀缺的類.
τ-normalized分類器 (金線) 在一定程度上緩解了這個問題, 它提供更平衡的分類器權重大小.
對於re-training策略(綠線), 權重幾乎是平衡的, 除了few-shot類有着稍微更大的權重范數.
NCM方法會在圖中給出一條水平線, 因為在最近鄰搜索之前平均向量被L2歸一化了.
在圖2 (右) 中, 我們進一步研究了隨着τ-normalization分類器的溫度參數τ的變化時, 性能如何變化. 該圖顯示隨着τ從0增加, 多樣本類精度急劇下降, 而少樣本類精度急劇增加.

與現有方法的對比

額外的實驗

\(\tau\)的選擇: 當前的設置中, 參數tau是需要驗證集來確定, 這在實際場景中可能是個缺點. 為此, 作者們設計了兩種更加自適應的策略:
- 從訓練集上尋找tau: 表9中可以看到, 最終在測試集合上的效果是很接近的.
  - We achieve this goal by simulating a balanced testing distribution from the training set.
    - We first feed the whole training set through the network to get the top-1 accuracy for each of the classes.
    - Then, we average the class-specific accuracies and use the averaged accuracy as the metric to determine the tau value.
  - As shown in Table 9, we compare the τ found on training set and validation set for all three datasets. We can see that both the value of τ and the overall performances are very close to each other, which demonstrates the effectiveness of searching for τ on training set.
  - This strategy offers a practical way to find τ even when validation set is not available.
- 從訓練集上學習tau: We further investigate if we can automatically learn the τ value instead of grid search.
  - To this end, following cRT, we set τ as a learnable parameter and learn it on the training set with balanced sampling, while keeping all the other parameters fixed (including both the backbone network and classifier).
  - Also, we compare the learned τ value and the corresponding results in the Table 9 (denoted by “learn” = ✓). This further reduces the manual effort of searching best τ values and make the strategy more accessible for practical usage.

MLP分類器和線性分類器的比較: We use ReLU as activation function, set the batch size to be 512, and train the MLP using balanced sampling on fixed representation for 10 epochs with a cosine learning rate schedule, which gradually decrease the learning rate to zero.

使用余弦相似度計算來替換線性分類器: We tried to replace the linear classifier with a cosine similarity classifier with (denoted by "cos") and without (denoted by "cos(noRelu)") the last ReLU activation function, following [Dynamic few-shot visual learning without forgetting].

實驗小結

盡管抽樣策略在共同學習表征和分類器時很重要, 實例平衡采樣提供了更多可推廣的表示, 在適當地重新平衡分類器之后, 無需精心設計的損失或memory單元, 即可實現最先進的性能.

參考鏈接

論文:https://arxiv.org/pdf/1910.09217.pdf
代碼:https://github.com/facebookresearch/classifier-balancing
《Decoupling Representation and Classifier》筆記 - 千佛山彭於晏的文章 - 知乎https://zhuanlan.zhihu.com/p/111518894
Long-Tailed Classification (2) 長尾分布下分類問題的最新研究 - 青磷不可燃的文章 - 知乎https://zhuanlan.zhihu.com/p/158638078
openreview頁面:https://openreview.net/forum?id=r1gRTCVFvB&noteId=SJx9gIcsoS
基於GRU和am-softmax的句子相似度模型 - 科學空間|Scientific Spaces:https://kexue.fm/archives/5743

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【論文閱讀】Equalization Loss for Long-Tailed Object Recognition 長尾目標識別的均衡損失 Long-Tailed Classification BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition（理解） BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition 【論文閱讀】Learning From Multiple Experts: Self-paced Knowledge Distillation for Long-tailed Classiﬁcation 向多專家學習:用於長尾分類的自定步長知識提煉長尾分布，重尾分布(Heavy-tailed Distribution) 論文筆記系列-iCaRL： Incremental Classifier and Representation Learning 特征預處理--長尾分布的處理方案 [轉]重尾分布，長尾分布，肥尾分布和隨機游走 ICLR2020 | 解決長尾分布的解耦學習方法