Co-Training vs Self-Training


首先,在實際做classification的場景中,經常會遇到只有少量的labeled data而更多的data都是unlabeled 的情況。co-training和self-training這兩個算法即是用來解決這樣情況的。

 

下面分別描述這兩種算法:

1.Self-training:

用已有的Labled data先建立一個分類器,建好之后用它去estimate那些unlabeled的data.

之后,之前的labeled data加上新estimate出來的 “pseudo-labeled” unlabeled data一起,再train出來一個新的分類器。

重負上述步驟,直到所有unlabeled data都被歸類進去。

2.Co-training:

used in special cases of the more general multi-view learning.

即當要training的數據,可以從不同的views來看待的時候。舉個例子,在做網頁分類(web-page classification)這個模型時候,feature的來源有兩個部分,一是URL features of the websites 記為 A, 二是text features of the websites 記為 B.

co-training的算法是:

• Inputs: An initial collection of labeled documents and one of unlabeled documents.

• Loop while there exist documents without class labels:

        • Build classifier A using the A portion of each document.

        • Build classifier B using the B portion of each document.

        • For each class C, pick the unlabeled document about which classifier A is most confident that its class label is C and add it to the collection of labeled documents.

         • For each class C, pick the unlabeled document about which classifier B is most confident that its class label is C and add it to the collection of labeled documents.

          • Output: Two classifiers, A and B, that predict class labels for new documents. These predictions can be combined by multiplying together and then renormalizing their class probability scores.

即兩組用features A,B分別做兩個分類器,單獨每個分類器里面用self-training的方法分別進行training的迭代(每次增加新的unlabeled數據),最后使用兩個self-training結束的分類器,一起進行prediction.

其主要的思路是,對於那些可以feature可以天然split的數據,用每組feature做出不同的分類器,不同features做出來的分類器可以相互互補

 

最后總結:

co-training和self-training之前最直觀的區別就是:在學習的過程中,前者有兩個分類器(classifier),而后者僅有一個分類器。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM