一些參考資料:
[1] 李宏毅機器學習教學視頻 semi-supervise
[2] 李宏毅視頻的文字稿
(上面兩個資料的講解順序是:semi-supervised generative model --> low density assumption --> smooth assumption)
[3] A survey on semi-supervised learning. Engelen et al. 2018.
[4] MixMatch (bilibili視頻講解)
以下內容是對上述資料的總結。
1. Why does semi-supervised learning help?
The distribution of the unlabeled data tell us something.
unlabeled data雖然只有input,沒有label,但它的分布,卻可以告訴我們一些事情 (前提是用上一些假設)
比如沒有unlabeled的data的時候,boundary是這樣的:

有unlabeled的話,boundary可能就是這樣的了:

2. When does semi-supervised learning work?
Unlabelled data is only useful if it carries information useful for label prediction that is not contained in the labelled data alone or cannot be easily extracted from it.
Themain takeaway from these observations is that semi-supervised learning should not be seen as a guaranteed way of achieving improved prediction performance by the mere introduction of unlabelled data. Rather, it should be treated as another direction in the process of finding and configuring a learning algorithm for the task at hand.
3. Three Assumption
smoothness assumption: if two samples x and x' are close in the input space, their labels y and y' should be the same.
low-density assumption: the decision boundary should not pass through high-density areas in the input space.
manifold assumption: data points on the same low-dimensional manifold should have the same label.

low-density assumption can be considered the counter part of the smoothness assumption for the underlying data distribution. 所以李宏毅視頻中是這么介紹smooth assumption的:
假設下圖是data的分布,x1 ,x2 ,x3是其中的三筆data,如果單純地看x的相似度,顯然x2和x3更接近一些,但對於smoothness assumption來說,x1和x2是處於同一塊區域的,它們之間可以有一條相連的路徑;而x2與x3之間則是“斷開”的,沒有high density path,因此x1與x2更“像”。

4. Empirical evaluation of semi-supervised learning methods
- It is important to establish high-quality supervised baselines to allow for proper assessment of the added value of the unlabelled data.
- In research, data sets used for evaluating semi-supervised learning algorithms are usually obtained by simply removing the labels of a large amount of data points from an existing supervised learning data set.
- In practice, the choice of data sets and their partitioning can have significant impact on the relative performance of different learning algorithms.
- Some algorithms may work well when the amount of labelled data is limited and perform poorly when more labelled data is available; others may excel on particular types of data sets but not on others.
- To provide a realistic evaluation of semi-supervised learning algorithms, researchers should thus evaluate their algorithms on a diverse suite of data sets with different quantities of labelled and unlabelled data.
5. Taxonomy of semi-supervised learning methods

- Inductive methods attempt to find a classification model, whereas transductive methods are solely concerned with obtaining label predictions for the given unlabelled data points.
- Inductive methods, like supervised learning methods, yield a classification model that can be used to predict the label of previously unseen data points. Transductive methods do not yield such a model, but instead directly provide predictions.
- Inductive methods involve optimization over prediction models, whereas transductive methods optimize directly over the predictions.
6. Inductive methods
Inductive methods aim to construct a classifier that can generate predictions for any object in the input space.
6.1. Wrapper methods
Wrapper methods are to first train classifiers on labelled data, and to then use the predictions of the resulting classifiers to generate additional labelled data. The classifiers can then be re-trained on this pseudo-labelled data in addition to the existing labelled data.
6.1.1. Self-training
- Self-training methods consist of a single supervised classifier.
- The selection procedure for data to be pseudo-labelled is of particular importance. In typical self-training settings, where this selection is made based on prediction confidence, the quality of the confidence estimates significantly influences algorithm performance.
6.1.2. Co-training
- Co-training is an extension of self-training to multiple supervised classifiers. In co-training, two or more supervised classifiers are iteratively trained on the labelled data, adding their most confident predictions to the labelled data set of the other supervised classifiers in each iteration.
- For co-training to succeed, it is important that the base learners are not too strongly correlated in their predictions. If they are, their potential to provide each other with useful information is limited.
- These classifiers exchange information through unlabelled data, for which different learners predict different labels.
6.1.3. Boosting
Ensemble classifiers consist of multiple base classifiers, which are trained and then used to form combined predictions.
6.2. Unsupervised preprocessing
這類方法潛在地包含兩個步驟:第一步,feature extraction, clustering;第二步,supervise classify。
6.2.1. Feature extraction
- Principal component analysis, Autoencoder, ...
- they inherently act on the assumption that the input space contains lower-dimensional substructures on which the data lie.
6.2.2. Cluster-then-label
These mehtods first apply an unsupervised or semi-supervised clustering algorithm to all available data, and use the resulting clusters to guide the classification process.
6.2.3. Pre-training
- Unlabelled data is used to guide the decision boundary towards potentially interesting regions before applying supervised training.
- For deep learning methods, the unsupervised pre-training aim to guide the parameters (weights) of a network towards interesting regions in model space using the unlabelled data, before fine-tuning the parameters with the labelled data.
- Unsupervised pre-training guides the neural network model towards regions in model space that provide better generalization.
6.3. Intrinsically semi-supervised methods
These methods directly optimize an objective function with components for labeled and unlabeled samples.
也就是說,這類方法的優化目標函數至少有2項,一項是關於labeled data,另一項是關於unlabeled data。
6.3.1. Maximum-margin methods
這類方法會應用到low-density assumption。
One can incorporate knowledge from the unlabelled data to determine where the density is low and thus, where a large margin can be achieved.
support vector machine (SVM): 人們提出了各種方法把unlabeled data加入到優化目標函數中。
Gaussian processes: incorporating the unlabelled data points into the likelihood function.
Density regularization: 把各個class的概率分布放到損失函數中。explicitly incorporate the amount of overlap between the estimated posterior class probabilities into the cost function. When there is a large amount of overlap, the decision boundary passes through a high-density area, and when there is a small amount of overlap, it passes through a low-density area.
6.3.2. Perturbation-based methods
這類方法的核心思想是,對於一個樣本,給它一點小擾動(噪聲),它的標簽不變。
The smoothness assumption entails that a predictive model should be robust to local perturbations in its input. This means that, when we perturb a data point with a small amount of noise, the predictions for the noisy and the clean inputs should be similar. Since this expected similarity is not dependent on the true label of the data points, we can make use of unlabelled data.
Many different methods exist for incorporating the smoothness assumption into a given learning algorithm. For instance, one could apply noise to the input data points, and incorporate the difference between the clean and the noisy predictions into the loss function. Alternatively, one could implicitly apply noise to the data points by perturbing the classifier itself. These two approaches give rise to the category of perturbation-based methods.
Ladder networks:the underlying idea is that latent representations that are useful for input reconstruction can also facilitate class prediction. 可參考A survey on semi-supervised learning文中的介紹,或者是原論文Semi-Supervised Learning with Ladder Networks,或者是博客階梯網絡Ladder Network_sxt1001的博客-CSDN博客_ladder network。
Pseudo-ensembles: Instead of explicitly perturbing the input data, one can also perturb the neural network model itself.
$\prod$-model: Instead of comparing the activations of the unperturbed parent model with those of the perturbed models in the cost function, one can also compare the perturbed models directly. 也就是說,直接比較擾動之后的多個模型的activations,而不是擾動的與未擾動的模型的activations之間的比較。
Temporal ensembling: they compare the activations of the neural network at each epoch to the activations of the network at previous epochs. In particular, after each epoch, they compare the output of the network to the exponential moving average of the outputs of the network in previous epochs. 關於exponential moving average可參考理解滑動平均(exponential moving average)。This approach—dubbed temporal ensembling, because it penalizes the difference in the network outputs at different points in time during the training process。
Mean teacher: Temporal ensembling的缺點是 intervals. Since the activations for each input are only generated once per epoch, it takes a long time for the activations of unlabelled data points to influence the inference process. Tarvainen and Valpola (2017) attempted to overcome this problem by considering moving averages over connection weights, instead of moving averages over network activations. Specifically, they suggested calculating the exponential moving average of weights at each training iteration, and compared the resulting final-layer activations to the final-layer activations when using the latest set of weights.
Virtual adversarial training: 這類方法把perturbation的方向也考慮進去了,也就是說擾動不是各向同性的。這樣做的原因是:it has been suggested in several studies that the sensitivity of neural networks to perturbations in the input is often highly dependent on the direction of these perturbations.
6.3.3. Manifolds
manifold regularization techniques, which define a graph over the data points and implicitly penalize differences in predictions for data points with small geodesic distance.
manifold approximation techniques, which explicitly estimate the manifoldsMon which the data lie and optimize an objective function accordingly.
6.4. Generative models
Mixture models,如高斯混合模型。
Generative adversarial networks
Variational autoencoders
7. Transductive methods
Unlike inductive algorithms, transductive algorithms do not produce a predictor that can operate over the entire input space. Instead, transductive methods yield a set of predictions for the set of unlabelled data points provided to the learning algorithm.
Transductive methods typically define a graph over all data points, both labelled and unlabelled, encoding the pairwise similarity of data points with possiblyweighted edges. An objective function is then defined and optimized, in order to achieve two goals:
- For labelled data points, the predicted labels should match the true labels.
- Similar data points, as defined via the similarity graph, should have the same label predictions.
