半監督學習基准數據集

本文轉載自查看原文 2020-03-03 22:36 651 半監督學習

半監督學習基准數據集

Semi-Supervised Learning Benchmark Dataset

該數據集出自：

Chapelle O, Scholkopf B, Zien A. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews][J]. IEEE Transactions on Neural Networks, 2009, 20(3): 542-542.

網上鏈接為：

http://olivier.chapelle.cc/ssl-book/benchmarks.html

找了好久，網上個別鏈接打不開，存在此處方便以后查看。

另發現一個總結半監督數據集的好博客。

Self-Labeled Techniques for Semi-Supervised Learning: Taxonomy, Software and Empirical Study

https://sci2s.ugr.es/SelfLabeled

以下為第一個鏈接的內容：

The Benchmark Data Sets

For each data set, we provide 12 splits (exception: only 10 splits for data set 8) into labeled points and remaining unlabeled points. We ensure that each split contains at least one point of each class. Apart from this, there is no bias in the labeling process.

數據集存在12次划分，（但數據集8例外），每次有10/100個點為標記樣本，其余為未標記樣本，每次划分中每類至少一個點。

The table contains individual files in matlab 5.0 format (.mat files).

Data Set	Points	Dimensions	Splits with l Labeled Points
g241c (set 5)	1500	241	l=10, l=100
g241n (set 7)	1500	241	l=10, l=100
Digit1 (set 1)	1500	241	l=10, l=100
USPS (set 2)	1500	241	l=10, l=100
COIL (set 6)	1500	241	l=10, l=100
COIL₂ (set 3)	1500	241	l=10, l=100 (binary version of set 6)
BCI (set 4)	400	117	l=10, l=100
Text (set 9)	1500	11960	l=10, l=100
SecStr (set 8)	83,679 + 1,189,472	315	l=100, l=1000, l=10000 (no splitting of extra unlabeled data); matlab script required

You can also download all data sets and splits (excluding the extra unlabeled data of set 8) at once as archive files, in matlab format: gzipped TAR file, ZIP file; in ascii format: gzipped TAR file, ZIP file (here, only the indices of the labeled examples are provided -- all other examples are unlabeled). Data sets 8 and 9 are supplied in special formats: in set 8, all attributes are categorical and have to be expanded into a sparse binary vector (21 bits per attribute; cf to the matlab script); in set 9, the data are very sparse, and only non-zero values are supplied as a list of "index:value" pairs.

X = matrix of input data; each row corresponds to one example

X輸入數據的矩陣，每行對應於一個樣本；
y = the labels (either {0,1} or {-1,+1} for binary problems)

y為標記，分{0，1}或{-1，+1}.
idxLabs = each row contains the indices of the labeled points for a given split

idxLabs每行包含給定划分的標記樣本點的索引
idxUnls = idem for the unlabeled points

idxUnls為未標記樣本點
(all indices are 1-based as in matlab, not 0-based as in C)

索引從1開始。

Back to the SSL book main page

2020.3.3

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 半監督學習自監督學習監督學習監督學習與非監督學習的區別監督學習與非監督學習的區別對監督學習和非監督學習的理解對比自監督學習半監督學習分類——？？？監督學習概述無監督學習