最簡單的辦法
下載'20news-bydate.pkz', 放到C:\\Users\[Current user]\scikit_learn_data 下邊就行.
實際上
scikit learning默認的路徑是C:\\Users\[Current user]\scikit_learn_data
也可以添加環境變量'SCIKIT_LEARN_DATA', 程序會在環境變量設置的目錄后加scikit_learn_data作為數據集存放的目錄
不想用這兩個目錄的話,可以改site-package/sklearn/datasets/base.py里 的函數: get_data_home(data_home=None)
另一個解決的辦法是
1. 手動下載 http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz,
存放到scikit_learn_data/20news_home/下
2. 改site-package/sklearn/datasets/twenty_newsgroups.py里的函數: download_20newsgroups
注釋掉下邊代碼:
if not os.path.exists(target_dir): os.makedirs(target_dir) if os.path.exists(archive_path): # Download is not complete as the .tar.gz file is removed after # download. logger.warning("Download was incomplete, downloading again.") os.remove(archive_path) logger.warning("Downloading dataset from %s (14 MB)", URL) opener = urlopen(URL) with open(archive_path, 'wb') as f: f.write(opener.read())
3. 運行, 程序會自動解壓20news-bydate.tar.gz,生成緩存文件20news-bydate.pkz.