機器學習是計算機科學的一個分支,研究的是無需人類干預,能夠自己學習的算法。
與TensorFlow不同,Scikit-learn(sklearn)的定位是通用機器學習庫,而TensorFlow(tf)的定位主要是深度學習庫。
數據科學中的第一步通常都是加載數據,我們首先學習怎么使用SciKit-Learn來加載數據集。
數據集的來源,通常有2個:
- 自己准備
- 第三方處獲取
如果你不是研究人員,一般都會選擇從第三方獲取。有一些網站上,可以獲取數據集:
這個網頁上,列出了很多數據集分享地址:https://www.kdnuggets.com/datasets/index.html。
注意:SciKit-Learn是SciKit庫的一部分,SciKit意思是SciPy Tookits,名字來源於SciPy庫,SciKit基於SciPy庫構建,除了SciKit-Learn,還包含其他很多模塊,可以打開這個網址查看。SciKit-Learn庫是專注於機器學習和數據挖掘的模塊。
SciKit-Learn庫中也自帶一些數據集,我們可以嘗試加載。
先從sklearn導入數據集模塊,然后,可以使用數據集中的load_digits()
方法加載數據:
數據加載代碼實現:
# Import `datasets` from `sklearn` from sklearn import datasets # 加載 `digits` 數據集 digits = datasets.load_digits() # 打印 `digits` 數據 print(digits)
執行結果:
C:\Anaconda3\python.exe "C:\Program Files\JetBrains\PyCharm 2019.1.1\helpers\pydev\pydevconsole.py" --mode=client --port=62310 import sys; print('Python %s on %s' % (sys.version, sys.platform)) sys.path.extend(['C:\\app\\PycharmProjects', 'C:/app/PycharmProjects']) Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help. PyDev console: using IPython 7.12.0 Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] on win32 runfile('C:/app/PycharmProjects/ArtificialIntelligence/test.py', wdir='C:/app/PycharmProjects/ArtificialIntelligence') {'data': array([[ 0., 0., 5., ..., 0., 0., 0.], [ 0., 0., 0., ..., 10., 0., 0.], [ 0., 0., 0., ..., 16., 9., 0.], ..., [ 0., 0., 1., ..., 6., 0., 0.], [ 0., 0., 2., ..., 12., 0., 0.], [ 0., 0., 10., ..., 12., 1., 0.]]), 'target': array([0, 1, 2, ..., 8, 9, 8]), 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'images': array([[[ 0., 0., 5., ..., 1., 0., 0.], [ 0., 0., 13., ..., 15., 5., 0.], [ 0., 3., 15., ..., 11., 8., 0.], ..., [ 0., 4., 11., ..., 12., 7., 0.], [ 0., 2., 14., ..., 12., 0., 0.], [ 0., 0., 6., ..., 0., 0., 0.]], [[ 0., 0., 0., ..., 5., 0., 0.], [ 0., 0., 0., ..., 9., 0., 0.], [ 0., 0., 3., ..., 6., 0., 0.], ..., [ 0., 0., 1., ..., 6., 0., 0.], [ 0., 0., 1., ..., 6., 0., 0.], [ 0., 0., 0., ..., 10., 0., 0.]], [[ 0., 0., 0., ..., 12., 0., 0.], [ 0., 0., 3., ..., 14., 0., 0.], [ 0., 0., 8., ..., 16., 0., 0.], ..., [ 0., 9., 16., ..., 0., 0., 0.], [ 0., 3., 13., ..., 11., 5., 0.], [ 0., 0., 0., ..., 16., 9., 0.]], ..., [[ 0., 0., 1., ..., 1., 0., 0.], [ 0., 0., 13., ..., 2., 1., 0.], [ 0., 0., 16., ..., 16., 5., 0.], ..., [ 0., 0., 16., ..., 15., 0., 0.], [ 0., 0., 15., ..., 16., 0., 0.], [ 0., 0., 2., ..., 6., 0., 0.]], [[ 0., 0., 2., ..., 0., 0., 0.], [ 0., 0., 14., ..., 15., 1., 0.], [ 0., 4., 16., ..., 16., 7., 0.], ..., [ 0., 0., 0., ..., 16., 2., 0.], [ 0., 0., 4., ..., 16., 2., 0.], [ 0., 0., 5., ..., 12., 0., 0.]], [[ 0., 0., 10., ..., 1., 0., 0.], [ 0., 2., 16., ..., 1., 0., 0.], [ 0., 0., 15., ..., 15., 0., 0.], ..., [ 0., 4., 16., ..., 16., 6., 0.], [ 0., 8., 16., ..., 16., 8., 0.], [ 0., 1., 8., ..., 12., 1., 0.]]]), 'DESCR': ".. _digits_dataset:\n\nOptical recognition of handwritten digits dataset\n--------------------------------------------------\n\n**Data Set Characteristics:**\n\n :Number of Instances: 5620\n :Number of Attributes: 64\n :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n :Missing Attribute Values: None\n :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttps://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\n.. topic:: References\n\n - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n Graduate Studies in Science and Engineering, Bogazici University.\n - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n Linear dimensionalityreduction using relevance weighted LDA. School of\n Electrical and Electronic Engineering Nanyang Technological University.\n 2005.\n - Claudio Gentile. A New Approximate Maximal Margin Classification\n Algorithm. NIPS. 2000."}
datasets
模塊中也包含了獲取其他流行數據集的方法,例如datasets.fetch_openml
可以從openml存儲庫獲取數據集。
上面示例中的數據集,也可以從這個網址獲取:http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/
代碼實現:
# 導入 `pandas` 庫 import pandas as pd # 使用 `read_csv()` 加載數據集 digits = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra", header=None) # 打印 `digits` 數據 print(digits)
執行結果:
C:\Anaconda3\python.exe "C:\Program Files\JetBrains\PyCharm 2019.1.1\helpers\pydev\pydevconsole.py" --mode=client --port=62450
import sys; print('Python %s on %s' % (sys.version, sys.platform))
sys.path.extend(['C:\\app\\PycharmProjects', 'C:/app/PycharmProjects'])
Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.12.0
Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] on win32
runfile('C:/app/PycharmProjects/ArtificialIntelligence/test.py', wdir='C:/app/PycharmProjects/ArtificialIntelligence')
0 1 2 3 4 5 6 7 8 ... 56 57 58 59 60 61 62 63 64
0 0 1 6 15 12 1 0 0 0 ... 0 0 6 14 7 1 0 0 0
1 0 0 10 16 6 0 0 0 0 ... 0 0 10 16 15 3 0 0 0
2 0 0 8 15 16 13 0 0 0 ... 0 0 9 14 0 0 0 0 7
3 0 0 0 3 11 16 0 0 0 ... 0 0 0 1 15 2 0 0 4
4 0 0 5 14 4 0 0 0 0 ... 0 0 4 12 14 7 0 0 6
.. .. .. .. .. .. .. .. .. ... .. .. .. .. .. .. .. .. ..
3818 0 0 5 13 11 2 0 0 0 ... 0 0 8 13 15 10 1 0 9
3819 0 0 0 1 12 1 0 0 0 ... 0 0 0 4 9 0 0 0 4
3820 0 0 3 15 0 0 0 0 0 ... 0 0 4 14 16 9 0 0 6
3821 0 0 6 16 2 0 0 0 0 ... 0 0 5 16 16 16 5 0 6
3822 0 0 2 15 16 13 1 0 0 ... 0 0 4 14 1 0 0 0 7
[3823 rows x 65 columns]
可以看到,上面下載網址中的文件后綴是.tra
,表示是訓練(train)數據集,在這個頁面內還可以看到.tes
文件,表示是測試(test)數據集,所以上面加載的數據集,是已經分割好訓練數據集和測試數據集的。上面示例中,只加載了訓練數據集。