采樣就是按照某種規則從數據集中挑選樣本數據,大致分為3類:隨機采樣、系統采樣和分層采樣。
隨機采樣:就是從數據集中隨機的抽取特定數量的數據,分為有放回和無放回兩種。
import random def noRepetRandomSampling(dataMat,number): ''' 無放回采樣 :param dataMat: 數據集 :param number: 采樣數 :return: sample 采樣到的數據 ''' try: length = len(dataMat) sample = random.sample(dataMat, number) return sample except Exception as e: print(e) def repetRandomSampling(dataMat,number): ''' 有放回采樣 :param dataMat: 數據集 :param number: 采樣數 :return: sample 采樣到的數據 ''' sample = [] i = 0 while(i<number): sample.append(dataMat[random.randint(0,len(dataMat)-1)]) #randint的范圍是a<=x<=b,包括上限,注意要減一 i+=1 return sample
系統采樣:一般是無放回抽樣,又稱等距采樣,先將總體數據集按順序分成n小份,再從每小份抽取第k個數據。
import random def systemSampling(dataMat,number): ''' 系統采樣 :param dataMat: 數據集 :param number: 采樣數 :return: sample 采樣到的數據 ''' length=len(dataMat) k=int(length/number) sample=[] i=0 if k>0: while (i<number): sample.append(dataMat[0+i*k]) i+=1 return sample else: return repetRandomSampling(dataMat,number)
分層采樣:就是先將數據分成若干個類別,再從每一層內隨機抽取一定數量的樣本,然后將這些樣本組合起來。
import random def stratifiedSampling(dataMat1,dataMat2,dataMat3,number): ''' 分層采樣 :param dataMat1: 數據集1 :param dataMat2: 數據集2 :param dataMat3: 數據集3 :param number: 采樣數 :return: sample 采樣到的數據 ''' subNumber=int(number/3) sample=[] sample.append(noRepetRandomSampling(dataMat1,subNumber)) sample.append(noRepetRandomSampling(dataMat2,subNumber)) sample.append(noRepetRandomSampling(dataMat3,subNumber)) return sample
測試代碼:
dataMat=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] dataMat1=[101,102,103,104,105,106,107,108,109,110] dataMat2=[201,202,203,204,205,206,207,208,209,210] dataMat3=[301,302,303,304,305,306,307,308,309,310] print(repetRandomSampling(dataMat,6)) print(noRepetRandomSampling(dataMat,6)) print(systemSampling(dataMat,6)) print(stratifiedSampling(dataMat1,dataMat2,dataMat3,6))
運行結果:
E:\Anaconda3\python.exe E:/數據采樣.py [8, 1, 8, 13, 19, 3] [14, 8, 5, 1, 17, 16] [1, 4, 7, 10, 13, 16] [[108, 105], [201, 208], [301, 308]]
以上內容摘自《機器學習實踐應用》