此次實驗是在Kaggle上微軟發起的一個惡意軟件分類的比賽,
數據集
此次微軟提供的數據集超過500G(解壓后),共9類惡意軟件,如下圖所示。這次實驗參考了此次比賽的冠軍隊伍實現方法。微軟提供的數據包括訓練集、測試集和訓練集的標注。其中每個惡意代碼樣本(去除了PE頭)包含兩個文件,一個是十六進制表示的.bytes文件,另一個是利用IDA反匯編工具生成的.asm文件。下載解壓后如下所示:
每個asm文件內容大致如下:
隨機抽樣
由於機器性能以及存儲空間的限制,本次實驗限制了數據集的規模,采用原數據集中大概1/10左右的訓練子集。其中從每個分類的中都隨機抽取了100個樣本(9個分類,每個樣本2個文件,共1800個文件),這樣也不需要用到pypy xgboost,只需要用到numpy,pandas、PIL和`scikit-learn這些庫即可。以下是隨機抽樣代碼:
import os
from random import *
import pandas as pd
import shutil
rs = Random()
rs.seed(1)
trainlabels = pd.read_csv('D:\\chrom下載\\malware-classification\\subtrainLabels.csv')
fids = []
opd = pd.DataFrame()
for clabel in range (1,10):
mids = trainlabels[trainlabels.Class == clabel]
mids = mids.reset_index(drop=True)
rchoice = [rs.randint(0,len(mids)-1) for i in range(150)]
print(rchoice)
# for i in rchoice:
# fids.append(mids.loc[i].Id)
# opd = opd.append(mids.loc[i])
rids = [mids.loc[i].Id for i in rchoice]
fids.extend(rids)
opd = opd.append(mids.loc[rchoice])
opd = opd.reset_index(drop=True)
print(opd)
opd.to_csv('D:\\chrom下載\\malware-classification\\sub_subtrainLabels.csv', encoding='utf-8', index=False)
sbase = 'E:\\malware_class\\subtrain\\'
tbase = 'E:\\malware_class\\sub_subtrain\\'
for fid in fids:
fnames = ['{0}.asm'.format(fid),'{0}.bytes'.format(fid)]
for fname in fnames:
cspath = sbase + fname
ctpath = tbase + fname
print(cspath)
shutil.copy(cspath,ctpath)
抽樣完成后是2966個文件,對應1483個不同的惡意軟件,對應1483個標簽,標簽如下

左為Id,右邊是該Id軟件對應的Class。
特征提取
n-gram特征
本次實驗用到的第一個特征為n-gram。
假設有一個m 個詞組成的句子,那么這個句子出現的概率為
。那么![[公式]](/image/aHR0cHM6Ly93d3cuemhpaHUuY29tL2VxdWF0aW9uP3RleD1wJTI4d18lN0IxJTdEJTJDd18lN0IyJTdEJTJDLi4uJTJDd18lN0JtJTdEJTI5JTNEcCUyOHdfJTdCMSU3RCUyOSUyQXAlMjh3XyU3QjIlN0QlN0N3XyU3QjElN0QlMjklMkFwJTI4d18lN0IzJTdEJTdDd18lN0IxJTdEJTJDd18lN0IyJTdEJTI5Li4uLi5wJTI4d18lN0JtJTdEJTdDd18lN0IxJTdEJTJDLi4lMkN3XyU3Qm0tMSU3RCUyOQ==.png)
為了簡化計算當前這個詞僅僅跟前面的n個詞相關,*因此也就不必追溯到最開始的那個詞,這樣便可以大幅縮減上述算式的長度*。這個n就是n-gram中的n。在本次實驗中,n取3.以下是特征提取的代碼
import re
from collections import *
import os
import pandas as pd
def getOpcodeSequence(filename): #獲取操作碼序列
opcode_seq = []
p = re.compile(r'\s([a-fA-F0-9]{2}\s)+\s*([a-z]+)')
with open(filename,mode="r",encoding='utf-8',errors='ignore') as f:
for line in f:
if line.startswith(".text"):
m = re.findall(p,line)
if m:
opc = m[0][1]
if opc != "align":
opcode_seq.append(opc)
return opcode_seq
def getOpcodeNgram(ops, n=3): #將操作碼序列以3個操作碼為單位切片,並統計各個單位序列
opngramlist = [tuple(ops[i:i+n]) for i in range(len(ops)-n)]
opngram = Counter(opngramlist)
return opngram
basepath = "E:\\malware_class\\subtrain\\"
map3gram = defaultdict(Counter)
subtrain = pd.read_csv('E:\\malware_class\\subtrain_label.csv')
count = 1
for sid in subtrain.Id: #獲取每個文件的n-gram特征並存入map3gram中
print ("counting the 3-gram of the {0} file...".format(str(count)))
count += 1
filename = basepath + sid + ".asm"
ops = getOpcodeSequence(filename)
op3gram = getOpcodeNgram(ops)
map3gram[sid] = op3gram
cc = Counter([]) #獲取總的n-gram特征,計算其出現的次數,將出現次數大於500的
for d in map3gram.values(): #單位存入selectedfeatures中用於后面的處理
print(d)
cc += d
selectedfeatures = {}
tc = 0
for k,v in cc.items():
if v >= 500:
selectedfeatures[k] = v
print (k,v)
tc += 1
dataframelist = []
for fid,op3gram in map3gram.items(): #每個文件的n-gram特征為dataframelist中的一行,每一列為各單位出現的次數
standard = {}
standard["Id"] = fid
for feature in selectedfeatures:
if feature in op3gram:
standard[feature] = op3gram[feature]
else:
standard[feature] = 0
dataframelist.append(standard)
df = pd.DataFrame(dataframelist)
df.to_csv("E:\\malware_class\\3gramfeature.csv",index=False)
最終得出的的結果如下

這就是這1483個軟件的n-gram特征。特征表是一個1483*2730的表。接下來只要將特征輸入算法就行了。本次實驗所用的算法為隨機森林算法。采用10交叉驗證。
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import pandas as pd
subtrainLabel = pd.read_csv('E:\\malware_class\\subtrain_label.csv')
subtrainfeature = pd.read_csv("E:\\malware_class\\3gramfeature.csv")
subtrain = pd.merge(subtrainLabel,subtrainfeature,on='Id')
labels = subtrain.Class
subtrain.drop(["Class","Id"], axis=1, inplace=True)
subtrain = subtrain.iloc[:,:].values
srf = RF(n_estimators=500, n_jobs=-1)
clf_s = cross_val_score(srf, subtrain, labels, cv=10)
print(clf_s)
得出的准確度如下
array([0.95302013, 0.95302013, 0.94630872, 0.94594595, 0.98648649,
0.95945946, 0.97297297, 0.93918919, 0.94594595, 0.95945946])
在未調參的情況下。最高0.98648649,最低0.93918919。
img特征
提取asm文件的前5000個字節,轉化為python中的圖像矩陣作為每個軟件的特征。特征提取代碼如下:
import os
import numpy
from collections import *
import pandas as pd
import binascii
def getMatrixfrom_asm(filename, startindex = 0, pixnum = 5000):
with open(filename, 'rb') as f:
f.s(startindex, 0)
content = f.read(pixnum)
hexst = binascii.hexlify(content)
fh = numpy.array([int(hexst[i:i+2],16) for i in range(0, len(hexst), 2)])
fh = numpy.uint8(fh)
return fh
basepath = "E:\\malware_class\\subtrain\\"
mapimg = defaultdict(list)
subtrain = pd.read_csv('E:\\malware_class\\subtrain_label.csv')
i = 0
for sid in subtrain.Id:
i += 1
print ("dealing with {0}th file...".format(str(i)))
filename = basepath + sid + ".asm"
im = getMatrixfrom_asm(filename, startindex = 0, pixnum = 1500)
mapimg[sid] = im
dataframelist = []
for sid,imf in mapimg.items():
standard = {}
standard["Id"] = sid
for index,value in enumerate(imf):
colName = "pix{0}".format(str(index))
standard[colName] = value
dataframelist.append(standard)
df = pd.DataFrame(dataframelist)
df.to_csv("E:\\malware_class\\imgfeature.csv",index=False)
提取后的特征表如下圖所示:

隨機森林訓練代碼同上,最終結果如下
array([0.95302013, 0.94630872, 0.96644295, 0.95945946, 0.98648649,
0.93918919, 0.93243243, 0.9527027 , 0.97297297, 0.96621622])
特征結合
將n-gram特征和img特征結合起來,一起作為特征輸入算法中,代碼如下:
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np
subtrainLabel = pd.read_csv('E:\\malware_class\\subtrain_label.csv')
subtrainfeature1 = pd.read_csv("E:\\malware_class\\3gramfeature.csv")
subtrainfeature2 = pd.read_csv("E:\\malware_class\\imgfeature.csv")
subtrain = pd.merge(subtrainfeature1,subtrainfeature2,on='Id')
labels = subtrain.Class
subtrain.drop(["Class","Id"], axis=1, inplace=True)
subtrain = subtrain.iloc[:,:].values
srf = RF(n_estimators=500, n_jobs=-1)
clf_s = cross_val_score(srf, subtrain, labels, cv=10)
最終結果如下
array([0.99328859, 0.97986577, 0.98657718, 0.97297297, 0.99324324,0.98648649, 0.99324324, 0.98648649, 0.99324324, 0.99324324])
總結
三次實驗准確度匯總如下:
n-gram
n-gram
array([0.95302013, 0.95302013, 0.94630872, 0.94594595, 0.98648649,0.95945946, 0.97297297, 0.93918919, 0.94594595, 0.95945946])
img
array([0.95302013, 0.94630872, 0.96644295, 0.95945946, 0.98648649,0.93918919, 0.93243243, 0.9527027 , 0.97297297, 0.96621622])
結合
array([0.99328859, 0.97986577, 0.98657718, 0.97297297, 0.99324324,0.98648649, 0.99324324, 0.98648649, 0.99324324, 0.99324324])
繪圖如下

