微軟惡意軟件分類


此次實驗是在Kaggle上微軟發起的一個惡意軟件分類的比賽

數據集

​ 此次微軟提供的數據集超過500G(解壓后),共9類惡意軟件,如下圖所示。這次實驗參考了此次比賽的冠軍隊伍實現方法。微軟提供的數據包括訓練集、測試集和訓練集的標注。其中每個惡意代碼樣本(去除了PE頭)包含兩個文件,一個是十六進制表示的.bytes文件,另一個是利用IDA反匯編工具生成的.asm文件。下載解壓后如下所示:

20201012165019

​ 每個asm文件內容大致如下:

20201012164905

隨機抽樣

​ 由於機器性能以及存儲空間的限制,本次實驗限制了數據集的規模,采用原數據集中大概1/10左右的訓練子集。其中從每個分類的中都隨機抽取了100個樣本(9個分類,每個樣本2個文件,共1800個文件),這樣也不需要用到pypy xgboost,只需要用到numpy,pandas、PIL和`scikit-learn這些庫即可。以下是隨機抽樣代碼:

import os
from random import *
import pandas as pd
import shutil

rs = Random()
rs.seed(1)
trainlabels = pd.read_csv('D:\\chrom下載\\malware-classification\\subtrainLabels.csv')

fids = []
opd = pd.DataFrame()

for clabel in range (1,10):
    mids = trainlabels[trainlabels.Class == clabel]
    mids = mids.reset_index(drop=True)

    rchoice = [rs.randint(0,len(mids)-1) for i in range(150)]
    print(rchoice)   
    
#     for i in rchoice:
#         fids.append(mids.loc[i].Id)
#         opd = opd.append(mids.loc[i])

    rids = [mids.loc[i].Id for i in rchoice]
    fids.extend(rids)
    opd = opd.append(mids.loc[rchoice])
    
   
opd = opd.reset_index(drop=True)
print(opd)
opd.to_csv('D:\\chrom下載\\malware-classification\\sub_subtrainLabels.csv', encoding='utf-8', index=False)

sbase = 'E:\\malware_class\\subtrain\\'
tbase = 'E:\\malware_class\\sub_subtrain\\'

for fid in fids:
    fnames = ['{0}.asm'.format(fid),'{0}.bytes'.format(fid)]
    for fname in fnames:
        cspath = sbase + fname
        ctpath = tbase + fname
        print(cspath)
        shutil.copy(cspath,ctpath)

抽樣完成后是2966個文件,對應1483個不同的惡意軟件,對應1483個標簽,標簽如下

img_659ac3a8-2085-4307-855c-9275cecb248g

左為Id,右邊是該Id軟件對應的Class。

特征提取

n-gram特征

​ 本次實驗用到的第一個特征為n-gram。

​ 假設有一個m 個詞組成的句子,那么這個句子出現的概率為[公式] 。那么[公式]

為了簡化計算當前這個詞僅僅跟前面的n個詞相關,*因此也就不必追溯到最開始的那個詞,這樣便可以大幅縮減上述算式的長度*。這個n就是n-gram中的n。在本次實驗中,n取3.以下是特征提取的代碼

import re
from collections import *
import os
import pandas as pd

def getOpcodeSequence(filename):		#獲取操作碼序列
    opcode_seq = []
    p = re.compile(r'\s([a-fA-F0-9]{2}\s)+\s*([a-z]+)')
    with open(filename,mode="r",encoding='utf-8',errors='ignore') as f:
        for line in f:
            if line.startswith(".text"):
                m = re.findall(p,line)
                if m:
                    opc = m[0][1]
                    if opc != "align":
                        opcode_seq.append(opc)
    return opcode_seq

def getOpcodeNgram(ops, n=3):			#將操作碼序列以3個操作碼為單位切片,並統計各個單位序列
    opngramlist = [tuple(ops[i:i+n]) for i in range(len(ops)-n)]
    opngram = Counter(opngramlist)
    return opngram

basepath = "E:\\malware_class\\subtrain\\"
map3gram = defaultdict(Counter)
subtrain = pd.read_csv('E:\\malware_class\\subtrain_label.csv')
count = 1

for sid in subtrain.Id:					#獲取每個文件的n-gram特征並存入map3gram中
    print ("counting the 3-gram of the {0} file...".format(str(count)))
    count += 1
    filename = basepath + sid + ".asm"
    ops = getOpcodeSequence(filename)
    op3gram = getOpcodeNgram(ops)
    map3gram[sid] = op3gram
    
cc = Counter([])						#獲取總的n-gram特征,計算其出現的次數,將出現次數大於500的
for d in map3gram.values():				 #單位存入selectedfeatures中用於后面的處理
    print(d)
    cc += d
selectedfeatures = {}
tc = 0
for k,v in cc.items():
    if v >= 500:
        selectedfeatures[k] = v
        print (k,v)
        tc += 1

dataframelist = []
for fid,op3gram in map3gram.items():	   #每個文件的n-gram特征為dataframelist中的一行,每一列為各單位出現的次數
    standard = {}
    standard["Id"] = fid
    for feature in selectedfeatures:
        if feature in op3gram:
            standard[feature] = op3gram[feature]
        else:
            standard[feature] = 0
    dataframelist.append(standard)
    
df = pd.DataFrame(dataframelist)
df.to_csv("E:\\malware_class\\3gramfeature.csv",index=False)

​ 最終得出的的結果如下

image-20201013212038903

​ 這就是這1483個軟件的n-gram特征。特征表是一個1483*2730的表。接下來只要將特征輸入算法就行了。本次實驗所用的算法為隨機森林算法。采用10交叉驗證。

from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import pandas as pd

subtrainLabel = pd.read_csv('E:\\malware_class\\subtrain_label.csv')
subtrainfeature = pd.read_csv("E:\\malware_class\\3gramfeature.csv")

subtrain = pd.merge(subtrainLabel,subtrainfeature,on='Id')
labels = subtrain.Class
subtrain.drop(["Class","Id"], axis=1, inplace=True)
subtrain = subtrain.iloc[:,:].values
srf = RF(n_estimators=500, n_jobs=-1)
clf_s = cross_val_score(srf, subtrain, labels, cv=10)
print(clf_s)

​ 得出的准確度如下

array([0.95302013, 0.95302013, 0.94630872, 0.94594595, 0.98648649,
       0.95945946, 0.97297297, 0.93918919, 0.94594595, 0.95945946])

​ 在未調參的情況下。最高0.98648649,最低0.93918919。

img特征

​ 提取asm文件的前5000個字節,轉化為python中的圖像矩陣作為每個軟件的特征。特征提取代碼如下:

import os
import numpy
from collections import *
import pandas as pd
import binascii

def getMatrixfrom_asm(filename, startindex = 0, pixnum = 5000):
    with open(filename, 'rb') as f:
        f.s(startindex, 0)
        content = f.read(pixnum)
    hexst = binascii.hexlify(content)
    fh = numpy.array([int(hexst[i:i+2],16) for i in range(0, len(hexst), 2)])
    fh = numpy.uint8(fh)
    return fh

basepath = "E:\\malware_class\\subtrain\\"
mapimg = defaultdict(list)
subtrain = pd.read_csv('E:\\malware_class\\subtrain_label.csv')
i = 0
for sid in subtrain.Id:
    i += 1
    print ("dealing with {0}th file...".format(str(i)))
    filename = basepath + sid + ".asm"
    im = getMatrixfrom_asm(filename, startindex = 0, pixnum = 1500)
    mapimg[sid] = im
    
dataframelist = []
for sid,imf in mapimg.items():
    standard = {}
    standard["Id"] = sid
    for index,value in enumerate(imf):
        colName = "pix{0}".format(str(index))
        standard[colName] = value
    dataframelist.append(standard)

df = pd.DataFrame(dataframelist)
df.to_csv("E:\\malware_class\\imgfeature.csv",index=False)

提取后的特征表如下圖所示:

image-20201014100555495

隨機森林訓練代碼同上,最終結果如下

array([0.95302013, 0.94630872, 0.96644295, 0.95945946, 0.98648649,
       0.93918919, 0.93243243, 0.9527027 , 0.97297297, 0.96621622])

特征結合

​ 將n-gram特征和img特征結合起來,一起作為特征輸入算法中,代碼如下:

from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np

subtrainLabel = pd.read_csv('E:\\malware_class\\subtrain_label.csv')
subtrainfeature1 = pd.read_csv("E:\\malware_class\\3gramfeature.csv")
subtrainfeature2 = pd.read_csv("E:\\malware_class\\imgfeature.csv")
subtrain = pd.merge(subtrainfeature1,subtrainfeature2,on='Id')

labels = subtrain.Class
subtrain.drop(["Class","Id"], axis=1, inplace=True)
subtrain = subtrain.iloc[:,:].values
srf = RF(n_estimators=500, n_jobs=-1)
clf_s = cross_val_score(srf, subtrain, labels, cv=10)

最終結果如下

array([0.99328859, 0.97986577, 0.98657718, 0.97297297, 0.99324324,0.98648649, 0.99324324, 0.98648649, 0.99324324, 0.99324324])

總結

​ 三次實驗准確度匯總如下:

n-gram

n-gram
array([0.95302013, 0.95302013, 0.94630872, 0.94594595, 0.98648649,0.95945946, 0.97297297, 0.93918919, 0.94594595, 0.95945946])
img
array([0.95302013, 0.94630872, 0.96644295, 0.95945946, 0.98648649,0.93918919, 0.93243243, 0.9527027 , 0.97297297, 0.96621622])
結合
array([0.99328859, 0.97986577, 0.98657718, 0.97297297, 0.99324324,0.98648649, 0.99324324, 0.98648649, 0.99324324, 0.99324324])

繪圖如下

image-20201014101503757


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM