Drebin樣本的百度網盤下載鏈接我放在下面評論區了,大家自行下載。本次實驗收到上一次實驗啟發(微軟惡意軟件分類),並采用了這篇博文的實現代碼(用機器學習檢測Android惡意代碼),代碼都可以在博主提供的github地址中找到。
原理
具體原理參考上一次實驗,都是提取反編譯文件中的操作碼,n-gram中n取3。具體原理可以參考這篇文章。與上次實驗不同的是,本次實驗針對的是安卓軟件,所以具體的操作碼有所不同。並且由於所采用的數據集中良性軟件明顯比惡意軟件大的多。所以n-gram不再采用出現頻次而是是否出現作為特征。
數據集
本次實驗的惡意軟件數據集來自於Drebin,只采用了第一個part共1000個惡意軟件。良性軟件來自於這個網站。共1100多個良性軟件,取其中的1000個。良性軟件集12.3GB,惡意軟件集1.2GB。可以看出良性軟件要比惡意軟件大的多。
反編譯
將良性數據集以及惡意數據集軟件分別反編譯到 \smalis\kind 以及 \smalis\malware 中。代碼如下
1# -*- coding: utf-8 -*-
"""
Created on Tue Feb 6 14:00:51 2018
@author: 燃燒杯
"""
import os
import subprocess
def disassemble(frompath, topath, num, start=0):
files = os.listdir(frompath)
files = files[start:num]
total = len(files)
for i, file in enumerate(files):
fullFrompath = os.path.join(frompath, file)
fullTopath = os.path.join(topath, file)
command = "apktool d " + fullFrompath + " -o " + fullTopath
subprocess.call(command, shell=True)
print("已反匯編", i, "個應用,百分比如下:")
print((i + 1) * 100 / total, "%")
#反匯編惡意軟件樣本
virus_root = "..\\bit\\virus\\VirusAndroid"
disassemble(virus_root, ".\\smalis\\malware", 600)
#反匯編正常軟件樣本
kind_root = "..\\bit\\virus\\normalApk"
disassemble(kind_root, ".\\smalis\\kind", 600)
完成后每個軟件會創建一個以文件名字命名的文件夾,文件夾中包含反編譯后的文件,如下圖所示:
其中smali文件夾中包含了我們要提取特征碼的文件。smali文件大致如下

我們要提取的操作碼就在.method中。操作碼大概有下圖這幾類
將每一類的操作碼對應為大寫字母以簡化特征碼。如move表示為M。
操作碼提取
代碼對應上文提到的github中的bytecode_extract.py文件。
# -*- coding: utf-8 -*-
"""
Created on Tue Feb 6 22:41:06 2018
@author: 燃燒杯
"""
from infrastructure.ware import Ware
from infrastructure.fileutils import DataFile
virusroot = "./smalis/malware"
kindroot = "./smalis/kind"
f = DataFile("./data.csv")
import os
def collect(rootdir, isMalware):
wares = os.listdir(rootdir)
total = len(wares)
for i, ware in enumerate(wares):
warePath = os.path.join(rootdir, ware)
ware = Ware(warePath, isMalware)
ware.extractFeature(f)
print("已提取", i + 1, "個文件的特征,百分比如下:")
print((i + 1) * 100 / total, "%")
#1代表惡意軟件
collect(virusroot, 1)
collect(kindroot, 0)
f.close()
提取出后如下圖所示:
feture列就是我們為每個文件提取出的特征。每個方法的特征碼序列用“|”隔開。
n-gram特征
從上文的feture中提取出n-gram特征,其數值表示該操作序列是否出現。代碼如下
# -*- coding: utf-8 -*-
"""
Created on Fri Feb 9 13:26:50 2018
@author: 燃燒杯
詞集模型
"""
import sys
#n-gram的n值
n = int(sys.argv[1])
print("n = ", n)
import pandas as pd
origin = pd.read_csv("data.csv")
#origin = pd.read_csv("test.csv")
from infrastructure.mydict import MyDict
mdict = MyDict()
feature = origin["Feature"].str.split("|")
total = len(feature)
for i, code in enumerate(feature):
mdict.newLayer()
if not type(code) == list:
continue
for method in code:
length = len(method)
if length < n:
continue
for start in range(length - (n - 1)):
end = start + n
mdict.mark(method[start:end])
print("已完成", i, "個應用,百分比如下:")
print((i + 1) * 100 / total, "%")
result = mdict.dict
pd.DataFrame(result, index=origin.index)\
.to_csv("./" + str(n) + "_gram.csv", index=False)
結果如圖:
形成了2000343的特征表,之所以是343個特征序列是應為總共有7大類操作碼,並且采用3-gram,有777個序列。
機器學習
接下來就是訓練了,本次實驗采用隨機森林算法,並采用10交叉驗證,代碼如下:
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import pandas as pd
train_feture = pd.read_csv('D:\\android\\dataset\\smalis\\3_gram.csv')
data = pd.read_csv('D:\\android\\dataset\\smalis\\data.csv')
labels = data["isMalware"]
train_feture = train_feture.iloc[:,:].values
srf = RF(n_estimators=500, n_jobs=-1)
clf_s = cross_val_score(srf, train_feture, labels, cv=10)
結果如下
array([0.965 , 0.995 , 0.995 , 0.96 , 0.89 ,
0.945 , 0.965 , 0.95 , 0.97487437, 0.97487437])
深度學習
順便用用深度學習做一下分類看看效果,深度學習庫采用keras。以下是代碼:
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
test_split = 0.2 #划分訓練集與測試集
data = pd.read_csv("D:\\android\\dataset\\smalis\\data_2.csv")
fetrues = pd.read_csv("D:\\android\\dataset\\smalis\\3_gram.csv")
labels = data["isMalware"]
p1 = int(len(labels)*(1-test_split))
index = np.random.permutation(len(fetrues)) #打亂順序
train_data = fetrues.iloc[index]
labels = labels.iloc[index]
index = np.random.permutation(len(fetrues))
train_data = fetrues.iloc[index]
labels = labels.iloc[index]
model = keras.Sequential()
model.add(layers.Dense(50,input_dim = 343, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(1, activation = 'sigmoid'))
model.compile(
optimizer = 'adam',
loss='binary_crossentropy',
metrics=['acc']
)
history = model.fit(x_train, y_train, epochs=60, batch_size=256, validation_data=(x_test, y_test))
測試結果如下:
最后10輪精確度如下:
0.9812, 0.9819, 0.9775, 0.9781, 0.9718, 0.9812, 0.9793, 0.9618, 0.9825, 0.9756
另外做10交叉驗證,代碼如下:
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from sklearn.model_selection import StratifiedKFold
seed = 7
np.random.seed(seed)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cvscores = []
data = pd.read_csv("D:\\android\\dataset\\smalis\\data_2.csv")
labels = data["isMalware"]
train_data = pd.read_csv("D:\\android\\dataset\\smalis\\3_gram.csv")
train_data = train_data.iloc[:,:].values
for train, test in kfold.split(train_data, labels):
model = keras.Sequential()
model.add(layers.Dense(50,input_dim = 343, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(1, activation = 'sigmoid'))
model.compile(
optimizer = 'adam',
loss='binary_crossentropy',
metrics=['acc']
)
model.fit(train_data[train],labels[train],epochs=60, batch_size=256,verbose = 0)
scores = model.evaluate(train_data[test], labels[test], verbose=0)
print(scores[1])
cvscores.append(scores[1])
print(cvscores)
精確度如下:
[0.95, 0.985, 0.95, 0.945, 0.975, 0.95, 0.955, 0.96, 0.9798995, 0.9748744]
與隨機森林對比圖;