安卓惡意軟件分類


Drebin樣本的百度網盤下載鏈接我放在下面評論區了,大家自行下載。本次實驗收到上一次實驗啟發(微軟惡意軟件分類),並采用了這篇博文的實現代碼(用機器學習檢測Android惡意代碼),代碼都可以在博主提供的github地址中找到。

原理

​ 具體原理參考上一次實驗,都是提取反編譯文件中的操作碼,n-gram中n取3。具體原理可以參考這篇文章。與上次實驗不同的是,本次實驗針對的是安卓軟件,所以具體的操作碼有所不同。並且由於所采用的數據集中良性軟件明顯比惡意軟件大的多。所以n-gram不再采用出現頻次而是是否出現作為特征。

數據集

​ 本次實驗的惡意軟件數據集來自於Drebin,只采用了第一個part共1000個惡意軟件。良性軟件來自於這個網站。共1100多個良性軟件,取其中的1000個。良性軟件集12.3GB,惡意軟件集1.2GB。可以看出良性軟件要比惡意軟件大的多。

反編譯

​ 將良性數據集以及惡意數據集軟件分別反編譯到 \smalis\kind 以及 \smalis\malware 中。代碼如下

1# -*- coding: utf-8 -*-
"""
Created on Tue Feb  6 14:00:51 2018
@author: 燃燒杯
"""

import os
import subprocess

def disassemble(frompath, topath, num, start=0):
    files = os.listdir(frompath)
    files = files[start:num]
        
    total = len(files)
    
    for i, file in enumerate(files):
        fullFrompath = os.path.join(frompath, file)
        fullTopath = os.path.join(topath, file)
        command = "apktool d " + fullFrompath + " -o " + fullTopath
        subprocess.call(command, shell=True)
        print("已反匯編", i, "個應用,百分比如下:")
        print((i + 1) * 100 / total, "%")


#反匯編惡意軟件樣本
virus_root = "..\\bit\\virus\\VirusAndroid"
disassemble(virus_root, ".\\smalis\\malware", 600)


#反匯編正常軟件樣本
kind_root = "..\\bit\\virus\\normalApk"
disassemble(kind_root, ".\\smalis\\kind", 600)

​ 完成后每個軟件會創建一個以文件名字命名的文件夾,文件夾中包含反編譯后的文件,如下圖所示:

image-20201015172056064

​ 其中smali文件夾中包含了我們要提取特征碼的文件。smali文件大致如下

image-20201015172738379

​ 我們要提取的操作碼就在.method中。操作碼大概有下圖這幾類

p1

​ 將每一類的操作碼對應為大寫字母以簡化特征碼。如move表示為M。

操作碼提取

​ 代碼對應上文提到的github中的bytecode_extract.py文件。

# -*- coding: utf-8 -*-
"""
Created on Tue Feb  6 22:41:06 2018
@author: 燃燒杯
"""

from infrastructure.ware import Ware
from infrastructure.fileutils import DataFile

virusroot = "./smalis/malware"
kindroot = "./smalis/kind"

f = DataFile("./data.csv")

import os

def collect(rootdir, isMalware):
    wares = os.listdir(rootdir)
    total = len(wares)
    for i, ware in enumerate(wares):
        warePath = os.path.join(rootdir, ware)
        ware = Ware(warePath, isMalware)
        ware.extractFeature(f)
        print("已提取", i + 1, "個文件的特征,百分比如下:")
        print((i + 1) * 100 / total, "%")
        
    
#1代表惡意軟件
collect(virusroot, 1)
collect(kindroot, 0)    

f.close()

​ 提取出后如下圖所示:

image-20201015172554235

​ feture列就是我們為每個文件提取出的特征。每個方法的特征碼序列用“|”隔開。

n-gram特征

​ 從上文的feture中提取出n-gram特征,其數值表示該操作序列是否出現。代碼如下

# -*- coding: utf-8 -*-
"""
Created on Fri Feb  9 13:26:50 2018
@author: 燃燒杯
詞集模型
"""

import sys

#n-gram的n值
n = int(sys.argv[1])
print("n = ", n)

import pandas as pd

origin = pd.read_csv("data.csv")
#origin = pd.read_csv("test.csv")

from infrastructure.mydict import MyDict

mdict = MyDict()

feature = origin["Feature"].str.split("|")
total = len(feature)
for i, code in enumerate(feature):
    mdict.newLayer()
    if not type(code) == list:
        continue
    for method in code:
        length = len(method)
        if length < n:
            continue
        for start in range(length - (n - 1)):
            end = start + n
            mdict.mark(method[start:end])
    print("已完成", i, "個應用,百分比如下:")
    print((i + 1) * 100 / total, "%")
            
result = mdict.dict
pd.DataFrame(result, index=origin.index)\
               .to_csv("./" + str(n) + "_gram.csv", index=False)

​ 結果如圖:

image-20201015173708451

形成了2000343的特征表,之所以是343個特征序列是應為總共有7大類操作碼,並且采用3-gram,有777個序列。

機器學習

​ 接下來就是訓練了,本次實驗采用隨機森林算法,並采用10交叉驗證,代碼如下:

from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import pandas as pd

train_feture = pd.read_csv('D:\\android\\dataset\\smalis\\3_gram.csv')
data = pd.read_csv('D:\\android\\dataset\\smalis\\data.csv')
labels = data["isMalware"]
train_feture = train_feture.iloc[:,:].values
srf = RF(n_estimators=500, n_jobs=-1)
clf_s = cross_val_score(srf, train_feture, labels, cv=10)

​ 結果如下

array([0.965     , 0.995     , 0.995     , 0.96      , 0.89      ,
       0.945     , 0.965     , 0.95      , 0.97487437, 0.97487437])

深度學習

​ 順便用用深度學習做一下分類看看效果,深度學習庫采用keras。以下是代碼:

from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

test_split = 0.2			#划分訓練集與測試集
data = pd.read_csv("D:\\android\\dataset\\smalis\\data_2.csv")
fetrues = pd.read_csv("D:\\android\\dataset\\smalis\\3_gram.csv")
labels = data["isMalware"]

p1 = int(len(labels)*(1-test_split))
index = np.random.permutation(len(fetrues))		#打亂順序
train_data = fetrues.iloc[index]
labels = labels.iloc[index]
index = np.random.permutation(len(fetrues))
train_data = fetrues.iloc[index]
labels = labels.iloc[index]

model = keras.Sequential()
model.add(layers.Dense(50,input_dim = 343, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(1, activation = 'sigmoid'))
model.compile(
    optimizer = 'adam',
    loss='binary_crossentropy',
    metrics=['acc']
)

history = model.fit(x_train, y_train, epochs=60, batch_size=256, validation_data=(x_test, y_test))

​ 測試結果如下:

image-20201016093540843

最后10輪精確度如下:

0.9812, 0.9819, 0.9775, 0.9781, 0.9718, 0.9812, 0.9793, 0.9618, 0.9825, 0.9756

另外做10交叉驗證,代碼如下:

from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from sklearn.model_selection import StratifiedKFold
seed = 7
np.random.seed(seed)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

cvscores = []
data = pd.read_csv("D:\\android\\dataset\\smalis\\data_2.csv")
labels = data["isMalware"]
train_data = pd.read_csv("D:\\android\\dataset\\smalis\\3_gram.csv")
train_data =  train_data.iloc[:,:].values
for train, test in kfold.split(train_data, labels):
    model = keras.Sequential()
    model.add(layers.Dense(50,input_dim = 343, activation = 'relu'))
    model.add(layers.Dense(16, activation = 'relu'))
    model.add(layers.Dense(16, activation = 'relu'))
    model.add(layers.Dense(16, activation = 'relu'))
    model.add(layers.Dense(16, activation = 'relu'))
    model.add(layers.Dense(16, activation = 'relu'))
    model.add(layers.Dense(1, activation = 'sigmoid'))
    model.compile(
    optimizer = 'adam',
    loss='binary_crossentropy',
    metrics=['acc']
    )
    model.fit(train_data[train],labels[train],epochs=60, batch_size=256,verbose = 0)
    scores = model.evaluate(train_data[test], labels[test], verbose=0)
    print(scores[1])
    cvscores.append(scores[1])
print(cvscores)

精確度如下:

[0.95, 0.985, 0.95, 0.945, 0.975, 0.95, 0.955, 0.96, 0.9798995, 0.9748744]

與隨機森林對比圖;

image-20201016102523772


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM