文章目錄(Table of Contents)
簡介
這一篇介紹關於UNSW-NB15數據集的相關內容, 也是關於入侵檢測的一個數據集. 這里主要會對這個數據集進行介紹. 之前我們對另一個入侵檢測的數據集進行過介紹, 鏈接如下: KDD99數據集與NSL-KDD數據集介紹
UNSW-NB15總體介紹
數據集的官網: The UNSW-NB15 Dataset Description
數據集下載鏈接: UNSW-NB15 Download
數據集中一共有9種攻擊: This dataset has nine types of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms.
數據集一共有49個特征, 我們會在后面對每一種特征進行介紹.
在csv中保存的數據共有2,540044條數據, 被包含在四個文件中: The total number of records is two million and 540,044 which are stored in the four CSV files.
這里包含了每一種攻擊的數量, 后面會做簡單的分析: UNSW-NB15_LIST_EVENTS.csv.
該數據集已經進行了訓練集和測試集的分割, 文件分別如下: UNSW_NB15_training-set.csv and UNSW_NB15_testing-set.csv.
在訓練集中共有175341條記錄, 在測試集中共有82332條記錄. The number of records in the training set is 175,341 records and the testing set is 82,332 records from the different types, attack and normal.Figure 1 and 2 show the testbed configuration dataset and the method of the feature creation of the UNSW-NB15, respectively.
UNSW-NB15特征介紹
數據集共有49個特征, 下面分別進行介紹, 這里的內容來源為:
- Moustafa, Nour, and Jill Slay. "UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)." In 2015 military communications and information systems conference (MilCIS), pp. 1-6. IEEE, 2015.
關於下面的數據介紹中, Type的簡寫的對於關系分別如下所示:
- N: nominal,
- I: integer,
- F: float,
- T: timestamp,
- B: binary
Flow Features
- #, Name, Type, Description
- ------------------------------
- 1. srcip, N, Source IP address
- 2. sport, I, Source port number
- 3. dstip, N, Destination IP address
- 4. dsport, I, Destination port number
- 5. proto, N, Transaction protocol
Base Features
- 6, state, N, The state and its dependent protocol, e.g. ACC, CLO, else (-)
- 7, dur, F, Record total duration
- 8, sbytes, I, Source to destination bytes
- 9, dbytes, I, Destination to source bytes
- 10, sttl, I, Source to destination time to live
- 11, dttl, I, Destination to source time to live
- 12, sloss, I, Source packets retransmitted or dropped
- 13, dloss, I, Destination packets retransmitted or dropped
- 14, service, N, http, ftp, ssh, dns ..,else (-)
- 15, sload, F, Source bits per second
- 16, dload, F, Destination bits per second
- 17, spkts, I, Source to destination packet count
- 18, dpkts, I, Destination to source packet count
Content Features
- 19, swin, I, Source TCP window advertisement
- 20, dwin, I, Destination TCP window advertisement
- 21, stcpb, I, Source TCP sequence number
- 22, dtcpb, I, Destination TCP sequence number
- 23, smeansz, I, Mean of the flow packet size transmitted by the src
- 24, dmeansz, I, Mean of the flow packet size transmitted by the dst
- 25, trans_depth, I, the depth into the connection of http request/response transaction
- 26, res_bdy_len, I, The content size of the data transferred from the server's http service.
Time Features
- 27, sjit, F, Source jitter (mSec)
- 28, djit, F, Destination jitter (mSec)
- 29, stime, T, record start time
- 30, ltime, T, record last time
- 31, sintpkt, F, Source inter-packet arrival time (mSec)
- 32, dintpkt, F, Destination inter-packet arrival time (mSec)
- 33, tcprtt, F, The sum of 'synack' and 'ackdat' of the TCP.
- 34, synack, F, The time between the SYN and the SYN_ACK packets of the TCP.
- 35, ackdat, F, The time between the SYN_ACK and the ACK packets of the TCP.
The features from 1-35 represent the integrated gathered information from data packets. The majority of features are generated from header packets as reflected above.
Additional Generated Features--General purpose features
In the general purpose features, each feature has its own purpose, according to the defence point of view.
- 36, is_sm_ips_ports, B, If source (1) equals to destination (3)IP addresses and port numbers (2)(4) are equal, this variable takes value 1 else 0
- 37, ct_state_ttl, I, No. for each state (6) according to specific range of values for source/destination time to live (10) (11).
- 38, ct_flw_http_mthd, I, No. of flows that has methods such as Get and Post in http service.
- 39, is_ftp_login, B, If the ftp session is accessed by user and password then 1 else 0.
- 40, ct_ftp_cmd, I, No of flows that has a command in ftp session.
Additional Generated Features--Connection features
Connection features are solely created to provide defence during attempt to connection scenarios.
The attackers might scan hosts in a capricious way. For example, once per minute or one scan per hour . In order to identify these attackers, the features 36-47 are intended to sort accordingly with the last time feature to capture similar characteristics of the connection records for each 100 connections sequentially ordered.
- 41, ct_srv_src, I, No. of connections that contain the same service (14) and source address (1) in 100 connections according to the last time (26).
- 42, ct_srv_dst, I, No. of connections that contain the same service (14) and destination address (3) in 100 connections according to the last time (26).
- 43, ct_dst_ltm, I, No. of connections of the same destination address (3) in 100 connections according to the last time (26).
- 44, ct_src_ ltm, I, No. of connections of the same source address (1) in 100 connections according to the last time (26).
- 45, ct_src_dport_ltm, I, No of connections of the same source address (1) and the destination port (4) in 100 connections according to the last time (26).
- 46, ct_dst_sport_ltm, I, No of connections of the same destination address (3) and the source port (2) in 100 connections according to the last time (26).
- 47, ct_dst_src_ltm, I, No of connections of the same source (1) and the destination (3) address in in 100 connections according to the last time (26).
Labelled Features
- 48, attack_cat, N, The name of each attack category. In this data set, nine categories (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms), 一共9種攻擊, 算上Normal是一共有10個類別.
- 49, Label, B, 0 for normal and 1 for attack records
UNSW-NB15數據介紹
數據集的分布介紹
It represents the distribution of all records of the UNSW-NB15 data set. The major categories of the records are normal and attack. The attack records are further classified into nine families according to the nature of the attacks.
- (1)Normal: 2,218,761; Natural transaction data.
- (2)Fuzzers: 24,246; Attempting to cause a program or network suspended by feeding it the randomly generated data. (模糊攻擊)
- (3)Analysis: 2,677; It contains different attacks of port scan, spam and html files penetrations.
- (4)Backdoors: 2,329; A technique in which a system security mechanism is bypassed stealthily to access a computer or its data.
- (5)DoS: 16,353; A malicious attempt to make a server or a network resource unavailable to users, usually by temporarily interrupting or suspending the services of a host connected to the Internet.
- (6)Exploits: 44,525; The attacker knows of a security problem within an operating system or a piece of software and leverages that knowledge by exploiting the vulnerability.
- (7)Generic: 215,481; A technique works against all blockciphers(分組密碼) (with a given block and key size), without consideration about the structure of the block-cipher.
- (8)Reconnaissance(偵察): 13,987; Contains all Strikes that can simulate attacks that gather information.
- (9)Shellcode: 1,511; A small piece of code used as the payload in the exploitation of software vulnerability.
- (10)Worms: 174; Attacker replicates itself in order to spread to other computers. Often, it uses a computer network to spread itself, relying on security failures on the target computer to access it.
UNSW-NB15文件介紹
Four CSV files of the data records are provided and each CSV file contains attack and normal records. The names of the CSV files are UNSWNB15_1.csv, UNSW-NB15_2.csv, UNSW NB15_3.csv and UNSW-NB15_4.csv.
In each CSV file, all the records are ordered according the last time attribute. Further, the first three CSV files each file contains 700000 records and the fourth file contains 440044 records.
The list of event file is labelled UNSWNB15_LIST_EVENTS which contains attack category and subcategory.
UNSW-NB15准確率分析
這里我們看一下UNSW-NB15數據集使用各種算法的准確率的分析. 這里的結果來源於以下的論文.
- @article{moustafa2016evaluation,
- title={The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set},
- author={Moustafa, Nour and Slay, Jill},
- journal={Information Security Journal: A Global Perspective},
- volume={25},
- number={1-3},
- pages={18--31},
- year={2016},
- publisher={Taylor \& Francis}
- }
在這里會使用五種算法來進行評估: The five techniques used are Naive Bayes (NB) (Panda & Patra, 2007), Decision Tree (DT) (Bouzida & Cuppens, 2006), Artificial Neural Network (ANN) (Bouzida & Cuppens, 2006; Mukkamala et al., 2005), Logistic Regression (LR) (Mukkamala et al., 2005), and Expectation-Maximization (EM) Clustering (Sharif et al., 2012).
模型評估的標准分別是Accuracy和false alarm rates (FAR). 關於更多評價標准的內容, 可以參考鏈接: 模型評價指標說明與實踐–混淆矩陣的說明
最終文章測試的結果如下圖所示, 可以看到准確率大概在85%不到的樣子:
UNSW-NB15實驗
這里包含一些使用UNSW-NB15數據集來進行的實驗, 做實驗的時候可以參考這些代碼.
Github上關於該數據集的匯總: Github匯總--UNSW-NB15數據集
做好數據處理的數據(做了數據預處理): Feature coded UNSW_NB15 intrusion detection data.
使用SVM和Naive Bayes來對UNSW-NB15進行處理: UNSW-Network_Packet_Classification