思科惡意加密TLS流檢測論文記錄——由於樣本不均衡，其實做得並不好，神馬99.9的准確率都是浮雲啊，之所以思科使用DNS和http一個重要假設是DGA和HTTP C&C（正常http會有圖片等）。一開始思科使用的邏輯回歸，后面17年文章是隨機森林。

0x00

本系列筆記是用來記錄論文閱讀過程中產生的問題與思考的隨筆性質文本，結構可能比較松散，無法完全體現園論文的精髓之處，僅供自己日后溫習參考之用。

題目：Identifying Encrypted Malware Traffic with Contextual Flow Data
作者： Blake Anderson (Cisco), David McGrew (Cisco)
出處：AISec ‘16 Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security
關鍵詞：Malware; Machine Learning; Transport Layer Security; Network Monitoring

0x01 提出問題

根據惡意軟件收發的加密流量數據來檢測惡意軟件的類型是很有必要的。
傳統的特征提取方式大多聚焦在數據包大小和一些與時間有關的參數，本文擴充了特征提取范圍，運用到完整TLS握手數據包、同TLS握手數據包同一來源的DNS數據流和5分鍾窗口內的HTTP數據流（后兩者被稱為contextual flow）。根據以上數據，我們能夠
將提取到的特征輸入到監督機器學習算法中，能夠得到非常高的識別准確率。

0x02 解決方法

特征提取步驟：針對contextual flow，從DNS流中，我們主要分析從DNS服務器中返回的帶有一個地址的響應，以及和這個地址相關聯的TTL值；從HTTP流中，我們主要分析HTTP頭中的各種屬性。針對TLS stream，我們主要分析它們的握手包中提供的信息；針對其他數據包（如普通TCP，UDP，ICMP包，Observable metadata）。我們將提取它們的“邊信道信息”。
分類識別步驟：對特征進行正則化處理，並投入監督學習算法中。
使用真實網絡環境下抓取的數據包進行測試。

0x03 特征來源

TLS流

TLS流在交互之初是不加密的，因其需要同遠程服務器進行握手。我們可以觀測到的未加密TLS元數據包括clientHello和clientKeyExchange。從這些包的信息中，我們可以推斷出客戶端使用的TLS庫等信息。從這些信息中，我們可以發現，良性流量的行為軌跡與惡意流量是十分不同的。

客戶端方面，我們首先觀察兩個TLS特征：Offered Ciphersuites和Advertised TLS Extensions。對於前者，惡意流量更喜歡在clientHello中提供0x0004(TLS_RSA_WITH_RC4_128_MD5)套件，而良性流量則更多提供0x002f(TLS_RSA_WITH_AES_128_CBC_SHA)套件；對於后者，大多數TLS流量提供0x000d(signature_algorithms)，但是良性流量會使用以下很少在惡意流量中見到的參數：

 
           0x0005 (status request) 0x3374 (next protocol negotiation) 0xff01 (renegotiation info)  
          

隨后，我們觀察良性與惡性流量客戶端公鑰的區別。良性流量往往選擇256-bit的橢圓曲線密碼公鑰，而惡意流量往往選擇2048-bit的RSA密碼公鑰。

服務端方面，我們能夠從serverHello流中得到服務端選擇的Offered Ciphersuites和Advertised TLS Extensions信息。良性流量的選擇比較多元化，而惡性流量往往會選擇較為過時的技術。在certificate流中，我們能夠得到服務端的證書鏈。無論是惡意流量還是良性流量，其證書的數量都是差不多的，但若我們觀察長度為1的證書鏈，就能夠發現，其中的70%都來自惡意流量自簽名，0.1%來自良性流量自簽名。

除此之外，SubjectAltName這個X.509拓展以及證書的有效時間也可區分一定量的良性和惡意流量。

DNS流

許多惡意軟件使用域名生成算法來隨機生成域名，這是一個明顯區別於普通流量的行為。因此這便是我們識別惡意流量的突破口。

在比較域名的長度時，良性流量的域名基本符合高斯分布，其最高點在6或7處；而惡意流量的域名分布在6處存在一個極為尖銳的高峰。在對域名使用字符種類的探測上，我們發現良性流量域名使用數字字符較惡意流量更多。

在比較DNS返回響應中攜帶的IP地址的個數時，我們發現，良性流量更多地返回2或8個，而惡意流量更多地返回4或11個。同時，在比較響應中的TTL數值時，我們發現良性流量中最常出現的數值為60、300、20和30；在惡意流量的TTL數值中，300是一個常見數值，但是20和30卻並不常見。且惡意流量中經常出現數值100，但這個數值幾乎從未出現在良性流量中過。

除了以上指標，我們還能通過參考Alexa排名來獲取良性流量和惡意流量在域名上的區別。我們將域名分為6類：top-100, top-1000, top-10000, top-100000以及未上榜。隨后我們發現，86%的惡意流量域名都未上榜。

HTTP流

HTTP響應報頭中，惡意流量最常用的屬性為Server，Set-Cookie和Location，但良性流量最常用的屬性為Connection，Expires和Last-Modified；在HTTP請求報頭中，良性流量最常用的屬性為User-Agent，Accept-Encoding和Accept-Language。

在屬性值的觀察中，良性流量最常用的Content-Type為image／\*，而惡意流量最常用的是text／\*。其他惡意流量常用的MIME值為：text／html；charset=UTF-8以及text／html；charset=utf-8。

惡意流量往往宣稱自己使用的服務器為低版本的Nginx，而良性流量往往宣稱自己使用的是低版本的Apache或Nginx。

惡意流量的User-Agent字段中較為常見的值為Opera/9.50(WindowsNT6.0;U;en)，次常見的為一些版本的Mozilla／5.0或Mozilla／4.0；而良性流量則一般為Windows或OS X版本的Mozilla／5.0。

0x04 特征提取細節

邊信道信息

（此處未看懂，與馬爾科夫鏈有關）
創立一個256-bit的數組，為每一種長度的payload計數

TLS數據

基於客戶端的特征：將176種密碼套件的類型、TLS拓展以及公鑰長度列成一個list，並使用一個二元數組（只有0和1）針對對該流量數據的具體情況進行標記；
基於服務端的特征：同上。

DNS數據

類似於上文的方法，我們羅列了針對域名的特征如下：32個可能的TTL值和一個“other”選項、數字字符的數量、非字母數字字符的數量、DNS響應中返回的IP地址數量，以及6個衡量域名在Alexa排名的位階。

HTTP數據

類似於上文的方法，選擇6個在HTTP報頭中經常的出現的字段，以及一個“other”選項。

0x05 測試結果

SPLT + BD + TLS + HTTP + DNS：99.933%
SPLT + BD + TLS + HTTP：99.983%
SPLT + BD + TLS + DNS：99.968%
TLS + HTTP + DNS：99.988%
SPLT + BD + TLS：99.933%
HTTP + DNS：99.985%
TLS + HTTP：99.955%
TLS + DNS：99.883%
HTTP：99.945%
DNS：99.496%
TLS：96.335%

補充：

Machine Learning for Encrypted Malware Traic Classification: Accounting for Noisy Labels and Non-Stationarity 同樣的作者在kdd 2017上的文章

里面提到了tls的交互過程：

Figure 1 provides a graphical representation of a simple TLS session. The client initially sends a ClientHello message that provides the server with, among other fields, a list of cipher suites and a set of TLS extensions that the client supports. The cipher suite list is ordered by preference of the client, and each cipher suite denes a set of cryptographic algorithms needed for TLS to operate. The set of extensions provides additional information to the server that facilitates extended functionality, e.g., the Server Name Indication extension indicates the hostname of the server that the client is trying to connect to, which is important for virtual hosting. As explained in Section 4, all of the TLS data features used in this paper are taken from the unencrypted ClientHello message. After the ClientHello, the server sends a ServerHello message that contains the selected cipher suite, selected from the client’s offer list, which defines the set of cryptographic algorithms that will be used to secure the exchanged application data. The ServerHello message also contains a list of extensions that the server supports, where this list is a subset of what the client supports. At this time, the server also sends a Certificate message containing the server’s certicate chain, which can be used to authenticate the server.
The client then sends a ClientKeyExchange message that establishes the premaster secret of the TLS session. Then the client and server exchange ChangeCipherSpec messages indicating that future messages will be encrypted with the negotiated cryptographic parameters. Finally, the client and server begin to exchange application data. In Figure 1, red text represents unencrypted messages, and blue text represents encrypted messages. The current TLS 1.2 handshake protocol provides a lot of interesting, unencrypted information. To enhance privacy, TLS 1.3 will be encrypting more of the handshake, e.g., the Certificate message will be encrypted, but the data features used in this paper will still be available. Many important details were omitted for the sake of brevity, but the associated RFC’s provide the full specification [18, 34]. Because TLS encrypts many of the application-specific features, therefore making traditional deep packet inspection infeasible,
many researchers have utilized side-channel information to make useful inferences on the TLS trac [38]. These data features are typically constructed from the individual packet lengths and packet inter-arrival times of the encrypted session. Commonly used features include the mean of the packet lengths, n-gram or Markov chain based features derived from the sequence of packet lengths, or similarly constructed features for the timing information.

google翻譯下：

圖1提供了簡單TLS會話的圖形表示。客戶端最初發送ClientHello消息，該消息為服務器提供密碼套件列表和客戶端支持的一組TLS擴展。密碼套件列表按客戶端的優先順序排序，每個密碼套件定義了TLS運行所需的一組加密算法。該組擴展向服務器提供便於擴展功能的附加信息，例如，服務器名稱指示擴展指示客戶端嘗試連接的服務器的主機名，這對於虛擬主機是重要的。如第4節所述，本文中使用的所有TLS數據功能都來自未加密的ClientHello消息。在ClientHello之后，服務器發送ServerHello消息，該消息包含從客戶端的商品列表中選擇的選定密碼套件，該列表定義將用於保護交換的應用程序數據的加密算法集。 ServerHello消息還包含服務器支持的擴展列表，其中此列表是客戶端支持的子集。此時，服務器還會發送包含服務器證書鏈的證書消息，該消息可用於對服務器進行身份驗證。
然后，客戶端發送ClientKeyExchange消息，該消息建立TLS會話的預主密鑰。然后，客戶端和服務器交換ChangeCipherSpec消息，指示將使用協商的加密參數對將來的消息進行加密。最后，客戶端和服務器開始交換應用程序數據。在圖1中，紅色文本表示未加密的消息，藍色文本表示加密的消息。當前的TLS 1.2握手協議提供了許多有趣的，未加密的信息。為了增強隱私，TLS 1.3將加密更多的握手，例如，證書消息將被加密，但本文中使用的數據功能仍然可用。為簡潔起見，省略了許多重要細節，但相關的RFC提供了完整的規范[18,34]。因為TLS加密了許多特定於應用程序的功能，因此傳統的深度包檢測不可行，許多研究人員利用旁道信息對TLS流量做出了有用的推論[38]。這些數據特征通常由加密會話的各個分組長度和分組到達間隔時間構成。常用的特征包括分組長度的平均值，從分組長度序列導出的n-gram或基於馬爾可夫鏈的特征，或者用於定時信息的類似構造的特征。

我總覺得報文大小不應該是關鍵特征，但是論文說是：

最后看下算法准確率，

樣本數量：Total 4,287,892 285,895 惡意樣本：白樣本為7:100

Enterprise Malware Algorithm Standard Enhanced Standard Enhanced

LinReg 99.92% 99.28% 0.00% 58.65%

l2-LogReg 93.35% 98.36% 16.86% 76.13%

l1-LogReg 92.75% 98.97% 19.71% 75.08%

DecTree 97.55% 97.02% 40.98% 83.33%

RandForest 99.53% 99.99% 33.54% 76.79%

SVM 11.94% 99.78% 77.98% 72.62%

MLP 95.90% 99.54% 20.61% 72.53%

由於樣本不均衡，其實分類效果並不好，就看惡意軟件的檢出率和准確率就知道。最高的才83%。

Identifying Encrypted Malware Traffic with Contextual Flow Data 文章里一些要點文章里有很多特征提取的圖，可以認真看下。一開始思科使用的邏輯回歸，在這個文章里就是。

We can see that malware usually offers a set of three obsolete ciphersuites in the clientHello message including 0x0004 (TLS_RSA_WITH_RC4_128_MD5). In the benign traffic we collected, the 0x002f (TLS_RSA_WITH_AES_128_CBC_SHA)
ciphersuite was the most offered. Malware also seems to have comparatively little diversity in the client-supported TLS extensions. 0x000d (signature_algorithms) was the only TLS extension supported in the majority of TLS flows. ∼50% of the DMZ traffic also advertised the following extensions, which were rarely seen in the malware dataset:
• 0x0005 (status request)
• 0x3374 (next protocol negotiation)
• 0xff01 (renegotiation info)
Although not shown, the client’s public key length was another client-based data feature that had significant differences. Most of the DMZ traffic used 256-bit elliptic curve cryptography for the public keys, but most of the malicious traffic used 2048-bit RSA public keys. The serverHello and certificate messages can be used to gain information about the server. The serverHello message contains the selected ciphersuite and supported extensions. As one would expect given the type and diversity of the offered ciphersuites and the advertised extensions, the malicious traffic most often selected obsolete ciphersuites. The DMZ traffic contained a wider variety of supported TLS extensions by the servers.

翻譯就是：

我們可以看到惡意軟件通常在clientHello消息中提供一組三個過時的密碼套件，包括0x0004（TLS_RSA_WITH_RC4_128_MD5）。在我們收集的良性流量中，0x002f（TLS_RSA_WITH_AES_128_CBC_SHA）
密碼套件是最多的。惡意軟件似乎在客戶端支持的TLS擴展中具有相對較小的多樣性。 0x000d（signature_algorithms）是大多數TLS流中唯一支持的TLS擴展。 ~50％的DMZ流量還宣傳了以下擴展，這在惡意軟件數據集中很少見：
•0x0005（狀態請求）
•0x3374（下一個協議協商）
•0xff01（重新協商信息）
雖然未顯示，但客戶端的公鑰長度是另一個基於客戶端的數據功能，具有顯着差異。大多數DMZ流量使用256位橢圓曲線加密作為公鑰，但大多數惡意流量使用2048位RSA公鑰。 serverHello和證書消息可用於獲取有關服務器的信息。 serverHello消息包含選定的密碼套件和支持的擴展。正如人們所期望的那樣，鑒於所提供的密碼套件和廣告擴展的類型和多樣性，惡意流量通常選擇過時的密碼套件。 DMZ流量包含服務器支持的各種TLS擴展。

The certificate message passes the server’s certificate chain to the client. We observed that the number of certificates in the chain for the malware and DMZ data were roughly the same. But, if we restrict our focus on the length1 chains, ∼70% were self-signed for malware and ∼.1% were self-signed for the DMZ traffic. The number of names in the SubjectAltName (SAN) X.509 extension also differed in the two datasets. For the DMZ traffic, the length of the list was 1 ∼45% of the time. This is in part because a number of Content Distribution Network (CDN) providers, e.g., Akamai, only have one entry. Length-10/12 lists were also common in the DMZ traffic due to some ad services.
Figure 2 also shows the distribution of the validity of the certificates rounded to the nearest day. Similar to the other data features, the period of validity for a server certificate has notable differences in the malicious and DMZ traffic.

證書消息將服務器的證書鏈傳遞給客戶端。我們觀察到惡意軟件和DMZ數據鏈中的證書數量大致相同。但是，如果我們將注意力集中在長度為1的鏈上，則大約有70％是針對惡意軟件進行自簽名的，並且~.1％是針對DMZ流量進行自簽名的。 SubjectAltName（SAN）X.509擴展中的名稱數量在兩個數據集中也不同。對於DMZ流量，列表的長度是1~45％的時間。這部分是因為許多內容分發網絡（CDN）提供商（例如Akamai）只有一個條目。由於某些廣告服務，長度為10/12的列表在DMZ流量中也很常見。
圖2還顯示了四舍五入到最近一天的證書有效性的分布。與其他數據功能類似，服務器證書的有效期在惡意和DMZ流量方面有顯着差異。

特征和相關度：

Weight Feature 3.38 DNS Suffix org 2.99 DNS TTL 3600 2.62 TLS Ciphersuite TLS_RSA_WITH_RC4_128_SHA 2.28 HTTP Field accept-encoding 1.95 TLS Ciphersuite SSL_RSA_FIPS_WITH_3DES_EDE_CBC_SHA 1.78 HTTP Field location 1.38 DNS Alexa: None 1.21 TLS Ciphersuite TLS_RSA_WITH_RC4_128_MD5 1.12 HTTP Server nginx 1.11 HTTP Code 404 -2.16 TLS Extension extended_master_secret -1.65 HTTP Content Type application/octet-stream -1.61 HTTP Accept Language en-US,en;q=0.5 -1.35 TLS Ciphersuite TLS_DHE_RSA_WITH_DES_CBC_SHA -1.10 HTTP Content Type text/plain;charset=UTF-8 -0.97 HTTP Server Microsoft-IIS/8.5 -0.95 DNS Alexa: top-1,000,000 -0.91 HTTP User-Agent Microsoft-CryptoAPI/6.1 -0.88 TLS Ciphersuite TLS_ECDHE_ECDSA_WITH_RC4_128_SHA -0.85 HTTP Content Type application/x-gzip

Table 2: The data features most relevant to the TLS/DNS/HTTP classifier.

7.2 DNS
域名系統（DNS）[28]是一種分層的，分散的手段，用於提供有關域名的附加信息，特別是域名到IP地址映射。最近，惡意軟件利用DNS和域生成算法（DGA）[8]來提供運行其命令和控制通道的強大方法。以前有很多關於將DNS數據分類為惡意或良性的結果[7,9,24]。這項工作都不利用DNS數據來推斷加密流量。我們的工作也不同，我們說明了DNS的不同數據特征的分布，例如TTL值。
7.3 HTTP
超文本傳輸協議（HTTP）[17]是用於在萬維網上傳輸數據的應用程序級協議。與DNS類似，威脅行為者也將HTTP用作命令和控制通道[29,33]。已經有一些專門針對HTTP數據中存在的功能的工作。在[33]中，作者使用統計數據（例如URL的平均長度）和URL上的字符串匹配方法來聚類惡意軟件。同樣，惡意軟件和良性HTTP會話的具體差異不會突出顯示。 [22]專門分析了User-Agent字段值。我們提供了更多HTTP字段的詳細說明，並使用此信息為加密流量創建機器學習分類器。

之所以思科使用DNS和http一個重要假設是DGA和HTTP C&C。

論文記錄：Identifying Encrypted Malware Traffic with Contextual Flow Data