pyshark 解析 pcap 獲取 http header

本文轉載自查看原文 2021-12-18 15:03 1647 python

1. pyshark 簡介

Python wrapper for tshark, allowing python packet parsing using wireshark dissectors.

There are quite a few python packet parsing modules, this one is different because it doesn't actually parse any packets, it simply uses tshark's (wireshark command-line utility) ability to export XMLs to use its parsing.

This package allows parsing from a capture file or a live capture, using all wireshark dissectors you have installed. Tested on windows/linux.

2. Reading from a capture file

import pyshark
cap = pyshark.FileCapture('/tmp/mycapture.cap')
cap
>>> <FileCapture /tmp/mycapture.cap>
print cap[0]
Packet (Length: 698)
Layer ETH:
        Destination: aa:bb:cc:dd:ee:ff
        Source: 00:de:ad:be:ef:00
        Type: IP (0x0800)
Layer IP:
        Version: 4
        Header Length: 20 bytes
        Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
        Total Length: 684
        Identification: 0x254f (9551)
        Flags: 0x00
        Fragment offset: 0
        Time to live: 1
        Protocol: UDP (17)
        Header checksum: 0xe148 [correct]
        Source: 192.168.0.1
        Destination: 192.168.0.2

3. Reading from a live interface

capture = pyshark.LiveCapture(interface='eth0')
capture.sniff(timeout=50)
capture
>>> <LiveCapture (5 packets)>
capture[3]
<UDP/HTTP Packet>

for packet in capture.sniff_continuously(packet_count=5):
    print 'Just arrived:', packet

Capturing from a live interface can be done in two ways: either using the sniff() method to capture a given amount of packets (or for a given amount of time) and then read the packets from the capture object as a list, or use the sniff_continously() method as a generator and work on each packet as it arrives. Another alternative is defining a callback for each received packet:

def print_callback(pkt):
    print 'Just arrived:', pkt
capture.apply_on_packets(print_callback, timeout=5)

The capture can also run on multiple interfaces if a list is provided, or all interfaces if no interface is provided. It can even be run through a remote interface using RemoteCapture.

4. 屬性和方法

使用 LiveCapture 或者 FileCapture 方法建立 Capture 對象后，在捕獲（capture）和數據包（packet）層面就會有多個方法和屬性可用。PyShark的強大在於可以調用tshark內建的所有數據包解碼器。

獲取數據包摘要（類似於tshark捕獲的輸出）

>>> for pkt in cap:
...:     print pkt
...:
2 0.512323 0.512323 fe80::f141:48a9:9a2c:73e5 ff02::c SSDP 208 M-SEARCH * HTTP/
3 1.331469 0.819146 fe80::159a:5c9f:529c:f1eb ff02::c SSDP 208 M-SEARCH * HTTP/
4 2.093188 0.761719 192.168.1.1 239.255.255.250 SSDP 395 NOTIFY * HTTP/1.  0x0000 (0)
5 2.096287 0.003099 192.168.1.1 239.255.255.250 SSDP 332 NOTIFY * HTTP/1.  0x0000 (0)

按層深入獲取數據包屬性

可在 ipython 中使用 dir(pkt) 或 pkt. 按tab鍵

>>> pkt.   #(tab auto-complete)
pkt.captured_length     pkt.highest_layer       pkt.ip                  pkt.pretty_print        pkt.transport_layer
pkt.eth                 pkt.http                pkt.layers              pkt.sniff_time          pkt.udp
pkt.frame_info          pkt.interface_captured  pkt.length              pkt.sniff_timestamp
>>>
>>> pkt[pkt.highest_layer].    #(tab auto-complete)
pkt_app.                 pkt_app.get_field_value  pkt_app.raw_mode         pkt_app.request_version
pkt_app.DATA_LAYER       pkt_app.get_raw_value    pkt_app.request
pkt_app.chat             pkt_app.layer_name       pkt_app.request_method
pkt_app.get_field        pkt_app.pretty_print     pkt_app.request_uri

capture 對象

dir(cap)
Out[3]:
['apply_on_packets',
 'close',
 'current_packet',
 'display_filter',
 'encryption',
 'input_filename',
 'next',
 'next_packet']

此處真正強大的是apply_on_packets()和next()方法。next()方法使得 capture 對象可以通過for循環進行遍歷。

apply_on_packets() 方法是另一種遍歷數據包的方式，它接受一個函數作為參數並將之作用於所有的數據包。

>>> cap = pyshark.FileCapture('test.pcap', keep_packets=False)
>>> def print_highest_layer(pkt)
...: print pkt.highest_layer
>>> cap.apply_on_packets(print_highest_layer)
HTTP
HTTP
HTTP
HTTP
HTTP
... (truncated)

這個方法也可以用於打印之外的功能，例如將數據包添加入一個列表進行其它處理。下面的腳本會將所有的數據包加入到一個列表中並打印總數：

import pyshark

def get_capture_count():
    p = pyshark.FileCapture('test.cap.pcap', keep_packets=False)

    count = []
    def counter(*args):
        count.append(args[0])

    p.apply_on_packets(counter, timeout=100000)

    return len(count)

print get_capture_count()

5. FileCapture和LiveCapture模塊

PyShark中進行數據包分析的兩個典型方法是使用 FileCapture 和 LiveCapture 模塊。
前者從一個存儲的捕獲文件中導入u數據包，后者將使用本機的網絡接口進行嗅探。
使用這兩個模塊都會返回一個 capture 對象。之后的文章中會詳細介紹。
我們首先來了解一下這兩個模塊如何使用。

兩個模塊提供相似的參數來控制 capture 對象中返回的數據包。下面的定義直接從模塊的docstring中獲取：

interface: [僅用於LiveCapture] 進行嗅探的網絡接口。如果沒有給出，使用可用的第一個接口。
bpf_filter: [僅用於LiveCapture] 在嗅探時使用的BPF(tcpdump)過濾條件。
input_file: [僅用於FileCapture] 保存的捕獲文件的路徑（PCAP, PCAPNG格式）。
keep_packets: 設定在調用next()函數之后是否保留之前讀取的數據包。用於在讀取較大的捕獲時節省內存。
display_filter: 設定在讀取捕獲時使用的display過濾條件（即Wireshark過濾器）。
only_summaries: 僅產生數據包摘要，比正常讀取速度快的多，但包含信息很少。
decryption_key: 可選的用於加密解密捕獲的流量的密鑰。
encryption_type: 捕獲的數據流量所使用的加密標准（必須是WEP，WPA-PWD或WPA-PWK中之一，默認是WPA-PWK）。

5.1 only_summaries選項

使用 only_summaries 后，返回的 capture 對象中的數據包將只包含摘要信息，類似於tshark的默認輸出：

>>> cap = pyshark.FileCapture('test.pcap', only_summaries=True)
>>> print cap[0]
2 0.512323 0.512323 fe80::f141:48a9:9a2c:73e5 ff02::c SSDP 208 M-SEARCH * HTTP/

>>> dir(cap[0])
['delta', 'destination', 'info', 'ip id', 'length', 'no', 'protocol', 'source', 'stream', 'summary_line', 'time', 'window']

使用這個選項，讀取捕獲文件將變得很快，但每個數據包將只包含下面的這些屬性。如果你想獲取嗅探中的IP地址來構建會話列表，或者使用時間和包長度來計算帶寬的統計數據，那么這些信息也足夠了。

>>> pkt.     #(tab auto-complete)
pkt.delta         pkt.info          pkt.no            pkt.stream        pkt.window
pkt.destination   pkt.ip id         pkt.protocol      pkt.summary_line
pkt.host          pkt.length        pkt.source        pkt.time

delta : 當前數據包和上一個數據包捕獲時間的差值。
destination : IP層的目標地址。
info ：應用層數據的簡短摘要（比如”HTTP GET /resource_folder/page.html”）。
ip id : IP標識符字段。
length : 以字節表示的數據包長度。
no : 數據包在列表中的索引值。
protocol : 數據包中識別出的最高層級的協議。（譯注：HTTP數據包如果是JSON的數據，此處可能是JSON而非HTTP）
source : IP層的源地址。
stream : 索引值，標識出該數據包屬於哪一個TCP流（僅用於TCP數據包）。
summary_line : 將所有的摘要屬性輸出在一個tab分隔的字符串中。
time : 當前數據包到達時間與第一個數據包的差值。
window : TCP的窗口大小（僅用於TCP數據包）。

5.2 keep_packets選項

PyShark只會在要對數據包進行處理的時候才會將其讀入內存。在你處理數據包的過程中，PyShark會將每個數據包添加到 capture 對象中叫 _packet 的列表屬性的末尾。當處理大量的數據包時，這個列表將占用大量的內存，因此PyShark提供了這個選項使得內存中一次僅保留一個數據包。如果 keep_packets 設置為False（默認為True），PyShark在讀取新數據包時會將上一個從內存中清除。我發現這樣能提升一點數據包遍歷處理的速度，提升一點也是好的！

5.3 display_filter和bpf_filter

這些過濾器有助於使你的應用集中精力於你想要分析的內容上。類似於使用Wireshark或者tshark進行嗅探，BPF過濾器可以用於確定進入到返回的 capture 對象中的流量。
BPF過濾器的靈活性不如Wireshark的display過濾器，但是你仍可以創造性的使用這些有限的關鍵字和偏移過濾器。
如果需要對使用BPF過濾器更詳細的說明，參考Wireshark的官方文檔。

下面是一個使用BPF過濾器嗅探目標HTTP流量的例子：

>>> cap = pyshark.LiveCapture(interface='en0', bpf_filter='ip and tcp port 80')
>>> cap.sniff(timeout=5)
>>> cap
   <LiveCapture (21 packets)>
>>> print cap[5].highest_layer
HTTP

在讀取保存的捕獲文件時，你可以通過設置 display_filter 選項，利用Wireshark強大的解析器來限制返回的數據包。
下面是沒有使用過濾器的情況下，我的test.pcap文件中的前幾個數據包：

>>> cap = pyshark.FileCapture('test.pcap')
>>> for pkt in cap:
...:    print pkt.highest_layer
...:
HTTP
HTTP
HTTP
TCP
HTTP
... (truncated)

使用了display過濾器來限制只顯示DNS數據流量：

>>> cap = pyshark.FileCapture('test.pcap', display_filter="dns")
>>> for pkt in cap:
...:    print pkt.highest_layer
...:
DNS
DNS
DNS
DNS
DNS
... (truncated)

6. 動態的層的引用

使用上面提到的動態變化的層屬性（比如transport_layer和highest_layer）讓我們在分析數據包時更靈活。
如果你對每個數據包都試圖訪問pkt.dns.qry_resp屬性，那么如果這個數據包不是DNS數據包就會返回AttributeError異常。傳輸層也有類似的問題，因為有TCP和UDP兩種可能。我們可以使用動態引用的層屬性來獲取源地址和目的地址，然后使用try/except來處理既不是TCP也不是UDP數據包的情況。

import pyshark

cap = pyshark.FileCapture('test.pcap')

def print_conversation_header(pkt):
    try:
        protocol =  pkt.transport_layer
        src_addr = pkt.ip.src
        src_port = pkt[pkt.transport_layer].srcport
        dst_addr = pkt.ip.dst
        dst_port = pkt[pkt.transport_layer].dstport
        print '%s  %s:%s --> %s:%s' % (protocol, src_addr, src_port, dst_addr, dst_port)
    except AttributeError as e:
        #ignore packets that aren't TCP/UDP or IPv4
        pass

cap.apply_on_packets(print_conversation_header, timeout=100)

該腳本會輸出：

UDP  10.10.10.12:51554 --> 239.255.255.250:1900
UDP  10.10.10.12:51554 --> 239.255.255.250:1900
UDP  10.10.10.15:58803 --> 8.8.8.8:53
UDP  8.8.8.8:53 --> 10.10.10.15:58803
TCP  10.10.10.15:58632 --> 192.168.20.197:80
TCP  192.168.20.197:80 --> 10.10.10.15:58632
TCP  10.10.10.15:58632 --> 192.168.20.197:80

7. 解析 http header

import pyshark
import pandas as pd

cap = pyshark.FileCapture('./data/SUEE1.pcap', display_filter='http')

# 構造 DataFrame
output_csv_field_names = ['host', 'request method', 'request uri', 'request version', 'request full uri', 'user agent', 'referer']
output_df = pd.DataFrame(columns=output_csv_field_names)

# for 循環能遍歷 pcap 中的 package
# 動態解析中有些數據包可能沒有某個屬性，會報 AttributeError
"""
for pkt in cap:
    try:
        print(pkt.http.host)
    except AttributeError as e:
        print("Tag was not found")
"""

# 沒有找到 pyshark 中關於屬性方法的介紹，下面的屬性和方法我是通過 dir 函數探測出來的
# 也許 wireshark 中會有文檔說明，畢竟 pyshark 使用的也是 tshark

for pkt in cap:
    try:
        http_layer_pkt = pkt.http
        output_df = output_df.append(
            {
                'host': http_layer_pkt.host,
                'request method': http_layer_pkt.request_method,
                'request uri': http_layer_pkt.request_uri,
                'request version': http_layer_pkt.request_version,
                'request full uri': http_layer_pkt.request_full_uri,
                'user agent': http_layer_pkt.user_agent,
                'referer': http_layer_pkt.referer
            }, ignore_index=True
        )
    except AttributeError:
        continue
print(output_df.head())
print(output_df.shape)
output_df.to_csv("./data/pcap_http_header.csv")

"""
In [11]: dir(cap[0].http)
Out[11]: 
['', 'DATA_LAYER', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', __subclasshook__', '__weakref__', '_all_fields', '_field_prefix', '_get_all_field_lines', '_get_all_fields_with_alternates', '_get_field_or_layer_repr', '_get_field_repr', '_layer_name', '_sanitize_field_name', '_ws_expert', '_ws_expert_group', '_ws_expert_message', '_ws_expert_severity', 'chat', 'field_names', 'get', 'get_field', 'get_field_by_showname', 'get_field_value', 'host', 'layer_name', 'pretty_print', 'raw_mode', 'referer', 'request', 'request_full_uri', 'request_line', 'request_method', 'request_number', 'request_uri', 'request_version', 'user_agent']
"""
"""
In [12]: cap[0].http.field_names
Out[12]: ['', '_ws_expert', 'chat', '_ws_expert_message', '_ws_expert_severity', '_ws_expert_group', 'request_method', 'request_uri', 'request_version', 'host', 'request_line', 'user_agent', 'referer', 'request_full_uri', 'request', 'request_number']"""

8. 聲明

參考文章

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 HTTP請求的header頭解析 pcap文件格式解析 HTTP 請求/響應設置/獲取 Header參數 Java獲取Http響應Header信息 pcap文件格式及文件解析 python dpkt 解析 pcap 文件 python讀取解析pcap包 org.apache.coyote.http11.Http11Processor.service 解析 HTTP 請求 header 錯誤使用PYTHON解析Wireshark的PCAP文件 HTTP Header 緩存