Linux探測工具BCC(可觀測性)

本文轉載自查看原文 2020-07-08 10:08 2218 監控告警/ linux/ ebpf

BCC(可觀測性)

BCC(可觀測性)

PS：學習BCC源自這篇文章

簡介

BCC是一個用於跟蹤內核和操作程序的工具集，其軟件包中包含了一些有用的工具和例子，它擴展了BPF(Berkeley Packet Filters)，通常被稱為eBPF , 在Linux3.15中首次引入，但大多數BCC的功能需要Libux4.1及以上版本。

BCC使用C語言進行內核檢測，並使用Python和lua作為前端簡化了BPF程序的編寫，它適用於多種任務，如性能分析和網絡流量控制。

BCC支持的分析工具示意圖如下，更多參見官方文檔。

動機

BPF保證加載到內核中的程序不會導致內核崩潰，且不會永遠運行，但BPF本身的通用性又可以支持執行很多類型的計算。可以使用C語言編寫一個有效的BPF程序，當然也可以使用C語言編寫一個無效的BPF程序。在運行程序前，用戶無法知道他的程序是有效的還是無效的。

使用特定的BPF前端后，用戶可以通過編寫腳本並從編譯器獲得反饋。該工具集的目的是提供一個前端，用於創建有效的BPF程序，同時不影響它的靈活性。

目前與BPF集成的工作流比較混亂，有時候會涉及編譯內核源碼樹。該工具集可以減少開發者花費到BPF編譯的時間，將精力聚集到編寫的應用本身，並解決BPF的問題。

該工具集的特性包括：

(共享庫中)端到端的BPF工作流
- 為BPF后端提供改進的C語言
- 與llvm-bpf后端集成，用於JIT
- 動態加載(卸載)JITed 程序
- 支持BPF內核鈎子：socket過濾，tc分類器，tc動作和kprobes
與python綁定使用
socket過濾，tc分類器和kprobes的例子
自包含的工具，用於跟蹤一個運行的系統

未來BCC會支持綁定更多的語言。

版本要求

BCC對內核版本有嚴格的要求，BCC提供的功能需要對應的內核支持才能使用，因此盡量使用最新發布版本的Linux內核。詳細對應關系，參見BPF Features by Linux Kernel Version

安裝

本次按照官方文檔，使用centos7.6操作系統安裝bcc

安裝依賴

sudo yum install -y epel-release
sudo yum update -y
sudo yum groupinstall -y "Development tools"
sudo yum install -y elfutils-libelf-devel cmake3 git bison flex ncurses-devel
sudo yum install -y luajit luajit-devel  # for Lua support

安裝和編譯LLVM

curl  -LO  http://releases.llvm.org/7.0.1/llvm-7.0.1.src.tar.xz
curl  -LO  http://releases.llvm.org/7.0.1/cfe-7.0.1.src.tar.xz
tar -xf cfe-7.0.1.src.tar.xz
tar -xf llvm-7.0.1.src.tar.xz

mkdir clang-build
mkdir llvm-build

cd llvm-build
cmake3 -G "Unix Makefiles" -DLLVM_TARGETS_TO_BUILD="BPF;X86" \
  -DCMAKE_BUILD_TYPE=Release ../llvm-7.0.1.src
make
sudo make install

cd ../clang-build
cmake3 -G "Unix Makefiles" -DLLVM_TARGETS_TO_BUILD="BPF;X86" \
  -DCMAKE_BUILD_TYPE=Release ../cfe-7.0.1.src
make
sudo make install
cd ..

或通過centos-release-scl安裝(推薦，安裝比較快)

yum install -y centos-release-scl
yum-config-manager --enable rhel-server-rhscl-7-rpms
yum install -y devtoolset-7 llvm-toolset-7 llvm-toolset-7-llvm-devel llvm-toolset-7-llvm-static llvm-toolset-7-clang-devel
source scl_source enable devtoolset-7 llvm-toolset-7

注：重啟之后需要重新執行source scl_source enable devtoolset-7 llvm-toolset-7

安裝和編譯BCC

git clone https://github.com/iovisor/bcc.git
mkdir bcc/build; cd bcc/build
cmake3 ..
make
sudo make install

上述命令執行成功后，bcc會被安裝在/usr/share/bcc路徑下。

windows源碼查看

windows下可以直接使用如下命令安裝BCC即可查看源碼：

> pip install bcc

BCC的基本使用

本節看下BCC自帶的部分工具(/usr/share/bcc/tool)的基本用法：

execsnoop
opensnoop
ext4slower (or btrfs, xfs, zfs*)
biolatency
biosnoop
cachestat
tcpconnect
tcpaccept
tcpretrans
runqlat
profile

工具講解

execsnoop

execsnoop會以行輸出每個新的進程，用於檢查生命周期比較短的進程，這些進程會消耗CPU資源，但不會出現在大多數以周期性采集正在運行的進程快照的監控工具中。

該工具會跟蹤exec()系統調用，而不是fork()，因此它能夠跟蹤大部分新創建的進程，但不是所有的進程(它無法跟蹤一個應用啟動的工作進程，因為此時沒有用到exec())。

# ./execsnoop
PCOMM            PID    RET ARGS
bash             15887    0 /usr/bin/man ls
preconv          15894    0 /usr/bin/preconv -e UTF-8

輸出顯示了父進程的命令名(PCOMM)，PID，exec()調用后的返回值(RET)，命令行參數(ARGS)

examples:
    ./execsnoop           # trace all exec() syscalls
    ./execsnoop -x        # include failed exec()s
    ./execsnoop -T        # include time (HH:MM:SS)
    ./execsnoop -U        # include UID
    ./execsnoop -u 1000   # only trace UID 1000
    ./execsnoop -u root   # get root UID and trace only this
    ./execsnoop -t        # include timestamps
    ./execsnoop -q        # add "quotemarks" around arguments
    ./execsnoop -n main   # only print command lines containing "main"
    ./execsnoop -l tpkg   # only print command where arguments contains "tpkg"
    ./execsnoop --cgroupmap mappath  # only trace cgroups in this BPF map
    ./execsnoop --mntnsmap mappath   # only trace mount namespaces in the map

更多用法參見example。

注解：該工具可以看作是對ps命令的擴展，可以記錄進程的變更信息。

opensnoop

opensnoop以行輸出每個open()系統調用及其詳細信息。

通過打開的文件可以了解到一個應用是如何工作的，確定其數據文件，配置文件和log文件等。有時候應用會因為要讀取一個不存在的文件而導致異常，可以通過opensnoop命令快速定位問題。

$ ./opensnoop -T
TIME(s)       PID    COMM               FD ERR PATH
0.000000000   977    AliYunDun          23   0 /var/log/secure
0.000373000   16884  opensnoop          -1   2 /usr/lib64/python2.7/encodings/ascii.so
0.000380000   16884  opensnoop          -1   2 /usr/lib64/python2.7/encodings/asciimodule.so
0.000385000   16884  opensnoop          12   0 /usr/lib64/python2.7/encodings/ascii.py
0.000391000   16884  opensnoop          13   0 /usr/lib64/python2.7/encodings/ascii.pyc
0.255241000   977    AliYunDun          23   0 /proc/977/stat
0.255329000   977    AliYunDun          23   0 /sys/devices/system/cpu
0.255380000   977    AliYunDun          23   0 /proc/977/stat
0.600429000   977    AliYunDun          23   0 /var/log/secure
1.200850000   977    AliYunDun          23   0 /var/log/secure

opensnoop會按照順序輸出進程打開的文件

examples:
    ./opensnoop           # trace all open() syscalls
    ./opensnoop -T        # include timestamps
    ./opensnoop -U        # include UID
    ./opensnoop -x        # only show failed opens
    ./opensnoop -p 181    # only trace PID 181
    ./opensnoop -t 123    # only trace TID 123
    ./opensnoop -u 1000   # only trace UID 1000
    ./opensnoop -d 10     # trace for 10 seconds only
    ./opensnoop -n main   # only print process names containing "main"
    ./opensnoop -e        # show extended fields
    ./opensnoop -f O_WRONLY -f O_RDWR  # only print calls for writing
    ./opensnoop --cgroupmap mappath  # only trace cgroups in this BPF map
    ./opensnoop --mntnsmap mappath   # only trace mount namespaces in the map

更多用法參見example。

ext4slower (or btrfs, xfs, zfs*)

ext4slower會跟蹤ext4的讀，寫，打開和fsyncs操作，並在超過閾值后打印信息。

對於識別或排除性能問題非常有用：通過文件系統顯示獨立的慢速磁盤I/O。由於磁盤處理I/O是異步的，因此很難將該層的延遲與應用程序的延遲聯系起來。使用本工具，可以在 VFS -> 文件系統接口層面進行跟蹤，將更接近應用的問題所在。

$ ./ext4slower 1
Tracing ext4 operations slower than 1 ms
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
06:35:01 cron           16464  R 1249    0          16.05 common-auth
06:35:01 cron           16463  R 1249    0          16.04 common-auth
06:35:01 cron           16465  R 1249    0          16.03 common-auth
06:35:01 cron           16465  R 4096    0          10.62 login.defs

延遲的評估從發起VFS接口到文件系統的操作開始，到該操作結束為止。中間包括多個過程：塊設備I/O，系統CPU周期，文件系統鎖，運行隊列延遲等。相比評估塊設備接口的延遲更加合理。

注：本工具僅會跟蹤前面列出的通用的文件系統操作，不包含其他文件系統操作(如inode操作)。

類似的用於其他系統的工具還有btrfsslower, xfsslower, 和zfsslower。此外還有fileslower命令，用於跟蹤基於文件的同步讀寫。

默認顯示慢於10ms的I/O操作。

更多用法參見example。

biolatency

biolatency會跟蹤塊設備的I/O延遲，並以直方圖顯示延遲分布，相比iostat給出的平均延遲更加直觀。

# ./biolatency
Tracing block device I/O... Hit Ctrl-C to end.
     usecs           : count     distribution
       0 -> 1        : 0        |                                      |
       2 -> 3        : 0        |                                      |
       4 -> 7        : 0        |                                      |
       8 -> 15       : 0        |                                      |
      16 -> 31       : 0        |                                      |
      32 -> 63       : 0        |                                      |
      64 -> 127      : 1        |                                      |
     128 -> 255      : 12       |********                              |
     256 -> 511      : 15       |**********                            |
     512 -> 1023     : 43       |*******************************       |
    1024 -> 2047     : 52       |**************************************|
    2048 -> 4095     : 47       |**********************************    |
    4096 -> 8191     : 52       |**************************************|
    8192 -> 16383    : 36       |**************************            |
   16384 -> 32767    : 15       |**********                            |
   32768 -> 65535    : 2        |*                                     |
   65536 -> 131071   : 2        |*                                     |

上面展示了從128微秒到32767微秒之間的I/O延遲分布，最高的延遲為65毫秒到131毫秒之間的2個IO

examples:
    ./biolatency            # summarize block I/O latency as a histogram
    ./biolatency 1 10       # print 1 second summaries, 10 times
    ./biolatency -mT 1      # 1s summaries, milliseconds, and timestamps
    ./biolatency -Q         # include OS queued time in I/O time
    ./biolatency -D         # show each disk device separately
    ./biolatency -F         # show I/O flags separately

更多用法參見example。

biosnoop

biosnoop跟蹤每個塊設備的I/O，並打印每個I/O的延遲。

$ ./biosnoop
TIME(s)     COMM           PID    DISK    T SECTOR     BYTES  LAT(ms)
0.000000    jbd2/vda1-8    301    vda     W 17073104   40960     0.65
0.000479    jbd2/vda1-8    301    vda     W 17073184   4096      0.42
3.071857    kworker/u4:0   17124  vda     W 2048       12288     0.51
3.071927    kworker/u4:0   17124  vda     W 4196360    4096      0.57

使用時可以結合biolatency，首先使用biolatency -D找出延遲大的磁盤，然后使用biosnoop找出導致延遲的進程。

更多用法參見example。

cachestat

cachestat每秒會打印文件系統cache的概要信息。使用該工具可以定位低緩存命中率和高miss率。

$ ./cachestat.py
    HITS   MISSES  DIRTIES HITRATIO   BUFFERS_MB  CACHED_MB
       1        0        0  100.00%            5        191
     198    12136        0    1.61%            5        238
       1    11007        3    0.01%            5        281
       0     6384        0    0.00%            5        306
       1    14464        0    0.01%            5        363
       0    11776        0    0.00%            5        409
       1    11712        0    0.01%            5        454
      32    13184        0    0.24%            5        506
       0    11232        0    0.00%            5        550
       1    13056        0    0.01%            5        601
      16    14720        0    0.11%            5        658
      33     9920        0    0.33%            5        697
       0    13248        0    0.00%            5        749
       4    14144        0    0.03%            5        804
       0     9728        0    0.00%            5        842
       1    10816        0    0.01%            5        885
     808    13504        1    5.65%            5        938
       0    11409        0    0.00%            5        982
       0    11520        0    0.00%            5       1027
       0    15616        0    0.00%            5       1088
       1     9792        0    0.01%            5       1126
       0     8256        0    0.00%            5       1158
       1     9600        0    0.01%            5       1196
     599     4804        0   11.09%            5       1215
       1        0        0  100.00%            5       1215
       0        0        0    0.00%            5       1215
       3        1        0   75.00%            5       1215
   79536       34        0   99.96%            5       1215
   87693      274        4   99.69%            6       1214
   89018     3546        0   96.17%            7       1227
   33531      201        4   99.40%            7       1228
      22       44        0   33.33%            8       1228
       0        0        0    0.00%            8       1228
      73       21        2   77.66%            8       1228

上述首次花費了24s來讀取1GB的文件，體現為高MISSES ，低HITRATIO；第二次讀取時僅花費了4s，HITRATIO接近99%。HITRATIO=HITS/(HITS+MISSES)

tcpconnect+tcpaccept+tcpretrans

推薦使用ss命令查看，ss -i可以看到重傳信息。

runqlat

以直方圖的方式顯示CPU run隊列中的現存等待的時間。用於評估在CPU飽和時等待的時長。

$ ./runqlat -p 980 -L
Tracing run queue latency... Hit Ctrl-C to end.
^C

tid = 1054
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 5        |****************************************|
         8 -> 15         : 1        |********                                |

tid = 1055
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 3        |****************************************|
         4 -> 7          : 2        |**************************              |
         8 -> 15         : 1        |*************                           |
...

上面展示了進程980中的線程的等待時長的分布情況。進程等待時間過程可能是被其他CPU密集型任務占用了CPU。

examples:
    ./runqlat            # summarize run queue latency as a histogram
    ./runqlat 1 10       # print 1 second summaries, 10 times
    ./runqlat -mT 1      # 1s summaries, milliseconds, and timestamps
    ./runqlat -P         # show each PID separately
    ./runqlat -p 185     # trace PID 185 only

profile

推薦使用strace和perf trace命令

BCC的開發

本節介紹使用Python接口進行BCC的開發，包括兩部分：可觀測性和網絡。

可觀測性

Lesson 1. Hello World

執行examples/hello_world.py，並在另一個會話中執行一些命令(如ls)，此時會打印"Hello,World!".

# ./examples/hello_world.py
            bash-13364 [002] d... 24573433.052937: : Hello, World!
            bash-13364 [003] d... 24573436.642808: : Hello, World!
[...]

hello_world.py中的代碼如下：

from bcc import BPF
BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!\\n"); return 0; }').trace_print()

包含如下6個內容：

text='...': 定義了一個使用C編寫的內聯的BPF程序。
kprobe__sys_clone(): 這是通過kprobes對內核進行動態跟蹤的一種捷徑。如果C函數以kprobe__開頭，剩下的部分將作為內核函數名來處理，本例為sys_clone()，
void *ctx: ctx包含參數，但鑒於此處沒有使用，因此將其強轉為void *
bpf_trace_printk(): 將printf()輸出到通用的trace_pipe (/sys/kernel/debug/tracing/trace_pipe)的一個簡單內核工具。該函數對於一些場景是合適的，但也有限制：最多3個參數，僅有一個%s，且trace_pipe是全局共享的，因此並行的程序會導致輸出混亂。最好采用BPF_PERF_OUTPUT()接口，后續會介紹。
return 0; kprobe__sys_clone函數的出參類型為int，必須返回一個值。
.trace_print(): 一個bcc例程，從trace_pipe中讀取，並打印輸出。

Lesson 2. sys_sync()

嘗試將跟蹤的系統調用改為sys_sync()，當執行sync命令時會輸出"sys_sync() called"

#!/usr/bin/python
from bcc import BPF
BPF(text='int kprobe__sys_sync(void *ctx) { bpf_trace_printk("sys_sync() called!\\n"); return 0; }').trace_print()

Lesson 3. hello_fields.py

文件路徑為examples/tracing/hello_fields.py。運行時的輸出如下：

$ ./hello_fields.py
TIME(s)            COMM             PID    MESSAGE
546.560775000      bash             1150   Hello, World!
547.127956000      bash             1150   Hello, World!

源碼如下：

from bcc import BPF

# define BPF program
prog = """
int hello(void *ctx) {
    bpf_trace_printk("Hello, World!\\n");
    return 0;
}
"""

# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue
    print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

內容與hello_world.py類似，通過sys_clone()跟蹤新的進程，但增加了新的內容：

prog =: 將C程序定義為一個變量，后面會引用它，便於使用命令行參數時進行字符串替換。
hello(): 此處定義了一個C函數，而沒有直接使用kprobe__開頭的系統調用，后續會引用它。定義在BPF程序中的C函數將會運行在一個probe上，這些函數都使用pt_reg* ctx作為第一個參數。如果需要定義一些不在probe上執行的輔助函數，則需要將這些函數定義為static inline，以便編譯器進行內聯，有時還需要向其添加_always_inline function屬性。
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello"):為內核clone系統調用函數創建一個kprobe，該kprobe會執行自定義的hello()函數。可以通過多次執行attach_kprobe() ，將自定義的C函數附加到多個內核函數上。
b.trace_fields():從trace_pipe返回固定的字段集，與trace_print()類似，對黑客來說很方便，但實際作為工具時，推薦使用BPF_PERF_OUTPUT()。

Lesson 4. sync_timing.py

還記得系統管理員在重啟前在一個慢控制台上輸入sync三次的日子嗎，這些操作是為了給第一個異步sync完成的時間？有人認為直接執行一行代碼sync;sync;sync更加便利，這種方式后來成為了行業慣例，盡管它違背了最初的目的。此時sync變成了同步執行的指令，這看起來很愚蠢。

下面例子展示了do_sync函數被調用的速度，並打印輸出(如果最近調用的時間超過一秒鍾)，sync;sync;sync會打印輸出第二次和第三次sync。

# ./examples/tracing/sync_timing.py
Tracing for quick sync's... Ctrl-C to end
At time 0.00 s: multiple syncs detected, last 95 ms ago
At time 0.10 s: multiple syncs detected, last 96 ms ago

代碼來自 examples/tracing/sync_timing.py。

from __future__ import print_function
from bcc import BPF

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>

BPF_HASH(last); #初始化一個哈希，用於保存時間，一個哈希可以保存多個key/value對，使用items()進行迭代

int do_trace(struct pt_regs *ctx) {
    u64 ts, *tsp, delta, key = 0;

    // attempt to read stored timestamp
    tsp = last.lookup(&key); //查找是否存在key，哈希在保存時采納的時key的值，而不是key的地址，因此要保證key的值是唯一的
    if (tsp != NULL) {
        delta = bpf_ktime_get_ns() - *tsp;// 打印兩次調用之間的時間差，納秒
        if (delta < 1000000000) {
            // output if time is less than 1 second
            bpf_trace_printk("%d\\n", delta / 1000000);
        }
        last.delete(&key);
    }

    // update stored timestamp
    ts = bpf_ktime_get_ns(); //將當前時間保存到鍵為key的hash中
    last.update(&key, &ts);
    return 0;
}
""")

b.attach_kprobe(event=b.get_syscall_fnname("sync"), fn_name="do_trace")#追蹤sync命令
print("Tracing for quick sync's... Ctrl-C to end")

# format output
start = 0
while 1:
    (task, pid, cpu, flags, ts, ms) = b.trace_fields()
    if start == 0:
        start = ts
    ts = ts - start
    print("At time %.2f s: multiple syncs detected, last %s ms ago" % (ts, ms))

內容包括：

bpf_ktime_get_ns(): 返回納秒為單位的時間
BPF_HASH(last):創建一個hash（關聯數組）的BPF map對象，稱為”last“，由於沒有進一步指定任何參數，因此默認為u64類型的key和value。
key = 0:只在該hash中存儲一個key/value對，其中key為零。
last.lookup(&key): 在hash中查找key，如果存在，則返回對應的值，否則返回NULL。上面將key作為指向指針的地址傳入。
if (tsp != NULL) {: 校驗指針是否為空，如果不為空，則表示該函數並非首次調用，可以計算兩次調用的時間差
last.delete(&key):刪除鍵，原因是內核的.update()存在的bug。
last.update(&key, &ts):將第二個參數表示的時間值保存到鍵為key的hash中，覆蓋之前的數值。

Lesson 5. sync_count.py

修改sync_timing.py程序，保存所有sync系統調用的次數，並打印輸出，修改如下(注意運行不能帶中文字符)：

#!/usr/bin/python

from __future__ import print_function
from bcc import BPF

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>

BPF_HASH(last);
BPF_HASH(count); //創建一個新的count

int do_trace(struct pt_regs *ctx) {
    u64 curct=1,ts, *tsp, delta, key = 0,key1 = 1;
    u64 *ctp;

    // attempt to read stored timestamp
    tsp = last.lookup(&key);
    ctp = count.lookup(&key1); //獲取保存的地址，一個key對應一個變量地址
    if (NULL != ctp){          //如果存在則累加，並更新
        curct = *ctp+1;   //注意：此處不能使用*curct++，可能是因為內核數據的原因，只能通過count.update更新
		count.update(&key1, &curct); 
    }else{
		count.update(&key1, &curct);//如果不存在直接保存當前值1
    }

    if (tsp != NULL) {
        delta = bpf_ktime_get_ns() - *tsp;
        if (delta < 1000000000) {
            // output if time is less than 1 second
            bpf_trace_printk("count:%d,%d\\n", curct,delta / 1000000); //打印count
        }
        last.delete(&key);
    }

    // update stored timestamp
    ts = bpf_ktime_get_ns();
    last.update(&key, &ts);
    return 0;
}
""")

b.attach_kprobe(event=b.get_syscall_fnname("sync"), fn_name="do_trace")
print("Tracing for quick sync's... Ctrl-C to end")

# format output
start = 0
while 1:
    (task, pid, cpu, flags, ts, ms) = b.trace_fields()
    if start == 0:
        start = ts
    ts = ts - start
    print("At time %.2f s: multiple syncs detected, last %s ms ago" % (ts, ms))

輸出結果如下：

Tracing for quick sync's... Ctrl-C to end
At time 0.00 s: multiple syncs detected, last count:2,3 ms ago
At time 0.00 s: multiple syncs detected, last count:3,0 ms ago
At time 0.41 s: multiple syncs detected, last count:4,413 ms ago
At time 0.42 s: multiple syncs detected, last count:5,2 ms ago
At time 0.42 s: multiple syncs detected, last count:6,0 ms ago

Lesson 6. disksnoop.py

執行examples/tracing/disksnoop.py ，部分輸出結果如下：

$ ./disksnoop.py
TIME(s)            T  BYTES    LAT(ms)
4011.638426000     W  65536       1.01
4011.638754000     W  65536       1.33
4011.639049000     W  65536       1.62
4011.639490000     W  65536       2.06
4011.639511000     W  65536       2.08
4011.639750000     W  65536       2.32

代碼為：

[...]
REQ_WRITE = 1		# from include/linux/blk_types.h

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>

BPF_HASH(start, struct request *); //此處的value為一個指針。由於trace_start和trace_completion中的入參req其實是同一個地址，因此此處可以使用指針

void trace_start(struct pt_regs *ctx, struct request *req) {
	// stash start timestamp by request ptr
	u64 ts = bpf_ktime_get_ns();

	start.update(&req, &ts); //使用req的地址作為key
}

void trace_completion(struct pt_regs *ctx, struct request *req) {
	u64 *tsp, delta;

	tsp = start.lookup(&req);
	if (tsp != 0) {
		delta = bpf_ktime_get_ns() - *tsp;
		bpf_trace_printk("%d %x %d\\n", req->__data_len,
		    req->cmd_flags, delta / 1000);
		start.delete(&req);
	}
}
""")

b.attach_kprobe(event="blk_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_account_io_completion", fn_name="trace_completion")
[...]

包含內容：

REQ_WRITE：此處在python程序中定義了一個內核常量，該變量來自include/linux/blk_types.h，可以直接使用
trace_start(struct pt_regs *ctx, struct request *req)：該函數后續會添加到kprobes中。第一個參數struct pt_regs *ctx用於通過kprpbe函數注冊BPF上下文，第二個參數才是給該函數的參數。此處將該函數附加到blk_start_request上，其首個參數類型為struct request *。
start.update(&req, &ts): 使用指向request結構體的指針作為hash的key，使用指向結構體的指針作為key是個不錯的選擇，因為指針的地址是唯一的(需要注意釋放和重用場景)。此處是給描述磁盤I/O的request結構體打上時間標簽。還有兩個常用的鍵用於存儲時間戳：指向結構體的指針和線程ID。
req->__data_len:此處訪問了struct request的成員，可以在內核源代碼中查看它的定義，並了解其中有哪些成員。bcc實際上會將這些表達重寫為一系列的bpf_probe_read_kernel()調用，有時bcc無法處理復雜的引用，此時需要直接調用bpf_probe_read_kernel()

注：上述程序使用trace_start和trace_completion對訪問磁盤的操作進行打點，blk_start_request，blk_mq_start_request和blk_account_io_completion都是內核函數，具體實現可以在linux源碼中找到，對函數的描述如下(來自內核v5.7.7版本)：
/**
 * blk_mq_start_request - Start processing a request
 * @rq: Pointer to request to be started
 *
 * Function used by device drivers to notify the block layer that a request
 * is going to be processed now, so blk layer can do proper initializations
 * such as starting the timeout timer.
 */
 
 /**
 * blk_mq_complete_request - end I/O on a request
 * @rq:		the request being processed
 *
 * Description:
 *	Ends all I/O on a request. It does not handle partial completions.
 *	The actual completion happens out-of-order, through a IPI handler.
 **/
在內核5.7.7版本中已經不存在blk_start_request函數，因此在最新的bcc 0.1.10版本中上述程序修改為：
if BPF.get_kprobe_functions(b'blk_start_request'): #首先驗證該內核函數是否存在
    b.attach_kprobe(event="blk_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_account_io_completion", fn_name="trace_completion")
bcc使用get_kprobe_functions函數校驗內涵函數是否存在，可以看到是在/proc/kallsyms中進行檢查的，/proc/kallsyms保存了Linux內核符號表。

Lesson 7. hello_perf_output.py

最后使用bpf_trace_printk()和BPF_PERF_OUTPUT()接口，這也意味着將不再使用trace_field()獲取如PID和時間戳等信息。

執行examples/tracing/hello_perf_output.py的結果如下：

# ./hello_perf_output.py
TIME(s)            COMM             PID    MESSAGE
0.000000000        bash             22986  Hello, perf_output!
0.021080275        systemd-udevd    484    Hello, perf_output!
0.021359520        systemd-udevd    484    Hello, perf_output!
0.021590610        systemd-udevd    484    Hello, perf_output!
[...]

代碼如下：

from bcc import BPF

# define BPF program
prog = """
#include <linux/sched.h>

// define output data structure in C
struct data_t {
    u32 pid;
    u64 ts;
    char comm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(events);

int hello(struct pt_regs *ctx) {
    struct data_t data = {};

    data.pid = bpf_get_current_pid_tgid(); //獲取執行該命令的進程ID
    data.ts = bpf_ktime_get_ns();
    bpf_get_current_comm(&data.comm, sizeof(data.comm));

    events.perf_submit(ctx, &data, sizeof(data));

    return 0;
}
"""

# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))

# process event
start = 0
def print_event(cpu, data, size):
    global start
    event = b["events"].event(data)
    if start == 0:
            start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d %s" % (time_s, event.comm, event.pid,
        "Hello, perf_output!"))

# loop with callback to print_event
b["events"].open_perf_buffer(print_event)
while 1:
    b.perf_buffer_poll()

需要掌握的內容為：

struct data_t: 定義了將數據從內核空間傳遞到用戶空間的C結構體
BPF_PERF_OUTPUT(events): 將輸出channel命名為"events"
struct data_t data = {};: 創建一個空的data_t結構體
bpf_get_current_pid_tgid(): 獲取低32為的進程ID(內核角度的PID，在用戶空間體現為線程ID)，線程組ID和高32位(在用戶空間體現為PID)。保存低32位的內容，並丟棄高32位的內容。那么應該呈現的是PID還是TGID？對於一個多線程app，其TGID是相同的，因此需要使用PID來進行區分。
bpf_get_current_comm(): 用當前進程名填充第一個參數地址。在上述例子中打點的系統調用位"clone"，在執行clone時會填充clone的進程的PID，時間戳和進程名，因此進程名和PID是一致的。
events.perf_submit():通過一個perf ring buffer將事件交給用戶空間。
def print_event(): 定義一個Python函數，用於從事件流中處理事件
b["events"].event(data): 通過C聲明自動將事件轉為為Python對象
b["events"].open_perf_buffer(print_event): 將Python的print_event與事件流關聯起來。
while 1: b.perf_buffer_poll():阻塞等待事件

Lesson 8. sync_perf_output.py

重新修改前面的sync_timing.py，使用BPF_PERF_OUTPUT進行輸出：

#!/usr/bin/python
#
# sync_timing.py    Trace time between syncs.
#                   For Linux, uses BCC, eBPF. Embedded C.
#
# Written as a basic example of tracing time between events.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")

from __future__ import print_function
from bcc import BPF
from bcc.utils import printb

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>

BPF_HASH(last);

struct data_t {
    u64 cur_ts;
	u64 diff_ts;
};

BPF_PERF_OUTPUT(events);

int do_trace(struct pt_regs *ctx) {
    u64 ts, *tsp, key = 0;
    struct data_t data = {}; 
    // attempt to read stored timestamp
    tsp = last.lookup(&key);
    if (tsp != NULL) {
	    data.cur_ts = bpf_ktime_get_ns() / 1000000000;
	    data.diff_ts = bpf_ktime_get_ns() - *tsp;
        if (data.diff_ts < 1000000000) {
            // output if time is less than 1 second
            events.perf_submit(ctx, &data, sizeof(data));
        }
        last.delete(&key);
    }

    // update stored timestamp
    ts = bpf_ktime_get_ns();
    last.update(&key, &ts);
    return 0;
}
""")

b.attach_kprobe(event=b.get_syscall_fnname("sync"), fn_name="do_trace")
print("Tracing for quick sync's... Ctrl-C to end")

# format output
start = 0
def print_event(cpu, data, size):
    global start
    event = b["events"].event(data)
    if start == 0:
        start = event.cur_ts
    ts = event.cur_ts - start
    printb(b"At time %.2f s: multiple syncs detected, last %s ms ago" % (ts, event.diff_ts / 1000000))

# loop with callback to print_event
b["events"].open_perf_buffer(print_event)

while 1:
    b.perf_buffer_poll()

Lesson 9. bitehist.py

examples/tracing/bitehist.py程序直方圖記錄了磁盤I/O的大小:

# ./bitehist.py
Tracing... Hit Ctrl-C to end.
^C
     kbytes          : count     distribution
       0 -> 1        : 3        |                                      |
       2 -> 3        : 0        |                                      |
       4 -> 7        : 211      |**********                            |
       8 -> 15       : 0        |                                      |
      16 -> 31       : 0        |                                      |
      32 -> 63       : 0        |                                      |
      64 -> 127      : 1        |                                      |
     128 -> 255      : 800      |**************************************|

代碼如下：

from __future__ import print_function
from bcc import BPF
from time import sleep

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>

BPF_HISTOGRAM(dist); //定義一個直方圖，名為dist

int kprobe__blk_account_io_completion(struct pt_regs *ctx, struct request *req)
{
	dist.increment(bpf_log2l(req->__data_len / 1024)); //增加直方圖的計數，key為bpf_log2l(req->__data_len / 1024)
	return 0;
}
""")

# header
print("Tracing... Hit Ctrl-C to end.")

# trace until Ctrl-C
try:
	sleep(99999999)
except KeyboardInterrupt:
	print()

# output
b["dist"].print_log2_hist("kbytes") #打印直方圖

回顧之前的內容：

kprobe__: 以該前綴開頭的函數將會被認為是一個內核函數名，並使用kprobe進行檢測，相當於直接對內核函數進行檢測。第一個參數總是struct pt_regs *，其余的參數為內核函數的入參，如果不使用，則可以忽略。其與使用BPF.attach_kprobe()的方式是相同的。
struct pt_regs *ctx, struct request *req: 給kprobe傳參，ctx用於寄存器和BPF上下文，req為傳遞給被檢測函數blk_account_io_completion()的第一個參數。
req->__data_len: 獲取該成員的值。

新的內容為：

BPF_HISTOGRAM(dist): 定義一個BPF map對象，為直方圖，名稱為"dist"
dist.increment():增加直方圖bucket的索引，默認每次增加1，可以使用第二個參數自定義增量。
bpf_log2l(): 對提供的值進行log-2計算，作為直方圖的索引，構造二次冪直方圖。
b["dist"].print_log2_hist("kbytes"): 以二次冪直方圖的方式打印"dist"，列標題為"kbytes"。從內核空間傳遞到用戶空間的唯一數據是bucket計數，這樣更加高效。

Lesson 10. disklatency.py

編寫一個程序計算磁盤I/O時間，並打印延遲的直方圖。可以在前面的disksnoop.py中找到磁盤I/O檢測和計時的方式，直方圖的代碼可以參考bitehist.py。

#!/usr/bin/python

from __future__ import print_function
from bcc import BPF
from time import sleep

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>

BPF_HISTOGRAM(dist);
BPF_HASH(start, struct request *);

void trace_start(struct pt_regs *ctx, struct request *req) {
    // stash start timestamp by request ptr
    u64 ts = bpf_ktime_get_ns();

    start.update(&req, &ts);
}

void trace_completion(struct pt_regs *ctx, struct request *req) {
    u64 *tsp, delta;

    tsp = start.lookup(&req);
    if (tsp != 0) {
        delta = (bpf_ktime_get_ns() - *tsp)/1000;
        dist.increment(bpf_log2l(delta));
        //dist.increment((delta)); /*直接使用非bpf_log2l將無法輸出結果，原因是本機延遲普遍大於250us，直方圖無法輸出這么多行，可以減少delta的比例即可，如設置delta = (bpf_ktime_get_ns() - *tsp)/100000;*/
        start.delete(&req);
    }
}

""")

# header
if BPF.get_kprobe_functions(b'blk_start_request'):
    b.attach_kprobe(event="blk_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_account_io_completion", fn_name="trace_completion")
print("Tracing... Hit Ctrl-C to end.")

# trace until Ctrl-C
try:
    sleep(99999999)
except KeyboardInterrupt:
    print()

# output
b["dist"].print_log2_hist("LAT-us")

輸出如下：

Tracing... Hit Ctrl-C to end.
^C
     LAT-us              : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 1        |****************************************|
       512 -> 1023       : 1        |****************************************|

Lesson 11. vfsreadlat.py

本例分為python和C兩部分文件： examples/tracing/vfsreadlat.py和examples/tracing/vfsreadlat.c，直接結果如下

# ./vfsreadlat.py
Tracing... Hit Ctrl-C to end.

     usecs               : count     distribution
         0 -> 1          : 2120     |****************************************|
         2 -> 3          : 26       |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 6        |                                        |

     usecs               : count     distribution
         0 -> 1          : 2120     |****************************************|
         2 -> 3          : 26       |                                        |
         4 -> 7          : 2        |                                        |
         8 -> 15         : 4        |                                        |
        16 -> 31         : 2        |                                        |

     usecs               : count     distribution
         0 -> 1          : 627      |****************************************|
         2 -> 3          : 10       |                                        |
         4 -> 7          : 2        |                                        |
         8 -> 15         : 3        |                                        |

內容為：

b = BPF(src_file = "vfsreadlat.c"): 從獨立的源代碼文件中讀取BPF C 程序
b.attach_kretprobe(event="vfs_read", fn_name="do_return"): 將BPF C程序 do_return()附加到內核函數vfs_read()。kretprobe用於檢測程序的返回值，而非入口。區別是attach_kprobe()在進入函數時探測，而attach_kretprobe()在退出函數時探測。
b["dist"].clear()：清除直方圖

Lesson 12. urandomread.py

執行examples/tracing/urandomread.py，並執行命令dd if=/dev/urandom of=/dev/null bs=8k count=5，輸出如下結果：

# ./urandomread.py
TIME(s)            COMM             PID    GOTBITS
4150.692889000     sshd             1420   112
4150.692918000     sshd             1420   384
4150.702547000     sshd             1421   384
4151.182605000     sshd             1420   880
4151.182658000     sshd             1420   8192
4151.182771000     sshd             1420   440
4151.205307000     sshd             1420   128
4151.230603000     sshd             1422   112
4151.230639000     sshd             1422   384
4151.256883000     sshd             1423   384
4151.330354000     sshd             1422   880
4151.330396000     sshd             1422   8192
4151.330503000     sshd             1422   440
4151.353134000     sshd             1422   128
4154.874123000     dd               1452   65536
4154.874193000     dd               1452   65536
4154.874252000     dd               1452   65536
4154.874318000     dd               1452   65536
4154.874383000     dd               1452   65536

源碼為：

from __future__ import print_function
from bcc import BPF

# load BPF program
b = BPF(text="""
TRACEPOINT_PROBE(random, urandom_read) {
    // args is from /sys/kernel/debug/tracing/events/random/urandom_read/format
    bpf_trace_printk("%d\\n", args->got_bits);
    return 0;
}
""")

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "GOTBITS"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue
    print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

需要掌握的內容為：

TRACEPOINT_PROBE(random, urandom_read): 檢測內核tracepoints random:urandom_read。這是一個穩定的API，推薦盡可能使用該函數替換kprobes。可以使用perf list列出系統支持的tracepoints，使用BPF程序對tracepoint檢測時，要求內核版本>=4.7。
args->got_bits: args 自動填充為跟蹤點參數，其數據結構如下：

#  cat /sys/kernel/debug/tracing/events/random/urandom_read/format
name: urandom_read
ID: 1075
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:int got_bits;     offset:8;       size:4; signed:1;
        field:int pool_left;    offset:12;      size:4; signed:1;
        field:int input_left;   offset:16;      size:4; signed:1;

print fmt: "got_bits %d nonblocking_pool_entropy_left %d input_entropy_left %d", REC->got_bits, REC->pool_left, REC->input_left

Lesson 13. disksnoop.py fixed

下面使用block:block_rq_issue 和block:block_rq_complete重寫disksnoop.py ：

#!/usr/bin/python
#
# disksnoop.py  Trace block device I/O: basic version of iosnoop.
#               For Linux, uses BCC, eBPF. Embedded C.
#
# Written as a basic example of tracing latency.
#
# Copyright (c) 2015 Brendan Gregg.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 11-Aug-2015   Brendan Gregg   Created this.

from __future__ import print_function
from bcc import BPF
from bcc.utils import printb

REQ_WRITE = 1   # from include/linux/blk_types.h

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>

struct data_t {
    u64 len;
    char rwbs[8];
    u64 ts;
};

BPF_HASH(start,u64,struct data_t); //定義一個名為start的哈希，key類型為u64，value類型為struct data_t,注意此處非指針

TRACEPOINT_PROBE(block,block_rq_issue){
    u64 key = 0;
    struct data_t data = {}; //在跟蹤block_rq_issue的時候賦值
    data.len = args->bytes;
    bpf_probe_read(&data.rwbs,sizeof(data.rwbs),(void *)args->rwbs);
    data.ts = bpf_ktime_get_ns();

    start.update(&key, &data);
    return 0;
}

TRACEPOINT_PROBE(block,block_rq_complete){
    u64 delta, key = 0;
    struct data_t* datap;
    datap = start.lookup(&key); //在跟蹤block_rq_complete時取出保存的值
    if (datap != NULL) {
        delta = bpf_ktime_get_ns() - datap->ts;
        bpf_trace_printk("%d %x %d\\n", datap->len,
        datap->rwbs, delta / 1000);
        start.delete(&key);
    }
    return 0;
}
""")

# header
print("%-18s %-2s %-7s %8s" % ("TIME(s)", "T", "BYTES", "LAT(ms)"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
        (bytes_s, bflags_s, us_s) = msg.split()

        if int(bflags_s, 16) & REQ_WRITE:
            type_s = b"W"
        elif bytes_s == "0":    # see blk_fill_rwbs() for logic
            type_s = b"M"
        else:
            type_s = b"R"
            ms = float(int(us_s, 10)) / 1000

            printb(b"%-18.9f %-2s %-7s %8.2f" % (ts, type_s, bytes_s, ms))
    except KeyboardInterrupt:
        exit()

Lesson 14. strlen_count.py

本程序探測用戶層面的函數strlen()，計算其字符串參數出現的頻率，執行examples/tracing/strlen_count.py的結果為：

# ./strlen_count.py
Tracing strlen()... Hit Ctrl-C to end.
^C     COUNT STRING
         1 " "
         1 "/bin/ls"
         1 "."
         1 "cpudist.py.1"
         1 ".bashrc"
         1 "ls --color=auto"
         1 "key_t"
[...]
        10 "a7:~# "
        10 "/root"
        12 "LC_ALL"
        12 "en_US.UTF-8"
        13 "en_US.UTF-8"
        20 "~"
        70 "#%^,~:-=?+/}"
       340 "\x01\x1b]0;root@bgregg-test: ~\x07\x02root@bgregg-test:~# "

監控的函數strlen()的位置位於libc，當調用libc中的該函數時，會觸發運行自定義的count函數

from __future__ import print_function
from bcc import BPF
from time import sleep

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>

struct key_t {
    char c[80];
};
BPF_HASH(counts, struct key_t);

int count(struct pt_regs *ctx) {
    if (!PT_REGS_PARM1(ctx))
        return 0;

    struct key_t key = {};
    u64 zero = 0, *val;

    bpf_probe_read_user(&key.c, sizeof(key.c), (void *)PT_REGS_PARM1(ctx));
    // could also use `counts.increment(key)`
    val = counts.lookup_or_try_init(&key, &zero);
    if (val) {
      (*val)++;
    }
    return 0;
};
""")
b.attach_uprobe(name="c", sym="strlen", fn_name="count")

# header
print("Tracing strlen()... Hit Ctrl-C to end.")

# sleep until Ctrl-C
try:
    sleep(99999999)
except KeyboardInterrupt:
    pass

# print output
print("%10s %s" % ("COUNT", "STRING"))
counts = b.get_table("counts")
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
    print("%10d \"%s\"" % (v.value, k.c.encode('string-escape')))

PT_REGS_PARM1(ctx): 該宏會獲取傳遞給 strlen()的第一個參數，即字符串。
b.attach_uprobe(name="c", sym="strlen", fn_name="count"): 附加到"c"庫(如果是主程序，則需要使用其路徑，如b.attach_uprobe(name="/usr/bin/python", sym="main", fn_name="do_main"))，更多參見reference。
counts = b.get_table("counts"):獲取一個table對象，用於遍歷。該方式已被廢棄，可以直接使用counts = b.["counts"]
counts.items():返回table中的key的隊列。

Lesson 15. nodejs_http_server.py

本程序會檢查用戶靜態定義跟蹤（USDT）probe，它是內核跟蹤點的用戶級版本，執行examples/tracing/nodejs_http_server.py的結果如下：

# ./nodejs_http_server.py 24728
TIME(s)            COMM             PID    ARGS
24653324.561322998 node             24728  path:/index.html
24653335.343401998 node             24728  path:/images/welcome.png
24653340.510164998 node             24728  path:/images/favicon.png

代碼如下：

from __future__ import print_function
from bcc import BPF, USDT
import sys

if len(sys.argv) < 2:
    print("USAGE: nodejs_http_server PID")
    exit()
pid = sys.argv[1]
debug = 0

# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
int do_trace(struct pt_regs *ctx) {
    uint64_t addr;
    char path[128]={0};
    bpf_usdt_readarg(6, ctx, &addr);
    bpf_probe_read_user(&path, sizeof(path), (void *)addr);
    bpf_trace_printk("path:%s\\n", path);
    return 0;
};
"""

# enable USDT probe from given PID
u = USDT(pid=int(pid))
u.enable_probe(probe="http__server__request", fn_name="do_trace")
if debug:
    print(u.get_text())
    print(bpf_text)

# initialize BPF
b = BPF(text=bpf_text, usdt_contexts=[u])

內容包括：

bpf_usdt_readarg(6, ctx, &addr): 將USDT中的第6個參數的地址保存到addr
bpf_probe_read_user(&path, sizeof(path), (void *)addr):現在字符串addr執行path變量
u = USDT(pid=int(pid)):使用給定的PID初始化USDT跟蹤
u.enable_probe(probe="http__server__request", fn_name="do_trace"): 將BPF C函數do_trace()附加到Node.js的http__server__request USDT probe上
b = BPF(text=bpf_text, usdt_contexts=[u]): 將USDT對象u傳遞給創建的BPF對象

更多參見reference

Lesson 16. task_switch.c

這是一個老的例子，用於回顧上面的內容。

當內核中的每個任務變更時會喚醒該程序，並在BPF map中記錄新舊PIDs。

下面的C程序提出了一個新觀念：上一個參數。BCC前端對該參數進行了特別處理，這樣就可以從kprobe基礎設施傳遞的已保存的上下文中訪問該變量。參數的原型位於第一個位置，需要與被kprobed的內核函數的原型匹配，如果匹配，程序將可以無縫地訪問函數參數。

#include <uapi/linux/ptrace.h>
#include <linux/sched.h>

struct key_t {
    u32 prev_pid;
    u32 curr_pid;
};

BPF_HASH(stats, struct key_t, u64, 1024);
int count_sched(struct pt_regs *ctx, struct task_struct *prev) {
    struct key_t key = {};
    u64 zero = 0, *val;

    key.curr_pid = bpf_get_current_pid_tgid();
    key.prev_pid = prev->pid;

    // could also use `stats.increment(key);`
    val = stats.lookup_or_try_init(&key, &zero);
    if (val) {
      (*val)++;
    }
    return 0;
}

用戶空間的組件會加載上述文件，並附加到finish_task_switch內核函數。BPF對象的[]運算符允許訪問程序中的每個BPF_HASH，進而訪問內核的數據。像使用其他python 字典對象一樣使用該對象：允許讀取、更新和刪除。

BCC實際不會修改內核數據，只是對內核函數調用處進行打點，采集函數入參的信息等。

from bcc import BPF
from time import sleep

b = BPF(src_file="task_switch.c")
b.attach_kprobe(event="finish_task_switch", fn_name="count_sched")

# generate many schedule events
for i in range(0, 100): sleep(0.01)

for k, v in b["stats"].items():
    print("task_switch[%5d->%5d]=%u" % (k.prev_pid, k.curr_pid, v.value))

Lesson 17. Further Study

更多例子可以參見Sasha Goldshtein's linux-tracing-workshop。

注：官方的開發文檔中沒有給出網絡相關的內容，后續補充

總結

BCC的代碼分為python和C兩部分，C代碼通常運行在內核態，用於打點采集系統調用的信息，然后將結果返回給python代碼；也可以用於對特定lib的函數進行打點探測。
BCC的核心為C代碼，需要了解系統調用接口，作為信息采集點；以及BCC本身提供的C代碼，提供了獲取當前時間，進程PID，map數據結構，輸出等功能。
推薦TRACEPOINT_PROBE進行打點檢測，perf list可以列出系統支持的tracepoint，該tracepoint基本包含了系統的大部分系統調用，如clone，sync等(要求內核版本>=4.7)，但網絡相關的還是需要使用kprobes對內核函數進行打點，如icmp_rcv。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Linux探測工具BCC(網絡) Istio可觀測性可觀測性-LOG BCC觀測工具的使用 Dapr-可觀測性 What is observability?｜一文讀懂什么是系統可觀測性 .NET運行時中的監測和可觀測性從零開始入門 K8s | 可觀測性：你的應用健康嗎？高德打車構建可觀測性系統實踐 OpenTelemetry - 雲原生下可觀測性的新標准