C/C++源碼掃描系列- codeql 篇

本文轉載自查看原文 2021-03-28 16:03 822

首發於

https://xz.aliyun.com/t/9275

概述

codeql 是一個靜態源碼掃描工具，支持 c, python, java 等語言，用戶可以使用 ql 語言編寫自定義規則識別軟件中的漏洞，也可以使用ql自帶的規則進行掃描。

環境搭建

codeql的工作方式是首先使用codeql來編譯源碼，從源碼中搜集需要的信息，然后將搜集到的信息保存為代碼數據庫文件，用戶通過編寫codeql規則從數據庫中搜索出匹配的代碼，工作示意圖如下：

本節涉及的環境為

Windows 平台： vscode + codeql 用於開發codeql規則並查詢
Linux 平台： codeql 用於編譯代碼創建代碼數據庫

首先下載codeql的二進制安裝包

https://github.com/github/codeql-cli-binaries/releases

二進制包的文件名和對應的類型

codeql-linux64.zip   Linux平台
codeql-osx64.zip	 macos平台
codeql-win64.zip	 Windows平台
codeql.zip			 全平台

根據自己的平台下載對應的壓縮包，然后解壓到一個目錄即可。

Windows 平台的就下載 codeql-win64.zip 並解壓，然后再根據 vscode-codeql-starter 的 readme 設置 vscode 用於后續編寫 codeql 規則和對數據庫進行查詢.

https://github.com/github/vscode-codeql-starter

下載好vscode-codeql-starter和 vscode 的 codeql插件后，使用 vscode 打開vscode-codeql-starter的工作目錄（通過File > Open Workspace），然后進入vscode的設置界面，搜索codeql然后設置 Executable Path 為 codeql.exe 的路徑

Linux環境主要是使用 codeql 來編譯代碼，創建代碼數據庫，所以只要下載 codeql-linux64.zip 解壓到一個目錄即可。

下面以一個簡單的例子來介紹使用方式，代碼路徑

https://github.com/hac425xxx/sca-workshop/tree/master/hello

首先使用 codeql 編譯代碼並創建數據庫

$ /home/hac425/sca/codeql/codeql database create --language=cpp -c "gcc hello.c -o hello" ./hello_codedb

Initializing database at /home/hac425/sca-workshop/hello_codedb.
Running command [gcc, hello.c, -o, hello] in /home/hac425/sca-workshop.
Finalizing database at /home/hac425/sca-workshop/hello_codedb.
Successfully created database at /home/hac425/sca-workshop/hello_codedb.

其中的命令行選項解釋如下

--language=cpp  指定語言是cpp
-c 指定編譯代碼需要執行的命令命令，比如 make、 gcc等
./hello_codedb 數據庫相關文件保存的路徑

-c 這里為了簡單直接使用了gcc的編譯命令，codeql也支持make、cmake等編譯系統來創建數據庫，比如可以寫個Makefile

hello:
	gcc hello.c -o hello

然后 -c 指定為 make 編譯命令也可以創建出數據庫

$ /home/hac425/sca/codeql/codeql database create --language=cpp -c "make -f Makefile_hello" ./hello_codedb

Initializing database at /home/hac425/sca-workshop/hello_codedb.
Running command [make, -f, Makefile_hello] in /home/hac425/sca-workshop.
[2021-02-23 05:09:18] [build] gcc hello.c -o hello
Finalizing database at /home/hac425/sca-workshop/hello_codedb.
Successfully created database at /home/hac425/sca-workshop/hello_codedb.

數據庫創建好之后可以直接使用 codeql 插件的 From a folder 選項打開數據庫所在目錄，即可加載數據庫。

由於我是在Linux上創建數據庫，然后在Windows平台加載數據庫並進行查詢，這樣的話還需要將數據庫打包.

$ /home/hac425/sca/codeql/codeql database bundle -o hello_codedb.zip hello_codedb

Creating bundle metadata for /home/hac425/sca-workshop/hello_codedb...
Creating zip file at /home/hac425/sca-workshop/hello_codedb.zip.

命令行選項解釋

database bundle 表示這個命令是要打包數據庫
-o 打包后的壓縮文件
hello_codedb 數據庫所在目錄

數據庫打包之后就可以拷貝到其他機器上進行分析了。

vscode 加載打包的數據庫文件可以使用插件的 From an archive 選項

加載完之后我們就可以編寫規則了，這里創建一個簡單的codeql查詢，用途是找到源碼中的所有函數調用並顯示調用的的目標函數名和函數調用的位置。

ql 代碼如下

import cpp
 
from FunctionCall fc
select fc.getTarget().getQualifiedName(), fc

執行后就可以顯示所有的函數調用信息

對於圖中的fc列，可以點擊進入對於的源碼行進行查看。

QL語言簡介和簡單示例

codeql 自己實現了 ql 語言，用戶通過ql語言從數據庫中查詢需要的代碼片段。QL語言是一種邏輯語言，QL中的所有語句基本都是邏輯語句，雖然有些情況下ql的使用和普通的編程語言（比如python）類似，但是其中的一些理念是完全不一樣的，這個下面會進行一些講解。本節將基於一些簡單的例子介紹ql常用語法的使用，完整的語法建議查看官方文檔。

示例代碼簡介

代碼路徑

https://github.com/hac425xxx/sca-workshop/blob/master/ql-example/example.c

我們知道漏洞都是由於程序在處理外部不可信數據時產生的，因此這個示例代碼的實現思路就是模擬一些獲取外部數據的函數，然后預設一些漏洞和不存在漏洞的場景，最后我們使用codeql把其中的漏洞查詢出來

其中模擬獲取外部數據的函數如下

// fake read byte from taint data
char read_byte()
{
    return 1;
}

// fake read int from taint data
int read_int()
{
}

// fake get user input function
char *get_user_input_str()
{
    return (char *)malloc(12);
}

system命令執行

本節所使用的示例代碼路徑

https://github.com/hac425xxx/sca-workshop/tree/master/ql-example
https://github.com/hac425xxx/sca-workshop/tree/master/ql-example/system_query

代碼漏洞

int call_system_example()
{

    char *user = get_user_input_str();

    char *xx = user;

    system(xx);
    return 1;
}

漏洞在於函數首先使用 get_user_input_str 獲取外部輸入的字符串，然后會將其傳給 system ，可以導致命令執行。

本節通過查詢system命令執行漏洞來學習一下ql規則的編寫，首先通過一個簡單的 ql 查詢示例來看看ql查詢的組成元素

import cpp

from FunctionCall fc
where fc.getTarget().getName().matches("system")
select fc.getEnclosingFunction(), fc

這個查詢的作用是找到所有調用 system 函數的位置，然后顯示調用點所在的函數和函數調用的位置，各個語句的作用如下：

import 語句可以導入需要的庫，庫里面會封裝一些函數、類供我們使用
from 語句用於定義查詢中需要使用的變量，比如這里就定義了一個 fc ，類型為 FunctionCall 表示一個函數調用
where 語句用於設置變量需要滿足的條件，比如這里的條件就是函數調用的目標的名稱為 system
select 語句則用於將結果顯示，可以選擇結果中需要輸出的東西.

查詢結果如下

查詢結果中列的數目和列中的數據由 select 語句指定，每一行代表一個結果，這個結果的呈現和sql語句的類似。

瀏覽查詢的結果可以發現有一個 system 調用的參數是一個固定字符串

int call_system_const_example()
{
    system("cat /etc/xxx");
    return 1;
}

這個不會導致命令注入，我們在查詢的where語句中可以增加一個條件過濾掉這個調用。

import cpp

from FunctionCall fc
where fc.getTarget().getName().matches("system") and not fc.getArgument(0).isConstant()
select fc.getEnclosingFunction(), fc, fc.getArgument(0)

where 語句通過 and 增加與條件，通過fc.getArgument(0).isConstant()可以判斷fc的第一個參數是不是一個常量，這樣就可以過濾掉 system 的參數為常量字符串的函數調用。

通過這兩個例子可以大概理解一下codeql的語法規則，首先用戶會在 from 里面定義需要的語法元素（比如FunctionCall），然后會在where語句里面定義若干個邏輯表達式，然后在執行查詢時codeql會根據from語句搜集所有的語法元素（這里是所有的函數調用），然后使用where里面的邏輯表達式對這些元素進行校驗，where的結果為真就會進入select語句進行結果的展示。

或者可以這樣理解 from 語句中聲明的變量類型只是代表某一類語法元素，取值空間很大，比如 FunctionCall 可以表示任意一個函數調用，然后 fc 經過 where 語句里面的各個邏輯表達式的約束，使得 fc 取值空間縮小，然后 select 語句就將所有的取值以表格的形式展現出來。

最開始學習codeql的時候在這一塊困擾了一段時間，大概理解ql語言的工作機理后對規則的編寫、調試都有很大的幫助。

繼續回調示例，此時我們的結果還剩下兩個，其中 call_system_safe_example 中會調用函數 clean_data 對用戶的輸入進行校驗，僅僅是為了教學我們假設 clean_data 可以確保用戶輸入是干凈的，否則就返回0，那么我們需要將 call_system_safe_example 過濾掉。

對於我們這個簡單的例子，我們可以加一些表達式，過濾掉在函數中既調用了system 有調用的 clean_data 函數的結果。

import cpp

from FunctionCall fc, FunctionCall clean_fc
where
  fc.getTarget().getName().matches("system") and
  not fc.getArgument(0).isConstant() and
  clean_fc.getTarget().getName().matches("clean_data") and
  not clean_fc.getEnclosingFunction() = fc.getEnclosingFunction()
select fc.getEnclosingFunction(), fc, fc.getArgument(0)

當然這樣去過濾會產生漏報和誤報，比如clean_data檢查的數據和實際傳入system的數據不是一個。

	clean_data(data_1)
	................
	................
	system(data_2)

還有就是這樣做搜索無法判斷system的入參是否為外部可控。

這時候就需要使用 codeql 的污點跟蹤功能，示例代碼如下

import cpp
import semmle.code.cpp.dataflow.TaintTracking

from FunctionCall system_call, FunctionCall user_input, DataFlow::Node source, DataFlow::Node sink
where
  system_call.getTarget().getName().matches("system") and
  user_input.getTarget().getName().matches("get_user_input_str") and
  sink.asExpr() = system_call.getArgument(0) and
  source.asExpr() = user_input and
  TaintTracking::localTaint(source, sink)
select user_input, user_input.getEnclosingFunction()

污點跟蹤由 TaintTracking 模塊提供，codeql 支持 local 和 global 兩種污點追蹤模塊，區別在於 local 的污點追蹤只能追蹤函數內的代碼，函數外部的不追蹤，global 則會在整個源碼工程中對數據進行追蹤。

回到上面的 codeql 代碼，首先我們要明確我們的目標和已知的信息。

get_user_input_str 函數模擬程序從外部獲取數據，其返回值里面的數據是外部數據，即污點源（source）
system 是 sink 點，數據從 get_user_input_str 流向 system 函數的就很大概率是有漏洞

查詢的解釋如下：

首先定義了兩個函數調用 system_call 和 user_input ，分別表示調用 system 和 get_user_input_str 的函數調用表達式
然后定義 source 和 sink 作為污點跟蹤的 source 和 sink 點
然后利用 sink.asExpr() = system_call.getArgument(0) 設置 sink 點為 system 函數調用的第一個參數
然后利用 source.asExpr() 設置 sink 點為 system 函數調用的第一個參數
最后使用 TaintTracking::localTaint 查找從 source 到 sink 的查詢

這個查詢的作用就是查詢 system 第一個參數由 get_user_input_str 返回值控制的調用點，比如

但是由於這里采用的是 localTaint 所以下面這種情況會漏報，如果要查詢下面這個情況有兩種方式

把 our_wrapper_system 函數加到 sink 里面
使用 global taint 進行跟蹤

void our_wrapper_system(char* cmd)
{
    system(cmd);
}

int call_our_wrapper_system_example()
{

    char* user = get_user_input_str();

    char* xx = user;

    our_wrapper_system(xx);
    return 1;
}

第一種方案的查詢如下，其實就是把 our_wrapper_system 也考慮進 sink 點

import cpp
import semmle.code.cpp.dataflow.TaintTracking

predicate setSystemSink(FunctionCall fc, Expr e) {
  fc.getTarget().getName().matches("system") and
  fc.getArgument(0) = e
}

predicate setWrapperSystemSink(FunctionCall fc, Expr e) {
  fc.getTarget().getName().matches("our_wrapper_system") and
  fc.getArgument(0) = e
}

from FunctionCall fc, FunctionCall user_input, DataFlow::Node source, DataFlow::Node sink
where
  (
    setWrapperSystemSink(fc, sink.asExpr()) or
    setSystemSink(fc, sink.asExpr())
  ) and
  user_input.getTarget().getName().matches("get_user_input_str") and
  sink.asExpr() = fc.getArgument(0) and
  source.asExpr() = user_input and
  TaintTracking::localTaint(source, sink)
select user_input, user_input.getEnclosingFunction()

使用global taint 的代碼如下

import cpp
import semmle.code.cpp.dataflow.TaintTracking

class SystemCfg extends TaintTracking::Configuration {
  SystemCfg() { this = "SystemCfg" }

  override predicate isSource(DataFlow::Node source) {
    source.asExpr().(FunctionCall).getTarget().getName() = "get_user_input_str"
  }

  override predicate isSink(DataFlow::Node sink) {
    exists(FunctionCall call |
      sink.asExpr() = call.getArgument(0) and
      call.getTarget().getName() = "system"
    )
  }
}

from DataFlow::PathNode sink, DataFlow::PathNode source, SystemCfg cfg
where cfg.hasFlowPath(source, sink)
select source, sink

ps: exists 的作用類似於局部變量

要使用 global taint 需要定義一個類繼承自 TaintTracking::Configuration ，然后重寫 isSource 和 isSink

isSource 用於定義 source 點，指定 get_user_input_str 的函數調用為 source 點
isSink 定義 sink 點，指定 system 的一個參數為 sink 點
然后在 where 語句里面使用 cfg.hasFlowPath(source, sink) 查詢到從 source 到 sink 的代碼

查看查詢結果發現 call_system_safe_example 也會出現在結果中，前面提到 clean_data 可以確保數據無法進行命令注入，我們可以通過 isSanitizer 函數來剔除掉污點數據流入 clean_data 函數的結果，關鍵代碼如下：

import cpp
import semmle.code.cpp.dataflow.TaintTracking
import semmle.code.cpp.valuenumbering.GlobalValueNumbering

class SystemCfg extends TaintTracking::Configuration {
  SystemCfg() { this = "SystemCfg" }
  ............

  override predicate isSanitizer(DataFlow::Node nd) {
    exists(FunctionCall fc |
      fc.getTarget().getName() = "clean_data" and
      globalValueNumber(fc.getArgument(0)) = globalValueNumber(nd.asExpr())
    )
  }
  ............
}

ps: 使用 globalValueNumber 才能結果正確，這個應該和編譯原理 GVN 理論相關。

數組越界

本節使用涉及的代碼

https://github.com/hac425xxx/sca-workshop/tree/master/ql-example/array_oob_query

代碼漏洞

int global_array[40] = {0};

void array_oob()
{
    int user = read_byte();
    global_array[user] = 1;
}

首先函數通過 read_byte 獲取外部輸入的一個字節，然后將其作為數組索引去訪問 global_array ，但是 global_array 的大小只有 40 項，所以可能導致數組越界。

這個漏洞模型很清晰，我們使用污點跟蹤來查詢這個漏洞，首先 source 點就是 read_byte 的函數調用， sink 點就是污點數據被用作數組索引。

查詢代碼如下

import cpp
import semmle.code.cpp.dataflow.TaintTracking

class ArrayOOBCfg extends TaintTracking::Configuration {
  ArrayOOBCfg() { this = "ArrayOOBCfg" }

  override predicate isSource(DataFlow::Node source) {
    source.asExpr().(FunctionCall).getTarget().getName() = "read_byte"
  }

  override predicate isSink(DataFlow::Node sink) {
    exists(ArrayExpr ae | sink.asExpr() = ae.getArrayOffset())
  }
}

from DataFlow::PathNode sink, DataFlow::PathNode source, ArrayOOBCfg cfg
where cfg.hasFlowPath(source, sink)
select source.getNode().asExpr().(FunctionCall).getEnclosingFunction(), source, sink

首先看定義 source 點的代碼

source.asExpr().(FunctionCall).getTarget().getName() = "read_byte"

這里就是讓 source 為 read_byte 的 FunctionCall 語句，其中 .(FunctionCall) 類似於類型強制轉換。

下面介紹sink點的查詢，在 ql 中很多語法結構都有對應的類來表示，比如這里涉及的數組訪問就可以通過 ArrayExpr 對象獲取

import cpp

from ArrayExpr ae
select ae, ae.getArrayOffset(), ae.getArrayBase()

可以看到 getArrayOffset 獲取到的是數組偏移的部分，getArrayBase 獲取到的是數組的基地址，所以這個查詢的作用就是查詢數據從 read_byte 流入數組索引的代碼。

查詢結果如下

可以看到查詢到了所有符合條件的代碼，其中有一個誤報

void no_array_oob()
{
    int user = read_byte();

    if (user >= sizeof(global_array))
        return;

    global_array[user] = 1;
}

可以看到這里檢查了 user 的值，我們可以通過 isSanitizer 來過濾掉這個結果，這里就簡單的認為用戶輸入進入 if 語句的條件判斷中就認為用戶輸入被正確的校驗了。

  override predicate isSanitizer(DataFlow::Node nd) {
    exists(IfStmt ifs |
      globalValueNumber(ifs.getControllingExpr().getAChild*()) = globalValueNumber(nd.asExpr())
    )
  }

codeql 使用 IfStmt 來表示一個 if 語句，然后使用 getControllingExpr 可以獲取到 if 語句的控制語句部分，然后我們使用 getAChild* 遞歸的遍歷控制語句的所有子節點，只要有 nd 為控制語句中的一部分就返回true。

引用計數相關

本節相關代碼

https://github.com/hac425xxx/sca-workshop/tree/master/ql-example/ref_query

漏洞代碼一

int ref_leak(int *ref, int a, int b)
{

    ref_get(ref);

    if (a == 2)
    {
        puts("error 2");
        return -1;
    }
    ref_put(ref);
    return 0;
}

漏洞是當 a=2 時會直接返回沒有調用 ref_put 對引用計數減一，漏洞模型：在某些存在 return 的條件分支中沒有調用 ref_put 釋放引用計數。

查詢的代碼如下

import cpp
import semmle.code.cpp.dataflow.TaintTracking

class RefGetFunctionCall extends FunctionCall {
  RefGetFunctionCall() { this.getTarget().getName() = "ref_get" }
}

class RefPutFunctionCall extends FunctionCall {
  RefPutFunctionCall() { this.getTarget().getName() = "ref_put" }
}

class EvilIfStmt extends IfStmt {
  EvilIfStmt() {
    exists(ReturnStmt rs |
      this.getAChild*() = rs and
      not exists(RefPutFunctionCall rpfc | rpfc.getEnclosingBlock() = rs.getEnclosingBlock())
    )
  }
}

from RefGetFunctionCall rgfc, EvilIfStmt eifs
where eifs.getEnclosingFunction() = rgfc.getEnclosingFunction()
select eifs.getEnclosingFunction(), eifs

代碼使用類來定義某個特定的函數調用，比如 RefPutFunctionCall 用於表示調用 ref_put 函數的函數調用語句。

然后使用 EvilIfStmt 來表示存在 return 語句但是沒有調用 ref_put 的代碼

class EvilIfStmt extends IfStmt {
  EvilIfStmt() {
    exists(ReturnStmt rs |
      this.getAChild*() = rs and
      not exists(RefPutFunctionCall rpfc | rpfc.getEnclosingBlock() = rs.getEnclosingBlock())
    )
  }
}

大概的邏輯如下

首先使用 this.getAChild*() = rs 約束 this 為一個包含 return 語句的 if 結構
然后在加上一個 exists 語句確保和 rs 同一個塊的語句里面沒有 reutrn 語句。

漏洞代碼二

int ref_dec_error(int *ref, int a, int b)
{
    ref_get(ref);

    if (a == 2)
    {
        puts("ref_dec_error 2");
        ref_put(ref);
    }
    ref_put(ref);
    return 0;
}

漏洞是當 a=2 時調用 ref_put 對引用計數減一但是沒有 return。

漏洞模型：在某些條件分支中調用 ref_put 釋放引用計數，但是沒有 reuturn 返回，可能導致 ref_put 多次。

ql 查詢代碼的關鍵代碼如下

class EvilIfStmt extends IfStmt {
  EvilIfStmt() {
    exists(RefPutFunctionCall rpfc |
      this.getAChild*() = rpfc and
      not exists(ReturnStmt rs | rpfc.getEnclosingBlock() = rs.getEnclosingBlock())
    )
  }
}

外部函數建模

本節涉及代碼

https://github.com/hac425xxx/sca-workshop/tree/master/ql-example/model_function

靜態污點分析的常見問題當數據流入外部函數（比如沒有源碼的庫函數）中時污點分析引擎就可能會丟失污點傳播信息，比如

int custom_memcpy(char *dst, char *src, int sz);

int call_our_wrapper_system_custom_memcpy_example()
{

    char *user = get_user_input_str();

    char *tmp = malloc(strlen(user) + 1);

    custom_memcpy(tmp, user, strlen(user));

    our_wrapper_system(tmp);
    return 1;
}

這個函數首先使用 get_user_input_str 獲取外部輸入，然后調用 custom_memcpy 把數據拷貝到 tmp 中，然后將 tmp 傳入 system 執行命令， custom_memcpy 實際就是對 memcpy 進行了封裝，只不過沒有提供函數的源碼。

直接使用之前的 ql 代碼進行查詢會發現查詢不到這個代碼，因為 custom_memcpy 是一個外部函數， codeql 的污點跟蹤引擎無法知道污點的傳播規則。

import cpp
import semmle.code.cpp.dataflow.TaintTracking

class SystemCfg extends TaintTracking::Configuration {
  SystemCfg() { this = "SystemCfg" }

  override predicate isSource(DataFlow::Node source) {
    source.asExpr().(FunctionCall).getTarget().getName() = "get_user_input_str"
  }

  override predicate isSink(DataFlow::Node sink) {
    exists(FunctionCall call |
      sink.asExpr() = call.getArgument(0) and
      call.getTarget().getName() = "system"
    )
  }
}

from DataFlow::PathNode sink, DataFlow::PathNode source, SystemCfg cfg
where cfg.hasFlowPath(source, sink)
select source.getNode().asExpr().(FunctionCall).getEnclosingFunction(), source, sink

為了解決這個問題，我們可以選擇兩種方式：重寫isAdditionalTaintStep函數或者給ql源碼增加模型，下面分別介紹。

重寫 `isAdditionalTaintStep` 函數

使用 TaintTracking::Configuration 時可以通過重寫 isAdditionalTaintStep 函數來自定義污點傳播規則，代碼如下

  override predicate isAdditionalTaintStep(DataFlow::Node pred, DataFlow::Node succ) {
    exists(FunctionCall fc |
      pred.asExpr() = fc.getArgument(1) and fc.getTarget().getName() = "custom_memcpy"
      and succ.asDefiningArgument() = fc.getArgument(0)
    )
  }

isAdditionalTaintStep 的邏輯是如果函數返回值為 True 就表示污點數據從 pred 流入了 succ.

因此這里指定的就是污點數據從 custom_memcpy 的第1個參數流入了函數的第0個參數。

給`ql`源碼增加模型

在ql的源碼里面內置很多標准庫函數的模型，比如strcpy，memcpy 等，代碼路徑為

cpp\ql\src\semmle\code\cpp\models\implementations\Memcpy.qll

我們可以基於這些模型進行改造來快速對需要的函數建模，下面介紹一下步驟

首先在目錄下新建一個 .qll 文件，這里就直接拷貝了 Memcpy.qll 然后修改了19行函數名部分，因為本身是對 memcpy 進行的封裝。

然后在 Models.qll 里面導入一下即可

這時再去查詢就可以了。

C/C++源碼掃描系列- codeql 篇

概述

環境搭建

QL語言簡介和簡單示例

示例代碼簡介

system命令執行

數組越界

引用計數相關

外部函數建模

重寫 `isAdditionalTaintStep` 函數

給`ql`源碼增加模型

相關鏈接

免責聲明！

C/C++源碼掃描系列- codeql 篇

概述

環境搭建

QL語言簡介和簡單示例

示例代碼簡介

system命令執行

數組越界

引用計數相關

外部函數建模

重寫 isAdditionalTaintStep 函數

給ql源碼增加模型

相關鏈接

免責聲明！

重寫 `isAdditionalTaintStep` 函數

給`ql`源碼增加模型