linux源碼解讀（二十三）：網絡通信簡介——網絡擁塞控制之BBR算法

本文轉載自查看原文 2022-02-12 21:44 1522 操作系統原理

　　1、從網絡問世直到10來年前，tcp擁塞控制采用的都是經典的reno、new-reno、bic、cubic等經典的算法，這些算法在低帶寬的有線網絡下運行了幾十年。隨着網絡帶寬增加、無線網絡通信的普及，這些經典算法逐漸開始不適應新環境了：

手機、wifi等的無線通信在空口段由於信道競爭等原因導致數據包傳輸出錯，但其實網絡可能並不擁塞，只是單純的數據包出錯，這是不擁塞被誤判成了擁塞！
網絡設備buffer增加，能容納的數據包也增加了。當buffer被填滿后就產生了擁塞，但此時數據包還未丟失（或則發送端判斷還未超時），所以如果以丟包判斷擁塞，此時就會誤判為不擁塞，導致擁塞判斷延遲，和上述情況剛好相反！

　　究其原因，還是傳統的擁塞控制算法把丟包/錯包等同於網絡擁塞。這種認知的缺陷：

導致整個網絡的吞吐率呈現鋸齒狀：先是努力向上，達到閾值或丟失后就減半，再逐步增加cw，達到閾值或丟失后繼續減半，周而復始，產生帶寬震盪，導致大部分時候的帶寬利用率或吞吐量不高！
端到端延遲大：網絡中轉設備的buffer被填滿，數據包排隊等待通行，此時還未丟包，發送端無法判斷是否擁塞
算法侵略性強：搶占其他算法的帶寬，導致整網的效果不好，帶寬分配不均

　 2、既然傳統經典的擁塞控制算法有這么多缺陷，BBR又是怎么做的了？本質上講，鏈路擁塞還是源端在短時間內發送了大量數據包，當數據包超過路由器等中轉設備的buffer或轉發能力后導致的，所以BBR控制擁塞的思路如下：

（1）源端發送數據的速率不要超過瓶頸鏈路的帶寬，避免長時間排隊產生擁塞

reno和cubic發送數據包是“brust突發”的：一次性發送4個、8個等，可能導致路由設備buffer瞬間填滿，超出瓶頸鏈路的帶寬，所以要控制分組數據包的數量，避免瞬間超出BtlBW；這個間隔該怎么計算了？
節拍參數pacing_gain: 1、1.25、0.75等取值；時間間隔就是packet.size/pacing_rate；next_send_time=now()+packet.size/pacing_rate;

（2）BDP=RTT*BtlBW，源端發送的待確認在途數據包inflight不要超過BDP，換句話說雙向鏈路中數據包總和inflight不要超過RTT*BtlBW

3、BRR采用的擁塞控制算法需要兩個變量：RTT(又被稱為RTprop:round-trip propagation time)和BtlBW(bottleneck bandwidth)，分別是傳輸延遲和鏈接瓶頸帶寬，這兩個變量的值又是怎么精確測量的了?

（1）RTT的定義：源端從發送數據到收到ACK的耗時；也就是數據包一來一回的時間總和；這個時間差在應用受限階段測量是最合適的，具體方法如下：

雙方握手階段：此時還未發送大量數據，理論上鏈路的數據還不多，可以把syn+ack的時間作為RTT；
已經握手完成：雙方有交互式的應用，導致雙方的數據量都不大，還沒有把瓶頸鏈路的帶寬打滿，也可以把syn+ack的時間作為RTT；
如果雙方都開足馬力收發數據，導致瓶頸鏈路都打滿了，怎么測RTT了? 只能每隔一定時間段（比如10s到幾分鍾），選擇2%左右的時間段（這里是200ms到幾秒），雙方主動降低發送速度，目的是讓應用回到受限階段后再測量RTT（這也是BBR算法相對公平、不惡意擠占整個網絡帶寬的原因）！

（2）瓶頸鏈路帶寬的測量BtlBW：在帶寬受限階段多次測量交付速率，將近期最大的交付速率作為BtlBW，具體測量方法為：雙方建立連接后，不斷增加在途inflight的數據包，連續三個RTT交付速率不增加25%時算作BL帶寬受限狀態；測量的時間窗口不低於6個RTT，最好在10個RTT左右！

（3）截至目前涉及到好些個概念，這些概念之間的關系如下圖所示：

剛開始源端的發送速度還未達到BDP時，因為鏈路還有空閑，此時處於應用受限階段（直白稱之為應用不足），所以RTT保持穩定不變，整個網絡的delivery rate持續上升！
等達到BDP但小於BDP+BtlBufSize，代表着整個鏈路都塞滿了但瓶頸設備的buffer還未慢，此時處於帶寬受限階段（直白稱之為帶寬不足）；此時如果源端繼續加速發送數據，直接導致RTT增加，delivery rate因為鏈路沒了空閑也無法繼續提升！
如果源端繼續火力全開地發送數據，使得insight的數據量超過了BDP+BtlBufSize，這代表這鏈路本身的帶寬+路由設備的buffer都被填滿，此時路由設備只能丟包，此階段稱為緩沖受限（直白稱之為緩沖不足）
所謂的擁塞控制，就是要讓在途的inflight數據量不要超過BDP！所以是通過RTT和BtlBW這兩個變量來控制擁塞的，而不是傳統的遇到丟包就減半這種簡單粗暴的方式！

　 4、上述都是BBR出現的背景和原理，具體是怎么落地的了? 分了4個階段，分別是startup、drain、probeBW和probe_RTT!

(1) Startup: 從名字就能看出來這是初始啟動階段！既然剛啟動，通信雙方互相發送的數據肯定不多，此時鏈路吞吐量較小。為了最大化利用鏈路帶寬，Startup為BtlBw 實現了二分查找法：隨着傳輸速率增加，每次用 2/ln2 增益來倍增發送速率，整個鏈路帶寬很快會被填滿！當連續三個RTT交付速率不增加25%時就達到了BL帶寬受限狀態，此時就能測量出BtlBW（也就是RTT*BtlBw）

（2）Drain：經過startup階段的灌水后，整個鏈路被洪水漫灌，導致吞吐量下降，此時發送方逐漸降低發送速率，使得inflight<BDP, 避免擁塞

(3) probe_BW：經過第二階段的排水后，inflight基本穩定，這是整個BBR算法最穩定的狀態了；從名字就能看出來，這個階段是用來探測帶寬的！

（4）probe_RTT：由於數據傳輸時可能會更改路由，之前測量的RTT不再適用，所以需要從probe_BW階段；測量方法很簡單：拿出2%的時間降低到應用受限階段再探測RTT；不同的流通過重新測量RTT均分帶寬，相對公平！

　 5、通信雙方的節點，要么是在發數據，要么是在收數據。google官方提供了偽代碼說明了具體的動作事宜！

（1）當收到ack包時，需要做的動作：

function onAck(packet)
  rtt = now - packet.sendtime                      // 收包時間 減去 包中記錄的發包時間就是RTT
  update_min_filter(RTpropFilter, rtt)             // 更新對 RTT 的估計
 
  delivered      += packet.size
  delivered_time =  now
  //計算當前實際的傳輸速率
  delivery_rate  =  (delivered - packet.delivered) / (delivered_time - packet.delivered_time)
 
  if (delivery_rate > BtlBwFilter.current_max      // 實際傳輸速率已經大於當前估計的瓶頸帶寬，或
     || !packet.app_limited)                       // 不是應用受限（應用受限的樣本對估計 BtlBw 無意義）
     update_max_filter(BtlBwFilter, delivery_rate) // 根更新對 BtlBw 的估計
 
  if (app_limited_until > 0)                       // 達到瓶頸帶寬前，仍然可發送的字節數
     app_limited_until = app_limited_until - packet.size

　　總的來說就是：每個包都更新RTT、但部分包更新BtlBW！

（2）當發送數據包時需要做的動作：

function send(packet)
  bdp = BtlBwFilter.current_max * RTpropFilter.current_min  // 計算 BDP
  if (inflight >= cwnd_gain * bdp)                          // 如果正在傳輸中的數據量超過了允許的最大值
     return                                                 // 直接返回，接下來就等下一個 ACK，或者等超時重傳

  // 能執行到這說明 inflight < cwnd_gain * bdp，即正在傳輸中的數據量 < 瓶頸容量
  if (now >= next_send_time)
     packet = nextPacketToSend()
     if (!packet)                      // 如果沒有數據要發送
        app_limited_until = inflight   // 更新 “在達到瓶頸容量之前，仍然可發送的數據量”
        return

     packet.app_limited = (app_limited_until > 0)  // 如果仍然能發送若干字節才會達到瓶頸容量，說明處於 app_limited 狀態
     packet.sendtime = now
     packet.delivered = delivered
     packet.delivered_time = delivered_time
     ship(packet)
     //下次發送數據的時間，通過這個控制擁塞
     next_send_time = now + packet.size / (pacing_gain * BtlBwFilter.current_max)
 //用定時器設置下次發送時間到期后的回調函數，就是繼續執行send函數
  timerCallbackAt(send, next_send_time)

　　總的來說就是：先判斷inflight的數據量是不是大於了BDP；如果是直接返回，結束send方法；如果不是，繼續發送數據，並重新設置下次發送的定時器！

　　6、接下來看看google提供的BBR源碼，在net\ipv4\tcp_bbr.c這個文件里（我用的是linux 4.9的源碼）！

　　（1）BBR所有關鍵函數一覽：還記得擁塞控制注冊的結構體么？這個是BBR算法的注冊結構體！

static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = {
    .flags        = TCP_CONG_NON_RESTRICTED,
    .name        = "bbr",
    .owner        = THIS_MODULE,
    .init        = bbr_init,
    .cong_control    = bbr_main,
    .sndbuf_expand    = bbr_sndbuf_expand,
    .undo_cwnd    = bbr_undo_cwnd,
    .cwnd_event    = bbr_cwnd_event,
    .ssthresh    = bbr_ssthresh,
    .tso_segs_goal    = bbr_tso_segs_goal,
    .get_info    = bbr_get_info,
    .set_state    = bbr_set_state,
};

（2）因為.cong_control對應的函數是bbr_main，所以很明顯這就是擁塞控制算法的入口了！

static void bbr_main(struct sock *sk, const struct rate_sample *rs)
{
    struct bbr *bbr = inet_csk_ca(sk);
    u32 bw;
    
    bbr_update_model(sk, rs);

    bw = bbr_bw(sk);
    bbr_set_pacing_rate(sk, bw, bbr->pacing_gain);
    bbr_set_tso_segs_goal(sk);
    bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain);
}

　　從調用的函數名稱看，有計算帶寬的，有設置pacing_rate的（通過這個控制發送速度來控制擁塞），也有設置擁塞窗口的，通過層層調用撥開后，發現幾個比較重要的函數如下：

（3）bbr_update_bw：估算帶寬值

/* Estimate the bandwidth based on how fast packets are delivered
   估算實際的帶寬 
    1、更新RTT周期
    2、計算帶寬=確認的字節數*BW_UNIT/采樣時間
    3、帶寬和minirtt樣本加入新的rtt、bw樣本
*/
static void bbr_update_bw(struct sock *sk, const struct rate_sample *rs)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    u64 bw;

    bbr->round_start = 0;
    if (rs->delivered < 0 || rs->interval_us <= 0)
        return; /* Not a valid observation */

    /* See if we've reached the next RTT */
    if (!before(rs->prior_delivered, bbr->next_rtt_delivered)) {
        bbr->next_rtt_delivered = tp->delivered;
        bbr->rtt_cnt++;
        bbr->round_start = 1;
        bbr->packet_conservation = 0;
    }
    bbr_lt_bw_sampling(sk, rs);

    /* Divide delivered by the interval to find a (lower bound) bottleneck
     * bandwidth sample. Delivered is in packets and interval_us in uS and
     * ratio will be <<1 for most connections. So delivered is first scaled.
     計算帶寬
     */
    bw = (u64)rs->delivered * BW_UNIT;
    do_div(bw, rs->interval_us);

    /* If this sample is application-limited, it is likely to have a very
     * low delivered count that represents application behavior rather than
     * the available network rate. Such a sample could drag down estimated
     * bw, causing needless slow-down. Thus, to continue to send at the
     * last measured network rate, we filter out app-limited samples unless
     * they describe the path bw at least as well as our bw model.
     *
     * So the goal during app-limited phase is to proceed with the best
     * network rate no matter how long. We automatically leave this
     * phase when app writes faster than the network can deliver :)
     */
    if (!rs->is_app_limited || bw >= bbr_max_bw(sk)) {
        /* Incorporate new sample into our max bw filter. 
         帶寬和minirtt樣本加入新的rtt、bw樣本*/
        minmax_running_max(&bbr->bw, bbr_bw_rtts, bbr->rtt_cnt, bw);
    }
}

　　（4）bbr_set_pacing_rate：通過設置pacing_rate控制發包的速度：

/* Pace using current bw estimate and a gain factor. In order to help drive the
 * network toward lower queues while maintaining high utilization and low
 * latency, the average pacing rate aims to be slightly (~1%) lower than the
 * estimated bandwidth. This is an important aspect of the design. In this
 * implementation this slightly lower pacing rate is achieved implicitly by not
 * including link-layer headers in the packet size used for the pacing rate.
   
 */
static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
{
    struct bbr *bbr = inet_csk_ca(sk);
    u64 rate = bw;

    rate = bbr_rate_bytes_per_sec(sk, rate, gain);
    rate = min_t(u64, rate, sk->sk_max_pacing_rate);
    if (bbr->mode != BBR_STARTUP || rate > sk->sk_pacing_rate)
        sk->sk_pacing_rate = rate;
}

　　（5）bbr_update_min_rtt：更新最小的rtt

/* The goal of PROBE_RTT mode is to have BBR flows cooperatively and
 * periodically drain the bottleneck queue, to converge to measure the true
 * min_rtt (unloaded propagation delay). This allows the flows to keep queues
 * small (reducing queuing delay and packet loss) and achieve fairness among
 * BBR flows.
 *
 * The min_rtt filter window is 10 seconds. When the min_rtt estimate expires,
 * we enter PROBE_RTT mode and cap the cwnd at bbr_cwnd_min_target=4 packets.
 * After at least bbr_probe_rtt_mode_ms=200ms and at least one packet-timed
 * round trip elapsed with that flight size <= 4, we leave PROBE_RTT mode and
 * re-enter the previous mode. BBR uses 200ms to approximately bound the
 * performance penalty of PROBE_RTT's cwnd capping to roughly 2% (200ms/10s).
 *
 * Note that flows need only pay 2% if they are busy sending over the last 10
 * seconds. Interactive applications (e.g., Web, RPCs, video chunks) often have
 * natural silences or low-rate periods within 10 seconds where the rate is low
 * enough for long enough to drain its queue in the bottleneck. We pick up
 * these min RTT measurements opportunistically with our min_rtt filter. :-)
 */
static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    bool filter_expired;

    /* Track min RTT seen in the min_rtt_win_sec filter window: */
    filter_expired = after(tcp_time_stamp,
                   bbr->min_rtt_stamp + bbr_min_rtt_win_sec * HZ);
    if (rs->rtt_us >= 0 &&
        (rs->rtt_us <= bbr->min_rtt_us || filter_expired)) {
        bbr->min_rtt_us = rs->rtt_us;
        bbr->min_rtt_stamp = tcp_time_stamp;
    }

    if (bbr_probe_rtt_mode_ms > 0 && filter_expired &&
        !bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) {
        bbr->mode = BBR_PROBE_RTT;  /* dip, drain queue */
        bbr->pacing_gain = BBR_UNIT;
        bbr->cwnd_gain = BBR_UNIT;
        bbr_save_cwnd(sk);  /* note cwnd so we can restore it */
        bbr->probe_rtt_done_stamp = 0;
    }

    if (bbr->mode == BBR_PROBE_RTT) {//如果是probe_rtt狀態
        /* Ignore low rate samples during this mode. */
        tp->app_limited =
            (tp->delivered + tcp_packets_in_flight(tp)) ? : 1;
        /* Maintain min packets in flight for max(200 ms, 1 round). */
        if (!bbr->probe_rtt_done_stamp &&
            tcp_packets_in_flight(tp) <= bbr_cwnd_min_target) {
            bbr->probe_rtt_done_stamp = tcp_time_stamp +
                msecs_to_jiffies(bbr_probe_rtt_mode_ms);
            bbr->probe_rtt_round_done = 0;
            bbr->next_rtt_delivered = tp->delivered;
        } else if (bbr->probe_rtt_done_stamp) {
            if (bbr->round_start)
                bbr->probe_rtt_round_done = 1;
            if (bbr->probe_rtt_round_done &&
                after(tcp_time_stamp, bbr->probe_rtt_done_stamp)) {
                bbr->min_rtt_stamp = tcp_time_stamp;
                bbr->restore_cwnd = 1;  /* snap to prior_cwnd */
                bbr_reset_mode(sk);
            }
        }
    }
    bbr->idle_restart = 0;
}

　　6、總結：BBR算法不再基於丟包判斷，也不再使用AIMD線性增乘性減策略來維護擁塞窗口，而是分別采樣估計（網絡鏈路拓撲情況對於發送端和接收端來說都是黑盒，不太可能完全實時掌控，只能不停地采樣）極大帶寬和極小延時，並用二者乘積作為發送窗口。同事BBR引入了Pacing Rate限制數據發送速率，配合cwnd使用來降低沖擊！

參考：

1、https://www.cnblogs.com/HadesBlog/p/13347418.html google bbr源碼分析

2、https://www.bilibili.com/video/BV1iq4y1H7Zf/?spm_id_from=333.788.recommend_more_video.-1 BBR擁塞控制算法

3、http://arthurchiao.art/blog/bbr-paper-zh/ google論文：基於擁塞（而非丟包）的擁塞控制

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 linux源碼解讀（二十二）：網絡通信簡介——網絡擁塞控制之cubic算法 linux源碼解讀（二十）：網絡通信簡介——socket&sock結構體介紹 linux源碼解讀（十九）：網絡通信簡介——sk_buff結構體介紹 TCP網絡擁塞控制網絡擁塞控制(三) TCP擁塞控制算法 Ubuntu 18.04開啟TCP網絡協議BBR加速的方法（Google BBR 擁塞控制算法) 網絡擁塞和死鎖來自Google的TCP BBR擁塞控制算法解析網絡層的擁塞控制 Linux網絡通信