1、從網絡問世直到10來年前,tcp擁塞控制采用的都是經典的reno、new-reno、bic、cubic等經典的算法,這些算法在低帶寬的有線網絡下運行了幾十年。隨着網絡帶寬增加、無線網絡通信的普及,這些經典算法逐漸開始不適應新環境了:
- 手機、wifi等的無線通信在空口段由於信道競爭等原因導致數據包傳輸出錯,但其實網絡可能並不擁塞,只是單純的數據包出錯,這是不擁塞被誤判成了擁塞!
- 網絡設備buffer增加,能容納的數據包也增加了。當buffer被填滿后就產生了擁塞,但此時數據包還未丟失(或則發送端判斷還未超時),所以如果以丟包判斷擁塞,此時就會誤判為不擁塞,導致擁塞判斷延遲,和上述情況剛好相反!
究其原因,還是傳統的擁塞控制算法把丟包/錯包等同於網絡擁塞。這種認知的缺陷:
- 導致整個網絡的吞吐率呈現鋸齒狀:先是努力向上,達到閾值或丟失后就減半,再逐步增加cw,達到閾值或丟失后繼續減半,周而復始,產生帶寬震盪,導致大部分時候的帶寬利用率或吞吐量不高!
- 端到端延遲大:網絡中轉設備的buffer被填滿,數據包排隊等待通行,此時還未丟包,發送端無法判斷是否擁塞
- 算法侵略性強:搶占其他算法的帶寬,導致整網的效果不好,帶寬分配不均
2、既然傳統經典的擁塞控制算法有這么多缺陷,BBR又是怎么做的了?本質上講,鏈路擁塞還是源端在短時間內發送了大量數據包,當數據包超過路由器等中轉設備的buffer或轉發能力后導致的,所以BBR控制擁塞的思路如下:
(1)源端發送數據的速率不要超過瓶頸鏈路的帶寬,避免長時間排隊產生擁塞
- reno和cubic發送數據包是“brust突發”的:一次性發送4個、8個等,可能導致路由設備buffer瞬間填滿,超出瓶頸鏈路的帶寬,所以要控制分組數據包的數量,避免瞬間超出BtlBW;這個間隔該怎么計算了?
-
節拍參數pacing_gain: 1、1.25、0.75等取值;時間間隔就是packet.size/pacing_rate;next_send_time=now()+packet.size/pacing_rate;
(2)BDP=RTT*BtlBW,源端發送的待確認在途數據包inflight不要超過BDP,換句話說雙向鏈路中數據包總和inflight不要超過RTT*BtlBW
3、BRR采用的擁塞控制算法需要兩個變量:RTT(又被稱為RTprop:round-trip propagation time)和BtlBW(bottleneck bandwidth),分別是傳輸延遲和鏈接瓶頸帶寬,這兩個變量的值又是怎么精確測量的了?
(1)RTT的定義:源端從發送數據到收到ACK的耗時;也就是數據包一來一回的時間總和;這個時間差在應用受限階段測量是最合適的,具體方法如下:
- 雙方握手階段:此時還未發送大量數據,理論上鏈路的數據還不多,可以把syn+ack的時間作為RTT;
- 已經握手完成:雙方有交互式的應用,導致雙方的數據量都不大,還沒有把瓶頸鏈路的帶寬打滿,也可以把syn+ack的時間作為RTT;
- 如果雙方都開足馬力收發數據,導致瓶頸鏈路都打滿了,怎么測RTT了? 只能每隔一定時間段(比如10s到幾分鍾),選擇2%左右的時間段(這里是200ms到幾秒),雙方主動降低發送速度,目的是讓應用回到受限階段后再測量RTT(這也是BBR算法相對公平、不惡意擠占整個網絡帶寬的原因)!
(2)瓶頸鏈路帶寬的測量BtlBW:在帶寬受限階段多次測量交付速率,將近期最大的交付速率作為BtlBW,具體測量方法為:雙方建立連接后,不斷增加在途inflight的數據包,連續三個RTT交付速率不增加25%時算作BL帶寬受限狀態;測量的時間窗口不低於6個RTT,最好在10個RTT左右!
(3)截至目前涉及到好些個概念,這些概念之間的關系如下圖所示:
- 剛開始源端的發送速度還未達到BDP時,因為鏈路還有空閑,此時處於應用受限階段(直白稱之為應用不足),所以RTT保持穩定不變,整個網絡的delivery rate持續上升!
- 等達到BDP但小於BDP+BtlBufSize,代表着整個鏈路都塞滿了但瓶頸設備的buffer還未慢,此時處於帶寬受限階段(直白稱之為帶寬不足);此時如果源端繼續加速發送數據,直接導致RTT增加,delivery rate因為鏈路沒了空閑也無法繼續提升!
- 如果源端繼續火力全開地發送數據,使得insight的數據量超過了BDP+BtlBufSize,這代表這鏈路本身的帶寬+路由設備的buffer都被填滿,此時路由設備只能丟包,此階段稱為緩沖受限(直白稱之為緩沖不足)
- 所謂的擁塞控制,就是要讓在途的inflight數據量不要超過BDP!所以是通過RTT和BtlBW這兩個變量來控制擁塞的,而不是傳統的遇到丟包就減半這種簡單粗暴的方式!
4、上述都是BBR出現的背景和原理,具體是怎么落地的了? 分了4個階段,分別是startup、drain、probeBW和probe_RTT!
(1) Startup: 從名字就能看出來這是初始啟動階段!既然剛啟動,通信雙方互相發送的數據肯定不多,此時鏈路吞吐量較小。為了最大化利用鏈路帶寬,Startup為BtlBw 實現了二分查找法:隨着傳輸速率增加,每次用 2/ln2 增益來倍增發送速率,整個鏈路帶寬很快會被填滿!當連續三個RTT交付速率不增加25%時就達到了BL帶寬受限狀態,此時就能測量出BtlBW(也就是RTT*BtlBw)
(2)Drain:經過startup階段的灌水后,整個鏈路被洪水漫灌,導致吞吐量下降,此時發送方逐漸降低發送速率,使得inflight<BDP, 避免擁塞
(3) probe_BW:經過第二階段的排水后,inflight基本穩定,這是整個BBR算法最穩定的狀態了;從名字就能看出來,這個階段是用來探測帶寬的!
(4)probe_RTT:由於數據傳輸時可能會更改路由,之前測量的RTT不再適用,所以需要從probe_BW階段;測量方法很簡單:拿出2%的時間降低到應用受限階段再探測RTT;不同的流通過重新測量RTT均分帶寬,相對公平!
5、通信雙方的節點,要么是在發數據,要么是在收數據。google官方提供了偽代碼說明了具體的動作事宜!
(1)當收到ack包時,需要做的動作:
function onAck(packet) rtt = now - packet.sendtime // 收包時間 減去 包中記錄的發包時間就是RTT update_min_filter(RTpropFilter, rtt) // 更新對 RTT 的估計 delivered += packet.size delivered_time = now //計算當前實際的傳輸速率 delivery_rate = (delivered - packet.delivered) / (delivered_time - packet.delivered_time) if (delivery_rate > BtlBwFilter.current_max // 實際傳輸速率已經大於當前估計的瓶頸帶寬,或 || !packet.app_limited) // 不是應用受限(應用受限的樣本對估計 BtlBw 無意義) update_max_filter(BtlBwFilter, delivery_rate) // 根更新對 BtlBw 的估計 if (app_limited_until > 0) // 達到瓶頸帶寬前,仍然可發送的字節數 app_limited_until = app_limited_until - packet.size
總的來說就是:每個包都更新RTT、但部分包更新BtlBW!
(2)當發送數據包時需要做的動作:
function send(packet) bdp = BtlBwFilter.current_max * RTpropFilter.current_min // 計算 BDP if (inflight >= cwnd_gain * bdp) // 如果正在傳輸中的數據量超過了允許的最大值 return // 直接返回,接下來就等下一個 ACK,或者等超時重傳 // 能執行到這說明 inflight < cwnd_gain * bdp,即正在傳輸中的數據量 < 瓶頸容量 if (now >= next_send_time) packet = nextPacketToSend() if (!packet) // 如果沒有數據要發送 app_limited_until = inflight // 更新 “在達到瓶頸容量之前,仍然可發送的數據量” return packet.app_limited = (app_limited_until > 0) // 如果仍然能發送若干字節才會達到瓶頸容量,說明處於 app_limited 狀態 packet.sendtime = now packet.delivered = delivered packet.delivered_time = delivered_time ship(packet) //下次發送數據的時間,通過這個控制擁塞 next_send_time = now + packet.size / (pacing_gain * BtlBwFilter.current_max) //用定時器設置下次發送時間到期后的回調函數,就是繼續執行send函數 timerCallbackAt(send, next_send_time)
總的來說就是:先判斷inflight的數據量是不是大於了BDP;如果是直接返回,結束send方法;如果不是,繼續發送數據,並重新設置下次發送的定時器!
6、接下來看看google提供的BBR源碼,在net\ipv4\tcp_bbr.c這個文件里(我用的是linux 4.9的源碼)!
(1)BBR所有關鍵函數一覽:還記得擁塞控制注冊的結構體么?這個是BBR算法的注冊結構體!
static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = { .flags = TCP_CONG_NON_RESTRICTED, .name = "bbr", .owner = THIS_MODULE, .init = bbr_init, .cong_control = bbr_main, .sndbuf_expand = bbr_sndbuf_expand, .undo_cwnd = bbr_undo_cwnd, .cwnd_event = bbr_cwnd_event, .ssthresh = bbr_ssthresh, .tso_segs_goal = bbr_tso_segs_goal, .get_info = bbr_get_info, .set_state = bbr_set_state, };
(2)因為.cong_control對應的函數是bbr_main,所以很明顯這就是擁塞控制算法的入口了!
static void bbr_main(struct sock *sk, const struct rate_sample *rs) { struct bbr *bbr = inet_csk_ca(sk); u32 bw; bbr_update_model(sk, rs); bw = bbr_bw(sk); bbr_set_pacing_rate(sk, bw, bbr->pacing_gain); bbr_set_tso_segs_goal(sk); bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain); }
從調用的函數名稱看,有計算帶寬的,有設置pacing_rate的(通過這個控制發送速度來控制擁塞),也有設置擁塞窗口的,通過層層調用撥開后,發現幾個比較重要的函數如下:
(3)bbr_update_bw:估算帶寬值
/* Estimate the bandwidth based on how fast packets are delivered 估算實際的帶寬 1、更新RTT周期 2、計算帶寬=確認的字節數*BW_UNIT/采樣時間 3、帶寬和minirtt樣本加入新的rtt、bw樣本 */ static void bbr_update_bw(struct sock *sk, const struct rate_sample *rs) { struct tcp_sock *tp = tcp_sk(sk); struct bbr *bbr = inet_csk_ca(sk); u64 bw; bbr->round_start = 0; if (rs->delivered < 0 || rs->interval_us <= 0) return; /* Not a valid observation */ /* See if we've reached the next RTT */ if (!before(rs->prior_delivered, bbr->next_rtt_delivered)) { bbr->next_rtt_delivered = tp->delivered; bbr->rtt_cnt++; bbr->round_start = 1; bbr->packet_conservation = 0; } bbr_lt_bw_sampling(sk, rs); /* Divide delivered by the interval to find a (lower bound) bottleneck * bandwidth sample. Delivered is in packets and interval_us in uS and * ratio will be <<1 for most connections. So delivered is first scaled. 計算帶寬 */ bw = (u64)rs->delivered * BW_UNIT; do_div(bw, rs->interval_us); /* If this sample is application-limited, it is likely to have a very * low delivered count that represents application behavior rather than * the available network rate. Such a sample could drag down estimated * bw, causing needless slow-down. Thus, to continue to send at the * last measured network rate, we filter out app-limited samples unless * they describe the path bw at least as well as our bw model. * * So the goal during app-limited phase is to proceed with the best * network rate no matter how long. We automatically leave this * phase when app writes faster than the network can deliver :) */ if (!rs->is_app_limited || bw >= bbr_max_bw(sk)) { /* Incorporate new sample into our max bw filter. 帶寬和minirtt樣本加入新的rtt、bw樣本*/ minmax_running_max(&bbr->bw, bbr_bw_rtts, bbr->rtt_cnt, bw); } }
(4)bbr_set_pacing_rate:通過設置pacing_rate控制發包的速度:
/* Pace using current bw estimate and a gain factor. In order to help drive the * network toward lower queues while maintaining high utilization and low * latency, the average pacing rate aims to be slightly (~1%) lower than the * estimated bandwidth. This is an important aspect of the design. In this * implementation this slightly lower pacing rate is achieved implicitly by not * including link-layer headers in the packet size used for the pacing rate. */ static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain) { struct bbr *bbr = inet_csk_ca(sk); u64 rate = bw; rate = bbr_rate_bytes_per_sec(sk, rate, gain); rate = min_t(u64, rate, sk->sk_max_pacing_rate); if (bbr->mode != BBR_STARTUP || rate > sk->sk_pacing_rate) sk->sk_pacing_rate = rate; }
(5)bbr_update_min_rtt:更新最小的rtt
/* The goal of PROBE_RTT mode is to have BBR flows cooperatively and * periodically drain the bottleneck queue, to converge to measure the true * min_rtt (unloaded propagation delay). This allows the flows to keep queues * small (reducing queuing delay and packet loss) and achieve fairness among * BBR flows. * * The min_rtt filter window is 10 seconds. When the min_rtt estimate expires, * we enter PROBE_RTT mode and cap the cwnd at bbr_cwnd_min_target=4 packets. * After at least bbr_probe_rtt_mode_ms=200ms and at least one packet-timed * round trip elapsed with that flight size <= 4, we leave PROBE_RTT mode and * re-enter the previous mode. BBR uses 200ms to approximately bound the * performance penalty of PROBE_RTT's cwnd capping to roughly 2% (200ms/10s). * * Note that flows need only pay 2% if they are busy sending over the last 10 * seconds. Interactive applications (e.g., Web, RPCs, video chunks) often have * natural silences or low-rate periods within 10 seconds where the rate is low * enough for long enough to drain its queue in the bottleneck. We pick up * these min RTT measurements opportunistically with our min_rtt filter. :-) */ static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs) { struct tcp_sock *tp = tcp_sk(sk); struct bbr *bbr = inet_csk_ca(sk); bool filter_expired; /* Track min RTT seen in the min_rtt_win_sec filter window: */ filter_expired = after(tcp_time_stamp, bbr->min_rtt_stamp + bbr_min_rtt_win_sec * HZ); if (rs->rtt_us >= 0 && (rs->rtt_us <= bbr->min_rtt_us || filter_expired)) { bbr->min_rtt_us = rs->rtt_us; bbr->min_rtt_stamp = tcp_time_stamp; } if (bbr_probe_rtt_mode_ms > 0 && filter_expired && !bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) { bbr->mode = BBR_PROBE_RTT; /* dip, drain queue */ bbr->pacing_gain = BBR_UNIT; bbr->cwnd_gain = BBR_UNIT; bbr_save_cwnd(sk); /* note cwnd so we can restore it */ bbr->probe_rtt_done_stamp = 0; } if (bbr->mode == BBR_PROBE_RTT) {//如果是probe_rtt狀態 /* Ignore low rate samples during this mode. */ tp->app_limited = (tp->delivered + tcp_packets_in_flight(tp)) ? : 1; /* Maintain min packets in flight for max(200 ms, 1 round). */ if (!bbr->probe_rtt_done_stamp && tcp_packets_in_flight(tp) <= bbr_cwnd_min_target) { bbr->probe_rtt_done_stamp = tcp_time_stamp + msecs_to_jiffies(bbr_probe_rtt_mode_ms); bbr->probe_rtt_round_done = 0; bbr->next_rtt_delivered = tp->delivered; } else if (bbr->probe_rtt_done_stamp) { if (bbr->round_start) bbr->probe_rtt_round_done = 1; if (bbr->probe_rtt_round_done && after(tcp_time_stamp, bbr->probe_rtt_done_stamp)) { bbr->min_rtt_stamp = tcp_time_stamp; bbr->restore_cwnd = 1; /* snap to prior_cwnd */ bbr_reset_mode(sk); } } } bbr->idle_restart = 0; }
6、總結:BBR算法不再基於丟包判斷,也不再使用AIMD線性增乘性減策略來維護擁塞窗口,而是分別采樣估計(網絡鏈路拓撲情況對於發送端和接收端來說都是黑盒,不太可能完全實時掌控,只能不停地采樣)極大帶寬和極小延時,並用二者乘積作為發送窗口。同事BBR引入了Pacing Rate限制數據發送速率,配合cwnd使用來降低沖擊!
參考:
1、https://www.cnblogs.com/HadesBlog/p/13347418.html google bbr源碼分析
2、https://www.bilibili.com/video/BV1iq4y1H7Zf/?spm_id_from=333.788.recommend_more_video.-1 BBR擁塞控制算法
3、http://arthurchiao.art/blog/bbr-paper-zh/ google論文:基於擁塞(而非丟包)的擁塞控制