Linux處理TIME_WAIT和FIN_WAIT_2狀態


  1. 以3.10版本內核為例,4.1+版本內核在處理FIN-WAIT-2時有所改變,后面會提到
  2. 代碼做適度精簡

TL;DR

  • Linux TCP的TIME_WAIT狀態超時默認為60秒,不可修改
  • Linux TCP的FIN_WAIT_2和TIME_WAIT共用一套實現
  • 可以通過tcp_fin_timeout修改FIN_WAIT_2的超時
  • 3.10內核和4.1+內核對tcp_fin_timeout實現機制有所變化
  • reuse和recycle都需要開啟timestamp,對NAT不友好
  • 推薦使用4.3+內核,參數配置可以看最后


圖1. TCP 狀態機

源碼解析

入口

初步認為tcp_input.c#tcp_fin為此次的入口,主動斷開連接方收到被關閉方發出的FIN指令后,進入time-wait狀態做進一步處理。

link:linux/net/ipv4/tcp_input.c

/*
 *  /net/ipv4/tcp_input.c
 *  ...
 *	If we are in FINWAIT-2, a received FIN moves us to TIME-WAIT.
 */
static void tcp_fin(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);

	inet_csk_schedule_ack(sk);

	sk->sk_shutdown |= RCV_SHUTDOWN;
	sock_set_flag(sk, SOCK_DONE);

	switch (sk->sk_state) {
	case TCP_SYN_RECV:
	case TCP_ESTABLISHED:
		...
	case TCP_CLOSE_WAIT:
	case TCP_CLOSING:
		...
	case TCP_LAST_ACK:
		...
	case TCP_FIN_WAIT1:
		...
	case TCP_FIN_WAIT2:
		/* 接收被關閉連接方的FIN -- 發送 ACK,轉到TIME_WAIT狀態 */
		tcp_send_ack(sk);
		tcp_time_wait(sk, TCP_TIME_WAIT, 0);
		break;
	default:
		/* Only TCP_LISTEN and TCP_CLOSE are left, in these
		 * cases we should never reach this piece of code.
		 */
		pr_err("%s: Impossible, sk->sk_state=%d\n",
		       __func__, sk->sk_state);
		break;
	}
	...
}

處理time-wait

tcp_minisocks.c中,會處理狀態回收,控制time-wait桶大小等,先在這里給出結論:

a)net.ipv4.tcp_tw_recycle需要和net.ipv4.tcp_timestamps同時打開才可以快速回收
b) 連接狀態為TIME_WAIT時,清理時間為默認60s,不可修改

link:linux/net/ipv4/tcp_minisocks.c

// tcp_death_row 結構
/*
在tcp_death_row中存在兩種回收機制,一種是timeout較長的sock口,放入tw_timer定時器的隊列中,
一種timeout較短的套接口,放入twcal_timer定時器的隊列中;
tw_timer定時器超時精度為 TCP_TIMEWAIT_LEN / INET_TWDR_TWKILL_SLOTS=7.5s
而 twcal_timer 的定時單位並不是固定的值,而是根據常量 HZ 定義的, 在3.10內核中為250HZ,
超時精度為:1<<INET_TWDR_RECYCLE_TICK個tick,
即 $((1<<5))=32個tick, 也就是相當於約 32/250≈1/8s的精度
具體處理在inet_twsk_schedule()方法中
*/
struct inet_timewait_death_row tcp_death_row = {
	.sysctl_max_tw_buckets = NR_FILE * 2,
	.period		= TCP_TIMEWAIT_LEN / INET_TWDR_TWKILL_SLOTS,
	.death_lock	= __SPIN_LOCK_UNLOCKED(tcp_death_row.death_lock),
	.hashinfo	= &tcp_hashinfo,
	.tw_timer	= TIMER_INITIALIZER(inet_twdr_hangman, 0,
					    (unsigned long)&tcp_death_row),
	.twkill_work	= __WORK_INITIALIZER(tcp_death_row.twkill_work,
					     inet_twdr_twkill_work),
/* Short-time timewait calendar */

	.twcal_hand	= -1,
	.twcal_timer	= TIMER_INITIALIZER(inet_twdr_twcal_tick, 0,
					    (unsigned long)&tcp_death_row),
};

/*
 * /net/ipv4/tcp_minisocks.c
 * 將socket狀態轉為time-wait或者fin-wait-2狀態
 */
void tcp_time_wait(struct sock *sk, int state, int timeo)
{
	struct inet_timewait_sock *tw = NULL;
	const struct inet_connection_sock *icsk = inet_csk(sk);
	const struct tcp_sock *tp = tcp_sk(sk);
	bool recycle_ok = false;

  // 是否開啟了recycle,且存在時間戳擴展,標記recycle_ok為true,為后面回收做准備
	if (tcp_death_row.sysctl_tw_recycle && tp->rx_opt.ts_recent_stamp)
		recycle_ok = tcp_remember_stamp(sk);

  // 如果當前等待回收time-wait的數量小於配置的桶大小,把當前sock扔到處理隊列里面
	if (tcp_death_row.tw_count < tcp_death_row.sysctl_max_tw_buckets) // 2
		tw = inet_twsk_alloc(sk, state);

	if (tw != NULL) {
		struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
		const int rto = (icsk->icsk_rto << 2) - (icsk->icsk_rto >> 1); // 3.5*RTO
		struct inet_sock *inet = inet_sk(sk);

		tw->tw_transparent	= inet->transparent;
		tw->tw_mark		= sk->sk_mark;
		tw->tw_rcv_wscale	= tp->rx_opt.rcv_wscale;
		tcptw->tw_rcv_nxt	= tp->rcv_nxt;
		tcptw->tw_snd_nxt	= tp->snd_nxt;
		tcptw->tw_rcv_wnd	= tcp_receive_window(tp);
		tcptw->tw_ts_recent	= tp->rx_opt.ts_recent;
		tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
		tcptw->tw_ts_offset	= tp->tsoffset;
		tcptw->tw_last_oow_ack_time = 0;

... ifdef endif...

		/* Get the TIME_WAIT timeout firing. */
    // 從tcp_fin方法過來的入參timeo為0,會被重新賦值為3.5rto
		if (timeo < rto)
			timeo = rto;

    // 如果需要回收話,設置超時時間為當前rto;否則超時時間設為60s,當狀態是time-wait時,
    // timeo也設置為60s,這是后面處理狀態的時間
    // btw,rto的值,一般都會小於配置的值,除非雙方出現網絡抖動和硬件異常需要多次超時重傳
		if (recycle_ok) {
			tw->tw_timeout = rto;
		} else {
			tw->tw_timeout = TCP_TIMEWAIT_LEN;
			if (state == TCP_TIME_WAIT)
				timeo = TCP_TIMEWAIT_LEN;
		}

		/* Linkage updates. */
		__inet_twsk_hashdance(tw, sk, &tcp_hashinfo);

    // 兩種timer
    // 1. TCP_TIMEWAIT_LEN定義的60s的timer
    // 2. TIMEOUT為3.5*RTO的timer
		inet_twsk_schedule(tw, &tcp_death_row, timeo,
				   TCP_TIMEWAIT_LEN);
		inet_twsk_put(tw);
	} else {
		/* Sorry, if we're out of memory, just CLOSE this
		 * socket up.  We've got bigger problems than
		 * non-graceful socket closings.
		 */
		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPTIMEWAITOVERFLOW);
	}

	tcp_update_metrics(sk);
	tcp_done(sk);
}

time-wait輪詢處理

根據timeo值大小計算slot,判斷后,進入不同的timer,等待清理

/*
* /net/ipv4/inet_timewait_sock.c
*/
void inet_twsk_schedule(struct inet_timewait_sock *tw,
		       struct inet_timewait_death_row *twdr,
		       const int timeo, const int timewait_len)
{
	struct hlist_head *list;
	unsigned int slot;

	/* timeout := RTO * 3.5
	 *
	 * 3.5 = 1+2+0.5 to wait for two retransmits.
	 *
	 * RATIONALE: if FIN arrived and we entered TIME-WAIT state,
	 * our ACK acking that FIN can be lost. If N subsequent retransmitted
	 * FINs (or previous seqments) are lost (probability of such event
	 * is p^(N+1), where p is probability to lose single packet and
	 * time to detect the loss is about RTO*(2^N - 1) with exponential
	 * backoff). Normal timewait length is calculated so, that we
	 * waited at least for one retransmitted FIN (maximal RTO is 120sec).
	 * [ BTW Linux. following BSD, violates this requirement waiting
	 *   only for 60sec, we should wait at least for 240 secs.
	 *   Well, 240 consumes too much of resources 8)
	 * ]
	 * This interval is not reduced to catch old duplicate and
	 * responces to our wandering segments living for two MSLs.
	 * However, if we use PAWS to detect
	 * old duplicates, we can reduce the interval to bounds required
	 * by RTO, rather than MSL. So, if peer understands PAWS, we
	 * kill tw bucket after 3.5*RTO (it is important that this number
	 * is greater than TS tick!) and detect old duplicates with help
	 * of PAWS.
	 */
  // 通過timeout值計算slot號
	slot = (timeo + (1 << INET_TWDR_RECYCLE_TICK) - 1) >> INET_TWDR_RECYCLE_TICK;

	spin_lock(&twdr->death_lock);

	/* Unlink it, if it was scheduled */
	if (inet_twsk_del_dead_node(tw))
		twdr->tw_count--;
	else
		atomic_inc(&tw->tw_refcnt);
  
  // 如果所計算的slot大於默認值(1<<5),進入慢timer去處理,其他的進入快timer
	if (slot >= INET_TWDR_RECYCLE_SLOTS) {
		/* Schedule to slow timer */
		if (timeo >= timewait_len) {
			slot = INET_TWDR_TWKILL_SLOTS - 1;
		} else {
			slot = DIV_ROUND_UP(timeo, twdr->period);
			if (slot >= INET_TWDR_TWKILL_SLOTS)
				slot = INET_TWDR_TWKILL_SLOTS - 1;
		}
		tw->tw_ttd = inet_tw_time_stamp() + timeo;
		slot = (twdr->slot + slot) & (INET_TWDR_TWKILL_SLOTS - 1);
		list = &twdr->cells[slot];
	} else {
		tw->tw_ttd = inet_tw_time_stamp() + (slot << INET_TWDR_RECYCLE_TICK);

		if (twdr->twcal_hand < 0) {
			twdr->twcal_hand = 0;
			twdr->twcal_jiffie = jiffies;
			twdr->twcal_timer.expires = twdr->twcal_jiffie +
					      (slot << INET_TWDR_RECYCLE_TICK);
			add_timer(&twdr->twcal_timer);
		} else {
			if (time_after(twdr->twcal_timer.expires,
				       jiffies + (slot << INET_TWDR_RECYCLE_TICK)))
				mod_timer(&twdr->twcal_timer,
					  jiffies + (slot << INET_TWDR_RECYCLE_TICK));
			slot = (twdr->twcal_hand + slot) & (INET_TWDR_RECYCLE_SLOTS - 1);
		}
		list = &twdr->twcal_row[slot];
	}

	hlist_add_head(&tw->tw_death_node, list);

	if (twdr->tw_count++ == 0)
		mod_timer(&twdr->tw_timer, jiffies + twdr->period);
	spin_unlock(&twdr->death_lock);
}
EXPORT_SYMBOL_GPL(inet_twsk_schedule);

4.1+內核修改了什么

4.1 的內核,對TIME_WAIT處理邏輯做了改動,具體改動見PR,這里做一下簡單翻譯。

tcp/dccp:擺脫單獨一個time-wait timer

大約15年前,當時內存昂貴且機器只有一個CPU時,使用timer作為time-wait套接字是不錯的選擇,
但是這個沒法擴展,代碼丑陋且延遲極大(經常能看到cpus在death_lock pinlock上達到30ms的自旋)
我們現在可以讓每個time-wait sock額外使用64個字節,並將time-wait負載擴展到所有的CPU
來獲得更好的性能
測試如下:
下面的測試中, 在server端(lpaa24)/proc/sys/net/ipv4/tcp_tw_recycle 設為 1

修改前 :
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171
當測試運行時,可以觀察到25到33ms的延遲

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

修改后 :
吞吐量提高90% :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992

即時網絡利用率提高了90%以上,延遲依舊保持在一個很低的水平上:

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

commit:789f558cfb3680aeb52de137418637f6b04b7d22

link:v4.1/net/ipv4/inet_timewait_sock.c

void inet_twsk_schedule(struct inet_timewait_sock *tw, const int timeo)
{
	tw->tw_kill = timeo <= 4*HZ;
	if (!mod_timer_pinned(&tw->tw_timer, jiffies + timeo)) {
		atomic_inc(&tw->tw_refcnt);
		atomic_inc(&tw->tw_dr->tw_count);
	}
}
EXPORT_SYMBOL_GPL(inet_twsk_schedule);

然后在4.3對上面的PR又做了修訂 ,詳細見PR。做下簡單翻譯:

當創建一個timewait socket時,我們需要在允許其他CPU找到它之前配置計時器。
允許cpus查找socket的信號將tw_refcnt設置為非零值

我們需要先調用inet_twsk_schedule(),才能在 __inet_twsk_hashdance()中設置tw_refcnt值

這也意味着我們需要從inet_twsk_schedule()中刪除tw_refcnt的更改,然后由調用方處理。

請注意,由於我們使用了mod_timer_pinned(),因此可以保證在BH上下文中運行時設置tw_refcnt之前,
計時器不會過期。

為了使內容更具可讀性,我引入了inet_twsk_reschedule() helper。

重新設置計時器時,可以使用mod_timer_pending()來確保不需要重新設置已取消的計時器。

注意:如果流的數據包能擊中多個cpus,則可能會觸發此錯誤。 除非以某種方式破壞了流量控制,
否則通常不會發生這種情況。 大概修改5個月后發現了這個錯誤。

reqsk_queue_hash_req()中的SYN_RECV socket需要類似的修復程序,但將在單獨的修補程序中提
供該修復程序以進行正確的跟蹤。

commit:ed2e923945892a8372ab70d2f61d364b0b6d9054

link:v4.3/net/ipv4/inet_timewait_sock.c#L222

void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
{
	if (!rearm) {
			BUG_ON(mod_timer_pinned(&tw->tw_timer, jiffies + timeo));
 			atomic_inc(&tw->tw_dr->tw_count);
		} else {
			mod_timer_pending(&tw->tw_timer, jiffies + timeo);
 		}
}

簡單來說,就是CPU利用率提升,吞吐量提高。推薦使用4.3+內核

幾個常見參數介紹

net.ipv4.tcp_tw_reuse

重用 TIME_WAIT 連接的條件:

  • 設置了 tcp_timestamps = 1,即開啟狀態。
  • 設置了 tcp_tw_reuse = 1,即開啟狀態。
  • 新連接的 timestamp 大於 之前連接的 timestamp 。
  • 在處於 TIME_WAIT 狀態並且持續 1 秒之后。get_seconds() - tcptw->tw_ts_recent_stamp > 1

重用的連接類型:僅僅只是 Outbound (Outgoing) connection ,對於 Inbound connection 不會重用。

安全指的是什么:

  • TIME_WAIT 可以避免重復發送的數據包被后續的連接錯誤的接收,由於 timestamp 機制的存在,重復的數據包會直接丟棄掉。
  • TIME_WAIT 能夠確保被動連接的一方,不會由於主動連接的一方發送的最后一個 ACK 數據包丟失(比如網絡延遲導致的丟包)之后,一直停留在 LAST_ACK 狀態,導致被動關閉方無法正確地關閉連接。為了確保這一機制,主動關閉的一方會一直重傳( retransmit ) FIN 數據包。
int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
{
	const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw);
	struct tcp_sock *tp = tcp_sk(sk);

	/* With PAWS, it is safe from the viewpoint
	   of data integrity. Even without PAWS it is safe provided sequence
	   spaces do not overlap i.e. at data rates <= 80Mbit/sec.

	   Actually, the idea is close to VJ's one, only timestamp cache is
	   held not per host, but per port pair and TW bucket is used as state
	   holder.

	   If TW bucket has been already destroyed we fall back to VJ's scheme
	   and use initial timestamp retrieved from peer table.
	 */
    // 需要開啟時間戳擴展
	if (tcptw->tw_ts_recent_stamp &&
	    (twp == NULL || (sysctl_tcp_tw_reuse &&
			     get_seconds() - tcptw->tw_ts_recent_stamp > 1))) {
		tp->write_seq = tcptw->tw_snd_nxt + 65535 + 2;
		if (tp->write_seq == 0)
			tp->write_seq = 1;
		tp->rx_opt.ts_recent	   = tcptw->tw_ts_recent;
		tp->rx_opt.ts_recent_stamp = tcptw->tw_ts_recent_stamp;
		sock_hold(sktw);
		return 1;
	}

	return 0;
}
EXPORT_SYMBOL_GPL(tcp_twsk_unique);

net.ipv4.tcp_tw_recycle

詳見上面處理time-wait一節的分析

不建議開啟 tw_recycle 配置。事實上,在 linux 內核 4.12 版本,已經去掉了 net.ipv4.tcp_tw_recycle 參數了,參考commit

tcp_max_tw_buckets

設置 TIME_WAIT 最大數量。目的為了阻止一些簡單的DoS攻擊,平常不要人為的降低它。如果縮小了它,那么系統會將多余的TIME_WAIT刪除掉,日志里會顯示:「TCP: time wait bucket table overflow」。

如何設置正確的值

定時器精度相關分析請看參考[3]

4.1內核

  1. tcp_fin_timeout <= 3, FIN_WAIT_2 狀態超時時間為 tcp_fin_timeout 值。
  2. 3<tcp_fin_timeout <=60, FIN_WAIT_2狀態超時時間為 tcp_fin_timeout值+定時器精度(以7秒為單位)誤差時間。
  3. tcp_fin_timeout > 60, FIN_WAIT_2狀態會先經歷keepalive狀態,持續時間為tmo=tcp_fin_timeout-60值, 再經歷timewait狀態,持續時間為 (tcp_fin_timeout -60)+定時器精度,這里的定時器精度根據(tcp_fin_timeout -60)的計算值,會最終落在上述兩個精度范圍(1/8秒為單位或7秒為單位)。

4.3+內核

  1. tcp_fin_timeout <=60, FIN_WAIT_2 狀態超時時間為 tcp_fin_timeout 值。
  2. tcp_fin_timeout > 60, FIN_WAIT_2 狀態會先經歷 keepalive 狀態,持續時間為 tmo=tcp_fin_timeout-60 值 , 再經歷 timewait 狀態,持續時間同樣為 tmo= tcp_fin_timeout-60 值。

參考

[1] Linux TCP Finwait2/Timewait狀態要義淺析,https://blog.csdn.net/dog250/article/details/81582604

[2] TCP的TIME_WAIT快速回收與重用,https://blog.csdn.net/dog250/article/details/13760985

[3] 由優化FIN_WAIT_2狀態超時引入的關於tcp_fin_timeout參數研究,https://www.talkwithtrend.com/Article/251641

[4] TCP TIME_WAIT 詳解,https://www.zhuxiaodong.net/2018/tcp-time-wait-instruction/


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM