tcp_rcv_established函數的工作原理是把數據包的處理分為2類:fast path和slow path,其含義顯而易見。這樣分類的目的當然是加快數據包的處理,因為在正常情況下,數據包是按順序到達的,網絡狀況也是穩定的,這時可以按照fast path直接把數據包存放到receive queue了。而在其他的情況下則需要走slow path流程了。
在協議棧中,是用頭部預測來實現的,每個tcp sock有個pred_flags成員,它就是判別的依據
fast path處理,其揭示的含義如下:
Either the data transaction is taking place in only one direction (which means that we are the receiver and not transmitting any data) or in the case where we are sending out data also, the window advertised from the other end is constant. The latter means that we have not transmitted any data from our side for quite some time but are receiving data from the other end. The receive window advertised by the other end is constant.
-
Other than PSH|ACK flags in the TCP header, no other flag is set (ACK is set for each TCP segment).
This means that if any other flag is set such as URG, FIN, SYN, ECN, RST, and CWR, we know that something important is there to be attended and we need to move into the SLOW path. -
The header length has unchanged. If the TCP header length remains unchanged, we have not added/reduced any TCP option and we can safely assume that there is nothing important to be attended, if the above two conditions are TRUE.
從fast path進入slow path的觸發條件(進入slow path 后pred_flags清除為0):
1 在tcp_data_queue中接收到亂序數據包
2 在tcp_prune_queue中用完緩存並且開始丟棄數據包
3 在tcp_urgent_check中遇到緊急指針
4 在tcp_select_window中發送的通告窗口下降到0.
從slow_path進入fast_path的觸發條件:
1 When we have read past an urgent byte in tcp_recvmsg() . Wehave gotten an urgent byte and we remain in the slow path mode until we receive the urgent byte because it is handled in the slow path in tcp_rcv_established().
2 當在tcp_data_queue中亂序隊列由於gap被填充而處理完畢時,運行tcp_fast_path_check。
3 tcp_ack_update_window()中更新了通告窗口。
/* * TCP receive function for the ESTABLISHED state. * * It is split into a fast path and a slow path. The fast path is * disabled when: * - A zero window was announced from us - zero window probing * is only handled properly in the slow path. * - Out of order segments arrived. * - Urgent data is expected. * - There is no buffer space left * - Unexpected TCP flags/window values/header lengths are received * (detected by checking the TCP header against pred_flags) * - Data is sent in both directions. Fast path only supports pure senders * or pure receivers (this means either the sequence number or the ack * value must stay constant) * - Unexpected TCP option. * * When these conditions are not satisfied it drops into a standard * receive procedure patterned after RFC793 to handle all cases. * The first three cases are guaranteed by proper pred_flags setting, * the rest is checked inline. Fast processing is turned on in * tcp_data_queue when everything is OK. */ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb, const struct tcphdr *th, unsigned int len) { struct tcp_sock *tp = tcp_sk(sk); skb_mstamp_get(&tp->tcp_mstamp); if (unlikely(!sk->sk_rx_dst)) /* 路由為空,則重新設置路由 */ inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb); /* * Header prediction. * The code loosely follows the one in the famous * "30 instruction TCP receive" Van Jacobson mail. * * Van's trick is to deposit buffers into socket queue * on a device interrupt, to call tcp_recv function * on the receive process context and checksum and copy * the buffer to user space. smart... * * Our current scheme is not silly either but we take the * extra cost of the net_bh soft interrupt processing... * We do checksum and copy also but from device to kernel. */ tp->rx_opt.saw_tstamp = 0; /* pred_flags is 0xS?10 << 16 + snd_wnd * if header_prediction is to be made * 'S' will always be tp->tcp_header_len >> 2 * '?' will be 0 for the fast path, otherwise pred_flags is 0 to * turn it off (when there are holes in the receive * space for instance) * PSH flag is ignored. */ /* 快路檢查&& 序號正確 && ack序號正確
TCP_HP_BITS的作用就是排除flag中的PSH標志位。只有在頭部預測滿足並且數據包以正確的順序(該數據包的第一個序號就是下個要接收的序號)到達時才進入fast path
*/ if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags && TCP_SKB_CB(skb)->seq == tp->rcv_nxt && !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) { int tcp_header_len = tp->tcp_header_len; /* tcp頭部長度 */ /* Timestamp header prediction: tcp_header_len * is automatically equal to th->doff*4 due to pred_flags * match. */ /* Check timestamp */ /* 有時間戳選項 */ if (tcp_header_len == sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) { /* No? Slow path! /* 解析時間戳選項失敗,執行慢路 */ if (!tcp_parse_aligned_timestamp(tp, th)) goto slow_path; /* If PAWS failed, check it more carefully in slow path */ /* 序號回轉,執行慢路 如果獲取到的時間戳值 小於 下一個發送的tcp 段的時間搓 回顯值 ?? paws檢測 接收到的tcp段序號 是預期的但是時間戳值早過 發生了序號回卷??????
新的接收的數據段的時間戳)比ts_recent(對端發送過來的數據(也就是上一次)的最新的一個時間戳)小,則我們要進入slow path 處理paws。 */ if ((s32)(tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent) < 0) goto slow_path; /* DO NOT update ts_recent here, if checksum fails * and timestamp was corrupted part, it will result * in a hung connection since we will drop all * future packets due to the PAWS test. */ } if (len <= tcp_header_len) { /* 無數據該代碼段是依據時戳選項來檢查PAWS(Protect Against Wrapped Sequence numbers)。
如果發送來的僅是一個TCP頭的話(沒有捎帶數據或者接收端檢測到有亂序數據這些情況時都會發送一個純粹的ACK包) */ /* Bulk data transfer: sender
主要的工作如下
1 保存對方的最近時戳 tcp_store_ts_recent。通過前面的if判斷可以看出tcp總是回顯2次時戳回顯直接最先到達的數據包的時戳,
rcv_wup只在發送數據(這時回顯時戳)時重置為rcv_nxt,所以接收到前一次回顯后第一個數據包后,rcv_nxt增加了,但是
rcv_wup沒有更新,所以后面的數據包處理時不會調用該函數來保存時戳。
2 ACK處理。這個函數非常復雜,包含了擁塞控制機制,確認處理等等。
3 檢查是否有數據待發送 tcp_data_snd_check。
*/ if (len == tcp_header_len) { /* Predicted packet is in window by definition. * seq == rcv_nxt and rcv_wup <= rcv_nxt. * Hence, check seq<=rcv_wup reduces to: *//* 有時間戳選項 && 所有接收的數據段均確認完畢 保存時間戳 /*static void tcp_store_ts_recent(struct tcp_sock *tp) { tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval; tp->rx_opt.ts_recent_stamp = get_seconds(); } */ if (tcp_header_len == (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) && tp->rcv_nxt == tp->rcv_wup) tcp_store_ts_recent(tp); /* We know that such packets are checksummed * on entry. */ /* 輸入/快速路徑ack處理 */ tcp_ack(sk, skb, 0); __kfree_skb(skb); /* 檢查是否有數據要發送,並檢查發送緩沖區大小 收到ack了,給數據包一次發送機會,tcp_push_pending_frames*/ tcp_data_snd_check(sk); return; } else { /* Header too small */ /* 數據多小,比頭部都小,錯包 */ TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS); goto discard; } } else { /* 有數據 且通過了首部預測檢測 說明收到的段符合預期 開始處理數據*/ int eaten = 0; bool fragstolen = false; /* 進程上下文 */ if (tp->ucopy.task == current && /* 期待讀取的和期待接收的序號一致也就是 正在接收的段序號 和尚未從內核空間復制到用戶空間的段最前的序號相等*/ tp->copied_seq == tp->rcv_nxt && len - tcp_header_len <= tp->ucopy.len && /* 數據<= 待讀取長度(小於用戶空間緩存) */ /* 控制塊被用戶空間鎖定 */ sock_owned_by_user(sk)) { __set_current_state(TASK_RUNNING); /* 設置狀態為running??? */ /* 拷貝數據到msghdr */ if (!tcp_copy_to_iovec(sk, skb, tcp_header_len)) { /* Predicted packet is in window by definition. * seq == rcv_nxt and rcv_wup <= rcv_nxt. * Hence, check seq<=rcv_wup reduces to: */ /* 有時間戳選項&& 收到的數據段均已確認,更新時間戳 */ if (tcp_header_len == (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) && tp->rcv_nxt == tp->rcv_wup) tcp_store_ts_recent(tp); tcp_rcv_rtt_measure_ts(sk, skb); /* 接收端RTT估算 */ __skb_pull(skb, tcp_header_len); /* 更新期望接收的序號 */ tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq); NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER); eaten = 1; } } /* 未拷貝數據到用戶空間,或者拷貝失敗----沒有把數據放到ucopy中 */ if (!eaten) { if (tcp_checksum_complete(skb)) goto csum_error; /* skb長度> 預分配長度 */ if ((int)skb->truesize > sk->sk_forward_alloc) goto step5; /* Predicted packet is in window by definition. * seq == rcv_nxt and rcv_wup <= rcv_nxt. * Hence, check seq<=rcv_wup reduces to: */ /* 有時間戳選項,且數據均已確認完畢,則更新時間戳 */ if (tcp_header_len == (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) && tp->rcv_nxt == tp->rcv_wup)//在收到這個數據包之前,沒有發送包也沒有收到其他數據包,並且這個包不是亂序包 tcp_store_ts_recent(tp); tcp_rcv_rtt_measure_ts(sk, skb); NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS); /* Bulk data transfer: receiver */ /* 數據加入接收隊列 添加數據到sk_receive_queue中 會刪除 tcp 首部 將數據包添加到隊列中緩存起來 等待進程讀取 同時設置改skb的宿主 釋放回調函數 更新傳輸控制塊已使用接收緩存總量 同時update tp->rcv_nxt, also update tp->bytes_received*/ eaten = tcp_queue_rcv(sk, skb, tcp_header_len, &fragstolen); } tcp_event_data_recv(sk, skb);//inet_csk_schedule_ack, 更新rtt /* 確認序號確認了數據 */ if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) { /* Well, only one small jumplet in fast path... */ tcp_ack(sk, skb, FLAG_DATA);/* 處理ack */ tcp_data_snd_check(sk); /* 檢查是否有數據要發送,需要則發送 */ if (!inet_csk_ack_scheduled(sk)) /* 沒有ack要發送 在tcp_event_data_recv標記過,但可能ack已經發出了,就不用檢測是否要發送了*/ goto no_ack; } /* 檢查是否有ack要發送,需要則發送 */ __tcp_ack_snd_check(sk, 0); no_ack: if (eaten) kfree_skb_partial(skb, fragstolen); sk->sk_data_ready(sk); return; } } slow_path: /* 長度錯誤|| 校驗和錯誤 */ if (len < (th->doff << 2) || tcp_checksum_complete(skb)) goto csum_error; /* 無ack,無rst,無syn */ if (!th->ack && !th->rst && !th->syn) goto discard; /* * Standard slow path. /* 種種校驗 */ if (!tcp_validate_incoming(sk, skb, th, 1)) return; step5: /* 處理ack */ if (tcp_ack(sk, skb, FLAG_SLOWPATH | FLAG_UPDATE_TS_RECENT) < 0) goto discard; /* 計算rtt */ tcp_rcv_rtt_measure_ts(sk, skb); /* Process urgent data. */ /* 處理緊急數據 */ tcp_urg(sk, skb, th); /* step 7: process the segment text數據段處理 */ tcp_data_queue(sk, skb); tcp_data_snd_check(sk);/* 發送數據檢查,有則發送 */ tcp_ack_snd_check(sk);/* 發送ack檢查,有則發送 */ return; csum_error: TCP_INC_STATS(sock_net(sk), TCP_MIB_CSUMERRORS); TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS); discard: tcp_drop(sk, skb); }
/* There is something which you must keep in mind when you analyze the * behavior of the tp->ato delayed ack timeout interval. When a * connection starts up, we want to ack as quickly as possible. The * problem is that "good" TCP's do slow start at the beginning of data * transmission. The means that until we send the first few ACK's the * sender will sit on his end and only queue most of his data, because * he can only send snd_cwnd unacked packets at any given time. For * each ACK we send, he increments snd_cwnd and transmits more of his * queue. -DaveM */ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); struct inet_connection_sock *icsk = inet_csk(sk); u32 now; inet_csk_schedule_ack(sk);/* 接收到了數據,設置ACK需調度標志*/ --------------------------------- icsk->icsk_ack.lrcvtime = now; tcp_ecn_check_ce(tp, skb); if (skb->len >= 128) tcp_grow_window(sk, skb); } /* rcv_ssthresh是當前的接收窗口大小的一個閥值,其初始值就置為rcv_wnd。它跟rcv_wnd配合工作, 當本地socket收到數據報,並滿足一定條件時,增長rcv_ssthresh的值,在下一次發送數據報組建TCP首部時, 需要通告對方當前的接收窗口大小,這時需要更新rcv_wnd,此時rcv_wnd的取值不能超過rcv_ssthresh的值。 兩者配合,達到一個滑動窗口大小緩慢增長的效果。 */ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); /* Check #1 */ if (tp->rcv_ssthresh < tp->window_clamp && (int)tp->rcv_ssthresh < tcp_space(sk) && !tcp_under_memory_pressure(sk)) { int incr; /* Check #2. Increase window, if skb with such overhead * will fit to rcvbuf in future. */ if (tcp_win_from_space(skb->truesize) <= skb->len) incr = 2 * tp->advmss; else incr = __tcp_grow_window(sk, skb); if (incr) { incr = max_t(int, incr, 2 * skb->len); tp->rcv_ssthresh = min(tp->rcv_ssthresh + incr, tp->window_clamp); inet_csk(sk)->icsk_ack.quick |= 1; } } }
/* Does PAWS and seqno based validation of an incoming segment, flags will * play significant role here. */ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, const struct tcphdr *th, int syn_inerr) { struct tcp_sock *tp = tcp_sk(sk); bool rst_seq_match = false; /* RFC1323: H1. Apply PAWS check first. PAWS丟棄數據包要滿足以下條件 1 The difference between the timestamp value obtained in the current segmentand last seen timestamp on the incoming TCP segment should be more than TCP_PAWS_WINDOW (= 1), which means that if the segment that was transmitted 1 clock tick before the segment that reached here earlier TCP seq should be acceptable. It may be because of reordering of the segments that the latter reached earlier. 2 the 24 days have not elapsed since last time timestamp was stored, 3 tcp_disordered_ack返回0. static inline bool tcp_paws_discard(const struct sock *sk, const struct sk_buff *skb) { const struct tcp_sock *tp = tcp_sk(sk); return !tcp_paws_check(&tp->rx_opt, TCP_PAWS_WINDOW) && !tcp_disordered_ack(sk, skb); } >tcp_paws_discard | |-->tcp_disordered_ack 其中關鍵是local方通過tcp_disordered_ack函數對一個剛收到的數據分段進行判斷,下面我們對該函數的判斷邏輯進行下總結: 大前提:該收到分段的TS值表明有回繞現象發生 a)若該分段不是一個純ACK,則丟棄。因為顯然這個分段所攜帶的數據是一個老數據了,不是local方目前希望接收的(參見PAWS的處理依據一節) b)若該分段不是local所希望接收的,則丟棄。這個原因很顯然 c)若該分段是一個純ACK,但該ACK並不是一個重復ACK(由local方后續數據正確到達所引發的),則丟棄。因為顯然該ACK是一個老的ACK,並不是由於為了加快local方重發而在每收到一個丟失分段后的分段而發出的ACK。 d)若該分段是一個ACK,且為重復ACK,並且該ACK的TS值超過了local方那個丟失分段后的重發rto,則丟棄。因為顯然此時local方已經重發了那個導致此重復ACK產生的分段,因此再收到此重復ACK就可以直接丟棄。 e)若該分段是一個ACK,且為重復ACK,但是沒有超過一個rto的時間,則不能丟棄,因為這正代表peer方收到了local方發出的丟失分段后的分段,local方要對此ACK進行處理(例如立刻重傳) 這里有一個重要概念需要理解,即在出現TS問題后,純ACK和帶ACK的數據分段二者是顯著不同的,對於后者,可以立刻丟棄掉,因為從一個窗口的某個seq到下一個窗口的同一個seq過程中, 一定有窗口變化曾經發生過,從而TS記錄值ts_recent也一定更新過,此時一定可以通過PAWS進行丟棄處理。但是對於前者,一個純ACK,就不能簡單丟棄了,因為有這樣一個現象是合理的, 即假定local方的接收緩存很大,並且peer方在發送時很快就回繞了,於是在local方的某個分段丟失后,peer方需要在每收到的后續分段時發送重復ACK,而此時該重發ACK的ack_seq就是這個丟失分段的序號, 而該重發ACK的seq已經是回繞后的重復序號了,盡管此時到底是回繞后的那個重復ACK還是之前的那個同樣序號seq的重復ACK,對於local方來都需要處理(立刻啟動重發動作),而不能簡單丟棄掉。 來自 http://abcdxyzk.github.io/blog/2015/04/01/kernel-net-estab/ */ if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp && tcp_paws_discard(sk, skb)) { if (!th->rst) { NET_INC_STATS(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED); if (!tcp_oow_rate_limited(sock_net(sk), skb, LINUX_MIB_TCPACKSKIPPEDPAWS, &tp->last_oow_ack_time)) tcp_send_dupack(sk, skb); goto discard; } /* Reset is accepted even if it did not pass PAWS. */ } /* Step 1: check sequence number 檢查數據包的序號是否正確,該判斷失敗后調用tcp_send_dupack發送一個duplicate acknowledge(未設置RST標志位時)。 由rcv_wup的更新時機(發送ACK時的tcp_select_window)可知位於序號rcv_wup前面的數據都已確認, 所以待檢查數據包的結束序號至少要大於該值;同時開始序號要落在接收窗口內 */ if (!tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) { /* RFC793, page 37: "In all states except SYN-SENT, all reset * (RST) segments are validated by checking their SEQ-fields." * And page 69: "If an incoming segment is not acceptable, * an acknowledgment should be sent in reply (unless the RST * bit is set, if so drop the segment and return)". */ if (!th->rst) { if (th->syn) goto syn_challenge; if (!tcp_oow_rate_limited(sock_net(sk), skb, LINUX_MIB_TCPACKSKIPPEDSEQ, &tp->last_oow_ack_time)) tcp_send_dupack(sk, skb); } else if (tcp_reset_check(sk, skb)) { tcp_reset(sk); } goto discard; } /* Step 2: check RST bit 如果設置了RST,則調用tcp_reset處理*/ if (th->rst) { /* RFC 5961 3.2 (extend to match against (RCV.NXT - 1) after a * FIN and SACK too if available): * If seq num matches RCV.NXT or (RCV.NXT - 1) after a FIN, or * the right-most SACK block, * then * RESET the connection * else * Send a challenge ACK */ if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt || tcp_reset_check(sk, skb)) { rst_seq_match = true; } else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) { struct tcp_sack_block *sp = &tp->selective_acks[0]; int max_sack = sp[0].end_seq; int this_sack; for (this_sack = 1; this_sack < tp->rx_opt.num_sacks; ++this_sack) { max_sack = after(sp[this_sack].end_seq, max_sack) ? sp[this_sack].end_seq : max_sack; } if (TCP_SKB_CB(skb)->seq == max_sack) rst_seq_match = true; } if (rst_seq_match) tcp_reset(sk); else { /* Disable TFO if RST is out-of-order * and no data has been received * for current active TFO socket */ if (tp->syn_fastopen && !tp->data_segs_in && sk->sk_state == TCP_ESTABLISHED) tcp_fastopen_active_disable(sk); tcp_send_challenge_ack(sk, skb); } goto discard; } /* step 3: check security and precedence [ignored] */ /* step 4: Check for a SYN * RFC 5961 4.2 : Send a challenge ack 檢查SYN,因為重發的SYN和原來的SYN之間不會發送數據,所以這2個SYN的序號是相同的 */ if (th->syn) { syn_challenge: if (syn_inerr) TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS); NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE); tcp_send_challenge_ack(sk, skb); goto discard; } return true; discard: tcp_drop(sk, skb); return false; }
/* Does PAWS and seqno based validation of an incoming segment, flags will * play significant role here. */ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, const struct tcphdr *th, int syn_inerr) { struct tcp_sock *tp = tcp_sk(sk); bool rst_seq_match = false; /* RFC1323: H1. Apply PAWS check first. PAWS丟棄數據包要滿足以下條件 1 The difference between the timestamp value obtained in the current segmentand last seen timestamp on the incoming TCP segment should be more than TCP_PAWS_WINDOW (= 1), which means that if the segment that was transmitted 1 clock tick before the segment that reached here earlier TCP seq should be acceptable. It may be because of reordering of the segments that the latter reached earlier. 2 the 24 days have not elapsed since last time timestamp was stored, 3 tcp_disordered_ack返回0. static inline bool tcp_paws_discard(const struct sock *sk, const struct sk_buff *skb) { const struct tcp_sock *tp = tcp_sk(sk); return !tcp_paws_check(&tp->rx_opt, TCP_PAWS_WINDOW) && !tcp_disordered_ack(sk, skb); } >tcp_paws_discard | |-->tcp_disordered_ack 其中關鍵是local方通過tcp_disordered_ack函數對一個剛收到的數據分段進行判斷,下面我們對該函數的判斷邏輯進行下總結: 大前提:該收到分段的TS值表明有回繞現象發生 a)若該分段不是一個純ACK,則丟棄。因為顯然這個分段所攜帶的數據是一個老數據了,不是local方目前希望接收的(參見PAWS的處理依據一節) b)若該分段不是local所希望接收的,則丟棄。這個原因很顯然 c)若該分段是一個純ACK,但該ACK並不是一個重復ACK(由local方后續數據正確到達所引發的),則丟棄。因為顯然該ACK是一個老的ACK,並不是由於為了加快local方重發而在每收到一個丟失分段后的分段而發出的ACK。 d)若該分段是一個ACK,且為重復ACK,並且該ACK的TS值超過了local方那個丟失分段后的重發rto,則丟棄。因為顯然此時local方已經重發了那個導致此重復ACK產生的分段,因此再收到此重復ACK就可以直接丟棄。 e)若該分段是一個ACK,且為重復ACK,但是沒有超過一個rto的時間,則不能丟棄,因為這正代表peer方收到了local方發出的丟失分段后的分段,local方要對此ACK進行處理(例如立刻重傳) 這里有一個重要概念需要理解,即在出現TS問題后,純ACK和帶ACK的數據分段二者是顯著不同的,對於后者,可以立刻丟棄掉,因為從一個窗口的某個seq到下一個窗口的同一個seq過程中, 一定有窗口變化曾經發生過,從而TS記錄值ts_recent也一定更新過,此時一定可以通過PAWS進行丟棄處理。但是對於前者,一個純ACK,就不能簡單丟棄了,因為有這樣一個現象是合理的, 即假定local方的接收緩存很大,並且peer方在發送時很快就回繞了,於是在local方的某個分段丟失后,peer方需要在每收到的后續分段時發送重復ACK,而此時該重發ACK的ack_seq就是這個丟失分段的序號, 而該重發ACK的seq已經是回繞后的重復序號了,盡管此時到底是回繞后的那個重復ACK還是之前的那個同樣序號seq的重復ACK,對於local方來都需要處理(立刻啟動重發動作),而不能簡單丟棄掉。 來自 http://abcdxyzk.github.io/blog/2015/04/01/kernel-net-estab/ */ if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp && tcp_paws_discard(sk, skb)) { if (!th->rst) { NET_INC_STATS(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED); if (!tcp_oow_rate_limited(sock_net(sk), skb, LINUX_MIB_TCPACKSKIPPEDPAWS, &tp->last_oow_ack_time)) tcp_send_dupack(sk, skb); goto discard; } /* Reset is accepted even if it did not pass PAWS. */ } /* Step 1: check sequence number 檢查數據包的序號是否正確,該判斷失敗后調用tcp_send_dupack發送一個duplicate acknowledge(未設置RST標志位時)。 由rcv_wup的更新時機(發送ACK時的tcp_select_window)可知位於序號rcv_wup前面的數據都已確認, 所以待檢查數據包的結束序號至少要大於該值;同時開始序號要落在接收窗口內 */ if (!tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) { /* RFC793, page 37: "In all states except SYN-SENT, all reset * (RST) segments are validated by checking their SEQ-fields." * And page 69: "If an incoming segment is not acceptable, * an acknowledgment should be sent in reply (unless the RST * bit is set, if so drop the segment and return)". */ if (!th->rst) { if (th->syn) goto syn_challenge; if (!tcp_oow_rate_limited(sock_net(sk), skb, LINUX_MIB_TCPACKSKIPPEDSEQ, &tp->last_oow_ack_time)) tcp_send_dupack(sk, skb); } else if (tcp_reset_check(sk, skb)) { tcp_reset(sk); } goto discard; } /* Step 2: check RST bit 如果設置了RST,則調用tcp_reset處理*/ if (th->rst) { /* RFC 5961 3.2 (extend to match against (RCV.NXT - 1) after a * FIN and SACK too if available): * If seq num matches RCV.NXT or (RCV.NXT - 1) after a FIN, or * the right-most SACK block, * then * RESET the connection * else * Send a challenge ACK */ if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt || tcp_reset_check(sk, skb)) { rst_seq_match = true; } else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) { struct tcp_sack_block *sp = &tp->selective_acks[0]; int max_sack = sp[0].end_seq; int this_sack; for (this_sack = 1; this_sack < tp->rx_opt.num_sacks; ++this_sack) { max_sack = after(sp[this_sack].end_seq, max_sack) ? sp[this_sack].end_seq : max_sack; } if (TCP_SKB_CB(skb)->seq == max_sack) rst_seq_match = true; } if (rst_seq_match) tcp_reset(sk); else { /* Disable TFO if RST is out-of-order * and no data has been received * for current active TFO socket */ if (tp->syn_fastopen && !tp->data_segs_in && sk->sk_state == TCP_ESTABLISHED) tcp_fastopen_active_disable(sk); tcp_send_challenge_ack(sk, skb); } goto discard; } /* step 3: check security and precedence [ignored] */ /* step 4: Check for a SYN * RFC 5961 4.2 : Send a challenge ack 檢查SYN,因為重發的SYN和原來的SYN之間不會發送數據,所以這2個SYN的序號是相同的 */ if (th->syn) { syn_challenge: if (syn_inerr) TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS); NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE); tcp_send_challenge_ack(sk, skb); goto discard; } return true; discard: tcp_drop(sk, skb); return false; }
從上述分析過程中可知:
/* 進程上下文 */ if (tp->ucopy.task == current && /* 期待讀取的和期待接收的序號一致也就是 正在接收的段序號 和尚未從內核空間復制到用戶空間的段最前的序號相等*/ tp->copied_seq == tp->rcv_nxt && len - tcp_header_len <= tp->ucopy.len && /* 數據<= 待讀取長度(小於用戶空間緩存) */ /* 控制塊被用戶空間鎖定 */ sock_owned_by_user(sk)) {//此時用戶進程正在recv 從內核獲取數據 (用戶進程正在休眠)
除了 recvmsg系統調用接收數據外,還有主動將數據從內核空間copy 到用戶空間,注意:復制時 不應該將tcp 首部復制到用戶空間
