Table of Contents
術語
ABC
- 英文全稱:Appropriate Byte Count
- 中文全稱: 適當字節計數
- 功能描述: ABC是一種針對於部分確認應答的更慢地增加擁塞窗口(cwnd)的方法。
可能的值為:
- 0: 每一個應答增加擁塞窗口一次(無ABC)
- 1: 每一個最大傳輸段應答增加擁塞窗口一次
- 2:允許增加擁塞控制窗口兩次,如果應答是為了補償延時應答的針對兩個段的應答。
- 0: 每一個應答增加擁塞窗口一次(無ABC)
SACK
- 英文全稱: Selective Acknowledgment.
- 中文全稱: 選擇性確認
- 功能描述: SACK是TCP選項,它使得接收方能告訴發送方哪些報文段丟失,哪些報文段重傳了,哪些報文段已經提前收到等信息。
根據這些信息TCP就可以只重傳哪些真正丟失的報文段。需要注意的是只有收到失序的分組時才會可能會發送SACK,TCP的ACK還
是建立在累積確認的基礎上的。也就是說如果收到的報文段與期望收到的報文段的序號相同就會發送累積的ACK,SACK只是針對
失序到達的報文段的。
D-SACK
- 英文全稱: duplicate-Selective Acknowledgment.
- 中文全稱: 重復的SACK
- 功能描述: RFC2883中對SACK進行了擴展。SACK中的信息描述的是收到的報文段,這些報文段可能是正常接收的,也可能是重復接收的,
通過對SACK進行擴展,D-SACK可以在SACK選項中描述它重復收到的報文段。但是需要注意的是D-SACK只用於報告接收端收到的最后一
個報文與已經接收了的報文的重復部分
FACK
- 英文全稱: Forward Acknowledgment
- 中文全稱: 提前確認
- 功能描述: FACK算法采取激進策略,將所有SACK的未確認區間當做丟失段。雖然這種策略通常帶來更佳的網絡性能,但是過於激進,因為SACK未確認的區間段可能只是發送了重排,而並非丟失
F-RTO
- 英文全稱: Forward RTO Recovery
- 中文全稱: 虛假超時
- 功能描述: F-RTO的基本思想是判斷RTO是否正常,從而決定是否執行擁塞避免算法。方法是觀察RTO之后的兩個ACK。如果ACK不是冗余ACK,並且確認的包不是重傳
的,會認為RTO是虛假的就不執行擁塞避免算法。
nagle算法
- 功能描述: nagle算法主要目的是減少網絡流量,當你發送的數據包太小時,TCP並不立即發送該數據包,而是緩存起來直到數據包到達一定大小后才發送。
cork算法
- 功能描述: CORK算法的初衷:提高網絡利用率,理想情況下,完全避免發送小包,僅僅發送滿包以及不得不發的小包。
template
- 英文全稱:
- 中文全稱:
- 功能描述:
tcp_v4_connect()
- 描述: 建立與服務器連接,發送SYN段
- 返回值: 0或錯誤碼
- 代碼關鍵路徑:
1: int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) 2: { 3: ..... 4: /* 設置目的地址和目標端口 */ 5: inet->dport = usin->sin_port; 6: inet->daddr = daddr; 7: .... 8: /* 初始化MSS上限 */ 9: tp->rx_opt.mss_clamp = 536; 10: 11: /* Socket identity is still unknown (sport may be zero). 12: * However we set state to SYN-SENT and not releasing socket 13: * lock select source port, enter ourselves into the hash tables and 14: * complete initialization after this. 15: */ 16: tcp_set_state(sk, TCP_SYN_SENT);/* 設置狀態 */ 17: err = tcp_v4_hash_connect(sk);/* 將傳輸控制添加到ehash散列表中,並動態分配端口 */ 18: if (err) 19: goto failure; 20: .... 21: if (!tp->write_seq)/* 還未計算初始序號 */ 22: /* 根據雙方地址、端口計算初始序號 */ 23: tp->write_seq = secure_tcp_sequence_number(inet->saddr, 24: inet->daddr, 25: inet->sport, 26: usin->sin_port); 27: 28: /* 根據初始序號和當前時間,隨機算一個初始id */ 29: inet->id = tp->write_seq ^ jiffies; 30: 31: /* 發送SYN段 */ 32: err = tcp_connect(sk); 33: rt = NULL; 34: if (err) 35: goto failure; 36: 37: return 0; 38: }
sys_accept()
- 描述: 調用tcp_accept(), 並把它返回的newsk進行連接描述符分配后返回給用戶空間。
- 返回值: 連接描述符
- 代碼關鍵路徑:
1: asmlinkage long sys_accept(int fd, struct sockaddr __user *upeer_sockaddr, int __user *upeer_addrlen) 2: { 3: struct socket *sock, *newsock; 4: ..... 5: sock = sockfd_lookup(fd, &err);/* 獲得偵聽端口的socket */ 6: ..... 7: if (!(newsock = sock_alloc()))/* 分配一個新的套接口,用來處理與客戶端的連接 */ 8: ..... 9: /* 調用傳輸層的accept,對TCP來說,是inet_accept */ 10: err = sock->ops->accept(sock, newsock, sock->file->f_flags); 11: .... 12: if (upeer_sockaddr) {/* 調用者需要獲取對方套接口地址和端口 */ 13: /* 調用傳輸層回調獲得對方的地址和端口 */ 14: if(newsock->ops->getname(newsock, (struct sockaddr *)address, &len, 2)<0) { 15: } 16: /* 成功后復制到用戶態 */ 17: err = move_addr_to_user(address, len, upeer_sockaddr, upeer_addrlen); 18: } 19: ..... 20: if ((err = sock_map_fd(newsock)) < 0)/* 為新連接分配文件描述符 */ 21: 22: return err; 23: }
tcp_accept()
[注]: 在內核2.6.32以后對應函數為inet_csk_accept().
- 描述: 通過在規定時間內,判斷tcp_sock->accept_queue隊列非空,代表有新的連接進入.
- 返回值: (struct sock *)newsk;
- 代碼關鍵路徑:
1: struct sock *tcp_accept(struct sock *sk, int flags, int *err) 2: { 3: .... 4: /* Find already established connection */ 5: if (!tp->accept_queue) {/* accept隊列為空,說明還沒有收到新連接 */ 6: long timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK);/* 如果套口是非阻塞的,或者在一定時間內沒有新連接,則返回 */ 7: 8: if (!timeo)/* 超時時間到,沒有新連接,退出 */ 9: goto out; 10: 11: /* 運行到這里,說明有新連接到來,則等待新的傳輸控制塊 */ 12: error = wait_for_connect(sk, timeo); 13: if (error) 14: goto out; 15: } 16: 17: req = tp->accept_queue; 18: if ((tp->accept_queue = req->dl_next) == NULL) 19: tp->accept_queue_tail = NULL; 20: 21: newsk = req->sk; 22: sk_acceptq_removed(sk); 23: tcp_openreq_fastfree(req); 24: .... 25: 26: return newsk; 27: }
三次握手
客戶端發送SYN段
- 由tcp_v4_connect()->tcp_connect()->tcp_transmit_skb()發送,並置為TCP_SYN_SENT.
- 代碼關鍵路徑:
1: /* 構造並發送SYN段 */ 2: int tcp_connect(struct sock *sk) 3: { 4: struct tcp_sock *tp = tcp_sk(sk); 5: struct sk_buff *buff; 6: 7: tcp_connect_init(sk);/* 初始化傳輸控制塊中與連接相關的成員 */ 8: 9: /* 為SYN段分配報文並進行初始化 */ 10: buff = alloc_skb(MAX_TCP_HEADER + 15, sk->sk_allocation); 11: if (unlikely(buff == NULL)) 12: return -ENOBUFS; 13: 14: /* Reserve space for headers. */ 15: skb_reserve(buff, MAX_TCP_HEADER); 16: 17: TCP_SKB_CB(buff)->flags = TCPCB_FLAG_SYN; 18: TCP_ECN_send_syn(sk, tp, buff); 19: TCP_SKB_CB(buff)->sacked = 0; 20: skb_shinfo(buff)->tso_segs = 1; 21: skb_shinfo(buff)->tso_size = 0; 22: buff->csum = 0; 23: TCP_SKB_CB(buff)->seq = tp->write_seq++; 24: TCP_SKB_CB(buff)->end_seq = tp->write_seq; 25: tp->snd_nxt = tp->write_seq; 26: tp->pushed_seq = tp->write_seq; 27: tcp_ca_init(tp); 28: 29: /* Send it off. */ 30: TCP_SKB_CB(buff)->when = tcp_time_stamp; 31: tp->retrans_stamp = TCP_SKB_CB(buff)->when; 32: 33: /* 將報文添加到發送隊列上 */ 34: __skb_queue_tail(&sk->sk_write_queue, buff); 35: sk_charge_skb(sk, buff); 36: tp->packets_out += tcp_skb_pcount(buff); 37: /* 發送SYN段 */ 38: tcp_transmit_skb(sk, skb_clone(buff, GFP_KERNEL)); 39: TCP_INC_STATS(TCP_MIB_ACTIVEOPENS); 40: 41: /* Timer for repeating the SYN until an answer. */ 42: /* 啟動重傳定時器 */ 43: tcp_reset_xmit_timer(sk, TCP_TIME_RETRANS, tp->rto); 44: return 0; 45: } 46:
服務端接收到SYN段后,發送SYN/ACK處理
- 由tcp_v4_do_rcv()->tcp_rcv_state_process()->tcp_v4_conn_request()->tcp_v4_send_synack().
- tcp_v4_send_synack()
- tcp_make_synack(sk, dst, req); * 根據路由、傳輸控制塊、連接請求塊中的構建SYN+ACK段 *
- ip_build_and_send_pkt(); * 生成IP數據報並發送出去 *

圖: 服務端接收到SYN段后,發送SYN/ACK處理流程。
- 代碼關鍵路徑:
1: /* 向客戶端發送SYN+ACK報文 */ 2: static int tcp_v4_send_synack(struct sock *sk, struct open_request *req, 3: struct dst_entry *dst) 4: { 5: int err = -1; 6: struct sk_buff * skb; 7: 8: /* First, grab a route. */ 9: /* 查找到客戶端的路由 */ 10: if (!dst && (dst = tcp_v4_route_req(sk, req)) == NULL) 11: goto out; 12: 13: /* 根據路由、傳輸控制塊、連接請求塊中的構建SYN+ACK段 */ 14: skb = tcp_make_synack(sk, dst, req); 15: 16: if (skb) {/* 生成SYN+ACK段成功 */ 17: struct tcphdr *th = skb->h.th; 18: 19: /* 生成校驗碼 */ 20: th->check = tcp_v4_check(th, skb->len, 21: req->af.v4_req.loc_addr, 22: req->af.v4_req.rmt_addr, 23: csum_partial((char *)th, skb->len, 24: skb->csum)); 25: 26: /* 生成IP數據報並發送出去 */ 27: err = ip_build_and_send_pkt(skb, sk, req->af.v4_req.loc_addr, 28: req->af.v4_req.rmt_addr, 29: req->af.v4_req.opt); 30: if (err == NET_XMIT_CN) 31: err = 0; 32: } 33: 34: out: 35: dst_release(dst); 36: return err; 37: } 38:
- tcp_make_synack(sk, dst, req); * 根據路由、傳輸控制塊、連接請求塊中的構建SYN+ACK段 *
客戶端回復確認ACK段
- 由tcp_v4_do_rcv()->tcp_rcv_state_process().當前客戶端處於TCP_SYN_SENT狀態。
- tcp_rcv_synsent_state_process(); * tcp_rcv_synsent_state_process處理SYN_SENT狀態下接收到的TCP段 *
- tcp_ack(); * 處理接收到的ack報文 *
- tcp_send_ack(); * 在主動連接時,向服務器端發送ACK完成連接,並更新窗口 *
- alloc_skb(); * 構造ack段 *
- tcp_transmit_skb(); * 將ack段發出 *
- alloc_skb(); * 構造ack段 *
- tcp_urg(sk, skb, th); * 處理完第二次握手后,還需要處理帶外數據 *
- tcp_data_snd_check(sk); * 檢測是否有數據需要發送 *
- 檢查sk->sk_send_head隊列上是否有待發送的數據。
- tcp_write_xmit(); * 將TCP發送隊列上的段發送出去 *
- 檢查sk->sk_send_head隊列上是否有待發送的數據。
- tcp_ack(); * 處理接收到的ack報文 *
- 代碼關鍵路徑:
tcp_rcv_synsent_state_process()
1: /* 在SYN_SENT狀態下處理接收到的段,但是不處理帶外數據 */ 2: static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, 3: struct tcphdr *th, unsigned len) 4: { 5: struct tcp_sock *tp = tcp_sk(sk); 6: int saved_clamp = tp->rx_opt.mss_clamp; 7: 8: /* 解析TCP選項並保存到傳輸控制塊中 */ 9: tcp_parse_options(skb, &tp->rx_opt, 0); 10: 11: if (th->ack) {/* 處理ACK標志 */ 12: /* rfc793: 13: * "If the state is SYN-SENT then 14: * first check the ACK bit 15: * If the ACK bit is set 16: * If SEG.ACK =< ISS, or SEG.ACK > SND.NXT, send 17: * a reset (unless the RST bit is set, if so drop 18: * the segment and return)" 19: * 20: * We do not send data with SYN, so that RFC-correct 21: * test reduces to: 22: */ 23: if (TCP_SKB_CB(skb)->ack_seq != tp->snd_nxt) 24: goto reset_and_undo; 25: 26: if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && 27: !between(tp->rx_opt.rcv_tsecr, tp->retrans_stamp, 28: tcp_time_stamp)) { 29: NET_INC_STATS_BH(LINUX_MIB_PAWSACTIVEREJECTED); 30: goto reset_and_undo; 31: } 32: 33: /* Now ACK is acceptable. 34: * 35: * "If the RST bit is set 36: * If the ACK was acceptable then signal the user "error: 37: * connection reset", drop the segment, enter CLOSED state, 38: * delete TCB, and return." 39: */ 40: 41: if (th->rst) {/* 收到ACK+RST段,需要tcp_reset設置錯誤碼,並關閉套接口 */ 42: tcp_reset(sk); 43: goto discard; 44: } 45: 46: /* rfc793: 47: * "fifth, if neither of the SYN or RST bits is set then 48: * drop the segment and return." 49: * 50: * See note below! 51: * --ANK(990513) 52: */ 53: if (!th->syn)/* 在SYN_SENT狀態下接收到的段必須存在SYN標志,否則說明接收到的段無效,丟棄該段 */ 54: goto discard_and_undo; 55: 56: /* rfc793: 57: * "If the SYN bit is on ... 58: * are acceptable then ... 59: * (our SYN has been ACKed), change the connection 60: * state to ESTABLISHED..." 61: */ 62: 63: /* 從首部標志中獲取顯示擁塞通知的特性 */ 64: TCP_ECN_rcv_synack(tp, th); 65: if (tp->ecn_flags&TCP_ECN_OK)/* 如果支持ECN,則設置標志 */ 66: sk->sk_no_largesend = 1; 67: 68: /* 設置與窗口相關的成員變量 */ 69: tp->snd_wl1 = TCP_SKB_CB(skb)->seq; 70: tcp_ack(sk, skb, FLAG_SLOWPATH); 71: 72: /* Ok.. it's good. Set up sequence numbers and 73: * move to established. 74: */ 75: tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1; 76: tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1; 77: 78: /* RFC1323: The window in SYN & SYN/ACK segments is 79: * never scaled. 80: */ 81: tp->snd_wnd = ntohs(th->window); 82: tcp_init_wl(tp, TCP_SKB_CB(skb)->ack_seq, TCP_SKB_CB(skb)->seq); 83: 84: if (!tp->rx_opt.wscale_ok) { 85: tp->rx_opt.snd_wscale = tp->rx_opt.rcv_wscale = 0; 86: tp->window_clamp = min(tp->window_clamp, 65535U); 87: } 88: 89: if (tp->rx_opt.saw_tstamp) {/* 根據是否支持時間戳選項來設置傳輸控制塊的相關字段 */ 90: tp->rx_opt.tstamp_ok = 1; 91: tp->tcp_header_len = 92: sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED; 93: tp->advmss -= TCPOLEN_TSTAMP_ALIGNED; 94: tcp_store_ts_recent(tp); 95: } else { 96: tp->tcp_header_len = sizeof(struct tcphdr); 97: } 98: 99: /* 初始化PMTU、MSS等成員變量 */ 100: if (tp->rx_opt.sack_ok && sysctl_tcp_fack) 101: tp->rx_opt.sack_ok |= 2; 102: 103: tcp_sync_mss(sk, tp->pmtu_cookie); 104: tcp_initialize_rcv_mss(sk); 105: 106: /* Remember, tcp_poll() does not lock socket! 107: * Change state from SYN-SENT only after copied_seq 108: * is initialized. */ 109: tp->copied_seq = tp->rcv_nxt; 110: mb(); 111: tcp_set_state(sk, TCP_ESTABLISHED); 112: 113: /* Make sure socket is routed, for correct metrics. */ 114: tp->af_specific->rebuild_header(sk); 115: 116: tcp_init_metrics(sk); 117: 118: /* Prevent spurious tcp_cwnd_restart() on first data 119: * packet. 120: */ 121: tp->lsndtime = tcp_time_stamp; 122: 123: tcp_init_buffer_space(sk); 124: 125: /* 如果啟用了連接保活,則啟用連接保活定時器 */ 126: if (sock_flag(sk, SOCK_KEEPOPEN)) 127: tcp_reset_keepalive_timer(sk, keepalive_time_when(tp)); 128: 129: if (!tp->rx_opt.snd_wscale)/* 首部預測 */ 130: __tcp_fast_path_on(tp, tp->snd_wnd); 131: else 132: tp->pred_flags = 0; 133: 134: if (!sock_flag(sk, SOCK_DEAD)) {/* 如果套口不處於SOCK_DEAD狀態,則喚醒等待該套接口的進程 */ 135: sk->sk_state_change(sk); 136: sk_wake_async(sk, 0, POLL_OUT); 137: } 138: 139: /* 連接建立完成,根據情況進入延時確認模式 */ 140: if (sk->sk_write_pending || tp->defer_accept || tp->ack.pingpong) { 141: /* Save one ACK. Data will be ready after 142: * several ticks, if write_pending is set. 143: * 144: * It may be deleted, but with this feature tcpdumps 145: * look so _wonderfully_ clever, that I was not able 146: * to stand against the temptation 8) --ANK 147: */ 148: tcp_schedule_ack(tp); 149: tp->ack.lrcvtime = tcp_time_stamp; 150: tp->ack.ato = TCP_ATO_MIN; 151: tcp_incr_quickack(tp); 152: tcp_enter_quickack_mode(tp); 153: tcp_reset_xmit_timer(sk, TCP_TIME_DACK, TCP_DELACK_MAX); 154: 155: discard: 156: __kfree_skb(skb); 157: return 0; 158: } else {/* 不需要延時確認,立即發送ACK段 */ 159: tcp_send_ack(sk); 160: } 161: return -1; 162: } 163: 164: /* No ACK in the segment */ 165: 166: if (th->rst) {/* 收到RST段,則丟棄傳輸控制塊 */ 167: /* rfc793: 168: * "If the RST bit is set 169: * 170: * Otherwise (no ACK) drop the segment and return." 171: */ 172: 173: goto discard_and_undo; 174: } 175: 176: /* PAWS check. */ 177: /* PAWS檢測失效,也丟棄傳輸控制塊 */ 178: if (tp->rx_opt.ts_recent_stamp && tp->rx_opt.saw_tstamp && tcp_paws_check(&tp->rx_opt, 0)) 179: goto discard_and_undo; 180: 181: /* 在SYN_SENT狀態下收到了SYN段並且沒有ACK,說明是兩端同時打開 */ 182: if (th->syn) { 183: /* We see SYN without ACK. It is attempt of 184: * simultaneous connect with crossed SYNs. 185: * Particularly, it can be connect to self. 186: */ 187: tcp_set_state(sk, TCP_SYN_RECV);/* 設置狀態為TCP_SYN_RECV */ 188: 189: if (tp->rx_opt.saw_tstamp) {/* 設置時間戳相關的字段 */ 190: tp->rx_opt.tstamp_ok = 1; 191: tcp_store_ts_recent(tp); 192: tp->tcp_header_len = 193: sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED; 194: } else { 195: tp->tcp_header_len = sizeof(struct tcphdr); 196: } 197: 198: /* 初始化窗口相關的成員變量 */ 199: tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1; 200: tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1; 201: 202: /* RFC1323: The window in SYN & SYN/ACK segments is 203: * never scaled. 204: */ 205: tp->snd_wnd = ntohs(th->window); 206: tp->snd_wl1 = TCP_SKB_CB(skb)->seq; 207: tp->max_window = tp->snd_wnd; 208: 209: TCP_ECN_rcv_syn(tp, th);/* 從首部標志中獲取顯式擁塞通知的特性。 */ 210: if (tp->ecn_flags&TCP_ECN_OK) 211: sk->sk_no_largesend = 1; 212: 213: /* 初始化MSS相關的成員變量 */ 214: tcp_sync_mss(sk, tp->pmtu_cookie); 215: tcp_initialize_rcv_mss(sk); 216: 217: /* 向對端發送SYN+ACK段,並丟棄接收到的SYN段 */ 218: tcp_send_synack(sk); 219: #if 0 220: /* Note, we could accept data and URG from this segment. 221: * There are no obstacles to make this. 222: * 223: * However, if we ignore data in ACKless segments sometimes, 224: * we have no reasons to accept it sometimes. 225: * Also, seems the code doing it in step6 of tcp_rcv_state_process 226: * is not flawless. So, discard packet for sanity. 227: * Uncomment this return to process the data. 228: */ 229: return -1; 230: #else 231: goto discard; 232: #endif 233: } 234: /* "fifth, if neither of the SYN or RST bits is set then 235: * drop the segment and return." 236: */ 237: 238: discard_and_undo: 239: tcp_clear_options(&tp->rx_opt); 240: tp->rx_opt.mss_clamp = saved_clamp; 241: goto discard; 242: 243: reset_and_undo: 244: tcp_clear_options(&tp->rx_opt); 245: tp->rx_opt.mss_clamp = saved_clamp; 246: return 1; 247: } 248:
服務端收到ACK段
- 由tcp_v4_do_rcv()->tcp_rcv_state_process().當前服務端處於TCP_SYN_RECV狀態變為TCP_ESTABLISHED狀態。
- 代碼關鍵路徑:
1: /* 除了ESTABLISHED和TIME_WAIT狀態外,其他狀態下的TCP段處理都由本函數實現 */ 2: int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb, 3: struct tcphdr *th, unsigned len) 4: { 5: struct tcp_sock *tp = tcp_sk(sk); 6: int queued = 0; 7: 8: tp->rx_opt.saw_tstamp = 0; 9: 10: switch (sk->sk_state) { 11: ..... 12: /* SYN_RECV狀態的處理 */ 13: if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp &&/* 解析TCP選項,如果首部中存在時間戳選項 */ 14: tcp_paws_discard(tp, skb)) {/* PAWS檢測失敗,則丟棄報文 */ 15: if (!th->rst) {/* 如果不是RST段 */ 16: /* 發送DACK給對端,說明接收到的TCP段已經處理過 */ 17: NET_INC_STATS_BH(LINUX_MIB_PAWSESTABREJECTED); 18: tcp_send_dupack(sk, skb); 19: goto discard; 20: } 21: /* Reset is accepted even if it did not pass PAWS. */ 22: } 23: 24: /* step 1: check sequence number */ 25: if (!tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) {/* TCP段序號無效 */ 26: if (!th->rst)/* 如果TCP段無RST標志,則發送DACK給對方 */ 27: tcp_send_dupack(sk, skb); 28: goto discard; 29: } 30: 31: /* step 2: check RST bit */ 32: if(th->rst) {/* 如果有RST標志,則重置連接 */ 33: tcp_reset(sk); 34: goto discard; 35: } 36: 37: /* 如果有必要,則更新時間戳 */ 38: tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq); 39: 40: /* step 3: check security and precedence [ignored] */ 41: 42: /* step 4: 43: * 44: * Check for a SYN in window. 45: */ 46: if (th->syn && !before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {/* 如果有SYN標志並且序號在接收窗口內 */ 47: NET_INC_STATS_BH(LINUX_MIB_TCPABORTONSYN); 48: tcp_reset(sk);/* 復位連接 */ 49: return 1; 50: } 51: 52: /* step 5: check the ACK field */ 53: if (th->ack) {/* 如果有ACK標志 */ 54: /* 檢查ACK是否為正常的第三次握手 */ 55: int acceptable = tcp_ack(sk, skb, FLAG_SLOWPATH); 56: 57: switch(sk->sk_state) { 58: case TCP_SYN_RECV: 59: if (acceptable) { 60: tp->copied_seq = tp->rcv_nxt; 61: mb(); 62: /* 正常的第三次握手,設置連接狀態為TCP_ESTABLISHED */ 63: tcp_set_state(sk, TCP_ESTABLISHED); 64: sk->sk_state_change(sk); 65: 66: /* Note, that this wakeup is only for marginal 67: * crossed SYN case. Passively open sockets 68: * are not waked up, because sk->sk_sleep == 69: * NULL and sk->sk_socket == NULL. 70: */ 71: if (sk->sk_socket) {/* 狀態已經正常,喚醒那些等待的線程 */ 72: sk_wake_async(sk,0,POLL_OUT); 73: } 74: 75: /* 初始化傳輸控制塊,如果存在時間戳選項,同時平滑RTT為0,則需計算重傳超時時間 */ 76: tp->snd_una = TCP_SKB_CB(skb)->ack_seq; 77: tp->snd_wnd = ntohs(th->window) << 78: tp->rx_opt.snd_wscale; 79: tcp_init_wl(tp, TCP_SKB_CB(skb)->ack_seq, 80: TCP_SKB_CB(skb)->seq); 81: 82: /* tcp_ack considers this ACK as duplicate 83: * and does not calculate rtt. 84: * Fix it at least with timestamps. 85: */ 86: if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && 87: !tp->srtt) 88: tcp_ack_saw_tstamp(tp, 0); 89: 90: if (tp->rx_opt.tstamp_ok) 91: tp->advmss -= TCPOLEN_TSTAMP_ALIGNED; 92: 93: /* Make sure socket is routed, for 94: * correct metrics. 95: */ 96: /* 建立路由,初始化擁塞控制模塊 */ 97: tp->af_specific->rebuild_header(sk); 98: 99: tcp_init_metrics(sk); 100: 101: /* Prevent spurious tcp_cwnd_restart() on 102: * first data packet. 103: */ 104: tp->lsndtime = tcp_time_stamp;/* 更新最近一次發送數據包的時間 */ 105: 106: tcp_initialize_rcv_mss(sk); 107: tcp_init_buffer_space(sk); 108: tcp_fast_path_on(tp);/* 計算有關TCP首部預測的標志 */ 109: } else { 110: return 1; 111: } 112: break; 113: ..... 114: } 115: } else 116: goto discard; 117: ..... 118: 119: /* step 6: check the URG bit */ 120: tcp_urg(sk, skb, th);/* 檢測帶外數據位 */ 121: 122: /* tcp_data could move socket to TIME-WAIT */ 123: if (sk->sk_state != TCP_CLOSE) {/* 如果tcp_data需要發送數據和ACK則在這里處理 */ 124: tcp_data_snd_check(sk); 125: tcp_ack_snd_check(sk); 126: } 127: 128: if (!queued) { /* 如果段沒有加入隊列,或者前面的流程需要釋放報文,則釋放它 */ 129: discard: 130: __kfree_skb(skb); 131: } 132: return 0; 133: }
數據傳輸
客戶端請求數據
- 由send() -> sendto() -> tcp_sendmsg().當前服務端處於TCP_ESTABLISHED狀態。
send()
send() 直接調用了sendto().
1: /* 2: * Send a datagram down a socket. 3: */ 4: 5: SYSCALL_DEFINE4(send, int, fd, void __user *, buff, size_t, len, 6: unsigned, flags) 7: { 8: return sys_sendto(fd, buff, len, flags, NULL, 0); 9: }
sendto()
1: /* 2: * Send a datagram to a given address. We move the address into kernel 3: * space and check the user space data area is readable before invoking 4: * the protocol. 5: */ 6: 7: SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len, 8: unsigned, flags, struct sockaddr __user *, addr, 9: int, addr_len) 10: { 11: struct socket *sock; 12: struct sockaddr_storage address; 13: int err; 14: struct msghdr msg; 15: struct iovec iov; 16: int fput_needed; 17: 18: if (len > INT_MAX) 19: len = INT_MAX; 20: sock = sockfd_lookup_light(fd, &err, &fput_needed); 21: if (!sock) 22: goto out; 23: 24: /* 可以看出用戶空間的buff直接賦給了iov.iov_base, iov.iov_len = len */ 25: iov.iov_base = buff; 26: iov.iov_len = len; 27: msg.msg_name = NULL; 28: msg.msg_iov = &iov; 29: msg.msg_iovlen = 1; 30: msg.msg_control = NULL; 31: msg.msg_controllen = 0; 32: msg.msg_namelen = 0; 33: if (addr) { 34: err = move_addr_to_kernel(addr, addr_len, (struct sockaddr *)&address); 35: if (err < 0) 36: goto out_put; 37: msg.msg_name = (struct sockaddr *)&address; 38: msg.msg_namelen = addr_len; 39: } 40: if (sock->file->f_flags & O_NONBLOCK) 41: flags |= MSG_DONTWAIT; 42: msg.msg_flags = flags; 43: err = sock_sendmsg(sock, &msg, len); 44: 45: out_put: 46: fput_light(sock->file, fput_needed); 47: out: 48: return err; 49: }
__sys_sendmsg()
關鍵路徑:
- 通過copy_from_user把用戶的struct msghdr拷貝到內核的msg_sys。
- 也通過verify_iovec()把用戶buff中的內容拷貝到內核的iovstack中。
- 最后調用sock_sendmsg().
1: static int __sys_sendmsg(struct socket *sock, struct msghdr __user *msg, 2: struct msghdr *msg_sys, unsigned flags, 3: struct used_address *used_address) 4: { 5: struct compat_msghdr __user *msg_compat = 6: (struct compat_msghdr __user *)msg; 7: struct sockaddr_storage address; 8: struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; 9: unsigned char ctl[sizeof(struct cmsghdr) + 20] 10: __attribute__ ((aligned(sizeof(__kernel_size_t)))); 11: /* 20 is size of ipv6_pktinfo */ 12: unsigned char *ctl_buf = ctl; 13: int err, ctl_len, iov_size, total_len; 14: 15: err = -EFAULT; 16: if (MSG_CMSG_COMPAT & flags) { 17: if (get_compat_msghdr(msg_sys, msg_compat)) 18: return -EFAULT; 19: } 20: else if (copy_from_user(msg_sys, msg, sizeof(struct msghdr))) 21: return -EFAULT; 22: 23: /* do not move before msg_sys is valid */ 24: err = -EMSGSIZE; 25: if (msg_sys->msg_iovlen > UIO_MAXIOV) 26: goto out; 27: 28: /* Check whether to allocate the iovec area */ 29: err = -ENOMEM; 30: iov_size = msg_sys->msg_iovlen * sizeof(struct iovec); 31: if (msg_sys->msg_iovlen > UIO_FASTIOV) { 32: iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL); 33: if (!iov) 34: goto out; 35: } 36: 37: /* This will also move the address data into kernel space */ 38: if (MSG_CMSG_COMPAT & flags) { 39: err = verify_compat_iovec(msg_sys, iov, 40: (struct sockaddr *)&address, 41: VERIFY_READ); 42: } else 43: err = verify_iovec(msg_sys, iov, 44: (struct sockaddr *)&address, 45: VERIFY_READ); 46: if (err < 0) 47: goto out_freeiov; 48: total_len = err; 49: 50: err = -ENOBUFS; 51: 52: if (msg_sys->msg_controllen > INT_MAX) 53: goto out_freeiov; 54: ctl_len = msg_sys->msg_controllen; 55: if ((MSG_CMSG_COMPAT & flags) && ctl_len) { 56: err = 57: cmsghdr_from_user_compat_to_kern(msg_sys, sock->sk, ctl, 58: sizeof(ctl)); 59: if (err) 60: goto out_freeiov; 61: ctl_buf = msg_sys->msg_control; 62: ctl_len = msg_sys->msg_controllen; 63: } else if (ctl_len) { 64: if (ctl_len > sizeof(ctl)) { 65: ctl_buf = sock_kmalloc(sock->sk, ctl_len, GFP_KERNEL); 66: if (ctl_buf == NULL) 67: goto out_freeiov; 68: } 69: err = -EFAULT; 70: /* 71: * Careful! Before this, msg_sys->msg_control contains a user pointer. 72: * Afterwards, it will be a kernel pointer. Thus the compiler-assisted 73: * checking falls down on this. 74: */ 75: if (copy_from_user(ctl_buf, (void __user *)msg_sys->msg_control, 76: ctl_len)) 77: goto out_freectl; 78: msg_sys->msg_control = ctl_buf; 79: } 80: msg_sys->msg_flags = flags; 81: 82: if (sock->file->f_flags & O_NONBLOCK) 83: msg_sys->msg_flags |= MSG_DONTWAIT; 84: /* 85: * If this is sendmmsg() and current destination address is same as 86: * previously succeeded address, omit asking LSM's decision. 87: * used_address->name_len is initialized to UINT_MAX so that the first 88: * destination address never matches. 89: */ 90: if (used_address && used_address->name_len == msg_sys->msg_namelen && 91: !memcmp(&used_address->name, msg->msg_name, 92: used_address->name_len)) { 93: err = sock_sendmsg_nosec(sock, msg_sys, total_len); 94: goto out_freectl; 95: } 96: err = sock_sendmsg(sock, msg_sys, total_len); 97: /* 98: * If this is sendmmsg() and sending to current destination address was 99: * successful, remember it. 100: */ 101: if (used_address && err >= 0) { 102: used_address->name_len = msg_sys->msg_namelen; 103: memcpy(&used_address->name, msg->msg_name, 104: used_address->name_len); 105: } 106: 107: out_freectl: 108: if (ctl_buf != ctl) 109: sock_kfree_s(sock->sk, ctl_buf, ctl_len); 110: out_freeiov: 111: if (iov != iovstack) 112: sock_kfree_s(sock->sk, iov, iov_size); 113: out: 114: return err; 115: } 116:
tcp_sendmsg():
1: /* sendmsg系統調用在TCP層的實現 */ 2: int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, 3: size_t size) 4: { 5: struct iovec *iov; 6: struct tcp_sock *tp = tcp_sk(sk); 7: struct sk_buff *skb; 8: int iovlen, flags; 9: int mss_now; 10: int err, copied; 11: long timeo; 12: 13: /* 獲取套接口的鎖 */ 14: lock_sock(sk); 15: TCP_CHECK_TIMER(sk); 16: 17: /* 根據標志計算阻塞超時時間 */ 18: flags = msg->msg_flags; 19: timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); 20: 21: /* Wait for a connection to finish. */ 22: if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))/* 只有這兩種狀態才能發送消息 */ 23: if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)/* 其它狀態下等待連接正確建立,超時則進行錯誤處理 */ 24: goto out_err; 25: 26: /* This should be in poll */ 27: clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); 28: 29: /* 獲得有效的MSS,如果支持OOB,則不能支持TSO,MSS則應當是比較小的值 */ 30: mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); 31: 32: /* Ok commence sending. */ 33: /* 獲取待發送數據塊數及數據塊指針 */ 34: iovlen = msg->msg_iovlen; 35: iov = msg->msg_iov; 36: /* copied表示從用戶數據塊復制到skb中的字節數。 */ 37: copied = 0; 38: 39: err = -EPIPE; 40: /* 如果套接口存在錯誤,則不允許發送數據,返回EPIPE錯誤 */ 41: if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) 42: goto do_error; 43: 44: while (--iovlen >= 0) {/* 處理所有待發送數據塊 */ 45: int seglen = iov->iov_len; 46: unsigned char __user *from = iov->iov_base; 47: 48: iov++; 49: 50: while (seglen > 0) {/* 處理單個數據塊中的所有數據 */ 51: int copy; 52: 53: skb = sk->sk_write_queue.prev; 54: 55: if (!sk->sk_send_head ||/* 發送隊列為空,前面取得的skb無效 */ 56: (copy = mss_now - skb->len) <= 0) {/* 如果skb有效,但是它已經沒有多余的空間復制新數據了 */ 57: 58: new_segment: 59: /* Allocate new segment. If the interface is SG, 60: * allocate skb fitting to single page. 61: */ 62: if (!sk_stream_memory_free(sk))/* 發送隊列中數據長度達到發送緩沖區的上限,等待緩沖區 */ 63: goto wait_for_sndbuf; 64: 65: skb = sk_stream_alloc_pskb(sk, select_size(sk, tp), 66: 0, sk->sk_allocation);/* 分配新的skb */ 67: if (!skb)/* 分配失敗,說明系統內存不足,等待 */ 68: goto wait_for_memory; 69: 70: /* 71: * Check whether we can use HW checksum. 72: */ 73: if (sk->sk_route_caps & 74: (NETIF_F_IP_CSUM | NETIF_F_NO_CSUM | 75: NETIF_F_HW_CSUM))/* 根據路由網絡設備的特性,確定是否由硬件執行校驗和 */ 76: skb->ip_summed = CHECKSUM_HW; 77: 78: skb_entail(sk, tp, skb);/* 將SKB添加到發送隊列尾部 */ 79: copy = mss_now;/* 本次需要復制的數據量是MSS */ 80: } 81: 82: /* Try to append data to the end of skb. */ 83: if (copy > seglen)/* 要復制的數據不能大於當前段的長度 */ 84: copy = seglen; 85: 86: /* Where to copy to? */ 87: if (skb_tailroom(skb) > 0) {/* skb線性存儲區底部還有空間 */ 88: /* We have some space in skb head. Superb! */ 89: if (copy > skb_tailroom(skb))/* 本次只復制skb存儲區底部剩余空間大小的數據量 */ 90: copy = skb_tailroom(skb); 91: /* 從用戶空間復制指定長度的數據到skb中,如果失敗,則退出 */ 92: if ((err = skb_add_data(skb, from, copy)) != 0) 93: goto do_fault; 94: } else {/* 線性存儲區底部已經沒有空間了,復制到分散/聚集存儲區中 */ 95: int merge = 0;/* 是否在頁中添加數據 */ 96: int i = skb_shinfo(skb)->nr_frags;/* 分散/聚集片斷數 */ 97: struct page *page = TCP_PAGE(sk);/* 分片頁頁 */ 98: int off = TCP_OFF(sk);/* 分片內的偏移 */ 99: 100: if (skb_can_coalesce(skb, i, page, off) && 101: off != PAGE_SIZE) {/* 當前分片還能添加數據 */ 102: /* We can extend the last page 103: * fragment. */ 104: merge = 1; 105: } else if (i == MAX_SKB_FRAGS ||/* 目前skb中的頁不能添加數據,這里判斷是否能再分配頁 */ 106: (!i && 107: !(sk->sk_route_caps & NETIF_F_SG))) {/* 網卡不支持S/G,不能分片 */ 108: /* Need to add new fragment and cannot 109: * do this because interface is non-SG, 110: * or because all the page slots are 111: * busy. */ 112: tcp_mark_push(tp, skb);/* SKB可以提交了 */ 113: goto new_segment;/* 重新分配skb */ 114: } else if (page) {/* 分頁數量未達到上限,判斷當前頁是否還有空間 */ 115: /* If page is cached, align 116: * offset to L1 cache boundary 117: */ 118: off = (off + L1_CACHE_BYTES - 1) & 119: ~(L1_CACHE_BYTES - 1); 120: if (off == PAGE_SIZE) {/* 最后一個分頁數據已經滿,需要分配新頁 */ 121: put_page(page); 122: TCP_PAGE(sk) = page = NULL; 123: } 124: } 125: 126: if (!page) {/* 需要分配新頁 */ 127: /* Allocate new cache page. */ 128: if (!(page = sk_stream_alloc_page(sk)))/* 分配新頁,如果內存不足則等待內存 */ 129: goto wait_for_memory; 130: off = 0; 131: } 132: 133: if (copy > PAGE_SIZE - off)/* 待復制的數據不能大於頁中剩余空間 */ 134: copy = PAGE_SIZE - off; 135: 136: /* Time to copy data. We are close to 137: * the end! */ 138: err = skb_copy_to_page(sk, from, skb, page, 139: off, copy);/* 從用戶態復制數據到頁中 */ 140: if (err) {/* 復制失敗了 */ 141: /* If this page was new, give it to the 142: * socket so it does not get leaked. 143: */ 144: if (!TCP_PAGE(sk)) {/* 如果是新分配的頁,則將頁記錄到skb中,供今后使用 */ 145: TCP_PAGE(sk) = page; 146: TCP_OFF(sk) = 0; 147: } 148: goto do_error; 149: } 150: 151: /* Update the skb. */ 152: /* 更新skb的分段信息 */ 153: if (merge) {/* 在最后一個頁中追加數據 */ 154: skb_shinfo(skb)->frags[i - 1].size += 155: copy;/* 更新最后一頁的數據長度 */ 156: } else {/* 新分配的頁 */ 157: /* 更新skb中分片信息 */ 158: skb_fill_page_desc(skb, i, page, off, copy); 159: if (TCP_PAGE(sk)) { 160: get_page(page); 161: } else if (off + copy < PAGE_SIZE) { 162: get_page(page); 163: TCP_PAGE(sk) = page; 164: } 165: } 166: 167: /* 更新頁內偏移 */ 168: TCP_OFF(sk) = off + copy; 169: } 170: 171: if (!copied)/* 如果沒有復制數據,則取消PSH標志 */ 172: TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH; 173: 174: tp->write_seq += copy;/* 更新發送隊列最后一個包的序號 */ 175: TCP_SKB_CB(skb)->end_seq += copy;/* 更新skb的序號 */ 176: skb_shinfo(skb)->tso_segs = 0; 177: 178: /* 更新數據復制的指針 */ 179: from += copy; 180: copied += copy; 181: /* 如果所有數據已經復制完畢則退出 */ 182: if ((seglen -= copy) == 0 && iovlen == 0) 183: goto out; 184: 185: /* 如果當前skb中的數據小於mss,說明可以往里面繼續復制數據。或者發送的是OOB數據,則也跳過發送過程,繼續復制數據 */ 186: if (skb->len != mss_now || (flags & MSG_OOB)) 187: continue; 188: 189: if (forced_push(tp)) {/* 必須立即發送數據,即上次發送后產生的數據已經超過通告窗口值的一半 */ 190: /* 設置PSH標志后發送數據 */ 191: tcp_mark_push(tp, skb); 192: __tcp_push_pending_frames(sk, tp, mss_now, TCP_NAGLE_PUSH); 193: } else if (skb == sk->sk_send_head)/* 雖然不是必須發送數據,但是發送隊列上只存在當前段,也將其發送出去 */ 194: tcp_push_one(sk, mss_now); 195: continue; 196: 197: wait_for_sndbuf: 198: /* 由於發送隊列滿的原因導致等待 */ 199: set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); 200: wait_for_memory: 201: if (copied)/* 雖然沒有內存了,但是本次調用復制了數據到緩沖區,調用tcp_push將其發送出去 */ 202: tcp_push(sk, tp, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); 203: 204: /* 等待內存可用 */ 205: if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) 206: goto do_error;/* 確實沒有內存了,超時后返回失敗 */ 207: 208: /* 睡眠后,MSS可能發生了變化,重新計算 */ 209: mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); 210: } 211: } 212: 213: out: 214: if (copied)/* 從用戶態復制了數據,發送它 */ 215: tcp_push(sk, tp, flags, mss_now, tp->nonagle); 216: TCP_CHECK_TIMER(sk); 217: release_sock(sk);/* 釋放鎖以后返回 */ 218: return copied; 219: 220: do_fault: 221: if (!skb->len) {/* 復制數據失敗了,如果skb長度為0,說明是新分配的,釋放它 */ 222: if (sk->sk_send_head == skb)/* 如果skb是發送隊列頭,則清空隊列頭 */ 223: sk->sk_send_head = NULL; 224: __skb_unlink(skb, skb->list); 225: sk_stream_free_skb(sk, skb);/* 釋放skb */ 226: } 227: 228: do_error: 229: if (copied) 230: goto out; 231: out_err: 232: err = sk_stream_error(sk, flags, err); 233: TCP_CHECK_TIMER(sk); 234: release_sock(sk); 235: return err; 236: }
服務端響應請求
- 由tcp_v4_do_rcv()->tcp_rcv_established().當前服務端處於TCP_ESTABLISHED狀態。
- 代碼關鍵路徑:
#+BEGIN_SRC c -n
#+END_SRC
25章 傳輸控制塊
25.4 傳輸控制塊的內存管理
25.4.4 接收緩存的分配與釋放
書上說到設置該skb的sk宿主時TCP使用sk_stream_set_owner_r(),而到內核kernel-2.6.32中,
TCP和UDP統一使用skb_set_owner_r().
