TCP三次握手Linux源碼解析


TCP是面向連接的協議。面向連接的傳輸層協議在原點和重點之間建立了一條虛擬路徑,同屬於一個報文的所有報文段都沿着這條虛擬路徑發送,為整個報文使用一條虛擬路徑能夠更容易地實施確認過程以及對損傷或者丟失報文的重傳。TCP比IP工作在更高的層次上。TCP使用IP的服務,把一個個保溫段交付給接受方,但是連接本身是由TCP所控制的。如果一個報文段丟失或者受到損傷,那么這個報文段就被重傳。與TCP不同,IP不知道TCP的重傳行為。如果一個報文段沒有按序到達,那么TCP會保留它,直至丟失的報文段到達為止,但IP並不知道這個重新排序的過程。

在TCP中,面向連接的傳輸需要經過三個階段:連接建立,數據傳輸和連接終止。

首先,簡述TCP傳輸控制協議的連接建立過程,也就是三次握手(three-way handshaking)過程:

一個稱為客戶的應用程序希望使用TCP作為傳輸層協議來和另一個稱為服務器的應用程序建立連接。這個過程從服務器開始。服務器程序告訴它的TCP自己已經准備好接受連接。這個請求被稱為被動打開請求。雖然服務器的TCP已經准備好接受來自世界上任何一個機器的連接,但是它自己並不能完成這個連接。

客戶程序發出的請求成為主動打開。打開與某個開放的服務器進行連接的客戶告訴它的TCP,自己需要連接到某個特定的服務器上。TCP現在可以開始進行三次握手過程。

如下圖所示:

 

 

客戶發送第一個報文段——SYN報文段,在這個報文段中只有SYN標志被置為1。這個報文段的作用是同步序號。客戶選擇一個隨機數,作為第一個序號,並且把這個序號發送給服務器。這個序號稱為初始序號ISN。這個報文段並不包括確認號,也沒有定義窗口大小。只有當一個報文段中包含了確認時,定義窗口大小才有意義。這個報文段還可以包含一些選項。SYN報文段是一個控制報文段,它不懈怠任何數據,但是它消耗了一個序號。當數據傳送開始時,序號就應該加1。也就是說,SYN報文段並不包含真正的數據,但是它要消耗一個序號。

服務器發送第二個報文段——SYN+ACK報文段,其中的兩個標志SYN和ACK,置為1。這個報文段有兩個目的。首先,他是另一個方向上通信的SYN報文段,服務器使用這個報文段來同步他的出使序號,以便從服務器向客戶發送字節,其次,服務器還通過ACK標志來確認已經和搜到來自客戶端的SYN報文段,同時給出期望從客戶端受到的下一個序號。因為這個報文段包含了確認,所以他還需要定義接收窗口大小——rwnd。SYN+ACK報文段若攜帶數據,則消耗一個序號,否則不消耗。

客戶端發送第三個報文段,這僅僅是一個ACK報文段。它使用ACK標志和確認號自斷來確認受到了第二個報文段。這個報文段如果攜帶數據,則消耗一個序號,否則不消耗序號。

下面我們進入內核源碼,分析這個過程在內核中的實現。

首先客戶端調用connect主動發起連接:

 1 /*
 2  *    Attempt to connect to a socket with the server address.  The address
 3  *    is in user space so we verify it is OK and move it to kernel space.
 4  *
 5  *    For 1003.1g we need to add clean support for a bind to AF_UNSPEC to
 6  *    break bindings
 7  *
 8  *    NOTE: 1003.1g draft 6.3 is broken with respect to AX.25/NetROM and
 9  *    other SEQPACKET protocols that take time to connect() as it doesn't
10  *    include the -EINPROGRESS status for such sockets.
11  *    嘗試使用服務器地址連接到套接字。
12  *    地址在用戶空間,所以我們驗證它是OK的,並將它移動到內核空間。
13  */
14 
15 int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen)
16 {
17     struct socket *sock;
18     struct sockaddr_storage address;
19     int err, fput_needed;
20 
21         //得到socket對象
22     sock = sockfd_lookup_light(fd, &err, &fput_needed);
23     if (!sock)
24         goto out;
25     //將地址對象從用戶空間拷貝到內核空間
26         err = move_addr_to_kernel(uservaddr, addrlen, &address);
27     if (err < 0)
28         goto out_put;
29         //內核相關,不需要管他
30     err =
31         security_socket_connect(sock, (struct sockaddr *)&address, addrlen);
32     if (err)
33         goto out_put;
34 
35         //對於流式套接字,sock->ops為 inet_stream_ops --> inet_stream_connect
36         //對於數據報套接字,sock->ops為 inet_dgram_ops --> inet_dgram_connect
37     err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen,
38                  sock->file->f_flags);
39 out_put:
40     fput_light(sock->file, fput_needed);
41 out:
42     return err;
43 }
44 
45 SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,
46         int, addrlen)
47 {
48     return __sys_connect(fd, uservaddr, addrlen);
49 }                                    

該函數一共做了三件事:

第一,根據文件描述符找到指定的socket對象;

第二,將地址信息從用戶空間拷貝到內核空間;

第三,調用指定類型套接字的connect函數。

對應流式套接字的connect函數是inet_stream_connect,接着我們分析該函數:

int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
            int addr_len, int flags)
{
    int err;
 
    lock_sock(sock->sk);
    err = __inet_stream_connect(sock, uaddr, addr_len, flags);
    release_sock(sock->sk);
    return err;
}
 
/*
 *    Connect to a remote host. There is regrettably still a little
 *    TCP 'magic' in here.
 */
 
//1. 檢查socket地址長度和使用的協議族。
//2. 檢查socket的狀態,必須是SS_UNCONNECTED或SS_CONNECTING。
//3. 調用tcp_v4_connect()來發送SYN包。
//4. 等待后續握手的完成:
int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
              int addr_len, int flags)
{
    struct sock *sk = sock->sk;
    int err;
    long timeo;
 
    if (addr_len < sizeof(uaddr->sa_family))
        return -EINVAL;
 
    //檢查協議族
    if (uaddr->sa_family == AF_UNSPEC) {
        err = sk->sk_prot->disconnect(sk, flags);
        sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED;
        goto out;
    }
 
    switch (sock->state) {
    default:
        err = -EINVAL;
        goto out;
    case SS_CONNECTED:
        //已經是連接狀態
        err = -EISCONN;
        goto out;
    case SS_CONNECTING:
        //正在連接
        err = -EALREADY;
        /* Fall out of switch with err, set for this state */
        break;
    case SS_UNCONNECTED:
        err = -EISCONN;
        if (sk->sk_state != TCP_CLOSE)
            goto out;
 
        //對於流式套接字,sock->ops為 inet_stream_ops -->         inet_stream_connect  --> tcp_prot  --> tcp_v4_connect
 
        //對於數據報套接字,sock->ops為 inet_dgram_ops --> inet_dgram_connect           --> udp_prot  --> ip4_datagram_connect
        err = sk->sk_prot->connect(sk, uaddr, addr_len);
        if (err < 0)
            goto out;
 
        //協議方面的工作已經處理完成了,但是自己的一切工作還沒有完成,所以切換至正在連接中
        sock->state = SS_CONNECTING;
 
        /* Just entered SS_CONNECTING state; the only
         * difference is that return value in non-blocking
         * case is EINPROGRESS, rather than EALREADY.
         */
        err = -EINPROGRESS;
        break;
    }
 
    //獲取阻塞時間timeo。如果socket是非阻塞的,則timeo是0
    //connect()的超時時間為sk->sk_sndtimeo,在sock_init_data()中初始化為MAX_SCHEDULE_TIMEOUT,表示無限等待,可以通過SO_SNDTIMEO選項來修改
    timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);
 
    if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
        int writebias = (sk->sk_protocol == IPPROTO_TCP) &&
                tcp_sk(sk)->fastopen_req &&
                tcp_sk(sk)->fastopen_req->data ? 1 : 0;
 
        /* Error code is set above */
 
    //如果socket是非阻塞的,那么就直接返回錯誤碼-EINPROGRESS。
    //如果socket為阻塞的,就調用inet_wait_for_connect(),通過睡眠來等待。在以下三種情況下會被喚醒:
    //(1) 使用SO_SNDTIMEO選項時,睡眠時間超過設定值,返回0。connect()返回錯誤碼-EINPROGRESS。
    //(2) 收到信號,返回剩余的等待時間。connect()返回錯誤碼-ERESTARTSYS或-EINTR。
    //(3) 三次握手成功,sock的狀態從TCP_SYN_SENT或TCP_SYN_RECV變為TCP_ESTABLISHED,
    if (!timeo || !inet_wait_for_connect(sk, timeo, writebias))
            goto out;
 
        err = sock_intr_errno(timeo);
        //進程收到信號,如果err為-ERESTARTSYS,接下來庫函數會重新調用connect()
        if (signal_pending(current))
            goto out;
    }
 
    /* Connection was closed by RST, timeout, ICMP error
     * or another process disconnected us.
     */
    if (sk->sk_state == TCP_CLOSE)
        goto sock_error;
 
    /* sk->sk_err may be not zero now, if RECVERR was ordered by user
     * and error was received after socket entered established state.
     * Hence, it is handled normally after connect() return successfully.
     */
 
    //更新socket狀態為連接已建立
    sock->state = SS_CONNECTED;
 
    //清除錯誤碼
    err = 0;
out:
    return err;
 
sock_error:
    err = sock_error(sk) ? : -ECONNABORTED;
    sock->state = SS_UNCONNECTED;
 
    //如果使用的是TCP,則sk_prot為tcp_prot,disconnect為tcp_disconnect()
    if (sk->sk_prot->disconnect(sk, flags))
        //如果失敗
        sock->state = SS_DISCONNECTING;
    goto out;
}

該函數主要做了三件事:

1. 檢查socket地址長度和使用的協議族;

2. 檢查socket的狀態,必須是SS_UNCONNECTED或SS_CONNECTING;

3. 調用實現協議的connect函數,對於流式套接字,實現協議是tcp,調用的是tcp_v4_connect();

4.對於阻塞調用,等待后續握手的完成;對於非阻塞調用,則直接返回 -EINPROGRESS。

TCP的三次握手一般由客戶端通過connect發起,因此我們先來分析tcp_v4_connect的源代碼:

/* This will initiate an outgoing connection. */
 
//對於TCP 協議來說,其連接實際上就是發送一個 SYN 報文,在服務器的應答到來時,回答它一個 ack 報文,也就是完成三次握手中的第一和第三次
int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
    struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
    struct inet_sock *inet = inet_sk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    __be16 orig_sport, orig_dport;
    __be32 daddr, nexthop;
    struct flowi4 *fl4;
    struct rtable *rt;
    int err;
    struct ip_options_rcu *inet_opt;
 
    if (addr_len < sizeof(struct sockaddr_in))
        return -EINVAL;
 
    if (usin->sin_family != AF_INET)
        return -EAFNOSUPPORT;
 
    nexthop = daddr = usin->sin_addr.s_addr;
    inet_opt = rcu_dereference_protected(inet->inet_opt,
                         lockdep_sock_is_held(sk));
 
    //將下一跳地址和目的地址的臨時變量都暫時設為用戶提交的地址。 
    if (inet_opt && inet_opt->opt.srr) {
        if (!daddr)
            return -EINVAL;
        nexthop = inet_opt->opt.faddr;
    }
 
    //源端口
    orig_sport = inet->inet_sport;
 
    //目的端口
    orig_dport = usin->sin_port;
    
    fl4 = &inet->cork.fl.u.ip4;
 
    //如果使用了來源地址路由,選擇一個合適的下一跳地址。 
    rt = ip_route_connect(fl4, nexthop, inet->inet_saddr,
                  RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
                  IPPROTO_TCP,
                  orig_sport, orig_dport, sk);
    if (IS_ERR(rt)) {
        err = PTR_ERR(rt);
        if (err == -ENETUNREACH)
            IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
        return err;
    }
 
    if (rt->rt_flags & (RTCF_MULTICAST | RTCF_BROADCAST)) {
        ip_rt_put(rt);
        return -ENETUNREACH;
    }
 
    //進行路由查找,並校驗返回的路由的類型,TCP是不被允許使用多播和廣播的
    if (!inet_opt || !inet_opt->opt.srr)
        daddr = fl4->daddr;
 
    //更新目的地址臨時變量——使用路由查找后返回的值
    if (!inet->inet_saddr)
        inet->inet_saddr = fl4->saddr;
    sk_rcv_saddr_set(sk, inet->inet_saddr);
    
    //如果還沒有設置源地址,和本地發送地址,則使用路由中返回的值
    if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
        /* Reset inherited state */
        tp->rx_opt.ts_recent       = 0;
        tp->rx_opt.ts_recent_stamp = 0;
        if (likely(!tp->repair))
            tp->write_seq       = 0;
    }
 
    if (tcp_death_row.sysctl_tw_recycle &&
        !tp->rx_opt.ts_recent_stamp && fl4->daddr == daddr)
        tcp_fetch_timewait_stamp(sk, &rt->dst);
 
    //保存目的地址及端口
    inet->inet_dport = usin->sin_port;
    sk_daddr_set(sk, daddr);
 
    inet_csk(sk)->icsk_ext_hdr_len = 0;
    if (inet_opt)
        inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
 
    //設置最小允許的mss值 536
    tp->rx_opt.mss_clamp = TCP_MSS_DEFAULT;
 
    /* Socket identity is still unknown (sport may be zero).
     * However we set state to SYN-SENT and not releasing socket
     * lock select source port, enter ourselves into the hash tables and
     * complete initialization after this.
     */
 
    //套接字狀態被置為 TCP_SYN_SENT, 
    tcp_set_state(sk, TCP_SYN_SENT);
    err = inet_hash_connect(&tcp_death_row, sk);
    if (err)
        goto failure;
 
    sk_set_txhash(sk);
    
    //動態選擇一個本地端口,並加入 hash 表,與bind(2)選擇端口類似
    rt = ip_route_newports(fl4, rt, orig_sport, orig_dport,
                   inet->inet_sport, inet->inet_dport, sk);
 
                   
    if (IS_ERR(rt)) {
        err = PTR_ERR(rt);
        rt = NULL;
        goto failure;
    }
    /* OK, now commit destination to socket.  */
 
    //設置下一跳地址,以及網卡分片相關
    sk->sk_gso_type = SKB_GSO_TCPV4;
    sk_setup_caps(sk, &rt->dst);
 
    //還未計算初始序號
    if (!tp->write_seq && likely(!tp->repair))
        //根據雙方地址、端口計算初始序號
        tp->write_seq = secure_tcp_sequence_number(inet->inet_saddr,
                               inet->inet_daddr,
                               inet->inet_sport,
                               usin->sin_port);
 
    //為 TCP報文計算一個 seq值(實際使用的值是 tp->write_seq+1)   --> 根據初始序號和當前時間,隨機算一個初始id
    inet->inet_id = tp->write_seq ^ jiffies;
    
    //函數用來根據 sk 中的信息,構建一個完成的 syn 報文,並將它發送出去。
    err = tcp_connect(sk);
 
    rt = NULL;
    if (err)
        goto failure;
 
    return 0;
 
failure:
    /*
     * This unhashes the socket and releases the local port,
     * if necessary.
     */
    tcp_set_state(sk, TCP_CLOSE);
    ip_rt_put(rt);
    sk->sk_route_caps = 0;
    inet->inet_dport = 0;
    return err;
}

在該函數主要完成:

1. 路由查找,得到下一跳地址,並更新socket對象的下一跳地址;

2. 將socket對象的狀態設置為TCP_SYN_SENT;

3. 如果沒設置序號初值,則選定一個隨機初值;

4. 調用函數tcp_connect完成報文構建和發送。

繼續看tcp_connect:

/* Build a SYN and send it off. */
//由tcp_v4_connect()->tcp_connect()->tcp_transmit_skb()發送,並置為TCP_SYN_SENT.
int tcp_connect(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *buff;
    int err;
 
    //初始化傳輸控制塊中與連接相關的成員
    tcp_connect_init(sk);
 
    if (unlikely(tp->repair)) {
        tcp_finish_connect(sk, NULL);
        return 0;
    }
    //分配skbuff   --> 為SYN段分配報文並進行初始化
    buff = sk_stream_alloc_skb(sk, 0, sk->sk_allocation, true);
    if (unlikely(!buff))
        return -ENOBUFS;
 
    //構建syn報文
    
    //在函數tcp_v4_connect中write_seq已經被初始化隨機值
    tcp_init_nondata_skb(buff, tp->write_seq++, TCPHDR_SYN);
    
    tp->retrans_stamp = tcp_time_stamp;
 
    //將報文添加到發送隊列上
    tcp_connect_queue_skb(sk, buff);
 
    //顯式擁塞通告 ---> 
    //路由器在出現擁塞時通知TCP。當TCP段傳遞時,路由器使用IP首部中的2位來記錄擁塞,當TCP段到達后,
    //接收方知道報文段是否在某個位置經歷過擁塞。然而,需要了解擁塞發生情況的是發送方,而非接收方。因
    //此,接收方使用下一個ACK通知發送方有擁塞發生,然后,發送方做出響應,縮小自己的擁塞窗口。
    tcp_ecn_send_syn(sk, buff);
 
    /* Send off SYN; include data in Fast Open. */
    err = tp->fastopen_req ? tcp_send_syn_data(sk, buff) :
 
          //構造tcp頭和ip頭並發送
          tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
    if (err == -ECONNREFUSED)
        return err;
 
    /* We change tp->snd_nxt after the tcp_transmit_skb() call
     * in order to make this packet get counted in tcpOutSegs.
     */
    tp->snd_nxt = tp->write_seq;
    tp->pushed_seq = tp->write_seq;
    TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);
 
    /* Timer for repeating the SYN until an answer. */
 
    //啟動重傳定時器
    inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
                  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
    return 0;
}

該函數完成:

1 初始化套接字跟連接相關的字段

2 申請sk_buff空間

3 將sk_buff初始化為syn報文,實質是操作tcp_skb_cb,在初始化TCP頭的時候會用到

4 調用tcp_connect_queue_skb()函數將報文sk_buff添加到發送隊列sk->sk_write_queue

5 調用tcp_transmit_skb()函數構造tcp頭,然后交給網絡層。

6 初始化重傳定時器

tcp_connect_queue_skb()函數的原理主要是移動sk_buff的data指針,然后填充TCP頭。再然后將報文交給網絡層,將報文發出。

這樣,三次握手中的第一次握手在客戶端的層面完成,報文到達服務端,由服務端處理完畢后,第一次握手完成,客戶端socket狀態變為TCP_SYN_SENT。

 

繼續看第二次握手,當服務端接收到客戶發送端發送的報文后的處理流程:

服務端處理第二次握手的時候,調用的是tcp_v4_rev()函數:

/*
 *    From tcp_input.c
 */
 
//網卡驅動-->netif_receive_skb()--->ip_rcv()--->ip_local_deliver_finish()---> tcp_v4_rcv()
int tcp_v4_rcv(struct sk_buff *skb)
{
    struct net *net = dev_net(skb->dev);
    const struct iphdr *iph;
    const struct tcphdr *th;
    bool refcounted;
    struct sock *sk;
    int ret;
 
    //如果不是發往本地的數據包,則直接丟棄
    if (skb->pkt_type != PACKET_HOST)
        goto discard_it;
 
    /* Count it even if it's bad */
    __TCP_INC_STATS(net, TCP_MIB_INSEGS);
 
 
    ////包長是否大於TCP頭的長度
    if (!pskb_may_pull(skb, sizeof(struct tcphdr)))
        goto discard_it;
 
    //tcp頭   --> 不是很懂為何老是獲取tcp頭
    th = (const struct tcphdr *)skb->data;
 
    if (unlikely(th->doff < sizeof(struct tcphdr) / 4))
        goto bad_packet;
    
    if (!pskb_may_pull(skb, th->doff * 4))
        goto discard_it;
 
    /* An explanation is required here, I think.
     * Packet length and doff are validated by header prediction,
     * provided case of th->doff==0 is eliminated.
     * So, we defer the checks. */
 
    if (skb_checksum_init(skb, IPPROTO_TCP, inet_compute_pseudo))
        goto csum_error;
    
    //得到tcp的頭  --> 不是很懂為何老是獲取tcp頭
    th = (const struct tcphdr *)skb->data;
 
    //得到ip報文頭
    iph = ip_hdr(skb);
    /* This is tricky : We move IPCB at its correct location into TCP_SKB_CB()
     * barrier() makes sure compiler wont play fool^Waliasing games.
     */
    memmove(&TCP_SKB_CB(skb)->header.h4, IPCB(skb),
        sizeof(struct inet_skb_parm));
    barrier();
 
    TCP_SKB_CB(skb)->seq = ntohl(th->seq);
    TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
                    skb->len - th->doff * 4);
    TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
    TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
    TCP_SKB_CB(skb)->tcp_tw_isn = 0;
    TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
    TCP_SKB_CB(skb)->sacked     = 0;
 
lookup:
    //根據源端口號,目的端口號和接收的interface查找sock對象------>先在建立連接的哈希表中查找------>如果沒找到就從監聽哈希表中找 
 
    //對於建立過程來講肯是監聽哈希表中才能找到
    sk = __inet_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th), th->source,
                   th->dest, &refcounted);
    
    //如果找不到處理的socket對象,就把數據報丟掉
    if (!sk)
        goto no_tcp_socket;
 
process:
 
    //檢查sock是否處於半關閉狀態
    if (sk->sk_state == TCP_TIME_WAIT)
        goto do_time_wait;
 
    if (sk->sk_state == TCP_NEW_SYN_RECV) {
        struct request_sock *req = inet_reqsk(sk);
        struct sock *nsk;
 
        sk = req->rsk_listener;
        if (unlikely(tcp_v4_inbound_md5_hash(sk, skb))) {
            sk_drops_add(sk, skb);
            reqsk_put(req);
            goto discard_it;
        }
        if (unlikely(sk->sk_state != TCP_LISTEN)) {
            inet_csk_reqsk_queue_drop_and_put(sk, req);
            goto lookup;
        }
        /* We own a reference on the listener, increase it again
         * as we might lose it too soon.
         */
        sock_hold(sk);
        refcounted = true;
        nsk = tcp_check_req(sk, skb, req, false);
        if (!nsk) {
            reqsk_put(req);
            goto discard_and_relse;
        }
        if (nsk == sk) {
            reqsk_put(req);
        } else if (tcp_child_process(sk, nsk, skb)) {
            tcp_v4_send_reset(nsk, skb);
            goto discard_and_relse;
        } else {
            sock_put(sk);
            return 0;
        }
    }
    if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
        __NET_INC_STATS(net, LINUX_MIB_TCPMINTTLDROP);
        goto discard_and_relse;
    }
 
    if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
        goto discard_and_relse;
 
    if (tcp_v4_inbound_md5_hash(sk, skb))
        goto discard_and_relse;
 
    nf_reset(skb);
 
    if (tcp_filter(sk, skb))
        goto discard_and_relse;
    
    //tcp頭部   --> 不是很懂為何老是獲取tcp頭
    th = (const struct tcphdr *)skb->data;
    iph = ip_hdr(skb);
 
    skb->dev = NULL;
 
    //如果socket處於監聽狀態 --> 我們重點關注這里
    if (sk->sk_state == TCP_LISTEN) {
        ret = tcp_v4_do_rcv(sk, skb);
        goto put_and_return;
    }
 
    sk_incoming_cpu_update(sk);
 
    bh_lock_sock_nested(sk);
    tcp_segs_in(tcp_sk(sk), skb);
    ret = 0;
    
    //查看是否有用戶態進程對該sock進行了鎖定
    //如果sock_owned_by_user為真,則sock的狀態不能進行更改
    if (!sock_owned_by_user(sk)) {
        if (!tcp_prequeue(sk, skb))
            //-------------------------------------------------------->
            ret = tcp_v4_do_rcv(sk, skb);
    } else if (tcp_add_backlog(sk, skb)) {
        goto discard_and_relse;
    }
    bh_unlock_sock(sk);
 
put_and_return:
    if (refcounted)
        sock_put(sk);
 
    return ret;
 
no_tcp_socket:
    if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
        goto discard_it;
 
    if (tcp_checksum_complete(skb)) {
csum_error:
        __TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
bad_packet:
        __TCP_INC_STATS(net, TCP_MIB_INERRS);
    } else {
        tcp_v4_send_reset(NULL, skb);
    }
 
discard_it:
    /* Discard frame. */
    kfree_skb(skb);
    return 0;
 
discard_and_relse:
    sk_drops_add(sk, skb);
    if (refcounted)
        sock_put(sk);
    goto discard_it;
 
do_time_wait:
    if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
        inet_twsk_put(inet_twsk(sk));
        goto discard_it;
    }
 
    if (tcp_checksum_complete(skb)) {
        inet_twsk_put(inet_twsk(sk));
        goto csum_error;
    }
    switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {
    case TCP_TW_SYN: {
        struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev),
                            &tcp_hashinfo, skb,
                            __tcp_hdrlen(th),
                            iph->saddr, th->source,
                            iph->daddr, th->dest,
                            inet_iif(skb));
        if (sk2) {
            inet_twsk_deschedule_put(inet_twsk(sk));
            sk = sk2;
            refcounted = false;
            goto process;
        }
        /* Fall through to ACK */
    }
    case TCP_TW_ACK:
        tcp_v4_timewait_ack(sk, skb);
        break;
    case TCP_TW_RST:
        tcp_v4_send_reset(sk, skb);
        inet_twsk_deschedule_put(inet_twsk(sk));
        goto discard_it;
    case TCP_TW_SUCCESS:;
    }
    goto discard_it;
}

該函數主要工作就是根據tcp頭部信息查到處理報文的socket對象,然后檢查socket狀態做不同處理,我們這里是監聽狀態TCP_LISTEN,直接調用函數tcp_v4_do_rcv(),不過該函數主要作用是防止洪泛和擁塞控制,和三次握手無關。接着是調用tcp_rcv_state_process():

/*
 *    This function implements the receiving procedure of RFC 793 for
 *    all states except ESTABLISHED and TIME_WAIT.
 *    It's called from both tcp_v4_rcv and tcp_v6_rcv and should be
 *    address independent.
 */
 
 
//除了ESTABLISHED和TIME_WAIT狀態外,其他狀態下的TCP段處理都由本函數實現
 
// tcp_v4_do_rcv() -> tcp_rcv_state_process() -> tcp_v4_conn_request() -> tcp_v4_send_synack().        
int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct inet_connection_sock *icsk = inet_csk(sk);
    const struct tcphdr *th = tcp_hdr(skb);
    struct request_sock *req;
    int queued = 0;
    bool acceptable;
 
    switch (sk->sk_state) {
        
    //SYN_RECV狀態的處理 
    case TCP_CLOSE:
        goto discard;
 
    //服務端第一次握手處理
    case TCP_LISTEN:
        if (th->ack)
            return 1;
 
        if (th->rst)
            goto discard;
 
        if (th->syn) {
            if (th->fin)
                goto discard;
            // tcp_v4_do_rcv() -> tcp_rcv_state_process() -> tcp_v4_conn_request() -> tcp_v4_send_synack().        
            if (icsk->icsk_af_ops->conn_request(sk, skb) < 0)
                return 1;
 
            consume_skb(skb);
            return 0;
        }
        goto discard;
 
    //客戶端第二次握手處理 
    case TCP_SYN_SENT:
        tp->rx_opt.saw_tstamp = 0;
 
        //處理SYN_SENT狀態下接收到的TCP段
        queued = tcp_rcv_synsent_state_process(sk, skb, th);
        if (queued >= 0)
            return queued;
 
        /* Do step6 onward by hand. */
 
        //處理完第二次握手后,還需要處理帶外數據
        tcp_urg(sk, skb, th);
        __kfree_skb(skb);
 
        //檢測是否有數據需要發送
        tcp_data_snd_check(sk);
        return 0;
    }
 
    tp->rx_opt.saw_tstamp = 0;
    req = tp->fastopen_rsk;
    if (req) {
        WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
            sk->sk_state != TCP_FIN_WAIT1);
 
        if (!tcp_check_req(sk, skb, req, true))
            goto discard;
    }
 
    if (!th->ack && !th->rst && !th->syn)
        goto discard;
 
    if (!tcp_validate_incoming(sk, skb, th, 0))
        return 0;
 
    /* step 5: check the ACK field */
    acceptable = tcp_ack(sk, skb, FLAG_SLOWPATH |
                      FLAG_UPDATE_TS_RECENT) > 0;
 
    switch (sk->sk_state) {
    //服務端第三次握手處理
    case TCP_SYN_RECV:
        if (!acceptable)
            return 1;
 
        if (!tp->srtt_us)
            tcp_synack_rtt_meas(sk, req);
 
        /* Once we leave TCP_SYN_RECV, we no longer need req
         * so release it.
         */
        if (req) {
            inet_csk(sk)->icsk_retransmits = 0;
            reqsk_fastopen_remove(sk, req, false);
        } else {
            /* Make sure socket is routed, for correct metrics. */
 
            //建立路由,初始化擁塞控制模塊
            icsk->icsk_af_ops->rebuild_header(sk);
            tcp_init_congestion_control(sk);
 
            tcp_mtup_init(sk);
            tp->copied_seq = tp->rcv_nxt;
            tcp_init_buffer_space(sk);
        }
        smp_mb();
        //正常的第三次握手,設置連接狀態為TCP_ESTABLISHED 
        tcp_set_state(sk, TCP_ESTABLISHED);
        sk->sk_state_change(sk);
 
        /* Note, that this wakeup is only for marginal crossed SYN case.
         * Passively open sockets are not waked up, because
         * sk->sk_sleep == NULL and sk->sk_socket == NULL.
         */
 
        //狀態已經正常,喚醒那些等待的線程
        if (sk->sk_socket)
            sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
 
        tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
        tp->snd_wnd = ntohs(th->window) << tp->rx_opt.snd_wscale;
        tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
 
        if (tp->rx_opt.tstamp_ok)
            tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
 
        if (req) {
            /* Re-arm the timer because data may have been sent out.
             * This is similar to the regular data transmission case
             * when new data has just been ack'ed.
             *
             * (TFO) - we could try to be more aggressive and
             * retransmitting any data sooner based on when they
             * are sent out.
             */
            tcp_rearm_rto(sk);
        } else
            tcp_init_metrics(sk);
 
        if (!inet_csk(sk)->icsk_ca_ops->cong_control)
            tcp_update_pacing_rate(sk);
 
        /* Prevent spurious tcp_cwnd_restart() on first data packet */
 
        //更新最近一次發送數據包的時間
        tp->lsndtime = tcp_time_stamp;
 
        tcp_initialize_rcv_mss(sk);
 
        //計算有關TCP首部預測的標志
        tcp_fast_path_on(tp);
        break;
 
    case TCP_FIN_WAIT1: {
        struct dst_entry *dst;
        int tmo;
 
        /* If we enter the TCP_FIN_WAIT1 state and we are a
         * Fast Open socket and this is the first acceptable
         * ACK we have received, this would have acknowledged
         * our SYNACK so stop the SYNACK timer.
         */
        if (req) {
            /* Return RST if ack_seq is invalid.
             * Note that RFC793 only says to generate a
             * DUPACK for it but for TCP Fast Open it seems
             * better to treat this case like TCP_SYN_RECV
             * above.
             */
            if (!acceptable)
                return 1;
            /* We no longer need the request sock. */
            reqsk_fastopen_remove(sk, req, false);
            tcp_rearm_rto(sk);
        }
        if (tp->snd_una != tp->write_seq)
            break;
 
        tcp_set_state(sk, TCP_FIN_WAIT2);
        sk->sk_shutdown |= SEND_SHUTDOWN;
 
        dst = __sk_dst_get(sk);
        if (dst)
            dst_confirm(dst);
 
        if (!sock_flag(sk, SOCK_DEAD)) {
            /* Wake up lingering close() */
            sk->sk_state_change(sk);
            break;
        }
 
        if (tp->linger2 < 0 ||
            (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
             after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt))) {
            tcp_done(sk);
            NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
            return 1;
        }
 
        tmo = tcp_fin_time(sk);
        if (tmo > TCP_TIMEWAIT_LEN) {
            inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
        } else if (th->fin || sock_owned_by_user(sk)) {
            /* Bad case. We could lose such FIN otherwise.
             * It is not a big problem, but it looks confusing
             * and not so rare event. We still can lose it now,
             * if it spins in bh_lock_sock(), but it is really
             * marginal case.
             */
            inet_csk_reset_keepalive_timer(sk, tmo);
        } else {
            tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
            goto discard;
        }
        break;
    }
 
    case TCP_CLOSING:
        if (tp->snd_una == tp->write_seq) {
            tcp_time_wait(sk, TCP_TIME_WAIT, 0);
            goto discard;
        }
        break;
 
    case TCP_LAST_ACK:
        if (tp->snd_una == tp->write_seq) {
            tcp_update_metrics(sk);
            tcp_done(sk);
            goto discard;
        }
        break;
    }
 
    /* step 6: check the URG bit */
    tcp_urg(sk, skb, th);
 
    /* step 7: process the segment text */
    switch (sk->sk_state) {
    case TCP_CLOSE_WAIT:
    case TCP_CLOSING:
    case TCP_LAST_ACK:
        if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
            break;
    case TCP_FIN_WAIT1:
    case TCP_FIN_WAIT2:
        /* RFC 793 says to queue data in these states,
         * RFC 1122 says we MUST send a reset.
         * BSD 4.4 also does reset.
         */
        if (sk->sk_shutdown & RCV_SHUTDOWN) {
            if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
                after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
                NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
                tcp_reset(sk);
                return 1;
            }
        }
        /* Fall through */
    case TCP_ESTABLISHED:
        tcp_data_queue(sk, skb);
        queued = 1;
        break;
    }
 
    /* tcp_data could move socket to TIME-WAIT */
    if (sk->sk_state != TCP_CLOSE) {
        tcp_data_snd_check(sk);
        tcp_ack_snd_check(sk);
    }
 
    if (!queued) {
discard:
        tcp_drop(sk, skb);
    }
    return 0;
}

這是TCP建立連接的核心所在,幾乎所有狀態的套接字,在收到數據報時都在這里完成處理。對於服務端來說,收到第一次握手報文時的狀態為TCP_LISTEN,接下將由tcp_v4_conn_request函數處理:

// tcp_v4_do_rcv() -> tcp_rcv_state_process() -> tcp_v4_conn_request() -> tcp_v4_send_synack().        
int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
{
    /* Never answer to SYNs send to broadcast or multicast */
    if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
        goto drop;
 
    //tcp_request_sock_ops 定義在 tcp_ipv4.c    1256行
 
    //inet_init --> proto_register --> req_prot_init -->初始化cache名
    return tcp_conn_request(&tcp_request_sock_ops,
                &tcp_request_sock_ipv4_ops, sk, skb);
 
drop:
    tcp_listendrop(sk);
    return 0;
}
int tcp_conn_request(struct request_sock_ops *rsk_ops,
             const struct tcp_request_sock_ops *af_ops,
             struct sock *sk, struct sk_buff *skb)
{
    struct tcp_fastopen_cookie foc = { .len = -1 };
    __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
    struct tcp_options_received tmp_opt;
    struct tcp_sock *tp = tcp_sk(sk);
    struct net *net = sock_net(sk);
    struct sock *fastopen_sk = NULL;
    struct dst_entry *dst = NULL;
    struct request_sock *req;
    bool want_cookie = false;
    struct flowi fl;
 
    /* TW buckets are converted to open requests without
     * limitations, they conserve resources and peer is
     * evidently real one.
     */
 
    //處理TCP SYN FLOOD攻擊相關的東西
 
    //Client發送SYN包給Server后掛了,Server回給Client的SYN-ACK一直沒收到Client的ACK確認,這個時候這個連接既沒建立起來,
    //也不能算失敗。這就需要一個超時時間讓Server將這個連接斷開,否則這個連接就會一直占用Server的SYN連接隊列中的一個位置,
    //大量這樣的連接就會將Server的SYN連接隊列耗盡,讓正常的連接無法得到處理。
 
    //目前,Linux下默認會進行5次重發SYN-ACK包,重試的間隔時間從1s開始,下次的重試間隔時間是前一次的雙倍,5次的重試時間間隔
    //為1s, 2s, 4s, 8s, 16s,總共31s,第5次發出后還要等32s都知道第5次也超時了,所以,總共需要 1s + 2s + 4s+ 8s+ 16s + 32s = 63s,
    //TCP才會把斷開這個連接。由於,SYN超時需要63秒,那么就給攻擊者一個攻擊服務器的機會,攻擊者在短時間內發送大量的SYN包給Server(俗稱 SYN flood 攻擊),
    //用於耗盡Server的SYN隊列。對於應對SYN 過多的問題,linux提供了幾個TCP參數:tcp_syncookies、tcp_synack_retries、tcp_max_syn_backlog、tcp_abort_on_overflow 來調整應對。
    if ((net->ipv4.sysctl_tcp_syncookies == 2 ||
         inet_csk_reqsk_queue_is_full(sk)) && !isn) {
        want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
        if (!want_cookie)
            goto drop;
    }
 
 
    /* Accept backlog is full. If we have already queued enough
     * of warm entries in syn queue, drop request. It is better than
     * clogging syn queue with openreqs with exponentially increasing
     * timeout.
     */
    if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
        NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
        goto drop;
    }
 
    //分配一個request_sock對象來代表這個半連接
 
    //在三次握手協議中,服務器維護一個半連接隊列,該隊列為每個客戶端的SYN包開設一個條目(服務端在接收到SYN包的時候,
    //就已經創建了request_sock結構,存儲在半連接隊列中),該條目表明服務器已收到SYN包,並向客戶發出確認,正在等待客
    //戶的確認包(會進行第二次握手發送SYN+ACK 的包加以確認)。這些條目所標識的連接在服務器處於Syn_RECV狀態,當服
    //務器收到客戶的確認包時,刪除該條目,服務器進入ESTABLISHED狀態。該隊列為SYN 隊列,長度為 max(64, /proc/sys/net/ipv4/tcp_max_syn_backlog) ,
    //在機器的tcp_max_syn_backlog值在/proc/sys/net/ipv4/tcp_max_syn_backlog下配置。
 
    // tcp_request_sock_ops
    //inet_init --> proto_register --> req_prot_init -->初始化cache名
    req = inet_reqsk_alloc(rsk_ops, sk, !want_cookie);
    if (!req)
        goto drop;
 
    //特定協議的request_sock的特殊操作函數集
    tcp_rsk(req)->af_specific = af_ops;
 
    tcp_clear_options(&tmp_opt);
    tmp_opt.mss_clamp = af_ops->mss_clamp;
    tmp_opt.user_mss  = tp->rx_opt.user_mss;
    tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
 
    if (want_cookie && !tmp_opt.saw_tstamp)
        tcp_clear_options(&tmp_opt);
 
    tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
    //初始化連接請求塊,包括request_sock、inet_request_sock、tcp_request_sock
    tcp_openreq_init(req, &tmp_opt, skb, sk);
    
    inet_rsk(req)->no_srccheck = inet_sk(sk)->transparent;
 
    /* Note: tcp_v6_init_req() might override ir_iif for link locals */
    inet_rsk(req)->ir_iif = inet_request_bound_dev_if(sk, skb);
 
    //tcp_request_sock_ipv4_ops  --> tcp_v4_init_req
    af_ops->init_req(req, sk, skb);
 
    if (security_inet_conn_request(sk, skb, req))
        goto drop_and_free;
 
    if (!want_cookie && !isn) {
        /* VJ's idea. We save last timestamp seen
         * from the destination in peer table, when entering
         * state TIME-WAIT, and check against it before
         * accepting new connection request.
         *
         * If "isn" is not zero, this request hit alive
         * timewait bucket, so that all the necessary checks
         * are made in the function processing timewait state.
         */
        if (tcp_death_row.sysctl_tw_recycle) {
            bool strict;
 
            dst = af_ops->route_req(sk, &fl, req, &strict);
 
            if (dst && strict &&
                !tcp_peer_is_proven(req, dst, true,
                        tmp_opt.saw_tstamp)) {
                NET_INC_STATS(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
                goto drop_and_release;
            }
        }
        /* Kill the following clause, if you dislike this way. */
        else if (!net->ipv4.sysctl_tcp_syncookies &&
             (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
              (sysctl_max_syn_backlog >> 2)) &&
             !tcp_peer_is_proven(req, dst, false,
                         tmp_opt.saw_tstamp)) {
            /* Without syncookies last quarter of
             * backlog is filled with destinations,
             * proven to be alive.
             * It means that we continue to communicate
             * to destinations, already remembered
             * to the moment of synflood.
             */
            pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
                    rsk_ops->family);
            goto drop_and_release;
        }
 
        isn = af_ops->init_seq(skb);
    }
    if (!dst) {
        dst = af_ops->route_req(sk, &fl, req, NULL);
        if (!dst)
            goto drop_and_free;
    }
    //擁塞顯式通告的東西
    tcp_ecn_create_request(req, skb, sk, dst);
 
    if (want_cookie) {
        isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
        req->cookie_ts = tmp_opt.tstamp_ok;
        if (!tmp_opt.tstamp_ok)
            inet_rsk(req)->ecn_ok = 0;
    }
 
    tcp_rsk(req)->snt_isn = isn;
    tcp_rsk(req)->txhash = net_tx_rndhash();
 
    //接收窗口初始化
    tcp_openreq_init_rwin(req, sk, dst);
    if (!want_cookie) {
        tcp_reqsk_record_syn(sk, req, skb);
        fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);
    }
    //握手過程傳輸數據相關的東西
    if (fastopen_sk) {
        af_ops->send_synack(fastopen_sk, dst, &fl, req,
                    &foc, TCP_SYNACK_FASTOPEN);
        /* Add the child socket directly into the accept queue */
        inet_csk_reqsk_queue_add(sk, req, fastopen_sk);
        sk->sk_data_ready(sk);
        bh_unlock_sock(fastopen_sk);
        sock_put(fastopen_sk);
    } else {
 
        //設置TFO選項為false
        tcp_rsk(req)->tfo_listener = false;
        if (!want_cookie)
            inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
 
        // tcp_v4_do_rcv() -> tcp_rcv_state_process() -> tcp_v4_conn_request() -> tcp_v4_send_synack().    
        
        // tcp_request_sock_ipv4_ops  --> tcp_v4_send_synack
        af_ops->send_synack(sk, dst, &fl, req, &foc,
                    !want_cookie ? TCP_SYNACK_NORMAL :
                           TCP_SYNACK_COOKIE);
        if (want_cookie) {
            reqsk_free(req);
            return 0;
        }
    }
    reqsk_put(req);
    return 0;
 
drop_and_release:
    dst_release(dst);
drop_and_free:
    reqsk_free(req);
drop:
    tcp_listendrop(sk);
    return 0;
}

在該函數中做了不少的事情,但是我們這里重點了解兩點:

1 分配一個request_sock對象來代表這次連接請求(狀態為TCP_NEW_SYN_RECV),如果沒有設置防范syn  flood相關的選項,則將該request_sock添加到established狀態的tcp_sock散列表(如果設置了防范選項,則request_sock對象都沒有,只有建立完成時才會分配)

2 調用tcp_v4_send_synack回復客戶端ack,開啟第二次握手

我們看下該函數:

//向客戶端發送SYN+ACK報文
static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
                  struct flowi *fl,
                  struct request_sock *req,
                  struct tcp_fastopen_cookie *foc,
                  enum tcp_synack_type synack_type)
{
    const struct inet_request_sock *ireq = inet_rsk(req);
    struct flowi4 fl4;
    int err = -1;
    struct sk_buff *skb;
 
    /* First, grab a route. */
 
    //查找到客戶端的路由
    if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
        return -1;
 
    //根據路由、傳輸控制塊、連接請求塊中的構建SYN+ACK段
    skb = tcp_make_synack(sk, dst, req, foc, synack_type);
 
    //生成SYN+ACK段成功
    if (skb) {
 
        //生成校驗碼
        __tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr);
 
 
        //生成IP數據報並發送出去
        err = ip_build_and_send_pkt(skb, sk, ireq->ir_loc_addr,
                        ireq->ir_rmt_addr,
                        ireq->opt);
        err = net_xmit_eval(err);
    }
 
    return err;
}

查找客戶端路由,構造syn包,然后調用ip_build_and_send_pkt,依靠網絡層將數據報發出去。至此,第二次握手完成。客戶端socket狀態變為TCP_ESTABLISHED,此時服務端socket的狀態為TCP_NEW_SYN_RECV,接下來調用如下函數進行第三次握手:

int tcp_v4_rcv(struct sk_buff *skb)
{
.............
 
 
    //收到握手最后一個ack后,會找到TCP_NEW_SYN_RECV狀態的req,然后創建一個新的sock進入TCP_SYN_RECV狀態,最終進入TCP_ESTABLISHED狀態. 並放入accept隊列通知select/epoll
    if (sk->sk_state == TCP_NEW_SYN_RECV) {
        struct request_sock *req = inet_reqsk(sk);
        struct sock *nsk;
 
        sk = req->rsk_listener;
        if (unlikely(tcp_v4_inbound_md5_hash(sk, skb))) {
            sk_drops_add(sk, skb);
            reqsk_put(req);
            goto discard_it;
        }
        if (unlikely(sk->sk_state != TCP_LISTEN)) {
            inet_csk_reqsk_queue_drop_and_put(sk, req);
            goto lookup;
        }
        /* We own a reference on the listener, increase it again
         * as we might lose it too soon.
         */
        sock_hold(sk);
        refcounted = true;
 
        //創建新的sock進入TCP_SYN_RECV state
        nsk = tcp_check_req(sk, skb, req, false);
        if (!nsk) {
            reqsk_put(req);
            goto discard_and_relse;
        }
        if (nsk == sk) {
            reqsk_put(req);
 
        //調用 tcp_rcv_state_process
        } else if (tcp_child_process(sk, nsk, skb)) {
            tcp_v4_send_reset(nsk, skb);
            goto discard_and_relse;
        } else {//成功后直接返回
            sock_put(sk);
            return 0;
        }
    }
}

進入tcp_check_req()查看在第三次握手中如何創建新的socket:

/*
 * Process an incoming packet for SYN_RECV sockets represented as a
 * request_sock. Normally sk is the listener socket but for TFO it
 * points to the child socket.
 *
 * XXX (TFO) - The current impl contains a special check for ack
 * validation and inside tcp_v4_reqsk_send_ack(). Can we do better?
 *
 * We don't need to initialize tmp_opt.sack_ok as we don't use the results
 */
 
struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
               struct request_sock *req,
               bool fastopen)
{
    struct tcp_options_received tmp_opt;
    struct sock *child;
    const struct tcphdr *th = tcp_hdr(skb);
    __be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
    bool paws_reject = false;
    bool own_req;
 
    tmp_opt.saw_tstamp = 0;
    if (th->doff > (sizeof(struct tcphdr)>>2)) {
        tcp_parse_options(skb, &tmp_opt, 0, NULL);
 
        if (tmp_opt.saw_tstamp) {
            tmp_opt.ts_recent = req->ts_recent;
            /* We do not store true stamp, but it is not required,
             * it can be estimated (approximately)
             * from another data.
             */
            tmp_opt.ts_recent_stamp = get_seconds() - ((TCP_TIMEOUT_INIT/HZ)<<req->num_timeout);
            paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
        }
    }
 
    /* Check for pure retransmitted SYN. */
    if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn &&
        flg == TCP_FLAG_SYN &&
        !paws_reject) {
        /*
         * RFC793 draws (Incorrectly! It was fixed in RFC1122)
         * this case on figure 6 and figure 8, but formal
         * protocol description says NOTHING.
         * To be more exact, it says that we should send ACK,
         * because this segment (at least, if it has no data)
         * is out of window.
         *
         *  CONCLUSION: RFC793 (even with RFC1122) DOES NOT
         *  describe SYN-RECV state. All the description
         *  is wrong, we cannot believe to it and should
         *  rely only on common sense and implementation
         *  experience.
         *
         * Enforce "SYN-ACK" according to figure 8, figure 6
         * of RFC793, fixed by RFC1122.
         *
         * Note that even if there is new data in the SYN packet
         * they will be thrown away too.
         *
         * Reset timer after retransmitting SYNACK, similar to
         * the idea of fast retransmit in recovery.
         */
        if (!tcp_oow_rate_limited(sock_net(sk), skb,
                      LINUX_MIB_TCPACKSKIPPEDSYNRECV,
                      &tcp_rsk(req)->last_oow_ack_time) &&
 
            !inet_rtx_syn_ack(sk, req)) {
            unsigned long expires = jiffies;
 
            expires += min(TCP_TIMEOUT_INIT << req->num_timeout,
                       TCP_RTO_MAX);
            if (!fastopen)
                mod_timer_pending(&req->rsk_timer, expires);
            else
                req->rsk_timer.expires = expires;
        }
        return NULL;
    }
 
    /* Further reproduces section "SEGMENT ARRIVES"
       for state SYN-RECEIVED of RFC793.
       It is broken, however, it does not work only
       when SYNs are crossed.
       You would think that SYN crossing is impossible here, since
       we should have a SYN_SENT socket (from connect()) on our end,
       but this is not true if the crossed SYNs were sent to both
       ends by a malicious third party.  We must defend against this,
       and to do that we first verify the ACK (as per RFC793, page
       36) and reset if it is invalid.  Is this a true full defense?
       To convince ourselves, let us consider a way in which the ACK
       test can still pass in this 'malicious crossed SYNs' case.
       Malicious sender sends identical SYNs (and thus identical sequence
       numbers) to both A and B:
        A: gets SYN, seq=7
        B: gets SYN, seq=7
       By our good fortune, both A and B select the same initial
       send sequence number of seven :-)
        A: sends SYN|ACK, seq=7, ack_seq=8
        B: sends SYN|ACK, seq=7, ack_seq=8
       So we are now A eating this SYN|ACK, ACK test passes.  So
       does sequence test, SYN is truncated, and thus we consider
       it a bare ACK.
       If icsk->icsk_accept_queue.rskq_defer_accept, we silently drop this
       bare ACK.  Otherwise, we create an established connection.  Both
       ends (listening sockets) accept the new incoming connection and try
       to talk to each other. 8-)
       Note: This case is both harmless, and rare.  Possibility is about the
       same as us discovering intelligent life on another plant tomorrow.
       But generally, we should (RFC lies!) to accept ACK
       from SYNACK both here and in tcp_rcv_state_process().
       tcp_rcv_state_process() does not, hence, we do not too.
       Note that the case is absolutely generic:
       we cannot optimize anything here without
       violating protocol. All the checks must be made
       before attempt to create socket.
     */
 
    /* RFC793 page 36: "If the connection is in any non-synchronized state ...
     *                  and the incoming segment acknowledges something not yet
     *                  sent (the segment carries an unacceptable ACK) ...
     *                  a reset is sent."
     *
     * Invalid ACK: reset will be sent by listening socket.
     * Note that the ACK validity check for a Fast Open socket is done
     * elsewhere and is checked directly against the child socket rather
     * than req because user data may have been sent out.
     */
    if ((flg & TCP_FLAG_ACK) && !fastopen &&
        (TCP_SKB_CB(skb)->ack_seq !=
         tcp_rsk(req)->snt_isn + 1))
        return sk;
 
    /* Also, it would be not so bad idea to check rcv_tsecr, which
     * is essentially ACK extension and too early or too late values
     * should cause reset in unsynchronized states.
     */
 
    /* RFC793: "first check sequence number". */
 
    if (paws_reject || !tcp_in_window(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq,
                      tcp_rsk(req)->rcv_nxt, tcp_rsk(req)->rcv_nxt + req->rsk_rcv_wnd)) {
        /* Out of window: send ACK and drop. */
        if (!(flg & TCP_FLAG_RST) &&
            !tcp_oow_rate_limited(sock_net(sk), skb,
                      LINUX_MIB_TCPACKSKIPPEDSYNRECV,
                      &tcp_rsk(req)->last_oow_ack_time))
            req->rsk_ops->send_ack(sk, skb, req);
        if (paws_reject)
            __NET_INC_STATS(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
        return NULL;
    }
 
    /* In sequence, PAWS is OK. */
 
    if (tmp_opt.saw_tstamp && !after(TCP_SKB_CB(skb)->seq, tcp_rsk(req)->rcv_nxt))
        req->ts_recent = tmp_opt.rcv_tsval;
 
    if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn) {
        /* Truncate SYN, it is out of window starting
           at tcp_rsk(req)->rcv_isn + 1. */
        flg &= ~TCP_FLAG_SYN;
    }
 
    /* RFC793: "second check the RST bit" and
     *       "fourth, check the SYN bit"
     */
    if (flg & (TCP_FLAG_RST|TCP_FLAG_SYN)) {
        __TCP_INC_STATS(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
        goto embryonic_reset;
    }
 
    /* ACK sequence verified above, just make sure ACK is
     * set.  If ACK not set, just silently drop the packet.
     *
     * XXX (TFO) - if we ever allow "data after SYN", the
     * following check needs to be removed.
     */
    if (!(flg & TCP_FLAG_ACK))
        return NULL;
 
    /* For Fast Open no more processing is needed (sk is the
     * child socket).
     */
    if (fastopen)
        return sk;
 
    /* While TCP_DEFER_ACCEPT is active, drop bare ACK. */
    if (req->num_timeout < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
        TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
        inet_rsk(req)->acked = 1;
        __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPDEFERACCEPTDROP);
        return NULL;
    }
 
    /* OK, ACK is valid, create big socket and
     * feed this segment to it. It will repeat all
     * the tests. THIS SEGMENT MUST MOVE SOCKET TO
     * ESTABLISHED STATE. If it will be dropped after
     * socket is created, wait for troubles.
     */
 
    // 生成child sk, 從ehash中刪除req sock    ipv4_specific  --> tcp_v4_syn_recv_sock
    child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL,
                             req, &own_req);
    if (!child)
        goto listen_overflow;
 
    //sk->sk_rxhash = skb->hash; 
    sock_rps_save_rxhash(child, skb);
 
    //更新rtt_min,srtt,rto
    tcp_synack_rtt_meas(child, req);
    //插入accept隊列
    return inet_csk_complete_hashdance(sk, child, req, own_req);
 
listen_overflow:
    if (!sysctl_tcp_abort_on_overflow) {
        inet_rsk(req)->acked = 1;
        return NULL;
    }
 
embryonic_reset:
    if (!(flg & TCP_FLAG_RST)) {
        /* Received a bad SYN pkt - for TFO We try not to reset
         * the local connection unless it's really necessary to
         * avoid becoming vulnerable to outside attack aiming at
         * resetting legit local connections.
         */
        req->rsk_ops->send_reset(sk, skb);
    } else if (fastopen) { /* received a valid RST pkt */
        reqsk_fastopen_remove(sk, req, true);
        tcp_reset(sk);
    }
    if (!fastopen) {
        inet_csk_reqsk_queue_drop(sk, req);
        __NET_INC_STATS(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
    }
    return NULL;
}

重點關注兩點:

1 通調用鏈過tcp_v4_syn_recv_sock --> tcp_create_openreq_child --> inet_csk_clone_lock 生成新sock,狀態設置為TCP_SYN_RECV;且tcp_v4_syn_recv_sock通過調用inet_ehash_nolisten將新sock加入ESTABLISHED狀態的哈希表中;

2 通過調用inet_csk_complete_hashdance,將新sock插入accept隊列.

至此我們得到一個代表本次連接的新sock,狀態為TCP_SYN_RECV,接着調用tcp_child_process,進而調用tcp_rcv_state_process:

/*
 * Queue segment on the new socket if the new socket is active,
 * otherwise we just shortcircuit this and continue with
 * the new socket.
 *
 * For the vast majority of cases child->sk_state will be TCP_SYN_RECV
 * when entering. But other states are possible due to a race condition
 * where after __inet_lookup_established() fails but before the listener
 * locked is obtained, other packets cause the same connection to
 * be created.
 */
 
int tcp_child_process(struct sock *parent, struct sock *child,
              struct sk_buff *skb)
{
    int ret = 0;
    int state = child->sk_state;
 
    tcp_segs_in(tcp_sk(child), skb);
    if (!sock_owned_by_user(child)) {
        ret = tcp_rcv_state_process(child, skb);
        /* Wakeup parent, send SIGIO */
        if (state == TCP_SYN_RECV && child->sk_state != state)
            parent->sk_data_ready(parent);
    } else {
        /* Alas, it is possible again, because we do lookup
         * in main socket hash table and lock on listening
         * socket does not protect us more.
         */
        __sk_add_backlog(child, skb);
    }
 
    bh_unlock_sock(child);
    sock_put(child);
    return ret;
}

又回到了函數tcp_rcv_state_process,TCP_SYN_RECV狀態的套接字將由一下代碼處理(只考慮TCP_SYN_RECV部分):

    //服務端第三次握手處理
    case TCP_SYN_RECV:
        if (!acceptable)
            return 1;
 
        if (!tp->srtt_us)
            tcp_synack_rtt_meas(sk, req);
 
        /* Once we leave TCP_SYN_RECV, we no longer need req
         * so release it.
         */
        if (req) {
            inet_csk(sk)->icsk_retransmits = 0;
            reqsk_fastopen_remove(sk, req, false);
        } else {
            /* Make sure socket is routed, for correct metrics. */
 
            //建立路由,初始化擁塞控制模塊
            icsk->icsk_af_ops->rebuild_header(sk);
            tcp_init_congestion_control(sk);
 
            tcp_mtup_init(sk);
            tp->copied_seq = tp->rcv_nxt;
            tcp_init_buffer_space(sk);
        }
        smp_mb();
        //正常的第三次握手,設置連接狀態為TCP_ESTABLISHED 
        tcp_set_state(sk, TCP_ESTABLISHED);
        sk->sk_state_change(sk);
 
        /* Note, that this wakeup is only for marginal crossed SYN case.
         * Passively open sockets are not waked up, because
         * sk->sk_sleep == NULL and sk->sk_socket == NULL.
         */
 
        //狀態已經正常,喚醒那些等待的線程
        if (sk->sk_socket)
            sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
 
        tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
        tp->snd_wnd = ntohs(th->window) << tp->rx_opt.snd_wscale;
        tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
 
        if (tp->rx_opt.tstamp_ok)
            tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
 
        if (req) {
            /* Re-arm the timer because data may have been sent out.
             * This is similar to the regular data transmission case
             * when new data has just been ack'ed.
             *
             * (TFO) - we could try to be more aggressive and
             * retransmitting any data sooner based on when they
             * are sent out.
             */
            tcp_rearm_rto(sk);
        } else
            tcp_init_metrics(sk);
 
        if (!inet_csk(sk)->icsk_ca_ops->cong_control)
            tcp_update_pacing_rate(sk);
 
        /* Prevent spurious tcp_cwnd_restart() on first data packet */
 
        //更新最近一次發送數據包的時間
        tp->lsndtime = tcp_time_stamp;
 
        tcp_initialize_rcv_mss(sk);
 
        //計算有關TCP首部預測的標志
        tcp_fast_path_on(tp);
        break;

可以看到到代碼對socket的窗口,mss等進行設置,以及最后將sock的狀態設置為TCP_ESTABLISHED,至此三次握手完成。等待用戶調用accept調用,取出套接字使用。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM