Linux內核--網絡棧實現分析（二）--數據包的傳遞過程--轉

本文轉載自查看原文 2014-03-16 22:46 2775 linux學習

轉載地址http://blog.csdn.net/yming0221/article/details/7492423

作者：閆明

本文分析基於Linux Kernel 1.2.13

注：標題中的”（上）“，”（下）“表示分析過程基於數據包的傳遞方向：”（上）“表示分析是從底層向上分析、”（下）“表示分析是從上向下分析。

上篇：

上一篇博文中我們從宏觀上分析了Linux內核中網絡棧的初始化過程，這里我們再從宏觀上分析一下一個數據包在各網絡層的傳遞的過程。

我們知道網絡的OSI模型和TCP/IP模型層次結構如下：

上文中我們看到了網絡棧的層次結構：

我們就從最底層開始追溯一個數據包的傳遞流程。

1、網絡接口層

* 硬件監聽物理介質，進行數據的接收，當接收的數據填滿了緩沖區，硬件就會產生中斷，中斷產生后，系統會轉向中斷服務子程序。

* 在中斷服務子程序中，數據會從硬件的緩沖區復制到內核的空間緩沖區，並包裝成一個數據結構（sk_buff），然后調用對驅動層的接口函數netif_rx()將數據包發送給鏈路層。該函數的實現在net/inet/dev.c中，（在整個網絡棧實現中dev.c文件的作用重大，它銜接了其下的驅動層和其上的網絡層，可以稱它為鏈路層模塊的實現）

該函數的實現如下：

[cpp] view plain copy

/*
* Receive a packet from a device driver and queue it for the upper
* (protocol) levels. It always succeeds. This is the recommended
* interface to use.
* 從設備驅動層接受到的數據發送到協議的
* 上層，該函數實際是一個接口。
*/
void netif_rx(struct sk_buff *skb)
{
static int dropping = 0;
/*
* Any received buffers are un-owned and should be discarded
* when freed. These will be updated later as the frames get
* owners.
*/
skb->sk = NULL;
skb->free = 1;
if(skb->stamp.tv_sec==0)
skb->stamp = xtime;
/*
* Check that we aren't overdoing things.
*/
if (!backlog_size)
dropping = 0;
else if (backlog_size > 300)
dropping = 1;
if (dropping)
{
kfree_skb(skb, FREE_READ);
return;
}
/*
* Add it to the "backlog" queue.
*/
#ifdef CONFIG_SKB_CHECK
IS_SKB(skb);
#endif
skb_queue_tail(&backlog,skb);//加入隊列backlog
backlog_size++;
/*
* If any packet arrived, mark it for processing after the
* hardware interrupt returns.
*/
mark_bh(NET_BH);//下半部分bottom half技術可以減少中斷處理程序的執行時間
return;
}

該函數中用到了bootom half技術，該技術的原理是將中斷處理程序人為的分為兩部分，上半部分是實時性要求較高的任務，后半部分可以稍后完成，這樣就可以節省中斷程序的處理時間。可整體的提高系統的性能。該技術將會在后續的博文中詳細分析。

我們從上一篇分析中知道，在網絡棧初始化的時候，已經將NET的下半部分執行函數定義成了net_bh(在socket.c文件中1375行左右)

[cpp] view plain copy

bh_base[NET_BH].routine= net_bh;//設置NET 下半部分的處理函數為net_bh

* 函數net_bh的實現在net/inet/dev.c中

[cpp] view plain copy

/*
* When we are called the queue is ready to grab, the interrupts are
* on and hardware can interrupt and queue to the receive queue a we
* run with no problems.
* This is run as a bottom half after an interrupt handler that does
* mark_bh(NET_BH);
*/
void net_bh(void *tmp)
{
struct sk_buff *skb;
struct packet_type *ptype;
struct packet_type *pt_prev;
unsigned short type;
/*
* Atomically check and mark our BUSY state.
*/
if (set_bit(1, (void*)&in_bh))//標記BUSY狀態
return;
/*
* Can we send anything now? We want to clear the
* decks for any more sends that get done as we
* process the input.
*/
dev_transmit();//調用dev_tinit()函數發送數據
/*
* Any data left to process. This may occur because a
* mark_bh() is done after we empty the queue including
* that from the device which does a mark_bh() just after
*/
cli();//防止隊列操作錯誤，需要關中斷和開中斷
/*
* While the queue is not empty
*/
while((skb=skb_dequeue(&backlog))!=NULL)//出隊直到隊列為空
{
/*
* We have a packet. Therefore the queue has shrunk
*/
backlog_size--;//隊列元素個數減一
sti();
/*
* Bump the pointer to the next structure.
* This assumes that the basic 'skb' pointer points to
* the MAC header, if any (as indicated by its "length"
* field). Take care now!
*/
skb->h.raw = skb->data + skb->dev->hard_header_len;
skb->len -= skb->dev->hard_header_len;
/*
* Fetch the packet protocol ID. This is also quite ugly, as
* it depends on the protocol driver (the interface itself) to
* know what the type is, or where to get it from. The Ethernet
* interfaces fetch the ID from the two bytes in the Ethernet MAC
* header (the h_proto field in struct ethhdr), but other drivers
* may either use the ethernet ID's or extra ones that do not
* clash (eg ETH_P_AX25). We could set this before we queue the
* frame. In fact I may change this when I have time.
*/
type = skb->dev->type_trans(skb, skb->dev);//取出該數據包所屬的協議類型
/*
* We got a packet ID. Now loop over the "known protocols"
* table (which is actually a linked list, but this will
* change soon if I get my way- FvK), and forward the packet
* to anyone who wants it.
*
* [FvK didn't get his way but he is right this ought to be
* hashed so we typically get a single hit. The speed cost
* here is minimal but no doubt adds up at the 4,000+ pkts/second
* rate we can hit flat out]
*/
pt_prev = NULL;
for (ptype = ptype_base; ptype != NULL; ptype = ptype->next) //遍歷ptype_base所指向的網絡協議隊列
{
//判斷協議號是否匹配
if ((ptype->type == type || ptype->type == htons(ETH_P_ALL)) && (!ptype->dev || ptype->dev==skb->dev))
{
/*
* We already have a match queued. Deliver
* to it and then remember the new match
*/
if(pt_prev)
{
struct sk_buff *skb2;
skb2=skb_clone(skb, GFP_ATOMIC);//復制數據包結構
/*
* Kick the protocol handler. This should be fast
* and efficient code.
*/
if(skb2)
pt_prev->func(skb2, skb->dev, pt_prev);//調用相應協議的處理函數，
//這里和網絡協議的種類有關系
//如IP 協議的處理函數就是ip_rcv
}
/* Remember the current last to do */
pt_prev=ptype;
}
} /* End of protocol list loop */
/*
* Is there a last item to send to ?
*/
if(pt_prev)
pt_prev->func(skb, skb->dev, pt_prev);
/*
* Has an unknown packet has been received ?
*/
else
kfree_skb(skb, FREE_WRITE);
/*
* Again, see if we can transmit anything now.
* [Ought to take this out judging by tests it slows
* us down not speeds us up]
*/
dev_transmit();
cli();
} /* End of queue loop */
/*
* We have emptied the queue
*/
in_bh = 0;//BUSY狀態還原
sti();
/*
* One last output flush.
*/
dev_transmit();//清空緩沖區
}

2、網絡層
* 就以IP數據包為例來說明，那么從鏈路層向網絡層傳遞時將調用ip_rcv函數。該函數完成本層的處理后會根據IP首部中使用的傳輸層協議來調用相應協議的處理函數。

UDP對應udp_rcv、TCP對應tcp_rcv、ICMP對應icmp_rcv、IGMP對應igmp_rcv（雖然這里的ICMP,IGMP一般成為網絡層協議，但是實際上他們都封裝在IP協議里面，作為傳輸層對待）

這個函數比較復雜，后續會詳細分析。這里粘貼一下，讓我們對整體了解更清楚

[cpp] view plain copy

/*
* This function receives all incoming IP datagrams.
*/
int ip_rcv(struct sk_buff *skb, struct device *dev, struct packet_type *pt)
{
struct iphdr *iph = skb->h.iph;
struct sock *raw_sk=NULL;
unsigned char hash;
unsigned char flag = 0;
unsigned char opts_p = 0; /* Set iff the packet has options. */
struct inet_protocol *ipprot;
static struct options opt; /* since we don't use these yet, and they
take up stack space. */
int brd=IS_MYADDR;
int is_frag=0;
#ifdef CONFIG_IP_FIREWALL
int err;
#endif
ip_statistics.IpInReceives++;
/*
* Tag the ip header of this packet so we can find it
*/
skb->ip_hdr = iph;
/*
* Is the datagram acceptable?
*
* 1. Length at least the size of an ip header
* 2. Version of 4
* 3. Checksums correctly. [Speed optimisation for later, skip loopback checksums]
* (4. We ought to check for IP multicast addresses and undefined types.. does this matter ?)
*/
if (skb->len<sizeof(struct iphdr) || iph->ihl<5 || iph->version != 4 ||
skb->len<ntohs(iph->tot_len) || ip_fast_csum((unsigned char *)iph, iph->ihl) !=0)
{
ip_statistics.IpInHdrErrors++;
kfree_skb(skb, FREE_WRITE);
return(0);
}
/*
* See if the firewall wants to dispose of the packet.
*/
#ifdef CONFIG_IP_FIREWALL
if ((err=ip_fw_chk(iph,dev,ip_fw_blk_chain,ip_fw_blk_policy, 0))!=1)
{
if(err==-1)
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0, dev);
kfree_skb(skb, FREE_WRITE);
return 0;
}
#endif
/*
* Our transport medium may have padded the buffer out. Now we know it
* is IP we can trim to the true length of the frame.
*/
skb->len=ntohs(iph->tot_len);
/*
* Next analyse the packet for options. Studies show under one packet in
* a thousand have options....
*/
if (iph->ihl != 5)
{ /* Fast path for the typical optionless IP packet. */
memset((char *) &opt, 0, sizeof(opt));
if (do_options(iph, &opt) != 0)
return 0;
opts_p = 1;
}
/*
* Remember if the frame is fragmented.
*/
if(iph->frag_off)
{
if (iph->frag_off & 0x0020)
is_frag|=1;
/*
* Last fragment ?
*/
if (ntohs(iph->frag_off) & 0x1fff)
is_frag|=2;
}
/*
* Do any IP forwarding required. chk_addr() is expensive -- avoid it someday.
*
* This is inefficient. While finding out if it is for us we could also compute
* the routing table entry. This is where the great unified cache theory comes
* in as and when someone implements it
*
* For most hosts over 99% of packets match the first conditional
* and don't go via ip_chk_addr. Note: brd is set to IS_MYADDR at
* function entry.
*/
if ( iph->daddr != skb->dev->pa_addr && (brd = ip_chk_addr(iph->daddr)) == 0)
{
/*
* Don't forward multicast or broadcast frames.
*/
if(skb->pkt_type!=PACKET_HOST || brd==IS_BROADCAST)
{
kfree_skb(skb,FREE_WRITE);
return 0;
}
/*
* The packet is for another target. Forward the frame
*/
#ifdef CONFIG_IP_FORWARD
ip_forward(skb, dev, is_frag);
#else
/* printk("Machine %lx tried to use us as a forwarder to %lx but we have forwarding disabled!\n",
iph->saddr,iph->daddr);*/
ip_statistics.IpInAddrErrors++;
#endif
/*
* The forwarder is inefficient and copies the packet. We
* free the original now.
*/
kfree_skb(skb, FREE_WRITE);
return(0);
}
#ifdef CONFIG_IP_MULTICAST
if(brd==IS_MULTICAST && iph->daddr!=IGMP_ALL_HOSTS && !(dev->flags&IFF_LOOPBACK))
{
/*
* Check it is for one of our groups
*/
struct ip_mc_list *ip_mc=dev->ip_mc_list;
do
{
if(ip_mc==NULL)
{
kfree_skb(skb, FREE_WRITE);
return 0;
}
if(ip_mc->multiaddr==iph->daddr)
break;
ip_mc=ip_mc->next;
}
while(1);
}
#endif
/*
* Account for the packet
*/
#ifdef CONFIG_IP_ACCT
ip_acct_cnt(iph,dev, ip_acct_chain);
#endif
/*
* Reassemble IP fragments.
*/
if(is_frag)
{
/* Defragment. Obtain the complete packet if there is one */
skb=ip_defrag(iph,skb,dev);
if(skb==NULL)
return 0;
skb->dev = dev;
iph=skb->h.iph;
}
/*
* Point into the IP datagram, just past the header.
*/
skb->ip_hdr = iph;
skb->h.raw += iph->ihl*4;
/*
* Deliver to raw sockets. This is fun as to avoid copies we want to make no surplus copies.
*/
hash = iph->protocol & (SOCK_ARRAY_SIZE-1);
/* If there maybe a raw socket we must check - if not we don't care less */
if((raw_sk=raw_prot.sock_array[hash])!=NULL)
{
struct sock *sknext=NULL;
struct sk_buff *skb1;
raw_sk=get_sock_raw(raw_sk, hash, iph->saddr, iph->daddr);
if(raw_sk) /* Any raw sockets */
{
do
{
/* Find the next */
sknext=get_sock_raw(raw_sk->next, hash, iph->saddr, iph->daddr);
if(sknext)
skb1=skb_clone(skb, GFP_ATOMIC);
else
break; /* One pending raw socket left */
if(skb1)
raw_rcv(raw_sk, skb1, dev, iph->saddr,iph->daddr);
raw_sk=sknext;
}
while(raw_sk!=NULL);
/* Here either raw_sk is the last raw socket, or NULL if none */
/* We deliver to the last raw socket AFTER the protocol checks as it avoids a surplus copy */
}
}
/*
* skb->h.raw now points at the protocol beyond the IP header.
*/
hash = iph->protocol & (MAX_INET_PROTOS -1);
for (ipprot = (struct inet_protocol *)inet_protos[hash];ipprot != NULL;ipprot=(struct inet_protocol *)ipprot->next)
{
struct sk_buff *skb2;
if (ipprot->protocol != iph->protocol)
continue;
/*
* See if we need to make a copy of it. This will
* only be set if more than one protocol wants it.
* and then not for the last one. If there is a pending
* raw delivery wait for that
*/
if (ipprot->copy || raw_sk)
{
skb2 = skb_clone(skb, GFP_ATOMIC);
if(skb2==NULL)
continue;
}
else
{
skb2 = skb;
}
flag = 1;
/*
* Pass on the datagram to each protocol that wants it,
* based on the datagram protocol. We should really
* check the protocol handler's return values here...
*/
ipprot->handler(skb2, dev, opts_p ? &opt : 0, iph->daddr,
(ntohs(iph->tot_len) - (iph->ihl * 4)),
iph->saddr, 0, ipprot);
}
/*
* All protocols checked.
* If this packet was a broadcast, we may *not* reply to it, since that
* causes (proven, grin) ARP storms and a leakage of memory (i.e. all
* ICMP reply messages get queued up for transmission...)
*/
if(raw_sk!=NULL) /* Shift to last raw user */
raw_rcv(raw_sk, skb, dev, iph->saddr, iph->daddr);
else if (!flag) /* Free and report errors */
{
if (brd != IS_BROADCAST && brd!=IS_MULTICAST)
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PROT_UNREACH, 0, dev);
kfree_skb(skb, FREE_WRITE);
}
return(0);
}

3、傳輸層

如果在IP數據報的首部標明的是使用TCP傳輸數據，則在上述函數中會調用tcp_rcv函數。該函數的大體處理流程為：

“所有使用TCP 協議的套接字對應sock 結構都被掛入tcp_prot 全局變量表示的proto 結構之sock_array 數組中，采用以本地端口號為索引的插入方式，所以當tcp_rcv 函數接收到一個數據包，在完成必要的檢查和處理后，其將以TCP 協議首部中目的端口號（對於一個接收的數據包而言，其目的端口號就是本地所使用的端口號）為索引，在tcp_prot 對應sock 結構之sock_array 數組中得到正確的sock 結構隊列，在輔之以其他條件遍歷該隊列進行對應sock 結構的查詢，在得到匹配的sock 結構后，將數據包掛入該sock 結構中的緩存隊列中（由sock 結構中receive_queue 字段指向），從而完成數據包的最終接收。”

該函數的實現也會比較復雜，這是由TCP協議的復雜功能決定的。附代碼如下：

[cpp] view plain copy

/*
* A TCP packet has arrived.
*/
int tcp_rcv(struct sk_buff *skb, struct device *dev, struct options *opt,
unsigned long daddr, unsigned short len,
unsigned long saddr, int redo, struct inet_protocol * protocol)
{
struct tcphdr *th;
struct sock *sk;
int syn_ok=0;
if (!skb)
{
printk("IMPOSSIBLE 1\n");
return(0);
}
if (!dev)
{
printk("IMPOSSIBLE 2\n");
return(0);
}
tcp_statistics.TcpInSegs++;
if(skb->pkt_type!=PACKET_HOST)
{
kfree_skb(skb,FREE_READ);
return(0);
}
th = skb->h.th;
/*
* Find the socket.
*/
sk = get_sock(&tcp_prot, th->dest, saddr, th->source, daddr);
/*
* If this socket has got a reset it's to all intents and purposes
* really dead. Count closed sockets as dead.
*
* Note: BSD appears to have a bug here. A 'closed' TCP in BSD
* simply drops data. This seems incorrect as a 'closed' TCP doesn't
* exist so should cause resets as if the port was unreachable.
*/
if (sk!=NULL && (sk->zapped || sk->state==TCP_CLOSE))
sk=NULL;
if (!redo)
{
if (tcp_check(th, len, saddr, daddr ))
{
skb->sk = NULL;
kfree_skb(skb,FREE_READ);
/*
* We don't release the socket because it was
* never marked in use.
*/
return(0);
}
th->seq = ntohl(th->seq);
/* See if we know about the socket. */
if (sk == NULL)
{
/*
* No such TCB. If th->rst is 0 send a reset (checked in tcp_reset)
*/
tcp_reset(daddr, saddr, th, &tcp_prot, opt,dev,skb->ip_hdr->tos,255);
skb->sk = NULL;
/*
* Discard frame
*/
kfree_skb(skb, FREE_READ);
return(0);
}
skb->len = len;
skb->acked = 0;
skb->used = 0;
skb->free = 0;
skb->saddr = daddr;
skb->daddr = saddr;
/* We may need to add it to the backlog here. */
cli();
if (sk->inuse)
{
skb_queue_tail(&sk->back_log, skb);
sti();
return(0);
}
sk->inuse = 1;
sti();
}
else
{
if (sk==NULL)
{
tcp_reset(daddr, saddr, th, &tcp_prot, opt,dev,skb->ip_hdr->tos,255);
skb->sk = NULL;
kfree_skb(skb, FREE_READ);
return(0);
}
}
if (!sk->prot)
{
printk("IMPOSSIBLE 3\n");
return(0);
}
/*
* Charge the memory to the socket.
*/
if (sk->rmem_alloc + skb->mem_len >= sk->rcvbuf)
{
kfree_skb(skb, FREE_READ);
release_sock(sk);
return(0);
}
skb->sk=sk;
sk->rmem_alloc += skb->mem_len;
/*
* This basically follows the flow suggested by RFC793, with the corrections in RFC1122. We
* don't implement precedence and we process URG incorrectly (deliberately so) for BSD bug
* compatibility. We also set up variables more thoroughly [Karn notes in the
* KA9Q code the RFC793 incoming segment rules don't initialise the variables for all paths].
*/
if(sk->state!=TCP_ESTABLISHED) /* Skip this lot for normal flow */
{
/*
* Now deal with unusual cases.
*/
if(sk->state==TCP_LISTEN)
{
if(th->ack) /* These use the socket TOS.. might want to be the received TOS */
tcp_reset(daddr,saddr,th,sk->prot,opt,dev,sk->ip_tos, sk->ip_ttl);
/*
* We don't care for RST, and non SYN are absorbed (old segments)
* Broadcast/multicast SYN isn't allowed. Note - bug if you change the
* netmask on a running connection it can go broadcast. Even Sun's have
* this problem so I'm ignoring it
*/
if(th->rst || !th->syn || th->ack || ip_chk_addr(daddr)!=IS_MYADDR)
{
kfree_skb(skb, FREE_READ);
release_sock(sk);
return 0;
}
/*
* Guess we need to make a new socket up
*/
tcp_conn_request(sk, skb, daddr, saddr, opt, dev, tcp_init_seq());
/*
* Now we have several options: In theory there is nothing else
* in the frame. KA9Q has an option to send data with the syn,
* BSD accepts data with the syn up to the [to be] advertised window
* and Solaris 2.1 gives you a protocol error. For now we just ignore
* it, that fits the spec precisely and avoids incompatibilities. It
* would be nice in future to drop through and process the data.
*/
release_sock(sk);
return 0;
}
/* retransmitted SYN? */
if (sk->state == TCP_SYN_RECV && th->syn && th->seq+1 == sk->acked_seq)
{
kfree_skb(skb, FREE_READ);
release_sock(sk);
return 0;
}
/*
* SYN sent means we have to look for a suitable ack and either reset
* for bad matches or go to connected
*/
if(sk->state==TCP_SYN_SENT)
{
/* Crossed SYN or previous junk segment */
if(th->ack)
{
/* We got an ack, but it's not a good ack */
if(!tcp_ack(sk,th,saddr,len))
{
/* Reset the ack - its an ack from a
different connection [ th->rst is checked in tcp_reset()] */
tcp_statistics.TcpAttemptFails++;
tcp_reset(daddr, saddr, th,
sk->prot, opt,dev,sk->ip_tos,sk->ip_ttl);
kfree_skb(skb, FREE_READ);
release_sock(sk);
return(0);
}
if(th->rst)
return tcp_std_reset(sk,skb);
if(!th->syn)
{
/* A valid ack from a different connection
start. Shouldn't happen but cover it */
kfree_skb(skb, FREE_READ);
release_sock(sk);
return 0;
}
/*
* Ok.. it's good. Set up sequence numbers and
* move to established.
*/
syn_ok=1; /* Don't reset this connection for the syn */
sk->acked_seq=th->seq+1;
sk->fin_seq=th->seq;
tcp_send_ack(sk->sent_seq,sk->acked_seq,sk,th,sk->daddr);
tcp_set_state(sk, TCP_ESTABLISHED);
tcp_options(sk,th);
sk->dummy_th.dest=th->source;
sk->copied_seq = sk->acked_seq;
if(!sk->dead)
{
sk->state_change(sk);
sock_wake_async(sk->socket, 0);
}
if(sk->max_window==0)
{
sk->max_window = 32;
sk->mss = min(sk->max_window, sk->mtu);
}
}
else
{
/* See if SYN's cross. Drop if boring */
if(th->syn && !th->rst)
{
/* Crossed SYN's are fine - but talking to
yourself is right out... */
if(sk->saddr==saddr && sk->daddr==daddr &&
sk->dummy_th.source==th->source &&
sk->dummy_th.dest==th->dest)
{
tcp_statistics.TcpAttemptFails++;
return tcp_std_reset(sk,skb);
}
tcp_set_state(sk,TCP_SYN_RECV);
/*
* FIXME:
* Must send SYN|ACK here
*/
}
/* Discard junk segment */
kfree_skb(skb, FREE_READ);
release_sock(sk);
return 0;
}
/*
* SYN_RECV with data maybe.. drop through
*/
goto rfc_step6;
}
/*
* BSD has a funny hack with TIME_WAIT and fast reuse of a port. There is
* a more complex suggestion for fixing these reuse issues in RFC1644
* but not yet ready for general use. Also see RFC1379.
*/
#define BSD_TIME_WAIT
#ifdef BSD_TIME_WAIT
if (sk->state == TCP_TIME_WAIT && th->syn && sk->dead &&
after(th->seq, sk->acked_seq) && !th->rst)
{
long seq=sk->write_seq;
if(sk->debug)
printk("Doing a BSD time wait\n");
tcp_statistics.TcpEstabResets++;
sk->rmem_alloc -= skb->mem_len;
skb->sk = NULL;
sk->err=ECONNRESET;
tcp_set_state(sk, TCP_CLOSE);
sk->shutdown = SHUTDOWN_MASK;
release_sock(sk);
sk=get_sock(&tcp_prot, th->dest, saddr, th->source, daddr);
if (sk && sk->state==TCP_LISTEN)
{
sk->inuse=1;
skb->sk = sk;
sk->rmem_alloc += skb->mem_len;
tcp_conn_request(sk, skb, daddr, saddr,opt, dev,seq+128000);
release_sock(sk);
return 0;
}
kfree_skb(skb, FREE_READ);
return 0;
}
#endif
}
/*
* We are now in normal data flow (see the step list in the RFC)
* Note most of these are inline now. I'll inline the lot when
* I have time to test it hard and look at what gcc outputs
*/
if(!tcp_sequence(sk,th,len,opt,saddr,dev))
{
kfree_skb(skb, FREE_READ);
release_sock(sk);
return 0;
}
if(th->rst)
return tcp_std_reset(sk,skb);
/*
* !syn_ok is effectively the state test in RFC793.
*/
if(th->syn && !syn_ok)
{
tcp_reset(daddr,saddr,th, &tcp_prot, opt, dev, skb->ip_hdr->tos, 255);
return tcp_std_reset(sk,skb);
}
/*
* Process the ACK
*/
if(th->ack && !tcp_ack(sk,th,saddr,len))
{
/*
* Our three way handshake failed.
*/
if(sk->state==TCP_SYN_RECV)
{
tcp_reset(daddr, saddr, th,sk->prot, opt, dev,sk->ip_tos,sk->ip_ttl);
}
kfree_skb(skb, FREE_READ);
release_sock(sk);
return 0;
}
rfc_step6: /* I'll clean this up later */
/*
* Process urgent data
*/
if(tcp_urg(sk, th, saddr, len))
{
kfree_skb(skb, FREE_READ);
release_sock(sk);
return 0;
}
/*
* Process the encapsulated data
*/
if(tcp_data(skb,sk, saddr, len))
{
kfree_skb(skb, FREE_READ);
release_sock(sk);
return 0;
}
/*
* And done
*/
release_sock(sk);
return 0;
}

4、應用層

當用戶需要接收數據時，首先根據文件描述符inode得到socket結構和sock結構，然后從sock結構中指向的隊列recieve_queue中讀取數據包，將數據包COPY到用戶空間緩沖區。數據就完整的從硬件中傳輸到用戶空間。這樣也完成了一次完整的從下到上的傳輸。

下篇：

在博文Linux內核--網絡棧實現分析（二）--數據包的傳遞過程（上）中分析了數據包從網卡設備經過驅動鏈路層，網絡層，傳輸層到應用層的過程。

本文就分析一下本機產生數據是如何通過傳輸層，網絡層到達物理層的。

綜述來說，數據流程圖如下：

一、應用層

應用層可以通過系統調用或文件操作來調用內核函數，BSD層的sock_write()函數會調用INET層的inet_wirte()函數。

[cpp] view plain copy

/*
* Write data to a socket. We verify that the user area ubuf..ubuf+size-1 is
* readable by the user process.
*/
static int sock_write(struct inode *inode, struct file *file, char *ubuf, int size)
{
struct socket *sock;
int err;
if (!(sock = socki_lookup(inode)))
{
printk("NET: sock_write: can't find socket for inode!\n");
return(-EBADF);
}
if (sock->flags & SO_ACCEPTCON)
return(-EINVAL);
if(size<0)
return -EINVAL;
if(size==0)
return 0;
if ((err=verify_area(VERIFY_READ,ubuf,size))<0)
return err;
return(sock->ops->write(sock, ubuf, size,(file->f_flags & O_NONBLOCK)));
}

INET層會調用具體傳輸層協議的write函數，該函數是通過調用本層的inet_send()函數實現功能的，inet_send()函數的UDP協議對應的函數為udp_write()

[cpp] view plain copy

static int inet_send(struct socket *sock, void *ubuf, int size, int noblock,
unsigned flags)
{
struct sock *sk = (struct sock *) sock->data;
if (sk->shutdown & SEND_SHUTDOWN)
{
send_sig(SIGPIPE, current, 1);
return(-EPIPE);
}
if(sk->err)
return inet_error(sk);
/* We may need to bind the socket. */
if(inet_autobind(sk)!=0)
return(-EAGAIN);
return(sk->prot->write(sk, (unsigned char *) ubuf, size, noblock, flags));
}
static int inet_write(struct socket *sock, char *ubuf, int size, int noblock)
{
return inet_send(sock,ubuf,size,noblock,0);
}

二、傳輸層

在傳輸層udp_write()函數調用本層的udp_sendto()函數完成功能。

[cpp] view plain copy

/*
* In BSD SOCK_DGRAM a write is just like a send.
*/
static int udp_write(struct sock *sk, unsigned char *buff, int len, int noblock,
unsigned flags)
{
return(udp_sendto(sk, buff, len, noblock, flags, NULL, 0));
}

udp_send()函數完成sk_buff結構相應的設置和報頭的填寫后會調用udp_send()來發送數據。具體的實現過程后面會詳細分析。

而在udp_send()函數中，最后會調用ip_queue_xmit()函數，將數據包下放的網絡層。

下面是udp_prot定義：

[cpp] view plain copy

struct proto udp_prot = {
sock_wmalloc,
sock_rmalloc,
sock_wfree,
sock_rfree,
sock_rspace,
sock_wspace,
udp_close,
udp_read,
udp_write,
udp_sendto,
udp_recvfrom,
ip_build_header,
udp_connect,
NULL,
ip_queue_xmit,
NULL,
NULL,
NULL,
udp_rcv,
datagram_select,
udp_ioctl,
NULL,
NULL,
ip_setsockopt,
ip_getsockopt,
128,
0,
{NULL,},
"UDP",
0, 0
};

[cpp] view plain copy

static int udp_send(struct sock *sk, struct sockaddr_in *sin,
unsigned char *from, int len, int rt)
{
struct sk_buff *skb;
struct device *dev;
struct udphdr *uh;
unsigned char *buff;
unsigned long saddr;
int size, tmp;
int ttl;
/*
* Allocate an sk_buff copy of the packet.
*/
........................
/*
* Now build the IP and MAC header.
*/
..........................
/*
* Fill in the UDP header.
*/
..............................
/*
* Copy the user data.
*/
memcpy_fromfs(buff, from, len);
/*
* Set up the UDP checksum.
*/
udp_send_check(uh, saddr, sin->sin_addr.s_addr, skb->len - tmp, sk);
/*
* Send the datagram to the interface.
*/
udp_statistics.UdpOutDatagrams++;
sk->prot->queue_xmit(sk, dev, skb, 1);
return(len);
}

三、網絡層

在網絡層，函數ip_queue_xmit()的功能是將數據包進行一系列復雜的操作，比如是檢查數據包是否需要分片，是否是多播等一系列檢查，最后調用dev_queue_xmit()函數發送數據。

[cpp] view plain copy

/*
* Queues a packet to be sent, and starts the transmitter
* if necessary. if free = 1 then we free the block after
* transmit, otherwise we don't. If free==2 we not only
* free the block but also don't assign a new ip seq number.
* This routine also needs to put in the total length,
* and compute the checksum
*/
void ip_queue_xmit(struct sock *sk, struct device *dev,
struct sk_buff *skb, int free)
{
struct iphdr *iph;
unsigned char *ptr;
/* Sanity check */
............
/*
* Do some book-keeping in the packet for later
*/
...........
/*
* Find the IP header and set the length. This is bad
* but once we get the skb data handling code in the
* hardware will push its header sensibly and we will
* set skb->ip_hdr to avoid this mess and the fixed
* header length problem
*/
..............
/*
* No reassigning numbers to fragments...
*/
if(free!=2)
iph->id = htons(ip_id_count++);
else
free=1;
/* All buffers without an owner socket get freed */
if (sk == NULL)
free = 1;
skb->free = free;
/*
* Do we need to fragment. Again this is inefficient.
* We need to somehow lock the original buffer and use
* bits of it.
*/
................
/*
* Add an IP checksum
*/
ip_send_check(iph);
/*
* Print the frame when debugging
*/
/*
* More debugging. You cannot queue a packet already on a list
* Spot this and moan loudly.
*/
.......................
/*
* If a sender wishes the packet to remain unfreed
* we add it to his send queue. This arguably belongs
* in the TCP level since nobody else uses it. BUT
* remember IPng might change all the rules.
*/
......................
/*
* If the indicated interface is up and running, send the packet.
*/
ip_statistics.IpOutRequests++;
.............................
.............................
if((dev->flags&IFF_BROADCAST) && iph->daddr==dev->pa_brdaddr && !(dev->flags&IFF_LOOPBACK))
ip_loopback(dev,skb);
if (dev->flags & IFF_UP)
{
/*
* If we have an owner use its priority setting,
* otherwise use NORMAL
*/
if (sk != NULL)
{
dev_queue_xmit(skb, dev, sk->priority);
}
else
{
dev_queue_xmit(skb, dev, SOPRI_NORMAL);
}
}
else
{
ip_statistics.IpOutDiscards++;
if (free)
kfree_skb(skb, FREE_WRITE);
}
}

四、驅動層（鏈路層）

在函數中，函數調用會調用具體設備的發送函數來發送數據包

dev->hard_start_xmit(skb, dev);

具體設備的發送函數在網絡初始化的時候已經設置了。

這里以8390網卡為例來說明驅動層的工作原理，在net/drivers/8390.c中函數ethdev_init()函數中設置如下：

[cpp] view plain copy

/* Initialize the rest of the 8390 device structure. */
int ethdev_init(struct device *dev)
{
if (ei_debug > 1)
printk(version);
if (dev->priv == NULL) {//申請私有空間
struct ei_device *ei_local;//8390網卡設備的結構體
dev->priv = kmalloc(sizeof(struct ei_device), GFP_KERNEL);//申請內核內存空間
memset(dev->priv, 0, sizeof(struct ei_device));
ei_local = (struct ei_device *)dev->priv;
#ifndef NO_PINGPONG
ei_local->pingpong = 1;
#endif
}
/* The open call may be overridden by the card-specific code. */
if (dev->open == NULL)
dev->open = &ei_open;//設備的打開函數
/* We should have a dev->stop entry also. */
dev->hard_start_xmit = &ei_start_xmit;//設備的發送函數，定義在8390.c中
dev->get_stats = get_stats;
#ifdef HAVE_MULTICAST
dev->set_multicast_list = &set_multicast_list;
#endif
ether_setup(dev);
return 0;
}

驅動中的發送函數比較復雜，和硬件關系緊密，這里不再詳細分析。

這樣就大體分析了下網絡數據從應用層到物理層的數據通路，后面會詳細分析。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Linux網絡 - 數據包的發送過程【轉】 Linux網絡 - 數據包的接收過程【轉】 Linux內核分析_UDP協議中數據包的收發處理過程 Linux內核網絡數據包處理流程 Linux網絡----數據包的接收過程 Linux網絡 - 數據包的接收過程 Linux網絡 - 數據包的發送過程 Linux網絡 - 數據包的接收過程 [轉] TCP數據包重組實現分析 go 網絡數據包分析（1）