Commit Graph

420 Commits

Author SHA1 Message Date
David S. Miller 67f11f4ded net: Remove linux/prefetch.h include from linux/skbuff.h
No longer needed.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-22 20:54:11 -04:00
David S. Miller 0fcbe742ea net: Remove prefetches from SKB list handlers.
Noticed by Linus.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-22 20:35:29 -04:00
Heiko Carstens 34ea646c9f net: add missing prefetch.h include
Fixes build errors on s390 and probably other archs as well:

  In file included from net/ipv4/ip_forward.c:32:0:
  include/net/udp.h: In function 'udp_csum_outgoing':
  include/net/udp.h:141:2: error: implicit declaration of function 'prefetch'

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-22 11:26:02 -07:00
Eric Dumazet 0a14842f5a net: filter: Just In Time compiler for x86-64
In order to speedup packet filtering, here is an implementation of a
JIT compiler for x86_64

It is disabled by default, and must be enabled by the admin.

echo 1 >/proc/sys/net/core/bpf_jit_enable

It uses module_alloc() and module_free() to get memory in the 2GB text
kernel range since we call helpers functions from the generated code.

EAX : BPF A accumulator
EBX : BPF X accumulator
RDI : pointer to skb   (first argument given to JIT function)
RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
r9d : skb->len - skb->data_len (headlen)
r8  : skb->data

To get a trace of generated code, use :

echo 2 >/proc/sys/net/core/bpf_jit_enable

Example of generated code :

# tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24

flen=18 proglen=147 pass=3 image=ffffffffa00b5000
JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
JIT code: ffffffffa00b5090: c0 c9 c3

BPF program is 144 bytes long, so native program is almost same size ;)

(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 8
(002) ld       [26]
(003) and      #0xffffff00
(004) jeq      #0xc0a81400      jt 16   jf 5
(005) ld       [30]
(006) and      #0xffffff00
(007) jeq      #0xc0a81400      jt 16   jf 17
(008) jeq      #0x806           jt 10   jf 9
(009) jeq      #0x8035          jt 10   jf 17
(010) ld       [28]
(011) and      #0xffffff00
(012) jeq      #0xc0a81400      jt 16   jf 13
(013) ld       [38]
(014) and      #0xffffff00
(015) jeq      #0xc0a81400      jt 16   jf 17
(016) ret      #65535
(017) ret      #0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-27 23:05:08 -07:00
Linus Torvalds 42933bac11 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
  Fix common misspellings
2011-04-07 11:14:49 -07:00
Lucas De Marchi 25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
David S. Miller eec009548e net: Fix warnings caused by MAX_SKB_FRAGS change.
After commit a715dea3c8 ("net: Always
allocate at least 16 skb frags regardless of page size"), the value
of MAX_SKB_FRAGS can now take on either an "unsigned long" or an
"int" value.

This causes warnings like:

net/packet/af_packet.c: In function ‘tpacket_fill_skb’:
net/packet/af_packet.c:948: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘int’

Fix by forcing the constant to be unsigned long, otherwise we have
a situation where the type of a system wide constant is variable.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-29 23:34:08 -07:00
Anton Blanchard a715dea3c8 net: Always allocate at least 16 skb frags regardless of page size
When analysing performance of the cxgb3 on a ppc64 box I noticed that
we weren't doing much GRO merging. It turns out we are limited by the
number of SKB frags:

#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 2)

With a 4kB page size we have 18 frags, but with a 64kB page size we
only have 3 frags.

I ran a single stream TCP bandwidth test to compare the performance of

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-28 22:26:32 -07:00
Jiri Pirko 8a4eb5734e net: introduce rx_handler results and logic around that
This patch allows rx_handlers to better signalize what to do next to
it's caller. That makes skb->deliver_no_wcard no longer needed.

kernel-doc for rx_handler_result is taken from Nicolas' patch.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-16 12:53:54 -07:00
Michał Mirosław 04ed3e741d net: change netdev->features to u32
Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.

Occurences found by `grep -r` on net/, drivers/net, include/

[ Move features and vlan_features next to each other in
  struct netdev, as per Eric Dumazet's suggestion -DaveM ]

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 15:32:47 -08:00
David S. Miller 686a295553 net: Add safe reverse SKB queue walkers.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20 22:47:32 -08:00
KOVACS Krisztian 2fc72c7b84 netfilter: fix compilation when conntrack is disabled but tproxy is enabled
The IPv6 tproxy patches split IPv6 defragmentation off of conntrack, but
failed to update the #ifdef stanzas guarding the defragmentation related
fields and code in skbuff and conntrack related code in nf_defrag_ipv6.c.

This patch adds the required #ifdefs so that IPv6 tproxy can truly be used
without connection tracking.

Original report:
http://marc.info/?l=linux-netdev&m=129010118516341&w=2

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2011-01-12 20:25:08 +01:00
Michał Mirosław 04fb451eff net: Introduce skb_checksum_start_offset()
Introduce skb_checksum_start_offset() to replace repetitive calculation.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 14:43:14 -08:00
Vladislav Zolotarov a3d22a68d7 bnx2x: Take the distribution range definition out of skb_tx_hash()
Move the calcualation of the Tx hash for a given hash range into a separate
function and define the skb_tx_hash(), which calculates a Tx hash for a
[0; dev->real_num_tx_queues - 1] hash values range, using this
function (__skb_tx_hash()).

Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 13:15:53 -08:00
Tom Herbert 3853b5841c xps: Improvements in TX queue selection
In dev_pick_tx, don't do work in calculating queue
index or setting
the index in the sock unless the device has more than one queue.  This
allows the sock to be set only with a queue index of a multi-queue
device which is desirable if device are stacked like in a tunnel.

We also allow the mapping of a socket to queue to be changed.  To
maintain in order packet transmission a flag (ooo_okay) has been
added to the sk_buff structure.  If a transport layer sets this flag
on a packet, the transmit queue can be changed for the socket.
Presumably, the transport would set this if there was no possbility
of creating OOO packets (for instance, there are no packets in flight
for the socket).  This patch includes the modification in TCP output
for setting this flag.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-24 11:44:19 -08:00
Eric Dumazet 27b75c95f1 net: avoid RCU for NOCACHE dst
There is no point using RCU for dst we allocate for a very short time
(used once).

Change dst_release() to take DST_NOCACHE into account, but also change
skb_dst_set_noref() to force a refcount increment for such dst.

This is a _huge_ gain, because we dont waste memory to store xx thousand
of dsts. Instead of queueing them to RCU, we can free them instantly.

CPU caches can stay hot, re-using same memory blocks to hold temporary
dsts.

Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(),
since atomic_dec_return() implies a full memory barrier.

Stress test, 160.000.000 udp frames sent, IP route cache disabled
(DDOS).

Before:

real    0m38.091s
user    0m13.189s
sys     7m53.018s

After:

real	0m29.946s
user	0m12.157s
sys	7m40.605s

For reference, if IP route cache was enabled :

real	0m32.030s
user	0m10.521s
sys	8m15.243s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-20 03:02:23 -07:00
Eric Dumazet 564824b0c5 net: allocate skbs on local node
commit b30973f877 (node-aware skb allocation) spread a wrong habit of
allocating net drivers skbs on a given memory node : The one closest to
the NIC hardware. This is wrong because as soon as we try to scale
network stack, we need to use many cpus to handle traffic and hit
slub/slab management on cross-node allocations/frees when these cpus
have to alloc/free skbs bound to a central node.

skb allocated in RX path are ephemeral, they have a very short
lifetime : Extra cost to maintain NUMA affinity is too expensive. What
appeared as a nice idea four years ago is in fact a bad one.

In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
and two 10Gb NIC might deliver more than 28 million packets per second,
needing all the available cpus.

Cost of cross-node handling in network and vm stacks outperforms the
small benefit hardware had when doing its DMA transfert in its 'local'
memory node at RX time. Even trying to differentiate the two allocations
done for one skb (the sk_buff on local node, the data part on NIC
hardware node) is not enough to bring good performance.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-16 11:13:19 -07:00
Eric Dumazet cb4dfe562c net: skb_frag_t can be smaller on small arches
On 32bit arches, if PAGE_SIZE is smaller than 65536, we can use 16bit
offset and size fields. This patch saves 72 bytes per skb on i386, or
128 bytes after rounding.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-26 18:31:13 -07:00
Eric Dumazet a02cec2155 net: return operator cleanup
Change "return (EXPR);" to "return EXPR;"

return is not a function, parentheses are not required.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-23 14:33:39 -07:00
Eric Dumazet bc8acf2c8c drivers/net: avoid some skb->ip_summed initializations
fresh skbs have ip_summed set to CHECKSUM_NONE (0)

We can avoid setting again skb->ip_summed to CHECKSUM_NONE in drivers.

Introduce skb_checksum_none_assert() helper so that we keep this
assertion documented in driver sources.

Change most occurrences of :

skb->ip_summed = CHECKSUM_NONE;

by :

skb_checksum_none_assert(skb);

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-02 19:06:22 -07:00
David S. Miller 21dc330157 net: Rename skb_has_frags to skb_has_frag_list
SKBs can be "fragmented" in two ways, via a page array (called
skb_shinfo(skb)->frags[]) and via a list of SKBs (called
skb_shinfo(skb)->frag_list).

Since skb_has_frags() tests the latter, it's name is confusing
since it sounds more like it's testing the former.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 00:13:46 -07:00
Oliver Hartkopp 2244d07bfa net: simplify flags for tx timestamping
This patch removes the abstraction introduced by the union skb_shared_tx in
the shared skb data.

The access of the different union elements at several places led to some
confusion about accessing the shared tx_flags e.g. in skb_orphan_try().

    http://marc.info/?l=linux-netdev&m=128084897415886&w=2

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-19 00:08:30 -07:00
Krishna Kumar bfb564e739 core: Factor out flow calculation from get_rps_cpu
Factor out flow calculation code from get_rps_cpu, since other
functions can use the same code.

Revisions:

v2 (Ben): Separate flow calcuation out and use in select queue.
v3 (Arnd): Don't re-implement MIN.
v4 (Changli): skb->data points to ethernet header in macvtap, and
	make a fast path. Tested macvtap with this patch.
v5 (Changli):
	- Cache skb->rxhash in skb_get_rxhash
	- macvtap may not have pow(2) queues, so change code for
	  queue selection.
    (Arnd):
	- Use first available queue if all fails.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-16 21:06:24 -07:00
Changli Gao f9599ce111 sk_buff: introduce pskb_network_may_pull()
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-04 21:53:14 -07:00
Oliver Hartkopp cff0d6e6ed can-raw: Fix skb_orphan_try handling
Commit fc6055a5ba (net: Introduce
skb_orphan_try()) allows an early orphan of the skb and takes care on
tx timestamping, which needs the sk-reference in the skb on driver level.
So does the can-raw socket, which has not been taken into account here.

The patch below adds a 'prevent_sk_orphan' bit in the skb tx shared info,
which fixes the problem discovered by Matthias Fuchs here:

      http://marc.info/?t=128030411900003&r=1&w=2

Even if it's not a primary tx timestamp topic it fits well into some skb
shared tx context. Or should be find a different place for the information to
protect the sk reference until it reaches the driver level?

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-03 00:31:48 -07:00
Eric Dumazet fed66381d6 net: pskb_expand_head() optimization
Move frags[] at the end of struct skb_shared_info, and make
pskb_expand_head() copy only the used part of it instead of whole array.

This should avoid kmemcheck warnings and speedup pskb_expand_head() as
well, avoiding a lot of cache misses.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-24 21:05:57 -07:00
Richard Cochran c1f19b51d1 net: support time stamping in phy devices.
This patch adds a new networking option to allow hardware time stamps
from PHY devices. When enabled, likely candidates among incoming and
outgoing network packets are offered to the PHY driver for possible
time stamping. When accepted by the PHY driver, incoming packets are
deferred for later delivery by the driver.

The patch also adds phylib driver methods for the SIOCSHWTSTAMP ioctl
and callbacks for transmit and receive time stamping. Drivers may
optionally implement these functions.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-18 19:15:26 -07:00
Richard Cochran 4507a71507 net: add driver hook for tx time stamping.
This patch adds a hook for transmit time stamps. The transmit hook
allows a software fallback for transmit time stamps, for MACs
lacking time stamping hardware. Using the hook will still require
adding an inline function call to each MAC driver.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-18 19:15:25 -07:00
Eric Dumazet 5933dd2f02 net: NET_SKB_PAD should depend on L1_CACHE_BYTES
In old kernels, NET_SKB_PAD was defined to 16.

Then commit d6301d3dd1 (net: Increase default NET_SKB_PAD to 32), and
commit 18e8c134f4 (net: Increase NET_SKB_PAD to 64 bytes) increased it
to 64.

While first patch was governed by network stack needs, second was more
driven by performance issues on current hardware. Real intent was to
align data on a cache line boundary.

So use max(32, L1_CACHE_BYTES) instead of 64, to be more generic.

Remove microblaze and powerpc own NET_SKB_PAD definitions.

Thanks to Alexander Duyck and David Miller for their comments.

Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-15 18:16:43 -07:00
David S. Miller 62522d36d7 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-06-11 13:32:31 -07:00
John Fastabend 597a264b1a net: deliver skbs on inactive slaves to exact matches
Currently, the accelerated receive path for VLAN's will
drop packets if the real device is an inactive slave and
is not one of the special pkts tested for in
skb_bond_should_drop().  This behavior is different then
the non-accelerated path and for pkts over a bonded vlan.

For example,

vlanx -> bond0 -> ethx

will be dropped in the vlan path and not delivered to any
packet handlers at all.  However,

bond0 -> vlanx -> ethx

and

bond0 -> ethx

will be delivered to handlers that match the exact dev,
because the VLAN path checks the real_dev which is not a
slave and netif_recv_skb() doesn't drop frames but only
delivers them to exact matches.

This patch adds a sk_buff flag which is used for tagging
skbs that would previously been dropped and allows the
skb to continue to skb_netif_recv().  Here we add
logic to check for the deliver_no_wcard flag and if it
is set only deliver to handlers that match exactly.  This
makes both paths above consistent and gives pkt handlers
a way to identify skbs that come from inactive slaves.
Without this patch in some configurations skbs will be
delivered to handlers with exact matches and in others
be dropped out right in the vlan path.

I have tested the following 4 configurations in failover modes
and load balancing modes.

# bond0 -> ethx

# vlanx -> bond0 -> ethx

# bond0 -> vlanx -> ethx

# bond0 -> ethx
            |
  vlanx -> --

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-10 22:23:34 -07:00
Alexander Duyck b78462ebc6 skbuff: add check for non-linear to warn_if_lro and needs_linearize
We can avoid an unecessary cache miss by checking if the skb is non-linear
before accessing gso_size/gso_type in skb_warn_if_lro, the same can also be
done to avoid a cache miss on nr_frags if data_len is 0.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-05 02:23:16 -07:00
Changli Gao 5b0daa3474 skb: make skb_recycle_check() return a bool value
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-29 00:12:13 -07:00
Eric Dumazet 7fee226ad2 net: add a noref bit on skb dst
Use low order bit of skb->_skb_dst to tell dst is not refcounted.

Change _skb_dst to _skb_refdst to make sure all uses are catched.

skb_dst() returns the dst, regardless of noref bit set or not, but
with a lockdep check to make sure a noref dst is not given if current
user is not rcu protected.

New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
(with lockdep check)

skb_dst_drop() drops a reference only if skb dst was refcounted.

skb_dst_force() helper is used to force a refcount on dst, when skb
is queued and not anymore RCU protected.

Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
!IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
sock_queue_rcv_skb(), in __nf_queue().

Use skb_dst_force() in dev_requeue_skb().

Note: dst_use_noref() still dirties dst, we might transform it
later to do one dirtying per jiffies.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-17 17:18:50 -07:00
Eric Dumazet 18e8c134f4 net: Increase NET_SKB_PAD to 64 bytes
eth_type_trans() & get_rps_cpus() currently need two 64bytes cache
lines in packet to compute rxhash.

Increasing NET_SKB_PAD from 32 to 64 reduces the need to one cache
line only, and makes RPS faster.

NET_IP_ALIGN(2) + ethernet_header(14) + IP_header(20/40) + ports(8)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-06 21:58:51 -07:00
Eric Dumazet ec7d2f2cf3 net: __alloc_skb() speedup
With following patch I can reach maximum rate of my pktgen+udpsink
simulator :
- 'old' machine : dual quad core E5450  @3.00GHz
- 64 UDP rx flows (only differ by destination port)
- RPS enabled, NIC interrupts serviced on cpu0
- rps dispatched on 7 other cores. (~130.000 IPI per second)
- SLAB allocator (faster than SLUB in this workload)
- tg3 NIC
- 1.080.000 pps without a single drop at NIC level.

Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch
first sk_buff cache line, the second to prefetch the shinfo part.

Also using one memset() to initialize all skb_shared_info fields instead
of one by one to reduce number of instructions, using long word moves.

All skb_shared_info fields before 'dataref' are cleared in 
__alloc_skb().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-05 01:07:37 -07:00
David S. Miller 47d29646a2 net: Inline skb_pull() in eth_type_trans().
In commit 6be8ac2f ("[NET]: uninline skb_pull, de-bloats a lot")
we uninlined skb_pull.

But in some critical paths it makes sense to inline this thing
and it helps performance significantly.

Create an skb_pull_inline() so that we can do this in a way that
serves also as annotation.

Based upon a patch by Eric Dumazet.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-02 02:21:44 -07:00
Rami Rosen ccb7c7732e net: Remove two unnecessary exports (skbuff).
There is no need to export skb_under_panic() and skb_over_panic() in
skbuff.c, since these methods are used only in skbuff.c ; this patch
removes these two exports. It also marks these functions as 'static'
and removeS the extern declarations of them from
include/linux/skbuff.h

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 22:39:53 -07:00
Alexander Duyck cd58950a53 skbuff: remove unused dev_consume_skb macro definition
dev_consume_skb and kfree_skb_clean have no users and in the case of
kfree_skb_clean could cause potential build issues since I cannot find
where it is defined.  Based on the patch in which it was introduced it
appears to have been a bit of leftover code from an earlier version of the
patch in which kfree_skb_clean was dropped in favor of consume_skb.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 02:58:24 -07:00
David S. Miller 4a35ecf8bf Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bonding/bond_main.c
	drivers/net/via-velocity.c
	drivers/net/wireless/iwlwifi/iwl-agn.c
2010-04-06 23:53:30 -07:00
Alexander Duyck 03e6d819c2 skbuff: remove unused dma_head & dma_maps fields
The dma map fields in the skb_shared_info structure no longer has any users
and can be dropped since it is making the skb_shared_info unecessarily larger.

Running slabtop show that we were using 4K slabs for the skb->head on x86_64 w/
an allocation size of 1522.  It turns out that the dma_head and dma_maps array
made skb_shared large enough that we had crossed over the 2k boundary with
standard frames and as such we were using 4k blocks of memory for all skbs.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-24 11:13:35 -07:00
Tom Herbert 0a9627f264 rps: Receive Packet Steering
This patch implements software receive side packet steering (RPS).  RPS
distributes the load of received packet processing across multiple CPUs.

Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load.  This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.

This solution queues packets early on in the receive path on the backlog queues
of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel.   For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask.  The IPI mechanism is used to raise networking receive
softirqs between CPUs.  This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.

Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.

The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
bit maps for receive queues in the device (numbered by <n>).  If a device
does not support multi-queue, a single variable is used for the device (rx-0).

Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization.  Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy.  Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.

e1000e on 8 core Intel
   Without RPS: 108K tps at 33% CPU
   With RPS:    311K tps at 64% CPU

forcedeth on 16 core AMD
   Without RPS: 156K tps at 15% CPU
   With RPS:    404K tps at 49% CPU

bnx2x on 16 core AMD
   Without RPS  567K tps at 61% CPU (4 HW RX queues)
   Without RPS  738K tps at 96% CPU (8 HW RX queues)
   With RPS:    854K tps at 76% CPU (4 HW RX queues)

Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet.  In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets.  It's
probably best not change the masks too frequently.

Signed-off-by: Tom Herbert <therbert@google.com>

 include/linux/netdevice.h |   32 ++++-
 include/linux/skbuff.h    |    3 +
 net/core/dev.c            |  335 +++++++++++++++++++++++++++++++++++++--------
 net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
 net/core/skbuff.c         |    2 +
 5 files changed, 538 insertions(+), 59 deletions(-)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:23:18 -07:00
Eric Dumazet 4ab408dea0 net: fix protocol sk_buff field
Commit e992cd9b72 (kmemcheck: make bitfield annotations truly no-ops
when disabled) allows us to revert a workaround we did in the past to
not add holes in sk_buff structure.

This patch partially reverts commit 14d18a81b5
(net: fix kmemcheck annotations) so that sparse doesnt complain:

include/linux/skbuff.h:357:41: error: invalid bitfield specifier for
type restricted __be16.

Reported-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-02 03:05:05 -08:00
Felix Fietkau da3f5cf1f8 skbuff: align sk_buff::cb to 64 bit and close some potential holes
The alignment requirement for 64-bit load/store instructions on ARM is
implementation defined. Some CPUs (such as Marvell Feroceon) do not
generate an exception, if such an instruction is executed with an
address that is not 64 bit aligned. In such a case, the Feroceon
corrupts adjacent memory, which showed up in my tests as a crash in the
rx path of ath9k that only occured with CONFIG_XFRM set.

This crash happened, because the first field of the mac80211 rx status
info in the cb is an u64, and changing it corrupted the skb->sp field.

This patch also closes some potential pre-existing holes in the sk_buff
struct surrounding the cb[] area.

Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Cc: stable@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-27 03:16:59 -08:00
Ben Hutchings 1a5778aa00 net: Fix first line of kernel-doc for a few functions
The function name must be followed by a space, hypen, space, and a
short description.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-14 22:35:47 -08:00
Alexander Duyck c81c2d9544 skbuff: remove skb_dma_map/unmap
The two functions skb_dma_map/unmap are unsafe to use as they cause
problems when packets are cloned and sent to multiple devices while a HW
IOMMU is enabled.  Due to this it is best to remove the code so it is not
used by any other network driver maintainters.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 19:57:15 -08:00
Eric Dumazet 8964be4a9a net: rename skb->iif to skb->skb_iif
To help grep games, rename iif to skb_iif

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-20 15:35:04 -08:00
David S. Miller 230f9bb701 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/usb/cdc_ether.c

All CDC ethernet devices of type USB_CLASS_COMM need to use
'&mbm_info'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-06 00:55:55 -08:00
Eric Dumazet d94d9fee9f net: cleanup include/linux
This cleanup patch puts struct/union/enum opening braces,
in first line to ease grep games.

struct something
{

becomes :

struct something {

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-04 09:50:58 -08:00
Eric Dumazet 9d410c7960 net: fix sk_forward_alloc corruption
On UDP sockets, we must call skb_free_datagram() with socket locked,
or risk sk_forward_alloc corruption. This requirement is not respected
in SUNRPC.

Add a convenient helper, skb_free_datagram_locked() and use it in SUNRPC

Reported-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-30 12:25:12 -07:00
Eric Dumazet 14d18a81b5 net: fix kmemcheck annotations
struct sk_buff kmemcheck annotations enlarged this structure by 8/16 bytes

Fix this by moving 'protocol' inside flags1 bitfield,
and queue_mapping inside flags2 bitfield.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 22:48:44 -07:00
Eric Dumazet 61321bbd62 net: Add netdev_alloc_skb_ip_align() helper
Instead of hardcoding NET_IP_ALIGN stuff in various network drivers,
we can add a helper around netdev_alloc_skb()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-13 03:44:03 -07:00
Neil Horman 3b885787ea net: Generalize socket rx gap / receive queue overflow cmsg
Create a new socket level option to report number of queue overflows

Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames.  This value was
exported via a SOL_PACKET level cmsg.  AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option.  As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames.  It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count).  Tested
successfully by me.

Notes:

1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.

2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me.  This also saves us having
to code in a per-protocol opt in mechanism.

3) This applies cleanly to net-next assuming that commit
977750076d (my af packet cmsg patch) is reverted

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-12 13:26:31 -07:00
Johannes Berg 72bce62775 net: remove unused skb->do_not_encrypt
mac80211 required this due to the master netdev, but now
it can put all information into skb->cb and this can go.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-24 15:05:31 -04:00
Tobias Klauser 8660c1240e skbuff.h: Fix comment for NET_IP_ALIGN
Use the correct function call for skb_reserve in the comment for
NET_IP_ALIGN.

Signed-off-by: Tobias Klauser <klto@zhaw.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-14 12:03:42 -07:00
Linus Torvalds d2aa455037 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (55 commits)
  netxen: fix tx ring accounting
  netxen: fix detection of cut-thru firmware mode
  forcedeth: fix dma api mismatches
  atm: sk_wmem_alloc initial value is one
  net: correct off-by-one write allocations reports
  via-velocity : fix no link detection on boot
  Net / e100: Fix suspend of devices that cannot be power managed
  TI DaVinci EMAC : Fix rmmod error
  net: group address list and its count
  ipv4: Fix fib_trie rebalancing, part 2
  pkt_sched: Update drops stats in act_police
  sky2: version 1.23
  sky2: add GRO support
  sky2: skb recycling
  sky2: reduce default transmit ring
  sky2: receive counter update
  sky2: fix shutdown synchronization
  sky2: PCI irq issues
  sky2: more receive shutdown
  sky2: turn off pause during shutdown
  ...

Manually fix trivial conflict in net/core/skbuff.c due to kmemcheck
2009-06-18 14:07:15 -07:00
Randy Dunlap a42fc8f694 skbuff.h: fix skb_dst kernel-doc
Fix kernel-doc warnings (missing + extra entries) in skbuff.h.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-17 04:31:11 -07:00
Linus Torvalds b3fec0fe35 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck: (39 commits)
  signal: fix __send_signal() false positive kmemcheck warning
  fs: fix do_mount_root() false positive kmemcheck warning
  fs: introduce __getname_gfp()
  trace: annotate bitfields in struct ring_buffer_event
  net: annotate struct sock bitfield
  c2port: annotate bitfield for kmemcheck
  net: annotate inet_timewait_sock bitfields
  ieee1394/csr1212: fix false positive kmemcheck report
  ieee1394: annotate bitfield
  net: annotate bitfields in struct inet_sock
  net: use kmemcheck bitfields API for skbuff
  kmemcheck: introduce bitfield API
  kmemcheck: add opcode self-testing at boot
  x86: unify pte_hidden
  x86: make _PAGE_HIDDEN conditional
  kmemcheck: make kconfig accessible for other architectures
  kmemcheck: enable in the x86 Kconfig
  kmemcheck: add hooks for the page allocator
  kmemcheck: add hooks for page- and sg-dma-mappings
  kmemcheck: don't track page tables
  ...
2009-06-16 13:09:51 -07:00
Vegard Nossum fe55f6d5c0 net: use kmemcheck bitfields API for skbuff
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
2009-06-15 15:49:25 +02:00
Johannes Berg 8f77f3849c mac80211: do not pass PS frames out of mac80211 again
In order to handle powersave frames properly we had needed
to pass these out to the device queues again, and introduce
the skb->requeue bit. This, however, also has unnecessary
overhead by needing to 'clean up' already tried frames, and
this clean-up code is also buggy when software encryption
is used.

Instead of sending the frames via the master netdev queue
again, simply put them into the pending queue. This also
fixes a problem where frames for that particular station
could be reordered when some were still on the software
queues and older ones are re-injected into the software
queue after them.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-06-10 13:28:37 -04:00
David S. Miller ee03987170 skbuff: Add frag list abstraction interfaces.
With the hope that these can be used to eliminate direct
references to the frag list implementation.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-09 00:17:13 -07:00
Eric Dumazet 042a53a9e4 net: skb_shared_info optimization
skb_dma_unmap() is quite expensive for small packets,
because we use two different cache lines from skb_shared_info.

One to access nr_frags, one to access dma_maps[0]

Instead of dma_maps being an array of MAX_SKB_FRAGS + 1 elements,
let dma_head alone in a new dma_head field, close to nr_frags,
to reduce cache lines misses.

Tested on my dev machine (bnx2 & tg3 adapters), nice speedup !

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-08 00:21:48 -07:00
Eric Dumazet eae3f29cc7 net: num_dma_maps is not used
Get rid of num_dma_maps in struct skb_shared_info, as it seems unused.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-08 00:20:23 -07:00
Eric Dumazet e5b9215ef9 net: skb cleanup
Can remove anonymous union now it has one field.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-03 02:51:05 -07:00
Eric Dumazet adf30907d6 net: skb->dst accessors
Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-03 02:51:04 -07:00
Eric Dumazet 511c3f92ad net: skb->rtable accessor
Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

Delete skb->rtable field

Setting rtable is not allowed, just set dst instead as rtable is an alias.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-03 02:51:02 -07:00
Eric Dumazet dfbf97f3ac net: add _skb_dst opaque field
struct sk_buff uses one union to define dst and rtable fields.

We want to replace direct access to these pointers by accessors.

First patch adds a new "unsigned long _skb_dst;" opaque field
in this union.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-03 02:51:01 -07:00
Johann Baudy 69e3c75f4d net: TX_RING and packet mmap
New packet socket feature that makes packet socket more efficient for
transmission.

- It reduces number of system call through a PACKET_TX_RING mechanism,
  based on PACKET_RX_RING (Circular buffer allocated in kernel space
  which is mmapped from user space).

- It minimizes CPU copy using fragmented SKB (almost zero copy).

Signed-off-by: Johann Baudy <johann.baudy@gnu-log.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-18 22:11:22 -07:00
Michael S. Tsirkin 6f26c9a755 tun: fix tun_chr_aio_write so that aio works
aio_write gets const struct iovec * but tun_chr_aio_write casts this to struct
iovec * and modifies the iovec. As a result, attempts to use io_submit
to send packets to a tun device fail with weird errors such as EINVAL.

Since tun is the only user of skb_copy_datagram_from_iovec, we can
fix this simply by changing the later so that it does not
touch the iovec passed to it.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-21 05:42:46 -07:00
Michael S. Tsirkin 0a1ec07a67 net: skb_copy_datagram_const_iovec()
There's an skb_copy_datagram_iovec() to copy out of a paged skb,
but it modifies the iovec, and does not support starting
at an offset in the destination. We want both in tun.c, so let's
add the function.

It's a carbon copy of skb_copy_datagram_iovec() with enough changes to
be annoying.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-21 05:42:44 -07:00
David S. Miller 13223cb02c Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ 2009-03-29 01:40:34 -07:00
Randy Dunlap 4b21cd4eed skbuff.h: fix missing kernel-doc
Add missing struct field to fix kernel-doc warning:

Warning(include/linux/skbuff.h:182): No description found for parameter 'flags'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-28 23:38:40 -07:00
Linus Torvalds d54b3538b0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (119 commits)
  [SCSI] scsi_dh_rdac: Retry for NOT_READY check condition
  [SCSI] mpt2sas: make global symbols unique
  [SCSI] sd: Make revalidate less chatty
  [SCSI] sd: Try READ CAPACITY 16 first for SBC-2 devices
  [SCSI] sd: Refactor sd_read_capacity()
  [SCSI] mpt2sas v00.100.11.15
  [SCSI] mpt2sas: add MPT2SAS_MINOR(221) to miscdevice.h
  [SCSI] ch: Add scsi type modalias
  [SCSI] 3w-9xxx: add power management support
  [SCSI] bsg: add linux/types.h include to bsg.h
  [SCSI] cxgb3i: fix function descriptions
  [SCSI] libiscsi: fix possbile null ptr session command cleanup
  [SCSI] iscsi class: remove host no argument from session creation callout
  [SCSI] libiscsi: pass session failure a session struct
  [SCSI] iscsi lib: remove qdepth param from iscsi host allocation
  [SCSI] iscsi lib: have lib create work queue for transmitting IO
  [SCSI] iscsi class: fix lock dep warning on logout
  [SCSI] libiscsi: don't cap queue depth in iscsi modules
  [SCSI] iscsi_tcp: replace scsi_debug/tcp_debug logging with iscsi conn logging
  [SCSI] libiscsi_tcp: replace tcp_debug/scsi_debug logging with session/conn logging
  ...
2009-03-28 13:30:43 -07:00
Stephen Hemminger 9247744e5e skb: expose and constify hash primitives
Some minor changes to queue hashing:
 1. Use const on accessor functions
 2. Export skb_tx_hash for use in drivers (see ixgbe)

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-21 13:39:26 -07:00
Chris Leech 01d5b2fca1 [SCSI] net: define feature flags for FCoE offloads
Define feature flags for FCoE offloads.

Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: Yi Zou <yi.zou@intel.com>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-03-13 15:11:07 -05:00
Neil Horman ead2ceb0ec Network Drop Monitor: Adding kfree_skb_clean for non-drops and modifying end-of-line points for skbs
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

 include/linux/skbuff.h |    4 +++-
 net/core/datagram.c    |    2 +-
 net/core/skbuff.c      |   22 ++++++++++++++++++++++
 net/ipv4/arp.c         |    2 +-
 net/ipv4/udp.c         |    2 +-
 net/packet/af_packet.c |    2 +-
 6 files changed, 29 insertions(+), 5 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-13 12:09:28 -07:00
Randy Dunlap d3a21be86c skbuff.h: fix timestamps kernel-doc
Fix skbuff.h kernel-doc for timestamps: must include "struct" keyword,
otherwise there are kernel-doc errors:

Error(linux-next-20090227//include/linux/skbuff.h:161): cannot understand prototype: 'struct skb_shared_hwtstamps '
Error(linux-next-20090227//include/linux/skbuff.h:177): cannot understand prototype: 'union skb_shared_tx '

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-02 03:15:58 -08:00
David S. Miller e70049b9e7 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ 2009-02-24 03:50:29 -08:00
David S. Miller 92a0acce18 net: Kill skb_truesize_check(), it only catches false-positives.
A long time ago we had bugs, primarily in TCP, where we would modify
skb->truesize (for TSO queue collapsing) in ways which would corrupt
the socket memory accounting.

skb_truesize_check() was added in order to try and catch this error
more systematically.

However this debugging check has morphed into a Frankenstein of sorts
and these days it does nothing other than catch false-positives.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-17 21:24:05 -08:00
Patrick Ohly ac45f602ee net: infrastructure for hardware time stamping
The additional per-packet information (16 bytes for time stamps, 1
byte for flags) is stored for all packets in the skb_shared_info
struct. This implementation detail is hidden from users of that
information via skb_* accessor functions. A separate struct resp.
union is used for the additional information so that it can be
stored/copied easily outside of skb_shared_info.

Compared to previous implementations (reusing the tstamp field
depending on the context, optional additional structures) this
is the simplest solution. It does not extend sk_buff itself.

TX time stamping is implemented in software if the device driver
doesn't support hardware time stamping.

The new semantic for hardware/software time stamping around
ndo_start_xmit() is based on two assumptions about existing
network device drivers which don't support hardware time
stamping and know nothing about it:
 - they leave the new skb_shared_tx unmodified
 - the keep the connection to the originating socket in skb->sk
   alive, i.e., don't call skb_orphan()

Given that skb_shared_tx is new, the first assumption is safe.
The second is only true for some drivers. As a result, software
TX time stamping currently works with the bnx2 driver, but not
with the unmodified igb driver (the two drivers this patch series
was tested with).

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-15 22:43:34 -08:00
David S. Miller d54e6d8727 net: Kill skbuff macros from the stone ages.
This kills of HAVE_ALLOC_SKB and HAVE_ALIGNABLE_SKB.

Nothing in-tree uses them and nothing in-tree has used them
since 2.0.x times.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-09 23:45:29 -08:00
David S. Miller d6301d3dd1 net: Increase default NET_SKB_PAD to 32.
Several devices need to insert some "pre headers" in front of the
main packet data when they transmit a packet.

Currently we allocate only 16 bytes of pad room and this ends up not
being enough for some types of hardware (NIU, usb-net, s390 qeth,
etc.)

So increase this to 32.

Note that drivers still need to check in their transmit routine
whether enough headroom exists, and if not use skb_realloc_headroom().
Tunneling, IPSEC, and other encapsulation methods can cause the
padding area to be used up.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-08 19:24:13 -08:00
Herbert Xu 86911732d3 gro: Avoid copying headers of unmerged packets
Unfortunately simplicity isn't always the best.  The fraginfo
interface turned out to be suboptimal.  The problem was quite
obvious.  For every packet, we have to copy the headers from
the frags structure into skb->head, even though for 99% of the
packets this part is immediately thrown away after the merge.

LRO didn't have this problem because it directly read the headers
from the frags structure.

This patch attempts to address this by creating an interface
that allows GRO to access the headers in the first frag without
having to copy it.  Because all drivers that use frags place the
headers in the first frag this optimisation should be enough.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-29 16:33:03 -08:00
David S. Miller d5a9e24afb net: Allow RX queue selection to seed TX queue hashing.
The idea is that drivers which implement multiqueue RX
pre-seed the SKB by recording the RX queue selected by
the hardware.

If such a seed is found on TX, we'll use that to select
the outgoing TX queue.

This helps get more consistent load balancing on router
and firewall loads.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-27 16:22:11 -08:00
Herbert Xu 71d93b39e5 net: Add skb_gro_receive
This patch adds the helper skb_gro_receive to merge packets for
GRO.  The current method is to allocate a new header skb and then
chain the original packets to its frag_list.  This is done to
make it easier to integrate into the existing GSO framework.

In future as GSO is moved into the drivers, we can undo this and
simply chain the original packets together.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-15 23:42:33 -08:00
Ilpo Järvinen 832d11c5cd tcp: Try to restore large SKBs while SACK processing
During SACK processing, most of the benefits of TSO are eaten by
the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
Then we're in problems when cleanup work for them has to be done
when a large cumulative ACK comes. Try to return back to pre-split
state already while more and more SACK info gets discovered by
combining newly discovered SACK areas with the previous skb if
that's SACKed as well.

This approach has a number of benefits:

1) The processing overhead is spread more equally over the RTT
2) Write queue has less skbs to process (affect everything
   which has to walk in the queue past the sacked areas)
3) Write queue is consistent whole the time, so no other parts
   of TCP has to be aware of this (this was not the case with
   some other approach that was, well, quite intrusive all
   around).
4) Clean_rtx_queue can release most of the pages using single
   put_page instead of previous PAGE_SIZE/mss+1 calls

In case a hole is fully filled by the new SACK block, we attempt
to combine the next skb too which allows construction of skbs
that are even larger than what tso split them to and it handles
hole per on every nth patterns that often occur during slow start
overshoot pretty nicely. Though this to be really useful also
a retransmission would have to get lost since cumulative ACKs
advance one hole at a time in the most typical case.

TODO: handle upwards only merging. That should be rather easy
when segment is fully sacked but I'm leaving that as future
work item (it won't make very large difference anyway since
this current approach already covers quite a lot of normal
cases).

I was earlier thinking of some sophisticated way of tracking
timestamps of the first and the last segment but later on
realized that it won't be that necessary at all to store the
timestamp of the last segment. The cases that can occur are
basically either:
  1) ambiguous => no sensible measurement can be taken anyway
  2) non-ambiguous is due to reordering => having the timestamp
     of the last segment there is just skewing things more off
     than does some good since the ack got triggered by one of
     the holes (besides some substle issues that would make
     determining right hole/skb even harder problem). Anyway,
     it has nothing to do with this change then.

I choose to route some abnormal looking cases with goto noop,
some could be handled differently (eg., by stopping the
walking at that skb but again). In general, they either
shouldn't happen at all or are rare enough to make no difference
in practice.

In theory this change (as whole) could cause some macroscale
regression (global) because of cache misses that are taken over
the round-trip time but it gets very likely better because of much
less (local) cache misses per other write queue walkers and the
big recovery clearing cumulative ack.

Worth to note that these benefits would be very easy to get also
without TSO/GSO being on as long as the data is in pages so that
we can merge them. Currently I won't let that happen because
DSACK splitting at fragment that would mess up pcounts due to
sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
avoided, we have some conditions that can be made less strict.

TODO: I will probably have to convert the excessive pointer
passing to struct sacktag_state... :-)

My testing revealed that considerable amount of skbs couldn't
be shifted because they were cloned (most likely still awaiting
tx reclaim)...

[The rest is considering future work instead since I got
repeatably EFAULT to tcpdump's recvfrom when I added
pskb_expand_head to deal with clones, so I separated that
into another, later patch]

...To counter that, I gave up on the fifth advantage:

5) When growing previous SACK block, less allocs for new skbs
   are done, basically a new alloc is needed only when new hole
   is detected and when the previous skb runs out of frags space

...which now only happens of if reclaim is fast enough to dispose
the clone before the SACK block comes in (the window is RTT long),
otherwise we'll have to alloc some.

With clones being handled I got these numbers (will be somewhat
worse without that), taken with fine-grained mibs:

                  TCPSackShifted 398
                   TCPSackMerged 877
            TCPSackShiftFallback 320
      TCPSACKCOLLAPSEFALLBACKGSO 0
  TCPSACKCOLLAPSEFALLBACKSKBBITS 0
  TCPSACKCOLLAPSEFALLBACKSKBDATA 0
    TCPSACKCOLLAPSEFALLBACKBELOW 0
    TCPSACKCOLLAPSEFALLBACKFIRST 1
 TCPSACKCOLLAPSEFALLBACKPREVBITS 318
      TCPSACKCOLLAPSEFALLBACKMSS 1
   TCPSACKCOLLAPSEFALLBACKNOHEAD 0
    TCPSACKCOLLAPSEFALLBACKSHIFT 0
          TCPSACKCOLLAPSENOOPSEQ 0
  TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
     TCPSACKCOLLAPSENOOPSMALLLEN 0
             TCPSACKCOLLAPSEHOLE 12

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:20:15 -08:00
Sujith 8b30b1fe36 mac80211: Re-enable aggregation
Wireless HW without any dedicated queues for aggregation
do not need the ampdu_queues mechanism present right now
in mac80211. Since mac80211 is still incomplete wrt TX MQ
changes, do not allow aggregation sessions for drivers that
set ampdu_queues.

This is only an interim hack until Intel fixes the requeue issue.

Signed-off-by: Sujith <Sujith.Manoharan@atheros.com>
Signed-off-by: Luis Rodriguez <Luis.Rodriguez@Atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-10-31 19:02:14 -04:00
Alexey Dobriyan def8b4faff net: reduce structures when XFRM=n
ifdef out
* struct sk_buff::sp		(pointer)
* struct dst_entry::xfrm	(pointer)
* struct sock::sk_policy	(2 pointers)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-28 13:24:06 -07:00
Peter Zijlstra 654bed16cf net: packet split receive api
Add some packet-split receive hooks.

For one this allows to do NUMA node affine page allocs. Later on these
hooks will be extended to do emergency reserve allocations for
fragments.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 14:22:33 -07:00
Lennert Buytenhek 04a4bb55bc net: add skb_recycle_check() to enable netdriver skb recycling
This patch adds skb_recycle_check(), which can be used by a network
driver after transmitting an skb to check whether this skb can be
recycled as a receive buffer.

skb_recycle_check() checks that the skb is not shared or cloned, and
that it is linear and its head portion large enough (as determined by
the driver) to be recycled as a receive buffer.  If these conditions
are met, it does any necessary reference count dropping and cleans
up the skbuff as if it just came from __alloc_skb().

Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-01 02:33:12 -07:00
David S. Miller 1164f52a24 net: Add skb_queue_walk_from() and skb_queue_walk_from_safe().
These will be used by TCP write queue handling and elsewhere.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-23 00:49:44 -07:00
David S. Miller 249c8b42c7 net: Add skb_queue_next().
A lot of code wants to iterate over an SKB queue at the top level using
it's own control structure and iterator scheme.

Provide skb_queue_next(), which is only valid to invoke if
skb_queue_is_last() returns false.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-23 00:44:42 -07:00
David S. Miller fc7ebb212d net: Add skb_queue_is_last().
Several bits of code want to know "is this the last SKB in
a queue", and all of them implement this by hand.

Provide an common interface to make this check.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-23 00:34:07 -07:00
David S. Miller 1d4a31dde9 net: Fix bus in SKB queue splicing interfaces.
Handle the case of head being non-empty, by adding list->qlen
to head->qlen instead of using direct assignment.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-22 21:57:21 -07:00
David S. Miller 67fed45930 net: Add new interfaces for SKB list light-weight init and splicing.
This will be used by subsequent changesets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-21 22:36:24 -07:00
David S. Miller a40c24a133 net: Add SKB DMA mapping helper functions.
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-11 04:51:14 -07:00
David S. Miller 271bff7afb net: Add DMA mapping tokens to skb_shared_info.
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-11 04:48:58 -07:00
Rusty Russell db543c1f97 net: skb_copy_datagram_from_iovec()
There's an skb_copy_datagram_iovec() to copy out of a paged skb, but
nothing the other way around (because we don't do that).

We want to allocate big skbs in tun.c, so let's add the function.
It's a carbon copy of skb_copy_datagram_iovec() with enough changes to
be annoying.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-08-15 19:52:30 -07:00
Gerrit Renker 987c402ac3 skbuff: Code readability NiT
Inserting a space between the `-' improved the C readability (some languages
allow hyphens within functions and variable names, which is confusing).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-08-11 18:17:17 -07:00
Randy Dunlap 4a7b61d235 skbuff: add missing kernel-doc for do_not_encrypt
Add missing kernel-doc notation to sk_buff:

Warning(linux-2.6.27-rc1-git2//include/linux/skbuff.h:345): No description found for parameter 'do_not_encrypt'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-31 20:52:08 -07:00
Johannes Berg d0f0980414 mac80211: partially fix skb->cb use
This patch fixes mac80211 to not use the skb->cb over the queue step
from virtual interfaces to the master. The patch also, for now,
disables aggregation because that would still require requeuing,
will fix that in a separate patch. There are two other places (software
requeue and powersaving stations) where requeue can happen, but that is
not currently used by any drivers/not possible to use respectively.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-07-29 16:55:08 -04:00
Patrick McHardy 6aa895b047 vlan: Don't store VLAN tag in cb
Use a real skb member to store the skb to avoid clashes with qdiscs,
which are allowed to use the cb area themselves. As currently only real
devices that consume the skb set the NETIF_F_HW_VLAN_TX flag, no explicit
invalidation is neccessary.

The new member fills a hole on 64 bit, the skb layout changes from:

        __u32                      mark;                 /*   172     4 */
        sk_buff_data_t             transport_header;     /*   176     4 */
        sk_buff_data_t             network_header;       /*   180     4 */
        sk_buff_data_t             mac_header;           /*   184     4 */
        sk_buff_data_t             tail;                 /*   188     4 */
        /* --- cacheline 3 boundary (192 bytes) --- */
        sk_buff_data_t             end;                  /*   192     4 */

        /* XXX 4 bytes hole, try to pack */

to

        __u32                      mark;                 /*   172     4 */
        __u16                      vlan_tci;             /*   176     2 */

        /* XXX 2 bytes hole, try to pack */

        sk_buff_data_t             transport_header;     /*   180     4 */
        sk_buff_data_t             network_header;       /*   184     4 */

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-14 22:49:06 -07:00
David S. Miller b19fa1fa91 net: Delete NETDEVICES_MULTIQUEUE kconfig option.
Multiple TX queue support is a core networking feature.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08 23:14:24 -07:00
Ben Hutchings 4497b0763c net: Discard and warn about LRO'd skbs received for forwarding
Add skb_warn_if_lro() to test whether an skb was received with LRO and
warn if so.

Change br_forward(), ip_forward() and ip6_forward() to call it) and
discard the skb if it returns true.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-19 16:22:28 -07:00
Randy Dunlap 553a56726b skbuff: fix missing kernel-doc notation
Add kernel-doc notation for ndisc_nodetype:

Warning(linux-2.6.25-git2//include/linux/skbuff.h:340): No description found for parameter 'ndisc_nodetype'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-21 15:51:36 -07:00
Gerrit Renker f5572855ec [SKB]: __skb_queue_tail = __skb_insert before
This expresses __skb_queue_tail() in terms of __skb_insert(),
using __skb_insert_before() as auxiliary function.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-14 00:05:28 -07:00
Gerrit Renker 7de6c03336 [SKB]: __skb_append = __skb_queue_after
This expresses __skb_append in terms of __skb_queue_after, exploiting that

  __skb_append(old, new, list) = __skb_queue_after(list, old, new).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-14 00:05:09 -07:00
Gerrit Renker bf29927588 [SKB]: __skb_queue_after(prev) = __skb_insert(prev, prev->next)
By reordering, __skb_queue_after() is expressed in terms of __skb_insert().

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-14 00:04:51 -07:00
Gerrit Renker f525c06d12 [SKB]: __skb_dequeue = skb_peek + __skb_unlink
By rearranging the order of declarations, __skb_dequeue() is expressed in terms of

 * skb_peek() and
 * __skb_unlink(),

thus in effect mirroring the analogue implementation of __skb_dequeue_tail().

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-14 00:04:12 -07:00
YOSHIFUJI Hideaki de357cc013 [IPV6] NDISC: Don't rely on node-type hint from L2 unless required.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-04-03 10:06:01 +09:00
Templin, Fred L fadf6bf060 [IPV6] SIT: Add PRL management for ISATAP.
This patch updates the Linux the Intra-Site Automatic Tunnel Addressing
Protocol (ISATAP) implementation. It places the ISATAP potential router
list (PRL) in the kernel and adds three new private ioctls for PRL
management.

[Add several changes of structure name, constant names etc. - yoshfuji]

Signed-off-by: Fred L. Templin <fred.l.templin@boeing.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-04-03 10:05:58 +09:00
Ilpo Järvinen 419ae74ecc [NET]: uninline skb_trim, de-bloats
Allyesconfig (v2.6.24-mm1):
-10976  209 funcs, 123 +, 11099 -, diff: -10976 --- skb_trim

Without number of debug related CONFIGs (v2.6.25-rc2-mm1):
-7360  192 funcs, 131 +, 7491 -, diff: -7360 --- skb_trim
skb_trim                      |  +42

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-27 17:54:01 -07:00
Ilpo Järvinen c2aa270ad7 [NET]: uninline skb_push, de-bloats a lot
Allyesconfig (v2.6.24-mm1):

-21593  356 funcs, 2418 +, 24011 -, diff: -21593 --- skb_push

Without many debug related CONFIGs (v2.6.25-rc2-mm1):

-13890  341 funcs, 189 +, 14079 -, diff: -13890 --- skb_push
skb_push                      |  +46

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-27 17:52:40 -07:00
Ilpo Järvinen f58518e678 [NET]: uninline dev_alloc_skb, de-bloats a lot
Allyesconfig (v2.6.24-mm1):

-23668  392 funcs, 104 +, 23772 -, diff: -23668 --- dev_alloc_skb

Without many debug CONFIGs (v2.6.25-rc2-mm1):

-12178  382 funcs, 157 +, 12335 -, diff: -12178 --- dev_alloc_skb
dev_alloc_skb                 |  +37

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-27 17:51:31 -07:00
Ilpo Järvinen 6be8ac2fdc [NET]: uninline skb_pull, de-bloats a lot
Allyesconfig (v2.6.24-mm1):

-28162  354 funcs, 3005 +, 31167 -, diff: -28162 --- skb_pull

Without number of debug related CONFIGs (v2.6.25-rc2-mm1):

-9697  338 funcs, 221 +, 9918 -, diff: -9697 --- skb_pull
skb_pull                      |  +44

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-27 17:47:24 -07:00
Ilpo Järvinen 0dde3e1648 [NET]: uninline skb_put, de-bloats a lot
Allyesconfig (v2.6.24-mm1):

~500 files changed
...
 869 funcs, 198 +, 111003 -, diff: -110805 --- skb_put
  skb_put                       | +104

Without number of debug related CONFIGs (v2.6.25-rc2-mm1):

-60744  855 funcs, 861 +, 61605 -, diff: -60744 --- skb_put
  skb_put                       |  +57

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-27 17:43:41 -07:00
Eric Dumazet ee6b967301 [IPV4]: Add 'rtable' field in struct sk_buff to alias 'dst' and avoid casts
(Anonymous) unions can help us to avoid ugly casts.

A common cast it the (struct rtable *)skb->dst one.

Defining an union like  :
union {
     struct dst_entry *dst;
     struct rtable *rtable;
};
permits to use skb->rtable in place.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-05 18:30:47 -08:00
Randy Dunlap 3172936341 net: fix kernel-doc warnings in header files
Add missing structure kernel-doc descriptions to sock.h & skbuff.h
to fix kernel-doc warnings.

(I think that Stephen H. sent a similar patch, but I can't find it.
I just want to kill the warnings, with either patch.)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-18 20:52:13 -08:00
Rusty Russell f35d9d8aae virtio: Implement skb_partial_csum_set, for setting partial csums on untrusted packets.
Use it in virtio_net (replacing buggy version there), it's also going
to be used by TAP for partial csum support.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
2008-02-04 23:49:56 +11:00
Patrick McHardy 2fd8e526f4 [NETFILTER]: bridge netfilter: remove nf_bridge_info read-only netoutdev member
Before the removal of the deferred output hooks, netoutdev was used in
case of VLANs on top of a bridge to store the VLAN device, so the
deferred hooks would see the correct output device. This isn't
necessary anymore since we're calling the output hooks for the correct
device directly in the IP stack.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31 19:27:29 -08:00
Herbert Xu a59322be07 [UDP]: Only increment counter on first peek/recv
The previous move of the the UDP inDatagrams counter caused each
peek of the same packet to be counted separately.  This may be
undesirable.

This patch fixes this by adding a bit to sk_buff to record whether
this packet has already been seen through skb_recv_datagram.  We
then only increment the counter when the packet is seen for the
first time.

The only dodgy part is the fact that skb_recv_datagram doesn't have
a good way of returning this new bit of information.  So I've added
a new function __skb_recv_datagram that does return this and made
skb_recv_datagram a wrapper around it.

The plan is to eventually replace all uses of skb_recv_datagram with
this new function at which time it can be renamed its proper name.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:34 -08:00
Herbert Xu 27ab256864 [UDP]: Avoid repeated counting of checksum errors due to peeking
Currently it is possible for two processes to peek on the same socket
and end up incrementing the error counter twice for the same packet.

This patch fixes it by making skb_kill_datagram return whether it
succeeded in unlinking the packet and only incrementing the counter
if it did.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:32 -08:00
Jens Axboe 9c55e01c0c [TCP]: Splice receive support.
Support for network splice receive.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:31 -08:00
Herbert Xu 2d4baff8da [SKBUFF]: Free old skb properly in skb_morph
The skb_morph function only freed the data part of the dst skb, but leaked
the auxiliary data such as the netfilter fields.  This patch fixes this by
moving the relevant parts from __kfree_skb to skb_release_all and calling
it in skb_morph.

It also makes kfree_skbmem static since it's no longer called anywhere else
and it now no longer does skb_release_data.

Thanks to Yasuyuki KOZAKAI for finding this problem and posting a patch for
it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-26 23:11:19 +08:00
Chuck Lever 78608ba032 [NET]: Fix skb_truesize_check() assertion
The intent of the assertion in skb_truesize_check() is to check
for skb->truesize being decremented too much by other code,
resulting in a wraparound below zero.

The type of the right side of the comparison causes the compiler to
promote the left side to an unsigned type, despite the presence of an
explicit type cast.  This defeats the check for negativity.

Ensure both sides of the comparison are a signed type to prevent the
implicit type conversion.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:53:30 -08:00
Chuck Lever c2636b4d9e [NET]: Treat the sign of the result of skb_headroom() consistently
In some places, the result of skb_headroom() is compared to an unsigned
integer, and in others, the result is compared to a signed integer.  Make
the comparisons consistent and correct.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-23 21:27:55 -07:00
Pavel Emelyanov e3fa259bcb [NET]: Cut off the queue_mapping field from sk_buff
Just hide it behind the #ifdef, because nobody wants
it now.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-22 02:59:57 -07:00
Pavel Emelyanov 4e3ab47a54 [NET]: Make and use skb_get_queue_mapping
Make the helper for getting the field, symmetrical to
the "set" one. Return 0 if CONFIG_NETDEVICES_MULTIQUEUE=n

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-22 02:59:56 -07:00
Herbert Xu deea84b0ae [NET]: Fix SKB_WITH_OVERHEAD calculation
The calculation in SKB_WITH_OVERHEAD is incorrect in that it can cause
an overflow across a page boundary which is what it's meant to prevent.
In particular, the header length (X) should not be lumped together with
skb_shared_info.  The latter needs to be aligned properly while the header
has no choice but to sit in front of wherever the payload is.

Therefore the correct calculation is to take away the aligned size of
skb_shared_info, and then subtract the header length.  The resulting
quantity L satisfies the following inequality:

	SKB_DATA_ALIGN(L + X) + sizeof(struct skb_shared_info) <= PAGE_SIZE

This is the quantity used by alloc_skb to do the actual allocation.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-22 02:59:53 -07:00
Linus Torvalds a52cefc80f Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (42 commits)
  [IPV6]: Consolidate the ip6_pol_route_(input|output) pair
  [TCP]: Make snd_cwnd_cnt 32-bit
  [TCP]: Update the /proc/net/tcp documentation
  [NETNS]: Don't panic on creating the namespace's loopback
  [NEIGH]: Ensure that pneigh_lookup is protected with RTNL
  [INET]: kmalloc+memset -> kzalloc in frag_alloc_queue
  [ISDN]: Fix compile with CONFIG_ISDN_X25 disabled.
  [IPV6]: Replace sk_buff ** with sk_buff * in input handlers
  [SELINUX]: Update for netfilter ->hook() arg changes.
  [INET]: Consolidate the xxx_put
  [INET]: Small cleanup for xxx_put after evictor consolidation
  [INET]: Consolidate the xxx_evictor
  [INET]: Consolidate the xxx_frag_destroy
  [INET]: Consolidate xxx_the secret_rebuild
  [INET]: Consolidate the xxx_frag_kill
  [INET]: Collect common frag sysctl variables together
  [INET]: Collect frag queues management objects together
  [INET]: Move common fields from frag_queues in one place.
  [TG3]: Fix performance regression on 5705.
  [ISDN]: Remove local copy of device name to make sure renames work.
  ...
2007-10-15 14:06:58 -07:00
Herbert Xu e0053ec07e [SKBUFF]: Add skb_morph
This patch creates a new function skb_morph that's just like skb_clone
except that it lets user provide the spare skb that will be overwritten
by the one that's to be cloned.

This will be used by IP fragment reassembly so that we get back the same
skb that went in last (rather than the head skb that we get now which
requires us to carry around double pointers all over the place).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:24 -07:00
Brice Goglin eabd7e35c0 Add skb_is_gso_v6
Add skb_is_gso_v6().

Signed-off-by: Brice Goglin <brice@myri.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-10-15 14:24:07 -04:00
Herbert Xu d9cc20484e [NET] skbuff: Add skb_cow_head
This patch adds an optimised version of skb_cow that avoids the copy if
the header can be modified even if the rest of the payload is cloned.

This can be used in encapsulating paths where we only need to modify the
header.  As it is, this can be used in PPPOE and bridging.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-09-16 16:21:16 -07:00
David S. Miller a309bb072b [NET]: Page offsets and lengths need to be __u32.
Based upon a report from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-31 02:28:28 -07:00
Al Viro 4381ca3c23 fix return type of skb_checksum_complete()
It returns __sum16, not unsigned int

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-15 16:40:51 -07:00
Herbert Xu c6c6e3e05c [NET]: Update comments for skb checksums
Rusty (whose comments we should all study and emulate :) pointed
out that our comments for skb checksums are no longer up-to-date.
So here is a patch to

1) add the case of partial checksums on input;
2) update partial checksum case to mention csum_start/csum_offset;
3) mention the new IPv6 feature bit.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:41:55 -07:00
Jozsef Kadlecsik ba9dda3ab5 [NETFILTER]: x_tables: add TRACE target
The TRACE target can be used to follow IP and IPv6 packets through
the ruleset.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick NcHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:17:14 -07:00
Peter P Waskiewicz Jr f25f4e4480 [CORE] Stack changes to add multiqueue hardware support API
Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them at
the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:16:21 -07:00
Patrick McHardy 334a8132d9 [SKBUFF]: Keep track of writable header len of headerless clones
Currently NAT (and others) that want to modify cloned skbs copy them,
even if in the vast majority of cases its not necessary because the
skb is a clone made by TCP and the portion NAT wants to modify is
actually writable because TCP release the header reference before
cloning.

The problem is that there is no clean way for NAT to find out how
long the writable header area is, so this patch introduces skb->hdr_len
to hold this length. When a headerless skb is cloned skb->hdr_len
is set to the current headroom, for regular clones it is copied from
the original. A new function skb_clone_writable(skb, len) returns
whether the skb is writable up to len bytes from skb->data. To avoid
enlarging the skb the mac_len field is reduced to 16 bit and the
new hdr_len field is put in the remaining 16 bit.

I've done a few rough benchmarks of NAT (not with this exact patch,
but a very similar one). As expected it saves huge amounts of system
time in case of sendfile, bringing it down to basically the same
amount as without NAT, with sendmsg it only helps on loopback,
probably because of the large MTU.

Transmit a 1GB file using sendfile/sendmsg over eth0/lo with and
without NAT:

- sendfile eth0, no NAT:	sys     0m0.388s
- sendfile eth0, NAT:		sys     0m1.835s
- sendfile eth0: NAT + path:	sys     0m0.370s	(~ -80%)

- sendfile lo, no NAT:		sys     0m0.258s
- sendfile lo, NAT:		sys     0m2.609s
- sendfile lo, NAT + patch:	sys     0m0.260s	(~ -90%)

- sendmsg eth0, no NAT:		sys     0m2.508s
- sendmsg eth0, NAT:		sys     0m2.539s
- sendmsg eth0, NAT + patch:	sys     0m2.445s	(no change)

- sendmsg lo, no NAT:		sys	0m2.151s
- sendmsg lo, NAT:		sys     0m3.557s
- sendmsg lo, NAT + patch:	sys     0m2.159s	(~ -40%)

I expect other users can see a similar performance improvement,
packet mangling iptables targets, ipip and ip_gre come to mind ..

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:15:37 -07:00
Ilpo Järvinen b9ce204f0a [TCP]: Congestion control API RTT sampling fix
Commit 164891aadf broke RTT
sampling of congestion control modules. Inaccurate timestamps
could be fed to them without providing any way for them to
identify such cases. Previously RTT sampler was called only if
FLAG_RETRANS_DATA_ACKED was not set filtering inaccurate
timestamps nicely. In addition, the new behavior could give an
invalid timestamp (zero) to RTT sampler if only skbs with
TCPCB_RETRANS were ACKed. This solves both problems.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-15 15:08:43 -07:00
Randy Dunlap be52178b9f [NET] skbuff: fix kernel-doc
Fix skbuff.h kernel-doc:
linux-2.6.21-git4//include/linux/skbuff.h:316): No description found for parameter 'transport_header'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 03:16:20 -07:00
James Chapman 46f8914e53 [SKB]: Introduce skb_queue_walk_safe()
This patch provides a method for walking skb lists while inserting or
removing skbs from the list.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-30 00:07:31 -07:00
Stephen Hemminger 164891aadf [TCP]: Congestion control API update.
Do some simple changes to make congestion control API faster/cleaner.
* use ktime_t rather than timeval
* merge rtt sampling into existing ack callback
  this means one indirect call versus two per ack.
* use flags bits to store options/settings

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:45 -07:00
Stephen Hemminger 0c6fcc8a8c [NET] skbuff: skb_store_bits const is backwards
Getting warnings becuase skb_store_bits has skb as constant,
but the function overwrites it. Looks like const was on the
wrong side.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:17 -07:00
Herbert Xu 604763722c [NET]: Treat CHECKSUM_PARTIAL as CHECKSUM_UNNECESSARY
When a transmitted packet is looped back directly, CHECKSUM_PARTIAL
maps to the semantics of CHECKSUM_UNNECESSARY.  Therefore we should
treat it as such in the stack.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:43 -07:00
Herbert Xu 663ead3bb8 [NET]: Use csum_start offset instead of skb_transport_header
The skb transport pointer is currently used to specify the start
of the checksum region for transmit checksum offload.  Unfortunately,
the same pointer is also used during receive side processing.

This creates a problem when we want to retransmit a received
packet with partial checksums since the skb transport pointer
would be overwritten.

This patch solves this problem by creating a new 16-bit csum_start
offset value to replace the skb transport header for the purpose
of checksums.  This offset is calculated from skb->head so that
it does not have to change when skb->data changes.

No extra space is required since csum_offset itself fits within
a 16-bit word so we can use the other 16 bits for csum_start.

For backwards compatibility, just before we push a packet with
partial checksums off into the device driver, we set the skb
transport header to what it would have been under the old scheme.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:40 -07:00
David Howells 716ea3a7aa [NET]: Move generic skbuff stuff from XFRM code to generic code
Move generic skbuff stuff from XFRM code to generic code so that
AF_RXRPC can use it too.

The kdoc comments I've attached to the functions needs to be checked
by whoever wrote them as I had to make some guesses about the workings
of these functions.

Signed-off-By: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:33 -07:00
Arnaldo Carvalho de Melo 27d7ff46a3 [SK_BUFF]: Introduce skb_copy_to_linear_data{_offset}
To clearly state the intent of copying to linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
2007-04-25 22:28:29 -07:00
Arnaldo Carvalho de Melo d626f62b11 [SK_BUFF]: Introduce skb_copy_from_linear_data{_offset}
To clearly state the intent of copying from linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2007-04-25 22:28:23 -07:00
Herbert Xu 35fc92a9de [NET]: Allow forwarding of ip_summed except CHECKSUM_COMPLETE
Right now Xen has a horrible hack that lets it forward packets with
partial checksums.  One of the reasons that CHECKSUM_PARTIAL and
CHECKSUM_COMPLETE were added is so that we can get rid of this hack
(where it creates two extra bits in the skbuff to essentially mirror
ip_summed without being destroyed by the forwarding code).

I had forgotten that I've already gone through all the deivce drivers
last time around to make sure that they're looking at ip_summed ==
CHECKSUM_PARTIAL rather than ip_summed != 0 on transmit.  In any case,
I've now done that again so it should definitely be safe.

Unfortunately nobody has yet added any code to update CHECKSUM_COMPLETE
values on forward so we I'm setting that to CHECKSUM_NONE.  This should
be safe to remove for bridging but I'd like to check that code path
first.

So here is the patch that lets us get rid of the hack by preserving
ip_summed (mostly) on forwarded packets.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:16 -07:00
Yasuyuki Kozakai de6e05c49f [NETFILTER]: nf_conntrack: kill destroy() in struct nf_conntrack for diet
The destructor per conntrack is unnecessary, then this replaces it with
system wide destructor.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:45 -07:00
Yasuyuki Kozakai 5f79e0f916 [NETFILTER]: nf_conntrack: don't use nfct in skb if conntrack is disabled
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:44 -07:00
Arnaldo Carvalho de Melo 897933bcdf [SK_BUFF]: Remove skb_add_mtu() leftovers
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2007-04-25 22:26:35 -07:00
Arnaldo Carvalho de Melo 4305b54135 [SK_BUFF]: Convert skb->end to sk_buff_data_t
Now to convert the last one, skb->data, that will allow many simplifications
and removal of some of the offset helpers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:29 -07:00
Arnaldo Carvalho de Melo 27a884dc3c [SK_BUFF]: Convert skb->tail to sk_buff_data_t
So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)

Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:28 -07:00
Arnaldo Carvalho de Melo 2e07fa9cd3 [SK_BUFF]: Use offsets for skb->{mac,network,transport}_header on 64bit architectures
With this we save 8 bytes per network packet, leaving a 4 bytes hole to be used
in further shrinking work, likely with the offsetization of other pointers,
such as ->{data,tail,end}, at the cost of adds, that were minimized by the
usual practice of setting skb->{mac,nh,n}.raw to a local variable that is then
accessed multiple times in each function, it also is not more expensive than
before with regards to most of the handling of such headers, like setting one
of these headers to another (transport to network, etc), or subtracting, adding
to/from it, comparing them, etc.

Now we have this layout for sk_buff on a x86_64 machine:

[acme@mica net-2.6.22]$ pahole vmlinux sk_buff
struct sk_buff {
	struct sk_buff *       next;             /*   0   8 */
	struct sk_buff *       prev;             /*   8   8 */
	struct rb_node         rb;               /*  16  24 */
	struct sock *          sk;               /*  40   8 */
	ktime_t                tstamp;           /*  48   8 */
	struct net_device *    dev;              /*  56   8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct net_device *    input_dev;        /*  64   8 */
	sk_buff_data_t         transport_header; /*  72   4 */
	sk_buff_data_t         network_header;   /*  76   4 */
	sk_buff_data_t         mac_header;       /*  80   4 */

	/* XXX 4 bytes hole, try to pack */

	struct dst_entry *     dst;              /*  88   8 */
	struct sec_path *      sp;               /*  96   8 */
	char                   cb[48];           /* 104  48 */
	/* cacheline 2 boundary (128 bytes) was 24 bytes ago*/
	unsigned int           len;              /* 152   4 */
	unsigned int           data_len;         /* 156   4 */
	unsigned int           mac_len;          /* 160   4 */
	union {
		__wsum         csum;             /*       4 */
		__u32          csum_offset;      /*       4 */
	};                                       /* 164   4 */
	__u32                  priority;         /* 168   4 */
	__u8                   local_df:1;       /* 172   1 */
	__u8                   cloned:1;         /* 172   1 */
	__u8                   ip_summed:2;      /* 172   1 */
	__u8                   nohdr:1;          /* 172   1 */
	__u8                   nfctinfo:3;       /* 172   1 */
	__u8                   pkt_type:3;       /* 173   1 */
	__u8                   fclone:2;         /* 173   1 */
	__u8                   ipvs_property:1;  /* 173   1 */

	/* XXX 2 bits hole, try to pack */

	__be16                 protocol;         /* 174   2 */
	void    (*destructor)(struct sk_buff *); /* 176   8 */
	struct nf_conntrack *  nfct;             /* 184   8 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	struct sk_buff *       nfct_reasm;       /* 192   8 */
	struct nf_bridge_info *nf_bridge;        /* 200   8 */
	__u16                  tc_index;         /* 208   2 */
	__u16                  tc_verd;          /* 210   2 */
	dma_cookie_t           dma_cookie;       /* 212   4 */
	__u32                  secmark;          /* 216   4 */
	__u32                  mark;             /* 220   4 */
	unsigned int           truesize;         /* 224   4 */
	atomic_t               users;            /* 228   4 */
	unsigned char *        head;             /* 232   8 */
	unsigned char *        data;             /* 240   8 */
	unsigned char *        tail;             /* 248   8 */
	/* --- cacheline 4 boundary (256 bytes) --- */
	unsigned char *        end;              /* 256   8 */
}; /* size: 264, cachelines: 5 */
   /* sum members: 260, holes: 1, sum holes: 4 */
   /* bit holes: 1, sum bit holes: 2 bits */
   /* last cacheline: 8 bytes */

On 32 bits nothing changes, and pointers continue to be used with the compiler
turning all this abstraction layer into dust. But there are some sk_buff
validation tricks that are now possible, humm... :-)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:21 -07:00
Arnaldo Carvalho de Melo b0e380b1d8 [SK_BUFF]: unions of just one member don't get anything done, kill them
Renaming skb->h to skb->transport_header, skb->nh to skb->network_header and
skb->mac to skb->mac_header, to match the names of the associated helpers
(skb[_[re]set]_{transport,network,mac}_header).

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:20 -07:00
Arnaldo Carvalho de Melo cfe1fc7759 [SK_BUFF]: Introduce skb_network_header_len
For the common sequence "skb->h.raw - skb->nh.raw", similar to skb->mac_len,
that is precalculated tho, don't think we need to bloat skb with one more
member, so just use this new helper, reducing the number of non-skbuff.h
references to the layer headers even more.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:19 -07:00
Yasuyuki Kozakai e7ac05f340 [NETFILTER]: nf_conntrack: add nf_copy() to safely copy members in skb
This unifies the codes to copy netfilter related datas. Before copying,
nf_copy() puts original members in destination skb.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:55 -07:00
Yasuyuki Kozakai edda553c32 [NETFILTER]: nf_conntrack: add __nf_copy() to copy members in skb
This unifies the codes to copy netfilter related datas. Note that
__nf_copy() assumes destination skb doesn't have any netfilter
related members.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:54 -07:00
Arnaldo Carvalho de Melo 9c70220b73 [SK_BUFF]: Introduce skb_transport_header(skb)
For the places where we need a pointer to the transport header, it is
still legal to touch skb->h.raw directly if just adding to,
subtracting from or setting it to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:31 -07:00
Arnaldo Carvalho de Melo 39b89160df [SK_BUFF]: Introduce ipipv6_hdr(), remove skb->h.ipv6h
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:28 -07:00
Arnaldo Carvalho de Melo b0061ce49c [SK_BUFF]: Introduce ipip_hdr(), remove skb->h.ipiph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:27 -07:00
Arnaldo Carvalho de Melo aa8223c7bb [SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:26 -07:00
Arnaldo Carvalho de Melo 88c7664f13 [SK_BUFF]: Introduce icmp_hdr(), remove skb->h.icmph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:23 -07:00
Arnaldo Carvalho de Melo 4bedb45203 [SK_BUFF]: Introduce udp_hdr(), remove skb->h.uh
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:22 -07:00
Arnaldo Carvalho de Melo d9edf9e2be [SK_BUFF]: Introduce igmp_hdr() & friends, remove skb->h.igmph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:21 -07:00
Arnaldo Carvalho de Melo 967b05f64e [SK_BUFF]: Introduce skb_set_transport_header
For the cases where the transport header is being set to a offset from
skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:17 -07:00
Arnaldo Carvalho de Melo ea2ae17d64 [SK_BUFF]: Introduce skb_transport_offset()
For the quite common 'skb->h.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:16 -07:00
Arnaldo Carvalho de Melo badff6d01a [SK_BUFF]: Introduce skb_reset_transport_header(skb)
For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple cases:

skb->h.raw = skb->data;
skb->h.raw = {skb_push|[__]skb_pull}()

The next ones will handle the slightly more "complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:15 -07:00
Arnaldo Carvalho de Melo 0660e03f6b [SK_BUFF]: Introduce ipv6_hdr(), remove skb->nh.ipv6h
Now the skb->nh union has just one member, .raw, i.e. it is just like the
skb->mac union, strange, no? I'm just leaving it like that till the transport
layer is done with, when we'll rename skb->mac.raw to skb->mac_header (or
->mac_header_offset?), ditto for ->{h,nh}.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:14 -07:00
Arnaldo Carvalho de Melo d0a92be05e [SK_BUFF]: Introduce arp_hdr(), remove skb->nh.arph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:12 -07:00
Arnaldo Carvalho de Melo eddc9ec53b [SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:10 -07:00
Arnaldo Carvalho de Melo c14d2450cb [SK_BUFF]: Introduce skb_set_network_header
For the cases where the network header is being set to a offset from skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:01 -07:00
Arnaldo Carvalho de Melo d56f90a7c9 [SK_BUFF]: Introduce skb_network_header()
For the places where we need a pointer to the network header, it is still legal
to touch skb->nh.raw directly if just adding to, subtracting from or setting it
to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:59 -07:00
Arnaldo Carvalho de Melo bbe735e424 [SK_BUFF]: Introduce skb_network_offset()
For the quite common 'skb->nh.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:58 -07:00
Arnaldo Carvalho de Melo c1d2bbe1cd [SK_BUFF]: Introduce skb_reset_network_header(skb)
For the common, open coded 'skb->nh.raw = skb->data' operation, so that we can
later turn skb->nh.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple case, next will handle the slightly more
"complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:46 -07:00
Arnaldo Carvalho de Melo 98e399f82a [SK_BUFF]: Introduce skb_mac_header()
For the places where we need a pointer to the mac header, it is still legal to
touch skb->mac.raw directly if just adding to, subtracting from or setting it
to another layer header.

This one also converts some more cases to skb_reset_mac_header() that my
regex missed as it had no spaces before nor after '=', ugh.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:41 -07:00
Arnaldo Carvalho de Melo 48d49d0ccd [SK_BUFF]: Introduce skb_set_mac_header()
For the cases where we want to set skb->mac.raw to an offset from skb->data.

Simple cases first, the memmove ones and specially pktgen will be left for later.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:37 -07:00
Arnaldo Carvalho de Melo 459a98ed88 [SK_BUFF]: Introduce skb_reset_mac_header(skb)
For the common, open coded 'skb->mac.raw = skb->data' operation, so that we can
later turn skb->mac.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple case, next will handle the slightly more
"complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:32 -07:00
Herbert Xu 759e5d0064 [UDP]: Clean up UDP-Lite receive checksum
This patch eliminates some duplicate code for the verification of
receive checksums between UDP-Lite and UDP.  It does this by
introducing __skb_checksum_complete_head which is identical to
__skb_checksum_complete_head apart from the fact that it takes
a length parameter rather than computing the first skb->len bytes.

As a result UDP-Lite will be able to use hardware checksum offload
for packets which do not use partial coverage checksums.  It also
means that UDP-Lite loopback no longer does unnecessary checksum
verification.

If any NICs start support UDP-Lite this would also start working
automatically.

This patch removes the assumption that msg_flags has MSG_TRUNC clear
upon entry in recvmsg.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:51 -07:00
David S. Miller fc910a2783 [NETLINK]: Limit NLMSG_GOODSIZE to 8K.
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:45 -07:00
Eric Dumazet b7aa0bf70c [NET]: convert network timestamps to ktime_t
We currently use a special structure (struct skb_timeval) and plain
'struct timeval' to store packet timestamps in sk_buffs and struct
sock.

This has some drawbacks :
- Fixed resolution of micro second.
- Waste of space on 64bit platforms where sizeof(struct timeval)=16

I suggest using ktime_t that is a nice abstraction of high resolution
time services, currently capable of nanosecond resolution.

As sizeof(ktime_t) is 8 bytes, using ktime_t in 'struct sock' permits
a 8 byte shrink of this structure on 64bit architectures. Some other
structures also benefit from this size reduction (struct ipq in
ipv4/ip_fragment.c, struct frag_queue in ipv6/reassembly.c, ...)

Once this ktime infrastructure adopted, we can more easily provide
nanosecond resolution on top of it. (ioctl SIOCGSTAMPNS and/or
SO_TIMESTAMPNS/SCM_TIMESTAMPNS)

Note : this patch includes a bug correction in
compat_sock_get_timestamp() where a "err = 0;" was missing (so this
syscall returned -ENOENT instead of 0)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: Stephen Hemminger <shemminger@linux-foundation.org>
CC: John find <linux.kernel@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:34 -07:00
Pavel Emelianov c2ecba7171 [NET]: Set a separate lockdep class for neighbour table's proxy_queue
Otherwise the following calltrace will lead to a wrong
lockdep warning:

  neigh_proxy_process()
    `- lock(neigh_table->proxy_queue.lock);
  arp_redo /* via tbl->proxy_redo */
  arp_process
  neigh_event_ns
  neigh_update
  skb_queue_purge
    `- lock(neighbor->arp_queue.lock);

This is not a deadlock actually, as neighbor table's proxy_queue
and the neighbor's arp_queue are different queues.

Lockdep thinks there is a deadlock as both queues are initialized
with skb_queue_head_init() and thus have a common class.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-17 13:13:31 -07:00
Herbert Xu b4dfa0b1fb [NET]: Get rid of alloc_skb_from_cache
Since this was added originally for Xen, and Xen has recently (~2.6.18)
stopped using this function, we can safely get rid of it.  Good timing
too since this function has started to bit rot.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-17 13:13:16 -07:00
Patrick McHardy c01003c205 [IFB]: Fix crash on input device removal
The input_device pointer is not refcounted, which means the device may
disappear while packets are queued, causing a crash when ifb passes packets
with a stale skb->dev pointer to netif_rx().

Fix by storing the interface index instead and do a lookup where neccessary.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-29 11:46:52 -07:00
Christoph Lameter e18b890bb0 [PATCH] slab: remove kmem_cache_t
Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

	#!/bin/sh
	#
	# Replace one string by another in all the kernel sources.
	#

	set -e

	for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
		quilt add $file
		sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
		mv /tmp/$$ $file
		quilt refresh
	done

The script was run like this

	sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Hellwig b30973f877 [PATCH] node-aware skb allocation
Node-aware allocation of skbs for the receive path.

Details:

  - __alloc_skb gets a new node argument and cals the node-aware
    slab functions with it.
  - netdev_alloc_skb passed the node number it gets from dev_to_node
    to it, everyone else passes -1 (any node)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:22 -08:00
Al Viro a80958f484 [PATCH] fix fallout from header dependency trimming
OK, that seems to be enough to deal with the mess.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-04 12:45:29 -08:00
Al Viro d7fe0f241d [PATCH] severing skbuff.h -> mm.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-12-04 02:00:34 -05:00
Al Viro bd01f843c3 [PATCH] severing skbuff.h -> poll.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-12-04 02:00:31 -05:00
Al Viro a1f8e7f7fb [PATCH] severing skbuff.h -> highmem.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-12-04 02:00:29 -05:00
Al Viro ff1dcadb1b [NET]: Split skb->csum
... into anonymous union of __wsum and __u32 (csum and csum_offset resp.)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:27:18 -08:00
Al Viro 1f61ab5ca5 [NET]: Preliminaty annotation of skb->csum.
It's still not completely right; we need to split it into anon unions
of __wsum and unsigned - for cases when we use it for partial checksum
and for offset of checksum in skb

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:23:44 -08:00
Al Viro b51655b958 [NET]: Annotate __skb_checksum_complete() and friends.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:23:38 -08:00
Al Viro 81d7766276 [NET]: Annotate skb_copy_and_csum_bits() and callers.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:23:36 -08:00
Al Viro 2bbbc86890 [NET]: Annotate skb_checksum() and callers.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:23:35 -08:00
Al Viro 5084205faf [NET]: Annotate callers of csum_partial_copy_...() and csum_and_copy...() in net/*
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:23:33 -08:00
Thomas Graf 82e91ffef6 [NET]: Turn nfmark into generic mark
nfmark is being used in various subsystems and has become
the defacto mark field for all kinds of packets. Therefore
it makes sense to rename it to `mark' and remove the
dependency on CONFIG_NETFILTER.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:21:38 -08:00
Al Viro ae08e1f092 [IPV6]: ip6_output annotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:21:26 -08:00