The function is quite big and has several call sites and nothing
to collapse by compiler optimization on inlining.
Besides it's nicer to read in a in .c file.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spring cleaning time...
There seems to be a lot of places in the network code that have
extra bogus semicolons after conditionals. Most commonly is a
bogus semicolon after: switch() { }
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an implementation of TCP Illinois invented by Shao Liu
at University of Illinois. It is a another variant of Reno which adapts
the alpha and beta parameters based on RTT. The basic idea is to increase
window less rapidly as delay approaches the maximum. See the papers
and talks to get a more complete description.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This also fixes memory leak in error path.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER. This option forces
us not to fragment, but does not make use of the kernel path MTU discovery.
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery). This is particularly useful for diagnostic utilities, like
traceroute/tracepath.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we're now holding the rtnl during the entire dump operation, we can
remove additional locking for rtnl protected data. This patch does that
for all simple cases (dev_base_lock for dev_base walking, RCU protection
for FIB rule dumping).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch cb_lock to mutex and allow netlink kernel users to override it
with a subsystem specific mutex for consistent locking in dump callbacks.
All netlink_dump_start users have been audited not to rely on any
side-effects of the previously used spinlock.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
All LOG targets always use their internal logging function nowadays, so
remove the incorrect error message and handle real errors (!= -EEXIST)
by failing to load.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When mangling packets forwarded to a HW checksumming capable device,
offload recalculation of the checksum instead of doing it in software.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a transmitted packet is looped back directly, CHECKSUM_PARTIAL
maps to the semantics of CHECKSUM_UNNECESSARY. Therefore we should
treat it as such in the stack.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb transport pointer is currently used to specify the start
of the checksum region for transmit checksum offload. Unfortunately,
the same pointer is also used during receive side processing.
This creates a problem when we want to retransmit a received
packet with partial checksums since the skb transport pointer
would be overwritten.
This patch solves this problem by creating a new 16-bit csum_start
offset value to replace the skb transport header for the purpose
of checksums. This offset is calculated from skb->head so that
it does not have to change when skb->data changes.
No extra space is required since csum_offset itself fits within
a 16-bit word so we can use the other 16 bits for csum_start.
For backwards compatibility, just before we push a packet with
partial checksums off into the device driver, we set the skb
transport header to what it would have been under the old scheme.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
esp_init_state doesn't account for the beet pseudo header in the header_len
calculation, which may result in undersized skbs hitting xfrm4_beet_output,
causing unnecessary reallocations in ip_finish_output2.
The skbs should still always have enough room to avoid causing
skb_under_panic in skb_push since we have at least 16 bytes available
from LL_RESERVED_SPACE in xfrm_state_check_space.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the probing based MTU estimation, which usually takes 2-3 iterations
to find a fitting value and may underestimate the MTU, by an exact calculation.
Also fix underestimation of the XFRM trailer_len, which causes unnecessary
reallocations.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
To clearly state the intent of copying to linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
To clearly state the intent of copying from linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
As IPPROTO_TCP is 6, it makes sense to make sure inet_protos[] array
is properly cache line aligned to avoid false sharing on SMP.
c0680540 b peer_total
c0680544 b inet_peer_unused_head
c0680560 B inet_protos
On i386 this example, we can see that inet_protos[IPPROTO_TCP] shares
a potentially hot (and modified) cache line.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_memory_pressure and tcp_socket currently share a cache line with tcp_memory_allocated, tcp_sockets_allocated.
(Very hot cache line)
It makes sense to declare these variables as __read_mostly, to avoid false sharing on SMP.
ffffffff8081d9c0 B tcp_orphan_count
ffffffff8081d9c4 B tcp_memory_allocated
ffffffff8081d9c8 B tcp_sockets_allocated
ffffffff8081d9cc B tcp_memory_pressure
ffffffff8081d9d0 b tcp_md5sig_users
ffffffff8081d9d8 b tcp_md5sig_pool
ffffffff8081d9e0 b warntime.31570
ffffffff8081d9e8 b tcp_socket
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The results of FIB rules lookups are cached in the routing cache
except for IPv6 as no such cache exists. So far, it was the
responsibility of the user to flush the cache after modifying any
rules. This lead to many false bug reports due to misunderstanding
of this concept.
This patch automatically flushes the route cache after inserting
or deleting a rule.
Thanks to Muli Ben-Yehuda <muli@il.ibm.com> for catching a bug
in the previous patch.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a very tiny probability that build_ehash_secret() is called
at the same time by different CPUS.
Also, using __read_mostly is a must for inet_ehash_secret
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now Xen has a horrible hack that lets it forward packets with
partial checksums. One of the reasons that CHECKSUM_PARTIAL and
CHECKSUM_COMPLETE were added is so that we can get rid of this hack
(where it creates two extra bits in the skbuff to essentially mirror
ip_summed without being destroyed by the forwarding code).
I had forgotten that I've already gone through all the deivce drivers
last time around to make sure that they're looking at ip_summed ==
CHECKSUM_PARTIAL rather than ip_summed != 0 on transmit. In any case,
I've now done that again so it should definitely be safe.
Unfortunately nobody has yet added any code to update CHECKSUM_COMPLETE
values on forward so we I'm setting that to CHECKSUM_NONE. This should
be safe to remove for bridging but I'd like to check that code path
first.
So here is the patch that lets us get rid of the hack by preserving
ip_summed (mostly) on forwarded packets.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
this is a small patch by Janusz Krzysztofik to ip_route_output_slow()
that allows VIP-less LVS linux director to generate packets
originating >From VIP if sysctl_ip_nonlocal_bind is set.
In a nutshell, the intention is for an LVS linux director to be able
to send ICMP unreachable responses to end-users when real-servers are
removed.
http://archive.linuxvirtualserver.org/html/lvs-users/2007-01/msg00106.html
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change tcp_probe to use ktime (needed to add one export).
Add option to only get events when cwnd changes - from Doug Leith
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following update received from Injong updates TCP cubic to the latest
version. I am running more complete tests and will have results after 4/1.
According to Injong: the new version improves on its scalability,
fairness and stability. So in all properties, we confirmed it shows better
performance.
NCSU results (for 2.6.18 and 2.6.20) available:
http://netsrv.csc.ncsu.edu/wiki/index.php/TCP_Testing
This version is described in a new Internet draft for CUBIC.
http://www.ietf.org/internet-drafts/draft-rhee-tcp-cubic-00.txt
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do fragmentation check in ip_forward, similar to ipv6 forwarding.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The days are gone when this was not an issue, there are folks out
there with huge bot networks that can be used to attack the
established hash tables on remote systems.
So just like the routing cache and connection tracking
hash, use Jenkins hash with random secret input.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all users of netlink_dump_start() use netlink_run_queue()
to process the receive queue, it is possible to return -EINTR from
netlink_dump_start() directly, therefore simplying the callers.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Makes use of netlink_run_queue() to process the receive queue and
converts inet_diag_rcv_msg() to use the type safe netlink interface.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use willy's work in optimizing cube root by having table for small values.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implements a unified, protocol independant rules dumping function
which is capable of both, dumping a specific protocol family or
all of them. This speeds up dumping as less lookups are required.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the common "(struct nlmsghdr *)skb->data" sequence, so that we reduce the
number of direct accesses to skb->data and for consistency with all the other
cast skb member helpers.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The threshold for root node can be more aggressive set to get
better tree compression. The new setting mekes the root grow
from 16 to 19 bits and substansial improvemnt in Aver depth
this with the current table of 214393 prefixes
But really the dynamic resize should need more investigation
both in terms convergence and performance and maybe it should
be possible to change...
Maybe just for the brave to start with or we may have to back
this out.
The patch below adds break condition for the resize operations. If
we don't achieve the desired fill factor a warning is printed. Trie
should still be operational but new thresholds should be considered.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)
Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Renaming skb->h to skb->transport_header, skb->nh to skb->network_header and
skb->mac to skb->mac_header, to match the names of the associated helpers
(skb[_[re]set]_{transport,network,mac}_header).
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the common sequence "skb->h.raw - skb->nh.raw", similar to skb->mac_len,
that is precalculated tho, don't think we need to bloat skb with one more
member, so just use this new helper, reducing the number of non-skbuff.h
references to the layer headers even more.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This time we have to set it to skb->tail that is not anymore equal to
skb->data, so we either add a new helper or just add the skb->tail - skb->data
offset, for now do the later.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>