Commit Graph

4082 Commits

Author SHA1 Message Date
David S. Miller 452edd598f xfrm: Return dst directly from xfrm_lookup()
Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-02 13:27:41 -08:00
Herbert Xu 07df5294a7 inet: Replace left-over references to inet->cork
The patch to replace inet->cork with cork left out two spots in
__ip_append_data that can result in bogus packet construction.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 23:00:58 -08:00
David S. Miller f6d460cf0e ipv4: Make icmp route lookup code a bit clearer.
The route lookup code in icmp_send() is slightly tricky as a result of
having to handle all of the requirements of RFC 4301 host relookups.

Pull the route resolution into a seperate function, so that the error
handling and route reference counting is hopefully easier to see and
contained wholly within this new routine.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 15:49:55 -08:00
David S. Miller 2774c131b1 xfrm: Handle blackhole route creation via afinfo.
That way we don't have to potentially do this in every xfrm_lookup()
caller.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:59:04 -08:00
David S. Miller 80c0bc9e37 xfrm: Kill XFRM_LOOKUP_WAIT flag.
This can be determined from the flow flags instead.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:36:37 -08:00
David S. Miller 273447b352 ipv4: Kill can_sleep arg to ip_route_output_flow()
This boolean state is now available in the flow flags.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:27:04 -08:00
David S. Miller 5df65e5567 net: Add FLOWI_FLAG_CAN_SLEEP.
And set is in contexts where the route resolution can sleep.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:22:19 -08:00
David S. Miller 420d44daa7 ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep"
Since that is what the current vague "flags" argument means.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:19:23 -08:00
David S. Miller abdf7e7239 ipv4: Can final ip_route_connect() arg to boolean "can_sleep".
Since that's what the current vague "flags" thing means.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:15:24 -08:00
Herbert Xu 903ab86d19 udp: Add lockless transmit path
The UDP transmit path has been running under the socket lock
for a long time because of the corking feature.  This means that
transmitting to the same socket in multiple threads does not
scale at all.

However, as most users don't actually use corking, the locking
can be removed in the common case.

This patch creates a lockless fast path where corking is not used.

Please note that this does create a slight inaccuracy in the
enforcement of socket send buffer limits.  In particular, we
may exceed the socket limit by up to (number of CPUs) * (packet
size) because of the way the limit is computed.

As the primary purpose of socket buffers is to indicate congestion,
this should not be a great problem for now.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 12:35:42 -08:00
Herbert Xu f6b9664f8b udp: Switch to ip_finish_skb
This patch converts UDP to use the new ip_finish_skb API.  This
would then allows us to more easily use ip_make_skb which allows
UDP to run without a socket lock.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 12:35:03 -08:00
Herbert Xu 1c32c5ad6f inet: Add ip_make_skb and ip_finish_skb
This patch adds the helper ip_make_skb which is like ip_append_data
and ip_push_pending_frames all rolled into one, except that it does
not send the skb produced.  The sending part is carried out by
ip_send_skb, which the transport protocol can call after it has
tweaked the skb.

It is meant to be called in cases where corking is not used should
have a one-to-one correspondence to sendmsg.

This patch also adds the helper ip_finish_skb which is meant to
be replace ip_push_pending_frames when corking is required.
Previously the protocol stack would peek at the socket write
queue and add its header to the first packet.  With ip_finish_skb,
the protocol stack can directly operate on the final skb instead,
just like the non-corking case with ip_make_skb.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 12:35:03 -08:00
Herbert Xu 1470ddf7f8 inet: Remove explicit write references to sk/inet in ip_append_data
In order to allow simultaneous calls to ip_append_data on the same
socket, it must not modify any shared state in sk or inet (other
than those that are designed to allow that such as atomic counters).

This patch abstracts out write references to sk and inet_sk in
ip_append_data and its friends so that we may use the underlying
code in parallel.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 12:35:02 -08:00
Herbert Xu 5a2ef92023 inet: Remove unused sk_sndmsg_* from UFO
UFO doesn't really use the sk_sndmsg_* parameters so touching
them is pointless.  It can't use them anyway since the whole
point of UFO is to use the original pages without copying.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 12:35:02 -08:00
David S. Miller dca8b089c9 ipv4: Rearrange how ip_route_newports() gets port keys.
ip_route_newports() is the only place in the entire kernel that
cares about the port members in the routing cache entry's lookup
flow key.

Therefore the only reason we store an entire flow inside of the
struct rtentry is for this one special case.

Rewrite ip_route_newports() such that:

1) The caller passes in the original port values, so we don't need
   to use the rth->fl.fl_ip_{s,d}port values to remember them.

2) The lookup flow is constructed by hand instead of being copied
   from the routing cache entry's flow.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 13:38:12 -08:00
David S. Miller 5e6b930f21 xfrm: Const'ify address arguments to ->dst_lookup()
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 23:07:38 -08:00
David S. Miller 19bd62441c xfrm: Const'ify tmpl and address arguments to ->init_temprop()
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 23:07:37 -08:00
David S. Miller 73e5ebb20f xfrm: Mark flowi arg to ->init_tempsel() const.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22 17:51:44 -08:00
David S. Miller 0c7b3eefb4 xfrm: Mark flowi arg to ->fill_dst() const.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22 17:48:57 -08:00
David S. Miller 05d8402576 xfrm: Mark flowi arg to ->get_tos() const.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22 17:47:10 -08:00
Shan Wei 089c34827e tcp: Remove debug macro of TCP_CHECK_TIMER
Now, TCP_CHECK_TIMER is not used for debuging, it does nothing.
And, it has been there for several years, maybe 6 years.

Remove it to keep code clearer.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-20 11:10:14 -08:00
David S. Miller da935c66ba Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	Documentation/feature-removal-schedule.txt
	drivers/net/e1000e/netdev.c
	net/xfrm/xfrm_policy.c
2011-02-19 19:17:35 -08:00
Eric Dumazet 91035f0b7d tcp: fix inet_twsk_deschedule()
Eric W. Biederman reported a lockdep splat in inet_twsk_deschedule()

This is caused by inet_twsk_purge(), run from process context,
and commit 575f4cd5a5 (net: Use rcu lookups in inet_twsk_purge.)
removed the BH disabling that was necessary.

Add the BH disabling but fine grained, right before calling
inet_twsk_deschedule(), instead of whole function.

With help from Linus Torvalds and Eric W. Biederman

Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Daniel Lezcano <daniel.lezcano@free.fr>
CC: Pavel Emelyanov <xemul@openvz.org>
CC: Arnaldo Carvalho de Melo <acme@redhat.com>
CC: stable <stable@kernel.org> (# 2.6.33+)
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-19 18:59:04 -08:00
David S. Miller 9435eb1cf0 ipv4: Implement __ip_dev_find using new interface address hash.
Much quicker than going through the FIB tables.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-18 12:43:09 -08:00
David S. Miller fd23c3b311 ipv4: Add hash table of interface addresses.
This will be used to optimize __ip_dev_find() and friends.

With help from Eric Dumazet.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-18 12:42:28 -08:00
Eric Dumazet 214f45c91b net: provide default_advmss() methods to blackhole dst_ops
Commit 0dbaee3b37 (net: Abstract default ADVMSS behind an
accessor.) introduced a possible crash in tcp_connect_init(), when
dst->default_advmss() is called from dst_metric_advmss()

Reported-by: George Spelvin <linux@horizon.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-18 11:39:01 -08:00
David S. Miller 982721f391 ipv4: Use const'ify fib_result deep in the route call chains.
The only troublesome bit here is __mkroute_output which wants
to override res->fi and res->type, compute those in local
variables instead.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-17 15:54:42 -08:00
David S. Miller 3b004569d8 ipv4: Avoid use of signed integers in fib_trie code.
GCC emits all kinds of crazy zero extensions when we go from signed
int, to unsigned short, etc. etc.

This transformation has to be legal because:

1) In tkey_extract_bits() in mask_pfx(), the values are used to
   perform shifts, on which negative values are undefined by C.

2) In fib_table_lookup() we perform comparisons with unsigned
   values, constants, and additions.  None of which should
   encounter negative values.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-17 15:49:26 -08:00
David S. Miller 3c7bd1a140 net: Add initial_ref arg to dst_alloc().
This allows avoiding multiple writes to the initial __refcnt.

The most simplest cases of wanting an initial reference of "1"
in ipv4 and ipv6 have been converted, the rest have been left
along and kept at the existing "0".

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-17 15:44:00 -08:00
David S. Miller 0c4dcd58fd ipv4: Consolidate ipv4 dst allocation logic.
This also allows us to combine all the dst->flags settings and avoid
read/modify/write sequences to this struct member.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-17 15:42:37 -08:00
David S. Miller 010c2708e5 ipv4: Move rcu_read_{lock,unlock}() into ip_route_output_slow().
Simplifies tail of __ip_route_output_key().

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-17 15:37:09 -08:00
David S. Miller 5ada552746 ipv4: Simplify output route creation call sequence.
There's a lot of redundancy and unnecessary stack frames
in the output route creation path.

1) Make __mkroute_output() return error pointers.

2) Eliminate ip_mkroute_output() entirely, made possible by #1.

3) Call __mkroute_output() directly and handling the returning error
   pointers in ip_route_output_slow().

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-17 15:29:00 -08:00
David S. Miller f39925dbde ipv4: Cache learned redirect information in inetpeer.
Note that we do not generate the redirect netevent any longer,
because we don't create a new cached route.

Instead, once the new neighbour is bound to the cached route,
we emit a neigh update event instead.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-14 21:33:27 -08:00
David S. Miller 2c8cec5c10 ipv4: Cache learned PMTU information in inetpeer.
The general idea is that if we learn new PMTU information, we
bump the peer genid.

This triggers the dst_ops->check() code to validate and if
necessary propagate the new PMTU value into the metrics.

Learned PMTU information self-expires.

This means that it is not necessary to kill a cached route
entry just because the PMTU information is too old.

As a consequence:

1) When the path appears unreachable (dst_ops->link_failure
   or dst_ops->negative_advice) we unwind the PMTU state if
   it is out of date, instead of killing the cached route.

   A redirected route will still be invalidated in these
   situations.

2) rt_check_expire(), rt_worker_func(), et al. are no longer
   necessary at all.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-14 21:33:07 -08:00
Ian Campbell d11327ad66 arp_notify: unconditionally send gratuitous ARP for NETDEV_NOTIFY_PEERS.
NETDEV_NOTIFY_PEER is an explicit request by the driver to send a link
notification while NETDEV_UP/NETDEV_CHANGEADDR generate link
notifications as a sort of side effect.

In the later cases the sysctl option is present because link
notification events can have undesired effects e.g. if the link is
flapping. I don't think this applies in the case of an explicit
request from a driver.

This patch makes NETDEV_NOTIFY_PEER unconditional, if preferred we
could add a new sysctl for this case which defaults to on.

This change causes Xen post-migration ARP notifications (which cause
switches to relearn their MAC tables etc) to be sent by default.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-14 17:47:15 -08:00
Eric Dumazet 31d409373c ipv4: fix rcu lock imbalance in fib_select_default()
Commit 0c838ff1ad (ipv4: Consolidate all default route selection
implementations.) forgot to remove one rcu_read_unlock() from
fib_select_default().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-14 11:23:04 -08:00
Steffen Klassert 946bf5ee3c ip_gre: Add IPPROTO_GRE to flowi in ipgre_tunnel_xmit
Commit 5811662b15 ("net: use the macros
defined for the members of flowi") accidentally removed the setting of
IPPROTO_GRE from the struct flowi in ipgre_tunnel_xmit. This patch
restores it.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-11 11:23:12 -08:00
David S. Miller 6431cbc25f inet: Create a mechanism for upward inetpeer propagation into routes.
If we didn't have a routing cache, we would not be able to properly
propagate certain kinds of dynamic path attributes, for example
PMTU information and redirects.

The reason is that if we didn't have a routing cache, then there would
be no way to lookup all of the active cached routes hanging off of
sockets, tunnels, IPSEC bundles, etc.

Consider the case where we created a cached route, but no inetpeer
entry existed and also we were not asked to pre-COW the route metrics
and therefore did not force the creation a new inetpeer entry.

If we later get a PMTU message, or a redirect, and store this
information in a new inetpeer entry, there is no way to teach that
cached route about the newly existing inetpeer entry.

The facilities implemented here handle this problem.

First we create a generation ID.  When we create a cached route of any
kind, we remember the generation ID at the time of attachment.  Any
time we force-create an inetpeer entry in response to new path
information, we bump that generation ID.

The dst_ops->check() callback is where the knowledge of this event
is propagated.  If the global generation ID does not equal the one
stored in the cached route, and the cached route has not attached
to an inetpeer yet, we look it up and attach if one is found.  Now
that we've updated the cached route's information, we update the
route's generation ID too.

This clears the way for implementing PMTU and redirects directly in
the inetpeer cache.  There is absolutely no need to consult cached
route information in order to maintain this information.

At this point nothing bumps the inetpeer genids, that comes in the
later changes which handle PMTUs and redirects using inetpeers.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-10 13:33:41 -08:00
David S. Miller ddd4aa424b inetpeer: Add redirect and PMTU discovery cached info.
Validity of the cached PMTU information is indicated by it's
expiration value being non-zero, just as per dst->expires.

The scheme we will use is that we will remember the pre-ICMP value
held in the metrics or route entry, and then at expiration time
we will restore that value.

In this way PMTU expiration does not kill off the cached route as is
done currently.

Redirect information is permanent, or at least until another redirect
is received.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-10 13:29:30 -08:00
David S. Miller 7a71ed899e inetpeer: Abstract address representation further.
Future changes will add caching information, and some of
these new elements will be addresses.

Since the family is implicit via the ->daddr.family member,
replicating the family in ever address we store is entirely
redundant.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-10 13:22:28 -08:00
David S. Miller 8d13a2a9fb net: Kill NETEVENT_PMTU_UPDATE.
Nobody actually does anything in response to the event,
so just kill it off.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-08 16:17:55 -08:00
Nicolas Dichtel fa9921e46f ipsec: allow to align IPv4 AH on 32 bits
The Linux IPv4 AH stack aligns the AH header on a 64 bit boundary
(like in IPv6). This is not RFC compliant (see RFC4302, Section
3.3.3.2.1), it should be aligned on 32 bits.

For most of the authentication algorithms, the ICV size is 96 bits.
The AH header alignment on 32 or 64 bits gives the same results.

However for SHA-256-128 for instance, the wrong 64 bit alignment results
in adding useless padding in IPv4 AH, which is forbidden by the RFC.

To avoid breaking backward compatibility, we use a new flag
(XFRM_STATE_ALIGN4) do change original behavior.

Initial patch from Dang Hongwu <hongwu.dang@6wind.com> and
Christophe Gouault <christophe.gouault@6wind.com>.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-08 14:00:40 -08:00
David S. Miller 92d8682926 inetpeer: Move ICMP rate limiting state into inet_peer entries.
Like metrics, the ICMP rate limiting bits are cached state about
a destination.  So move it into the inet_peer entries.

If an inet_peer cannot be bound (the reason is memory allocation
failure or similar), the policy is to allow.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-04 15:59:53 -08:00
David S. Miller 0131ba451e ipv4: Don't miss existing cached metrics in new routes.
Always lookup to see if we have an existing inetpeer entry for
a route.  Let FLOWI_FLAG_PRECOW_METRICS merely influence the
"create" argument to rt_bind_peer().

Also, call rt_bind_peer() unconditionally since it is not
possible for rt->peer to be non-NULL at this point.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-04 14:37:30 -08:00
David S. Miller bd4a6974cc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-02-04 14:28:58 -08:00
David S. Miller ca6b8bb097 net: Support compat SIOCGETVIFCNT ioctl in ipv4.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-03 17:24:28 -08:00
David S. Miller 0033d5ad27 net: Fix bug in compat SIOCGETSGCNT handling.
Commit 709b46e8d9 ("net: Add compat
ioctl support for the ipv4 multicast ioctl SIOCGETSGCNT") added the
correct plumbing to handle SIOCGETSGCNT properly.

However, whilst definiting a proper "struct compat_sioc_sg_req" it
isn't actually used in ipmr_compat_ioctl().

Correct this oversight.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-03 17:21:31 -08:00
David S. Miller b299e4f001 ipv4: Fix fib_trie build in some configurations.
If we end up including include/linux/node.h (either explicitly
or implicitly) that header has a definition of "structt node"
too.

So rename the one we use in fib_trie to "rt_trie_node" to avoid
the conflict.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-02 20:48:47 -08:00
David S. Miller 442b9635c5 tcp: Increase the initial congestion window to 10.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Nandita Dukkipati <nanditad@google.com>
2011-02-02 20:48:47 -08:00
David S. Miller 0bc0be7f20 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6 2011-02-02 15:52:23 -08:00