This patch puts mostly read only data in the right section
(read_mostly), to help sharing of these data between CPUS without
memory ping pongs.
On one of my production machine, tcp_statistics was sitting in a
heavily modified cache line, so *every* SNMP update had to force a
reload.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netlink_broadcast users must initialize NETLINK_CB(skb).dst_groups to the
destination group mask for netlink_recvmsg.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Remove bogus code for compiling netlink as module
- Add module refcounting support for modules implementing a netlink
protocol
- Add support for autoloading modules that implement a netlink protocol
as soon as someone opens a socket for that protocol
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spotted by, and original patch by, Balazs Scheidler.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the protocol specific config options out to the specific protocols.
With this change net/Kconfig now starts to become readable and serve as a
good basis for further re-structuring.
The menu structure is left almost intact, except that indention is
fixed in most cases. Most visible are the INET changes where several
"depends on INET" are replaced with a single ifdef INET / endif pair.
Several new files were created to accomplish this change - they are
small but serve the purpose that config options are now distributed
out where they belongs.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the xfrm_state_afinfo->init_flags hook which allows
each address family to perform any common initialisation that does
not require a corresponding destructor call.
It will be used subsequently to set the XFRM_STATE_NOPMTUDISC flag
in IPv4.
It also fixes up the error codes returned by xfrm_init_state.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jmorris@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds xfrm_init_state which is simply a wrapper that calls
xfrm_get_type and subsequently x->type->init_state. It also gets rid
of the unused args argument.
Abstracting it out allows us to add common initialisation code, e.g.,
to set family-specific flags.
The add_time setting in xfrm_user.c was deleted because it's already
set by xfrm_state_alloc.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jmorris@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the format of the XFRM_MSG_DELSA and
XFRM_MSG_DELPOLICY notification so that the main message
sent is of the same format as that received by the kernel
if the original message was via netlink. This also means
that we won't lose the byid information carried in km_event.
Since this user interface is introduced by Jamal's patch
we can still afford to change it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Small fixup to use netlink macros instead of hardcoding.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu wrote:
> @@ -1254,6 +1326,7 @@ static int pfkey_add(struct sock *sk, st
> if (IS_ERR(x))
> return PTR_ERR(x);
>
> + xfrm_state_hold(x);
This introduces a leak when xfrm_state_add()/xfrm_state_update()
fail. We hold two references (one from xfrm_state_alloc(), one
from xfrm_state_hold()), but only drop one. We need to take the
reference because the reference from xfrm_state_alloc() can
be dropped by __xfrm_state_delete(), so the fix is to drop both
references on error. Same problem in xfrm_user.c.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes XFRM_SAP_* and converts them over to XFRM_MSG_*.
The netlink interface is meant to map directly onto the underlying
xfrm subsystem. Therefore rather than using a new independent
representation for the events we can simply use the existing ones
from xfrm_user.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch fixes policy deletion in xfrm_user so that it sets
km_event.data.byid. This puts xfrm_user on par with what af_key
does in this case.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch ensures that the hard state/policy expire notifications are
only sent when the state/policy is successfully removed from their
respective tables.
As it is, it's possible for a state/policy to both expire through
reaching a hard limit, as well as being deleted by the user.
Note that this behaviour isn't actually forbidden by RFC 2367.
However, it is a quality of implementation issue.
As an added bonus, the restructuring in this patch will help
eventually in moving the expire notifications from softirq
context into process context, thus improving their reliability.
One important side-effect from this change is that SAs reaching
their hard byte/packet limits are now deleted immediately, just
like SAs that have reached their hard time limits.
Previously they were announced immediately but only deleted after
30 seconds.
This is bad because it prevents the system from issuing an ACQUIRE
command until the existing state was deleted by the user or expires
after the time is up.
In the scenario where the expire notification was lost this introduces
a 30 second delay into the system for no good reason.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Heres the final patch.
What this patch provides
- netlink xfrm events
- ability to have events generated by netlink propagated to pfkey
and vice versa.
- fixes the acquire lets-be-happy-with-one-success issue
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
[XFRM] Call dst_check() with appropriate cookie
This fixes infinite loop issue with IPv6 tunnel mode.
Signed-off-by: Kazunori Miyazawa <kazunori@miyazawa.org>
Signed-off-by: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to verify that the payload contains enough data so that
attach_one_algo can copy alg_key_len bits from the payload.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable alg_key_len is in bits and not bytes. The function
attach_one_algo is currently using it as if it were in bytes.
This causes it to read memory which may not be there.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
It looks like skb_cow_data() does not set
proper owner for newly created skb.
If we have several fragments for skb and some of them
are shared(?) or cloned (like in async IPsec) there
might be a situation when we require recreating skb and
thus using skb_copy() for it.
Newly created skb has neither a destructor nor a socket
assotiated with it, which must be copied from the old skb.
As far as I can see, current code sets destructor and socket
for the first one skb only and uses truesize of the first skb
only to increment sk_wmem_alloc value.
If above "analysis" is correct then attached patch fixes that.
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
I found a bug that stopped IPsec/IPv6 from working. About
a month ago IPv6 started using rt6i_idev->dev on the cached socket dst
entries. If the cached socket dst entry is IPsec, then rt6i_idev will
be NULL.
Since we want to look at the rt6i_idev of the original route in this
case, the easiest fix is to store rt6i_idev in the IPsec dst entry just
as we do for a number of other IPv6 route attributes. Unfortunately
this means that we need some new code to handle the references to
rt6i_idev. That's why this patch is bigger than it would otherwise be.
I've also done the same thing for IPv4 since it is conceivable that
once these idev attributes start getting used for accounting, we
probably need to dereference them for IPv4 IPsec entries too.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we free up a partially processed packet because it's
skb->len dropped to zero, we need to decrement qlen because
we are dropping out of the top-level loop so it will do
the decrement for us.
Spotted by Herbert Xu.
Signed-off-by: David S. Miller <davem@davemloft.net>
The qlen should continue to decrement, even if we
pop partially processed SKBs back onto the receive queue.
Signed-off-by: David S. Miller <davem@davemloft.net>
Let's recap the problem. The current asynchronous netlink kernel
message processing is vulnerable to these attacks:
1) Hit and run: Attacker sends one or more messages and then exits
before they're processed. This may confuse/disable the next netlink
user that gets the netlink address of the attacker since it may
receive the responses to the attacker's messages.
Proposed solutions:
a) Synchronous processing.
b) Stream mode socket.
c) Restrict/prohibit binding.
2) Starvation: Because various netlink rcv functions were written
to not return until all messages have been processed on a socket,
it is possible for these functions to execute for an arbitrarily
long period of time. If this is successfully exploited it could
also be used to hold rtnl forever.
Proposed solutions:
a) Synchronous processing.
b) Stream mode socket.
Firstly let's cross off solution c). It only solves the first
problem and it has user-visible impacts. In particular, it'll
break user space applications that expect to bind or communicate
with specific netlink addresses (pid's).
So we're left with a choice of synchronous processing versus
SOCK_STREAM for netlink.
For the moment I'm sticking with the synchronous approach as
suggested by Alexey since it's simpler and I'd rather spend
my time working on other things.
However, it does have a number of deficiencies compared to the
stream mode solution:
1) User-space to user-space netlink communication is still vulnerable.
2) Inefficient use of resources. This is especially true for rtnetlink
since the lock is shared with other users such as networking drivers.
The latter could hold the rtnl while communicating with hardware which
causes the rtnetlink user to wait when it could be doing other things.
3) It is still possible to DoS all netlink users by flooding the kernel
netlink receive queue. The attacker simply fills the receive socket
with a single netlink message that fills up the entire queue. The
attacker then continues to call sendmsg with the same message in a loop.
Point 3) can be countered by retransmissions in user-space code, however
it is pretty messy.
In light of these problems (in particular, point 3), we should implement
stream mode netlink at some point. In the mean time, here is a patch
that implements synchronous processing.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Converts xfrm_msg_min and xfrm_dispatch to use c99 designated
initializers to make greping a little bit easier. Also replaces
two hardcoded message type with meaningful names.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use 'daddr' instead of &tmpl->id.daddr, since the latter
might be zero. Also, only perform the lookup when
tmpl->id.spi is non-zero.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!