In the IPV6_FL_A_GET case the hash is checked for flowlabels
with the given label. If it is not found, the lock, protecting
the hash, is dropped to be re-get for writing. After this a
newly allocated entry is inserted, but no checks are performed
to catch a classical SMP race, when the conflicting label may
be inserted on another cpu.
Use the (currently unused) return value from fl_intern() to
return the conflicting entry (if found) and re-check, whether
we can reuse it (IPV6_FL_F_EXCL) or return -EEXISTS.
Also add the comment, about why not re-lookup the current
sock for conflicting flowlabel entry.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This routine scans the ipv6_fl_list whose update is
protected with the socket lock and the ip6_sk_fl_lock.
Since the socket lock is not taken in the lookup, use
the other one.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new flowlabels should be inserted into the sock list
under the ip6_sk_fl_lock. This was lost in one place.
This list is naturally protected with the socket lock, but
the fl6_sock_lookup() is called without it, so another
protection is required.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new field to xfrm states called inner_mode. The existing
mode object is renamed to outer_mode.
This is the first part of an attempt to fix inter-family transforms. As it
is we always use the outer family when determining which mode to use. As a
result we may end up shoving IPv4 packets into netfilter6 and vice versa.
What we really want is to use the inner family for the first part of outbound
processing and the outer family for the second part. For inbound processing
we'd use the opposite pairing.
I've also added a check to prevent silly combinations such as transport mode
with inter-family transforms.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Combining RO and AH/ESP/IPCOMP does not make sense. So this patch adds a
check in the state initialisation function to prevent this.
This allows us to safely remove the mode input function of RO since it
can never be called anymore. Indeed, if somehow it does get called we'll
know about it through an OOPS instead of it slipping past silently.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is convenient to have a pointer from xfrm_state to address-specific
functions such as the output function for a family. Currently the
address-specific policy code calls out to the xfrm state code to get
those pointers when we could get it in an easier way via the state
itself.
This patch adds an xfrm_state_afinfo to xfrm_mode (since they're
address-specific) and changes the policy code to use it. I've also
added an owner field to do reference counting on the module providing
the afinfo even though it isn't strictly necessary today since IPv6
can't be unloaded yet.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently BEET mode does not reinject the packet back into the stack
like tunnel mode does. Since BEET should behave just like tunnel mode
this is incorrect.
This patch fixes this by introducing a flags field to xfrm_mode that
tells the IPsec code whether it should terminate and reinject the packet
back into the stack.
It then sets the flag for BEET and tunnel mode.
I've also added a number of missing BEET checks elsewhere where we check
whether a given mode is a tunnel or not.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not every transform needs to zap ip_summed. For example, a pure tunnel
mode encapsulation does not affect the hardware checksum at all. In fact,
every algorithm (that needs this) other than AH6 already does its own
ip_summed zapping.
This patch moves the zapping into AH6 which is in line with what IPv4 does.
Possible future optimisation: Checksum the data as we copy them in IPComp.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently xfrm6_rcv_spi gets the nexthdr value itself from the packet.
This means that we need to fix up the value in case we have a 4-on-6
tunnel. Moving this logic into the caller simplifies things and allows
us to merge the code with IPv4.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed that my recent patch broke 6-on-4 pure IPsec tunnels (the ones
that are only used for incompressible IPsec packets). Subsequent reviews
show that I broke 6-on-6 pure tunnels more than three years ago and nobody
ever noticed. I suppose every must be testing 6-on-6 IPComp with large
pings which are very compressible :)
This patch fixes both cases.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This functions is never called with NULL or not setup argument,
so the checks inside are redundant.
Also, the return value is always -ENOMEM, so no need in
additional variable for this.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we now allocate the queues in inet_fragment.c, we
can safely free it in the same place. The ->destructor
callback thus becomes optional for inet_frags.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since this callback is used to check for conflicts in
hashtable when inserting a newly created frag queue, we can
do the same by checking for matching the queue with the
argument, used to create one.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here we need another callback ->match to check whether the
entry found in hash matches the key passed. The key used
is the same as the creation argument for inet_frag_create.
Yet again, this ->match is the same for netfilter and ipv6.
Running a frew steps forward - this callback will later
replace the ->equal one.
Since the inet_frag_find() uses the already consolidated
inet_frag_create() remove the xxx_frag_create from protocol
codes.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This one uses the xxx_frag_intern() and xxx_frag_alloc()
routines, which are already consolidated, so remove them
from protocol code (as promised).
The ->constructor callback is used to init the rest of
the frag queue and it is the same for netfilter and ipv6.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just perform the kzalloc() allocation and setup common
fields in the inet_frag_queue(). Then return the result
to the caller to initialize the rest.
The inet_frag_alloc() may return NULL, so check the
return value before doing the container_of(). This looks
ugly, but the xxx_frag_alloc() will be removed soon.
The xxx_expire() timer callbacks are patches,
because the argument is now the inet_frag_queue, not
the protocol specific queue.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This routine checks for the existence of a given entry
in the hash table and inserts the new one if needed.
The ->equal callback is used to compare two frag_queue-s
together, but this one is temporary and will be removed
later. The netfilter code and the ipv6 one use the same
routine to compare frags.
The inet_frag_intern() always returns non-NULL pointer,
so convert the inet_frag_queue into protocol specific
one (with the container_of) without any checks.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the hash value is already calculated in xxx_find, we can
simply use it later. This is already done in netfilter code,
so make the same in ipv4 and ipv6.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The icmpv6msg mib statistics is not freed.
This is almost not critical for current kernel, since ipv6
module is unloadable, but this can happen on load error and
will happen every time we stop the network namespace (when
we have one, of course).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The difference in both functions is in the "id" passed to
the rt6_select, so just pass it as an extra argument from
two outer helpers.
This is minus 60 lines of code and 360 bytes of .text
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With all the users of the double pointers removed from the IPv6 input path,
this patch converts all occurances of sk_buff ** to sk_buff * in IPv6 input
handlers.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
These ones use the generic data types too, so move
them in one place.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the evictor code is consolidated there is no need in
passing the extra pointer to the xxx_put() functions.
The only place when it made sense was the evictor code itself.
Maybe this change must got with the previous (or with the
next) patch, but I try to make them shorter as much as
possible to simplify the review (but they are still large
anyway), so this change goes in a separate patch.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The evictors collect some statistics for ipv4 and ipv6,
so make it return the number of evicted queues and account
them all at once in the caller.
The XXX_ADD_STATS_BH() macros are just for this case,
but maybe there are places in code, that can make use of
them as well.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make in possible we need to know the exact frag queue
size for inet_frags->mem management and two callbacks:
* to destoy the skb (optional, used in conntracks only)
* to free the queue itself (mandatory, but later I plan to
move the allocation and the destruction of frag_queues
into the common place, so this callback will most likely
be optional too).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This code works with the generic data types as well, so
move this into inet_fragment.c
This move makes it possible to hide the secret_timer
management and the secret_rebuild routine completely in
the inet_fragment.c
Introduce the ->hashfn() callback in inet_frags() to get
the hashfun for a given inet_frag_queue() object.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since now all the xxx_frag_kill functions now work
with the generic inet_frag_queue data type, this can
be moved into a common place.
The xxx_unlink() code is moved as well.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some sysctl variables are used to tune the frag queues
management and it will be useful to work with them in
a common way in the future, so move them into one
structure, moreover they are the same for all the frag
management codes.
I don't place them in the existing inet_frags object,
introduced in the previous patch for two reasons:
1. to keep them in the __read_mostly section;
2. not to export the whole inet_frags objects outside.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some objects that are common in all the places
which are used to keep track of frag queues, they are:
* hash table
* LRU list
* rw lock
* rnd number for hash function
* the number of queues
* the amount of memory occupied by queues
* secret timer
Move all this stuff into one structure (struct inet_frags)
to make it possible use them uniformly in the future. Like
with the previous patch this mostly consists of hunks like
- write_lock(&ipfrag_lock);
+ write_lock(&ip4_frags.lock);
To address the issue with exporting the number of queues and
the amount of memory occupied by queues outside the .c file
they are declared in, I introduce a couple of helpers.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce the struct inet_frag_queue in include/net/inet_frag.h
file and place there all the common fields from three structs:
* struct ipq in ipv4/ip_fragment.c
* struct nf_ct_frag6_queue in nf_conntrack_reasm.c
* struct frag_queue in ipv6/reassembly.c
After this, replace these fields on appropriate structures with
this structure instance and fix the users to use correct names
i.e. hunks like
- atomic_dec(&fq->refcnt);
+ atomic_dec(&fq->q.refcnt);
(these occupy most of the patch)
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Uninline netfilter okfns for those cases where gcc can generate tail-calls.
Before:
text data bss dec hex filename
8994153 1016524 524652 10535329 a0c1a1 vmlinux
After:
text data bss dec hex filename
8992761 1016524 524652 10533937 a0bc31 vmlinux
-------------------------------------------------------
-1392
All cases have been verified to generate tail-calls with and without netfilter.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Coverity checker spotted that we have already oops'ed if "dst" was
NULL.
Since "dst" being NULL doesn't seem to be possible at this point this
patch removes the NULL check.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Acked-by: Noriaki TAKAMIYA <takamiya@po.ntts.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces unnecessary uses of skb_copy by pskb_expand_head
on the IPv6 input path.
This allows us to remove the double pointers later.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the same change taht was done to ip_defrag. It
makes ipv6_frag_rcv return the last packet received of a train of fragments
rather than the head of that sequence.
This allows us to get rid of the sk_buff ** argument later.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
With all the users of the double pointers removed, this patch mops up by
finally replacing all occurances of sk_buff ** in the netfilter API by
sk_buff *.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces unnecessary uses of skb_copy, pskb_copy and
skb_realloc_headroom by functions such as skb_make_writable and
pskb_expand_head.
This allows us to remove the double pointers later.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all callers of netfilter can guarantee that the skb is not shared,
we no longer have to copy the skb in skb_make_writable.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
From RFC 3493, Section 5.2:
IPV6_MULTICAST_IF
Set the interface to use for outgoing multicast packets. The
argument is the index of the interface to use. If the
interface index is specified as zero, the system selects the
interface (for example, by looking up the address in a routing
table and using the resulting interface).
This patch adds support for (index == 0) to reset the value to it's
original state, allowing the system to choose the best interface. IPv4
already behaves this way.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As discussed before, this patch provides userland with a way to access
relevant options in Router Advertisements, after they are processed
and validated by the kernel. Extra options are processed in a generic
way; this patch only exports RDNSS options described in RFC5006, but
support to control which options are exported could be easily added.
A new rtnetlink message type is defined, to transport Neighbor
Discovery options, along with optional context information. At the
moment only the address of the router sending an RDNSS option is
included, but additional attributes may be later defined, if needed by
new use cases.
Signed-off-by: Pierre Ynard <linkfanel@yahoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch make processing netlink user -> kernel messages synchronious.
This change was inspired by the talk with Alexey Kuznetsov about current
netlink messages processing. He says that he was badly wrong when introduced
asynchronious user -> kernel communication.
The call netlink_unicast is the only path to send message to the kernel
netlink socket. But, unfortunately, it is also used to send data to the
user.
Before this change the user message has been attached to the socket queue
and sk->sk_data_ready was called. The process has been blocked until all
pending messages were processed. The bad thing is that this processing
may occur in the arbitrary process context.
This patch changes nlk->data_ready callback to get 1 skb and force packet
processing right in the netlink_unicast.
Kernel -> user path in netlink_unicast remains untouched.
EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
drop, but the process remains in the cycle until the message will be fully
processed. So, there is no need to use this kludges now.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expansion of original idea from Denis V. Lunev <den@openvz.org>
Add robustness and locking to the local_port_range sysctl.
1. Enforce that low < high when setting.
2. Use seqlock to ensure atomic update.
The locking might seem like overkill, but there are
cases where sysadmin might want to change value in the
middle of a DoS attack.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the setting of the IP length and checksum fields out of
the transforms and into the xfrmX_output functions. This would help future
efforts in merging the transforms themselves.
It also adds an optimisation to ipcomp due to the fact that the transport
offset is guaranteed to be zero.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the duplicate ipv6_{auth,esp,comp}_hdr structures since
they're identical to the IPv4 versions. Duplicating them would only create
problems for ourselves later when we need to add things like extended
sequence numbers.
I've also added transport header type conversion headers for these types
which are now used by the transforms.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 calling convention for x->mode->output is more general and could
help an eventual protocol-generic x->type->output implementation. This
patch adopts it for IPv4 as well and modifies the IPv4 type output functions
accordingly.
It also rewrites the IPv6 mac/transport header calculation to be based off
the network header where practical.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the calling convention so that on entry from
x->mode->output and before entry into x->type->output skb->data
will point to the payload instead of the IP header.
This is essentially a redistribution of skb_push/skb_pull calls
with the aim of minimising them on the common path of tunnel +
ESP.
It'll also let us use the same calling convention between IPv4
and IPv6 with the next patch.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The beet output function completely kills any extension headers by replacing
them with the IPv6 header. This is because it essentially ignores the
result of ip6_find_1stfragopt by simply acting as if there aren't any
extension headers.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
To judge the timing for DAD, netif_carrier_ok() is used. However,
there is a possibility that dev->qdisc stays noop_qdisc even if
netif_carrier_ok() returns true. In that case, DAD NS is not sent out.
We need to defer the IPv6 device initialization until a valid qdisc
is specified.
Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This concerns the ipv4 and ipv6 code mostly, but also the netlink
and unix sockets.
The netlink code is an example of how to use the __seq_open_private()
call - it saves the net namespace on this private.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch releases the lock on the state before calling x->type->output.
It also adds the lock to the spots where they're currently needed.
Most of those places (all except mip6) are expected to disappear with
async crypto.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current the x->mode->output functions store the IPv6 nh pointer in the
skb network header. This is inconvenient because the network header then
has to be fixed up before the packet can leave the IPsec stack. The mac
header field is unused on output so we can use that to store this instead.
This patch does that and removes the network header fix-up in xfrm_output.
It also uses ipv6_hdr where appropriate in the x->type->output functions.
There is also a minor clean-up in esp4 to make it use the same code as
esp6 to help any subsequent effort to merge the two.
Lastly it kills two redundant skb_set_* statements in BEET that were
simply copied over from transport mode.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6_fib.c, fib6_clean_node() casts a fib6_walker_t pointer to
a fib6_cleaner_t pointer assuming a struct fib6_walker_t (field 'w')
is the first field in struct fib6_walker_t.
To prevent any future problems that may occur if one day a field
is inadvertently inserted before the 'w' field in struct fib6_cleaner_t,
(and to improve readability), this patch uses the container_of() macro.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The lastused update check in xfrm_output can be done just as well in
the mode output function which is specific to RO.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The replay counter is one of only two remaining things in the output code
that requires a lock on the xfrm state (the other being the crypto). This
patch moves it into the generic xfrm_output so we can remove the lock from
the transforms themselves.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of the code in xfrm4_output_one and xfrm6_output_one are identical so
this patch moves them into a common xfrm_output function which will live
in net/xfrm.
In fact this would seem to fix a bug as on IPv4 we never reset the network
header after a transform which may upset netfilter later on.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The keys are only used during initialisation so we don't need to carry them
in esp_data. Since we don't have to allocate them again, there is no need
to place a limit on the authentication key length anymore.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The keys are only used during initialisation so we don't need to carry them
in esp_data. Since we don't have to allocate them again, there is no need
to place a limit on the authentication key length anymore.
This patch also kills the unused auth.icv member.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a bunch of sparse warnings. Mostly about 0 used as
NULL pointer, and shadowed variable declarations.
One notable case was that hash size should have been unsigned.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no struct nfattr anymore, rename functions to 'nlattr'.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of the duplicated rtnetlink macros and use the generic netlink
attribute functions. The old duplicated stuff is moved to a new header
file that exists just for userspace.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since hardware header operations are part of the protocol class
not the device instance, make them into a separate object and
save memory.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wrap the hard_header_parse function to simplify next step of
header_ops conversion.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add inline for common usage of hardware header creation, and
fix bug in IPV6 mcast where the assumption about negative return is
an errno. Negative return from hard_header means not enough space
was available,(ie -N bytes).
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes loopback_dev per network namespace. Adding
code to create a different loopback device for each network
namespace and adding the code to free a loopback device
when a network namespace exits.
This patch modifies all users the loopback_dev so they
access it as init_net.loopback_dev, keeping all of the
code compiling and working. A later pass will be needed to
update the users to use something other than the initial network
namespace.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces all occurences to the static variable
loopback_dev to a pointer loopback_dev. That provides the
mindless, trivial, uninteressting change part for the dynamic
allocation for the loopback.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-By: Kirill Korotaev <dev@sw.ru>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Background: RFC 4293 deprecates existing individual, named ICMP
type counters to be replaced with the ICMPMsgStatsTable. This table
includes entries for both IPv4 and IPv6, and requires counting of all
ICMP types, whether or not the machine implements the type.
These patches "remove" (but not really) the existing counters, and
replace them with the ICMPMsgStats tables for v4 and v6.
It includes the named counters in the /proc places they were, but gets the
values for them from the new tables. It also counts packets generated
from raw socket output (e.g., OutEchoes, MLD queries, RA's from
radvd, etc).
Changes:
1) create icmpmsg_statistics mib
2) create icmpv6msg_statistics mib
3) modify existing counters to use these
4) modify /proc/net/snmp to add "IcmpMsg" with all ICMP types
listed by number for easy SNMP parsing
5) modify /proc/net/snmp printing for "Icmp" to get the named data
from new counters.
[new to 2nd revision]
6) support per-interface ICMP stats
7) use common macro for per-device stat macros
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch slightly cleanups FIB rules framework. rules_list as a pointer
on struct fib_rules_ops is useless. It is always assigned with a static
per/subsystem list in IPv4, IPv6 and DecNet.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove useless message. We get the right message from another
subsystem.
Signed-off-by: Milan Kocian <milon@wq.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's been a useless no-op for long enough in 2.6 so I figured it's time to
remove it. The number of people that could object because they're
maintaining unified 2.4 and 2.6 drivers is probably rather small.
[ Handled drivers added by netdev tree and some missed IRDA cases... -DaveM ]
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change allows the generic attribute interface to be used within
the netfilter subsystem where this flag was initially introduced.
The byte-order flag is yet unused, it's intended use is to
allow automatic byte order convertions for all atomic types.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes most of the generic device layer network
namespace safe. This patch makes dev_base_head a
network namespace variable, and then it picks up
a few associated variables. The functions:
dev_getbyhwaddr
dev_getfirsthwbytype
dev_get_by_flags
dev_get_by_name
__dev_get_by_name
dev_get_by_index
__dev_get_by_index
dev_ioctl
dev_ethtool
dev_load
wireless_process_ioctl
were modified to take a network namespace argument, and
deal with it.
vlan_ioctl_set and brioctl_set were modified so their
hooks will receive a network namespace argument.
So basically anthing in the core of the network stack that was
affected to by the change of dev_base was modified to handle
multiple network namespaces. The rest of the network stack was
simply modified to explicitly use &init_net the initial network
namespace. This can be fixed when those components of the network
stack are modified to handle multiple network namespaces.
For now the ifindex generator is left global.
Fundametally ifindex numbers are per namespace, or else
we will have corner case problems with migration when
we get that far.
At the same time there are assumptions in the network stack
that the ifindex of a network device won't change. Making
the ifindex number global seems a good compromise until
the network stack can cope with ifindex changes when
you change namespaces, and the like.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each netlink socket will live in exactly one network namespace,
this includes the controlling kernel sockets.
This patch updates all of the existing netlink protocols
to only support the initial network namespace. Request
by clients in other namespaces will get -ECONREFUSED.
As they would if the kernel did not have the support for
that netlink protocol compiled in.
As each netlink protocol is updated to be multiple network
namespace safe it can register multiple kernel sockets
to acquire a presence in the rest of the network namespaces.
The implementation in af_netlink is a simple filter implementation
at hash table insertion and hash table look up time.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every user of the network device notifiers is either a protocol
stack or a pseudo device. If a protocol stack that does not have
support for multiple network namespaces receives an event for a
device that is not in the initial network namespace it quite possibly
can get confused and do the wrong thing.
To avoid problems until all of the protocol stacks are converted
this patch modifies all netdev event handlers to ignore events on
devices that are not in the initial network namespace.
As the rest of the code is made network namespace aware these
checks can be removed.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch modifies every packet receive function
registered with dev_add_pack() to drop packets if they
are not from the initial network namespace.
This should ensure that the various network stacks do
not receive packets in a anything but the initial network
namespace until the code has been converted and is ready
for them.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch passes in the namespace a new socket should be created in
and has the socket code do the appropriate reference counting. By
virtue of this all socket create methods are touched. In addition
the socket create methods are modified so that they will fail if
you attempt to create a socket in a non-default network namespace.
Failing if we attempt to create a socket outside of the default
network namespace ensures that as we incrementally make the network stack
network namespace aware we will not export functionality that someone
has not audited and made certain is network namespace safe.
Allowing us to partially enable network namespaces before all of the
exotic protocols are supported.
Any protocol layers I have missed will fail to compile because I now
pass an extra parameter into the socket creation code.
[ Integrated AF_IUCV build fixes from Andrew Morton... -DaveM ]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes /proc/net per network namespace. It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.
Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This trivial patch removes the unneeded pointer iph, which is never used.
Signed-off-by: Micah Gruber <micah.gruber@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 IPsec tunnel gateway incorrectly sends redirect to
router or sender when network device the IPsec tunnelled packet
is arrived is the same as the one the decapsulated packet
is sent.
With this patch, it omits to send the redirect when the forwarding
skbuff carries secpath, since such skbuff should be assumed as
a decapsulated packet from IPsec tunnel by own.
It may be a rare case for an IPsec security gateway, however
it is not rare when the gateway is MIPv6 Home Agent since
the another tunnel end-point is Mobile Node and it changes
the attached network.
Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When XFRM policy and state are ready after TCP connection is started,
the traffic should be transformed immediately, however it does not
on IPv6 TCP.
It depends on a dst cache replacement policy with connected socket.
It seems that the replacement is always done for IPv4, however, on
IPv6 case it is done only when routing cookie is changed.
This patch fix that non-transformation dst can be changed to
transformation one.
This behavior is required by MIPv6 and improves IPv6 IPsec.
Fixes by Masahide NAKAMURA.
Signed-off-by: Noriaki TAKAMIYA <takamiya@po.ntts.co.jp>
Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add v4mapped address inline to avoid calls to ipv6_addr_type().
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the ICMPv6 Target address is multicast, Linux processes the
redirect instead of dropping it. The problem is in this code in
ndisc_redirect_rcv():
if (ipv6_addr_equal(dest, target)) {
on_link = 1;
} else if (!(ipv6_addr_type(target) & IPV6_ADDR_LINKLOCAL)) {
ND_PRINTK2(KERN_WARNING
"ICMPv6 Redirect: target address is not
link-local.\n");
return;
}
This second check will succeed if the Target address is, for example,
FF02::1 because it has link-local scope. Instead, it should be checking
if it's a unicast link-local address, as stated in RFC 2461/4861 Section
8.1:
- The ICMP Target Address is either a link-local address (when
redirected to a router) or the same as the ICMP Destination
Address (when redirected to the on-link destination).
I know this doesn't explicitly say unicast link-local address, but it's
implied.
This bug is preventing Linux kernels from achieving IPv6 Logo Phase II
certification because of a recent error that was found in the TAHI test
suite - Neighbor Disovery suite test 206 (v6LC.2.3.6_G) had the
multicast address in the Destination field instead of Target field, so
we were passing the test. This won't be the case anymore.
The patch below fixes this problem, and also fixes ndisc_send_redirect()
to not send an invalid redirect with a multicast address in the Target
field. I re-ran the TAHI Neighbor Discovery section to make sure Linux
passes all 245 tests now.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a report and initial patch by Peter Lieven.
tcp4_md5sig_key and tcp6_md5sig_key need to start with
the exact same members as tcp_md5sig_key. Because they
are both cast to that type by tcp_v{4,6}_md5_do_lookup().
Unfortunately tcp{4,6}_md5sig_key use a u16 for the key
length instead of a u8, which is what tcp_md5sig_key
uses. This just so happens to work by accident on
little-endian, but on big-endian it doesn't.
Instead of casting, just place tcp_md5sig_key as the first member of
the address-family specific structures, adjust the access sites, and
kill off the ugly casts.
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 95c385 broke proper source address selection for cases in which
there is a address which is makred 'deprecated'. The commit mistakenly
changed ifa->flags to ifa_result->flags (probably copy/paste error from a
few lines above) in the 'Rule 3' address selection code.
The patch restores the previous RFC-compliant behavior.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some of skbs in sk->write_queue do not have skb->dst because
we do not fill skb->dst when we allocate new skb in append_data().
BTW, I think we may not need to (or we should not) increment some stats
when using corking; if 100 sendmsg() (with MSG_MORE) result in 2 packets,
how many should we increment?
If 100, we should set skb->dst for every queued skbs.
If 1 (or 2 (*)), we increment the stats for the first queued skb and
we should just skip incrementing OutDiscards for the rest of queued skbs,
adn we should also impelement this semantics in other places;
e.g., we should increment other stats just once, not 100 times.
*: depends on the place we are discarding the datagram.
I guess should just increment by 1 (or 2).
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
So I've had a deadlock reported to me. I've found that the sequence of
events goes like this:
1) process A (modprobe) runs to remove ip_tables.ko
2) process B (iptables-restore) runs and calls setsockopt on a netfilter socket,
increasing the ip_tables socket_ops use count
3) process A acquires a file lock on the file ip_tables.ko, calls remove_module
in the kernel, which in turn executes the ip_tables module cleanup routine,
which calls nf_unregister_sockopt
4) nf_unregister_sockopt, seeing that the use count is non-zero, puts the
calling process into uninterruptible sleep, expecting the process using the
socket option code to wake it up when it exits the kernel
4) the user of the socket option code (process B) in do_ipt_get_ctl, calls
ipt_find_table_lock, which in this case calls request_module to load
ip_tables_nat.ko
5) request_module forks a copy of modprobe (process C) to load the module and
blocks until modprobe exits.
6) Process C. forked by request_module process the dependencies of
ip_tables_nat.ko, of which ip_tables.ko is one.
7) Process C attempts to lock the request module and all its dependencies, it
blocks when it attempts to lock ip_tables.ko (which was previously locked in
step 3)
Theres not really any great permanent solution to this that I can see, but I've
developed a two part solution that corrects the problem
Part 1) Modifies the nf_sockopt registration code so that, instead of using a
use counter internal to the nf_sockopt_ops structure, we instead use a pointer
to the registering modules owner to do module reference counting when nf_sockopt
calls a modules set/get routine. This prevents the deadlock by preventing set 4
from happening.
Part 2) Enhances the modprobe utilty so that by default it preforms non-blocking
remove operations (the same way rmmod does), and add an option to explicity
request blocking operation. So if you select blocking operation in modprobe you
can still cause the above deadlock, but only if you explicity try (and since
root can do any old stupid thing it would like.... :) ).
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Denis V. Lunev <den@openvz.org>
addrconf_dad_failure calls addrconf_dad_stop which takes referenced address
and drops the count. So, in6_ifa_put perrformed at out: is extra. This
results in message: "Freeing alive inet6 address" and not released dst entries.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix IP[V6]_ADD_MEMBERSHIP and IP[V6]_DROP_MEMBERSHIP to
return -EPROTO for connection oriented sockets.
Signed-off-by: Flavio Leitner <fleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A similar fix to netfilter from Eric Dumazet inspired me to
look around a bit by using some grep/sed stuff as looking for
this kind of bugs seemed easy to automate. This is one of them
I found where it looks like this semicolon is not valid.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch cleans up duplicate includes in
net/ipv6/
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As discovered by Evegniy Polyakov, if we try to sendmsg after
a connection reset, we can do incredibly stupid things.
The core issue is that inet_sendmsg() tries to autobind the
socket, but we should never do that for TCP. Instead we should
just go straight into TCP's sendmsg() code which will do all
of the necessary state and pending socket error checks.
TCP's sendpage already directly vectors to tcp_sendpage(), so this
merely brings sendmsg() in line with that.
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_addr_type() doesn't check for 'Unique Local IPv6 Unicast
Addresses' (RFC4193) and returns IPV6_ADDR_RESERVED for that range.
SCTP uses this function and will fail bind() and connect() calls that
use RFC4193 addresses, SCTP will also ignore inbound connections from
RFC4193 addresses if listening on IPV6_ADDR_ANY.
There may be other users of ipv6_addr_type() that could also have
problems.
Signed-off-by: Dave Johnson <djohnson@sw.starentnetworks.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that netdev notifications can fail, we can use this to signal
errors during registration for IPv4/IPv6. In particular, if we
fail to allocate memory for the inet device, we can fail the netdev
registration.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ADVMSS value was incorrectly updated for ALL routes when the MTU
is updated because it's outside the effect of the if statement's
condition.
Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert rel_info to host-endian before calling ip6_tnl_err().
The things become much more straightforward that way.
The key observation (and the reason why that code actually
worked) is that after ip6_tnl_err() we either immediately
bailed out or had rel_info set to 0 or had it set to host-endian
and guaranteed to hit
(rel_type == ICMP_DEST_UNREACH && rel_code == ICMP_FRAG_NEEDED)
case. So inconsistent endianness didn't really lead to bugs,
but it had been subtle and prone to breakage. New variant is
saner and obviously safe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>