The ipv6_fl_socklist from listening socket is inadvertently shared
with new socket created for connection. This leads to a variety of
interesting, but fatal, bugs. For example, removing one of the
sockets may lead to the other socket's encountering a page fault
when the now freed list is referenced.
The fix is to not share the flow label list with the new socket.
Signed-off-by: Masayuki Nakagawa <nakagawa.msy@ncos.nec.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change tcp_mem initialization function. The fraction of total memory
is now a continuous function of memory size, and independent of page
size.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
ANK says: "It is rarely used, that's wy it was not noticed.
But in the places, where it is used, it should be disaster."
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hello, Just discussed this Patrick...
We have two users of trie_leaf_remove, fn_trie_flush and fn_trie_delete
both are holding RTNL. So there shouldn't be need for this preempt stuff.
This is assumed to a leftover from an older RCU-take.
> Mhh .. I think I just remembered something - me incorrectly suggesting
> to add it there while we were talking about this at OLS :) IIRC the
> idea was to make sure tnode_free (which at that time didn't use
> call_rcu) wouldn't free memory while still in use in a rcu read-side
> critical section. It should have been synchronize_rcu of course,
> but with tnode_free using call_rcu it seems to be completely
> unnecessary. So I guess we can simply remove it.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed that in xfrm_state_add we look for the larval SA in a few
places without checking for protocol match. So when using both
AH and ESP, whichever one gets added first, deletes the larval SA.
It seems AH always gets added first and ESP is always the larval
SA's protocol since the xfrm->tmpl has it first. Thus causing the
additional km_query()
Adding the check eliminates accidental double SA creation.
Signed-off-by: Joy Latten <latten@austin.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Delete the apparently superfluous source file
net/wanrouter/af_wanpipe.c.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kill warning about unused variable `in_dev' when CONFIG_IP_MULTICAST
is not set.
Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 484b366932 added support for the CIPSO
ranged categories tag. However, it appears that I made a mistake when rebasing
then patch to the latest upstream sources for submission and dropped the part
of the patch that actually parses the tag on incoming packets. This patch
fixes this mistake by adding the required function call to the
cipso_v4_skbuff_getattr() function.
I've run this patch over the weekend and have not noticed any problems.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
User supplied len < 0 can cause leak of kernel memory.
Use unsigned compare instead.
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I came across this bug in http://bugzilla.kernel.org/show_bug.cgi?id=8155
Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TX CCID needs the write_xmit_timer for delaying packet sends. Previously
this timer was only activated on active (connecting) sockets.
This patch initialises the write_xmit_timer in sync with the other timers, i.e.
the timer will be ready on any socket. This is used by applications with a
listening socket which start to stream after receiving an initiation by the
client. The write_xmit_timer is stopped when the application closes, as before.
Was tested to work and to remove the timer bug reported on dccp@vger.
Also moved timer initialisation into timer.c (static).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return negative error value (embedded in the pointer) instead of
returning NULL.
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
lockdep found that dev->lock taken from softirq in ipv6_add_addr
is also taken in sctp_v6_copy_addrlist with softirqs enabled, so
lockup is possible.
Noticed-by: Simon Arlott <simon@arlott.org>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
[Bluetooth] Fix socket locking in hci_sock_dev_event()
hci_sock_dev_event() uses bh_lock_sock() to lock the socket lock.
This is not deadlock-safe against locking of the same socket lock in
l2cap_connect_cfm() from softirq context. In addition to that,
hci_sock_dev_event() doesn't seem to be called from softirq context,
so it is safe to use lock_sock()/release_sock() instead.
The lockdep warning can be triggered on my T42p simply by switching
the Bluetooth off by the keyboard button.
=================================
[ INFO: inconsistent lock state ]
2.6.21-rc2 #4
---------------------------------
inconsistent {in-softirq-W} -> {softirq-on-W} usage.
khubd/156 [HC0[0]:SC0[0]:HE1:SE1] takes:
(slock-AF_BLUETOOTH){-+..}, at: [<e0ca5520>] hci_sock_dev_event+0xa8/0xc5 [bluetooth]
{in-softirq-W} state was registered at:
[<c012d1db>] mark_lock+0x59/0x414
[<e0cef688>] l2cap_connect_cfm+0x4e/0x11f [l2cap]
[<c012dfd7>] __lock_acquire+0x3e5/0xb99
[<e0cef688>] l2cap_connect_cfm+0x4e/0x11f [l2cap]
[<c012e7f2>] lock_acquire+0x67/0x81
[<e0cef688>] l2cap_connect_cfm+0x4e/0x11f [l2cap]
[<c036ee72>] _spin_lock+0x29/0x34
[<e0cef688>] l2cap_connect_cfm+0x4e/0x11f [l2cap]
[<e0cef688>] l2cap_connect_cfm+0x4e/0x11f [l2cap]
[<e0ca17c3>] hci_send_cmd+0x126/0x14f [bluetooth]
[<e0ca4ce4>] hci_event_packet+0x729/0xebd [bluetooth]
[<e0ca205b>] hci_rx_task+0x2a/0x20f [bluetooth]
[<e0ca209d>] hci_rx_task+0x6c/0x20f [bluetooth]
[<c012d7be>] trace_hardirqs_on+0x10d/0x14e
[<c011ac85>] tasklet_action+0x3d/0x68
[<c011abba>] __do_softirq+0x41/0x92
[<c011ac32>] do_softirq+0x27/0x3d
[<c0105134>] do_IRQ+0x7b/0x8f
[<c0103dec>] common_interrupt+0x24/0x34
[<c0103df6>] common_interrupt+0x2e/0x34
[<c0248e65>] acpi_processor_idle+0x1b3/0x34a
[<c0248e68>] acpi_processor_idle+0x1b6/0x34a
[<c010232b>] cpu_idle+0x39/0x4e
[<c04bab0c>] start_kernel+0x372/0x37a
[<c04ba42b>] unknown_bootoption+0x0/0x202
[<ffffffff>] 0xffffffff
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One change introduced by the workqueue removal patch is that adding an
interface that is up to a bridge which is also up does not ever call
br_stp_enable_port(), leaving the port in DISABLED state until we do
ifconfig down and up or link events occur.
The following patch to the br_add_if function fixes it.
This is a regression introduced in 2.6.21.
Submitted-by: Aji_Srinivas@emc.com
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we add the IPv6 device at registration time we don't need
to set IF_READY in ipv6_add_dev anymore because we will always get
a NETDEV_UP event later on should the device ever become ready.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inside pfkey_delete and xfrm_del_sa the audit hooks were not called if
there was any permission/security failures in attempting to do the del
operation (such as permission denied from security_xfrm_state_delete).
This patch moves the audit hook to the exit path such that all failures
(and successes) will actually get audited.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Venkat Yekkirala <vyekkirala@trustedcs.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
pfkey_spdget neither had an LSM security hook nor auditing for the
removal of xfrm_policy structs. The security hook was added when it was
moved into xfrm_policy_byid instead of the callers to that function by
my earlier patch and this patch adds the auditing hooks as well.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Venkat Yekkirala <vyekkirala@trustedcs.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The security hooks to check permissions to remove an xfrm_policy were
actually done after the policy was removed. Since the unlinking and
deletion are done in xfrm_policy_by* functions this moves the hooks
inside those 2 functions. There we have all the information needed to
do the security check and it can be done before the deletion. Since
auditing requires the result of that security check err has to be passed
back and forth from the xfrm_policy_by* functions.
This patch also fixes a bug where a deletion that failed the security
check could cause improper accounting on the xfrm_policy
(xfrm_get_policy didn't have a put on the exit path for the hold taken
by xfrm_policy_by*)
It also fixes the return code when no policy is found in
xfrm_add_pol_expire. In old code (at least back in the 2.6.18 days) err
wasn't used before the return when no policy is found and so the
initialization would cause err to be ENOENT. But since err has since
been used above when we don't get a policy back from the xfrm_policy_by*
function we would always return 0 instead of the intended ENOENT. Also
fixed some white space damage in the same area.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Venkat Yekkirala <vyekkirala@trustedcs.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts an earlier patch which disabled bidirectional mode, meaning that
a listening (passive) socket was not allowed to write to the other (active)
end of the connection.
This mode had been disabled when there were problems with CCID3, but it
imposes a constraint on socket programming and thus hinders deployment.
A change is included to ignore RX feedback received by the TX CCID3 module.
Many thanks to Andre Noll for pointing out this issue.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
physoutdev is only set on purely bridged packet, when nfnetlink_log is used
in the OUTPUT/FORWARD/POSTROUTING hooks on packets forwarded from or to a
bridge it crashes when trying to dereference skb->nf_bridge->physoutdev.
Reported by Holger Eitzenberger <heitzenberger@astaro.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Userspace expects a zero-terminated string, so include the trailing
zero in the netlink message.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The individual fragments of a packet reassembled by conntrack have the
conntrack reference from the reassembled packet attached, but nfctinfo
is not copied. This leaves it initialized to 0, which unfortunately is
the value of IP_CT_ESTABLISHED.
The result is that all IPv6 fragments are tracked as ESTABLISHED,
allowing them to bypass a usual ruleset which accepts ESTABLISHED
packets early.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:
sis900 warning fixes
mv643xx_eth: Place explicit port number in mv643xx_eth_platform_data
pcnet32: Fix PCnet32 performance bug on non-coherent architecutres
__devinit & __devexit cleanups for de2104x driver
3c59x: Handle pci_enable_device() failure while resuming
dmfe: Fix link detection
dmfe: fix two bugs
dmfe: trivial/spelling fixes
revert "drivers/net/tulip/dmfe: support basic carrier detection"
ucc_geth: returns NETDEV_TX_BUSY when BD ring is full
ucc_geth: Fix BD processing
natsemi: netpoll fixes
bonding: Improve IGMP join processing
bonding: only receive ARPs for us
bonding: fix double dev_add_pack
This mirrors a recent change in tcp_open_req_child, whereby the icsk_rto of the
newly created child socket was not set (but rather on the parent socket). Same
fix for DCCP.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a bug caused by a previous patch, which causes DCCP servers in
LISTEN state to not receive packets.
This patch changes the logic so that
* servers in either LISTEN or OPEN state get the RX half connection packets
* clients in OPEN state get the TX half connection packets
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a typo in compat_sock_common_getsockopt.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts two changes:
8488df894d248f06726e
A backlog value of N really does mean allow "N + 1" connections
to queue to a listening socket. This allows one to specify
"0" as the backlog and still get 1 connection.
Noticed by Gerrit Renker and Rick Jones.
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide a module param "pool_mode" for sunrpc.ko which allows a sysadmin to
choose the mode for mapping NFS thread service pools to CPUs. Values are:
auto choose a mapping mode heuristically
global (default, same as the pre-2.6.19 code) a single global pool
percpu one pool per CPU
pernode one pool per NUMA node
Note that since 2.6.19 the hardcoded behaviour has been "auto", this patch
makes the default "global".
The pool mode can be changed after boot/modprobe using /sys, if the NFS and
lockd services have been shut down. A useful side effect of this change is to
fix a small memory leak when unloading the module.
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the last thread of nfsd exits, it shuts down all related sockets. It
currently uses svc_close_socket to do this, but that only is immediately
effective if the socket is not SK_BUSY.
If the socket is busy - i.e. if a request has arrived that has not yet been
processes - svc_close_socket is not effective and the shutdown process spins.
So create a new svc_force_close_socket which removes the SK_BUSY flag is set
and then calls svc_close_socket.
Also change some open-codes loops in svc_destroy to use
list_for_each_entry_safe.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
They don't really save that much, and aren't worth the hassle.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The sunrpc server code needs to know the source and destination address for
UDP packets so it can reply properly. It currently copies code out of the
network stack to pick the pieces out of the skb. This is ugly and causes
compile problems with the IPv6 stuff.
So, rip that out and use recv_msg instead. This is a much cleaner interface,
but has a slight cost in that the checksum is now checked before the copy, so
we don't benefit from doing both at the same time. This can probably be
fixed.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In active-backup mode, the current bonding code duplicates IGMP
traffic to all slaves, so that switches are up to date in case of a
failover from an active to a backup interface. If bonding then fails
back to the original active interface, it is likely that the "active
slave" switch's IGMP forwarding for the port will be out of date until
some event occurs to refresh the switch (e.g., a membership query).
This patch alters the behavior of bonding to no longer flood
IGMP to all ports, and to issue IGMP JOINs to the newly active port at
the time of a failover. This insures that switches are kept up to date
for all cases.
"GOELLESCH Niels" <niels.goellesch@eurocontrol.int> originally
reported this problem, and included a patch. His original patch was
modified by Jay Vosburgh to additionally remove the existing IGMP flood
behavior, use RCU, streamline code paths, fix trailing white space, and
adjust for style.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix reference counting (memory leak) problem in __nfulnl_send() and callers
related to packet queueing.
Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Count module references correctly: after instance_destroy() there
might be timer pending and holding a reference for this netlink instance.
Based on patch by Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate possible NULL pointer dereference in nfulnl_recv_config().
Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paranoia: instance_put() might have freed the inst pointer when we
spin_unlock_bh().
Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stop reference leaking in nfulnl_log_packet(). If we start a timer we
are already taking another reference.
Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some stacks apparently send packets with SYN|URG set. Linux accepts
these packets, so TCP conntrack should to.
Pointed out by Martijn Posthuma <posthuma@sangine.com>.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nf_conntrack_netlink config option is named CONFIG_NF_CT_NETLINK,
but multiple files use CONFIG_IP_NF_CONNTRACK_NETLINK or
CONFIG_NF_CONNTRACK_NETLINK for ifdefs.
Fix this and reformat all CONFIG_NF_CT_NETLINK ifdefs to only use a line.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix {nf,ip}_ct_iterate_cleanup unconfirmed list handling:
- unconfirmed entries can not be killed manually, they are removed on
confirmation or final destruction of the conntrack entry, which means
we might iterate forever without making forward progress.
This can happen in combination with the conntrack event cache, which
holds a reference to the conntrack entry, which is only released when
the packet makes it all the way through the stack or a different
packet is handled.
- taking references to an unconfirmed entry and using it outside the
locked section doesn't work, the list entries are not refcounted and
another CPU might already be waiting to destroy the entry
What the code really wants to do is make sure the references of the hash
table to the selected conntrack entries are released, so they will be
destroyed once all references from skbs and the event cache are dropped.
Since unconfirmed entries haven't even entered the hash yet, simply mark
them as dying and skip confirmation based on that.
Reported and tested by Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch splits the vlan_group struct into a multi-allocated struct. On
x86_64, the size of the original struct is a little more than 32KB, causing
a 4-order allocation, which is prune to problems caused by buddy-system
external fragmentation conditions.
I couldn't just use vmalloc() because vfree() cannot be called in the
softirq context of the RCU callback.
Signed-off-by: Dan Aloni <da-x@monatomic.org>
Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current CIPSO engine has a problem where it does not verify that
the given sensitivity level has a valid CIPSO mapping when the "std"
CIPSO DOI type is used. The end result is that bad packets are sent
on the wire which should have never been sent in the first place.
This patch corrects this problem by verifying the sensitivity level
mapping similar to what is done with the category mapping. This patch
also changes the returned error code in this case to -EPERM to better
match what the category mapping verification code returns.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>