linux

Commit Graph

Author	SHA1	Message	Date
Eric Dumazet	d485d500cf	netfilter: tproxy: nf_tproxy_assign_sock() can handle tw sockets transparent field of a socket is either inet_twsk(sk)->tw_transparent for timewait sockets, or inet_sk(sk)->transparent for other sockets (TCP/UDP). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-22 13:13:31 -07:00
David S. Miller	a0741ca949	Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6	2010-09-21 18:17:19 -07:00
Sjur Braendeland	9e2e8f14d4	caif: Use default send and receive buffer size in caif_socket. CAIF sockets should use socket's default send and receive buffers sizes. Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 18:05:31 -07:00
Sjur Braendeland	e5e03ce1e5	caif: Fix function NULL pointer check. Check that receive function pointer is not null before calling it. Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 18:05:31 -07:00
Sjur Braendeland	b04367df66	caif: Minor fixes in log prints. Use pr_debug for flow control printouts, and refine an error printout. Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 18:05:30 -07:00
Sjur Braendeland	9c44c9fa78	caif: Remove buggy re-definition of pr_debug Remove debugging quirk redefining pr_debug to pr_warning. Signed-off-by: Sjur Brændeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 18:05:30 -07:00
Eric Dumazet	756e64a0b1	net: constify some ppp/pptp structs Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 18:04:47 -07:00
Andy Shevchenko	82fd5b5d1e	net: core: use kernel's converter from hex to bin Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 18:04:45 -07:00
David S. Miller	73da16c28e	ethtool: Fix build due to lack of ethtool.h include. net/core/ethtool.c: In function 'ethtool_get_regs': net/core/ethtool.c:818:2: error: implicit declaration of function 'vmalloc' net/core/ethtool.c:818:9: warning: assignment makes pointer from integer without a cast net/core/ethtool.c:833:2: error: implicit declaration of function 'vfree' Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 16:12:11 -07:00
David S. Miller	98e684bd5c	Merge branch 'dccp' of git://eden-feed.erg.abdn.ac.uk/net-next-2.6	2010-09-21 16:00:40 -07:00
Eric Dumazet	3d13008e73	ip: fix truesize mismatch in ip fragmentation Special care should be taken when slow path is hit in ip_fragment() : When walking through frags, we transfert truesize ownership from skb to frags. Then if we hit a slow_path condition, we must undo this or risk uncharging frags->truesize twice, and in the end, having negative socket sk_wmem_alloc counter, or even freeing socket sooner than expected. Many thanks to Nick Bowler, who provided a very clean bug report and test program. Thanks to Jarek for reviewing my first patch and providing a V2 While Nick bisection pointed to commit `2b85a34e91` (net: No more expensive sock_hold()/sock_put() on each tx), underlying bug is older (2.6.12-rc5) A side effect is to extend work done in commit `b2722b1c3a` (ip_fragment: also adjust skb->truesize for packets not owned by a socket) to ipv6 as well. Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com> Tested-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 15:05:50 -07:00
Ben Hutchings	a77f5db361	ethtool: Allocate register dump buffer with vmalloc() Some NICs have huge register files which exceed the maximum heap allocation size. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 14:57:59 -07:00
John W. Linville	b618f6f885	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 into for-davem Conflicts: arch/arm/mach-omap2/board-omap3pandora.c drivers/net/wireless/ath/ath5k/base.c	2010-09-21 15:49:14 -04:00
David S. Miller	2d813760d7	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6	2010-09-21 12:26:07 -07:00
Gerrit Renker	536bb20b45	dccp ccid-3: Remove redundant 'options_received' struct The `options_received' struct is redundant, since it re-duplicates the existing `p' and `x_recv' fields. This patch removes the sub-struct and migrates the format conversion operations to ccid3_hc_tx_parse_options(). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2010-09-21 12:14:26 +02:00
Gerrit Renker	792e6d3389	dccp tfrc/ccid-3: computing the loss rate from the Loss Event Rate This adds a function to take care of the following, separate cases occurring in the computation of the Loss Rate p: * 1/(2^32-1) is mapped into 0% as per RFC 4342, 8.5; * 1/0 is mapped into 100%, the maximum; * to avoid that p = 1/x is rounded down to 0 when x is very large, since this means accidentally re-entering slow-start indicated by p == 0, the minimum resolution value of p is now returned instead; * a bug in ccid3_hc_rx_getsockopt is fixed: 1/0 was mapped into ~0U. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2010-09-21 12:14:26 +02:00
Gerrit Renker	80763dfbac	dccp ccid-3: remove dead states This patch is thanks to an investigation by Leandro Sales de Melo and his colleagues. They worked out two state diagrams which highlight the fact that the xxx_TERM states in CCID-3/4 are in fact not necessary. And this can be confirmed by in turn looking at the code: the xxx_TERM states are only ever set in ccid3_hc_{rx,tx}_exit(): when CCID-3 sets the state to xxx_TERM, it is at a time where no more processing should be going on, hence it is not necessary to introduce a dedicated exit state - this is already implied by unloading the CCID. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2010-09-21 12:14:26 +02:00
Gerrit Renker	a18213d1d2	dccp: Replace magic CCID-specific numbers by symbolic constants The constants DCCPO_{MIN,MAX}_CCID_SPECIFIC are nowhere used in the code, but instead for the CCID-specific options numbers are used. This patch unifies the use of CCID-specific option numbers, by adding symbolic names reflecting the definitions in RFC 4340, 10.3. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2010-09-21 12:14:25 +02:00
Gerrit Renker	4874c131d7	dccp: Add packet type information to CCID-specific option parsing This 1. adds packet type information to ccid_hc_{rx,tx}_parse_options(). This is necessary, since table 3 in RFC 4340, 5.8 leaves it to the CCIDs to state which options may (not) appear on what packet type. 2. adds such a check for CCID-3's {Loss Event, Receive} Rate as specified in RFC 4340 8.3 ("Receive Rate options MUST NOT be sent on DCCP-Data packets") and 8.5 ("Loss Event Rate options MUST NOT be sent on DCCP-Data packets"). 3. removes an unused argument `idx' from ccid_hc_{rx,tx}_parse_options(). This is also no longer necessary, since the CCID-specific option-parsing routines are passed every single parameter of the type-length-value option encoding. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2010-09-21 12:14:25 +02:00
Tom Marshall	a4d258036e	tcp: Fix race in tcp_poll If a RST comes in immediately after checking sk->sk_err, tcp_poll will return POLLIN but not POLLOUT. Fix this by checking sk->sk_err at the end of tcp_poll. Additionally, ensure the correct order of operations on SMP machines with memory barriers. Signed-off-by: Tom Marshall <tdm.code@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-20 15:42:05 -07:00
David S. Miller	9828e6e6e3	rose: Fix signedness issues wrt. digi count. Just use explicit casts, since we really can't change the types of structures exported to userspace which have been around for 15 years or so. Reported-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-20 15:40:35 -07:00
Thomas Egerer	8444cf712c	xfrm: Allow different selector family in temporary state The family parameter xfrm_state_find is used to find a state matching a certain policy. This value is set to the template's family (encap_family) right before xfrm_state_find is called. The family parameter is however also used to construct a temporary state in xfrm_state_find itself which is wrong for inter-family scenarios because it produces a selector for the wrong family. Since this selector is included in the xfrm_user_acquire structure, user space programs misinterpret IPv6 addresses as IPv4 and vice versa. This patch splits up the original init_tempsel function into a part that initializes the selector respectively the props and id of the temporary state, to allow for differing ip address families whithin the state. Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-20 11:11:38 -07:00
Johannes Berg	df6d02300f	wext: fix potential private ioctl memory content leak When a driver doesn't fill the entire buffer, old heap contents may remain, and if it also doesn't update the length properly, this old heap content will be copied back to userspace. It is very unlikely that this happens in any of the drivers using private ioctls since it would show up as junk being reported by iwpriv, but it seems better to be safe here, so use kzalloc. Reported-by: Jeff Mahoney <jeffm@suse.com> Cc: stable@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-20 13:41:40 -04:00
Eric Dumazet	8990f468ae	net: rx_dropped accounting Under load, netif_rx() can drop incoming packets but administrators dont have a chance to spot which device needs some tuning (RPS activation for example) This patch adds rx_dropped accounting in vlans and tunnels. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-20 10:08:58 -07:00
Eric Dumazet	842c74bffc	ip_gre: CONFIG_IPV6_MODULE support ipv6 can be a module, we should test CONFIG_IPV6 and CONFIG_IPV6_MODULE to enable ipv6 bits in ip_gre. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-20 10:06:12 -07:00
Bandan Das	462fb2af97	bridge : Sanitize skb before it enters the IP stack Related dicussion here : http://lkml.org/lkml/2010/9/3/16 Introduce a function br_parse_ip_options that will audit the skb and possibly refill IP options before a packet enters the IP stack. If no options are present, the function will zero out the skb cb area so that it is not misinterpreted as options by some unsuspecting IP layer routine. If packet consistency fails, drop it. Signed-off-by: Bandan Das <bandan.das@stratus.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-19 12:42:34 -07:00
Dan Carpenter	aef3ea33e8	rds: spin_lock_irq() is not nestable This is basically just a cleanup. IRQs were disabled on the previous line so we don't need to do it again here. In the current code IRQs would get turned on one line earlier than intended. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-19 11:59:44 -07:00
Dan Carpenter	f4fa7f3807	rds: double unlock in rds_ib_cm_handle_connect() We unlock after we goto out. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-19 11:59:44 -07:00
Dan Carpenter	9b9d2e00bf	rds: signedness bug In the original code if the copy_from_user() fails in rds_rdma_pages() then the error handling fails and we get a stack trace from kmalloc(). Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-19 11:59:43 -07:00
Linus Torvalds	7d7dee96e1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (21 commits) dca: disable dca on IOAT ver.3.0 multiple-IOH platforms netpoll: Disable IRQ around RCU dereference in netpoll_rx sctp: Do not reset the packet during sctp_packet_config(). net/llc: storing negative error codes in unsigned short MAINTAINERS: move atlx discussions to netdev drivers/net/cxgb3/cxgb3_main.c: prevent reading uninitialized stack memory drivers/net/eql.c: prevent reading uninitialized stack memory drivers/net/usb/hso.c: prevent reading uninitialized memory xfrm: dont assume rcu_read_lock in xfrm_output_one() r8169: Handle rxfifo errors on 8168 chips 3c59x: Remove atomic context inside vortex_{set\|get}_wol tcp: Prevent overzealous packetization by SWS logic. net: RPS needs to depend upon USE_GENERIC_SMP_HELPERS phylib: fix PAL state machine restart on resume net: use rcu_barrier() in rollback_registered_many bonding: correctly process non-linear skbs ipv4: enable getsockopt() for IP_NODEFRAG ipv4: force_igmp_version ignored when a IGMPv3 query received ppp: potential NULL dereference in ppp_mp_explode() net/llc: make opt unsigned in llc_ui_setsockopt() ...	2010-09-19 11:05:50 -07:00
Ben Hutchings	be2902daee	ethtool, ixgbe: Move RX n-tuple mask fixup to ethtool The ethtool utility does not set masks for flow parameters that are not specified, so if both value and mask are 0 then this must be treated as equivalent to a mask with all bits set. Currently that is done in the only driver that implements RX n-tuple filtering, ixgbe. Move it to the ethtool core. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-17 16:53:23 -07:00
Vlad Yasevich	4bdab43323	sctp: Do not reset the packet during sctp_packet_config(). sctp_packet_config() is called when getting the packet ready for appending of chunks. The function should not touch the current state, since it's possible to ping-pong between two transports when sending, and that can result packet corruption followed by skb overlfow crash. Reported-by: Thomas Dreibholz <dreibh@iem.uni-due.de> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-17 16:47:56 -07:00
David Lamparter	3b27e10555	netns: keep vlan slaves on master netns move previously, if a vlan master device was moved from one network namespace to another, all 802.1q and macvlan slaves were deleted. we can use dev->reg_state to figure out whether dev_change_net_namespace is happening, since that won't set dev->reg_state NETREG_UNREGISTERING. so, this changes 8021q and macvlan to ignore NETDEV_UNREGISTER when reg_state is not NETREG_UNREGISTERING. Signed-off-by: David Lamparter <equinox@diac24.net> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-17 16:46:04 -07:00
Eric Dumazet	67c9660831	ethtool: change ethtool_set_gro() to use ethtool_op_get_rx_csum To be able to switch on GRO on a device, ethtool_set_gro() checks this device provides a get_rx_csum() method. Some devices dont provide this method, while they do support RX checksumming. This patch allows bonding to support GRO : ethtool -K bond0 gro on Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-17 11:56:18 -07:00
Dan Carpenter	2507136f74	net/llc: storing negative error codes in unsigned short If the alloc_skb() fails then we return 65431 instead of -ENOBUFS (-105). Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-16 22:38:23 -07:00
Eric Dumazet	9476763262	ip6tnl: get rid of ip6_tnl_lock As RTNL is held while doing tunnels inserts and deletes, we can remove ip6_tnl_lock spinlock. My initial RCU conversion was conservative and converted the rwlock to spinlock, with no RTNL requirement. Use appropriate rcu annotations and modern lockdep checks as well. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-16 21:58:44 -07:00
Eric Dumazet	e71895a1be	xfrm: dont assume rcu_read_lock in xfrm_output_one() ip_local_out() is called with rcu_read_lock() held from ip_queue_xmit() but not from other call sites. Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-16 21:46:15 -07:00
Stephen Rothwell	caeda9b926	net: include inetdevice.h for rcu_dereference_raw api change rcu_dereference_raw() now needs to know the type of its argument. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-16 21:39:16 -07:00
Luis R. Rodriguez	f01a067d9e	mac80211: send last 3/5 probe requests as unicast Some buggy APs do not respond to unicast probe requests or send unicast probe requests very delayed so in the worst case we should try to send broadcast probe requests, otherwise we can get disconnected from these APs. Even if drivers do not have filters to disregard probe responses from foreign APs mac80211 will only process probe responses from our associated AP for re-arming connection monitoring. We need to do this since the beacon monitor does not push back the connection monitor by design so even if we are getting beacons from these type of APs our connection monitor currently relies heavily on the way the probe requests are received on the AP. An example of an AP affected by this is the Nexus One, but this has also been observed with random APs. We can probably optimize this later by using null funcs instead of probe requests. For more details refer to: http://code.google.com/p/chromium-os/issues/detail?id=5715 This patch has fixes for stable kernels [2.6.35+]. Cc: stable@kernel.org Cc: Paul Stewart <pstew@google.com> Cc: Amod Bodas <amod.bodas@atheros.com> Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-16 15:46:18 -04:00
Luis R. Rodriguez	3bc3c0d748	mac80211: disable beacon monitor while going offchannel The beacon monitor should be disabled when going off channel to prevent spurious warnings and triggering connection deterioration work such as sending probe requests. Re-enable the beacon monitor once we come back to the home channel. This patch has fixes for stable kernels [2.6.34+]. Cc: stable@kernel.org Cc: Paul Stewart <pstew@google.com> Cc: Amod Bodas <amod.bodas@atheros.com> Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-16 15:46:17 -04:00
Luis R. Rodriguez	d3a910a8e4	mac80211: make the beacon monitor available externally This will be used by other components next. The beacon monitor was added as of 2.6.34 so these fixes are applicable only to kernels >= 2.6.34. Cc: stable@kernel.org Cc: Paul Stewart <pstew@google.com> Cc: Amod Bodas <amod.bodas@atheros.com> Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-16 15:46:16 -04:00
Luis R. Rodriguez	4730d5977f	mac80211: reset connection idle when going offchannel When we go offchannel mac80211 currently leaves alive the connection idle monitor. This should be instead postponed until we come back to our home channel, otherwise by the time we get back to the home channel we could be triggering unecesary probe requests. For APs that do not respond to unicast probe requests (Nexus One is a simple example) this means we essentially get disconnected after the probes fails. This patch has stable fixes for kernels [2.6.35+] Cc: stable@kernel.org Cc: Paul Stewart <pstew@google.com> Cc: Amod Bodas <amod.bodas@atheros.com> Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-16 15:46:15 -04:00
Luis R. Rodriguez	0c699c3a75	mac80211: reset probe send counter upon connection timer reset Upon beacon loss we send probe requests after 30 seconds of idle time and we wait for each probe response 1/2 second. We send a total of 3 probe requests before giving up on the AP. In the case that we reset the connection idle monitor we should reset the probe requests count to 0. Right now this won't help in any way but the next patch will. This patch has fixes for stable kernel [2.6.35+]. Cc: stable@kernel.org Cc: Paul Stewart <pstew@google.com> Cc: Amod Bodas <amod.bodas@atheros.com> Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-16 15:46:14 -04:00
Luis R. Rodriguez	be099e82e9	mac80211: add helper for reseting the connection monitor This will be used in another place later. The connection monitor was added as of 2.6.35 so these fixes will be applicable to >= 2.6.35. Cc: stable@kernel.org Cc: Paul Stewart <pstew@google.com> Cc: Amod Bodas <amod.bodas@atheros.com> Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-16 15:46:13 -04:00
Johannes Berg	2ca27bcff7	mac80211: add p2p device type support When a driver advertises p2p device support, mac80211 will handle it, but internally it will rewrite the interface type to STA/AP rather than P2P-STA/GO since otherwise a lot of paths need to be touched that are otherwise identical. A p2p boolean tells drivers whether or not a given interface will be used for p2p or not. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-16 15:46:07 -04:00
Johannes Berg	074ac8df9f	cfg80211/nl80211: introduce p2p device types This adds P2P-STA and P2P-GO as device types so we can distinguish between those and normal STA or AP (respectively) type interfaces. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-16 15:46:06 -04:00
Johannes Berg	2d2080c3c1	mac80211: set running state earlier When an interface is brought up, the recent changes to allow changing type-while-up only set the running bit after everything was done. This broke a number of things, including idle calculation for monitor interfaces, and it also broke WDS station insertion (although nobody noticed yet). Thus, change the code to set the running bit earlier, but keep it after the driver's add_interface was called because otherwise drivers may iterate over interfaces they haven't fully set up yet. Reported-by: Rajkumar Manoharan <rmanoharan@atheros.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-16 15:46:01 -04:00
Johannes Berg	46a5ebaf02	cfg80211/mac80211: use lockdep_assert_held Instead of using a WARN_ON(!mutex_is_locked()) use lockdep_assert_held() which compiles away completely when lockdep isn't enabled, and also is a more accurate assertion since it checks that the current thread is holding the mutex. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-16 15:46:00 -04:00
Johannes Berg	f5521b1388	mac80211: use correct station flags lock This code is modifying the station flags, and as such should hold the flags lock so it can do so atomically vs. other flags modifications and readers. This issue was introduced when this code was added in `eccb8e8f`, as it used the wrong lock (thus not fixing the race that was previously documented in a comment.) Cc: stable@kernel.org [2.6.31+] Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-16 15:45:58 -04:00
Joe Perches	9c37663929	include/net/cfg80211.h: wiphy_<level> messages use dev_printk The output becomes: [ 41.261941] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht' Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-16 15:19:44 -04:00
Brandon Philips	16c3ea785f	net: enable GRO by default for vlan devices Currently vlan devices don't have GRO by default as none of the Ethernet drivers add NETIF_F_GRO to their vlan_features. As GRO is a software feature add GRO to dev->vlan_features in register_netdevice() and let vlan_dev_init() take care that it gets enabled only when dev->features has NETIF_F_GRO too. Signed-off-by: Brandon Philips <bphilips@suse.de> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 22:32:29 -07:00
Eric Dumazet	95ae6b228f	ipv4: ip_ptr cleanups dev->ip_ptr is protected by rtnl and rcu. Yet some places dont use appropriate primitives and/or locking rules. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 22:06:05 -07:00
David S. Miller	9e0064a545	phonet: Fix build warning. net/phonet/socket.c: In function ‘pn_res_seq_show’: net/phonet/socket.c:726: warning: format ‘%02X’ expects type ‘unsigned int’, but argument 3 has type ‘long int’ Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 21:34:41 -07:00
Rémi Denis-Courmont	507215f8d0	Phonet: list subscribed resources via proc_fs Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 21:31:33 -07:00
Rémi Denis-Courmont	b6a563b2af	Phonet: look up the resource routing table when forwarding Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 21:31:33 -07:00
Rémi Denis-Courmont	7417fa83c1	Phonet: hook resource routing to userspace via ioctl()'s I wish we could use something cleaner, such as bind(). But that would not work since resource subscription is orthogonal/in addition to the normal object ID allocated via bind(). This is similar to multicasting which also uses ioctl()'s. Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 21:31:32 -07:00
Rémi Denis-Courmont	4e3d16ce5e	Phonet: resource routing backend When both destination device and object are nul, Phonet routes the packet according to the resource field. In fact, this is the most common pattern when sending Phonet "request" packets. In this case, the packet is delivered to whichever endpoint (socket) has registered the resource. This adds a new table so that Linux processes can register their Phonet sockets to Phonet resources, if they have adequate privileges. (Namespace support is not implemented at the moment.) Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 21:31:32 -07:00
Rémi Denis-Courmont	6482f554e2	Phonet: remove dangling pipe if an endpoint is closed early Closing a pipe endpoint is not normally allowed by the Phonet pipe, other than as a side after-effect of removing the pipe between two endpoints. But there is no way to prevent Linux userspace processes from being killed or suffering from bugs, so this can still happen. We might as well forcefully close Phonet pipe endpoints then. The cellular modem supports only a few existing pipes at a time. So we really should not leak them. This change instructs the modem to destroy the pipe if either of the pipe's endpoint (Linux socket) is closed too early. Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 21:31:31 -07:00
David S. Miller	7fedd7e5df	Merge branch 'dccp' of git://eden-feed.erg.abdn.ac.uk/net-next-2.6	2010-09-15 20:21:48 -07:00
Arnd Bergmann	0a5f1d476a	irda/irnet: use noop_llseek There may be applications trying to seek on the irnet character device, so we should use noop_llseek to avoid returning an error when the default llseek changes to no_llseek. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Samuel Ortiz <samuel@sortiz.org> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 19:29:55 -07:00
Eric Dumazet	3a43be3c32	sit: get rid of ipip6_lock As RTNL is held while doing tunnels inserts and deletes, we can remove ipip6_lock spinlock. My initial RCU conversion was conservative and converted the rwlock to spinlock, with no RTNL requirement. Use appropriate rcu annotations and modern lockdep checks as well. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 19:29:47 -07:00
Eric Dumazet	1507850b40	gre: get rid of ipgre_lock As RTNL is held while doing tunnels inserts and deletes, we can remove ipgre_lock spinlock. My initial RCU conversion was conservative and converted the rwlock to spinlock, with no RTNL requirement. Use appropriate rcu annotations and modern lockdep checks as well. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 19:29:46 -07:00
Eric Dumazet	b7285b7912	ipip: get rid of ipip_lock As RTNL is held while doing tunnels inserts and deletes, we can remove ipip_lock spinlock. My initial RCU conversion was conservative and converted the rwlock to spinlock, with no RTNL requirement. Use appropriate rcu annotations and modern lockdep checks as well. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 19:29:46 -07:00
Ben Hutchings	e0de7c93b9	ethtool: Remove unimplemented flow specification types struct ethtool_rawip4_spec and struct ethtool_ether_spec are neither commented nor used by any driver, so remove them. Adjust padding in the user-visible unions that included these structures. Fix references to struct ethtool_rawip4_spec in ethtool_get_rx_ntuple(), which should use struct ethtool_usrip4_spec. struct ethtool_usrip4_spec cannot hold IPv6 host addresses and there is no separate structure that can, so remove ETH_RX_NFC_IP6 and the reference to it in niu. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 14:42:13 -07:00
Gerrit Renker	37efb03fbd	dccp ccid-3: Simplify and consolidate tx_parse_options This simplifies and consolidates the TX option-parsing code: 1. The Loss Intervals option is not currently used, so dead code related to this option is removed. I am aware of no plans to support the option, but if someone wants to implement it (e.g. for inter-op tests), it is better to start afresh than having to also update currently unused code. 2. The Loss Event and Receive Rate options have a lot of code in common (both are 32 bit, both have same length etc.), so this is consolidated. 3. The test against GSR is not necessary, because - on first loading CCID3, ccid_new() zeroes out all fields in the socket; - ccid3_hc_tx_packet_recv() treats 0 and ~0U equivalently, due to pinv = opt_recv->ccid3or_loss_event_rate; if (pinv == ~0U \|\| pinv == 0) hctx->p = 0; - as a result, the sequence number field is removed from opt_recv. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2010-09-15 12:36:02 +02:00
Gerrit Renker	d2c726309d	dccp ccid-3: remove buggy RTT-sampling history lookup This removes the RTT-sampling function tfrc_tx_hist_rtt(), since 1. it suffered from complex passing of return values (the return value both indicated successful lookup while the value doubled as RTT sample); 2. when for some odd reason the sample value equalled 0, this triggered a bug warning about "bogus Ack", due to the ambiguity of the return value; 3. on a passive host which has not sent anything the TX history is empty and thus will lead to unwanted "bogus Ack" warnings such as ccid3_hc_tx_packet_recv: server(e7b7d518): DATAACK with bogus ACK-28197148 ccid3_hc_tx_packet_recv: server(e7b7d518): DATAACK with bogus ACK-26641606. The fix is to replace the implicit encoding by performing the steps manually. Furthermore, the "bogus Ack" warning has been removed, since it can actually be triggered due to several reasons (network reordering, old packet, (3) above), hence it is not very useful. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2010-09-15 12:36:02 +02:00
Gerrit Renker	20cbd3e120	dccp ccid-3: A lower bound for the inter-packet scheduling algorithm This fixes a subtle bug in the calculation of the inter-packet gap and shows that t_delta, as it is currently used, is not needed. The algorithm from RFC 5348, 8.3 below continually computes a send time t_nom, which is initialised with the current time t_now; t_gran = 1E6 / HZ specifies the scheduling granularity, s the packet size, and X the sending rate: t_distance = t_nom - t_now; // in microseconds t_delta = min(t_ipi, t_gran) / 2; // `delta' parameter in microseconds if (t_distance >= t_delta) { reschedule after (t_distance / 1000) milliseconds; } else { t_ipi = s / X; // inter-packet interval in usec t_nom += t_ipi; // compute the next send time send packet now; } Problem: -------- Rescheduling requires a conversion into milliseconds (sk_reset_timer()). The highest jiffy resolution with HZ=1000 is 1 millisecond, so using a higher granularity does not make much sense here. As a consequence, values of t_distance < 1000 are truncated to 0. This issue has so far been resolved by using instead if (t_distance >= t_delta + 1000) reschedule after (t_distance / 1000) milliseconds; This is unnecessarily large, a lower bound is t_delta' = max(t_delta, 1000). And it implies a further simplification: a) when HZ >= 500, then t_delta <= t_gran/2 = 10^6/(2HZ) <= 1000, so that t_delta' = MAX(1000, t_delta) = 1000 (constant value); b) when HZ < 500, then t_delta = 1/2MIN(rtt, t_ipi, t_gran) <= t_gran/2, so that 1000 <= t_delta' <= t_gran/2. The maximum error of using a constant t_delta in (b) is less than half a jiffy. Fix: ---- The patch replaces t_delta with a constant, whose value depends on CONFIG_HZ, changing the above algorithm to: if (t_distance >= t_delta') reschedule after (t_distance / 1000) milliseconds; where t_delta' = 10^6/(2*HZ) if HZ < 500, and t_delta' = 1000 otherwise. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2010-09-15 12:36:01 +02:00
David S. Miller	6dcbc12290	net: RPS needs to depend upon USE_GENERIC_SMP_HELPERS You cannot invoke __smp_call_function_single() unless the architecture sets this symbol. Reported-by: Daniel Hellstrom <daniel@gaisler.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-14 21:42:22 -07:00
andrew hendry	21a4591794	X.25 remove bkl in connect Connect already has socket locking. Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-14 20:39:09 -07:00
Andrew Hendry	141646ce56	X.25 remove bkl in accept Accept already has socket locking. [ Extend socket locking over TCP_LISTEN state test. -DaveM ] Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-14 20:38:54 -07:00
andrew hendry	90c27297a9	X.25 remove bkl in bind Accept updates socket values in 3 lines so wrapped with lock_sock. Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-14 20:34:52 -07:00
andrew hendry	25aa4efe4f	X.25 remove bkl in listen Listen updates socket values and needs lock_sock. Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-14 20:34:52 -07:00
Joe Perches	55b1804c67	net/irda: Use static const char * const where possible Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-14 20:22:05 -07:00
Linus Torvalds	de8d4f5d75	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: SUNRPC: Fix the NFSv4 and RPCSEC_GSS Kconfig dependencies statfs() gives ESTALE error NFS: Fix a typo in nfs_sockaddr_match_ipaddr6 sunrpc: increase MAX_HASHTABLE_BITS to 14 gss:spkm3 miss returning error to caller when import security context gss:krb5 miss returning error to caller when import security context Remove incorrect do_vfs_lock message SUNRPC: cleanup state-machine ordering SUNRPC: Fix a race in rpc_info_open SUNRPC: Fix race corrupting rpc upcall Fix null dereference in call_allocate	2010-09-14 17:04:48 -07:00
Eric Dumazet	ef885afbf8	net: use rcu_barrier() in rollback_registered_many netdev_wait_allrefs() waits that all references to a device vanishes. It currently uses a _very_ pessimistic 250 ms delay between each probe. Some users reported that no more than 4 devices can be dismantled per second, this is a pretty serious problem for some setups. Most of the time, a refcount is about to be released by an RCU callback, that is still in flight because rollback_registered_many() uses a synchronize_rcu() call instead of rcu_barrier(). Problem is visible if number of online cpus is one, because synchronize_rcu() is then a no op. time to remove 50 ipip tunnels on a UP machine : before patch : real 11.910s after patch : real 1.250s Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reported-by: Octavian Purdila <opurdila@ixiacom.com> Reported-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-14 14:27:29 -07:00
Johannes Berg	a2c1e3dad5	mac80211: match only assigned bss in sta_info_get_bss sta_info_get_bss() is used to match STA pointers for VLAN/AP interfaces, but if the same station is also added to multiple other interfaces it will erroneously match because both pointers are NULL, fix this by ignoring NULL pointers here. Reported-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-14 16:14:27 -04:00
Stanislaw Gruszka	edeb78a7fa	mac80211: wait for scan work complete before restarting hw This is needed to avoid warning in ieee80211_restart_hw about hardware scan in progress. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Wey-Yi W Guy <wey-yi.w.guy@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-14 16:14:25 -04:00
Steve deRosier	740c1aa3b0	mac80211: Fix dangling pointer in ieee80211_xmit hdr pointer is left dangling after call to ieee80211_skb_resize. This can cause guards around mesh path selection to fail. Signed-off-by: Steve deRosier <steve@cozybit.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-14 16:08:03 -04:00
Bill Jordan	a1e567c83f	nl80211: Uninitialized variable There is a path in nl80211_set_wiphy where result is tested but uninitialized. I am hitting this path when I attempt: sh# iw dev wlan0 set channel 10 command failed: Unknown error 1069727332 (-1069727332) Signed-off-by: William Jordan <bjordan@rajant.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-14 16:08:03 -04:00
Nikitas Angelinas	942623166d	net/wireless: use ARRAY_SIZE macro in radiotap.c Replace sizeof(rtap_namespace_sizes) / sizeof(rtap_namespace_sizes[0]) with ARRAY_SIZE(rtap_namespace_sizes) in net/wireless/radiotap.c Signed-off-by: Nikitas Angelinas <nikitasangelinas@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-09-14 16:05:57 -04:00
Eric Dumazet	83b6b1f5d1	flow: better memory management Allocate hash tables for every online cpus, not every possible ones. NUMA aware allocations. Dont use a full page on arches where PAGE_SIZE > 1024sizeof(void ) misc: __percpu , __read_mostly, __cpuinit annotations flow_compare_t is just an "unsigned long" Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-13 20:02:50 -07:00
Michael Kerrisk	a89b47639f	ipv4: enable getsockopt() for IP_NODEFRAG While integrating your man-pages patch for IP_NODEFRAG, I noticed that this option is settable by setsockopt(), but not gettable by getsockopt(). I suppose this is not intended. The (untested, trivial) patch below adds getsockopt() support. Signed-off-by: Michael kerrisk <mtk.manpages@gmail.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-13 19:57:23 -07:00
Bob Arendt	7998156344	ipv4: force_igmp_version ignored when a IGMPv3 query received After all these years, it turns out that the /proc/sys/net/ipv4/conf//force_igmp_version parameter isn't fully implemented. Symptom: When set force_igmp_version to a value of 2, the kernel should only perform multicast IGMPv2 operations (IETF rfc2236). An host-initiated Join message will be sent as a IGMPv2 Join message. But if a IGMPv3 query message is received, the host responds with a IGMPv3 join message. Per rfc3376 and rfc2236, a IGMPv2 host should treat a IGMPv3 query as a IGMPv2 query and respond with an IGMPv2 Join message. Consequences: This is an issue when a IGMPv3 capable switch is the querier and will only issue IGMPv3 queries (which double as IGMPv2 querys) and there's an intermediate switch that is only IGMPv2 capable. The intermediate switch processes the initial v2 Join, but fails to recognize the IGMPv3 Join responses to the Query, resulting in a dropped connection when the intermediate v2-only switch times it out. Identifying issue in the kernel source: The issue is in this section of code (in net/ipv4/igmp.c), which is called when an IGMP query is received (from mainline 2.6.36-rc3 gitweb): ... A IGMPv3 query has a length >= 12 and no sources. This routine will exit after line 880, setting the general query timer (random timeout between 0 and query response time). This calls igmp_gq_timer_expire(): ... .. which only sends a v3 response. So if a v3 query is received, the kernel always sends a v3 response. IGMP queries happen once every 60 sec (per vlan), so the traffic is low. A IGMPv3 query is* a strict superset of a IGMPv2 query, so this patch properly short circuit's the v3 behaviour. One issue is that this does not address force_igmp_version=1. Then again, I've never seen any IGMPv1 multicast equipment in the wild. However there is a lot of v2-only equipment. If it's necessary to support the IGMPv1 case as well: 837 if (len == 8 \|\| IGMP_V2_SEEN(in_dev) \|\| IGMP_V1_SEEN(in_dev)) { Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-13 12:56:51 -07:00
Dan Carpenter	339db11b21	net/llc: make opt unsigned in llc_ui_setsockopt() The members of struct llc_sock are unsigned so if we pass a negative value for "opt" it can cause a sign bug. Also it can cause an integer overflow when we multiply "opt * HZ". CC: stable@kernel.org Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-13 12:44:10 -07:00
Latchesar Ionkov	62b2be591a	fs/9p, net/9p: memory leak fixes Four memory leak fixes in the 9P code. Signed-off-by: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-09-13 08:13:02 -05:00
Miquel van Smoorenburg	db5fe26541	sunrpc: increase MAX_HASHTABLE_BITS to 14 The maximum size of the authcache is now set to 1024 (10 bits), but on our server we need at least 4096 (12 bits). Increase MAX_HASHTABLE_BITS to 14. This is a maximum of 16384 entries, each containing a pointer (8 bytes on x86_64). This is exactly the limit of kmalloc() (128K). Signed-off-by: Miquel van Smoorenburg <mikevs@xs4all.net> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-09-12 19:55:26 -04:00
Bian Naimeng	651b2933b2	gss:spkm3 miss returning error to caller when import security context spkm3 miss returning error to up layer when import security context, it may be return ok though it has failed to import security context. Signed-off-by: Bian Naimeng <biannm@cn.fujitsu.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-09-12 19:55:26 -04:00
Bian Naimeng	ce8477e117	gss:krb5 miss returning error to caller when import security context krb5 miss returning error to up layer when import security context, it may be return ok though it has failed to import security context. Signed-off-by: Bian Naimeng <biannm@cn.fujitsu.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-09-12 19:55:25 -04:00
J. Bruce Fields	55576244eb	SUNRPC: cleanup state-machine ordering This is just a minor cleanup: net/sunrpc/clnt.c clarifies the rpc client state machine by commenting each state and by laying out the functions implementing each state in the order that each state is normally executed (in the absence of errors). The previous patch "Fix null dereference in call_allocate" changed the order of the states. Move the functions and update the comments to reflect the change. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-09-12 19:55:25 -04:00
Trond Myklebust	006abe887c	SUNRPC: Fix a race in rpc_info_open There is a race between rpc_info_open and rpc_release_client() in that nothing stops a process from opening the file after the clnt->cl_kref goes to zero. Fix this by using atomic_inc_unless_zero()... Reported-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org	2010-09-12 19:55:25 -04:00
Trond Myklebust	5a67657a2e	SUNRPC: Fix race corrupting rpc upcall If rpc_queue_upcall() adds a new upcall to the rpci->pipe list just after rpc_pipe_release calls rpc_purge_list(), but before it calls gss_pipe_release (as rpci->ops->release_pipe(inode)), then the latter will free a message without deleting it from the rpci->pipe list. We will be left with a freed object on the rpc->pipe list. Most frequent symptoms are kernel crashes in rpc.gssd system calls on the pipe in question. Reported-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org	2010-09-12 19:55:25 -04:00
J. Bruce Fields	f2d47d02fd	Fix null dereference in call_allocate In call_allocate we need to reach the auth in order to factor au_cslack into the allocation. As of `a17c2153d2` "SUNRPC: Move the bound cred to struct rpc_rqst", call_allocate attempts to do this by dereferencing tk_client->cl_auth, however this is not guaranteed to be defined--cl_auth can be zero in the case of gss context destruction (see rpc_free_auth). Reorder the client state machine to bind credentials before allocating, so that we can instead reach the auth through the cred. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org	2010-09-12 19:55:25 -04:00
David S. Miller	a505b3b30f	sch_atm: Fix potential NULL deref. The list_head conversion unearther an unnecessary flow check. Since flow is always NULL here we don't need to see if a matching flow exists already. Reported-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-12 11:56:44 -07:00
stephen hemminger	9ca7f87622	pkt_sched: remov unnecessary bh_disable Now that est_tree_lock is acquired with BH protection, the other call is unnecessary. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-10 12:47:59 -07:00
Eric Dumazet	a034ee3cca	fib: cleanups Use rcu_dereference_rtnl() helper Change hard coded constants in fib_flag_trans() 7 -> RTN_UNREACHABLE 8 -> RTN_PROHIBIT Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-10 12:32:02 -07:00
David S. Miller	e548833df8	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/mac80211/main.c	2010-09-09 22:27:33 -07:00
Paul Gortmaker	b2abd4c033	tipc: Optimize handling excess content on incoming messages Remove code that trimmed excess trailing info from incoming messages arriving over an Ethernet interface. TIPC now ignores the extra info while the message is being processed by the node, and only trims it off if the message is retransmitted to another node. (This latter step is done to ensure the extra info doesn't cause the sk_buff to exceed the outgoing interface's MTU limit.) The outgoing buffer is guaranteed to be linear. Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-09 21:34:14 -07:00
Eric Dumazet	49d61e2390	tunnels: missing rcu_assign_pointer() xfrm4_tunnel_register() & xfrm6_tunnel_register() should use rcu_assign_pointer() to make sure previous writes (to handler->next) are committed to memory before chain insertion. deregister functions dont need a particular barrier. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-09 15:02:39 -07:00
Namhyung Kim	f39234d606	net/core: add lock context change annotations in net/core/sock.c __lock_sock() and __release_sock() releases and regrabs lock but were missing proper annotations. Add it. This removes following warning from sparse. (Currently __lock_sock() does not emit any warning about it but I think it is better to add also.) net/core/sock.c:1580:17: warning: context imbalance in '__release_sock' - unexpected unlock Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-09 15:02:39 -07:00
Namhyung Kim	a700d8be73	net/core: remove address space warnings on verify_iovec() move_addr_to_kernel() and copy_from_user() requires their argument as __user pointer but were missing proper markups. Add it. This removes following warnings from sparse. net/core/iovec.c:44:52: warning: incorrect type in argument 1 (different address spaces) net/core/iovec.c:44:52: expected void [noderef] <asn:1>uaddr net/core/iovec.c:44:52: got void msg_name net/core/iovec.c:55:34: warning: incorrect type in argument 2 (different address spaces) net/core/iovec.c:55:34: expected void const [noderef] <asn:1>from net/core/iovec.c:55:34: got struct iovec msg_iov Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-09 15:02:38 -07:00
Joe Perches	123031c0ee	sctp: fix test for end of loop Add a list_has_sctp_addr function to simplify loop Based on a patches by Dan Carpenter and David Miller Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-09 15:00:29 -07:00
David S. Miller	cf0ac2b8a7	Merge branch 'for-davem' of git://oss.oracle.com/git/agrover/linux-2.6	2010-09-09 14:58:11 -07:00
David S. Miller	e199e6136c	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6	2010-09-08 23:49:04 -07:00
Eric Dumazet	719f835853	udp: add rehash on connect() commit `30fff923` introduced in linux-2.6.33 (udp: bind() optimisation) added a secondary hash on UDP, hashed on (local addr, local port). Problem is that following sequence : fd = socket(...) connect(fd, &remote, ...) not only selects remote end point (address and port), but also sets local address, while UDP stack stored in secondary hash table the socket while its local address was INADDR_ANY (or ipv6 equivalent) Sequence is : - autobind() : choose a random local port, insert socket in hash tables [while local address is INADDR_ANY] - connect() : set remote address and port, change local address to IP given by a route lookup. When an incoming UDP frame comes, if more than 10 sockets are found in primary hash table, we switch to secondary table, and fail to find socket because its local address changed. One solution to this problem is to rehash datagram socket if needed. We add a new rehash(struct socket *) method in "struct proto", and implement this method for UDP v4 & v6, using a common helper. This rehashing only takes care of secondary hash table, since primary hash (based on local port only) is not changed. Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-08 21:45:01 -07:00
Eric Dumazet	e0386005ff	net: inet_add_protocol() can use cmpxchg() Use cmpxchg() to get rid of spinlocks in inet_add_protocol() and friends. inet_protos[] & inet6_protos[] are moved to read_mostly section Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-08 21:31:35 -07:00
Andy Grover	20c72bd5f5	RDS: Implement masked atomic operations Add two CMSGs for masked versions of cswp and fadd. args struct modified to use a union for different atomic op type's arguments. Change IB to do masked atomic ops. Atomic op type in rds_message similarly unionized. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:16:51 -07:00
Zach Brown	59f740a6ae	RDS/IB: print string constants in more places This prints the constant identifier for work completion status and rdma cm event types, like we already do for IB event types. A core string array helper is added that each string type uses. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:50 -07:00
Zach Brown	4518071ac1	RDS: cancel connection work structs as we shut down Nothing was canceling the send and receive work that might have been queued as a conn was being destroyed. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:49 -07:00
Zach Brown	ffcec0e110	RDS: don't call rds_conn_shutdown() from rds_conn_destroy() rds_conn_shutdown() can return before the connection is shut down when it encounters an existing state that it doesn't understand. This lets rds_conn_destroy() then start tearing down the conn from under paths that are still using it. It's more reliable the shutdown work and wait for krdsd to complete the shutdown callback. This stopped some hangs I was seeing where krdsd was trying to shut down a freed conn. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:48 -07:00
Zach Brown	5adb5bc65f	RDS: have sockets get transport module references Right now there's nothing to stop the various paths that use rs->rs_transport from racing with rmmod and executing freed transport code. The simple fix is to have binding to a transport also hold a reference to the transport's module, removing this class of races. We already had an unused t_owner field which was set for the modular transports and which wasn't set for the built-in loop transport. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:47 -07:00
Zach Brown	77510481c0	RDS: remove old rs_transport comment rs_transport is now also used by the rdma paths once the socket is bound. We don't need this stale comment to tell us what cscope can. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:46 -07:00
Zach Brown	fe8ff6b58f	RDS: lock rds_conn_count decrement in rds_conn_destroy() rds_conn_destroy() can race with all other modifications of the rds_conn_count but it was modifying the count without locking. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:45 -07:00
Zach Brown	ea819867b7	RDS/IB: protect the list of IB devices The RDS IB device list wasn't protected by any locking. Traversal in both the get_mr and FMR flushing paths could race with additon and removal. List manipulation is done with RCU primatives and is protected by the write side of a rwsem. The list traversal in the get_mr fast path is protected by a rcu read critical section. The FMR list traversal is more problematic because it can block while traversing the list. We protect this with the read side of the rwsem. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:44 -07:00
Zach Brown	1bde04a63d	RDS/IB: print IB event strings as well as their number It's nice to not have to go digging in the code to see which event occurred. It's easy to throw together a quick array that maps the ib event enums to their strings. I didn't see anything in the stack that does this translation for us, but I also didn't look very hard. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:43 -07:00
Chris Mason	8576f374ac	RDS: flush fmrs before allocating new ones Flushing FMRs is somewhat expensive, and is currently kicked off when the interrupt handler notices that we are getting low. The result of this is that FMR flushing only happens from the interrupt cpus. This spreads the load more effectively by triggering flushes just before we allocate a new FMR. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:16:42 -07:00
Chris Mason	b4e1da3c9a	RDS: properly use sg_init_table This is only needed to keep debugging code from bugging. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:16:41 -07:00
Zach Brown	f046011cd7	RDS/IB: track signaled sends We're seeing bugs today where IB connection shutdown clears the send ring while the tasklet is processing completed sends. Implementation details cause this to dereference a null pointer. Shutdown needs to wait for send completion to stop before tearing down the connection. We can't simply wait for the ring to empty because it may contain unsignaled sends that will never be processed. This patch tracks the number of signaled sends that we've posted and waits for them to complete. It also makes sure that the tasklet has finished executing. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:40 -07:00
Zach Brown	ef87b7ea39	RDS: remove __init and __exit annotation The trivial amount of memory saved isn't worth the cost of dealing with section mismatches. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:39 -07:00
Andy Grover	c20f5b9633	RDS/IB: Use SLAB_HWCACHE_ALIGN flag for kmem_cache_create() We are definitely counting cycles as closely as DaveM, so ensure hwcache alignment for our recv ring control structs. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:16:38 -07:00
Zach Brown	d455ab6409	RDS/IB: always process recv completions The recv refill path was leaking fragments because the recv event handler had marked a ring element as free without freeing its frag. This was happening because it wasn't processing receives when the conn wasn't marked up or connecting, as can be the case if it races with rmmod. Two observations support always processing receives in the callback. First, buildup should only post receives, thus triggering recv event handler calls, once it has built up all the state to handle them. Teardown should destroy the CQ and drain the ring before tearing down the state needed to process recvs. Both appear to be true today. Second, this test was fundamentally racy. There is nothing to stop rmmod and connection destruction from swooping in the moment after the conn state was sampled but before real receive procesing starts. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:36 -07:00
Zach Brown	80c51be56f	RDS: return to a single-threaded krdsd We were seeing very nasty bugs due to fundamental assumption the current code makes about concurrent work struct processing. The code simpy isn't able to handle concurrent connection shutdown work function execution today, for example, which is very much possible once a multi-threaded krdsd was introduced. The problem compounds as additional work structs are added to the mix. krdsd is no longer perforance critical now that send and receive posting and FMR flushing are done elsewhere, so the safest fix is to move back to the single threaded krdsd that the current code was built around. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:35 -07:00
Zach Brown	515e079dab	RDS/IB: create a work queue for FMR flushing This patch moves the FMR flushing work in to its own mult-threaded work queue. This is to maintain performance in preparation for returning the main krdsd work queue back to a single threaded work queue to avoid deep-rooted concurrency bugs. This is also good because it further separates FMRs, which might be removed some day, from the rest of the code base. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:34 -07:00
Zach Brown	8aeb1ba663	RDS/IB: destroy connections on rmmod IB connections were not being destroyed during rmmod. First, recently IB device removal callback was changed to disconnect connections that used the removing device rather than destroying them. So connections with devices during rmmod were not being destroyed. Second, rds_ib_destroy_nodev_conns() was being called before connections are disassociated with devices. It would almost never find connections in the nodev list. We first get rid of rds_ib_destroy_conns(), which is no longer called, and refactor the existing caller into the main body of the function and get rid of the list and lock wrappers. Then we call rds_ib_destroy_nodev_conns() after ib_unregister_client() has removed the IB device from all the conns and put the conns on the nodev list. The result is that IB connections are destroyed by rmmod. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:33 -07:00
Zach Brown	24fa163a4b	RDS/IB: wait for IB dev freeing work to finish during rmmod The RDS IB client removal callback can queue work to drop the final reference to an IB device. We have to make sure that this function has returned before we complete rmmod or the work threads can try to execute freed code. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:32 -07:00
Andy Grover	b6fb0df12d	RDS/IB: Make ib_recv_refill return void Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:16:31 -07:00
Andy Grover	fbf4d7e3d0	RDS: Remove unused XLIST_PTR_TAIL and xlist_protect() Not used. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:16:06 -07:00
Andy Grover	c9455d9996	RDS: whitespace	2010-09-08 18:15:32 -07:00
Chris Mason	7a0ff5dbdd	RDS: use delayed work for the FMR flushes Using a delayed work queue helps us make sure a healthy number of FMRs have queued up over the limit. It makes for a large improvement in RDMA iops. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:30 -07:00
Chris Mason	eabb732279	rds: more FMRs are faster When we add more FMRs, we flush them less often and so we go faster. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:29 -07:00
Chris Mason	6fa70da608	rds: recycle FMRs through lockless lists FRM allocation and recycling is performance critical and fairly lock intensive. The current code has a per connection lock that all processes bang on and it becomes a major bottleneck on large systems. This changes things to use a number of cmpxchg based lists instead, allowing us to go through the whole FMR lifecycle without locking inside RDS. Zach Brown pointed out that our usage of cmpxchg for xlist removal is racey if someone manages to remove and add back an FMR struct into the list while another CPU can see the FMR's address at the head of the list. The second CPU might assume the list hasn't changed when in fact any number of operations might have happened in between the deletion and reinsertion. This commit maintains a per cpu count of CPUs that are currently in xlist removal, and establishes a grace period to make sure that nobody can see an entry we have just removed from the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:28 -07:00
Zach Brown	0f4b1c7e89	rds: fix rds_send_xmit() serialization rds_send_xmit() was changed to hold an interrupt masking spinlock instead of a mutex so that it could be called from the IB receive tasklet path. This broke the TCP transport because its xmit method can block and masks and unmasks interrupts. This patch serializes callers to rds_send_xmit() with a simple bit instead of the current spinlock or previous mutex. This enables rds_send_xmit() to be called from any context and to call functions which block. Getting rid of the c_send_lock exposes the bare c_lock acquisitions which are changed to block interrupts. A waitqueue is added so that rds_conn_shutdown() can wait for callers to leave rds_send_xmit() before tearing down partial send state. This lets us get rid of c_senders. rds_send_xmit() is changed to check the conn state after acquiring the RDS_IN_XMIT bit to resolve races with the shutdown path. Previously both worked with the conn state and then the lock in the same order, allowing them to race and execute the paths concurrently. rds_send_reset() isn't racing with rds_send_xmit() now that rds_conn_shutdown() properly ensures that rds_send_xmit() can't start once the conn state has been changed. We can remove its previous use of the spinlock. Finally, c_send_generation is redundant. Callers can race to test the c_flags bit by simply retrying instead of racing to test the c_send_generation atomic. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:27 -07:00
Zach Brown	501dcccdb7	rds: block ints when acquiring c_lock in rds_conn_message_info() conn->c_lock is acquired in interrupt context. rds_conn_message_info() is called from user context and was acquiring c_lock without blocking interrupts, leading to possible deadlocks. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:26 -07:00
Zach Brown	671202f349	rds: remove unused rds_send_acked_before() rds_send_acked_before() wasn't blocking interrupts when acquiring c_lock from user context but nothing calls it. Rather than fix its use of c_lock we just remove the function. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:25 -07:00
Chris Mason	037f18a307	RDS: use friendly gfp masks for prefill When prefilling the rds frags, we end up doing a lot of allocations. We're not in atomic context here, and so there's no reason to dip into atomic reserves. This changes the prefills to use masks that allow waiting. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:24 -07:00
Chris Mason	3324412587	RDS/IB: Add caching of frags and incs This patch is based heavily on an initial patch by Chris Mason. Instead of freeing slab memory and pages, it keeps them, and funnels them back to be reused. The lock minimization strategy uses xchg and cmpxchg atomic ops for manipulation of pointers to list heads. We anchor the lists with a pointer to a list_head struct instead of a static list_head struct. We just have to carefully use the existing primitives with the difference between a pointer and a static head struct. For example, 'list_empty()' means that our anchor pointer points to a list with a single item instead of meaning that our static head element doesn't point to any list items. Original patch by Chris, with significant mods and fixes by Andy and Zach. Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:23 -07:00
Andy Grover	fc24f78085	RDS/IB: Remove ib_recv_unmap_page() All it does is call unmap_sg(), so just call that directly. The comment above unmap_page also may be incorrect, so we shouldn't hold on to it, either. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:15:22 -07:00
Andy Grover	3427e854e1	RDS: Assume recv->r_frag is always NULL in refill_one() refill_one() should never be called on a recv struct that doesn't need a new r_frag allocated. Add a WARN and remove conditional around r_frag alloc code. Also, add a comment to explain why r_ibinc may or may not need refilling. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:15:21 -07:00
Andy Grover	0b088e003c	RDS: Use page_remainder_alloc() for recv bufs Instead of splitting up a page into RDS_FRAG_SIZE chunks ourselves, ask rds_page_remainder_alloc() to do it. While it is possible PAGE_SIZE > FRAG_SIZE, on x86en it isn't, so having duplicate "carve up a page into buffers" code seems excessive. The other modification this spawns is the use of a single struct scatterlist in rds_page_frag instead of a bare page ptr. This causes verbosity to increase in some places, and decrease in others. Finally, I decided to unify the lifetimes and alloc/free of rds_page_frag and its page. This is a nice simplification in itself, but will be extra-nice once we come to adding cmason's recycling patch. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:15:20 -07:00
Zach Brown	fc19de38be	RDS/IB: disconnect when IB devices are removed Currently IB device removal destroys connections which are associated with the device. This prevents connections from being re-established when replacement devices are added. Instead we'll queue shutdown work on the connections as their devices are removed. When we see that devices are added we triger connection attempts on all connections that don't currently have a device. The result is that RDS sockets can resume device-independent work (bcopy, not RDMA) across IB device removal and restoration. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:19 -07:00
Zach Brown	f3c6808d3d	RDS: introduce rds_conn_connect_if_down() A few paths had the same block of code to queue a connection's connect work if it was in the right state. Let's move this in to a helper function. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:18 -07:00
Zach Brown	3e0249f9c0	RDS/IB: add refcount tracking to struct rds_ib_device The RDS IB client .remove callback used to free the rds_ibdev for the given device unconditionally. This could race other users of the struct. This patch adds refcounting so that we only free the rds_ibdev once all of its users are done. Many rds_ibdev users are tied to connections. We give the connection a reference and change these users to reference the device in the connection instead of looking it up in the IB client data. The only user of the IB client data remaining is the first lookup of the device as connections are built up. Incrementing the reference count of a device found in the IB client data could race with final freeing so we use an RCU grace period to make sure that freeing won't happen until those lookups are done. MRs need the rds_ibdev to get at the pool that they're freed in to. They exist outside a connection and many MRs can reference different devices from one socket, so it was natural to have each MR hold a reference. MR refs can be dropped from interrupt handlers and final device teardown can block so we push it off to a work struct. Pool teardown had to be fixed to cancel its pending work instead of deadlocking waiting for all queued work, including itself, to finish. MRs get their reference from the global device list, which gets a reference. It is left unprotected by locks and remains racy. A simple global lock would be a significant bottleneck. More scalable (complicated) locking should be done carefully in a later patch. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:17 -07:00
Zach Brown	89bf9d4158	RDS/IB: get the xmit max_sge from the RDS IB device on the connection rds_ib_xmit_rdma() was calling ib_get_client_data() to get at the rds_ibdevice just to get the max_sge for the transmit. This patch instead has it get it directly off the rds_ibdev which is stored on the connection. The current code won't free the rds_ibdev until all the IB connections that use it are freed. So it's safe to reference the rds_ibdev this way. In the future it also makes it easier to support proper reference counting of the rds_ibdev struct. As an additional bonus, this gets rid of the performance hit of calling in to the IB stack to look up the rds_ibdev. The current implementation in the IB stack acquires an interrupt blocking spinlock to protect the registration of client callback data. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:16 -07:00
Zach Brown	a46ca94e7f	RDS/IB: rds_ib_cm_handle_connect() forgot to unlock c_cm_lock rds_ib_cm_handle_connect() could return without unlocking the c_conn_lock if rds_setup_qp() failed. Rather than adding another imbalanced mutex_unlock() to this error path we only unlock the mutex once as we exit the function, reducing the likelyhood of making this same mistake in the future. We remove the previous mulitple return sites, leaving one unambigious return path. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:15 -07:00
Chris Mason	1cc2228c59	rds: Fix reference counting on the for xmit_atomic and xmit_rdma This makes sure we have the proper number of references in rds_ib_xmit_atomic and rds_ib_xmit_rdma. We also consistently drop references the same way for all message types as the IOs end. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:13 -07:00
Chris Mason	bcf50ef2ce	rds: use RCU to protect the connection hash The connection hash was almost entirely RCU ready, this just makes the final couple of changes to use RCU instead of spinlocks for everything. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:12 -07:00
Chris Mason	abf454398c	RDS: use locking on the connection hash list rds_conn_destroy really needs locking while it changes the connection hash. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:11 -07:00
Chris Mason	c9e65383a2	rds: Fix RDMA message reference counting The RDS send_xmit code was trying to get fancy with message counting and was dropping the final reference on the RDMA messages too early. This resulted in memory corruption and oopsen. The fix here is to always add a ref as the parts of the message passes through rds_send_xmit, and always drop a ref as the parts of the message go through completion handling. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:10 -07:00
Chris Mason	7e3f2952ee	rds: don't let RDS shutdown a connection while senders are present This is the first in a long line of patches that tries to fix races between RDS connection shutdown and RDS traffic. Here we are maintaining a count of active senders to make sure the connection doesn't go away while they are using it. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:09 -07:00
Chris Mason	38a4e5e613	rds: Use RCU for the bind lookup searches The RDS bind lookups are somewhat expensive in terms of CPU time and locking overhead. This commit changes them into a faster RCU based hash tree instead of the rbtrees they were using before. On large NUMA systems it is a significant improvement. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:08 -07:00
Andy Grover	e4c52c98e0	RDS/IB: add _to_node() macros for numa and use {k,v}malloc_node() Allocate send/recv rings in memory that is node-local to the HCA. This significantly helps performance. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:14:06 -07:00
Andy Grover	4a81802b5e	RDS/IB: Remove unused variable in ib_remove_addr() Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:29 -07:00
Chris Mason	764f2dd92f	rds: rcu-ize rds_ib_get_device() rds_ib_get_device is called very often as we turn an ip address into a corresponding device structure. It currently take a global spinlock as it walks different lists to find active devices. This commit changes the lists over to RCU, which isn't very complex because they are not updated very often at all. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:12:28 -07:00
Chris Mason	c83188dcd7	rds: per-rm flush_wait waitq This removes a global waitqueue used to wait for rds messages and replaces it with a waitqueue inside the rds_message struct. The global waitqueue turns into a global lock and significantly bottlenecks operations on large machines. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:12:27 -07:00
Chris Mason	976673ee1b	rds: switch to rwlock on bind_lock The bind_lock is almost entirely readonly, but it gets hammered during normal operations and is a major bottleneck. This commit changes it to an rwlock, which takes it from 80% of the system time on a big numa machine down to much lower numbers. A better fix would involve RCU, which is done in a later commit Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:12:26 -07:00
Andy Grover	ce47f52f42	RDS: Update comments in rds_send_xmit() Update comments to reflect changes in previous commit. Keeping as separate commits due to different authorship. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:25 -07:00
Chris Mason	9e29db0e36	RDS: Use a generation counter to avoid rds_send_xmit loop rds_send_xmit is required to loop around after it releases the lock because someone else could done a trylock, found someone working on the list and backed off. But, once we drop our lock, it is possible that someone else does come in and make progress on the list. We should detect this and not loop around if another process is actually working on the list. This patch adds a generation counter that is bumped every time we get the lock and do some send work. If the retry notices someone else has bumped the generation counter, it does not need to loop around and continue working. Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:24 -07:00
Andy Grover	acfcd4d4ec	RDS: Get pong working again Call send_xmit() directly from pong() Set pongs as op_active Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:23 -07:00
Andy Grover	a40aa9233a	RDS: Do wait_event_interruptible instead of wait_event Can't see a reason not to allow signals to interrupt the wait. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:22 -07:00
Andy Grover	fcc5450c63	RDS: Remove send_quota from send_xmit() The purpose of the send quota was really to give fairness when different connections were all using the same workq thread to send backlogged msgs -- they could only send so many before another connection could make progress. Now that each connection is pushing the backlog from its completion handler, they are all guaranteed to make progress and the quota isn't needed any longer. A thread will have to send all previously queued data, as well as any further msgs placed on the queue while while c_send_lock was held. In a pathological case a single process can get roped into doing this for long periods while other threads get off free. But, since it can only do this until the transport reports full, this is a bounded scenario. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:21 -07:00
Andy Grover	51e2cba8b5	RDS: Move atomic stats from general to ib-specific area Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:20 -07:00
Andy Grover	ab1a6926f5	RDS: rds_message_unmapped() doesn't need to check if queue active If the queue has nobody on it, then wake_up does nothing. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:19 -07:00
Andy Grover	cf4b7389ee	RDS: Fix locking in send on m_rs_lock Do not nest m_rs_lock under c_lock Disable interrupts in {rdma,atomic}_send_complete Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:18 -07:00
Andy Grover	f2ec76f288	RDS: Use NOWAIT in message_map_pages() Can no longer block, so use NOWAIT. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:17 -07:00
Andy Grover	2fa57129df	RDS: Bypass workqueue when queueing cong updates Now that rds_send_xmit() does not block, we can call it directly instead of going through the helper thread. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:16 -07:00
Andy Grover	a7d3a28148	RDS: Call rds_send_xmit() directly from sendmsg() rds_sendmsg() is calling the send worker function to send the just-queued datagrams, presumably because it wants the behavior where anything not sent will re-call the send worker. We now ensure all queued datagrams are sent by retrying from the send completion handler, so this isn't needed any more. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:15 -07:00
Andy Grover	2ad8099b58	RDS: rds_send_xmit() locking/irq fixes rds_message_put() cannot be called with irqs off, so move it after irqs are re-enabled. Spinlocks throughout the function do not to use _irqsave because the lock of c_send_lock at top already disabled irqs. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:13 -07:00
Andy Grover	049ee3f500	RDS: Change send lock from a mutex to a spinlock This change allows us to call rds_send_xmit() from a tasklet, which is crucial to our new operating model. * Change c_send_lock to a spinlock * Update stats fields "sem_" to "_lock" * Remove unneeded rds_conn_is_sending() About locking between shutdown and send -- send checks if the connection is up. Shutdown puts the connection into DISCONNECTING. After this, all threads entering send will exit immediately. However, a thread could be in send_xmit(), so shutdown acquires the c_send_lock to ensure everyone is out before proceeding with connection shutdown. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:12 -07:00
Andy Grover	f17a1a55fb	RDS: Refill recv ring directly from tasklet Performance is better if we use allocations that don't block to refill the receive ring. Since the whole reason we were kicking out to the worker thread was so we could do blocking allocs, we no longer need to do this. Remove gfp params from rds_ib_recv_refill(); we always use GFP_NOWAIT. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:11 -07:00
Andy Grover	77dd550e55	RDS: Stop supporting old cong map sending method We now ask the transport to give us a rm for the congestion map, and then we handle it normally. Previously, the transport defined a function that we would call to send a congestion map. Convert TCP and loop transports to new cong map method. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:10 -07:00
Andy Grover	e32b4a7049	RDS/IB: Do not wait for send ring to be empty on conn shutdown Now that we are signaling send completions much less, we are likely to have dirty entries in the send queue when the connection is shut down (on rmmod, for example.) These are cleaned up a little further down in conn_shutdown, but if we wait on the ring_empty_wait for them, it'll never happen, and we hand on unload. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:09 -07:00
Andy Grover	ff3d7d3613	RDS: Perform unmapping ops in stages Previously, RDS would wait until the final send WR had completed and then handle cleanup. With silent ops, we do not know if an atomic, rdma, or data op will be last. This patch handles any of these cases by keeping a pointer to the last op in the message in m_last_op. When the TX completion event fires, rds dispatches to per-op-type cleanup functions, and then does whole-message cleanup, if the last op equalled m_last_op. This patch also moves towards having op-specific functions take the op struct, instead of the overall rm struct. rds_ib_connection has a pointer to keep track of a a partially- completed data send operation. This patch changes it from an rds_message pointer to the narrower rm_data_op pointer, and modifies places that use this pointer as needed. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:08 -07:00
Andy Grover	aa0a4ef4ac	RDS: Make sure cmsgs aren't used in improper ways It hasn't cropped up in the field, but this code ensures it is impossible to issue operations that pass an rdma cookie (DEST, MAP) in the same sendmsg call that's actually initiating rdma or atomic ops. Disallowing this perverse-but-technically-allowed usage makes silent RDMA heuristics slightly easier. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:07 -07:00
Andy Grover	2c3a5f9abb	RDS: Add flag for silent ops. Do atomic op before RDMA Add a flag to the API so users can indicate they want silent operations. This is needed because silent ops cannot be used with USE_ONCE MRs, so we can't just assume silent. Also, change send_xmit to do atomic op before rdma op if both are present, and centralize the hairy logic to determine if we want to attempt silent, or not. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:06 -07:00
Andy Grover	7e3bd65ebf	RDS: Move some variables around for consistency Also, add a comment. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:05 -07:00
Andy Grover	940786eb0a	RDS: queue failure notifications for dropped atomic ops When dropping ops in the send queue, we notify the client of failed rdma ops they asked for notifications on, but not atomic ops. It should be for both. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:04 -07:00
Andy Grover	ee4c7b47e4	RDS: Add a warning if trying to allocate 0 sgs rds_message_alloc_sgs() only works when nents is nonzero. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:03 -07:00
Andy Grover	372cd7dedf	RDS: Do not set op_active in r_m_copy_from_user(). Do not allocate sgs for data for 0-length datagrams Set data.op_active in rds_sendmsg() instead of rds_message_copy_from_user(). Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:02 -07:00
Andy Grover	5b2366bd28	RDS: Rewrite rds_send_xmit Simplify rds_send_xmit(). Send a congestion map (via xmit_cong_map) without decrementing send_quota. Move resetting of conn xmit variables to end of loop. Update comments. Implement a special case to turn off sending an rds header when there is an atomic op and no other data. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:01 -07:00
Andy Grover	6c7cc6e469	RDS: Rename data op members prefix from m_ to op_ For consistency. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:59 -07:00
Andy Grover	f8b3aaf2ba	RDS: Remove struct rds_rdma_op A big changeset, but it's all pretty dumb. struct rds_rdma_op was already embedded in struct rm_rdma_op. Remove rds_rdma_op and put its members in rm_rdma_op. Rename members with "op_" prefix instead of "r_", for consistency. Of course this breaks a lot, so fixup the code accordingly. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:58 -07:00
Andy Grover	d0ab25a83c	RDS: purge atomic resources too in rds_message_purge() Add atomic_free_op function, analogous to rdma_free_op, and call it in rds_message_purge(). Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:57 -07:00
Andy Grover	4324879df0	RDS: Inline rdma_prepare into cmsg_rdma_args cmsg_rdma_args just calls rdma_prepare and does a little arg checking -- not quite enough to justify its existence. Plus, it is the only caller of rdma_prepare(). Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:56 -07:00
Andy Grover	241eef3e2f	RDS: Implement silent atomics Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:55 -07:00
Andy Grover	d37c935905	RDS: Move loop-only function to loop.c Also, try to better-document the locking around the rm and its m_inc in loop.c. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:54 -07:00
Andy Grover	c8de3f1005	RDS/IB: Make all flow control code conditional on i_flowctl Maybe things worked fine with the flow control code running even in the non-flow-control case, but making it explicitly conditional helps the non-fc case be easier to read. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:53 -07:00
Andy Grover	1d34f17571	RDS: Remove unsignaled_bytes sysctl Removed unsignaled_bytes sysctl and code to signal based on it. I believe unsignaled_wrs is more than sufficient for our purposes. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:52 -07:00
Andy Grover	da5a06cef5	RDS: rewrite rds_ib_xmit Now that the header always goes first, it is possible to simplify rds_ib_xmit. Instead of having a path to handle 0-byte dgrams and another path to handle >0, these can both be handled in one path. This lets us eliminate xmit_populate_wr(). Rename sent to bytes_sent, to differentiate better from other variable named "send". Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:51 -07:00
Andy Grover	919ced4ce7	RDS/IB: Remove ib_[header/data]_sge() functions These functions were to cope with differently ordered sg entries depending on RDS 3.0 or 3.1+. Now that we've dropped 3.0 compatibility we no longer need them. Also, modify usage sites for these to refer to sge[0] or [1] directly. Reorder code to initialize header sgs first. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:50 -07:00
Andy Grover	6f3d05db0d	RDS/IB: Remove dead code Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:49 -07:00
Andy Grover	f147dd9eca	RDS/IB: Disallow connections less than RDS 3.1 RDS 3.0 connections (in OFED 1.3 and earlier) put the header at the end. 3.1 connections put it at the head. The code has significant added complexity in order to handle both configurations. In OFED 1.6 we can drop this and simplify the code by only supporting "header-first" configuration. This patch checks the protocol version, and if prior to 3.1, does not complete the connection. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:48 -07:00
Andy Grover	9c030391e8	RDS/IB: eliminate duplicate code both atomics and rdmas need to convert ib-specific completion codes into RDS status codes. Rename rds_ib_rdma_send_complete to rds_ib_send_complete, and have it take a pointer to the function to call with the new error code. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:47 -07:00
Andy Grover	809fa148a2	RDS: inc_purge() transport function unused - remove it Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:46 -07:00
Andy Grover	6200ed7799	RDS: Whitespace Tidy up some whitespace issues. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:44 -07:00
Andy Grover	d22faec22c	RDS: Do not mask address when pinning pages This does not appear to be necessary. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:43 -07:00
Andy Grover	40589e74f7	RDS: Base init_depth and responder_resources on hw values Instead of using a constant for initiator_depth and responder_resources, read the per-QP values when the device is enumerated, and then use these values when creating the connection. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:42 -07:00
Andy Grover	15133f6e67	RDS: Implement atomic operations Implement a CMSG-based interface to do FADD and CSWP ops. Alter send routines to handle atomic ops. Add atomic counters to stats. Add xmit_atomic() to struct rds_transport Inline rds_ib_send_unmap_rdma into unmap_rm Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:41 -07:00
Andy Grover	a63273d499	RDS: Clear up some confusing code in send_remove_from_sock The previous code was correct, but made the assumption that if r_notifier was non-NULL then either r_recverr or r_notify was true. Valid, but fragile. Changed to explicitly check r_recverr (shows up in greps for recverr now, too.) Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:40 -07:00
Andy Grover	f4dd96f7b2	RDS: make sure all sgs alloced are initialized rds_message_alloc_sgs() now returns correctly-initialized sg lists, so calleds need not do this themselves. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:39 -07:00
Andy Grover	ff87e97a9d	RDS: make m_rdma_op a member of rds_message This eliminates a separate memory alloc, although it is now necessary to add an "r_active" flag, since it is no longer to use the m_rdma_op pointer as an indicator of if an rdma op is present. rdma SGs allocated from rm sg pool. rds_rm_size also gets bigger. It's a little inefficient to run through CMSGs twice, but it makes later steps a lot smoother. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:38 -07:00
Andy Grover	21f79afa5f	RDS: fold rdma.h into rds.h RDMA is now an intrinsic part of RDS, so it's easier to just have a single header. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:11:37 -07:00

... 2 3 4 5 6 ...

16726 Commits