Commit Graph

13654 Commits

Author SHA1 Message Date
Yi Zou d653479941 vlan: Add support to netdev_ops.ndo_fcoe_get_wwn for VLAN device
Implements the netdev_ops.ndo_fcoe_get_wwn for VLAN device.

Signed-off-by: Yi Zou <yi.zou@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 01:04:04 -07:00
Andreas Petlund ea84e5555a net: Corrected spelling error heurestics->heuristics
Corrected a spelling error in a function name.

Signed-off-by: Andreas Petlund <apetlund@simula.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-28 04:00:03 -07:00
Eric Dumazet ac5e3af999 net: sysfs: ethtool_ops can be NULL
commit d519e17e2d
(net: export device speed and duplex via sysfs)
made the wrong assumption that netdev->ethtool_ops was always set.

This makes possible to crash kernel and let rtnl in locked state.

modprobe dummy
ip link set dummy0 up
(udev runs and crash)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-28 03:59:46 -07:00
Eric Dumazet eef6dd65e3 gre: Optimize multiple unregistration
Speedup module unloading by factorizing synchronize_rcu() calls

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-28 02:22:09 -07:00
Eric Dumazet 0694c4c016 ipip: Optimize multiple unregistration
Speedup module unloading by factorizing synchronize_rcu() calls

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-28 02:22:08 -07:00
Eric Dumazet 63c8099d90 vlan: Optimize multiple unregistration
Use unregister_netdevice_many() to speedup master device unregister.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-28 02:22:08 -07:00
Eric Dumazet 23289a37e2 net: add a list_head parameter to dellink() method
Adding a list_head parameter to rtnl_link_ops->dellink() methods
allow us to queue devices on a list, in order to dismantle
them all at once.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-28 02:22:07 -07:00
Eric Dumazet 9b5e383c11 net: Introduce unregister_netdevice_many()
Introduce rollback_registered_many() and unregister_netdevice_many()

rollback_registered_many() is able to perform necessary steps at device dismantle
time, factorizing two expensive synchronize_net() calls.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-28 02:22:06 -07:00
Eric Dumazet 44a0873d52 net: Introduce unregister_netdevice_queue()
This patchs adds an unreg_list anchor to struct net_device, and
introduces an unregister_netdevice_queue() function, able to queue
a net_device to a list instead of immediately unregister it.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-28 02:22:06 -07:00
David S. Miller cfadf853f6 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/sh_eth.c
2009-10-27 01:03:26 -07:00
Eric Dumazet 05423b2413 vlan: allow null VLAN ID to be used
We currently use a 16 bit field (vlan_tci) to store VLAN ID/PRIO on a skb.

Null value is used as a special value, meaning vlan tagging not enabled.
This forbids use of null vlan ID.

As pointed by David, some drivers use the 3 high order bits (PRIO)

As VLAN ID is 12 bits, we can use the remaining bit (CFI) as a flag, and
allow null VLAN ID.

In case future code really wants to use VLAN_CFI_MASK, we'll have to use
a bit outside of vlan_tci.

#define VLAN_PRIO_MASK         0xe000 /* Priority Code Point */
#define VLAN_PRIO_SHIFT        13
#define VLAN_CFI_MASK          0x1000 /* Canonical Format Indicator */
#define VLAN_TAG_PRESENT       VLAN_CFI_MASK
#define VLAN_VID_MASK          0x0fff /* VLAN Identifier */

Reported-by: Gertjan Hofman <gertjan_hofman@yahoo.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-27 01:02:33 -07:00
Eric Dumazet 66ed1e5ec1 pktgen: Dont leak kernel memory
While playing with pktgen, I realized IP ID was not filled and a
random value was taken, possibly leaking 2 bytes of kernel memory.
 
We can use an increasing ID, this can help diagnostics anyway.

Also clear packet payload, instead of leaking kernel memory.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-24 06:55:20 -07:00
Eric Dumazet 7c28bd0b8e rtnetlink: speedup rtnl_dump_ifinfo()
When handling large number of netdevice, rtnl_dump_ifinfo()
is very slow because it has O(N^2) complexity.

Instead of scanning one single list, we can use the 256 sub lists
of the dev_index hash table.

This considerably speedups "ip link" operations

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-24 06:13:17 -07:00
Eric Dumazet 8d5b2c084d gre: convert hash tables locking to RCU
GRE tunnels use one rwlock to protect their hash tables.

This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-24 06:07:59 -07:00
Eric Dumazet 2922bc8aed ip6tnl: convert hash tables locking to RCU
ip6_tunnels use one rwlock to protect their hash tables.

This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-24 06:07:58 -07:00
Eric Dumazet 8f95dd63a2 ipip: convert hash tables locking to RCU
IPIP tunnels use one rwlock to protect their hash tables.

This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-24 06:07:57 -07:00
Eric Dumazet 91cc3bb0b0 xfrm6_tunnel: RCU conversion
xfrm6_tunnels use one rwlock to protect their hash tables.

Plain and straightforward conversion to RCU locking to permit better SMP
performance.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-24 06:07:57 -07:00
Eric Dumazet 4543c10de2 ipv6 sit: RCU conversion phase II
SIT tunnels use one rwlock to protect their hash tables.

This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-24 06:07:56 -07:00
Eric Dumazet ef9a9d1183 ipv6 sit: RCU conversion phase I
SIT tunnels use one rwlock to protect their prl entries.

This first patch adds RCU locking for prl management,
with standard call_rcu() calls.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-24 06:07:55 -07:00
jamal 1c55d62e77 pkt_sched: skbedit add support for setting mark
This adds support for setting the skb mark.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-22 21:56:42 -07:00
Arjan van de Ven c62f4c453a net: use WARN() for the WARN_ON in commit b6b39e8f3f
Commit b6b39e8f3f (tcp: Try to catch MSG_PEEK bug) added a printk()
to the WARN_ON() that's in tcp.c. This patch changes this combination
to WARN(); the advantage of WARN() is that the printk message shows up
inside the message, so that kerneloops.org will collect the message.

In addition, this gets rid of an extra if() statement.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-22 21:37:56 -07:00
Eric Dumazet a3d1289126 rtnetlink: rtnl_setlink() and rtnl_getlink() changes
rtnl_getlink() & rtnl_setlink() run with RTNL held, we can use
__dev_get_by_index() and __dev_get_by_name() variants and avoid
dev_hold()/dev_put()

Adds to rtnl_getlink() the capability to find a device by its name,
not only by its index.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-22 04:33:48 -07:00
Krishna Kumar a4ee3ce329 net: Use sk_tx_queue_mapping for connected sockets
For connected sockets, the first run of dev_pick_tx saves the
calculated txq in sk_tx_queue_mapping. This is not saved if
either the device has a queue select or the socket is not
connected. Next iterations of dev_pick_tx uses the cached value
of sk_tx_queue_mapping.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-20 18:55:47 -07:00
Krishna Kumar ea94ff3b55 net: Fix for dst_negative_advice
dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.

(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-20 18:55:46 -07:00
Krishna Kumar f04c827624 net: IPv6 changes
IPv6: Reset sk_tx_queue_mapping when dst_cache is reset. Use existing
macro to do the work.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-20 18:55:45 -07:00
Krishna Kumar e022f0b4a0 net: Introduce sk_tx_queue_mapping
Introduce sk_tx_queue_mapping; and functions that set, test and
get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
cache is set/reset, and in socket alloc. Setting txq to -1 and
using valid txq=<0 to n-1> allows the tx path to use the value
of sk_tx_queue_mapping directly instead of subtracting 1 on every
tx.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-20 18:55:45 -07:00
Eric Dumazet d19742fb1c filter: Add SKF_AD_QUEUE instruction
It can help being able to filter packets on their queue_mapping.

If filter performance is not good, we could add a "numqueue" field
in struct packet_type, so that netif_nit_deliver() and other functions
can directly ignore packets with not expected queue number.

Lets experiment this simple filter extension first.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-20 01:06:22 -07:00
Eric Dumazet ad959e76f0 af_packet: mc_drop/flush_mclist changes
We hold RTNL, we can use __dev_get_by_index() instead of dev_get_by_index()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-20 01:02:06 -07:00
Eric Dumazet 94b059520d af_packet: Avoid cache line dirtying
While doing multiple captures, I found af_packet was dirtying cache line
containing its prot_hook.

This slow down machines where several cpus are necessary to handle capture
traffic, as each prot_hook is traversed for each packet coming in or out
the host.

This patches moves "struct packet_type prot_hook" to the end of
packet_sock, and uses a ____cacheline_aligned_in_smp to make sure
this remains shared by all cpus.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-20 01:02:06 -07:00
Herbert Xu b6b39e8f3f tcp: Try to catch MSG_PEEK bug
This patch tries to print out more information when we hit the
MSG_PEEK bug in tcp_recvmsg.  It's been around since at least
2005 and it's about time that we finally fix it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-20 00:51:57 -07:00
John Dykstra 0eae750e60 IP: Cleanups
Use symbols instead of magic constants while checking PMTU discovery
setsockopt.

Remove redundant test in ip_rt_frag_needed() (done by caller).

Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-19 23:22:52 -07:00
jamal 7e75f93eda pkt_sched: ingress socket filter by mark
Allow bpf to set a filter to drop packets that dont
match a specific mark

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-19 23:22:49 -07:00
Eric Dumazet 55b8050353 net: Fix IP_MULTICAST_IF
ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.

This function should be called only with RTNL or dev_base_lock held, or reader
could see a corrupt hash chain and eventually enter an endless loop.

Fix is to call dev_get_by_index()/dev_put().

If this happens to be performance critical, we could define a new dev_exist_by_index()
function to avoid touching dev refcount.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-19 21:34:20 -07:00
Dave Young 45054dc1bf bluetooth: static lock key fix
When shutdown ppp connection, lockdep waring about non-static key
will happen, it is caused by the lock is not initialized properly
at that time.

Fix with tuning the lock/skb_queue_head init order

[   94.339261] INFO: trying to register non-static key.
[   94.342509] the code is fine but needs lockdep annotation.
[   94.342509] turning off the locking correctness validator.
[   94.342509] Pid: 0, comm: swapper Not tainted 2.6.31-mm1 #2
[   94.342509] Call Trace:
[   94.342509]  [<c0248fbe>] register_lock_class+0x58/0x241
[   94.342509]  [<c024b5df>] ? __lock_acquire+0xb57/0xb73
[   94.342509]  [<c024ab34>] __lock_acquire+0xac/0xb73
[   94.342509]  [<c024b7fa>] ? lock_release_non_nested+0x17b/0x1de
[   94.342509]  [<c024b662>] lock_acquire+0x67/0x84
[   94.342509]  [<c04cd1eb>] ? skb_dequeue+0x15/0x41
[   94.342509]  [<c054a857>] _spin_lock_irqsave+0x2f/0x3f
[   94.342509]  [<c04cd1eb>] ? skb_dequeue+0x15/0x41
[   94.342509]  [<c04cd1eb>] skb_dequeue+0x15/0x41
[   94.342509]  [<c054a648>] ? _read_unlock+0x1d/0x20
[   94.342509]  [<c04cd641>] skb_queue_purge+0x14/0x1b
[   94.342509]  [<fab94fdc>] l2cap_recv_frame+0xea1/0x115a [l2cap]
[   94.342509]  [<c024b5df>] ? __lock_acquire+0xb57/0xb73
[   94.342509]  [<c0249c04>] ? mark_lock+0x1e/0x1c7
[   94.342509]  [<f8364963>] ? hci_rx_task+0xd2/0x1bc [bluetooth]
[   94.342509]  [<fab95346>] l2cap_recv_acldata+0xb1/0x1c6 [l2cap]
[   94.342509]  [<f8364997>] hci_rx_task+0x106/0x1bc [bluetooth]
[   94.342509]  [<fab95295>] ? l2cap_recv_acldata+0x0/0x1c6 [l2cap]
[   94.342509]  [<c02302c4>] tasklet_action+0x69/0xc1
[   94.342509]  [<c022fbef>] __do_softirq+0x94/0x11e
[   94.342509]  [<c022fcaf>] do_softirq+0x36/0x5a
[   94.342509]  [<c022fe14>] irq_exit+0x35/0x68
[   94.342509]  [<c0204ced>] do_IRQ+0x72/0x89
[   94.342509]  [<c02038ee>] common_interrupt+0x2e/0x34
[   94.342509]  [<c024007b>] ? pm_qos_add_requirement+0x63/0x9d
[   94.342509]  [<c038e8a5>] ? acpi_idle_enter_bm+0x209/0x238
[   94.342509]  [<c049d238>] cpuidle_idle_call+0x5c/0x94
[   94.342509]  [<c02023f8>] cpu_idle+0x4e/0x6f
[   94.342509]  [<c0534153>] rest_init+0x53/0x55
[   94.342509]  [<c0781894>] start_kernel+0x2f0/0x2f5
[   94.342509]  [<c0781091>] i386_start_kernel+0x91/0x96

Reported-by: Oliver Hartkopp <oliver@hartkopp.net>
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Tested-by: Oliver Hartkopp <oliver@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-19 19:36:49 -07:00
Dave Young f74c77cb11 bluetooth: scheduling while atomic bug fix
Due to driver core changes dev_set_drvdata will call kzalloc which should be
in might_sleep context, but hci_conn_add will be called in atomic context

Like dev_set_name move dev_set_drvdata to work queue function.

oops as following:

Oct  2 17:41:59 darkstar kernel: [  438.001341] BUG: sleeping function called from invalid context at mm/slqb.c:1546
Oct  2 17:41:59 darkstar kernel: [  438.001345] in_atomic(): 1, irqs_disabled(): 0, pid: 2133, name: sdptool
Oct  2 17:41:59 darkstar kernel: [  438.001348] 2 locks held by sdptool/2133:
Oct  2 17:41:59 darkstar kernel: [  438.001350]  #0:  (sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP){+.+.+.}, at: [<faa1d2f5>] lock_sock+0xa/0xc [l2cap]
Oct  2 17:41:59 darkstar kernel: [  438.001360]  #1:  (&hdev->lock){+.-.+.}, at: [<faa20e16>] l2cap_sock_connect+0x103/0x26b [l2cap]
Oct  2 17:41:59 darkstar kernel: [  438.001371] Pid: 2133, comm: sdptool Not tainted 2.6.31-mm1 #2
Oct  2 17:41:59 darkstar kernel: [  438.001373] Call Trace:
Oct  2 17:41:59 darkstar kernel: [  438.001381]  [<c022433f>] __might_sleep+0xde/0xe5
Oct  2 17:41:59 darkstar kernel: [  438.001386]  [<c0298843>] __kmalloc+0x4a/0x15a
Oct  2 17:41:59 darkstar kernel: [  438.001392]  [<c03f0065>] ? kzalloc+0xb/0xd
Oct  2 17:41:59 darkstar kernel: [  438.001396]  [<c03f0065>] kzalloc+0xb/0xd
Oct  2 17:41:59 darkstar kernel: [  438.001400]  [<c03f04ff>] device_private_init+0x15/0x3d
Oct  2 17:41:59 darkstar kernel: [  438.001405]  [<c03f24c5>] dev_set_drvdata+0x18/0x26
Oct  2 17:41:59 darkstar kernel: [  438.001414]  [<fa51fff7>] hci_conn_init_sysfs+0x40/0xd9 [bluetooth]
Oct  2 17:41:59 darkstar kernel: [  438.001422]  [<fa51cdc0>] ? hci_conn_add+0x128/0x186 [bluetooth]
Oct  2 17:41:59 darkstar kernel: [  438.001429]  [<fa51ce0f>] hci_conn_add+0x177/0x186 [bluetooth]
Oct  2 17:41:59 darkstar kernel: [  438.001437]  [<fa51cf8a>] hci_connect+0x3c/0xfb [bluetooth]
Oct  2 17:41:59 darkstar kernel: [  438.001442]  [<faa20e87>] l2cap_sock_connect+0x174/0x26b [l2cap]
Oct  2 17:41:59 darkstar kernel: [  438.001448]  [<c04c8df5>] sys_connect+0x60/0x7a
Oct  2 17:41:59 darkstar kernel: [  438.001453]  [<c024b703>] ? lock_release_non_nested+0x84/0x1de
Oct  2 17:41:59 darkstar kernel: [  438.001458]  [<c028804b>] ? might_fault+0x47/0x81
Oct  2 17:41:59 darkstar kernel: [  438.001462]  [<c028804b>] ? might_fault+0x47/0x81
Oct  2 17:41:59 darkstar kernel: [  438.001468]  [<c033361f>] ? __copy_from_user_ll+0x11/0xce
Oct  2 17:41:59 darkstar kernel: [  438.001472]  [<c04c9419>] sys_socketcall+0x82/0x17b
Oct  2 17:41:59 darkstar kernel: [  438.001477]  [<c020329d>] syscall_call+0x7/0xb

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-19 19:36:45 -07:00
Julian Anastasov b103cf3438 tcp: fix TCP_DEFER_ACCEPT retrans calculation
Fix TCP_DEFER_ACCEPT conversion between seconds and
retransmission to match the TCP SYN-ACK retransmission periods
because the time is converted to such retransmissions. The old
algorithm selects one more retransmission in some cases. Allow
up to 255 retransmissions.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-19 19:19:06 -07:00
Julian Anastasov 0c3d79bce4 tcp: reduce SYN-ACK retrans for TCP_DEFER_ACCEPT
Change SYN-ACK retransmitting code for the TCP_DEFER_ACCEPT
users to not retransmit SYN-ACKs during the deferring period if
ACK from client was received. The goal is to reduce traffic
during the deferring period. When the period is finished
we continue with sending SYN-ACKs (at least one) but this time
any traffic from client will change the request to established
socket allowing application to terminate it properly.
Also, do not drop acked request if sending of SYN-ACK fails.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-19 19:19:03 -07:00
Julian Anastasov d1b99ba41d tcp: accept socket after TCP_DEFER_ACCEPT period
Willy Tarreau and many other folks in recent years
were concerned what happens when the TCP_DEFER_ACCEPT period
expires for clients which sent ACK packet. They prefer clients
that actively resend ACK on our SYN-ACK retransmissions to be
converted from open requests to sockets and queued to the
listener for accepting after the deferring period is finished.
Then application server can decide to wait longer for data
or to properly terminate the connection with FIN if read()
returns EAGAIN which is an indication for accepting after
the deferring period. This change still can have side effects
for applications that expect always to see data on the accepted
socket. Others can be prepared to work in both modes (with or
without TCP_DEFER_ACCEPT period) and their data processing can
ignore the read=EAGAIN notification and to allocate resources for
clients which proved to have no data to send during the deferring
period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not
as a timeout) to wait for data will notice clients that didn't
send data for 3 seconds but that still resend ACKs.
Thanks to Willy Tarreau for the initial idea and to
Eric Dumazet for the review and testing the change.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-19 19:19:01 -07:00
David S. Miller a1a2ad9151 Revert "tcp: fix tcp_defer_accept to consider the timeout"
This reverts commit 6d01a026b7.

Julian Anastasov, Willy Tarreau and Eric Dumazet have come up
with a more correct way to deal with this.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-19 19:12:36 -07:00
Tomoki Sekiyama 77238f2b94 AF_UNIX: Fix deadlock on connecting to shutdown socket
I found a deadlock bug in UNIX domain socket, which makes able to DoS
attack against the local machine by non-root users.

How to reproduce:
1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct
    namespace(*), and shutdown(2) it.
 2. Repeat connect(2)ing to the listening socket from the other sockets
    until the connection backlog is full-filled.
 3. connect(2) takes the CPU forever. If every core is taken, the
    system hangs.

PoC code: (Run as many times as cores on SMP machines.)

int main(void)
{
	int ret;
	int csd;
	int lsd;
	struct sockaddr_un sun;

	/* make an abstruct name address (*) */
	memset(&sun, 0, sizeof(sun));
	sun.sun_family = PF_UNIX;
	sprintf(&sun.sun_path[1], "%d", getpid());

	/* create the listening socket and shutdown */
	lsd = socket(AF_UNIX, SOCK_STREAM, 0);
	bind(lsd, (struct sockaddr *)&sun, sizeof(sun));
	listen(lsd, 1);
	shutdown(lsd, SHUT_RDWR);

	/* connect loop */
	alarm(15); /* forcely exit the loop after 15 sec */
	for (;;) {
		csd = socket(AF_UNIX, SOCK_STREAM, 0);
		ret = connect(csd, (struct sockaddr *)&sun, sizeof(sun));
		if (-1 == ret) {
			perror("connect()");
			break;
		}
		puts("Connection OK");
	}
	return 0;
}

(*) Make sun_path[0] = 0 to use the abstruct namespace.
    If a file-based socket is used, the system doesn't deadlock because
    of context switches in the file system layer.

Why this happens:
 Error checks between unix_socket_connect() and unix_wait_for_peer() are
 inconsistent. The former calls the latter to wait until the backlog is
 processed. Despite the latter returns without doing anything when the
 socket is shutdown, the former doesn't check the shutdown state and
 just retries calling the latter forever.

Patch:
 The patch below adds shutdown check into unix_socket_connect(), so
 connect(2) to the shutdown socket will return -ECONREFUSED.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Signed-off-by: Masanori Yoshida <masanori.yoshida.tv@hitachi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-18 23:17:37 -07:00
Steffen Klassert eb2ff967a5 xfrm: remove skb_icv_walk
The last users of skb_icv_walk are converted to ahash now,
so skb_icv_walk is unused and can be removed.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-18 21:32:01 -07:00
Steffen Klassert 8631e9bdfe ah6: convert to ahash
This patch converts ah6 to the new ahash interface.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-18 21:31:59 -07:00
Steffen Klassert dff3bb0626 ah4: convert to ahash
This patch converts ah4 to the new ahash interface.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-18 21:31:59 -07:00
Eric Dumazet 8edf19c2fe net: sk_drops consolidation part 2
- skb_kill_datagram() can increment sk->sk_drops itself, not callers.

- UDP on IPV4 & IPV6 dropped frames (because of bad checksum or policy checks) increment sk_drops

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-18 18:52:54 -07:00
Eric Dumazet c720c7e838 inet: rename some inet_sock fields
In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.

Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)

This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-18 18:52:53 -07:00
Krishna Kumar 988ade6b8e genetlink: Optimize and one bug fix in genl_generate_id()
1. GENL_MIN_ID is a valid id -> no need to start at
   GENL_MIN_ID + 1.
2. Avoid going through the ids two times: If we start at
   GENL_MIN_ID+1 (*or bigger*) and all ids are over!, the
   code iterates through the list twice (*or lesser*).
3. Simplify code - no need to start at idx=0 which gets
   reset to GENL_MIN_ID.

Patch on net-next-2.6. Reboot test shows that first id
passed to genl_register_family was 16, next two were
GENL_ID_GENERATE and genl_generate_id returned 17 & 18
(user level testing of same code shows expected values
across entire range of MIN/MAX).

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-17 23:57:26 -07:00
Krishna Kumar 93860b08e3 genetlink: Optimize genl_register_family()
genl_register_family() doesn't need to call genl_family_find_byid
when GENL_ID_GENERATE is passed during register.

Patch on net-next-2.6, compile and reboot testing only.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-17 23:57:26 -07:00
Ursula Braun 9a4ff8d417 af_iucv: remove duplicate sock_set_flag
Remove duplicate sock_set_flag(sk, SOCK_ZAPPED) in iucv_sock_close,
which has been overlooked in September-commit
7514bab04e.

Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-17 23:57:20 -07:00
Hendrik Brueckner 49f5eba757 af_iucv: use sk functions to modify sk->sk_ack_backlog
Instead of modifying sk->sk_ack_backlog directly, use respective
socket functions.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-17 23:57:19 -07:00
Rémi Denis-Courmont 21912d1ca2 Phonet: hold socket before giving it to sk_deliver_skb()
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-15 12:30:42 -07:00