Commit Graph

99005 Commits

Author SHA1 Message Date
Pekka Enberg da02b23192 ipg: per-device max_rxframe_size
Add a ->max_rxframe member to struct ipg_nic_private and convert the users of
IPG_MAX_RXFRAME_SIZE to use it instead to enable per-device jumbo frame
configuration.

Tested-by: Andrew Savchenko <Bircoph@list.ru>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-07-04 08:46:53 -04:00
Pekka Enberg 39f205854c ipg: per-device rxsupport_size
Add a ->max_rxframe member to struct ipg_nic_private and convert the users of
IPG_RXSUPPORT_SIZE to use it instead to enable per-device jumbo frame
configuration.

Tested-by: Andrew Savchenko <Bircoph@list.ru>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-07-04 08:46:53 -04:00
Pekka Enberg 18a9cdb9c7 ipg: per-device rxfrag_size
Add a ->max_rxframe member to struct ipg_nic_private and convert the users of
IPG_RXFRAG_SIZE to use it instead to enable per-device jumbo frame
configuration.

Tested-by: Andrew Savchenko <Bircoph@list.ru>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-07-04 08:46:52 -04:00
Pekka Enberg 8304295521 ipg: remove jumbo frame #ifdef from mtu
Remove JUMBO_FRAME #ifdef from dev->mtu setting in ipg_nic_open() so that we
can make IPG_TXFRAG_SIZE configurable.

Tested-by: Andrew Savchenko <Bircoph@list.ru>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-07-04 08:46:51 -04:00
Pekka Enberg 024f4d88b4 ipg: always compile in jumbo frame support
Add a ->is_jumbo boolean to struct ipg_nic_private and fix up
ipg_interrupt_handler() to call the jumbo frame version of ipg_nic_rx() if the
boolean is set to true. Also remove the JUMBO_FRAME #ifdefs so we can always
compile in support for jumbo frames.

Tested-by: Andrew Savchenko <Bircoph@list.ru>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-07-04 08:46:51 -04:00
Jeff Garzik 16e605a2a0 Merge branch 'r8169-next' of git://git.kernel.org/pub/scm/linux/kernel/git/romieu/netdev-2.6 into upstream-next 2008-07-04 08:35:57 -04:00
David S. Miller 44d28ab19c Merge branch 'net-next-2.6-v6ready-20080703' of git://git.linux-ipv6.org/gitroot/yoshfuji/linux-2.6-next 2008-07-03 03:07:58 -07:00
YOSHIFUJI Hideaki e0835f8fa5 ipv4,ipv6 mroute: Add some helper inline functions to remove ugly ifdefs.
ip{,v6}_mroute_{set,get}sockopt() should not matter by optimization but
it would be better not to depend on optimization semantically.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-07-03 17:51:57 +09:00
Wang Chen 03d2f897e9 ipv4: Do cleanup for ip_mr_init
Same as ip6_mr_init(), make ip_mr_init() return errno if fails.
But do not do error handling in inet_init(), just print a msg.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-07-03 17:51:57 +09:00
Wang Chen 623d1a1af7 ipv6: Do cleanup for ip6_mr_init.
If do not do it, we will get following issues:
1. Leaving junks after inet6_init failing halfway.
2. Leaving proc and notifier junks after ipv6 modules unloading.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-07-03 17:51:56 +09:00
YOSHIFUJI Hideaki dd3abc4ef5 ipv6 route: Prefer outgoing interface with source address assigned.
Outgoing interface is selected by the route decision if unspecified.
Let's prefer routes via interface(s) with the address assigned if we
have multiple routes with same cost.
With help from Naohiro Ooiwa <nooiwa@miraclelinux.com>.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-07-03 17:51:56 +09:00
YOSHIFUJI Hideaki 1b34be74cb ipv6 addrconf: add accept_dad sysctl to control DAD operation.
- If 0, disable DAD.
- If 1, perform DAD (default).
- If >1, perform DAD and disable IPv6 operation if DAD for MAC-based
  link-local address has been failed (RFC4862 5.4.5).

We do not follow RFC4862 by default.  Refer to the netdev thread entitled
"Linux IPv6 DAD not full conform to RFC 4862 ?"
	http://www.spinics.net/lists/netdev/msg52027.html

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-07-03 17:51:56 +09:00
YOSHIFUJI Hideaki 778d80be52 ipv6: Add disable_ipv6 sysctl to disable IPv6 operaion on specific interface.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-07-03 17:51:55 +09:00
YOSHIFUJI Hideaki 5ce83afaac ipv6: Assume the loopback address in link-local scope.
Handle interface property strictly when looking up a route
for the loopback address (RFC4291 2.5.3).

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-07-03 17:51:55 +09:00
YOSHIFUJI Hideaki f81b2e7d8c ipv6: Do not forward packets with the unspecified source address.
RFC4291 2.5.2.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-07-03 17:51:55 +09:00
YOSHIFUJI Hideaki d68b82705a ipv6: Do not assign non-valid address on interface.
Check the type of the address when adding a new one on interface.
- the unspecified address (::) is always disallowed (RFC4291 2.5.2)
- the loopback address is disallowed unless the interface is (one of)
  loopback (RFC4291 2.5.3).
- multicast addresses are disallowed.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-07-03 17:51:55 +09:00
Pavel Emelyanov 40b215e594 tcp: de-bloat a bit with factoring NET_INC_STATS_BH out
There are some places in TCP that select one MIB index to
bump snmp statistics like this:

	if (<something>)
		NET_INC_STATS_BH(<some_id>);
	else if (<something_else>)
		NET_INC_STATS_BH(<some_other_id>);
	...
	else
		NET_INC_STATS_BH(<default_id>);

or in a more tricky but still similar way.

On the other hand, this NET_INC_STATS_BH is a camouflaged
increment of percpu variable, which is not that small.

Factoring those cases out de-bloats 235 bytes on non-preemptible
i386 config and drives parts of the code into 80 columns.

add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-235 (-235)
function                                     old     new   delta
tcp_fastretrans_alert                       1437    1424     -13
tcp_dsack_set                                137     124     -13
tcp_xmit_retransmit_queue                    690     676     -14
tcp_try_undo_recovery                        283     265     -18
tcp_sacktag_write_queue                     1550    1515     -35
tcp_update_reordering                        162     106     -56
tcp_retransmit_timer                         990     904     -86

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-03 01:05:41 -07:00
Santwona Behera b4653e9945 niu: Add support for rx flow hash configuration.
Implemented ethtool callback functions for configuring receive flow
hashing in the niu driver.

Signed-off-by: Santwona Behera <santwona.behera@sun.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-02 03:49:11 -07:00
Santwona Behera 0853ad66b1 netdev: Add support for rx flow hash configuration, using ethtool.
Added new interfaces to ethtool to configure receive network flow
distribution across multiple rx rings using hashing.

Signed-off-by: Santwona Behera <santwona.behera@sun.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-02 03:47:41 -07:00
Vlad Yasevich ecbed6a419 sctp: Mark GET_PEER|LOCAL_ADDR_OLD deprecated.
Socket options SCTP_GET_PEER_ADDR_OLD, SCTP_GET_PEER_ADDR_NUM_OLD,
SCTP_GET_LOCAL_ADDR_OLD, and SCTP_GET_PEER_LOCAL_ADDR_NUM_OLD
have been replaced by newer versions a since 2005.  It's time
to officially deprecate them and schedule them for removal.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-01 20:06:22 -07:00
Stephen Hemminger 6dbf4bcac9 icmp: fix units for ratelimit
Convert the sysctl values for icmp ratelimit to use milliseconds instead
of jiffies which is based on kernel configured HZ.
Internal kernel jiffies are not a proper unit for any userspace API.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-01 19:29:07 -07:00
Francois Romieu 865c652d6b r8169: remove non-napi code
It will almost unavoidably cause some breakage but it
is long overdue.

The driver identification string has been updated, a
lost tabulation and some unused code have been removed.
Otherwise the code paths should stay the same.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Cc: Edward Hsu <edward_hsu@realtek.com.tw>
2008-06-29 15:08:28 +02:00
Francois Romieu 1087f4f4af r8169: multicast register update (sync with Realtek's 8.004.00 8168 driver)
The layout of the 8168 serie is different from that of the 8110 one.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Cc: Edward Hsu <edward_hsu@realtek.com.tw>
2008-06-29 15:08:28 +02:00
David S. Miller 28f49d8fec Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/linville/wireless-next-2.6 2008-06-28 22:57:58 -07:00
David S. Miller 332e4af80d Merge branch 'davem-next' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6 2008-06-28 21:28:46 -07:00
Jeff Garzik be0976be91 [netdrvr] kill sync_irq-before-freq_irq pattern
synchronize_irq() is superfluous when free_irq() call immediately follows it,
because free_irq() also does a synchronize_irq() call of its own.

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-28 10:23:38 -04:00
Jeff Garzik 28cd4289ab [netdrvr] fealnx: clean up nasty mess of arch ifdefs
Clean up config/burst value arch-specific setup.

* bcrvalue only varied by its big-endian bit
* crvalue only varied for certain types of x86-32 chips

This should make fealnx quite a bit more portable, without any behavior
change.

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-28 10:23:37 -04:00
Harvey Harrison 445854f4c4 tulip: remove wrapper around get_unaligned
DE_UNALIGNED_16 is always being passed a u16 *, no need to have the
wrapper with two casts in it, just call get_unaligned directly.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-28 10:23:37 -04:00
Tobias Diedrich f5ccbcfaca Fix forcedeth hibernate/wake-on-lan problems
This patch is the minimal amount of code needed to support
wake-on-lan in platform mode properly (i.e. "ethtool -s eth0 wol g"
is sufficient, no additional magic needed) for me.

This is derived from David Brownells patch
(http://lists.laptop.org/pipermail/devel/2007-April/004691.html).
However I decided to move the hook into pci-acpi.c since the other
two pci hooks also live there and pci and acpi are the only users of
the platform_enable_wakeup-hook.

As a 'side-effect' this also makes wake on usb activity work for me
and I had to disable usb wakeup (which is enabled by default) using
the power/wakeup sysfs functionality ("echo disabled >
${sysfs_path_to_device}/power/wakeup").

(BTW I first thought the 'immediate reboot because of usb wake' effect is
caused by the optical mouse generating a wake event, but it rather
seems to be a problem with a flaky secondary usb host controller,
which sees a connected device where nothing is attached)

Signed-off-by: Tobias Diedrich <ranma+kernel@tdiedrich.de>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-28 10:23:36 -04:00
Tobias Diedrich 9a60a82600 Fix forcedeth hibernate/wake-on-lan problems
We currently don't signal the kernel we that this device can wake
the system.  Call device_init_wakeup() to correct this.
Without this device_can_wakeup and device_may_wakeup will return
incorrect values.
Together with the minimized acpi wakeup patch (6/4 ;)), which will
follow in the next mail, this really makes wake-on-lan work for me
as expected (i.e. "ethtool -s eth0 wol g" is sufficient, no
additional magic needed).

Signed-off-by: Tobias Diedrich <ranma+kernel@tdiedrich.de>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-28 10:23:35 -04:00
Márton Németh a9879c4fca 8139too: some style cleanups
Clean up the following errors and warnings reported by checkpatch.pl:
 + ERROR: Macros with complex values should be enclosed in parenthesis
 + WARNING: __func__ should be used instead of gcc specific __FUNCTION__
 + WARNING: plain inline is preferred over __inline__
 + WARNING: Use #include <linux/io.h> instead of <asm/io.h>
 + WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>

The changes were verified with by comparing the "objdump -d 8139too.ko"
output which is exactly the same for the old and new version in case of
config CONFIG_8139TOO=m, CONFIG_8139TOO_PIO=n, CONFIG_8139TOO_TUNE_TWISTER=n,
CONFIG_8139TOO_8129=n, CONFIG_8139_OLD_RX_RESET=n.
Software versions used: gcc 4.2.3, objdump 2.18.0.20080103, on elf32-i386.

Signed-off-by: Márton Németh <nm127@freemail.hu>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-28 10:23:35 -04:00
Jussi Kivilinna 818727badc rndis_host: pass buffer length to rndis_command
Pass buffer length to rndis_command so that rndis_command can read full
response buffer from device instead of max CONTROL_BUFFER_SIZE bytes.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-28 10:23:34 -04:00
Nobuhiro Iwamatsu 0caa11663c net: sh_eth: Fix compile error sh_eth
Fix compile error on sh_eth and remove base address macro.

Signed-off-by: Nobuhiro Iwamatsu <iwamatsu.nobuhiro@renesas.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-28 10:23:33 -04:00
Andy Gospodarek b45f87681e e1000: remove e1000_clean_tx_irq call from e1000_netpoll
The call to e1000_clean_tx_irq in e1000_netpoll can race with the call
to e1000_clean_tx_irq in e1000_clean.  With a small bit of tweaking to
to netpoll_send_skb to simulate a system that was under extreme stress,
I was able to reproduce these concurrent calls.  This can result in
multiple frees to the skbs on the tx ring buffer.

Dropping this call from e1000_netpoll should be fine since we can rely
on the calls in e1000_clean to do what is needed since napi will poll
the hardware just after calling poll_controller.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-28 10:23:32 -04:00
Taku Izumi 42bfd33ab7 igb: make ioport free
This patch makes igb driver ioport-free.
This corrects behavior in probe function so as not to request ioport
resources as long as they are not really needed.

Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-28 10:23:32 -04:00
Taku Izumi 6e4f6f6b40 e1000e: make ioport free
This patch makes e1000e driver ioport-free.
This corrects behavior in probe function so as not to request ioport
resources as long as they are not really needed.

Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-28 10:23:31 -04:00
Auke Kok d03157babe e1000: remove PCI Express device IDs
We do not want to prolong the situation much longer that e1000
and e1000e support these devices at the same time. As a result,
take out the bandage that was added for the interim period
and remove all the PCI Express device IDs from e1000.

Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-06-28 10:23:30 -04:00
David S. Miller 1b63ba8a86 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/wireless/iwlwifi/iwl4965-base.c
2008-06-28 01:19:40 -07:00
YOSHIFUJI Hideaki d420895efb ipv6 route: Convert rt6_device_match() to use RT6_LOOKUP_F_xxx flags.
The commit 77d16f450a ("[IPV6] ROUTE:
Unify RT6_F_xxx and RT6_SELECT_F_xxx flags") intended to pass various
routing lookup hints around RT6_LOOKUP_F_xxx flags, but conversion was
missing for rt6_device_match().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 20:14:54 -07:00
Paul Moore 59d88c00ca netlabel: Fix a problem when dumping the default IPv6 static labels
There is a missing "!" in a conditional statement which is causing entries to
be skipped when dumping the default IPv6 static label entries.  This can be
demonstrated by running the following:

 # netlabelctl unlbl add default address:::1 \
                                 label:system_u:object_r:unlabeled_t:s0
 # netlabelctl -p unlbl list

... you will notice that the entry for the IPv6 localhost address is not
displayed but does exist (works correctly, causes collisions when attempting
to add duplicate entries, etc.).

Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 20:12:32 -07:00
Eli Cohen 251a4b320f net/inet_lro: remove setting skb->ip_summed when not LRO-able
When an SKB cannot be chained to a session, the current code attempts
to "restore" its ip_summed field from lro_mgr->ip_summed. However,
lro_mgr->ip_summed does not hold the original value; in fact, we'd
better not touch skb->ip_summed since it is not modified by the code
in the path leading to a failure to chain it.  Also use a cleaer
comment to the describe the ip_summed field of struct net_lro_mgr.

Issue raised by Or Gerlitz <ogerlitz@voltaire.com>

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 20:09:00 -07:00
Pavel Emelyanov 9a375803fe inet fragments: fix race between inet_frag_find and inet_frag_secret_rebuild
The problem is that while we work w/o the inet_frags.lock even
read-locked the secret rebuild timer may occur (on another CPU, since
BHs are still disabled in the inet_frag_find) and change the rnd seed
for ipv4/6 fragments.

It was caused by my patch fd9e63544c
([INET]: Omit double hash calculations in xxx_frag_intern) late 
in the 2.6.24 kernel, so this should probably be queued to -stable.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 20:06:08 -07:00
Li Zefan a0a61a604c CONNECTOR: add a proc entry to list connectors
I got a problem when I wanted to check if the kernel supports process
event connector, and It seems there's no way to do this check.

At best I can check if the kernel supports connector or not, by looking
into /proc/net/netlink, or maybe checking the return value of bind() to
see if it's ENOENT.

So it would be useful to add /proc/net/connector to list all supported
connectors:
 # cat /proc/net/connector
 Name            ID
 connector       4294967295:4294967295
 cn_proc         1:1
 w1              3:1

Changelog:
- fix memory leak: s/seq_release/single_release
- use spin_lock_bh instead of spin_lock_irqsave

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 20:03:24 -07:00
Julius Volz 10b595aff1 netlink: Fix some doc comments in net/netlink/attr.c
Fix some doc comments to match function and attribute names in
net/netlink/attr.c.

Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 20:02:14 -07:00
Stephen Hemminger 7be87351a1 tcp: /proc/net/tcp rto,ato values not scaled properly (v2)
I found another case where we are sending information to userspace
in the wrong HZ scale.  This should have been fixed back in 2.5 :-(

This means an ABI change but as it stands there is no way for an application
like ss to get the right value.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 20:00:19 -07:00
Adrian Bunk c88e6f51c2 include/linux/netdevice.h: don't export MAX_HEADER to userspace
Due to the CONFIG_'s the value is anyway not correct in userspace.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 19:54:54 -07:00
Adrian Bunk ede16af4cd pkt_sched: Remove CONFIG_NET_SCH_RR
Commit d62733c8e4
([SCHED]: Qdisc changes and sch_rr added for multiqueue)
added a NET_SCH_RR option that was unused since the code
went unconditionally into sch_prio.

Reported-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 19:54:05 -07:00
WANG Cong 01e123d79a pkt_sched: ERR_PTR() ususally encodes an negative errno, not positive.
Note, in the following patch, 'err' is initialized as:

int err = -ENOBUFS;

Signed-off-by: WANG Cong <wcong@critical-links.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 19:51:35 -07:00
Wang Chen 5dbaec5dc6 netdevice: Fix typo of dev_unicast_add() comment
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 19:35:16 -07:00
Rainer Weikusat ec0d215f94 af_unix: fix 'poll for write'/connected DGRAM sockets
For n:1 'datagram connections' (eg /dev/log), the unix_dgram_sendmsg
routine implements a form of receiver-imposed flow control by
comparing the length of the receive queue of the 'peer socket' with
the max_ack_backlog value stored in the corresponding sock structure,
either blocking the thread which caused the send-routine to be called
or returning EAGAIN. This routine is used by both SOCK_DGRAM and
SOCK_SEQPACKET sockets. The poll-implementation for these socket types
is datagram_poll from core/datagram.c. A socket is deemed to be
writeable by this routine when the memory presently consumed by
datagrams owned by it is less than the configured socket send buffer
size. This is always wrong for PF_UNIX non-stream sockets connected to
server sockets dealing with (potentially) multiple clients if the
abovementioned receive queue is currently considered to be full.
'poll' will then return, indicating that the socket is writeable, but
a subsequent write result in EAGAIN, effectively causing an (usual)
application to 'poll for writeability by repeated send request with
O_NONBLOCK set' until it has consumed its time quantum.

The change below uses a suitably modified variant of the datagram_poll
routines for both type of PF_UNIX sockets, which tests if the
recv-queue of the peer a socket is connected to is presently
considered to be 'full' as part of the 'is this socket
writeable'-checking code. The socket being polled is additionally
put onto the peer_wait wait queue associated with its peer, because the
unix_dgram_recvmsg routine does a wake up on this queue after a
datagram was received and the 'other wakeup call' is done implicitly
as part of skb destruction, meaning, a process blocked in poll
because of a full peer receive queue could otherwise sleep forever
if no datagram owned by its socket was already sitting on this queue.
Among this change is a small (inline) helper routine named
'unix_recvq_full', which consolidates the actual testing code (in three
different places) into a single location.

Signed-off-by: Rainer Weikusat <rweikusat@mssgmbh.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 19:34:18 -07:00