Commit Graph

146377 Commits

Author SHA1 Message Date
Pablo Neira Ayuso
dd7669a92c netfilter: conntrack: optional reliable conntrack event delivery
This patch improves ctnetlink event reliability if one broadcast
listener has set the NETLINK_BROADCAST_ERROR socket option.

The logic is the following: if an event delivery fails, we keep
the undelivered events in the missed event cache. Once the next
packet arrives, we add the new events (if any) to the missed
events in the cache and we try a new delivery, and so on. Thus,
if ctnetlink fails to deliver an event, we try to deliver them
once we see a new packet. Therefore, we may lose state
transitions but the userspace process gets in sync at some point.

At worst case, if no events were delivered to userspace, we make
sure that destroy events are successfully delivered. Basically,
if ctnetlink fails to deliver the destroy event, we remove the
conntrack entry from the hashes and we insert them in the dying
list, which contains inactive entries. Then, the conntrack timer
is added with an extra grace timeout of random32() % 15 seconds
to trigger the event again (this grace timeout is tunable via
/proc). The use of a limited random timeout value allows
distributing the "destroy" resends, thus, avoiding accumulating
lots "destroy" events at the same time. Event delivery may
re-order but we can identify them by means of the tuple plus
the conntrack ID.

The maximum number of conntrack entries (active or inactive) is
still handled by nf_conntrack_max. Thus, we may start dropping
packets at some point if we accumulate a lot of inactive conntrack
entries that did not successfully report the destroy event to
userspace.

During my stress tests consisting of setting a very small buffer
of 2048 bytes for conntrackd and the NETLINK_BROADCAST_ERROR socket
flag, and generating lots of very small connections, I noticed
very few destroy entries on the fly waiting to be resend.

A simple way to test this patch consist of creating a lot of
entries, set a very small Netlink buffer in conntrackd (+ a patch
which is not in the git tree to set the BROADCAST_ERROR flag)
and invoke `conntrack -F'.

For expectations, no changes are introduced in this patch.
Currently, event delivery is only done for new expectations (no
events from expectation expiration, removal and confirmation).
In that case, they need a per-expectation event cache to implement
the same idea that is exposed in this patch.

This patch can be useful to provide reliable flow-accouting. We
still have to add a new conntrack extension to store the creation
and destroy time.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-13 12:30:52 +02:00
Pablo Neira Ayuso
d219dce76c list_nulls: add hlist_nulls_add_head and hlist_nulls_del
This patch adds the hlist_nulls_add_head() function which is
based on hlist_nulls_add_head_rcu() but without the use of
rcu_assign_pointer(). It also adds hlist_nulls_del which is
exactly the same like hlist_nulls_del_rcu().

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-13 12:28:57 +02:00
Pablo Neira Ayuso
9858a3ae1d netfilter: conntrack: move helper destruction to nf_ct_helper_destroy()
This patch moves the helper destruction to a function that lives
in nf_conntrack_helper.c. This new function is used in the patch
to add ctnetlink reliable event delivery.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-13 12:28:22 +02:00
Pablo Neira Ayuso
a0891aa6a6 netfilter: conntrack: move event caching to conntrack extension infrastructure
This patch reworks the per-cpu event caching to use the conntrack
extension infrastructure.

The main drawback is that we consume more memory per conntrack
if event delivery is enabled. This patch is required by the
reliable event delivery that follows to this patch.

BTW, this patch allows you to enable/disable event delivery via
/proc/sys/net/netfilter/nf_conntrack_events in runtime, although
you can still disable event caching as compilation option.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-13 12:26:29 +02:00
Patrick McHardy
65cb9fda32 netfilter: nf_conntrack: use mod_timer_pending() for conntrack refresh
Use mod_timer_pending() instead of atomic sequence of del_timer()/
add_timer(). mod_timer_pending() does not rearm an inactive timer,
so we don't need the conntrack lock anymore to make sure we don't
accidentally rearm a timer of a conntrack which is in the process
of being destroyed.

With this change, we don't need to take the global lock anymore at all,
counter updates can be performed under the per-conntrack lock.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-13 12:21:49 +02:00
Patrick McHardy
266d07cb1c netfilter: nf_log: fix sleeping function called from invalid context
Fix regression introduced by 17625274 "netfilter: sysctl support of
logger choice":

BUG: sleeping function called from invalid context at /mnt/s390test/linux-2.6-tip/arch/s390/include/asm/uaccess.h:234
in_atomic(): 1, irqs_disabled(): 0, pid: 3245, name: sysctl
CPU: 1 Not tainted 2.6.30-rc8-tipjun10-02053-g39ae214 #1
Process sysctl (pid: 3245, task: 000000007f675da0, ksp: 000000007eb17cf0)
0000000000000000 000000007eb17be8 0000000000000002 0000000000000000
       000000007eb17c88 000000007eb17c00 000000007eb17c00 0000000000048156
       00000000003e2de8 000000007f676118 000000007eb17f10 0000000000000000
       0000000000000000 000000007eb17be8 000000000000000d 000000007eb17c58
       00000000003e2050 000000000001635c 000000007eb17be8 000000007eb17c30
Call Trace:
(Ý<00000000000162e6>¨ show_trace+0x13a/0x148)
 Ý<00000000000349ea>¨ __might_sleep+0x13a/0x164
 Ý<0000000000050300>¨ proc_dostring+0x134/0x22c
 Ý<0000000000312b70>¨ nf_log_proc_dostring+0xfc/0x188
 Ý<0000000000136f5e>¨ proc_sys_call_handler+0xf6/0x118
 Ý<0000000000136fda>¨ proc_sys_read+0x26/0x34
 Ý<00000000000d6e9c>¨ vfs_read+0xac/0x158
 Ý<00000000000d703e>¨ SyS_read+0x56/0x88
 Ý<0000000000027f42>¨ sysc_noemu+0x10/0x16

Use the nf_log_mutex instead of RCU to fix this.

Reported-and-tested-by: Maran Pakkirisamy <maranpsamy@in.ibm.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-13 12:21:10 +02:00
Herbert Xu
8981f01001 virtio_net: Fix IP alignment on non-mergeable RX path
We need to enforce the IP alignment on the non-mergeable RX path just
like the other RX path.  Not doing so results in misaligned IP
headers.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 20:55:17 -07:00
David S. Miller
adf76cfe24 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6 2009-06-11 20:00:44 -07:00
Eric Dumazet
17d0cdfa8f net: ntohs() misuse
Some drivers incorrectly use ntohs() instead of htons()

A cleanup as htons() returns same result than ntohs(),
but better to use the proper one.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 17:23:24 -07:00
David S. Miller
3ee40c376a Merge branch 'linux-2.6.31.y' of git://git.kernel.org/pub/scm/linux/kernel/git/inaky/wimax 2009-06-11 17:11:33 -07:00
Patrick McHardy
24992eacd8 netfilter: ip_tables: fix build error
Fix build error introduced by commit bb70dfa5 (netfilter: xtables:
consolidate comefrom debug cast access):

net/ipv4/netfilter/ip_tables.c: In function 'ipt_do_table':
net/ipv4/netfilter/ip_tables.c:421: error: 'comefrom' undeclared (first use in this function)
net/ipv4/netfilter/ip_tables.c:421: error: (Each undeclared identifier is reported only once
net/ipv4/netfilter/ip_tables.c:421: error: for each function it appears in.)

Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-12 01:53:09 +02:00
Inaky Perez-Gonzalez
98eb0f53e2 wimax: fix gcc warnings in sh4 when calling BUG()
SH4's BUG() seems to confuse the compiler as it is considered to
return; thus, some functions would trigger usage of uninitialized
variables or non-void functions returning void.

Work around by initializing/returning.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 11:47:39 -07:00
Inaky Perez-Gonzalez
d2f4c10544 wimax: fix warning caused by not checking retval of rfkill_set_hw_state()
Caused by an API update. The return value can be safely ignored, as
there is notthing we can do with it.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 11:12:48 -07:00
Karsten Keil
670025478c mISDN: cleanup mISDNhw.h
Remove unused stuff.

Signed-off-by: Karsten Keil <keil@b1-systems.de>
2009-06-11 19:05:32 +02:00
Karsten Keil
8164491dd6 mISDN: Do not disable IRQ in ph_data_ind()
This fix triggering the WARN_ON_ONCE(in_irq() || irqs_disabled()); in
local_bh_enable().

Here is no need to grab this lock, this was wrong at all and may
cause a deadlock and access to freed memory, since on a TEI remove
the current listelement can be deleted under us. So this is clearly
a case for list_for_each_entry_safe.

Signed-off-by: Karsten Keil <keil@b1-systems.de>
2009-06-11 19:05:18 +02:00
Roel Kluin
395df11f5f drivers/isdn/i4l/isdn_tty.c: fix check for array overindexing
The check for overindexing of dev->mdm.info[] has an off-by-one.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Karsten Keil <keil@b1-systems.de>
2009-06-11 19:05:10 +02:00
Andreas Mohr
cdae28e1a2 mISDN: Free hfcpci IRQ if init was not successful
If we get no interrupts for after 3 resets we need to unregister
the interrupt function, which is already done outside the loop.

Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Karsten Keil <keil@b1-systems.de>
2009-06-11 19:04:59 +02:00
Karsten Keil
1ce1513f48 mISDN: Fix overlapping data access
Remove code rewriting a buffer by itself.
This fix bug 12970 on bugzilla.kernel.org.

Signed-off-by: Karsten Keil <keil@b1-systems.de>
2009-06-11 19:04:54 +02:00
Karsten Keil
8a745b9d91 ISDN:Fix DMA alloc for hfcpci
Replace wrong code with correct DMA API functions.

Signed-off-by: Karsten Keil <keil@b1-systems.de>
2009-06-11 19:04:48 +02:00
Patrick McHardy
334a47f634 netfilter: nf_ct_tcp: fix up build after merge
Replace the last occurence of tcp_lock by the per-conntrack lock.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-11 16:16:09 +02:00
Patrick McHardy
36432dae73 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2009-06-11 16:00:49 +02:00
David S. Miller
bb400801c2 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-next-2.6 2009-06-11 05:47:43 -07:00
Stephen Hemminger
130aa61a77 bonding: fix multiple module load problem
Some users still load bond module multiple times to create bonding
devices.  This accidentally was broken by a later patch about
the time sysfs was fixed.  According to Jay, it was broken
by:
   commit b8a9787edd
   Author: Jay Vosburgh <fubar@us.ibm.com>
   Date:   Fri Jun 13 18:12:04 2008 -0700

     bonding: Allow setting max_bonds to zero

Note: sysfs and procfs still produce WARN() messages when this is done
so the sysfs method is the recommended API.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 05:46:04 -07:00
Timo Teras
5ef12d98a1 neigh: fix state transition INCOMPLETE->FAILED via Netlink request
The current code errors out the INCOMPLETE neigh entry skb queue only from
the timer if maximum probes have been attempted and there has been no reply.
This also causes the transtion to FAILED state.

However, the neigh entry can be also updated via Netlink to inform that the
address is unavailable.  Currently, neigh_update() just stops the timers and
leaves the pending skb's unreleased. This results that the clean up code in
the timer callback is never called, preventing also proper garbage collection.

This fixes neigh_update() to process the pending skb queue immediately if
INCOMPLETE -> FAILED state transtion occurs due to a Netlink request.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 04:16:28 -07:00
Cindy H Kao
0bcfc5ef01 wimax/i2400m: use -EL3RST to indicate device reset instead of -ERESTARTSYS
When the i2400m device resets, the driver code will force some
functions to return a -ERESTARTSYS error code, which can is used by
the caller to determine which recovery actions to take.

However, in certain situations the only thing that can be done is to
bubble up said error code to user space, for handling.

However, -ERESTARSYS was a poor choice, as it is supposed to be used
by the kernel only.

As such, replace -ERESTARTSYS with -EL3RST; as well, in
i2400m_msg_to_dev(), when the device is in boot mode (following a
recent reset), return -EL3RST instead of -ENODEV (meaning the device
is in bootrom mode after a reset, not that the device was
disconnected, and thus, normal commands cannot be executed).

Signed-off-by: Cindy H Kao <cindy.h.kao@intel.com>
2009-06-11 03:30:26 -07:00
Cindy H Kao
8b5b30ee7d wimax/i2400m: when bootstrap fails, reinitialize the bootrom
When a device reset happens during firmware load [in
i2400m_dev_bootstrap()], __i2400m_dev_start() will retry a number of
times. However, for those retries to be able to accomplish anything,
the device's bootrom has to be reinitialized.

Thus, on the retry path, pass the I2400M_MAC_REINIT to the firmware
load code.

Signed-off-by: Cindy H Kao <cindy.h.kao@intel.com>
2009-06-11 03:30:26 -07:00
Inaky Perez-Gonzalez
16820c166d wimax/i2400m/sdio: Move all the RX code to a unified, IRQ based receive routine
The current SDIO code was working in polling mode for boot-mode
(firmware load) mode. This was causing issues on some hardware.

Moved all the RX code to use a unified IRQ handler that based on the
type of data the device is sending can discriminate and decide which
is the right destination.

As well, all the reads from the device are made to be at least the
block size (256); the driver will ignore the rest when not needed.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:25 -07:00
Inaky Perez-Gonzalez
6e053d6c79 wimax/i2400m: don't reset device when bootrom init retries are exceeded
When i2400m_bootrom_init() fails to put the device into a state of
being ready to accept firmware, the driver was currently trying to
reset it if it failed to do so. This is not too useful; as part of
trying to put the device in the right state a few resets have already
been tried.

At this point, things are probably fried out and an extra reset might
do more harm than good (for example causing reseting of other
functions in the same composite device).

So it is left up to the callers to determine the error path to take
(at the end this is always i2400m_setup(), who depending on how many
retries are left, might give up on the device).

From a fix by Cindy H. Kao.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:25 -07:00
Dirk Brandewie
1c0b2dd757 wimax/i2400m/sdio: Add device specific poke table.
Add a poke table for the SDIO device (as it is different than USB).

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
2009-06-11 03:30:24 -07:00
Dirk Brandewie
7308a0c239 wimax/i2400m: move boot time poke table out of common driver
This change moves the table of "pokes" performed on the device at boot
time to the bus specific portion of the driver.

Different models of the i2400m device supported by this driver require
different poke tables, thus having a single table that works for all
is impossible. For that, the table is moved to the bus-specific
driver, who can decide which table to use based on the specifics of
the device and point the generic driver to it.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
2009-06-11 03:30:24 -07:00
Inaky Perez-Gonzalez
ecddfd5ed7 wimax/i2400m: Allow bus-specific driver to specify retry count
The code that sets up the i2400m (firmware load and general driver
setup after it) includes a couple of retry loops.

The SDIO device sometimes can get in more complicated corners than the
USB one (due to its interaction with other SDIO functions), that
require trying a few more times.

To solve that, without having a failing USB device taking longer to be
considered dead, allow the retry counts to be specified by the
bus-specific driver, which the general driver takes as a parameter.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:23 -07:00
Inaky Perez-Gonzalez
b4013f91cd wimax/i2400m: if a device reboot happens during probe, handle it
When a device reboot happens when we are under probe, with init_mutex
taken, make sure we can recover. Have dev_reset_handle set boot mode
and i2400m_msg_to_dev() will see it and fail gracefully instead of
timing out.

Found and diagnosed by Cindy H. Kao.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:23 -07:00
Inaky Perez-Gonzalez
59063afa0a wimax/i2400m: fix oops when the TX FIFO fills up due to a missing check
When the TX FIFO filled up and i2400m_tx_new() failed to allocate a
new TX message header, a missing check for said condition was causing a
kernel oops when trying to dereference a NULL i2400m->tx_msg pointer.

Found and diagnosed by Cindy H. Kao.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:22 -07:00
Inaky Perez-Gonzalez
b4bd07e3b1 wimax/i2400m: don't reset device on i2400m_dev_shutdown()
i2400m_dev_shutdown() tried to reset the device to put it in a known
state before shutting down.

But that turned out to be pointless. We reach this case in two paths:

 1 - when the device resets, to clean up state
 2 - when the driver is unloaded, for the same

however, in both cases it is pointless; in (1) the device is already
reset, why do it again? in (2) we can't -- the USB stack, for example,
doesn't allow communicating with the device when the driver is being
unbound and if the device is disconnected, the device is gone already.

So just remove it. Leave the function as a placeholder for future
cleanups that will be done from data allocated by the driver during
device operation.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:22 -07:00
Inaky Perez-Gonzalez
2971a5bac8 wimax/i2400m: fix panic due to missed corner cases on tail_room calculation
i2400m_tx_skip_tail() needs to handle the special case of being called
when the tail room that is left over in the FIFO is zero.

This happens when a TX message header was opened at the very end of
the FIFO (without payloads). The i2400m_tx_close() code already marked
said TX message (header) to be skipped and this function should be
doing nothing.

It is called anyway because it is part of a common "corner case" path
handling which takes care of more cases than only this one.

The tail room computation was also improved to take care of the case
when tx_in is at the end of the buffer boundary; tail_room has to be
modded (%) to the buffer size. To do that in a single well-documented
place, __i2400m_tx_tail_room() is introduced and used.

Treat i2400m->tx_in == 0 as a corner case and handle it accordingly.

Found and diagnosed by Cindy H. Kao.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:21 -07:00
Inaky Perez-Gonzalez
c56affafdd wimax/i2400m: fix panic/warnings caused by missed check on empty TX message
In some situations, when a new TX message header is started, there
might be no space for data payloads. In this case the message is left
with zero payloads and the i2400m_tx_close() function has just to mark
it as "to skip". If it tries to go ahead it will overwrite things
because there is no space to add padding as defined by the
bus-specific layer. This can cause buffer overruns and in some stress
cases, panics.

Found and diagnosed by Cindy H. Kao.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:21 -07:00
Inaky Perez-Gonzalez
8593a1967f wimax/i2400m: rename misleading I2400M_PL_PAD to I2400M_PL_ALIGN
The constant is being use as an alignment factor, not as a padding
factor; made reading/reviewing the code quite confusing.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:20 -07:00
Dirk Brandewie
10b1de6b77 wimax/i2400m/sdio: Implement I2400M_RT_BUS reset type
This reset type causes the WiMAX function to be disabled and
re-enabled, which will force the WiMAX device to reset and enter boot
mode.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
2009-06-11 03:30:20 -07:00
Dirk Brandewie
ead6823991 wimax/i2400m: Change d_printf() level for secure boot messages
Changing debug level of print out to support validation engineers
getting the messages they need.

Signed-off-by:  <dirk.j.brandewie@intel.com>
2009-06-11 03:30:19 -07:00
Inaky Perez-Gonzalez
16eafba8de wimax/i2400m: i2400m_schedule_work() doesn't need i2400m->work_queue
By mistake, the BUG_ON() check was left in there and it will fail when
called if i2400m->work_queue is still not setup.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:19 -07:00
Inaky Perez-Gonzalez
e9a6b45be5 wimax/i2400m: i2400m's work queue should be initialized before RX support
RX support is the only user of the work-queue, to process
reports/notifications from the device. Thus, it needs the work queue
to be initialized first.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:18 -07:00
Inaky Perez-Gonzalez
fff1068559 wimax/i2400m: don't call netif_start_queue() in _tx_msg_sent()
Reported and fixed by Cindy H Kao.

When the device is stopped __i2400m_dev_stop() stops the network
queue.

However, when this is done in the middle of heavy network operation,
when the bus-specific subdriver is still wrapping up and it reports a
sent TX transaction with _tx_msg_sent() right after the device was
stopped, the queue was being started again, which was causing a stream
of oopsen and finally a panic.

In any case, said call has no place there. It's a left over from an
early implementation that was discarded later on.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:18 -07:00
Inaky Perez-Gonzalez
fb10167478 wimax/i2400m: introduce module parameter to disable entering power save
The i2400m driver waits for the device to report being ready for
entering power save before asking it to do so. This module parameter
allows control of said operation; if disabled, the driver won't ask
the device to enter power save mode.

This is useful in setups where power saving is not so important or
when the overhead imposed by network reentry after power save is not
acceptable; by combining this with parameter 'idle_mode_disabled', the
driver will always maintain both the connection and the device in
active state.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
2009-06-11 03:30:17 -07:00
Eric Dumazet
2b85a34e91 net: No more expensive sock_hold()/sock_put() on each tx
One of the problem with sock memory accounting is it uses
a pair of sock_hold()/sock_put() for each transmitted packet.

This slows down bidirectional flows because the receive path
also needs to take a refcount on socket and might use a different
cpu than transmit path or transmit completion path. So these
two atomic operations also trigger cache line bounces.

We can see this in tx or tx/rx workloads (media gateways for example),
where sock_wfree() can be in top five functions in profiles.

We use this sock_hold()/sock_put() so that sock freeing
is delayed until all tx packets are completed.

As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
by one unit at init time, until sk_free() is called.
Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
to decrement initial offset and atomicaly check if any packets
are in flight.

skb_set_owner_w() doesnt call sock_hold() anymore

sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
reached 0 to perform the final freeing.

Drawback is that a skb->truesize error could lead to unfreeable sockets, or
even worse, prematurely calling __sk_free() on a live socket.

Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
contention point. 5 % speedup on a UDP transmit workload (depends
on number of flows), lowering TX completion cpu usage.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 02:55:43 -07:00
Figo.zhang
f2333a014c netxen: No need to check vfree() pointer.
vfree() does its own 'NULL' check, so no need for check before
calling it.

Signed-off-by: Figo.zhang <figo1802@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 02:49:10 -07:00
Sathya Perla
934037bc2e be2net: Fix be_tx_q_clean() being called on freed queues
In the tx queue destroy path, be_tx_q_clean() is currently called after the tx queues are freed; it must be called before.

Signed-off-by: Sathya Perla <sathyap@serverengines.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 02:47:19 -07:00
Sathya Perla
a7a0ef31de be2net: Fix early reset of rx-completion
be_rx_compl_get() must not reset(via the valid word) the rx_compl as the rx_compl is not processed yet; it must be reset after it is processed.

Signed-off-by: Sathya Perla <sathyap@serverengines.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 02:47:18 -07:00
Sathya Perla
76fbb42919 be2net: Fix rx stats updation in non-lro path
rx stats are not getting updated when an rx_compl with only one frag is rcvd in non-lro path.

Signed-off-by: Sathya Perla <sathyap@serverengines.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 02:47:16 -07:00
Sathya Perla
6811086899 be2net: fix netdev stats rx_errors and rx_dropped
Fix netdev stat rx_errors to cover length related errors and checksum errors and rx_dropped to the pkts dropped due to lack of buffers

Signed-off-by: Sathya Perla <sathyap@serverengines.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 02:47:16 -07:00
Sathya Perla
b305be78a0 be2net: Use cancel_delayed_work_sync instead of cancel_delayed_work()
Use cancel_delayed_work_sycn instead of cancel_delayed_work() to reliably kill be_worker() as it rearms itself.

Signed-off-by: Sathya Perla <sathyap@serverengines.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 02:47:15 -07:00