Commit Graph

1796 Commits

Author SHA1 Message Date
Eric Dumazet 7fee226ad2 net: add a noref bit on skb dst
Use low order bit of skb->_skb_dst to tell dst is not refcounted.

Change _skb_dst to _skb_refdst to make sure all uses are catched.

skb_dst() returns the dst, regardless of noref bit set or not, but
with a lockdep check to make sure a noref dst is not given if current
user is not rcu protected.

New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
(with lockdep check)

skb_dst_drop() drops a reference only if skb dst was refcounted.

skb_dst_force() helper is used to force a refcount on dst, when skb
is queued and not anymore RCU protected.

Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
!IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
sock_queue_rcv_skb(), in __nf_queue().

Use skb_dst_force() in dev_requeue_skb().

Note: dst_use_noref() still dirties dst, we might transform it
later to do one dirtying per jiffies.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-17 17:18:50 -07:00
Eric Dumazet ebda37c27d rps: avoid one atomic in enqueue_to_backlog
If CONFIG_SMP=y, then we own a queue spinlock, we can avoid the atomic
test_and_set_bit() from napi_schedule_prep().

We now have same number of atomic ops per netif_rx() calls than with
pre-RPS kernel.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-17 17:18:50 -07:00
David S. Miller 6811d58fc1 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	include/linux/if_link.h
2010-05-16 22:26:58 -07:00
Chris Wright c02db8c629 rtnetlink: make SR-IOV VF interface symmetric
Now we have a set of nested attributes:

  IFLA_VFINFO_LIST (NESTED)
    IFLA_VF_INFO (NESTED)
      IFLA_VF_MAC
      IFLA_VF_VLAN
      IFLA_VF_TX_RATE

This allows a single set to operate on multiple attributes if desired.
Among other things, it means a dump can be replayed to set state.

The current interface has yet to be released, so this seems like
something to consider for 2.6.34.

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16 01:05:45 -07:00
Eric Dumazet a465419b1f net: Introduce sk_route_nocaps
TCP-MD5 sessions have intermittent failures, when route cache is
invalidated. ip_queue_xmit() has to find a new route, calls
sk_setup_caps(sk, &rt->u.dst), destroying the 

sk->sk_route_caps &= ~NETIF_F_GSO_MASK

that MD5 desperately try to make all over its way (from
tcp_transmit_skb() for example)

So we send few bad packets, and everything is fine when
tcp_transmit_skb() is called again for this socket.

Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.

Reported-by: Bhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16 00:36:33 -07:00
Eric Dumazet 3b098e2d7c net: Consistent skb timestamping
With RPS inclusion, skb timestamping is not consistent in RX path.

If netif_receive_skb() is used, its deferred after RPS dispatch.

If netif_rx() is used, its done before RPS dispatch.

This can give strange tcpdump timestamps results.

I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)

Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.

Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue

Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.

Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-15 23:57:10 -07:00
Jiri Pirko a14462f1bd net: adjust handle_macvlan to pass port struct to hook
Now there's null check here and also again in the hook. Looking at bridge bits
which are simmilar, port structure is rcu_dereferenced right away in
handle_bridge and passed to hook. Looks nicer.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-15 23:48:02 -07:00
Steven Rostedt 38516ab59f tracing: Let tracepoints have data passed to tracepoint callbacks
This patch adds data to be passed to tracepoint callbacks.

The created functions from DECLARE_TRACE() now need a mandatory data
parameter. For example:

DECLARE_TRACE(mytracepoint, int value, value)

Will create the register function:

int register_trace_mytracepoint((void(*)(void *data, int value))probe,
                                void *data);

As the first argument, all callbacks (probes) must take a (void *data)
parameter. So a callback for the above tracepoint will look like:

void myprobe(void *data, int value)
{
}

The callback may choose to ignore the data parameter.

This change allows callbacks to register a private data pointer along
with the function probe.

	void mycallback(void *data, int value);

	register_trace_mytracepoint(mycallback, mydata);

Then the mycallback() will receive the "mydata" as the first parameter
before the args.

A more detailed example:

  DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status));

  /* In the C file */

  DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status));

  [...]

       trace_mytracepoint(status);

  /* In a file registering this tracepoint */

  int my_callback(void *data, int status)
  {
	struct my_struct my_data = data;
	[...]
  }

  [...]
	my_data = kmalloc(sizeof(*my_data), GFP_KERNEL);
	init_my_data(my_data);
	register_trace_mytracepoint(my_callback, my_data);

The same callback can also be registered to the same tracepoint as long
as the data registered is different. Note, the data must also be used
to unregister the callback:

	unregister_trace_mytracepoint(my_callback, my_data);

Because of the data parameter, tracepoints declared this way can not have
no args. That is:

  DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS());

will cause an error.

If no arguments are needed, a new macro can be used instead:

  DECLARE_TRACE_NOARGS(mytracepoint);

Since there are no arguments, the proto and args fields are left out.

This is part of a series to make the tracepoint footprint smaller:

   text	   data	    bss	    dec	    hex	filename
4913961	1088356	 861512	6863829	 68bbd5	vmlinux.orig
4914025	1088868	 861512	6864405	 68be15	vmlinux.class
4918492	1084612	 861512	6864616	 68bee8	vmlinux.tracepoint

Again, this patch also increases the size of the kernel, but
lays the ground work for decreasing it.

 v5: Fixed net/core/drop_monitor.c to handle these updates.

 v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the
     #ifdef CONFIG_TRACE_POINTS, since the two are the same in both
     cases. The __DECLARE_TRACE() is what changes.
     Thanks to Frederic Weisbecker for pointing this out.

 v3: Made all register_* functions require data to be passed and
     all callbacks to take a void * parameter as its first argument.
     This makes the calling functions comply with C standards.

     Also added more comments to the modifications of DECLARE_TRACE().

 v2: Made the DECLARE_TRACE() have the ability to pass arguments
     and added a new DECLARE_TRACE_NOARGS() for tracepoints that
     do not need any arguments.

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-14 09:50:34 -04:00
David S. Miller 278554bd65 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	Documentation/feature-removal-schedule.txt
	drivers/net/wireless/ath/ar9170/usb.c
	drivers/scsi/iscsi_tcp.c
	net/ipv4/ipmr.c
2010-05-12 00:05:35 -07:00
Eric Dumazet eecfd7c4e3 rps: Various optimizations
Introduce ____napi_schedule() helper for callers in irq disabled
contexts. rps_trigger_softirq() becomes a leaf function.

Use container_of() in process_backlog() instead of accessing per_cpu
address.

Use a custom inlined version of __napi_complete() in process_backlog()
to avoid one locked instruction :

 only current cpu owns and manipulates this napi,
 and NAPI_STATE_SCHED is the only possible flag set on backlog.
 we can use a plain write instead of clear_bit(),
 and we dont need an smp_mb() memory barrier, since RPS is on,
 backlog is protected by a spinlock.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-06 22:07:48 -07:00
Eric Dumazet 6ec82562ff veth: Dont kfree_skb() after dev_forward_skb()
In case of congestion, netif_rx() frees the skb, so we must assume
dev_forward_skb() also consume skb.

Bug introduced by commit 445409602c
(veth: move loopback logic to common location)

We must change dev_forward_skb() to always consume skb, and veth to not
double free it.

Bug report : http://marc.info/?l=linux-netdev&m=127310770900442&w=3

Reported-by: Martín Ferrari <martin.ferrari@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-06 00:53:53 -07:00
WANG Cong 0e34e93177 netpoll: add generic support for bridge and bonding devices
This whole patchset is for adding netpoll support to bridge and bonding
devices. I already tested it for bridge, bonding, bridge over bonding,
and bonding over bridge. It looks fine now.

To make bridge and bonding support netpoll, we need to adjust
some netpoll generic code. This patch does the following things:

1) introduce two new priv_flags for struct net_device:
   IFF_IN_NETPOLL which identifies we are processing a netpoll;
   IFF_DISABLE_NETPOLL is used to disable netpoll support for a device
   at run-time;

2) introduce one new method for netdev_ops:
   ->ndo_netpoll_cleanup() is used to clean up netpoll when a device is
     removed.

3) introduce netpoll_poll_dev() which takes a struct net_device * parameter;
   export netpoll_send_skb() and netpoll_poll_dev() which will be used later;

4) hide a pointer to struct netpoll in struct netpoll_info, ditto.

5) introduce ->real_dev for struct netpoll.

6) introduce a new status NETDEV_BONDING_DESLAE, which is used to disable
   netconsole before releasing a slave, to avoid deadlocks.

Cc: David Miller <davem@davemloft.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-06 00:47:21 -07:00
Eric Dumazet ec7d2f2cf3 net: __alloc_skb() speedup
With following patch I can reach maximum rate of my pktgen+udpsink
simulator :
- 'old' machine : dual quad core E5450  @3.00GHz
- 64 UDP rx flows (only differ by destination port)
- RPS enabled, NIC interrupts serviced on cpu0
- rps dispatched on 7 other cores. (~130.000 IPI per second)
- SLAB allocator (faster than SLUB in this workload)
- tg3 NIC
- 1.080.000 pps without a single drop at NIC level.

Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch
first sk_buff cache line, the second to prefetch the shinfo part.

Also using one memset() to initialize all skb_shared_info fields instead
of one by one to reduce number of instructions, using long word moves.

All skb_shared_info fields before 'dataref' are cleared in 
__alloc_skb().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-05 01:07:37 -07:00
Eric Dumazet 93bb64eac1 net: skb_free_datagram_locked() fix
Commit 4b0b72f7dd ( net: speedup udp receive path )
introduced a bug in skb_free_datagram_locked().

We should not skb_orphan() skb if we dont have the guarantee we are the
last skb user, this might happen with MSG_PEEK concurrent users.

To keep socket locked for the smallest period of time, we split
consume_skb() logic, inlined in skb_free_datagram_locked()

Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-03 23:18:14 -07:00
Changli Gao dee42870a4 net: fix softnet_stat
Per cpu variable softnet_data.total was shared between IRQ and SoftIRQ context
without any protection. And enqueue_to_backlog should update the netdev_rx_stat
of the target CPU.

This patch renames softnet_data.total to softnet_data.processed: the number of
packets processed in uppper levels(IP stacks).

softnet_stat data is moved into softnet_data.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/linux/netdevice.h |   17 +++++++----------
 net/core/dev.c            |   26 ++++++++++++--------------
 net/sched/sch_generic.c   |    2 +-
 3 files changed, 20 insertions(+), 25 deletions(-)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-02 22:26:57 -07:00
David S. Miller 47d29646a2 net: Inline skb_pull() in eth_type_trans().
In commit 6be8ac2f ("[NET]: uninline skb_pull, de-bloats a lot")
we uninlined skb_pull.

But in some critical paths it makes sense to inline this thing
and it helps performance significantly.

Create an skb_pull_inline() so that we can do this in a way that
serves also as annotation.

Based upon a patch by Eric Dumazet.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-02 02:21:44 -07:00
Eric Dumazet 4381548237 net: sock_def_readable() and friends RCU conversion
sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
need two atomic operations (and associated dirtying) per incoming
packet.

RCU conversion is pretty much needed :

1) Add a new structure, called "struct socket_wq" to hold all fields
that will need rcu_read_lock() protection (currently: a
wait_queue_head_t and a struct fasync_struct pointer).

[Future patch will add a list anchor for wakeup coalescing]

2) Attach one of such structure to each "struct socket" created in
sock_alloc_inode().

3) Respect RCU grace period when freeing a "struct socket_wq"

4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
socket_wq"

5) Change sk_sleep() function to use new sk->sk_wq instead of
sk->sk_sleep

6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
a rcu_read_lock() section.

7) Change all sk_has_sleeper() callers to :
  - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
  - Use wq_has_sleeper() to eventually wakeup tasks.
  - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

8) sock_wake_async() is modified to use rcu protection as well.

9) Exceptions :
  macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
instead of dynamically allocated ones. They dont need rcu freeing.

Some cleanups or followups are probably needed, (possible
sk_callback_lock conversion to a spinlock for example...).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-01 15:00:15 -07:00
Eric Dumazet 4b0b72f7dd net: speedup udp receive path
Since commit 95766fff ([UDP]: Add memory accounting.), 
each received packet needs one extra sock_lock()/sock_release() pair.

This added latency because of possible backlog handling. Then later,
ticket spinlocks added yet another latency source in case of DDOS.

This patch introduces lock_sock_bh() and unlock_sock_bh()
synchronization primitives, avoiding one atomic operation and backlog
processing.

skb_free_datagram_locked() uses them instead of full blown
lock_sock()/release_sock(). skb is orphaned inside locked section for
proper socket memory reclaim, and finally freed outside of it.

UDP receive path now take the socket spinlock only once.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-28 14:35:48 -07:00
Jiri Pirko 05fceb4ad7 net: disallow to use net_assign_generic externally
Now there's no need to use this fuction directly because it's handled by
register_pernet_device. So to make this simple and easy to understand,
make this static to do not tempt potentional users.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27 15:49:02 -07:00
Eric Dumazet c377411f24 net: sk_add_backlog() take rmem_alloc into account
Current socket backlog limit is not enough to really stop DDOS attacks,
because user thread spend many time to process a full backlog each
round, and user might crazy spin on socket lock.

We should add backlog size and receive_queue size (aka rmem_alloc) to
pace writers, and let user run without being slow down too much.

Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
stress situations.

Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
receiver can now process ~200.000 pps (instead of ~100 pps before the
patch) on a 8 core machine.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27 15:13:20 -07:00
Changli Gao 6e7676c1a7 net: batch skb dequeueing from softnet input_pkt_queue
batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
contention when RPS is enabled.

Note: in the worst case, the number of packets in a softnet_data may
be double of netdev_max_backlog.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27 15:11:49 -07:00
Changli Gao a9cbd588fd net: reimplement softnet_data.output_queue as a FIFO queue
reimplement softnet_data.output_queue as a FIFO queue to keep the
fairness among the qdiscs rescheduled.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
----
 include/linux/netdevice.h |    1 +
 net/core/dev.c            |   22 ++++++++++++----------
 2 files changed, 13 insertions(+), 10 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27 14:32:12 -07:00
David S. Miller bb61187465 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/ipmr-2.6 2010-04-27 12:57:39 -07:00
David S. Miller e1703b36c3 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/e100.c
	drivers/net/e1000e/netdev.c
2010-04-27 12:49:13 -07:00
Patrick McHardy 25239cee7e net: rtnetlink: decouple rtnetlink address families from real address families
Decouple rtnetlink address families from real address families in socket.h to
be able to add rtnetlink interfaces to code that is not a real address family
without increasing AF_MAX/NPROTO.

This will be used to add support for multicast route dumping from all tables
as the proc interface can't be extended to support anything but the main table
without breaking compatibility.

This partialy undoes the patch to introduce independant families for routing
rules and converts ipmr routing rules to a new rtnetlink family. Similar to
that patch, values up to 127 are reserved for real address families, values
above that may be used arbitrarily.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-04-26 16:13:54 +02:00
Patrick McHardy 3d0c9c4eb2 net: fib_rules: mark arguments to fib_rules_register const and __net_initdata
fib_rules_register() duplicates the template passed to it without modification,
mark the argument as const. Additionally the templates are only needed when
instantiating a new namespace, so mark them as __net_initdata, which means
they can be discarded when CONFIG_NET_NS=n.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-04-26 16:02:04 +02:00
Jiri Pirko b3c981d2bb netns: rename unregister_pernet_subsys parameter
Stay consistent with other functions and with comment also and name
pernet_operations parameter properly.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-25 00:49:56 -07:00
Changli Gao 8c52d509e8 rps: optimize rps_get_cpu()
optimize rps_get_cpu().

don't initialize ports when we can get the ports. one memory access
for ports than two.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-24 22:50:10 -07:00
Paul LeoNerd Evans 40eaf96271 net: Socket filter ancilliary data access for skb->dev->type
Add an SKF_AD_HATYPE field to the packet ancilliary data area, giving
access to skb->dev->type, as reported in the sll_hatype field.

When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all
interfaces, there doesn't appear to be a way for the filter program to
actually find out the underlying hardware type the packet was captured
on. This patch adds such ability.

This patch also handles the case where skb->dev can be NULL, such as on
netlink sockets.

Signed-off-by: Paul Evans <leonerd@leonerd.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 16:05:44 -07:00
Dan Carpenter 80032cffb9 rtnetlink: potential ERR_PTR dereference
In the original code, if rtnl_create_link() returned an ERR_PTR then that
would get passed to rtnl_configure_link() which dereferences it.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 15:57:26 -07:00
David S. Miller 9ccb897594 net: Orphan and de-dst skbs earlier in xmit path.
This way GSO packets don't get handled differently.

With help from Eric Dumazet.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-04-22 01:02:07 -07:00
Eric Dumazet e326bed2f4 rps: immediate send IPI in process_backlog()
If some skb are queued to our backlog, we are delaying IPI sending at
the end of net_rx_action(), increasing latencies. This defeats the
queueing, since we want to quickly dispatch packets to the pool of
worker cpus, then eventually deeply process our packets.

It's better to send IPI before processing our packets in upper layers,
from process_backlog().

Change the _and_disable_irq suffix to _and_enable_irq(), since we enable
local irq in net_rps_action(), sorry for the confusion.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 00:22:45 -07:00
Eric Dumazet 9a20e3197e net: Introduce skb_orphan_try()
At this point, skb->destructor is not the original one (stored in
DEV_GSO_CB(skb)->destructor)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21 22:54:08 -07:00
David S. Miller 87eb367003 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-6000.c
	net/core/dev.c
2010-04-21 01:14:25 -07:00
David Howells 05d17608a6 net: Fix an RCU warning in dev_pick_tx()
Fix the following RCU warning in dev_pick_tx():

===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
net/core/dev.c:1993 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
2 locks held by swapper/0:
 #0:  (&idev->mc_ifc_timer){+.-...}, at: [<ffffffff81039e65>] run_timer_softirq+0x17b/0x278
 #1:  (rcu_read_lock_bh){.+....}, at: [<ffffffff812ea3eb>] dev_queue_xmit+0x14e/0x4dc

stack backtrace:
Pid: 0, comm: swapper Not tainted 2.6.34-rc5-cachefs #4
Call Trace:
 <IRQ>  [<ffffffff810516c4>] lockdep_rcu_dereference+0xaa/0xb2
 [<ffffffff812ea4f6>] dev_queue_xmit+0x259/0x4dc
 [<ffffffff812ea3eb>] ? dev_queue_xmit+0x14e/0x4dc
 [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81035362>] ? local_bh_enable_ip+0xbc/0xc1
 [<ffffffff812f0954>] neigh_resolve_output+0x24b/0x27c
 [<ffffffff8134f673>] ip6_output_finish+0x7c/0xb4
 [<ffffffff81350c34>] ip6_output2+0x256/0x261
 [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff813517fb>] ip6_output+0xbbc/0xbcb
 [<ffffffff8135bc5d>] ? fib6_force_start_gc+0x2b/0x2d
 [<ffffffff81368acb>] mld_sendpack+0x273/0x39d
 [<ffffffff81368858>] ? mld_sendpack+0x0/0x39d
 [<ffffffff81052099>] ? mark_held_locks+0x52/0x70
 [<ffffffff813692fc>] mld_ifc_timer_expire+0x24f/0x288
 [<ffffffff81039ed6>] run_timer_softirq+0x1ec/0x278
 [<ffffffff81039e65>] ? run_timer_softirq+0x17b/0x278
 [<ffffffff813690ad>] ? mld_ifc_timer_expire+0x0/0x288
 [<ffffffff81035531>] ? __do_softirq+0x69/0x140
 [<ffffffff8103556a>] __do_softirq+0xa2/0x140
 [<ffffffff81002e0c>] call_softirq+0x1c/0x28
 [<ffffffff81004b54>] do_softirq+0x38/0x80
 [<ffffffff81034f06>] irq_exit+0x45/0x47
 [<ffffffff810177c3>] smp_apic_timer_interrupt+0x88/0x96
 [<ffffffff810028d3>] apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff810488dd>] ? __atomic_notifier_call_chain+0x0/0x86
 [<ffffffff810096bf>] ? mwait_idle+0x6e/0x78
 [<ffffffff810096b6>] ? mwait_idle+0x65/0x78
 [<ffffffff810011cb>] cpu_idle+0x4d/0x83
 [<ffffffff81380b05>] rest_init+0xb9/0xc0
 [<ffffffff81380a4c>] ? rest_init+0x0/0xc0
 [<ffffffff8168dcf0>] start_kernel+0x392/0x39d
 [<ffffffff8168d2a3>] x86_64_start_reservations+0xb3/0xb7
 [<ffffffff8168d38b>] x86_64_start_kernel+0xe4/0xeb

An rcu_dereference() should be an rcu_dereference_bh().

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21 01:09:44 -07:00
Rami Rosen ccb7c7732e net: Remove two unnecessary exports (skbuff).
There is no need to export skb_under_panic() and skb_over_panic() in
skbuff.c, since these methods are used only in skbuff.c ; this patch
removes these two exports. It also marks these functions as 'static'
and removeS the extern declarations of them from
include/linux/skbuff.h

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 22:39:53 -07:00
Eric Dumazet aa39514516 net: sk_sleep() helper
Define a new function to return the waitqueue of a "struct sock".

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
	return sk->sk_sleep;
}

Change all read occurrences of sk_sleep by a call to this function.

Needed for a future RCU conversion. sk_sleep wont be a field directly
available.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 16:37:13 -07:00
Jiri Pirko ab9304717f net: emphasize rtnl lock required in call_netdevice_notifiers
Since netdev_chain is guarded by rtnl_lock, ASSERT_RTNL should be
present here to make sure that all callers of call_netdevice_notifiers
does the locking properly.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 01:45:37 -07:00
Eric Dumazet b249dcb82d rps: consistent rxhash
In case we compute a software skb->rxhash, we can generate a consistent
hash : Its value will be the same in both flow directions.

This helps some workloads, like conntracking, since the same state needs
to be accessed in both directions.

tbench + RFS + this patch gives better results than tbench with default
kernel configuration (no RPS, no RFS)

Also fixed some sparse warnings.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 01:18:06 -07:00
Eric Dumazet e36fa2f7e9 rps: cleanups
struct softnet_data holds many queues, so consistent use "sd" name
instead of "queue" is better.

Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog()

Adds a _and_irq_disable suffix to net_rps_action() name, as David
suggested.

incr_input_queue_head() becomes input_queue_head_incr()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 01:18:05 -07:00
Eric Dumazet f5acb907dc rps: static functions
store_rps_map() & store_rps_dev_flow_table_cnt() are static.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-19 14:40:57 -07:00
Eric Dumazet 88751275b8 rps: shortcut net_rps_action()
net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
RPS is not active.

Tom Herbert used two bitmasks to hold information needed to send IPI,
but a single LIFO list seems more appropriate.

Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
(remove two ifdefs)

Move rps_remote_softirq_cpus into softnet_data to share its first cache
line, filling an existing hole.

In a future patch, we could call net_rps_action() from process_backlog()
to make sure we send IPI before handling this cpu backlog.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-19 13:20:34 -07:00
Eric Dumazet fc6055a5ba net: Introduce skb_orphan_try()
Transmitted skb might be attached to a socket and a destructor, for
memory accounting purposes.

Traditionally, this destructor is called at tx completion time, when skb
is freed.

When tx completion is performed by another cpu than the sender, this
forces some cache lines to change ownership. XPS was an attempt to give
tx completion to initial cpu.

David idea is to call destructor right before giving skb to device (call
to ndo_start_xmit()). Because device queues are usually small, orphaning
skb before tx completion is not a big deal. Some drivers already do
this, we could do it in upper level.

There is one known exception to this early orphaning, called tx
timestamping. It needs to keep a reference to socket until device can
give a hardware or software timestamp.

This patch adds a skb_orphan_try() helper, to centralize all exceptions
to early orphaning in one spot, and use it in dev_hard_start_xmit().

"tbench 16" results on a Nehalem machine (2 X5570  @ 2.93GHz)
before: Throughput 4428.9 MB/sec 16 procs
after: Throughput 4448.14 MB/sec 16 procs

UDP should get even better results, its destructor being more complex,
since SOCK_USE_WRITE_QUEUE is not set (four atomic ops instead of one)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-18 02:39:41 -07:00
Eric Dumazet 9958da0501 net: remove time limit in process_backlog()
- There is no point to enforce a time limit in process_backlog(), since
other napi instances dont follow same rule. We can exit after only one
packet processed...
The normal quota of 64 packets per napi instance should be the norm, and
net_rx_action() already has its own time limit.
Note : /proc/net/core/dev_weight can be used to tune this 64 default
value.

- Use DEFINE_PER_CPU_ALIGNED for softnet_data definition.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-18 02:36:13 -07:00
Eric Dumazet 8770acf049 rps: rps_sock_flow_table is mostly read
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 00:54:36 -07:00
Tom Herbert fec5e652e5 rfs: Receive Flow Steering
This patch implements receive flow steering (RFS).  RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running.  RFS is an
extension of Receive Packet Steering (RPS).

The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure.  The rxhash is passed in skb's received on
the connection from netif_receive_skb.  For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.

The convolution of the simple approach is that it would potentially
allow OOO packets.  If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.

To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.

rps_sock_table is a global hash table.  Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.

rps_dev_flow_table is specific to each device queue.  Each entry
contains a CPU and a tail queue counter.  The CPU is the "current"
CPU for a matching flow.  The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.

Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length.  When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.

And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted.  When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:

- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table.  This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.

Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality.  2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.

This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.

There are two configuration parameters for RFS.  The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue.  Both are rounded to power of two.

The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).

The benefits of RFS are dependent on cache hierarchy, application
load, and other factors.  On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation.  However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp.  The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.

e1000e on 8 core Intel
   No RFS or RPS		104K tps at 30% CPU
   No RFS (best RPS config):    290K tps at 63% CPU
   RFS				303K tps at 61% CPU

RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
  No RFS/RPS	103K	48%	757/900/3185		4472.35
  RPS only:	174K	73%	415/993/2468		491.66
  RFS		223K	73%	379/651/1382		315.61

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
Eric Dumazet 8728c544a9 net: dev_pick_tx() fix
When dev_pick_tx() caches tx queue_index on a socket, we must check
socket dst_entry matches skb one, or risk a crash later, as reported by
Denys Fedorysychenko, if old packets are in flight during a route
change, involving devices with different number of queues.

Bug introduced by commit a4ee3ce3
(net: Use sk_tx_queue_mapping for connected sockets)

Reported-by: Denys Fedorysychenko <nuclearcat@nuclearcat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-15 01:27:11 -07:00
Eric Dumazet b0e28f1eff net: netif_rx() must disable preemption
Eric Paris reported netif_rx() is calling smp_processor_id() from
preemptible context, in particular when caller is
ip_dev_loopback_xmit().

RPS commit added this smp_processor_id() call, this patch makes sure
preemption is disabled. rps_get_cpus() wants rcu_read_lock() anyway, we
can dot it a bit earlier.

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-15 00:14:07 -07:00
Patrick McHardy 0f87b1dd01 net: fib_rules: decouple address families from real address families
Decouple the address family values used for fib_rules from the real
address families in socket.h. This allows to use fib_rules for
code that is not a real address family without increasing AF_MAX/NPROTO.

Values up to 127 are reserved for real address families and map directly
to the corresponding AF value, values starting from 128 are for other
uses. rtnetlink is changed to invoke the AF_UNSPEC dumpit/doit handlers
for these families.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 14:49:31 -07:00
Patrick McHardy 28bb17268b net: fib_rules: set family in fib_rule_hdr centrally
All fib_rules implementations need to set the family in their ->fill()
functions. Since the value is available to the generic fib_nl_fill_rule()
function, set it there.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 14:49:30 -07:00
Patrick McHardy d8a566beaa net: fib_rules: consolidate IPv4 and DECnet ->default_pref() functions.
Both functions are equivalent, consolidate them since a following patch
needs a third implementation for multicast routing.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 14:49:30 -07:00
stephen hemminger 5611551103 dst: don't inline dst_ifdown
The function dst_ifdown is called only two places but in a non-
performance critical code path, there is no reason to inline it.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 03:32:44 -07:00
Eric Dumazet acbbc07145 net: uninline skb_bond_should_drop()
skb_bond_should_drop() is too big to be inlined.

This patch reduces kernel text size, and its compilation time as well
(shrinking include/linux/netdevice.h)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 03:32:42 -07:00
Eric Dumazet b6c6712a42 net: sk_dst_cache RCUification
With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
work.

sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)

This rwlock is readlocked for a very small amount of time, and dst
entries are already freed after RCU grace period. This calls for RCU
again :)

This patch converts sk_dst_lock to a spinlock, and use RCU for readers.

__sk_dst_get() is supposed to be called with rcu_read_lock() or if
socket locked by user, so use appropriate rcu_dereference_check()
condition (rcu_read_lock_held() || sock_owned_by_user(sk))

This patch avoids two atomic ops per tx packet on UDP connected sockets,
for example, and permits sk_dst_lock to be much less dirtied.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 01:41:33 -07:00
Eric Dumazet 7a161ea924 net: Dont use netdev_warn()
Dont use netdev_warn() in dev_cap_txqueue() and get_rps_cpu() so that we
can catch following warnings without crash.

bond0.2240 received packet on queue 6, but number of RX queues is 1
bond0.2240 received packet on queue 11, but number of RX queues is 1

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 01:41:32 -07:00
David S. Miller 871039f02f Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/stmmac/stmmac_main.c
	drivers/net/wireless/wl12xx/wl1271_cmd.c
	drivers/net/wireless/wl12xx/wl1271_main.c
	drivers/net/wireless/wl12xx/wl1271_spi.c
	net/core/ethtool.c
	net/mac80211/scan.c
2010-04-11 14:53:53 -07:00
chavey 97f8aefbbf net: fix ethtool coding style errors and warnings
Fix coding style errors and warnings output while running checkpatch.pl
on the files net/core/ethtool.c and include/linux/ethtool.h

Signed-off-by: chavey <chavey@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-07 21:54:42 -07:00
Eric Dumazet 298b9e44be net: include linux/proc_fs.h in dev_addr_lists.c
As pointed by Randy Dunlap, we must include linux/proc_fs.h in
net/core/dev_addr_lists.c, regardless of CONFIG_PROC_FS

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>, 
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-07 16:46:36 -07:00
Timo Teräs 8e4795605d flow: delayed deletion of flow cache entries
Speed up lookups by freeing flow cache entries later. After
virtualizing flow cache entry operations, the flow cache may now
end up calling policy or bundle destructor which can be slowish.

As gc_list is more effective with double linked list, the flow cache
is converted to use common hlist and list macroes where appropriate.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-07 03:43:20 -07:00
Timo Teräs fe1a5f031e flow: virtualize flow cache entry methods
This allows to validate the cached object before returning it.
It also allows to destruct object properly, if the last reference
was held in flow cache. This is also a prepartion for caching
bundles in the flow cache.

In return for virtualizing the methods, we save on:
- not having to regenerate the whole flow cache on policy removal:
  each flow matching a killed policy gets refreshed as the getter
  function notices it smartly.
- we do not have to call flow_cache_flush from policy gc, since the
  flow cache now properly deletes the object if it had any references

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-07 03:43:18 -07:00
David S. Miller 4a35ecf8bf Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bonding/bond_main.c
	drivers/net/via-velocity.c
	drivers/net/wireless/iwlwifi/iwl-agn.c
2010-04-06 23:53:30 -07:00
Eric Dumazet e4008276fd net: Add a missing local_irq_enable()
As noticed by Changli Gao, we must call local_irq_enable() after
rps_unlock()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-05 15:42:39 -07:00
Tom Herbert 5a6d234e73 rps: fixed missed rps_unlock
Fix spin_unlock_irq which needs to be rps_unlock.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-05 14:37:55 -07:00
Jiri Pirko 22bedad3ce net: convert multicast list to list_head
Converts the list and the core manipulating with it to be the same as uc_list.

+uses two functions for adding/removing mc address (normal and "global"
 variant) instead of a function parameter.
+removes dev_mcast.c completely.
+exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for
 manipulation with lists on a sandbox (used in bonding and 80211 drivers)

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-03 14:22:15 -07:00
Jiri Pirko a748ee2426 net: move address list functions to a separate file
+little renaming of unicast functions to be smooth with multicast ones

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-03 14:22:11 -07:00
Eric Dumazet 9092c658ba net: illegal_highdma() fix
Followup to commit 5acbbd428d
(net: change illegal_highdma to use dma_mask)

If dev->dev.parent is NULL, we should not try to dereference it.

Dont force inline illegal_highdma() as its pretty big now.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-02 13:34:49 -07:00
FUJITA Tomonori 5acbbd428d net: change illegal_highdma to use dma_mask
Robert Hancock pointed out two problems about NETIF_F_HIGHDMA:

-Many drivers only set the flag when they detect they can use 64-bit DMA,
since otherwise they could receive DMA addresses that they can't handle
(which on platforms without IOMMU/SWIOTLB support is fatal). This means that if
64-bit support isn't available, even buffers located below 4GB will get copied
unnecessarily.

-Some drivers set the flag even though they can't actually handle 64-bit DMA,
which would mean that on platforms without IOMMU/SWIOTLB they would get a DMA
mapping error if the memory they received happened to be located above 4GB.

http://lkml.org/lkml/2010/3/3/530

We can use the dma_mask if we need bouncing or not here. Then we can
safely fix drivers that misuse NETIF_F_HIGHDMA.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-01 19:53:12 -07:00
Timo Teräs d7997fe1f4 flow: structurize flow cache
Group all per-cpu data to one structure instead of having many
globals. Also prepare the internals so that we can have multiple
instances of the flow cache if needed.

Only the kmem_cache is left as a global as all flow caches share
the same element size, and benefit from using a common cache.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-01 19:41:36 -07:00
Changli Gao 152102c7f2 rps: keep the old behavior on SMP without rps
keep the old behavior on SMP without rps

RPS introduces a lock operation to per cpu variable input_pkt_queue on
SMP whenever rps is enabled or not. On SMP without RPS, this lock isn't
needed at all.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
net/core/dev.c | 42 ++++++++++++++++++++++++++++--------------
1 file changed, 28 insertions(+), 14 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-01 18:41:40 -07:00
laurent chavey 598ed9367a fix net/core/dst.c coding style error and warnings
Fix coding style errors and warnings output while running checkpatch.pl
on the file net/core/dst.c.

Signed-off-by: chavey <chavey@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-30 23:51:08 -07:00
stephen hemminger b00fabb402 netdev: ethtool RXHASH flag
This adds ethtool and device feature flag to allow control
of receive hashing offload.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-30 23:51:08 -07:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Stephen Rothwell 30bde1f507 rps: fix net-sysfs build for !CONFIG_RPS
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-29 01:00:44 -07:00
Eric Dumazet 10f744d205 net: __netif_receive_skb should be static
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-28 23:07:20 -07:00
Jan Engelhardt adcfe1964e net: increase preallocated size of nlmsg to accomodate for IFLA_STATS64
When more data is stuffed into an nlmsg than initially projected, an
extra allocation needs to be done. Reserve enough for IFLA_STATS64 so
that this does not to needlessy happen.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-27 17:15:29 -07:00
Jan Engelhardt 14a4b42bd6 net: fix unaligned access in IFLA_STATS64
Tony Luck observes that the original IFLA_STATS64 submission causes
unaligned accesses. This is because nla_data() returns a pointer to a
memory region that is only aligned to 32 bits. Do some memcpying to
workaround this.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-27 16:35:50 -07:00
Eric Dumazet df3345457a rps: add CONFIG_RPS
RPS currently depends on SMP and SYSFS

Adding a CONFIG_RPS makes sense in case this requirement changes in the
future. This patch saves about 1500 bytes of kernel text in case SMP is
on but SYSFS is off.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-25 12:07:00 -07:00
Tom Herbert e51d739ab7 net: Fix locking in flush_backlog
Need to take spinlocks when dequeuing from input_pkt_queue in flush_backlog.
Also, flush_backlog can now be called directly from netdev_run_todo.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-23 23:17:18 -07:00
Amerigo Wang 5fc05f8764 netpoll: warn when there are spaces in parameters
v2: update according to Frans' comments.

Currently, if we leave spaces before dst port,
netconsole will silently accept it as 0. Warn about this.

Also, when spaces appear in other places, make them
visible in error messages.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: David Miller <davem@davemloft.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-22 20:05:45 -07:00
Tom Herbert e880eb6c5c rps: Fix build with CONFIG_SYSFS enabled
Fix build with CONFIG_SYSFS not enabled.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-22 18:06:47 -07:00
Robert Olsson e99b99b471 pktgen node allocation
Here is patch to manipulate packet node allocation and implicitly
how packets are DMA'd etc.

The flag NODE_ALLOC enables the function and numa_node_id();
when enabled it can also be explicitly controlled via a new
node parameter

Tested this with 10 Intel 82599 ports w. TYAN S7025 E5520 CPU's.
Was able to TX/DMA ~80 Gbit/s to Ethernet wires.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-21 20:33:36 -07:00
Eric Dumazet 99fe3c391d net: dev_getfirstbyhwtype() optimization
Use RCU to avoid RTNL use in dev_getfirstbyhwtype()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-21 20:33:36 -07:00
Eric Dumazet 283f2fe87e net: speedup netdev_set_master()
We currently force a synchronize_net() in netdev_set_master()

This seems necessary only when a slave had a master and we dismantle it.

In the other case ("ifenslave bond0 ethO"), we dont need this long
delay.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-21 18:34:15 -07:00
Jiri Pirko 32a806c194 bonding: flush unicast and multicast lists when changing type
After the type change, addresses in unicast and multicast lists wouldn't make
sense, not to mention possible different lenghts. So flush both lists here.

Note "dev_addr_discard" will be very soon replaced by "dev_mc_flush" (once
mc_list conversion will be done).

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-21 18:31:34 -07:00
Patrick McHardy 755d0e77ac net: rtnetlink: ignore NETDEV_PRE_TYPE_CHANGE in rtnetlink_event()
Ignore the new NETDEV_PRE_TYPE_CHANGE event in rtnetlink_event() since
there have been no changes userspace needs to be notified of.

Also add a comment to the netdev notifier event definitions to remind
people to update the exclusion list when adding new event types.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-21 18:31:34 -07:00
David S. Miller e77c8e83dd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-03-20 15:24:29 -07:00
Eric Dumazet 0641e4fbf2 net: Potential null skb->dev dereference
When doing "ifenslave -d bond0 eth0", there is chance to get NULL
dereference in netif_receive_skb(), because dev->master suddenly becomes
NULL after we tested it.

We should use ACCESS_ONCE() to avoid this (or rcu_dereference())

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-18 21:16:45 -07:00
Jiri Pirko 3ca5b4042e bonding: check return value of nofitier when changing type
This patch adds the possibility to refuse the bonding type change for
other subsystems (such as for example bridge, vlan, etc.)

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-18 20:00:02 -07:00
Tom Herbert 1e94d72fea rps: Fixed build with CONFIG_SMP not enabled.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-18 17:45:44 -07:00
Jan Engelhardt 10708f37ae net: core: add IFLA_STATS64 support
`ip -s link` shows interface counters truncated to 32 bit. This is
because interface statistics are transported only in 32-bit quantity
to userspace. This commit adds a new IFLA_STATS64 attribute that
exports them in full 64 bit.

References: http://lkml.indiana.edu/hypermail/linux/kernel/0307.3/0215.html
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:23:22 -07:00
Eric Dumazet 2fb3573dfb net: remove rcu locking from fib_rules_event()
We hold RTNL at this point and dont use RCU variants of list traversals,
we dont need rcu_read_lock()/rcu_read_unlock()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:23:19 -07:00
Tom Herbert 0a9627f264 rps: Receive Packet Steering
This patch implements software receive side packet steering (RPS).  RPS
distributes the load of received packet processing across multiple CPUs.

Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load.  This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.

This solution queues packets early on in the receive path on the backlog queues
of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel.   For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask.  The IPI mechanism is used to raise networking receive
softirqs between CPUs.  This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.

Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.

The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
bit maps for receive queues in the device (numbered by <n>).  If a device
does not support multi-queue, a single variable is used for the device (rx-0).

Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization.  Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy.  Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.

e1000e on 8 core Intel
   Without RPS: 108K tps at 33% CPU
   With RPS:    311K tps at 64% CPU

forcedeth on 16 core AMD
   Without RPS: 156K tps at 15% CPU
   With RPS:    404K tps at 49% CPU

bnx2x on 16 core AMD
   Without RPS  567K tps at 61% CPU (4 HW RX queues)
   Without RPS  738K tps at 96% CPU (8 HW RX queues)
   With RPS:    854K tps at 76% CPU (4 HW RX queues)

Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet.  In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets.  It's
probably best not change the masks too frequently.

Signed-off-by: Tom Herbert <therbert@google.com>

 include/linux/netdevice.h |   32 ++++-
 include/linux/skbuff.h    |    3 +
 net/core/dev.c            |  335 +++++++++++++++++++++++++++++++++++++--------
 net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
 net/core/skbuff.c         |    2 +
 5 files changed, 538 insertions(+), 59 deletions(-)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:23:18 -07:00
Jiri Slaby 21edbb223e NET: netpoll, fix potential NULL ptr dereference
Stanse found that one error path in netpoll_setup dereferences npinfo
even though it is NULL. Avoid that by adding new label and go to that
instead.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Daniel Borkmann <danborkmann@googlemail.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: chavey@google.com
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 14:15:45 -07:00
Eric Dumazet 3041f51707 net: Fix dev_mc_add()
Commit 6e17d45a (net: add addr len check to dev_mc_add)
added a bug in dev_mc_add(), since it can now exit with a lock
imbalance.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-10 07:32:28 -08:00
Eric Dumazet 0a141509ed net: Annotates neigh_invalidate()
Annotates neigh_invalidate() with __releases() and __acquires() for
sparse sake.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-10 07:32:28 -08:00
Eric Dumazet f5c445ed41 ethtool: Use noinline_for_stack
Use self documenting noinline_for_stack instead of duplicated comments.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-08 12:17:04 -08:00
Dan Carpenter 72150e9b7f sock.c: potential null dereference
We test that "prot->rsk_prot" is non-null right before we dereference it
on this line.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-07 15:25:50 -08:00
Jeff Garzik d17792ebdf ethtool: Add direct access to ops->get_sset_count
On 03/04/2010 09:26 AM, Ben Hutchings wrote:
> On Thu, 2010-03-04 at 00:51 -0800, Jeff Kirsher wrote:
>> From: Jeff Garzik<jgarzik@redhat.com>
>>
>> This patch is an alternative approach for accessing string
>> counts, vs. the drvinfo indirect approach.  This way the drvinfo
>> space doesn't run out, and we don't break ABI later.
> [...]
>> --- a/net/core/ethtool.c
>> +++ b/net/core/ethtool.c
>> @@ -214,6 +214,10 @@ static noinline int ethtool_get_drvinfo(struct net_device *dev, void __user *use
>>   	info.cmd = ETHTOOL_GDRVINFO;
>>   	ops->get_drvinfo(dev,&info);
>>
>> +	/*
>> +	 * this method of obtaining string set info is deprecated;
>> +	 * consider using ETHTOOL_GSSET_INFO instead
>> +	 */
>
> This comment belongs on the interface (ethtool.h) not the
> implementation.

Debatable -- the current comment is located at the callsite of
ops->get_sset_count(), which is where an implementor might think to add
a new call.  Not all the numeric fields in ethtool_drvinfo are obtained
from ->get_sset_count().

Hence the "some" in the attached patch to include/linux/ethtool.h,
addressing your comment.

> [...]
>> +static noinline int ethtool_get_sset_info(struct net_device *dev,
>> +                                          void __user *useraddr)
>> +{
> [...]
>> +	/* calculate size of return buffer */
>> +	for (i = 0; i<  64; i++)
>> +		if (sset_mask&  (1ULL<<  i))
>> +			n_bits++;
> [...]
>
> We have a function for this:
>
> 	n_bits = hweight64(sset_mask);

Agreed.

I've attached a follow-up patch, which should enable my/Jeff's kernel
patch to be applied, followed by this one.

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-05 14:00:17 -08:00
Jeff Garzik 723b2f57ad ethtool: Add direct access to ops->get_sset_count
This patch is an alternative approach for accessing string
counts, vs. the drvinfo indirect approach.  This way the drvinfo
space doesn't run out, and we don't break ABI later.

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-05 14:00:17 -08:00
Zhu Yi a3a858ff18 net: backlog functions rename
sk_add_backlog -> __sk_add_backlog
sk_add_backlog_limited -> sk_add_backlog

Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-05 13:34:03 -08:00
Zhu Yi 8eae939f14 net: add limit for socket backlog
We got system OOM while running some UDP netperf testing on the loopback
device. The case is multiple senders sent stream UDP packets to a single
receiver via loopback on local host. Of course, the receiver is not able
to handle all the packets in time. But we surprisingly found that these
packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
Instead, they are kept queuing to sk->sk_backlog and finally ate up all
the memory. We believe this is a secure hole that a none privileged user
can crash the system.

The root cause for this problem is, when the receiver is doing
__release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
sk_receive_queue with the softirq enabled. In the above case, multiple
busy senders will almost make it an endless loop. The skbs in the
backlog end up eat all the system memory.

The issue is not only for UDP. Any protocols using socket backlog is
potentially affected. The patch adds limit for socket backlog so that
the backlog size cannot be expanded endlessly.

Reported-by: Alex Shi <alex.shi@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Allan Stephens <allan.stephens@windriver.com>
Cc: Andrew Hendry <andrew.hendry@gmail.com>
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-05 13:33:59 -08:00
David S. Miller 47871889c6 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	drivers/firmware/iscsi_ibft.c
2010-02-28 19:23:06 -08:00
Eric W. Biederman 76dadd76c2 scm: Only support SCM_RIGHTS on unix domain sockets.
We use scm_send and scm_recv on both unix domain and
netlink sockets, but only unix domain sockets support
everything required for file descriptor passing,
so error if someone attempts to pass file descriptors
over netlink sockets.

Cc: stable@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-28 18:22:02 -08:00
Jeff Garzik 9675478bba ethtool: do not set some flags, if others failed
NETIF_F_NTUPLE flag setting introduced a bug:  non-ntuple flags
like LRO may be successfully set, before ioctl(2) returns failure
to userspace.

The set-flags operation should be all-or-none, rather than leaving
things in an inconsistent state prior to reporting failure to
userspace.

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-28 01:40:30 -08:00
Patrick McHardy 3729d50212 rtnetlink: support specifying device flags on device creation
commit e8469ed959c373c2ff9e6f488aa5a14971aebe1f
Author: Patrick McHardy <kaber@trash.net>
Date:   Tue Feb 23 20:41:30 2010 +0100

Support specifying the initial device flags when creating a device though
rtnl_link. Devices allocated by rtnl_create_link() are marked as INITIALIZING
in order to surpress netlink registration notifications. To complete setup,
rtnl_configure_link() must be called, which performs the device flag changes
and invokes the deferred notifiers if everything went well.

Two examples:

# add macvlan to eth0
#
$ ip link add link eth0 up allmulticast on type macvlan

[LINK]11: macvlan0@eth0: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 26:f8:84:02:f9:2a brd ff:ff:ff:ff:ff:ff
[ROUTE]ff00::/8 dev macvlan0  table local  metric 256  mtu 1500 advmss 1440 hoplimit 0
[ROUTE]fe80::/64 dev macvlan0  proto kernel  metric 256  mtu 1500 advmss 1440 hoplimit 0
[LINK]11: macvlan0@eth0: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 1500
    link/ether 26:f8:84:02:f9:2a
[ADDR]11: macvlan0    inet6 fe80::24f8:84ff:fe02:f92a/64 scope link
       valid_lft forever preferred_lft forever
[ROUTE]local fe80::24f8:84ff:fe02:f92a via :: dev lo  table local  proto none  metric 0  mtu 16436 advmss 16376 hoplimit 0
[ROUTE]default via fe80::215:e9ff:fef0:10f8 dev macvlan0  proto kernel  metric 1024  mtu 1500 advmss 1440 hoplimit 0
[NEIGH]fe80::215:e9ff:fef0:10f8 dev macvlan0 lladdr 00:15:e9:f0:10:f8 router STALE
[ROUTE]2001:6f8:974::/64 dev macvlan0  proto kernel  metric 256  expires 0sec mtu 1500 advmss 1440 hoplimit 0
[PREFIX]prefix 2001:6f8:974::/64 dev macvlan0 onlink autoconf valid 14400 preferred 131084
[ADDR]11: macvlan0    inet6 2001:6f8:974:0:24f8:84ff:fe02:f92a/64 scope global dynamic
       valid_lft 86399sec preferred_lft 14399sec

# add VLAN to eth1, eth1 is down
#
$ ip link add link eth1 up type vlan id 1000
RTNETLINK answers: Network is down

<no events>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-27 02:43:40 -08:00
Patrick McHardy bd38081160 dev: support deferring device flag change notifications
Split dev_change_flags() into two functions: __dev_change_flags() to
perform the actual changes and __dev_notify_flags() to invoke netdevice
notifiers. This will be used by rtnl_link to defer netlink notifications
until the device has been fully configured.

This changes ordering of some operations, in particular:

- netlink notifications are sent after all changes have been performed.
  As a side effect this surpresses one unnecessary netlink message when
  the IFF_UP and other flags are changed simultaneously.

- The NETDEV_UP/NETDEV_DOWN and NETDEV_CHANGE notifiers are invoked
  after all changes have been performed. Their relative is unchanged.

- net_dmaengine_put() is invoked before the NETDEV_DOWN notifier instead
  of afterwards. This should not make any difference since both RX and TX
  are already shut down at this point.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-27 02:43:40 -08:00
Patrick McHardy a2835763e1 rtnetlink: handle rtnl_link netlink notifications manually
In order to support specifying device flags during device creation,
we must be able to roll back device registration in case setting the
flags fails without sending any notifications related to the device
to userspace.

This patch changes rollback_registered_many() and register_netdevice()
to manually send netlink notifications for devices not handled by
rtnl_link and allows to defer notifications for devices handled by
rtnl_link until setup is complete.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-27 02:43:39 -08:00
Patrick McHardy 10de05afe0 rtnetlink: ignore NETDEV_PRE_UP notifier in rtnetlink_event()
Commit 3b8bcfd (net: introduce pre-up netdev notifier) added a new
notifier which is run before a device is set UP for use by cfg80211.

The patch missed to add the new notifier to the ignore list in
rtnetlink_event(), so we currently get an unnecessary netlink
notification before a device is set UP.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-27 02:43:39 -08:00
David S. Miller 738b0343e7 Revert "ethtool: Add n-tuple string length to drvinfo and return it"
This reverts commit c79c5ffdce.

As Jeff points out we can't break the user visible interface
like this, we need to add this into the reserved[] thing.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-26 05:12:02 -08:00
Jiri Pirko 6e17d45ae3 net: add addr len check to dev_mc_add
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-26 04:22:26 -08:00
Peter Waskiewicz c79c5ffdce ethtool: Add n-tuple string length to drvinfo and return it
The drvinfo struct should include the number of strings that
get_rx_ntuple will return.  It will be variable if an underlying
driver implements its own get_rx_ntuple routine, so userspace
needs to know how much data is coming.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-26 04:18:43 -08:00
stephen hemminger e5e26d75f4 netdev: use list_first_entry macro
Use list_first_entry macro; no longer any need to use
'next' directly in list to find first entry.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-26 04:18:35 -08:00
Williams, Mitch A 4edb246626 rtnetlink: clean up SR-IOV config interface
This patch consists of a few minor cleanups to the SR-IOV
configurion code in rtnetlink.
- Remove unneccesary lock
- Remove unneccesary casts
- Return correct error code for no driver support

These changes are based on comments from Patrick McHardy

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-26 04:18:35 -08:00
David S. Miller 0448873480 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-02-25 23:22:42 -08:00
Paul E. McKenney a898def29e net: Add checking to rcu_dereference() primitives
Update rcu_dereference() primitives to use new lockdep-based
checking. The rcu_dereference() in __in6_dev_get() may be
protected either by rcu_read_lock() or RTNL, per Eric Dumazet.
The rcu_dereference() in __sk_free() is protected by the fact
that it is never reached if an update could change it.  Check
for this by using rcu_dereference_check() to verify that the
struct sock's ->sk_wmem_alloc counter is zero.

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-5-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 09:41:03 +01:00
Ajit Khaparde c4d49794ff net: bug fix for vlan + gro issue
Traffic (tcp) doesnot start on a vlan interface when gro is enabled.
Even the tcp handshake was not taking place.
This is because, the eth_type_trans call before the netif_receive_skb
in napi_gro_finish() resets the skb->dev to napi->dev from the previously
set vlan netdev interface. This causes the ip_route_input to drop the
incoming packet considering it as a packet coming from a martian source.

I could repro this on 2.6.32.7 (stable) and 2.6.33-rc7.
With this fix, the traffic starts and the test runs fine on both vlan
and non-vlan interfaces.

CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-23 19:09:31 -08:00
Jamal Hadi Salim bd55775c8d xfrm: SA lookups signature with mark
pass mark to all SA lookups to prepare them for when we add code
to have them search.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-22 16:20:22 -08:00
Eric W. Biederman b8afe64161 net-sysfs: Use rtnl_trylock in wireless sysfs methods.
The wireless sysfs methods like the rest of the networking sysfs
methods are removed with the rtnl_lock held and block until
the existing methods stop executing.  So use rtnl_trylock
and restart_syscall so that the code continues to work.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-19 15:40:51 -08:00
Michael S. Tsirkin 5ff3f07367 net: export attach/detach filter routines
Export sk_attach_filter/sk_detach_filter routines,
so that tun module can use them.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-17 16:35:16 -08:00
Ajit Khaparde e76b69cc01 net: bug fix for vlan + gro issue
Traffic (tcp) doesnot start on a vlan interface when gro is enabled.
Even the tcp handshake was not taking place.
This is because, the eth_type_trans call before the netif_receive_skb
in napi_gro_finish() resets the skb->dev to napi->dev from the previously
set vlan netdev interface. This causes the ip_route_input to drop the
incoming packet considering it as a packet coming from a martian source.

I could repro this on 2.6.32.7 (stable) and 2.6.33-rc7.
With this fix, the traffic starts and the test runs fine on both vlan
and non-vlan interfaces.

CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-17 15:59:47 -08:00
Ben Hutchings 7af3351f71 ethtool: Don't flush n-tuple list from ethtool_reset()
The n-tuple list should be flushed if and only if the ETH_RESET_FILTER
flag is set and the driver is able to reset filtering/flow direction
hardware without also resetting a component whose flag is not set.
This test is best left to the driver.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-17 13:38:10 -08:00
Alexey Dobriyan faf234220f net: use kasprintf() for socket cache names
kasprintf() makes code smaller.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-17 13:27:11 -08:00
Alexey Dobriyan dc4c2c3105 net: remove INIT_RCU_HEAD() usage
call_rcu() will unconditionally reinitialize RCU head anyway.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-17 00:03:27 -08:00
David S. Miller 2bb4646fce Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-02-16 22:09:29 -08:00
Eric W. Biederman 54716e3beb net neigh: Decouple per interface neighbour table controls from binary sysctls
Stop computing the number of neighbour table settings we have by
counting the number of binary sysctls.  This behaviour was silly
and meant that we could not add another neighbour table setting
without also adding another binary sysctl.

Don't pass the binary sysctl path for neighour table entries
into neigh_sysctl_register.  These parameters are no longer
used and so are just dead code.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-16 15:55:18 -08:00
stephen hemminger 1cab819b5e ethtool: allow non-admin user to read GRO settings.
Looks like an oversight in GRO design.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-16 14:53:23 -08:00
Eric Dumazet 339c6e9985 ethtool: reduce stack usage
dev_ethtool() is currently using 604 bytes of stack, even with gcc-4.4.2

objdump -d vmlinux | scripts/checkstack.pl
...
0xc04bbc33 dev_ethtool [vmlinux]:			604
...
Adding noinline attributes to selected functions can reduce stack usage.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-15 21:51:33 -08:00
Peter Waskiewicz 0d643e1fb4 ethtool: Move n-tuple capability check into set_flags
set_flags should check if the underlying device supports
n-tuple filter programming before setting the device flags
on the netdevice.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-15 21:49:47 -08:00
Peter Waskiewicz e858911804 ethtool: Fix filter addition when caching n-tuple filters
We can allow a filter to be added successfully to the underlying
hardware, but still return an error if the cached list memory
allocation fails.  This patch fixes that condition.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-15 21:49:47 -08:00
Williams, Mitch A ebc08a6f47 rtnetlink: Add VF config code to rtnetlink
Add code to allow rtnetlink clients to query and set VF information through
the PF driver.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-12 16:56:08 -08:00
Jiri Pirko 4cd24eaf0c net: use netdev_mc_count and netdev_mc_empty when appropriate
This patch replaces dev->mc_count in all drivers (hopefully I didn't miss
anything). Used spatch and did small tweaks and conding style changes when
it was suitable.

Jirka

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-12 11:38:58 -08:00
Roland Dreier 8e5574211d ethtool: Use explicit designated initializers for .cmd
Initialize the .cmd member of various ethtool using a designated struct
initializer rather.  This makes things a teeny bit more robust, although
the chance of a struct layout changing is extremely remote, and also
makes the code a little easier to read.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-11 12:14:23 -08:00
Peter P Waskiewicz Jr 15682bc488 ethtool: Introduce n-tuple filter programming support
This patchset enables the ethtool layer to program n-tuple
filters to an underlying device.  The idea is to allow capable
hardware to have static rules applied that can assist steering
flows into appropriate queues.

Hardware that is known to support these types of filters today
are ixgbe and niu.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-10 20:03:05 -08:00
David S. Miller b1109bf085 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-02-09 11:44:44 -08:00
Eric Dumazet 2fc1b5dd99 dst: call cond_resched() in dst_gc_task()
Kernel bugzilla #15239

On some workloads, it is quite possible to get a huge dst list to
process in dst_gc_task(), and trigger soft lockup detection.

Fix is to call cond_resched(), as we run in process context.

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-08 15:00:39 -08:00
Rafael J. Wysocki 1b3f720bf0 pktgen: Fix freezing problem
Add missing try_to_freeze() to one of the pktgen_thread_worker() code
paths so that it doesn't block suspend/hibernation.

Fixes http://bugzilla.kernel.org/show_bug.cgi?id=15006

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-and-tested-by: Ciprian Dorin Craciun <ciprian.craciun@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-04 14:00:41 -08:00
Arnd Bergmann 8a83a00b07 net: maintain namespace isolation between vlan and real device
In the vlan and macvlan drivers, the start_xmit function forwards
data to the dev_queue_xmit function for another device, which may
potentially belong to a different namespace.

To make sure that classification stays within a single namespace,
this resets the potentially critical fields.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-03 20:20:32 -08:00
Jiri Pirko 32e7bfc411 net: use helpers to access uc list V2
This patch introduces three macros to work with uc list from net drivers.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-25 13:36:10 -08:00
Alexey Dobriyan 81c1ebfc43 neigh: simplify seq_file code
Simpily pass 'struct neigh_table' with seq_file private pointer,
and save one dereference. Proc entry itself isn't interesting.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-23 01:21:27 -08:00
Krishna Kumar 4b258461c0 net: Optimize non-gso test checks
Avoid checking twice whether skb needs to be linearized, if one
skb_linearize was already done.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-21 01:26:29 -08:00
David S. Miller 11380a4b2d net: Unexport napi_gro_flush().
Nothing outside of net/core/dev.c uses it.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-19 13:46:10 -08:00
Alexey Dobriyan 2c8c1e7297 net: spread __net_init, __net_exit
__net_init/__net_exit are apparently not going away, so use them
to full extent.

In some cases __net_init was removed, because it was called from
__net_exit code.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-17 19:16:02 -08:00
H Hartley Sweeten 4d0392be21 net/core/sock.c: quiet sparse noise
In sock_getsockopt the symbol 'lv' is declared as an
unsigned int type, probably due to sizeof returning a
size_t which is really an unsigned int.

This produces a sparse warning for SO_PEERNAME due to
the sock->ops->getname() call:

warning: incorrect type in argument 3 (different signedness)
   expected int *sockaddr_len
   got unsigned int *<noident>

Quiet the warning by changing the type of 'lv' to an int.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-15 01:08:58 -08:00
Daniel Borkmann 508e14b4a4 netpoll: allow execution of multiple rx_hooks per interface
Signed-off-by: Daniel Borkmann <danborkmann@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-13 20:38:26 -08:00
Linus Torvalds 597d8c7178 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (56 commits)
  sky2: Fix oops in sky2_xmit_frame() after TX timeout
  Documentation/3c509: document ethtool support
  af_packet: Don't use skb after dev_queue_xmit()
  vxge: use pci_dma_mapping_error to test return value
  netfilter: ebtables: enforce CAP_NET_ADMIN
  e1000e: fix and commonize code for setting the receive address registers
  e1000e: e1000e_enable_tx_pkt_filtering() returns wrong value
  e1000e: perform 10/100 adaptive IFS only on parts that support it
  e1000e: don't accumulate PHY statistics on PHY read failure
  e1000e: call pci_save_state() after pci_restore_state()
  netxen: update version to 4.0.72
  netxen: fix set mac addr
  netxen: fix smatch warning
  netxen: fix tx ring memory leak
  tcp: update the netstamp_needed counter when cloning sockets
  TI DaVinci EMAC: Handle emac module clock correctly.
  dmfe/tulip: Let dmfe handle DM910x except for SPARC on-board chips
  ixgbe: Fix compiler warning about variable being used uninitialized
  netfilter: nf_ct_ftp: fix out of bounds read in update_nl_seq()
  mv643xx_eth: don't include cache padding in rx desc buffer size
  ...

Fix trivial conflict in drivers/scsi/cxgb3i/cxgb3i_offload.c
2010-01-12 20:53:29 -08:00
David S. Miller d4a66e752d Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/benet/be_cmds.h
	include/linux/sysctl.h
2010-01-10 22:55:03 -08:00
Octavian Purdila 704da560c0 tcp: update the netstamp_needed counter when cloning sockets
This fixes a netstamp_needed accounting issue when the listen socket
has SO_TIMESTAMP set:

    s = socket(AF_INET, SOCK_STREAM, 0);
    setsockopt(s, SOL_SOCKET, SO_TIMESTAMP, 1); -> netstamp_needed = 1
    bind(s, ...);
    listen(s, ...);
    s2 = accept(s, ...); -> netstamp_needed = 1
    close(s2); -> netstamp_needed = 0
    close(s); -> netstamp_needed = -1

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-08 00:00:09 -08:00
Jesper Dangaard Brouer 2d13bafeba net: Make it easier to parse /proc/net/dev contents.
The contents of /proc/net/dev is annoying to parse, because
it changes whether there is a space after the "ethX:" or not.
It depends upon the size of the "Receive bytes" counter,
if the number is below 7 digits, then there is whitespaces
else if the number is 8 digits or above there is no space
between the ":" and the number.

This patch changes the output to assure there is always a space
between the ":" and the number.  Given that all existing userspace
application already need to handle the whitespaces, I see
no breakage of existing tools.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-07 00:59:10 -08:00
Andy Gospodarek ca8d9ea30b fix bonding: allow arp_ip_targets on separate vlans to use arp validation
On Wed, Jan 06, 2010 at 10:10:03PM +0100, Eric Dumazet wrote:
> Le 06/01/2010 19:38, Eric Dumazet a écrit :
> >
> > (net-next-2.6 doesnt work well on my bond/vlan setup, I suspect I need a bisection)
>
> David, I had to revert 1f3c8804ac
> (bonding: allow arp_ip_targets on separate vlans to use arp validation)
>
> Or else, my vlan devices dont work (unfortunatly I dont have much time
> these days to debug the thing)
>
> My config :
>
>               +---------+
> vlan.103 -----+ bond0   +--- eth1 (bnx2)
>               |         +
> vlan.825 -----+         +--- eth2 (tg3)
>               +---------+
>
> $ cat /proc/net/bonding/bond0
> Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
>
> Bonding Mode: fault-tolerance (active-backup)
> Primary Slave: None
> Currently Active Slave: eth2
> MII Status: up
> MII Polling Interval (ms): 100
> Up Delay (ms): 0
> Down Delay (ms): 0
>
> Slave Interface: eth1  (bnx2)
> MII Status: down
> Link Failure Count: 1
> Permanent HW addr: 00:1e:0b:ec:d3:d2
>
> Slave Interface: eth2   (tg3)
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:1e:0b:92:78:50
>

This patch fixes up a problem with found with commit
1f3c8804ac.  The original change
overloaded null_or_orig, but doing that prevented any packet handlers
that were not tied to a specific device (i.e. ptype->dev == NULL) from
ever receiving any frames.

The null_or_orig variable cannot be overloaded, and must be kept as NULL
to prevent the frame from being ignored by packet handlers designed to
accept frames on any interface.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-07 00:46:39 -08:00
Andy Gospodarek 1f3c8804ac bonding: allow arp_ip_targets on separate vlans to use arp validation
This allows a bond device to specify an arp_ip_target as a host that is
not on the same vlan as the base bond device and still use arp
validation.  A configuration like this, now works:

BONDING_OPTS="mode=active-backup arp_interval=1000 arp_ip_target=10.0.100.1 arp_validate=3"

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
    link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
    link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::213:21ff:febe:33e9/64 scope link
       valid_lft forever preferred_lft forever
9: bond0.100@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
    inet 10.0.100.2/24 brd 10.0.100.255 scope global bond0.100
    inet6 fe80::213:21ff:febe:33e9/64 scope link
       valid_lft forever preferred_lft forever

Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 1000
ARP IP target/s (n.n.n.n form): 10.0.100.1

Slave Interface: eth1
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:40:05:30:ff:30

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:13:21:be:33:e9

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-03 21:17:16 -08:00