Commit Graph

745 Commits

Author SHA1 Message Date
Pavel Emelyanov 3f0666ee30 [NET]: Auto-zero the allocated sock object
We have a __GFP_ZERO flag that allocates a zeroed chunk of memory.
Use it in the sk_alloc() and avoid a hand-made memset().

This is a temporary patch that will help us in the nearest future :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:34:42 -07:00
Pavel Emelyanov c308c1b20e [NET]: Cleanup the allocation/freeing of the sock object
The sock object is allocated either from the generic cache with
the kmalloc, or from the proc->slab cache.

Move this logic into an isolated set of helpers and make the
sk_alloc/sk_free look a bit nicer.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:33:50 -07:00
Pavel Emelyanov 1e2e6b89f1 [NET]: Move the get_net() from sock_copy()
The sock_copy() is supposed to just clone the socket. In a perfect
world it has to be just memcpy, but we have to handle the security
mark correctly. All the extra setup must be performed in sk_clone() 
call, so move the get_net() into more proper place.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:31:26 -07:00
Pavel Emelyanov f1a6c4da14 [NET]: Move the sock_copy() from the header
The sock_copy() call is not used outside the sock.c file,
so just move it into a sock.c

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:29:45 -07:00
David S. Miller 51c739d1f4 [NET]: Fix incorrect sg_mark_end() calls.
This fixes scatterlist corruptions added by

	commit 68e3f5dd4d
	[CRYPTO] users: Fix up scatterlist conversion errors

The issue is that the code calls sg_mark_end() which clobbers the
sg_page() pointer of the final scatterlist entry.

The first part fo the fix makes skb_to_sgvec() do __sg_mark_end().

After considering all skb_to_sgvec() call sites the most correct
solution is to call __sg_mark_end() in skb_to_sgvec() since that is
what all of the callers would end up doing anyways.

I suspect this might have fixed some problems in virtio_net which is
the sole non-crypto user of skb_to_sgvec().

Other similar sg_mark_end() cases were converted over to
__sg_mark_end() as well.

Arguably sg_mark_end() is a poorly named function because it doesn't
just "mark", it clears out the page pointer as a side effect, which is
what led to these bugs in the first place.

The one remaining plain sg_mark_end() call is in scsi_alloc_sgtable()
and arguably it could be converted to __sg_mark_end() if only so that
we can delete this confusing interface from linux/scatterlist.h

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-30 21:29:29 -07:00
Daniel Lezcano 310928d963 [NETNS]: fix net released by rcu callback
When a network namespace reference is held by a network subsystem,
and when this reference is decremented in a rcu update callback, we
must ensure that there is no more outstanding rcu update before
trying to free the network namespace.

In the normal case, the rcu_barrier is called when the network namespace
is exiting in the cleanup_net function.

But when a network namespace creation fails, and the subsystems are
undone (like the cleanup), the rcu_barrier is missing.

This patch adds the missing rcu_barrier.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-30 21:16:21 -07:00
Daniel Lezcano 93ee31f14f [NET]: Fix free_netdev on register_netdev failure.
Point 1:
The unregistering of a network device schedule a netdev_run_todo.
This function calls dev->destructor when it is set and the
destructor calls free_netdev.

Point 2:
In the case of an initialization of a network device the usual code
is:
 * alloc_netdev
 * register_netdev
    -> if this one fails, call free_netdev and exit with error.

Point 3:
In the register_netdevice function at the later state, when the device
is at the registered state, a call to the netdevice_notifiers is made.
If one of the notification falls into an error, a rollback to the
registered state is done using unregister_netdevice.

Conclusion:
When a network device fails to register during initialization because
one network subsystem returned an error during a notification call
chain, the network device is freed twice because of fact 1 and fact 2.
The second free_netdev will be done with an invalid pointer.

Proposed solution:
The following patch move all the code of unregister_netdevice *except*
the call to net_set_todo, to a new function "rollback_registered".

The following functions are changed in this way:
 * register_netdevice: calls rollback_registered when a notification fails
 * unregister_netdevice: calls rollback_register + net_set_todo, the call
                         order to net_set_todo is changed because it is the
                         latest now. Since it justs add an element to a list
                         that should not break anything.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-30 21:16:18 -07:00
David S. Miller 0a7606c121 [NET]: Fix race between poll_napi() and net_rx_action()
netpoll_poll_lock() synchronizes the ->poll() invocation
code paths, but once we have the lock we have to make
sure that NAPI_STATE_SCHED is still set.  Otherwise we
get:

	cpu 0			cpu 1

	net_rx_action()		poll_napi()
	netpoll_poll_lock()	... spin on ->poll_lock
	->poll()
	  netif_rx_complete
	netpoll_poll_unlock()	acquire ->poll_lock()
				->poll()
				 netif_rx_complete()
				 CRASH

Based upon a bug report from Tina Yang.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-29 22:37:28 -07:00
Eric W. Biederman ceaa79c434 [NETNS]: Fix get_net_ns_by_pid
The pid namespace patches changed the semantics of
find_task_by_pid without breaking the compile resulting
in get_net_ns_by_pid doing the wrong thing.

So switch to using the intended find_task_by_vpid.

Combined with Denis' earlier patch to make netlink traffic
fully synchronous the inadvertent race I introduced with
accessing current is actually removed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 22:56:12 -07:00
Eric W. Biederman 2b008b0a8e [NET]: Marking struct pernet_operations __net_initdata was inappropriate
It is not safe to to place struct pernet_operations in a special section.
We need struct pernet_operations to last until we call unregister_pernet_subsys.
Which doesn't happen until module unload.

So marking struct pernet_operations is a disaster for modules in two ways.
- We discard it before we call the exit method it points to.
- Because I keep struct pernet_operations on a linked list discarding
  it for compiled in code removes elements in the middle of a linked
  list and does horrible things for linked insert.

So this looks safe assuming __exit_refok is not discarded
for modules.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 22:54:53 -07:00
Adrian Bunk bbbb1a812d [NET]: Unexport sock_enable_timestamp().
sock_enable_timestamp() no longer has any modular users.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 03:59:45 -07:00
Stephen Hemminger c8d90dca32 [NET] dev_change_name: ignore changes to same name
Prevent error/backtrace from dev_rename() when changing
name of network device to the same name. This is a common
situation with udev and other scripts that bind addr to device.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 03:53:42 -07:00
Jamal Hadi Salim a057ae3c10 [NET_CLS_ACT]: Use skb_act_clone
clean skb_clone of any signs of CONFIG_NET_CLS_ACT and
have mirred us skb_act_clone()

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 02:47:54 -07:00
Linus Torvalds 06dbbfef82 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  [IPV4]: Explicitly call fib_get_table() in fib_frontend.c
  [NET]: Use BUILD_BUG_ON in net/core/flowi.c
  [NET]: Remove in-code externs for some functions from net/core/dev.c
  [NET]: Don't declare extern variables in net/core/sysctl_net_core.c
  [TCP]: Remove unneeded implicit type cast when calling tcp_minshall_update()
  [NET]: Treat the sign of the result of skb_headroom() consistently
  [9P]: Fix missing unlock before return in p9_mux_poll_start
  [PKT_SCHED]: Fix sch_prio.c build with CONFIG_NETDEVICES_MULTIQUEUE
  [IPV4] ip_gre: sendto/recvfrom NBMA address
  [SCTP]: Consolidate sctp_ulpq_renege_xxx functions
  [NETLINK]: Fix ACK processing after netlink_dump_start
  [VLAN]: MAINTAINERS update
  [DCCP]: Implement SIOCINQ/FIONREAD
  [NET]: Validate device addr prior to interface-up
2007-10-25 15:50:32 -07:00
Jens Axboe 642f149031 SG: Change sg_set_page() to take length and offset argument
Most drivers need to set length and offset as well, so may as well fold
those three lines into one.

Add sg_assign_page() for those two locations that only needed to set
the page, where the offset/length is set outside of the function context.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-24 11:20:47 +02:00
Pavel Emelyanov f0fe91ded3 [NET]: Use BUILD_BUG_ON in net/core/flowi.c
Instead of ugly extern not-existing function.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-23 21:27:57 -07:00
Pavel Emelyanov 342709efc7 [NET]: Remove in-code externs for some functions from net/core/dev.c
Inconsistent prototype and real type for functions may have worse
consequences, than those for variables, so move them into a header.

Since they are used privately in net/core, make this file reside in
the same place.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-23 21:27:56 -07:00
Pavel Emelyanov a37ae4086e [NET]: Don't declare extern variables in net/core/sysctl_net_core.c
Some are already declared in include/linux/netdevice.h, while
some others (xfrm ones) need to be declared.

The driver/net/rrunner.c just uses same extern as well, so
cleanup it also.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-23 21:27:56 -07:00
Jeff Garzik bada339ba2 [NET]: Validate device addr prior to interface-up
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-23 21:27:50 -07:00
Linus Torvalds f09cc910fe Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (30 commits)
  [IPSEC] IPV6: Fix to add tunnel mode SA correctly.
  [NET]: Cut off the queue_mapping field from sk_buff
  [NET]: Hide the queue_mapping field inside netif_subqueue_stopped
  [NET]: Make and use skb_get_queue_mapping
  [NET]: Use the skb_set_queue_mapping where appropriate
  [INET]: Use MODULE_ALIAS_NET_PF_PROTO_TYPE where possible.
  [INET]: Let inet_diag and friends autoload
  [NIU]: Cleanup PAGE_SIZE checks a bit
  [NET]: Fix SKB_WITH_OVERHEAD calculation
  [ATM]: Fix clip module reload crash.
  [TG3]: Update version to 3.85
  [TG3]: PCI command adjustment
  [TG3]: Add management FW version to ethtool report
  [TG3]: Add 5723 support
  [Bluetooth] Convert RFCOMM to use kthread API
  [Bluetooth] Add constant for Bluetooth socket options level
  [Bluetooth] Add support for handling simple eSCO links
  [Bluetooth] Add address and channel attribute to RFCOMM TTY device
  [Bluetooth] Fix wrong argument in debug code of HIDP
  [Bluetooth] Add generic driver for Bluetooth USB devices
  ...
2007-10-22 19:22:33 -07:00
Jens Axboe fa05f1286b Update net/ to use sg helpers
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-22 21:19:56 +02:00
Pavel Emelyanov 668f895a85 [NET]: Hide the queue_mapping field inside netif_subqueue_stopped
Many places get the queue_mapping field from skb to pass it to the
netif_subqueue_stopped() which will be 0 in any case.

Make the helper that works with sk_buff

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-22 02:59:56 -07:00
Pavel Emelyanov dfa4091129 [NET]: Use the skb_set_queue_mapping where appropriate
There's already such a helper to initialize this field.  Use it.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-22 02:59:55 -07:00
Randy Dunlap bfb85c9f75 [ATM]: Fix clip module reload crash.
net/atm/clip.c crashes the kernel if it (module) is loaded, removed,
and then loaded again.  Its exit call to neigh_table_clear()
should destroy the cache after freeing it.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-22 02:59:52 -07:00
Jan Engelhardt 96de0e252c Convert files to UTF-8 and some cleanups
* Convert files to UTF-8.

  * Also correct some people's names
    (one example is Eißfeldt, which was found in a source file.
    Given that the author used an ß at all in a source file
    indicates that the real name has in fact a 'ß' and not an 'ss',
    which is commonly used as a substitute for 'ß' when limited to
    7bit.)

  * Correct town names (Goettingen -> Göttingen)

  * Update Eberhard Mönkeberg's address (http://lkml.org/lkml/2007/1/8/313)

Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-10-19 23:21:04 +02:00
Linus Torvalds 804b908adf Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [NET]: Fix possible dev_deactivate race condition
  [INET]: Justification for local port range robustness.
  [PACKET]: Kill unused pg_vec_endpage() function
  [NET]: QoS/Sched as menuconfig
  [NET]: Fix bug in sk_filter race cures.
  [PATCH] mac80211: make ieee802_11_parse_elems return void
2007-10-19 11:54:39 -07:00
Pavel Emelyanov ba25f9dcc4 Use helpers to obtain task pid in printks
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:43 -07:00
Jiri Slaby 1977f03272 remove asm/bitops.h includes
remove asm/bitops.h includes

including asm/bitops directly may cause compile errors. don't include it
and include linux/bitops instead. next patch will deny including asm header
directly.

Cc: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:41 -07:00
Pavel Emelyanov b488893a39 pid namespaces: changes to show virtual ids to user
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
 - all in-kernel data structures must store either struct pid itself
   or the pid's global nr, obtained with pid_nr() call;
 - when seeking the task from kernel code with the stored id one
   should use find_task_by_pid() call that works with global pids;
 - when showing pid's numerical value to the user the virtual one
   should be used, but however when one shows task's pid outside this
   task's namespace the global one is to be used;
 - when getting the pid from userspace one need to consider this as
   the virtual one and use appropriate task/pid-searching functions.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:40 -07:00
Pavel Emelyanov cf7b708c8d Make access to task's nsproxy lighter
When someone wants to deal with some other taks's namespaces it has to lock
the task and then to get the desired namespace if the one exists.  This is
slow on read-only paths and may be impossible in some cases.

E.g.  Oleg recently noticed a race between unshare() and the (sent for
review in cgroups) pid namespaces - when the task notifies the parent it
has to know the parent's namespace, but taking the task_lock() is
impossible there - the code is under write locked tasklist lock.

On the other hand switching the namespace on task (daemonize) and releasing
the namespace (after the last task exit) is rather rare operation and we
can sacrifice its speed to solve the issues above.

The access to other task namespaces is proposed to be performed
like this:

     rcu_read_lock();
     nsproxy = task_nsproxy(tsk);
     if (nsproxy != NULL) {
             / *
               * work with the namespaces here
               * e.g. get the reference on one of them
               * /
     } / *
         * NULL task_nsproxy() means that this task is
         * almost dead (zombie)
         * /
     rcu_read_unlock();

This patch has passed the review by Eric and Oleg :) and,
of course, tested.

[clg@fr.ibm.com: fix unshare()]
[ebiederm@xmission.com: Update get_net_ns_by_pid]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:37 -07:00
Olof Johansson 9b013e05e0 [NET]: Fix bug in sk_filter race cures.
Looks like this might be causing problems, at least for me on ppc. This
happened during a normal boot, right around first interface config/dhcp
run..

cpu 0x0: Vector: 300 (Data Access) at [c00000000147b820]
    pc: c000000000435e5c: .sk_filter_delayed_uncharge+0x1c/0x60
    lr: c0000000004360d0: .sk_attach_filter+0x170/0x180
    sp: c00000000147baa0
   msr: 9000000000009032
   dar: 4
 dsisr: 40000000
  current = 0xc000000004780fa0
  paca    = 0xc000000000650480
    pid   = 1295, comm = dhclient3
0:mon> t
[c00000000147bb20] c0000000004360d0 .sk_attach_filter+0x170/0x180
[c00000000147bbd0] c000000000418988 .sock_setsockopt+0x788/0x7f0
[c00000000147bcb0] c000000000438a74 .compat_sys_setsockopt+0x4e4/0x5a0
[c00000000147bd90] c00000000043955c .compat_sys_socketcall+0x25c/0x2b0
[c00000000147be30] c000000000007508 syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 000000000ff618d8
SP (fffdf040) is in userspace
0:mon> 

I.e. null pointer deref at sk_filter_delayed_uncharge+0x1c:

0:mon> di $.sk_filter_delayed_uncharge
c000000000435e40  7c0802a6      mflr    r0
c000000000435e44  fbc1fff0      std     r30,-16(r1)
c000000000435e48  7c8b2378      mr      r11,r4
c000000000435e4c  ebc2cdd0      ld      r30,-12848(r2)
c000000000435e50  f8010010      std     r0,16(r1)
c000000000435e54  f821ff81      stdu    r1,-128(r1)
c000000000435e58  380300a4      addi    r0,r3,164
c000000000435e5c  81240004      lwz     r9,4(r4)

That's the deref of fp:

static void sk_filter_delayed_uncharge(struct sock *sk, struct sk_filter *fp)
{
        unsigned int size = sk_filter_len(fp);
...

That is called from sk_attach_filter():

...
        rcu_read_lock_bh();
        old_fp = rcu_dereference(sk->sk_filter);
        rcu_assign_pointer(sk->sk_filter, fp);
        rcu_read_unlock_bh();

        sk_filter_delayed_uncharge(sk, old_fp);
        return 0;
...

So, looks like rcu_dereference() returned NULL. I don't know the
filter code at all, but it seems like it might be a valid case?
sk_detach_filter() seems to handle a NULL sk_filter, at least.

So, this needs review by someone who knows the filter, but it fixes the
problem for me:

Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-18 21:48:39 -07:00
Linus Torvalds a57793651f Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (51 commits)
  [IPV6]: Fix again the fl6_sock_lookup() fixed locking
  [NETFILTER]: nf_conntrack_tcp: fix connection reopening fix
  [IPV6]: Fix race in ipv6_flowlabel_opt() when inserting two labels
  [IPV6]: Lost locking in fl6_sock_lookup
  [IPV6]: Lost locking when inserting a flowlabel in ipv6_fl_list
  [NETFILTER]: xt_sctp: fix mistake to pass a pointer where array is required
  [NET]: Fix OOPS due to missing check in dev_parse_header().
  [TCP]: Remove lost_retrans zero seqno special cases
  [NET]: fix carrier-on bug?
  [NET]: Fix uninitialised variable in ip_frag_reasm()
  [IPSEC]: Rename mode to outer_mode and add inner_mode
  [IPSEC]: Disallow combinations of RO and AH/ESP/IPCOMP
  [IPSEC]: Use the top IPv4 route's peer instead of the bottom
  [IPSEC]: Store afinfo pointer in xfrm_mode
  [IPSEC]: Add missing BEET checks
  [IPSEC]: Move type and mode map into xfrm_state.c
  [IPSEC]: Fix length check in xfrm_parse_spi
  [IPSEC]: Move ip_summed zapping out of xfrm6_rcv_spi
  [IPSEC]: Get nexthdr from caller in xfrm6_rcv_spi
  [IPSEC]: Move tunnel parsing for IPv4 out of xfrm4_input
  ...
2007-10-18 14:40:30 -07:00
Eric W. Biederman d12af679bc sysctl: fix neighbour table sysctls.
- In ipv6 ndisc_ifinfo_syctl_change so it doesn't depend on binary
  sysctl names for a function that works with proc.

- In neighbour.c reorder the table to put the possibly unused entries
  at the end so we can remove them by terminating the table early.

- In neighbour.c kill the entries with questionable binary sysctl
  handling behavior.

- In neighbour.c if we don't have a strategy routine remove the
  binary path.  So we don't the default sysctl strategy routine
  on data that is not ready for it.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:22 -07:00
Herbert Xu 13996378e6 [IPSEC]: Rename mode to outer_mode and add inner_mode
This patch adds a new field to xfrm states called inner_mode.  The existing
mode object is renamed to outer_mode.

This is the first part of an attempt to fix inter-family transforms.  As it
is we always use the outer family when determining which mode to use.  As a
result we may end up shoving IPv4 packets into netfilter6 and vice versa.

What we really want is to use the inner family for the first part of outbound
processing and the outer family for the second part.  For inbound processing
we'd use the opposite pairing.

I've also added a check to prevent silly combinations such as transport mode
with inter-family transforms.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17 21:35:51 -07:00
Pavel Emelyanov 47e958eac2 [NET]: Fix the race between sk_filter_(de|at)tach and sk_clone()
The proposed fix is to delay the reference counter decrement
until the quiescent state pass. This will give sk_clone() a
chance to get the reference on the cloned filter.

Regular sk_filter_uncharge can happen from the sk_free() only
and there's no need in delaying the put - the socket is dead
anyway and is to be release itself.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17 21:22:42 -07:00
Pavel Emelyanov d3904b7399 [NET]: Cleanup the error path in sk_attach_filter
The sk_filter_uncharge is called for error handling and
for releasing the former filter, but this will have to
be done in a bit different manner, so cleanup the error
path a bit.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17 21:22:17 -07:00
Pavel Emelyanov 309dd5fc87 [NET]: Move the filter releasing into a separate call
This is done merely as a preparation for the fix.

The sk_filter_uncharge() unaccounts the filter memory and calls
the sk_filter_release(), which in turn decrements the refcount
anf frees the filter.

The latter function will be required separately.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17 21:21:51 -07:00
Pavel Emelyanov 55b333253d [NET]: Introduce the sk_detach_filter() call
Filter is attached in a separate function, so do the
same for filter detaching.

This also removes one variable sock_setsockopt().

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17 21:21:26 -07:00
Pavel Emelyanov 4ae289444b [NEIGH]: Ensure that pneigh_lookup is protected with RTNL
The pnigh_lookup is used to lookup proxy entries and to 
create them in case lookup failed. 

However, the "creation" code does not perform the re-lookup
after GFP_KERNEL allocation. This is done because the code
is expected to be protected with the RTNL lock, so add the 
assertion (mainly to address future questions from new network 
developers like me :) ).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:54:15 -07:00
Herbert Xu a030847e9f [NET]: Avoid copying TCP packets unnecessarily
TCP packets all have writable heads, that is, even though it's cloned, it is
writable up to the end of the TCP header.  This patch makes skb_checksum_help
aware of this fact by using skb_clone_writable and avoiding a copy for TCP.

I've also modified the BUG_ON tests to be unsigned.  The only case where this
makes a difference is if csum_start points to a location before skb->data.
Since skb->data should always include the header where the checksum field
is (and all currently callers adhere to that), this change is safe and may
uncover bugs later.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:34 -07:00
Herbert Xu 172a863f2b [NET]: Fix csum_start update in pskb_expand_head
I got confused by the dual nature of the off variable in the
function pskb_expand_head.  The csum_start offset should use
nhead instead of off which can change depending on whether we
are using offsets or pointers.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:33 -07:00
Herbert Xu f697c3e8b3 [NET]: Avoid unnecessary cloning for ingress filtering
As it is we always invoke pt_prev before ing_filter, even if there are no
ingress filters attached.  This can cause unnecessary cloning in pt_prev.

This patch changes it so that we only invoke pt_prev if there are ingress
filters attached.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:26 -07:00
Herbert Xu e0053ec07e [SKBUFF]: Add skb_morph
This patch creates a new function skb_morph that's just like skb_clone
except that it lets user provide the spare skb that will be overwritten
by the one that's to be cloned.

This will be used by IP fragment reassembly so that we get back the same
skb that went in last (rather than the head skb that we get now which
requires us to carry around double pointers all over the place).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:24 -07:00
Herbert Xu dec18810c5 [SKBUFF]: Merge common code between copy_skb_header and skb_clone
This patch creates a new function __copy_skb_header to merge the common
code between copy_skb_header and skb_clone.  Having two functions which
are largely the same is a source of wasted labour as well as confusion.

In fact the tc_verd stuff is almost certainly a bug since it's treated
differently in skb_clone compared to the callers of copy_skb_header
(skb_copy/pskb_copy/skb_copy_expand).

I've kept that difference in tact with a comment added asking for
clarification.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:24 -07:00
Randy Dunlap c4ea43c552 net core: fix kernel-doc for new function parameters
Fix networking code kernel-doc for newly added parameters.

Warning(linux-2.6.23-git2//net/core/sock.c:879): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:570): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:594): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:617): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:641): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:667): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:722): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:959): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:1195): No description found for parameter 'dev'
Warning(linux-2.6.23-git2//net/core/dev.c:2105): No description found for parameter 'n'
Warning(linux-2.6.23-git2//net/core/dev.c:3272): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:3445): No description found for parameter 'net'
Warning(linux-2.6.23-git2//include/linux/netdevice.h:1301): No description found for parameter 'cpu'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-13 09:52:26 -07:00
Kay Sievers 7eff2e7a8b Driver core: change add_uevent_var to use a struct
This changes the uevent buffer functions to use a struct instead of a
long list of parameters. It does no longer require the caller to do the
proper buffer termination and size accounting, which is currently wrong
in some places. It fixes a known bug where parts of the uevent
environment are overwritten because of wrong index calculations.

Many thanks to Mathieu Desnoyers for finding bugs and improving the
error handling.

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:51:01 -07:00
Denis V. Lunev cd40b7d398 [NET]: make netlink user -> kernel interface synchronious
This patch make processing netlink user -> kernel messages synchronious.
This change was inspired by the talk with Alexey Kuznetsov about current
netlink messages processing. He says that he was badly wrong when introduced 
asynchronious user -> kernel communication.

The call netlink_unicast is the only path to send message to the kernel
netlink socket. But, unfortunately, it is also used to send data to the
user.

Before this change the user message has been attached to the socket queue
and sk->sk_data_ready was called. The process has been blocked until all
pending messages were processed. The bad thing is that this processing
may occur in the arbitrary process context.

This patch changes nlk->data_ready callback to get 1 skb and force packet
processing right in the netlink_unicast.

Kernel -> user path in netlink_unicast remains untouched.

EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
drop, but the process remains in the cycle until the message will be fully
processed. So, there is no need to use this kludges now.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:15:29 -07:00
Denis V. Lunev 1536cc0d55 [NET]: rtnl_unlock cleanups
There is no need to process outstanding netlink user->kernel packets
during rtnl_unlock now. There is no rtnl_trylock in the rtnetlink_rcv
anymore.

Normal code path is the following:
netlink_sendmsg
   netlink_unicast
       netlink_sendskb
           skb_queue_tail
           netlink_data_ready
               rtnetlink_rcv
                   mutex_lock(&rtnl_mutex);
                   netlink_run_queue(sk, qlen, &rtnetlink_rcv_msg);
                   mutex_unlock(&rtnl_mutex);

So, it is possible, that packets can be present in the rtnl->sk_receive_queue
during rtnl_unlock, but there is no need to process them at that moment as
rtnetlink_rcv for that packet is pending.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:12:58 -07:00
Pavel Emelyanov 9b77265235 [NET]: Remove double dev->flags checking when calling dev_close()
The unregister_netdevice() and dev_change_net_namespace()
both check for dev->flags to be IFF_UP before calling the
dev_close(), but the dev_close() checks for IFF_UP itself,
so remove those unneeded checks.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:52 -07:00
Pavel Emelyanov 32f0c4cbe4 [NETNS]: Don't memset() netns to zero manually
The newly created net namespace is set to 0 with memset()
in setup_net(). The setup_net() is also called for the
init_net_ns(), which is zeroed naturally as a global var.

So remove this memset and allocate new nets with the
kmem_cache_zalloc().

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:54:59 -07:00