Commit Graph

4286 Commits

Author SHA1 Message Date
Eric Dumazet 4b9d9be839 inetpeer: remove unused list
Andi Kleen and Tim Chen reported huge contention on inetpeer
unused_peers.lock, on memcached workload on a 40 core machine, with
disabled route cache.

It appears we constantly flip peers refcnt between 0 and 1 values, and
we must insert/remove peers from unused_peers.list, holding a contended
spinlock.

Remove this list completely and perform a garbage collection on-the-fly,
at lookup time, using the expired nodes we met during the tree
traversal.

This removes a lot of code, makes locking more standard, and obsoletes
two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also
removes two pointers in inet_peer structure.

There is still a false sharing effect because refcnt is in first cache
line of object [were the links and keys used by lookups are located], we
might move it at the end of inet_peer structure to let this first cache
line mostly read by cpus.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-08 17:05:30 -07:00
Jerry Chu 9ad7c049f0 tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side
This patch lowers the default initRTO from 3secs to 1sec per
RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
has been retransmitted, AND the TCP timestamp option is not on.

It also adds support to take RTT sample during 3WHS on the passive
open side, just like its active open counterpart, and uses it, if
valid, to seed the initRTO for the data transmission phase.

The patch also resets ssthresh to its initial default at the
beginning of the data transmission phase, and reduces cwnd to 1 if
there has been MORE THAN ONE retransmission during 3WHS per RFC5681.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-08 17:05:30 -07:00
Linus Torvalds 0e833d8cfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (40 commits)
  tg3: Fix tg3_skb_error_unmap()
  net: tracepoint of net_dev_xmit sees freed skb and causes panic
  drivers/net/can/flexcan.c: add missing clk_put
  net: dm9000: Get the chip in a known good state before enabling interrupts
  drivers/net/davinci_emac.c: add missing clk_put
  af-packet: Add flag to distinguish VID 0 from no-vlan.
  caif: Fix race when conditionally taking rtnl lock
  usbnet/cdc_ncm: add missing .reset_resume hook
  vlan: fix typo in vlan_dev_hard_start_xmit()
  net/ipv4: Check for mistakenly passed in non-IPv4 address
  iwl4965: correctly validate temperature value
  bluetooth l2cap: fix locking in l2cap_global_chan_by_psm
  ath9k: fix two more bugs in tx power
  cfg80211: don't drop p2p probe responses
  Revert "net: fix section mismatches"
  drivers/net/usb/catc.c: Fix potential deadlock in catc_ctrl_run()
  sctp: stop pending timers and purge queues when peer restart asoc
  drivers/net: ks8842 Fix crash on received packet when in PIO mode.
  ip_options_compile: properly handle unaligned pointer
  iwlagn: fix incorrect PCI subsystem id for 6150 devices
  ...
2011-06-04 23:16:00 +09:00
Marcus Meissner d0733d2e29 net/ipv4: Check for mistakenly passed in non-IPv4 address
Check against mistakenly passing in IPv6 addresses (which would result
in an INADDR_ANY bind) or similar incompatible sockaddrs.

Signed-off-by: Marcus Meissner <meissner@suse.de>
Cc: Reinhard Max <max@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-01 21:05:22 -07:00
Chris Metcalf 48bdf072c3 ip_options_compile: properly handle unaligned pointer
The current code takes an unaligned pointer and does htonl() on it to
make it big-endian, then does a memcpy().  The problem is that the
compiler decides that since the pointer is to a __be32, it is legal
to optimize the copy into a processor word store.  However, on an
architecture that does not handled unaligned writes in kernel space,
this produces an unaligned exception fault.

The solution is to track the pointer as a "char *" (which removes a bunch
of unpleasant casts in any case), and then just use put_unaligned_be32()
to write the value to memory.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: David S. Miller <davem@zippy.davemloft.net>
2011-05-31 15:11:02 -07:00
Linus Torvalds 10799db60c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  net: Kill ratelimit.h dependency in linux/net.h
  net: Add linux/sysctl.h includes where needed.
  net: Kill ether_table[] declaration.
  inetpeer: fix race in unused_list manipulations
  atm: expose ATM device index in sysfs
  IPVS: bug in ip_vs_ftp, same list heaad used in all netns.
  bug.h: Move ratelimit warn interfaces to ratelimit.h
  bonding: cleanup module option descriptions
  net:8021q:vlan.c Fix pr_info to just give the vlan fullname and version.
  net: davinci_emac: fix dev_err use at probe
  can: convert to %pK for kptr_restrict support
  net: fix ETHTOOL_SFEATURES compatibility with old ethtool_ops.set_flags
  netfilter: Fix several warnings in compat_mtw_from_user().
  netfilter: ipset: fix ip_set_flush return code
  netfilter: ipset: remove unused variable from type_pf_tdel()
  netfilter: ipset: Use proper timeout value to jiffies conversion
2011-05-27 11:16:27 -07:00
Eric Dumazet 686a7e32ca inetpeer: fix race in unused_list manipulations
Several crashes in cleanup_once() were reported in recent kernels.

Commit d6cc1d642d (inetpeer: various changes) added a race in
unlink_from_unused().

One way to avoid taking unused_peers.lock before doing the list_empty()
test is to catch 0->1 refcnt transitions, using full barrier atomic
operations variants (atomic_cmpxchg() and atomic_inc_return()) instead
of previous atomic_inc() and atomic_add_unless() variants.

We then call unlink_from_unused() only for the owner of the 0->1
transition.

Add a new atomic_add_unless_return() static helper

With help from Arun Sharma.

Refs: https://bugzilla.kernel.org/show_bug.cgi?id=32772

Reported-by: Arun Sharma <asharma@fb.com>
Reported-by: Maximilian Engelhardt <maxi@daemonizer.de>
Reported-by: Yann Dupont <Yann.Dupont@univ-nantes.fr>
Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-27 13:39:11 -04:00
Linus Torvalds fce637e392 Merge branches 'core-fixes-for-linus' and 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  seqlock: Get rid of SEQLOCK_UNLOCKED

* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  irq: Remove smp_affinity_list when unregister irq proc
2011-05-26 12:19:11 -07:00
Veaceslav Falico 24cf3af3fe igmp: call ip_mc_clear_src() only when we have no users of ip_mc_list
In igmp_group_dropped() we call ip_mc_clear_src(), which resets the number
of source filters per mulitcast. However, igmp_group_dropped() is also
called on NETDEV_DOWN, NETDEV_PRE_TYPE_CHANGE and NETDEV_UNREGISTER, which
means that the group might get added back on NETDEV_UP, NETDEV_REGISTER and
NETDEV_POST_TYPE_CHANGE respectively, leaving us with broken source
filters.

To fix that, we must clear the source filters only when there are no users
in the ip_mc_list, i.e. in ip_mc_dec_group() and on device destroy.

Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-24 13:26:12 -04:00
Eric Dumazet c4dbe54ed7 seqlock: Get rid of SEQLOCK_UNLOCKED
All static seqlock should be initialized with the lockdep friendly
__SEQLOCK_UNLOCKED() macro.

Remove legacy SEQLOCK_UNLOCKED() macro.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Link: http://lkml.kernel.org/r/%3C1306238888.3026.31.camel%40edumazet-laptop%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-05-24 15:22:17 +02:00
Dan Rosenberg 71338aa7d0 net: convert %p usage to %pK
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces.  Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers.  The behavior of %pK depends on the kptr_restrict sysctl.

If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs.  If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
 If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges.  Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".

The supporting code for kptr_restrict and %pK are currently in the -mm
tree.  This patch converts users of %p in net/ to %pK.  Cases of printing
pointers to the syslog are not covered, since this would eliminate useful
information for postmortem debugging and the reading of the syslog is
already optionally protected by the dmesg_restrict sysctl.

Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-24 01:13:12 -04:00
Eric Dumazet 19a76fa959 net: ping: cleanups ping_v4_unhash()
net/ipv4/ping.c: In function ‘ping_v4_unhash’:
net/ipv4/ping.c:140:28: warning: variable ‘hslot’ set but not used

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-23 16:29:24 -04:00
Linus Torvalds 53ee7569ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
  bnx2x: allow device properly initialize after hotplug
  bnx2x: fix DMAE timeout according to hw specifications
  bnx2x: properly handle CFC DEL in cnic flow
  bnx2x: call dev_kfree_skb_any instead of dev_kfree_skb
  net: filter: move forward declarations to avoid compile warnings
  pktgen: refactor pg_init() code
  pktgen: use vzalloc_node() instead of vmalloc_node() + memset()
  net: skb_trim explicitely check the linearity instead of data_len
  ipv4: Give backtrace in ip_rt_bug().
  net: avoid synchronize_rcu() in dev_deactivate_many
  net: remove synchronize_net() from netdev_set_master()
  rtnetlink: ignore NETDEV_RELEASE and NETDEV_JOIN event
  net: rename NETDEV_BONDING_DESLAVE to NETDEV_RELEASE
  bridge: call NETDEV_JOIN notifiers when add a slave
  netpoll: disable netpoll when enslave a device
  macvlan: Forward unicast frames in bridge mode to lowerdev
  net: Remove linux/prefetch.h include from linux/skbuff.h
  ipv4: Include linux/prefetch.h in fib_trie.c
  netlabel: Remove prefetches from list handlers.
  drivers/net: add prefetch header for prefetch users
  ...

Fixed up prefetch parts: removed a few duplicate prefetch.h includes,
fixed the location of the igb prefetch.h, took my version of the
skbuff.h code without the extra parentheses etc.
2011-05-23 08:39:24 -07:00
Paul Gortmaker 70c7160619 Add appropriate <linux/prefetch.h> include for prefetch users
After discovering that wide use of prefetch on modern CPUs
could be a net loss instead of a win, net drivers which were
relying on the implicit inclusion of prefetch.h via the list
headers showed up in the resulting cleanup fallout.  Give
them an explicit include via the following $0.02 script.

 =========================================
 #!/bin/bash
 MANUAL=""
 for i in `git grep -l 'prefetch(.*)' .` ; do
 	grep -q '<linux/prefetch.h>' $i
 	if [ $? = 0 ] ; then
 		continue
 	fi

 	(	echo '?^#include <linux/?a'
 		echo '#include <linux/prefetch.h>'
 		echo .
 		echo w
 		echo q
 	) | ed -s $i > /dev/null 2>&1
 	if [ $? != 0 ]; then
 		echo $i needs manual fixup
 		MANUAL="$i $MANUAL"
 	fi
 done
 echo ------------------- 8\<----------------------
 echo vi $MANUAL
 =========================================

Signed-off-by: Paul <paul.gortmaker@windriver.com>
[ Fixed up some incorrect #include placements, and added some
  non-network drivers and the fib_trie.c case    - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-22 21:41:57 -07:00
Dave Jones c378a9c019 ipv4: Give backtrace in ip_rt_bug().
Add a stack backtrace to the ip_rt_bug path for debugging

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-22 21:01:20 -04:00
David S. Miller 120a3d5c7c ipv4: Include linux/prefetch.h in fib_trie.c
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-22 20:53:43 -04:00
Linus Torvalds 06f4e926d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
  macvlan: fix panic if lowerdev in a bond
  tg3: Add braces around 5906 workaround.
  tg3: Fix NETIF_F_LOOPBACK error
  macvlan: remove one synchronize_rcu() call
  networking: NET_CLS_ROUTE4 depends on INET
  irda: Fix error propagation in ircomm_lmp_connect_response()
  irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
  irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
  rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
  be2net: Kill set but unused variable 'req' in lancer_fw_download()
  irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
  atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
  rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
  rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
  rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
  rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
  pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
  isdn: capi: Use pr_debug() instead of ifdefs.
  tg3: Update version to 3.119
  tg3: Apply rx_discards fix to 5719/5720
  ...

Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
as per Davem.
2011-05-20 13:43:21 -07:00
Linus Torvalds eb04f2f04e Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (78 commits)
  Revert "rcu: Decrease memory-barrier usage based on semi-formal proof"
  net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree
  batman,rcu: convert call_rcu(softif_neigh_free_rcu) to kfree_rcu
  batman,rcu: convert call_rcu(neigh_node_free_rcu) to kfree()
  batman,rcu: convert call_rcu(gw_node_free_rcu) to kfree_rcu
  net,rcu: convert call_rcu(kfree_tid_tx) to kfree_rcu()
  net,rcu: convert call_rcu(xt_osf_finger_free_rcu) to kfree_rcu()
  net/mac80211,rcu: convert call_rcu(work_free_rcu) to kfree_rcu()
  net,rcu: convert call_rcu(wq_free_rcu) to kfree_rcu()
  net,rcu: convert call_rcu(phonet_device_rcu_free) to kfree_rcu()
  perf,rcu: convert call_rcu(swevent_hlist_release_rcu) to kfree_rcu()
  perf,rcu: convert call_rcu(free_ctx) to kfree_rcu()
  net,rcu: convert call_rcu(__nf_ct_ext_free_rcu) to kfree_rcu()
  net,rcu: convert call_rcu(net_generic_release) to kfree_rcu()
  net,rcu: convert call_rcu(netlbl_unlhsh_free_addr6) to kfree_rcu()
  net,rcu: convert call_rcu(netlbl_unlhsh_free_addr4) to kfree_rcu()
  security,rcu: convert call_rcu(sel_netif_free) to kfree_rcu()
  net,rcu: convert call_rcu(xps_dev_maps_release) to kfree_rcu()
  net,rcu: convert call_rcu(xps_map_release) to kfree_rcu()
  net,rcu: convert call_rcu(rps_map_release) to kfree_rcu()
  ...
2011-05-19 18:14:34 -07:00
Micha Nelissen 3fb72f1e6e ipconfig wait for carrier
v3 -> v4: fix return boolean false instead of 0 for ic_is_init_dev

Currently the ip auto configuration has a hardcoded delay of 1 second.
When (ethernet) link takes longer to come up (e.g. more than 3 seconds),
nfs root may not be found.

Remove the hardcoded delay, and wait for carrier on at least one network
device.

Signed-off-by: Micha Nelissen <micha@neli.hopto.org>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-19 17:13:04 -04:00
Changli Gao 75e308c894 net: ping: fix the coding style
The characters in a line should be no more than 80.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-19 16:17:51 -04:00
Changli Gao bb0cd2fb53 net: ping: make local functions static
As these functions are only used in this file.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-19 16:17:51 -04:00
David S. Miller a48eff1288 ipv4: Pass explicit destination address to rt_bind_peer().
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-18 18:42:43 -04:00
David S. Miller ed2361e66e ipv4: Pass explicit destination address to rt_get_peer().
This will next trickle down to rt_bind_peer().

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-18 18:38:54 -04:00
David S. Miller 6bd023f3dd ipv4: Make caller provide flowi4 key to inet_csk_route_req().
This way the caller can get at the fully resolved fl4->{daddr,saddr}
etc.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-18 18:32:03 -04:00
David S. Miller 6882f933cc ipv4: Kill RT_CACHE_DEBUG
It's way past it's usefulness.  And this gets rid of a bunch
of stray ->rt_{dst,src} references.

Even the comment documenting the macro was inaccurate (stated
default was 1 when it's 0).

If reintroduced, it should be done properly, with dynamic debug
facilities.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-18 18:23:21 -04:00
David S. Miller 1d1652cbdb ipv4: Don't use enums as bitmasks in ip_fragment.c
Noticed by Joe Perches.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-17 17:28:02 -04:00
Vasiliy Kulikov f56e03e8dc net: ping: fix build failure
If CONFIG_PROC_SYSCTL=n the building process fails:

    ping.c:(.text+0x52af3): undefined reference to `inet_get_ping_group_range_net'

Moved inet_get_ping_group_range_net() to ping.c.

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-17 14:16:58 -04:00
Eric Dumazet 5173cc0577 ipv4: more compliant RFC 3168 support
Commit 6623e3b24a (ipv4: IP defragmentation must be ECN aware) was an
attempt to not lose "Congestion Experienced" (CE) indications when
performing datagram defragmentation.

Stefanos Harhalakis raised the point that RFC 3168 requirements were not
completely met by this commit.

In particular, we MUST detect invalid combinations and eventually drop
illegal frames.

Reported-by: Stefanos Harhalakis <v13@v13.gr>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-16 14:49:14 -04:00
David S. Miller c5be24ff62 ipv4: Trivial rt->rt_src conversions in net/ipv4/route.c
At these points we have a fully filled in value via the IP
header the form of ip_hdr(skb)->saddr

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-16 13:49:05 -04:00
Eric Dumazet 1a8218e962 net: ping: dont call udp_ioctl()
udp_ioctl() really handles UDP and UDPLite protocols.

1) It can increment UDP_MIB_INERRORS in case first_packet_length() finds
a frame with bad checksum.

2) It has a dependency on sizeof(struct udphdr), not applicable to
ICMP/PING

If ping sockets need to handle SIOCINQ/SIOCOUTQ ioctl, this should be
done differently.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-16 11:49:39 -04:00
Eric Dumazet 1b1cb1f78a net: ping: small changes
ping_table is not __read_mostly, since it contains one rwlock,
and is static to ping.c

ping_port_rover & ping_v4_lookup are static

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-15 01:22:21 -04:00
David S. Miller 7be799a70b ipv4: Remove rt->rt_dst reference from ip_forward_options().
At this point iph->daddr equals what rt->rt_dst would hold.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 17:31:02 -04:00
David S. Miller 8e36360ae8 ipv4: Remove route key identity dependencies in ip_rt_get_source().
Pass in the sk_buff so that we can fetch the necessary keys from
the packet header when working with input routes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 17:29:41 -04:00
David S. Miller 22f728f8f3 ipv4: Always call ip_options_build() after rest of IP header is filled in.
This will allow ip_options_build() to reliably look at the values of
iph->{daddr,saddr}

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 17:21:27 -04:00
David S. Miller 0374d9ceb0 ipv4: Kill spurious write to iph->daddr in ip_forward_options().
This code block executes when opt->srr_is_hit is set.  It will be
set only by ip_options_rcv_srr().

ip_options_rcv_srr() walks until it hits a matching nexthop in the SRR
option addresses, and when it matches one 1) looks up the route for
that nexthop and 2) on route lookup success it writes that nexthop
value into iph->daddr.

ip_forward_options() runs later, and again walks the SRR option
addresses looking for the option matching the destination of the route
stored in skb_rtable().  This route will be the same exact one looked
up for the nexthop by ip_options_rcv_srr().

Therefore "rt->rt_dst == iph->daddr" must be true.

All it really needs to do is record the route's source address in the
matching SRR option adddress.  It need not write iph->daddr again,
since that has already been done by ip_options_rcv_srr() as detailed
above.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 17:15:50 -04:00
Vasiliy Kulikov c319b4d76b net: ipv4: add IPPROTO_ICMP socket kind
This patch adds IPPROTO_ICMP socket kind.  It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges.  In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping.  In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).

Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/

A new ping socket is created with

    socket(PF_INET, SOCK_DGRAM, PROT_ICMP)

Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.

Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.

ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.

ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).

socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range".  It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets.  Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.

The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).

Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping

For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro.  A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/

Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.

All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.

PATCH v3:
    - switched to flowi4.
    - minor changes to be consistent with raw sockets code.

PATCH v2:
    - changed ping_debug() to pr_debug().
    - removed CONFIG_IP_PING.
    - removed ping_seq_fops.owner field (unused for procfs).
    - switched to proc_net_fops_create().
    - switched to %pK in seq_printf().

PATCH v1:
    - fixed checksumming bug.
    - CAP_NET_RAW may not create icmp sockets anymore.

RFC v2:
    - minor cleanups.
    - introduced sysctl'able group range to restrict socket(2).

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13 16:08:13 -04:00
David S. Miller 72a8f97bf2 ipv4: Fix 'iph' use before set.
I swear none of my compilers warned about this, yet it is so
obvious.

> net/ipv4/ip_forward.c: In function 'ip_forward':
> net/ipv4/ip_forward.c:87: warning: 'iph' may be used uninitialized in this function

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-12 23:03:46 -04:00
David S. Miller def57687e9 ipv4: Elide use of rt->rt_dst in ip_forward()
No matter what kind of header mangling occurs due to IP options
processing, rt->rt_dst will always equal iph->daddr in the packet.

So we can safely use iph->daddr instead of rt->rt_dst here.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-12 19:34:30 -04:00
David S. Miller c30883bdff ipv4: Simplify iph->daddr overwrite in ip_options_rcv_srr().
We already copy the 4-byte nexthop from the options block into
local variable "nexthop" for the route lookup.

Re-use that variable instead of memcpy()'ing again when assigning
to iph->daddr after the route lookup succeeds.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-12 19:30:58 -04:00
David S. Miller 10949550bd ipv4: Kill spurious opt->srr check in ip_options_rcv_srr().
All call sites conditionalize the call to ip_options_rcv_srr()
with a check of opt->srr, so no need to check it again there.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-12 19:26:57 -04:00
David S. Miller 3c709f8fb4 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-3.6
Conflicts:
	drivers/net/benet/be_main.c
2011-05-11 14:26:58 -04:00
Steffen Klassert 43a4dea4c9 xfrm: Assign the inner mode output function to the dst entry
As it is, we assign the outer modes output function to the dst entry
when we create the xfrm bundle. This leads to two problems on interfamily
scenarios. We might insert ipv4 packets into ip6_fragment when called
from xfrm6_output. The system crashes if we try to fragment an ipv4
packet with ip6_fragment. This issue was introduced with git commit
ad0081e4 (ipv6: Fragment locally generated tunnel-mode IPSec6 packets
as needed). The second issue is, that we might insert ipv4 packets in
netfilter6 and vice versa on interfamily scenarios.

With this patch we assign the inner mode output function to the dst entry
when we create the xfrm bundle. So xfrm4_output/xfrm6_output from the inner
mode is used and the right fragmentation and netfilter functions are called.
We switch then to outer mode with the output_finish functions.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 15:03:34 -07:00
Eric Dumazet 1fc19aff84 net: fix two lockdep splats
Commit e67f88dd12 (net: dont hold rtnl mutex during netlink dump
callbacks) switched rtnl protection to RCU, but we forgot to adjust two
rcu_dereference() lockdep annotations :

inet_get_link_af_size() or inet_fill_link_af() might be called with
rcu_read_lock or rtnl held, so use rcu_dereference_rtnl()
instead of rtnl_dereference()

Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 15:03:01 -07:00
David S. Miller 8f01cb0827 ipv4: xfrm: Eliminate ->rt_src reference in policy code.
Rearrange xfrm4_dst_lookup() so that it works by calling a helper
function __xfrm_dst_lookup() that takes an explicit flow key storage
area as an argument.

Use this new helper in xfrm4_get_saddr() so we can fetch the selected
source address from the flow instead of from rt->rt_src

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 13:32:48 -07:00
David S. Miller 79ab053145 ipv4: udp: Eliminate remaining uses of rt->rt_src
We already track and pass around the correct flow key,
so simply use it in udp_send_skb().

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 13:32:47 -07:00
David S. Miller 9f6abb5f17 ipv4: icmp: Eliminate remaining uses of rt->rt_src
On input packets, rt->rt_src always equals ip_hdr(skb)->saddr

Anything that mangles or otherwise changes the IP header must
relookup the route found at skb_rtable().  Therefore this
invariant must always hold true.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 13:32:46 -07:00
David S. Miller 0a5ebb8000 ipv4: Pass explicit daddr arg to ip_send_reply().
This eliminates an access to rt->rt_src.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 13:32:46 -07:00
David S. Miller f5fca60865 ipv4: Pass flow key down into ip_append_*().
This way rt->rt_dst accesses are unnecessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 21:24:07 -07:00
David S. Miller 77968b7824 ipv4: Pass flow keys down into datagram packet building engine.
This way ip_output.c no longer needs rt->rt_{src,dst}.

We already have these keys sitting, ready and waiting, on the stack or
in a socket structure.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 21:24:06 -07:00
David S. Miller e474995f29 udp: Use flow key information instead of rt->rt_{src,dst}
We have two cases.

Either the socket is in TCP_ESTABLISHED state and connect() filled
in the inet socket cork flow, or we looked up the route here and
used an on-stack flow.

Track which one it was, and use it to obtain src/dst addrs.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 21:12:48 -07:00