Commit Graph

3597 Commits

Author SHA1 Message Date
Patrick McHardy 48f8ac2653 netfilter: nf_nat_sip: add TCP support
Add support for mangling TCP SIP packets.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-02-11 12:29:38 +01:00
Patrick McHardy 010c0b9f34 netfilter: nf_nat: support mangling a single TCP packet multiple times
nf_nat_mangle_tcp_packet() can currently only handle a single mangling
per window because it only maintains two sequence adjustment positions:
the one before the last adjustment and the one after.

This patch makes sequence number adjustment tracking in
nf_nat_mangle_tcp_packet() optional and allows a helper to manually
update the offsets after the packet has been fully handled.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-02-11 12:27:09 +01:00
Patrick McHardy f5b321bd37 netfilter: nf_conntrack_sip: add TCP support
Add TCP support, which is mandated by RFC3261 for all SIP elements.

SIP over TCP is similar to UDP, except that messages are delimited
by Content-Length: headers and multiple messages may appear in one
packet.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-02-11 12:26:19 +01:00
Patrick McHardy 3b6b9fab42 netfilter: nf_conntrack_sip: pass data offset to NAT functions
When using TCP multiple SIP messages might be present in a single packet.
A following patch will parse them by setting the dptr to the beginning of
each message. The NAT helper needs to reload the dptr value after mangling
the packet however, so it needs to know the offset of the message to the
beginning of the packet.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-02-11 12:23:53 +01:00
Damian Lukowski 598856407d tcp: fix ICMP-RTO war
Make sure, that TCP has a nonzero RTT estimation after three-way
handshake. Currently, a listening TCP has a value of 0 for srtt,
rttvar and rto right after the three-way handshake is completed
with TCP timestamps disabled.
This will lead to corrupt RTO recalculation and retransmission
flood when RTO is recalculated on backoff reversion as introduced
in "Revert RTO on ICMP destination unreachable"
(f1ecd5d9e7).
This behaviour can be provoked by connecting to a server which
"responds first" (like SMTP) and rejecting every packet after
the handshake with dest-unreachable, which will lead to softirq
load on the server (up to 30% per socket in some tests).

Thanks to Ilpo Jarvinen for providing debug patches and to
Denys Fedoryshchenko for reporting and testing.

Changes since v3: Removed bad characters in patchfile.

Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-10 18:04:08 -08:00
Jan Engelhardt e3eaa9910b netfilter: xtables: generate initial table on-demand
The static initial tables are pretty large, and after the net
namespace has been instantiated, they just hang around for nothing.
This commit removes them and creates tables on-demand at runtime when
needed.

Size shrinks by 7735 bytes (x86_64).

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
2010-02-10 17:50:47 +01:00
Jan Engelhardt 2b95efe7f6 netfilter: xtables: use xt_table for hook instantiation
The respective xt_table structures already have most of the metadata
needed for hook setup. Add a 'priority' field to struct xt_table so
that xt_hook_link() can be called with a reduced number of arguments.

So should we be having more tables in the future, it comes at no
static cost (only runtime, as before) - space saved:
6807373->6806555.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
2010-02-10 17:13:33 +01:00
Jan Engelhardt 2b21e05147 netfilter: xtables: compact table hook functions (2/2)
The calls to ip6t_do_table only show minimal differences, so it seems
like a good cleanup to merge them to a single one too.
Space saving obtained by both patches: 6807725->6807373
("Total" column from `size -A`.)

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
2010-02-10 17:03:53 +01:00
Jan Engelhardt 737535c5cf netfilter: xtables: compact table hook functions (1/2)
This patch combines all the per-hook functions in a given table into
a single function. Together with the 2nd patch, further
simplifications are possible up to the point of output code reduction.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
2010-02-10 16:44:58 +01:00
Patrick McHardy 9ab99d5a43 Merge branch 'master' of /repos/git/net-next-2.6
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-02-10 14:17:10 +01:00
David S. Miller b1109bf085 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-02-09 11:44:44 -08:00
Daniel Mack 3ad2f3fbb9 tree-wide: Assorted spelling fixes
In particular, several occurances of funny versions of 'success',
'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address',
'beginning', 'desirable', 'separate' and 'necessary' are fixed.

Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-09 11:13:56 +01:00
Patrick McHardy d696c7bdaa netfilter: nf_conntrack: fix hash resizing with namespaces
As noticed by Jon Masters <jonathan@jonmasters.org>, the conntrack hash
size is global and not per namespace, but modifiable at runtime through
/sys/module/nf_conntrack/hashsize. Changing the hash size will only
resize the hash in the current namespace however, so other namespaces
will use an invalid hash size. This can cause crashes when enlarging
the hashsize, or false negative lookups when shrinking it.

Move the hash size into the per-namespace data and only use the global
hash size to initialize the per-namespace value when instanciating a
new namespace. Additionally restrict hash resizing to init_net for
now as other namespaces are not handled currently.

Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-08 11:18:07 -08:00
Alexey Dobriyan 14c7dbe043 netfilter: xtables: compat out of scope fix
As per C99 6.2.4(2) when temporary table data goes out of scope,
the behaviour is undefined:

	if (compat) {
		struct foo tmp;
		...
		private = &tmp;
	}
	[dereference private]

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-02-08 11:17:43 -08:00
Florian Westphal 7678037319 netfilter: fix build failure with CONNTRACK=y NAT=n
net/ipv4/netfilter/nf_defrag_ipv4.c: In function 'ipv4_conntrack_defrag':
net/ipv4/netfilter/nf_defrag_ipv4.c:62: error: implicit declaration of function 'nf_ct_is_template'

Signed-off-by: Florian Westphal <fwestphal@astaro.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-02-08 15:39:16 +01:00
Christoph Egger d088dde7b1 ipv4: obsolete config in kernel source (IP_ROUTE_PERVASIVE)
CONFIG_IP_ROUTE_PERVASIVE is missing a corresponding config
IP_ROUTE_PERVASIVE somewhere in KConfig (and missing it for ages
already) so it looks like some aging artefact no longer needed.

Therefor this patch kills of the only remaining reference to that
config Item removing the already unrechable code snipet.

Signed-off-by: Christoph Egger <siccegge@stud.informatik.uni-erlangen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-04 14:58:46 -08:00
Patrick McHardy b2a15a604d netfilter: nf_conntrack: support conntrack templates
Support initializing selected parameters of new conntrack entries from a
"conntrack template", which is a specially marked conntrack entry attached
to the skb.

Currently the helper and the event delivery masks can be initialized this
way.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-02-03 14:40:17 +01:00
Patrick McHardy add6746124 netfilter: add struct net * to target parameters
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-02-03 13:45:12 +01:00
Patrick McHardy d1c9ae6d1e ipv4: ip_fragment: fix unbalanced rcu_read_unlock()
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-02 11:46:50 -08:00
Flavio Leitner c85bb41e93 igmp: fix ip_mc_sf_allow race [v5]
Almost all igmp functions accessing inet->mc_list are protected by
rtnl_lock(), but there is one exception which is ip_mc_sf_allow(),
so there is a chance of either ip_mc_drop_socket or ip_mc_leave_group
remove an entry while ip_mc_sf_allow is running causing a crash.

Signed-off-by: Flavio Leitner <fleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-02 07:32:29 -08:00
Alexey Dobriyan a92df25454 netns xfrm: ipcomp support
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-28 06:31:06 -08:00
David S. Miller 05ba712d7e Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-01-28 06:12:38 -08:00
Stephen Hemminger f81074f861 tcp_probe: avoid modulus operation and wrap fix
By rounding up the buffer size to power of 2, several expensive
modulus operations can be avoided.  This patch also solves a bug where
the gap need when ring gets full was not being accounted for.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-25 15:47:50 -08:00
Alexey Dobriyan d7c7544c3d netns xfrm: deal with dst entries in netns
GC is non-existent in netns, so after you hit GC threshold, no new
dst entries will be created until someone triggers cleanup in init_net.

Make xfrm4_dst_ops and xfrm6_dst_ops per-netns.
This is not done in a generic way, because it woule waste
(AF_MAX - 2) * sizeof(struct dst_ops) bytes per-netns.

Reorder GC threshold initialization so it'd be done before registering
XFRM policies.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-24 22:47:53 -08:00
Shan Wei e9017b5518 IP: Send an ICMP "Fragment Reassembly Timeout" message when enabling connection track
No matter whether connection track is enabled, an end host should send 
an ICMPv4 "Fragment Reassembly Timeout" message when defrag timeout. 
The reasons are following two points:  

1. RFC 792 says:
   >>>> >> > >   If a host reassembling a fragmented datagram cannot complete the
   >>>> >> > >   reassembly due to missing fragments within its time limit it
   >>>> >> > >   discards the datagram, and it may send a time exceeded message.
   >>>> >> > > 
   >>>> >> > >   If fragment zero is not available then no time exceeded need be
   >>>> >> > >   sent at all.
   >>>> >> > > 
   >>>> >> > > Read more: http://www.faqs.org/rfcs/rfc792.html#ixzz0aOXRD7Wp

2. Patrick McHardy also agrees with this opinion.   :-)   
   About the discussion of this opinion, refer to http://patchwork.ozlabs.org/patch/41649

The patch fixed the problem like this:
When enabling connection track, fragments are received at PRE_ROUTING HOOK.
If they are failed to reassemble, ip_expire() will be called. 
Before sending an ICMP "Fragment Reassembly Timeout" message, 
the patch searches router table to get the destination entry only for host type.

The patch has been tested on both host type and route type.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-23 01:57:42 -08:00
Alexey Dobriyan e754834e65 icmp: move icmp_err_convert[] to .rodata
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-23 01:21:28 -08:00
Alexey Dobriyan 5833929cc2 net: constify MIB name tables
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-23 01:21:27 -08:00
David S. Miller 51c24aaaca Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-01-23 00:31:06 -08:00
David S. Miller 6be325719b Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ 2010-01-22 22:45:46 -08:00
Alexey Dobriyan 477781477a netfiltr: ipt_CLUSTERIP: simplify seq_file codeA
Pass "struct clusterip_config" itself to seq_file iterators
and save one dereference. Proc entry itself isn't interesting.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-22 22:21:18 +01:00
Roel Kluin b4ced2b768 netlink: With opcode INET_DIAG_BC_S_LE dport was compared in inet_diag_bc_run()
The s-port should be compared.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-19 14:12:20 -08:00
Octavian Purdila 6d955180b2 ipv4: allow warming up the ARP cache with request type gratuitous ARP
If the per device ARP_ACCEPT option is enable, currently we only allow
creating new ARP cache entries for response type gratuitous ARP.

Allowing gratuitous ARP to create new ARP entries (not only to update
existing ones) is useful when we want to avoid unnecessary delays for
the first packet of a stream.

This patch allows request type gratuitous ARP to create new ARP cache
entries as well. This is useful when we want to populate the ARP cache
entries for a large number of hosts on the same LAN.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-19 02:12:34 -08:00
Alexey Dobriyan f54e9367f8 netfilter: xtables: add struct xt_mtdtor_param::net
Add ->net to match destructor list like ->net in constructor list.

Make sure it's set in ebtables/iptables/ip6tables, this requires to
propagate netns up to *_unregister_table().

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-18 08:25:47 +01:00
Alexey Dobriyan a83d8e8d09 netfilter: xtables: add struct xt_mtchk_param::net
Some complex match modules (like xt_hashlimit/xt_recent) want netns
information at constructor and destructor time. We propably can play
games at match destruction time, because netns can be passed in object,
but I think it's cleaner to explicitly pass netns.

Add ->net, make sure it's set from ebtables/iptables/ip6tables code.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-18 08:21:13 +01:00
Alexey Dobriyan 0a931acfd1 ipv4: don't remove /proc/net/rt_acct
/proc/net/rt_acct is not created if NET_CLS_ROUTE=n.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-17 19:24:49 -08:00
Alexey Dobriyan 2c8c1e7297 net: spread __net_init, __net_exit
__net_init/__net_exit are apparently not going away, so use them
to full extent.

In some cases __net_init was removed, because it was called from
__net_exit code.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-17 19:16:02 -08:00
Octavian Purdila 72659ecce6 tcp: account SYN-ACK timeouts & retransmissions
Currently we don't increment SYN-ACK timeouts & retransmissions
although we do increment the same stats for SYN. We seem to have lost
the SYN-ACK accounting with the introduction of tcp_syn_recv_timer
(commit 2248761e in the netdev-vger-cvs tree).

This patch fixes this issue. In the process we also rename the v4/v6
syn/ack retransmit functions for clarity. We also add a new
request_socket operations (syn_ack_timeout) so we can keep code in
inet_connection_sock.c protocol agnostic.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-17 19:09:39 -08:00
David S. Miller 71fceff0ea ipv4: Use less conflicting local var name in change_nexthops() loop macro.
As noticed by H Hartley Sweeten, since change_nexthops() uses 'nh'
as it's iterator variable, it can conflict with other existing
local vars.

Use "nexthop_nh" to avoid the conflict and make it easier to figure
out where this magic variable comes from.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-15 01:16:40 -08:00
Linus Torvalds 597d8c7178 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (56 commits)
  sky2: Fix oops in sky2_xmit_frame() after TX timeout
  Documentation/3c509: document ethtool support
  af_packet: Don't use skb after dev_queue_xmit()
  vxge: use pci_dma_mapping_error to test return value
  netfilter: ebtables: enforce CAP_NET_ADMIN
  e1000e: fix and commonize code for setting the receive address registers
  e1000e: e1000e_enable_tx_pkt_filtering() returns wrong value
  e1000e: perform 10/100 adaptive IFS only on parts that support it
  e1000e: don't accumulate PHY statistics on PHY read failure
  e1000e: call pci_save_state() after pci_restore_state()
  netxen: update version to 4.0.72
  netxen: fix set mac addr
  netxen: fix smatch warning
  netxen: fix tx ring memory leak
  tcp: update the netstamp_needed counter when cloning sockets
  TI DaVinci EMAC: Handle emac module clock correctly.
  dmfe/tulip: Let dmfe handle DM910x except for SPARC on-board chips
  ixgbe: Fix compiler warning about variable being used uninitialized
  netfilter: nf_ct_ftp: fix out of bounds read in update_nl_seq()
  mv643xx_eth: don't include cache padding in rx desc buffer size
  ...

Fix trivial conflict in drivers/scsi/cxgb3i/cxgb3i_offload.c
2010-01-12 20:53:29 -08:00
Stephen Hemminger d218d11133 tcp: Generalized TTL Security Mechanism
This patch adds the kernel portions needed to implement
RFC 5082 Generalized TTL Security Mechanism (GTSM).
It is a lightweight security measure against forged
packets causing DoS attacks (for BGP). 

This is already implemented the same way in BSD kernels.
For the necessary Quagga patch 
  http://www.gossamer-threads.com/lists/quagga/dev/17389

Description from Cisco
  http://www.cisco.com/en/US/docs/ios/12_3t/12_3t7/feature/guide/gt_btsh.html

It does add one byte to each socket structure, but I did
a little rearrangement to reuse a hole (on 64 bit), but it
does grow the structure on 32 bit

This should be documented on ip(4) man page and the Glibc in.h
file also needs update.  IPV6_MINHOPLIMIT should also be added
(although BSD doesn't support that).  

Only TCP is supported, but could also be added to UDP, DCCP, SCTP
if desired.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-11 16:28:01 -08:00
Joe Perches c299bd53aa netfilter: nf_nat_ftp: remove (*mangle[]) array and functions, use %pI4
These functions merely exist to format a buffer and call
nf_nat_mangle_tcp_packet.

Format the buffer and perform the call in nf_nat_ftp instead.

Use %pI4 for the IP address.

Saves ~600 bytes of text

old:
$ size net/ipv4/netfilter/nf_nat_ftp.o
   text	   data	    bss	    dec	    hex	filename
   2187	    160	    408	   2755	    ac3	net/ipv4/netfilter/nf_nat_ftp.o
new:
$ size net/ipv4/netfilter/nf_nat_ftp.o
   text    data     bss     dec     hex filename
   1532     112     288    1932     78c net/ipv4/netfilter/nf_nat_ftp.o

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-11 11:49:51 +01:00
David S. Miller d4a66e752d Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/benet/be_cmds.h
	include/linux/sysctl.h
2010-01-10 22:55:03 -08:00
Jesper Dangaard Brouer 65324144b5 net: RFC3069, private VLAN proxy arp support
This is to be used together with switch technologies, like RFC3069,
that where the individual ports are not allowed to communicate with
each other, but they are allowed to talk to the upstream router.  As
described in RFC 3069, it is possible to allow these hosts to
communicate through the upstream router by proxy_arp'ing.

This patch basically allow proxy arp replies back to the same
interface (from which the ARP request/solicitation was received).

Tunable per device via proc "proxy_arp_pvlan":
  /proc/sys/net/ipv4/conf/*/proxy_arp_pvlan

This switch technology is known by different vendor names:
 - In RFC 3069 it is called VLAN Aggregation.
 - Cisco and Allied Telesyn call it Private VLAN.
 - Hewlett-Packard call it Source-Port filtering or port-isolation.
 - Ericsson call it MAC-Forced Forwarding (RFC Draft).

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-07 00:59:09 -08:00
Octavian Purdila 7ad6848c7e ip: fix mc_loop checks for tunnels with multicast outer addresses
When we have L3 tunnels with different inner/outer families
(i.e. IPV4/IPV6) which use a multicast address as the outer tunnel
destination address, multicast packets will be loopbacked back to the
sending socket even if IP*_MULTICAST_LOOP is set to disabled.

The mc_loop flag is present in the family specific part of the socket
(e.g. the IPv4 or IPv4 specific part).  setsockopt sets the inner
family mc_loop flag. When the packet is pushed through the L3 tunnel
it will eventually be processed by the outer family which if different
will check the flag in a different part of the socket then it was set.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-06 20:37:01 -08:00
Julia Lawall 71c3ebfdb2 netfilter: SNMP NAT: correct the size argument to kzalloc
obj has type struct snmp_object **, not struct snmp_object *.  But indeed
it is not even clear why kmalloc is needed.  The memory is freed by the end
of the function, so the local variable of pointer type should be sufficient.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@disable sizeof_type_expr@
type T;
T **x;
@@

  x =
  <+...sizeof(
- T
+ *x
  )...+>
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-04 15:21:31 +01:00
Linus Torvalds c3bf4906fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (74 commits)
  Revert "b43: Enforce DMA descriptor memory constraints"
  iwmc3200wifi: fix array out-of-boundary access
  wl1251: timeout one too soon in wl1251_boot_run_firmware()
  mac80211: fix propagation of failed hardware reconfigurations
  mac80211: fix race with suspend and dynamic_ps_disable_work
  ath9k: fix missed error codes in the tx status check
  ath9k: wake hardware during AMPDU TX actions
  ath9k: wake hardware for interface IBSS/AP/Mesh removal
  ath9k: fix suspend by waking device prior to stop
  cfg80211: fix error path in cfg80211_wext_siwscan
  wl1271_cmd.c: cleanup char => u8
  iwlwifi: Storage class should be before const qualifier
  ath9k: Storage class should be before const qualifier
  cfg80211: fix race between deauth and assoc response
  wireless: remove remaining qual code
  rt2x00: Add USB ID for Linksys WUSB 600N rev 2.
  ath5k: fix SWI calibration interrupt storm
  mac80211: fix ibss join with fixed-bssid
  libertas: Remove carrier signaling from the scan code
  orinoco: fix GFP_KERNEL in orinoco_set_key with interrupts disabled
  ...
2009-12-30 12:37:35 -08:00
Jamal Hadi Salim 28f6aeea3f net: restore ip source validation
when using policy routing and the skb mark:
there are cases where a back path validation requires us
to use a different routing table for src ip validation than
the one used for mapping ingress dst ip.
One such a case is transparent proxying where we pretend to be
the destination system and therefore the local table
is used for incoming packets but possibly a main table would
be used on outbound.
Make the default behavior to allow the above and if users
need to turn on the symmetry via sysctl src_valid_mark

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-25 17:30:22 -08:00
laurent chavey 31d12926e3 net: Add rtnetlink init_rcvwnd to set the TCP initial receive window
Add rtnetlink init_rcvwnd to set the TCP initial receive window size
advertised by passive and active TCP connections.
The current Linux TCP implementation limits the advertised TCP initial
receive window to the one prescribed by slow start. For short lived
TCP connections used for transaction type of traffic (i.e. http
requests), bounding the advertised TCP initial receive window results
in increased latency to complete the transaction.
Support for setting initial congestion window is already supported
using rtnetlink init_cwnd, but the feature is useless without the
ability to set a larger TCP initial receive window.
The rtnetlink init_rcvwnd allows increasing the TCP initial receive
window, allowing TCP connection to advertise larger TCP receive window
than the ones bounded by slow start.

Signed-off-by: Laurent Chavey <chavey@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-23 14:13:30 -08:00
Krishna Kumar def87cf420 tcp: Slightly optimize tcp_sendmsg
Slightly optimize tcp_sendmsg since NETIF_F_SG is used many
times iteratively in the loop. The only other modification is
to change:
			} else if (i == MAX_SKB_FRAGS ||
				   (!i &&
				   !(sk->sk_route_caps & NETIF_F_SG))) {
	to:
			} else if (i == MAX_SKB_FRAGS || !sg) {

The reason why this change is correct: this code (other than
the MAX_SKB_FRAGS case) executes only due to the else part
of: "if (skb_tailroom(skb) > 0) {" - i.e. there was no space
in the skb to put the data inline. Hence SG is false is a
sufficient condition, and there is no way a fragment can be
added to the skb.

Changelog:
	- Added the above explanation for the change

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-23 14:13:29 -08:00
Krishna Kumar afeca340c0 tcp: Remove unrequired operations in tcp_push()
Remove unrequired operations in tcp_push()

Changelog:
	Removed a temporary skb variable from tcp_push()

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-23 14:13:28 -08:00
Krishna Kumar 12d50c46dc tcp: Remove check in __tcp_push_pending_frames
tcp_push checks tcp_send_head and calls __tcp_push_pending_frames,
which again checks tcp_send_head, and this unnecessary check is
done for every other caller of __tcp_push_pending_frames.

Remove tcp_send_head check in __tcp_push_pending_frames and add
the check to tcp_push_pending_frames. Other functions call
__tcp_push_pending_frames only when tcp_send_head would evaluate
to true.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-23 14:13:28 -08:00
Linus Torvalds 37c24b37fb Merge branch 'for-2.6.33' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.33' of git://linux-nfs.org/~bfields/linux: (42 commits)
  nfsd: remove pointless paths in file headers
  nfsd: move most of nfsfh.h to fs/nfsd
  nfsd: remove unused field rq_reffh
  nfsd: enable V4ROOT exports
  nfsd: make V4ROOT exports read-only
  nfsd: restrict filehandles accepted in V4ROOT case
  nfsd: allow exports of symlinks
  nfsd: filter readdir results in V4ROOT case
  nfsd: filter lookup results in V4ROOT case
  nfsd4: don't continue "under" mounts in V4ROOT case
  nfsd: introduce export flag for v4 pseudoroot
  nfsd: let "insecure" flag vary by pseudoflavor
  nfsd: new interface to advertise export features
  nfsd: Move private headers to source directory
  vfs: nfsctl.c un-used nfsd #includes
  lockd: Remove un-used nfsd headers #includes
  s390: remove un-used nfsd #includes
  sparc: remove un-used nfsd #includes
  parsic: remove un-used nfsd #includes
  compat.c: Remove dependence on nfsd private headers
  ...
2009-12-16 10:43:34 -08:00
David S. Miller 81e839efc2 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6 2009-12-15 21:08:53 -08:00
David S. Miller bb5b7c1126 tcp: Revert per-route SACK/DSACK/TIMESTAMP changes.
It creates a regression, triggering badness for SYN_RECV
sockets, for example:

[19148.022102] Badness at net/ipv4/inet_connection_sock.c:293
[19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000
[19148.023035] REGS: eeecbd30 TRAP: 0700   Not tainted  (2.6.32)
[19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR>  CR: 24002442  XER: 00000000
[19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000

This is likely caused by the change in the 'estab' parameter
passed to tcp_parse_options() when invoked by the functions
in net/ipv4/tcp_minisocks.c

But even if that is fixed, the ->conn_request() changes made in
this patch series is fundamentally wrong.  They try to use the
listening socket's 'dst' to probe the route settings.  The
listening socket doesn't even have a route, and you can't
get the right route (the child request one) until much later
after we setup all of the state, and it must be done by hand.

This stuff really isn't ready, so the best thing to do is a
full revert.  This reverts the following commits:

f55017a93f
022c3f7d82
1aba721eba
cda42ebd67
345cda2fd6
dc343475ed
05eaade278
6a2a2d6bf8

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-15 20:56:42 -08:00
Patrick McHardy 8fa9ff6849 netfilter: fix crashes in bridge netfilter caused by fragment jumps
When fragments from bridge netfilter are passed to IPv4 or IPv6 conntrack
and a reassembly queue with the same fragment key already exists from
reassembling a similar packet received on a different device (f.i. with
multicasted fragments), the reassembled packet might continue on a different
codepath than where the head fragment originated. This can cause crashes
in bridge netfilter when a fragment received on a non-bridge device (and
thus with skb->nf_bridge == NULL) continues through the bridge netfilter
code.

Add a new reassembly identifier for packets originating from bridge
netfilter and use it to put those packets in insolated queues.

Fixes http://bugzilla.kernel.org/show_bug.cgi?id=14805

Reported-and-Tested-by: Chong Qiao <qiaochong@loongson.cn>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-12-15 16:59:59 +01:00
Eric Dumazet 5781b2356c udp: udp_lib_get_port() fix
Now we can have a large udp hash table, udp_lib_get_port() loop
should be converted to a do {} while (cond) form,
or we dont enter it at all if hash table size is exactly 65536.

Reported-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-13 19:32:39 -08:00
David S. Miller 501706565b Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	include/net/tcp.h
2009-12-11 17:12:17 -08:00
Linus Torvalds 4ef58d4e2a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits)
  tree-wide: fix misspelling of "definition" in comments
  reiserfs: fix misspelling of "journaled"
  doc: Fix a typo in slub.txt.
  inotify: remove superfluous return code check
  hdlc: spelling fix in find_pvc() comment
  doc: fix regulator docs cut-and-pasteism
  mtd: Fix comment in Kconfig
  doc: Fix IRQ chip docs
  tree-wide: fix assorted typos all over the place
  drivers/ata/libata-sff.c: comment spelling fixes
  fix typos/grammos in Documentation/edac.txt
  sysctl: add missing comments
  fs/debugfs/inode.c: fix comment typos
  sgivwfb: Make use of ARRAY_SIZE.
  sky2: fix sky2_link_down copy/paste comment error
  tree-wide: fix typos "couter" -> "counter"
  tree-wide: fix typos "offest" -> "offset"
  fix kerneldoc for set_irq_msi()
  spidev: fix double "of of" in comment
  comment typo fix: sybsystem -> subsystem
  ...
2009-12-09 19:43:33 -08:00
Ilpo Järvinen 77722b177a tcp: fix retrans_stamp advancing in error cases
It can happen, that tcp_retransmit_skb fails due to some error.
In such cases we might end up into a state where tp->retrans_out
is zero but that's only because we removed the TCPCB_SACKED_RETRANS
bit from a segment but couldn't retransmit it because of the error
that happened. Therefore some assumptions that retrans_out checks
are based do not necessarily hold, as there still can be an old
retransmission but that is only visible in TCPCB_EVER_RETRANS bit.
As retransmission happen in sequential order (except for some very
rare corner cases), it's enough to check the head skb for that bit.

Main reason for all this complexity is the fact that connection dying
time now depends on the validity of the retrans_stamp, in particular,
that successive retransmissions of a segment must not advance
retrans_stamp under any conditions. It seems after quick thinking that
this has relatively low impact as eventually TCP will go into CA_Loss
and either use the existing check for !retrans_stamp case or send a
retransmission successfully, setting a new base time for the dying
timer (can happen only once). At worst, the dying time will be
approximately the double of the intented time. In addition,
tcp_packet_delayed() will return wrong result (has some cc aspects
but due to rarity of these errors, it's hardly an issue).

One of retrans_stamp clearing happens indirectly through first going
into CA_Open state and then a later ACK lets the clearing to happen.
Thus tcp_try_keep_open has to be modified too.

Thanks to Damian Lukowski <damian@tvk.rwth-aachen.de> for hinting
that this possibility exists (though the particular case discussed
didn't after all have it happening but was just a debug patch
artifact).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-08 20:56:12 -08:00
Damian Lukowski 2f7de5710a tcp: Stalling connections: Move timeout calculation routine
This patch moves retransmits_timed_out() from include/net/tcp.h
to tcp_timer.c, where it is used.

Reported-by: Frederic Leroy <fredo@starox.org>
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-08 20:56:11 -08:00
Eric Dumazet 2a8875e73f [PATCH] tcp: documents timewait refcnt tricks
Adds kerneldoc for inet_twsk_unhash() & inet_twsk_bind_unhash().

With help from Randy Dunlap.

Suggested-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-08 20:19:53 -08:00
Eric Dumazet 3cdaedae63 tcp: Fix a connect() race with timewait sockets
When we find a timewait connection in __inet_hash_connect() and reuse
it for a new connection request, we have a race window, releasing bind
list lock and reacquiring it in __inet_twsk_kill() to remove timewait
socket from list.

Another thread might find the timewait socket we already chose, leading to
list corruption and crashes.

Fix is to remove timewait socket from bind list before releasing the bind lock.

Note: This problem happens if sysctl_tcp_tw_reuse is set.

Reported-by: kapil dakhane <kdakhane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-08 20:17:51 -08:00
Eric Dumazet 9327f7053e tcp: Fix a connect() race with timewait sockets
First patch changes __inet_hash_nolisten() and __inet6_hash()
to get a timewait parameter to be able to unhash it from ehash
at same time the new socket is inserted in hash.

This makes sure timewait socket wont be found by a concurrent
writer in __inet_check_established()

Reported-by: kapil dakhane <kdakhane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-08 20:17:51 -08:00
David S. Miller 3dc789320e tcp: Remove runtime check that can never be true.
GCC even warns about it, as reported by Andrew Morton:

net/ipv4/tcp.c: In function 'do_tcp_getsockopt':
net/ipv4/tcp.c:2544: warning: comparison is always false due to limited range of data type

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-08 20:07:54 -08:00
Linus Torvalds d7fc02c7ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
  mac80211: fix reorder buffer release
  iwmc3200wifi: Enable wimax core through module parameter
  iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
  iwmc3200wifi: Coex table command does not expect a response
  iwmc3200wifi: Update wiwi priority table
  iwlwifi: driver version track kernel version
  iwlwifi: indicate uCode type when fail dump error/event log
  iwl3945: remove duplicated event logging code
  b43: fix two warnings
  ipw2100: fix rebooting hang with driver loaded
  cfg80211: indent regulatory messages with spaces
  iwmc3200wifi: fix NULL pointer dereference in pmkid update
  mac80211: Fix TX status reporting for injected data frames
  ath9k: enable 2GHz band only if the device supports it
  airo: Fix integer overflow warning
  rt2x00: Fix padding bug on L2PAD devices.
  WE: Fix set events not propagated
  b43legacy: avoid PPC fault during resume
  b43: avoid PPC fault during resume
  tcp: fix a timewait refcnt race
  ...

Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
CTL_UNNUMBERED removed) in
	kernel/sysctl_check.c
	net/ipv4/sysctl_net_ipv4.c
	net/ipv6/addrconf.c
	net/sctp/sysctl.c
2009-12-08 07:55:01 -08:00
Linus Torvalds 1557d33007 Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
  security/tomoyo: Remove now unnecessary handling of security_sysctl.
  security/tomoyo: Add a special case to handle accesses through the internal proc mount.
  sysctl: Drop & in front of every proc_handler.
  sysctl: Remove CTL_NONE and CTL_UNNUMBERED
  sysctl: kill dead ctl_handler definitions.
  sysctl: Remove the last of the generic binary sysctl support
  sysctl net: Remove unused binary sysctl code
  sysctl security/tomoyo: Don't look at ctl_name
  sysctl arm: Remove binary sysctl support
  sysctl x86: Remove dead binary sysctl support
  sysctl sh: Remove dead binary sysctl support
  sysctl powerpc: Remove dead binary sysctl support
  sysctl ia64: Remove dead binary sysctl support
  sysctl s390: Remove dead sysctl binary support
  sysctl frv: Remove dead binary sysctl support
  sysctl mips/lasat: Remove dead binary sysctl support
  sysctl drivers: Remove dead binary sysctl support
  sysctl crypto: Remove dead binary sysctl support
  sysctl security/keys: Remove dead binary sysctl support
  sysctl kernel: Remove binary sysctl logic
  ...
2009-12-08 07:38:50 -08:00
Jiri Kosina d014d04386 Merge branch 'for-next' into for-linus
Conflicts:

	kernel/irq/chip.c
2009-12-07 18:36:35 +01:00
André Goddard Rosa af901ca181 tree-wide: fix assorted typos all over the place
That is "success", "unknown", "through", "performance", "[re|un]mapping"
, "access", "default", "reasonable", "[con]currently", "temperature"
, "channel", "[un]used", "application", "example","hierarchy", "therefore"
, "[over|under]flow", "contiguous", "threshold", "enough" and others.

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-12-04 15:39:55 +01:00
Eric Dumazet 47e1c32306 tcp: fix a timewait refcnt race
After TCP RCU conversion, tw->tw_refcnt should not be set to 1 in
inet_twsk_alloc(). It allows a RCU reader to get this timewait socket,
while we not yet stabilized it.

Only choice we have is to set tw_refcnt to 0 in inet_twsk_alloc(),
then atomic_add() it later, once everything is done.

Location of this atomic_add() is tricky, because we dont want another
writer to find this timewait in ehash, while tw_refcnt is still zero !

Thanks to Kapil Dakhane tests and reports.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 16:17:44 -08:00
Eric Dumazet 13475a30b6 tcp: connect() race with timewait reuse
Its currently possible that several threads issuing a connect() find
the same timewait socket and try to reuse it, leading to list
corruptions.

Condition for bug is that these threads bound their socket on same
address/port of to-be-find timewait socket, and connected to same
target. (SO_REUSEADDR needed)

To fix this problem, we could unhash timewait socket while holding
ehash lock, to make sure lookups/changes will be serialized. Only
first thread finds the timewait socket, other ones find the
established socket and return an EADDRNOTAVAIL error.

This second version takes into account Evgeniy's review and makes sure
inet_twsk_put() is called outside of locked sections.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 16:17:43 -08:00
Eric Dumazet 49d0900787 tcp: diag: Dont report negative values for rx queue
Both netlink and /proc/net/tcp interfaces can report transient
negative values for rx queue.

ss ->
State   Recv-Q Send-Q  Local Address:Port  Peer Address:Port
ESTAB   -6     6       127.0.0.1:45956     127.0.0.1:3333 

netstat ->
tcp   4294967290      6 127.0.0.1:37784  127.0.0.1:3333 ESTABLISHED

This is because we dont lock socket while computing 
tp->rcv_nxt - tp->copied_seq,
and another CPU can update copied_seq before rcv_next in RX path.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 16:06:13 -08:00
David S. Miller 424eff9751 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6 2009-12-03 13:23:12 -08:00
Eric W. Biederman b099ce2602 net: Batch inet_twsk_purge
This function walks the whole hashtable so there is no point in
passing it a network namespace.  Instead I purge all timewait
sockets from dead network namespaces that I find.  If the namespace
is one of the once I am trying to purge I am guaranteed no new timewait
sockets can be formed so this will get them all.  If the namespace
is one I am not acting for it might form a few more but I will
call inet_twsk_purge again and  shortly to get rid of them.  In
any even if the network namespace is dead timewait sockets are
useless.

Move the calls of inet_twsk_purge into batch_exit routines so
that if I am killing a bunch of namespaces at once I will just
call inet_twsk_purge once and save a lot of redundant unnecessary
work.

My simple 4k network namespace exit test the cleanup time dropped from
roughly 8.2s to 1.6s.  While the time spent running inet_twsk_purge fell
to about 2ms.  1ms for ipv4 and 1ms for ipv6.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 12:23:47 -08:00
Eric W. Biederman 575f4cd5a5 net: Use rcu lookups in inet_twsk_purge.
While we are looking up entries to free there is no reason to take
the lock in inet_twsk_purge.  We have to drop locks and restart
occassionally anyway so adding a few more in case we get on the
wrong list because of a timewait move is no big deal.  At the
same time not taking the lock for long periods of time is much
more polite to the rest of the users of the hash table.

In my test configuration of killing 4k network namespaces
this change causes 4k back to back runs of inet_twsk_purge on an
empty hash table to go from roughly 20.7s to 3.3s, and the total
time to destroy 4k network namespaces goes from roughly 44s to
3.3s.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 12:23:47 -08:00
Eric W. Biederman e9c5158ac2 net: Allow fib_rule_unregister to batch
Refactor the code so fib_rules_register always takes a template instead
of the actual fib_rules_ops structure that will be used.  This is
required for network namespace support so 2 out of the 3 callers already
do this, it allows the error handling to be made common, and it allows
fib_rules_unregister to free the template for hte caller.

Modify fib_rules_unregister to use call_rcu instead of syncrhonize_rcu
to allw multiple namespaces to be cleaned up in the same rcu grace
period.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 12:22:55 -08:00
Patrick McHardy 8153a10c08 ipv4 05/05: add sysctl to accept packets with local source addresses
commit 8ec1e0ebe26087bfc5c0394ada5feb5758014fc8
Author: Patrick McHardy <kaber@trash.net>
Date:   Thu Dec 3 12:16:35 2009 +0100

    ipv4: add sysctl to accept packets with local source addresses

    Change fib_validate_source() to accept packets with a local source address when
    the "accept_local" sysctl is set for the incoming inet device. Combined with the
    previous patches, this allows to communicate between multiple local interfaces
    over the wire.

    Signed-off-by: Patrick McHardy <kaber@trash.net>

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 12:14:38 -08:00
Patrick McHardy 5adef18091 net 04/05: fib_rules: allow to delete local rule
commit d124356ce314fff22a047ea334379d5105b2d834
Author: Patrick McHardy <kaber@trash.net>
Date:   Thu Dec 3 12:16:35 2009 +0100

    net: fib_rules: allow to delete local rule

    Allow to delete the local rule and recreate it with a higher priority. This
    can be used to force packets with a local destination out on the wire instead
    of routing them to loopback. Additionally this patch allows to recreate rules
    with a priority of 0.

    Combined with the previous patch to allow oif classification, a socket can
    be bound to the desired interface and packets routed to the wire like this:

    # move local rule to lower priority
    ip rule add pref 1000 lookup local
    ip rule del pref 0

    # route packets of sockets bound to eth0 to the wire independant
    # of the destination address
    ip rule add pref 100 oif eth0 lookup 100
    ip route add default dev eth0 table 100

    Signed-off-by: Patrick McHardy <kaber@trash.net>

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 12:14:37 -08:00
David S. Miller e6b09ccada tcp: sysctl_tcp_cookie_size needs to be exported to modules.
Otherwise:

ERROR: "sysctl_tcp_cookie_size" [net/ipv6/ipv6.ko] undefined!
make[1]: *** [__modpost] Error 1

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 22:18:58 -08:00
David S. Miller f9a2e69e8b tcp: Fix warning on 64-bit.
net/ipv4/tcp_output.c: In function ‘tcp_make_synack’:
net/ipv4/tcp_output.c:2488: warning: cast from pointer to integer of different size

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 22:12:04 -08:00
William Allen Simpson 4957faade1 TCPCT part 1g: Responder Cookie => Initiator
Parse incoming TCP_COOKIE option(s).

Calculate <SYN,ACK> TCP_COOKIE option.

Send optional <SYN,ACK> data.

This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

Requires:
   TCPCT part 1a: add request_values parameter for sending SYNACK
   TCPCT part 1b: generate Responder Cookie secret
   TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
   TCPCT part 1d: define TCP cookie option, extend existing struct's
   TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS
   TCPCT part 1f: Initiator Cookie => Responder

Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 22:07:26 -08:00
William Allen Simpson bd0388ae77 TCPCT part 1f: Initiator Cookie => Responder
Calculate and format <SYN> TCP_COOKIE option.

This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

Requires:
   TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
   TCPCT part 1d: define TCP cookie option, extend existing struct's

Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 22:07:26 -08:00
William Allen Simpson e56fb50f2b TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS
Provide per socket control of the TCP cookie option and SYN/SYNACK data.

This is a straightforward re-implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

The principle difference is using a TCP option to carry the cookie nonce,
instead of a user configured offset in the data.

Allocations have been rearranged to avoid requiring GFP_ATOMIC.

Requires:
   net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED
   TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
   TCPCT part 1d: define TCP cookie option, extend existing struct's

Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 22:07:25 -08:00
William Allen Simpson 435cf559f0 TCPCT part 1d: define TCP cookie option, extend existing struct's
Data structures are carefully composed to require minimal additions.
For example, the struct tcp_options_received cookie_plus variable fits
between existing 16-bit and 8-bit variables, requiring no additional
space (taking alignment into consideration).  There are no additions to
tcp_request_sock, and only 1 pointer in tcp_sock.

This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

The principle difference is using a TCP option to carry the cookie nonce,
instead of a user configured offset in the data.  This is more flexible and
less subject to user configuration error.  Such a cookie option has been
suggested for many years, and is also useful without SYN data, allowing
several related concepts to use the same extension option.

    "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
    http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html

    "Re: what a new TCP header might look like", May 12, 1998.
    ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail

These functions will also be used in subsequent patches that implement
additional features.

Requires:
   TCPCT part 1a: add request_values parameter for sending SYNACK
   TCPCT part 1b: generate Responder Cookie secret
   TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS

Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 22:07:25 -08:00
William Allen Simpson 519855c508 TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
Define sysctl (tcp_cookie_size) to turn on and off the cookie option
default globally, instead of a compiled configuration option.

Define per socket option (TCP_COOKIE_TRANSACTIONS) for setting constant
data values, retrieving variable cookie values, and other facilities.

Move inline tcp_clear_options() unchanged from net/tcp.h to linux/tcp.h,
near its corresponding struct tcp_options_received (prior to changes).

This is a straightforward re-implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

These functions will also be used in subsequent patches that implement
additional features.

Requires:
   net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED

Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 22:07:24 -08:00
William Allen Simpson da5c78c826 TCPCT part 1b: generate Responder Cookie secret
Define (missing) hash message size for SHA1.

Define hashing size constants specific to TCP cookies.

Add new function: tcp_cookie_generator().

Maintain global secret values for tcp_cookie_generator().

This is a significantly revised implementation of earlier (15-year-old)
Photuris [RFC-2522] code for the KA9Q cooperative multitasking platform.

Linux RCU technique appears to be well-suited to this application, though
neither of the circular queue items are freed.

These functions will also be used in subsequent patches that implement
additional features.

Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 22:07:23 -08:00
William Allen Simpson e6b4d11367 TCPCT part 1a: add request_values parameter for sending SYNACK
Add optional function parameters associated with sending SYNACK.
These parameters are not needed after sending SYNACK, and are not
used for retransmission.  Avoids extending struct tcp_request_sock,
and avoids allocating kernel memory.

Also affects DCCP as it uses common struct request_sock_ops,
but this parameter is currently reserved for future use.

Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02 22:07:23 -08:00
David S. Miller ff9c38bba3 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/mac80211/ht.c
2009-12-01 22:13:38 -08:00
Eric W. Biederman 86de8a631e net: Simplify ipip pernet operations.
Take advantage of the new pernet automatic storage management,
and stop using compatibility network namespace functions.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-01 16:15:58 -08:00
Eric W. Biederman cfb8fbf229 net: Simplify ip_gre pernet operations.
Take advantage of the new pernet automatic storage management,
and stop using compatibility network namespace functions.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-01 16:15:57 -08:00
Eric W. Biederman a5ee155136 net: NETDEV_UNREGISTER_PERNET -> NETDEV_UNREGISTER_BATCH
The motivation for an additional notifier in batched netdevice
notification (rt_do_flush) only needs to be called once per batch not
once per namespace.

For further batching improvements I need a guarantee that the
netdevices are unregistered in order allowing me to unregister an all
of the network devices in a network namespace at the same time with
the guarantee that the loopback device is really and truly
unregistered last.

Additionally it appears that we moved the route cache flush after
the final synchronize_net, which seems wrong and there was no
explanation.  So I have restored the original location of the final
synchronize_net.

Cc: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-01 16:15:50 -08:00
Patrick McHardy b2722b1c3a ip_fragment: also adjust skb->truesize for packets not owned by a socket
When a large packet gets reassembled by ip_defrag(), the head skb
accounts for all the fragments in skb->truesize. If this packet is
refragmented again, skb->truesize is not re-adjusted to reflect only
the head size since its not owned by a socket. If the head fragment
then gets recycled and reused for another received fragment, it might
exceed the defragmentation limits due to its large truesize value.

skb_recycle_check() explicitly checks for linear skbs, so any recycled
skb should reflect its true size in skb->truesize. Change ip_fragment()
to also adjust the truesize value of skbs not owned by a socket.

Reported-and-tested-by: Ben Menchaca <ben@bigfootnetworks.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-01 15:53:57 -08:00
Eric Dumazet 1fdf475aa1 tcp: tcp_disconnect() should clear window_clamp
NFS can reuse its TCP socket after calling tcp_disconnect().

We noticed window scaling was not negotiated in SYN packet of next
connection request.

Fix is to clear tp->window_clamp in tcp_disconnect().

Reported-by: Krzysztof Oledzki <ole@ans.pl>
Tested-by: Krzysztof Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-30 12:53:30 -08:00
David Ford bbf31bf18d ipv4: additional update of dev_net(dev) to struct *net in ip_fragment.c, NULL ptr OOPS
ipv4 ip_frag_reasm(), fully replace 'dev_net(dev)' with 'net', defined
previously patched into 2.6.29.

Between 2.6.28.10 and 2.6.29, net/ipv4/ip_fragment.c was patched,
changing from dev_net(dev) to container_of(...).  Unfortunately the goto
section (out_fail) on oversized packets inside ip_frag_reasm() didn't
get touched up as well.  Oversized IP packets cause a NULL pointer
dereference and immediate hang.

I discovered this running openvasd and my previous email on this is
titled:  NULL pointer dereference at 2.6.32-rc8:net/ipv4/ip_fragment.c:566

Signed-off-by: David Ford <david@blue-labs.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-29 23:02:22 -08:00
Joe Perches f64f9e7192 net: Move && and || to end of previous line
Not including net/atm/

Compiled tested x86 allyesconfig only
Added a > 80 column line or two, which I ignored.
Existing checkpatch plaints willfully, cheerfully ignored.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-29 16:55:45 -08:00
Martin Willi 8f8a088c21 xfrm: Use the user specified truncation length in ESP and AH
Instead of using the hardcoded truncation for authentication
algorithms, use the truncation length specified on xfrm_state.

Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-25 15:48:41 -08:00
Alexey Dobriyan a661c4199b net: convert /proc/net/rt_acct to seq_file
Rewrite statistics accumulation to be in terms of structure fields,
not raw u32 additions. Keep them in same order, though.

This is the last user of create_proc_read_entry() in net/,
please NAK all new ones as well as all new ->write_proc, ->read_proc and
create_proc_entry() users. Cc me if there are problems. :-)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-25 15:40:35 -08:00
Octavian Purdila 09ad9bc752 net: use net_eq to compare nets
Generated with the following semantic patch

@@
struct net *n1;
struct net *n2;
@@
- n1 == n2
+ net_eq(n1, n2)

@@
struct net *n1;
struct net *n2;
@@
- n1 != n2
+ !net_eq(n1, n2)

applied over {include,net,drivers/net}.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-25 15:14:13 -08:00
Joe Perches 3666ed1c48 netfilter: net/ipv[46]/netfilter: Move && and || to end of previous line
Compile tested only.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-11-23 23:17:06 +01:00
Joe Perches 9d4fb27db9 net/ipv4: Move && and || to end of previous line
On Sun, 2009-11-22 at 16:31 -0800, David Miller wrote:
> It should be of the form:
> 	if (x &&
> 	    y)
> 
> or:
> 	if (x && y)
> 
> Fix patches, rather than complaints, for existing cases where things
> do not follow this pattern are certainly welcome.

Also collapsed some multiple tabs to single space.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-23 10:41:23 -08:00
J. Bruce Fields 9b8b317d58 Merge commit 'v2.6.32-rc8' into HEAD 2009-11-23 12:34:58 -05:00
David S. Miller e994b7c901 tcp: Don't make syn cookies initial setting depend on CONFIG_SYSCTL
That's extremely non-intuitive, noticed by William Allen Simpson.

And let's make the default be on, it's been suggested by a lot of
people so we'll give it a try.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-21 11:22:25 -08:00
Eric Dumazet f99189b186 netns: net_identifiers should be read_mostly
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-18 05:03:25 -08:00
Octavian Purdila e2ce146848 ipv4: factorize cache clearing for batched unregister operations
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-18 05:03:07 -08:00
Eric W. Biederman bb9074ff58 Merge commit 'v2.6.32-rc7'
Resolve the conflict between v2.6.32-rc7 where dn_def_dev_handler
gets a small bug fix and the sysctl tree where I am removing all
sysctl strategy routines.
2009-11-17 01:01:34 -08:00
David S. Miller a2bfbc072e Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/can/Kconfig
2009-11-17 00:05:02 -08:00
Eric Dumazet 2c1409a0a2 inetpeer: Optimize inet_getid()
While investigating for network latencies, I found inet_getid() was a
contention point for some workloads, as inet_peer_idlock is shared
by all inet_getid() users regardless of peers.

One way to fix this is to make ip_id_count an atomic_t instead
of __u16, and use atomic_add_return().

In order to keep sizeof(struct inet_peer) = 64 on 64bit arches
tcp_ts_stamp is also converted to __u32 instead of "unsigned long".

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13 20:46:58 -08:00
Eric Dumazet eec4df9885 ipv4: speedup inet_dump_ifaddr()
Stephen Hemminger a écrit :
> On Thu, 12 Nov 2009 15:11:36 +0100
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> When handling large number of netdevices, inet_dump_ifaddr()
>> is very slow because it has O(N^2) complexity.
>>
>> Instead of scanning one single list, we can use the NETDEV_HASHENTRIES
>> sub lists of the dev_index hash table, and RCU lookups.
>>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> You might be able to make RCU critical section smaller by moving
> it into loop.
>

Indeed. But we dump at most one skb (<= 8192 bytes ?), so rcu_read_lock
holding time is small, unless we meet many netdevices without
addresses. I wonder if its really common...

Thanks

[PATCH net-next-2.6] ipv4: speedup inet_dump_ifaddr()

When handling large number of netdevices, inet_dump_ifaddr()
is very slow because it has O(N2) complexity.

Instead of scanning one single list, we can use the NETDEV_HASHENTRIES
sub lists of the dev_index hash table, and RCU lookups.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13 20:46:55 -08:00
Eric Dumazet 6baff15037 igmp: Use next_net_device_rcu()
We need to use next_det_device_rcu() in RCU protected section.

We also can avoid in_dev_get()/in_dev_put() overhead (code size mainly)
in rcu_read_lock() sections.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13 20:38:49 -08:00
William Allen Simpson bee7ca9ec0 net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED
Define two symbols needed in both kernel and user space.

Remove old (somewhat incorrect) kernel variant that wasn't used in
most cases.  Default should apply to both RMSS and SMSS (RFC2581).

Replace numeric constants with defined symbols.

Stand-alone patch, originally developed for TCPCT.

Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13 20:38:48 -08:00
Dan Carpenter d0490cfdf4 ipmr: missing dev_put() on error path in vif_add()
The other error paths in front of this one have a dev_put() but this one
got missed.

Found by smatch static checker.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Wang Chen <ellre923@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13 19:56:54 -08:00
Ilpo Järvinen d792c1006f tcp: provide more information on the tcp receive_queue bugs
The addition of rcv_nxt allows to discern whether the skb
was out of place or tp->copied. Also catch fancy combination
of flags if necessary (sadly we might miss the actual causer
flags as it might have already returned).

Btw, we perhaps would want to forward copied_seq in
somewhere or otherwise we might have some nice loop with
WARN stuff within but where to do that safely I don't
know at this stage until more is known (but it is not
made significantly worse by this patch).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13 13:56:33 -08:00
Eric W. Biederman f8572d8f2a sysctl net: Remove unused binary sysctl code
Now that sys_sysctl is a compatiblity wrapper around /proc/sys
all sysctl strategy routines, and all ctl_name and strategy
entries in the sysctl tables are unused, and can be
revmoed.

In addition neigh_sysctl_register has been modified to no longer
take a strategy argument and it's callers have been modified not
to pass one.

Cc: "David Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2009-11-12 02:05:06 -08:00
stephen hemminger 61fbab77a8 IPV4: use rcu to walk list of devices in IGMP
This also needs to be optimized for large number of devices.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-10 22:27:12 -08:00
Eric Dumazet 30fff9231f udp: bind() optimisation
UDP bind() can be O(N^2) in some pathological cases.

Thanks to secondary hash tables, we can make it O(N)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-10 20:54:38 -08:00
David S. Miller d0e1e88d6e Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/can/usb/ems_usb.c
2009-11-08 23:00:54 -08:00
Eric Dumazet f6b8f32ca7 udp: multicast RX should increment SNMP/sk_drops counter in allocation failures
When skb_clone() fails, we should increment sk_drops and SNMP counters.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08 20:53:10 -08:00
Eric Dumazet 1240d1373c ipv4: udp: Optimise multicast reception
UDP multicast rx path is a bit complex and can hold a spinlock
for a long time.

Using a small (32 or 64 entries) stack of socket pointers can help
to perform expensive operations (skb_clone(), udp_queue_rcv_skb())
outside of the lock, in most cases.

It's also a base for a future RCU conversion of multicast recption.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Lucian Adrian Grijincu <lgrijincu@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08 20:53:08 -08:00
Eric Dumazet 5051ebd275 ipv4: udp: optimize unicast RX path
We first locate the (local port) hash chain head
If few sockets are in this chain, we proceed with previous lookup algo.

If too many sockets are listed, we take a look at the secondary
(port, address) hash chain we added in previous patch.

We choose the shortest chain and proceed with a RCU lookup on the elected chain.

But, if we chose (port, address) chain, and fail to find a socket on given address,
 we must try another lookup on (port, INADDR_ANY) chain to find socket not bound
to a particular IP.

-> No extra cost for typical setups, where the first lookup will probabbly
be performed.

RCU lookups everywhere, we dont acquire spinlock.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08 20:53:07 -08:00
Eric Dumazet 512615b6b8 udp: secondary hash on (local port, local address)
Extends udp_table to contain a secondary hash table.

socket anchor for this second hash is free, because UDP
doesnt use skc_bind_node : We define an union to hold
both skc_bind_node & a new hlist_nulls_node udp_portaddr_node

udp_lib_get_port() inserts sockets into second hash chain
(additional cost of one atomic op)

udp_lib_unhash() deletes socket from second hash chain
(additional cost of one atomic op)

Note : No spinlock lockdep annotation is needed, because
lock for the secondary hash chain is always get after
lock for primary hash chain.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08 20:53:06 -08:00
Eric Dumazet d4cada4ae1 udp: split sk_hash into two u16 hashes
Union sk_hash with two u16 hashes for udp (no extra memory taken)

One 16 bits hash on (local port) value (the previous udp 'hash')

One 16 bits hash on (local address, local port) values, initialized
but not yet used. This second hash is using jenkin hash for better
distribution.

Because the 'port' is xored later, a partial hash is performed
on local address + net_hash_mix(net)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08 20:53:05 -08:00
Eric Dumazet fdcc8aa953 udp: add a counter into udp_hslot
Adds a counter in udp_hslot to keep an accurate count
of sockets present in chain.

This will permit to upcoming UDP lookup algo to chose
the shortest chain when secondary hash is added.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08 20:53:04 -08:00
Eric W. Biederman 81adee47df net: Support specifying the network namespace upon device creation.
There is no good reason to not support userspace specifying the
network namespace during device creation, and it makes it easier
to create a network device and pass it to a child network namespace
with a well known name.

We have to be careful to ensure that the target network namespace
for the new device exists through the life of the call.  To keep
that logic clear I have factored out the network namespace grabbing
logic into rtnl_link_get_net.

In addtion we need to continue to pass the source network namespace
to the rtnl_link_ops.newlink method so that we can find the base
device source network namespace.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2009-11-08 00:53:51 -08:00
Herbert Xu 23ca0c989e ipip: Fix handling of DF packets when pmtudisc is OFF
RFC 2003 requires the outer header to have DF set if DF is set
on the inner header, even when PMTU discovery is off for the
tunnel.  Our implementation does exactly that.

For this to work properly the IPIP gateway also needs to engate
in PMTU when the inner DF bit is set.  As otherwise the original
host would not be able to carry out its PMTU successfully since
part of the path is only visible to the gateway.

Unfortunately when the tunnel PMTU discovery setting is off, we
do not collect the necessary soft state, resulting in blackholes
when the original host tries to perform PMTU discovery.

This problem is not reproducible on the IPIP gateway itself as
the inner packet usually has skb->local_df set.  This is not
correctly cleared (an unrelated bug) when the packet passes
through the tunnel, which allows fragmentation to occur.  For
hosts behind the IPIP gateway it is readily visible with a simple
ping.

This patch fixes the problem by performing PMTU discovery for
all packets with the inner DF bit set, regardless of the PMTU
discovery setting on the tunnel itself.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-06 20:33:40 -08:00
Patrick McHardy dee5817e88 netfilter: remove unneccessary checks from netlink notifiers
The NETLINK_URELEASE notifier is only invoked for bound sockets, so
there is no need to check ->pid again.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-11-06 17:04:00 +01:00
David S. Miller 230f9bb701 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/usb/cdc_ether.c

All CDC ethernet devices of type USB_CLASS_COMM need to use
'&mbm_info'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-06 00:55:55 -08:00
Jozsef Kadlecsik f9dd09c7f7 netfilter: nf_nat: fix NAT issue in 2.6.30.4+
Vitezslav Samel discovered that since 2.6.30.4+ active FTP can not work
over NAT. The "cause" of the problem was a fix of unacknowledged data
detection with NAT (commit a3a9f79e36).
However, actually, that fix uncovered a long standing bug in TCP conntrack:
when NAT was enabled, we simply updated the max of the right edge of
the segments we have seen (td_end), by the offset NAT produced with
changing IP/port in the data. However, we did not update the other parameter
(td_maxend) which is affected by the NAT offset. Thus that could drift
away from the correct value and thus resulted breaking active FTP.

The patch below fixes the issue by *not* updating the conntrack parameters
from NAT, but instead taking into account the NAT offsets in conntrack in a
consistent way. (Updating from NAT would be more harder and expensive because
it'd need to re-calculate parameters we already calculated in conntrack.)

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-06 00:43:42 -08:00
Eric Dumazet 69df9d5993 ip_frag: dont touch device refcount
When sending fragmentation expiration ICMP V4/V6 messages,
we can avoid touching device refcount, thanks to RCU

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-05 22:34:22 -08:00
Eric Paris c84b3268da net: check kern before calling security subsystem
Before calling capable(CAP_NET_RAW) check if this operations is on behalf
of the kernel or on behalf of userspace.  Do not do the security check if
it is on behalf of the kernel.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-05 22:18:18 -08:00
Eric Paris 3f378b6844 net: pass kern to net_proto_family create function
The generic __sock_create function has a kern argument which allows the
security system to make decisions based on if a socket is being created by
the kernel or by userspace.  This patch passes that flag to the
net_proto_family specific create function, so it can do the same thing.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-05 22:18:14 -08:00
Eric Paris 13f18aa05f net: drop capability from protocol definitions
struct can_proto had a capability field which wasn't ever used.  It is
dropped entirely.

struct inet_protosw had a capability field which can be more clearly
expressed in the code by just checking if sock->type = SOCK_RAW.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-05 21:40:17 -08:00
Hannes Eder 76ac894080 netfilter: nf_nat_helper: tidy up adjust_tcp_sequence
The variable 'other_way' gets initialized but is not read afterwards,
so remove it.  Pass the right arguments to a pr_debug call.

While being at tidy up a bit and it fix this checkpatch warning:
  WARNING: suspect code indent for conditional statements

Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-11-05 15:51:19 +01:00
Gilad Ben-Yossef 6a2a2d6bf8 tcp: Use defaults when no route options are available
Trying to parse the option of a SYN packet that we have
no route entry for should just use global wide defaults
for route entry options.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Tested-by: Valdis.Kletnieks@vt.edu
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-04 23:24:15 -08:00
Gilad Ben-Yossef 05eaade278 tcp: Do not call IPv4 specific func in tcp_check_req
Calling IPv4 specific inet_csk_route_req in tcp_check_req
is a bad idea and crashes machine on IPv6 connections, as reported
by Valdis Kletnieks

Also, all we are really interested in is the timestamp
option in the header, so calling tcp_parse_options()
with the "estab" set to false flag is an overkill as
it tries to parse half a dozen other TCP options.

We know whether timestamp should be enabled or not
using data from request_sock.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Tested-by: Valdis.Kletnieks@vt.edu
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-04 23:24:14 -08:00
Eric Dumazet 9f9354b92d net: net/ipv4/devinet.c cleanups
As pointed by Stephen Rothwell, commit c6d14c84 added a warning :

net/ipv4/devinet.c: In function 'inet_select_addr':
net/ipv4/devinet.c:902: warning: label 'out' defined but not used

delete unused 'out' label and do some cleanups as well

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-04 22:05:10 -08:00
Eric Dumazet 9481721be1 netfilter: remove synchronize_net() calls in ip_queue/ip6_queue
nf_unregister_queue_handlers() already does a synchronize_rcu()
call, we dont need to do it again in callers.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-11-04 21:14:31 +01:00
Eric Dumazet c6d14c8456 net: Introduce for_each_netdev_rcu() iterator
Adds RCU management to the list of netdevices.

Convert some for_each_netdev() users to RCU version, if
it can avoid read_lock-ing dev_base_lock

Ie:
	read_lock(&dev_base_loack);
	for_each_netdev(net, dev)
		some_action();
	read_unlock(&dev_base_lock);

becomes :

	rcu_read_lock();
	for_each_netdev_rcu(net, dev)
		some_action();
	rcu_read_unlock();


Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-04 05:43:23 -08:00
Eric Dumazet 685c794405 icmp: icmp_send() can avoid a dev_put()
We can avoid touching device refcount in icmp_send(),
using dev_get_by_index_rcu()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-01 23:55:10 -08:00
Eric Dumazet c148fc2e30 ipv4: inetdev_by_index() switch to RCU
Use dev_get_by_index_rcu() instead of __dev_get_by_index() and
dev_base_lock rwlock

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-01 23:55:09 -08:00
Herbert Xu 2e9526b352 gre: Fix dev_addr clobbering for gretap
Nathan Neulinger noticed that gretap devices get their MAC address
from the local IP address, which results in invalid MAC addresses
half of the time.

This is because gretap is still using the tunnel netdev ops rather
than the correct tap netdev ops struct.

This patch also fixes changelink to not clobber the MAC address
for the gretap case.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Tested-by: Nathan Neulinger <nneul@mst.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-30 12:28:07 -07:00
Eric Dumazet 9d410c7960 net: fix sk_forward_alloc corruption
On UDP sockets, we must call skb_free_datagram() with socket locked,
or risk sk_forward_alloc corruption. This requirement is not respected
in SUNRPC.

Add a convenient helper, skb_free_datagram_locked() and use it in SUNRPC

Reported-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-30 12:25:12 -07:00
jamal b0c110ca8e net: Fix RPF to work with policy routing
Policy routing is not looked up by mark on reverse path filtering.
This fixes it.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 22:49:12 -07:00
David S. Miller 0519d83d83 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2009-10-29 21:28:59 -07:00
Cyrill Gorcunov 38bfd8f5be net,socket: introduce DECLARE_SOCKADDR helper to catch overflow at build time
proto_ops->getname implies copying protocol specific data
into storage unit (particulary to __kernel_sockaddr_storage).
So when we implement new protocol support we should keep such
a detail in mind (which is easy to forget about).

Lets introduce DECLARE_SOCKADDR helper which check if
storage unit is not overfowed at build time.

Eventually inet_getname is switched to use DECLARE_SOCKADDR
(to show example of usage).

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 03:00:06 -07:00
roel kluin 65a1c4fffa net: Cleanup redundant tests on unsigned
optlen is unsigned so the `< 0' test is never true.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 01:39:54 -07:00
Gilad Ben-Yossef dc343475ed Allow disabling of DSACK TCP option per route
Add and use no DSCAK bit in the features field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 01:28:48 -07:00
Gilad Ben-Yossef 345cda2fd6 Allow to turn off TCP window scale opt per route
Add and use no window scale bit in the features field.

Note that this is not the same as setting a window scale of 0
as would happen with window limit on route.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 01:28:47 -07:00
Gilad Ben-Yossef cda42ebd67 Allow disabling TCP timestamp options per route
Implement querying and acting upon the no timestamp bit in the feature
field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 01:28:45 -07:00
Gilad Ben-Yossef 1aba721eba Add the no SACK route option feature
Implement querying and acting upon the no sack bit in the features
field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 01:28:44 -07:00
Gilad Ben-Yossef 022c3f7d82 Allow tcp_parse_options to consult dst entry
We need tcp_parse_options to be aware of dst_entry to
take into account per dst_entry TCP options settings

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 01:28:41 -07:00
Gilad Ben-Yossef f55017a93f Only parse time stamp TCP option in time wait sock
Since we only use tcp_parse_options here to check for the exietence
of TCP timestamp option in the header, it is better to call with
the "established" flag on.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: Ori Finkelman <ori@comsleep.com>
Signed-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29 01:28:39 -07:00