Commit Graph

4693 Commits

Author SHA1 Message Date
Shan Wei 8dcf01fc00 net: sock_diag_handler structs can be const
read only, so change it to const.

Signed-off-by: Shan Wei <davidshan@tencent.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-25 20:46:59 -04:00
Eric Dumazet 38ba0a65fa net: skb_can_coalesce returns a boolean
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-24 00:18:02 -04:00
Eric Dumazet 783c175f90 tcp: tcp_try_coalesce returns a boolean
This clarifies code intention, as suggested by David.

Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 23:36:58 -04:00
David S. Miller f24001941c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fix merge between commit 3adadc08cc ("net ax25: Reorder ax25_exit to
remove races") and commit 0ca7a4c87d ("net ax25: Simplify and
cleanup the ax25 sysctl handling")

The former moved around the sysctl register/unregister calls, the
later simply removed them.

With help from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 23:15:17 -04:00
Eric Dumazet 1402d36601 tcp: introduce tcp_try_coalesce
commit c8628155ec (tcp: reduce out_of_order memory use) took care of
coalescing tcp segments provided by legacy devices (linear skbs)

We extend this idea to fragged skbs, as their truesize can be heavy.

ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment.

Use this coalescing strategy for receive queue too.

This contributes to reduce number of tcp collapses, at minimal cost, and
reduces memory overhead and packets drops.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 22:42:49 -04:00
Eric Dumazet da882c1f2e tcp: sk_add_backlog() is too agressive for TCP
While investigating TCP performance problems on 10Gb+ links, we found a
tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit
in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends
one ACK every two MSS segments.

A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default
value (87380), allowing a too small backlog.

A TCP ACK, even being small, can consume nearly same truesize space than
outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and
is fast to compute.

Performance results on netperf, single flow, receiver with disabled
GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop
increments at sender.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 22:28:28 -04:00
Eric Dumazet f545a38f74 net: add a limit parameter to sk_add_backlog()
sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
memory limit. We need to make this limit a parameter for TCP use.

No functional change expected in this patch, all callers still using the
old sk_rcvbuf limit.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 22:28:28 -04:00
David S. Miller ac807fa8e6 tcp: Fix build warning after tcp_{v4,v6}_init_sock consolidation.
net/ipv4/tcp_ipv4.c: In function 'tcp_v4_init_sock':
net/ipv4/tcp_ipv4.c:1891:19: warning: unused variable 'tp' [-Wunused-variable]
net/ipv6/tcp_ipv6.c: In function 'tcp_v6_init_sock':
net/ipv6/tcp_ipv6.c:1836:19: warning: unused variable 'tp' [-Wunused-variable]

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 03:21:58 -04:00
Neal Cardwell 900f65d361 tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock()
This commit moves the (substantial) common code shared between
tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family
independent function, tcp_init_sock().

Centralizing this functionality should help avoid drift issues,
e.g. where the IPv4 side is updated without a corresponding update to
IPv6. There was already some drift: IPv4 initialized snd_cwnd to
TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to
2 (in this case it should not matter, since snd_cwnd is also
initialized in tcp_init_metrics(), but the general risks and
maintenance overhead remain).

When diffing the old and new code, note that new tcp_init_sock()
function uses the order of steps from the tcp_v4_init_sock()
implementation (the order is slightly different in
tcp_v6_init_sock()).

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 16:36:42 -04:00
Pavel Emelyanov b139ba4e90 tcp: Repair connection-time negotiated parameters
There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.

Only 4 options should be handled on repaired socket.

To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.

Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.

The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack  info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 15:52:25 -04:00
Pavel Emelyanov 5e6a3ce657 tcp: Report mss_clamp with TCP_MAXSEG option in repair mode
The mss_clamp is the only connection-time negotiated option which
cannot be obtained from the user space. Make the TCP_MAXSEG sockopt
report one in the repair mode.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 15:52:25 -04:00
Pavel Emelyanov c0e88ff0f2 tcp: Repair socket queues
Reading queues under repair mode is done with recvmsg call.
The queue-under-repair set by TCP_REPAIR_QUEUE option is used
to determine which queue should be read. Thus both send and
receive queue can be read with this.

Caller must pass the MSG_PEEK flag.

Writing to queues is done with sendmsg call and yet again --
the repair-queue option can be used to push data into the
receive queue.

When putting an skb into receive queue a zero tcp header is
appented to its head to address the tcp_hdr(skb)->syn and
the ->fin checks by the (after repair) tcp_recvmsg. These
flags flags are both set to zero and that's why.

The fin cannot be met in the queue while reading the source
socket, since the repair only works for closed/established
sockets and queueing fin packet always changes its state.

The syn in the queue denotes that the respective skb's seq
is "off-by-one" as compared to the actual payload lenght. Thus,
at the rcv queue refill we can just drop this flag and set the
skb's sequences to precice values.

When the repair mode is turned off, the write queue seqs are
updated so that the whole queue is considered to be 'already sent,
waiting for ACKs' (write_seq = snd_nxt <= snd_una). From the
protocol POV the send queue looks like it was sent, but the data
between the write_seq and snd_nxt is lost in the network.

This helps to avoid another sockoption for setting the snd_nxt
sequence. Leaving the whole queue in a 'not yet sent' state (as
it will be after sendmsg-s) will not allow to receive any acks
from the peer since the ack_seq will be after the snd_nxt. Thus
even the ack for the window probe will be dropped and the
connection will be 'locked' with the zero peer window.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 15:52:25 -04:00
Pavel Emelyanov ee9952831c tcp: Initial repair mode
This includes (according the the previous description):

* TCP_REPAIR sockoption

This one just puts the socket in/out of the repair mode.
Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
When repair mode is turned off and the socket happens to be in
the established state the window probe is sent to the peer to
'unlock' the connection.

* TCP_REPAIR_QUEUE sockoption

This one sets the queue which we're about to repair. The
'no-queue' is set by default.

* TCP_QUEUE_SEQ socoption

Sets the write_seq/rcv_nxt of a selected repaired queue.
Allowed for TCP_CLOSE-d sockets only. When the socket changes
its state the other seq-s are changed by the kernel according
to the protocol rules (most of the existing code is actually
reused).

* Ability to forcibly bind a socket to a port

The sk->sk_reuse is set to SK_FORCE_REUSE.

* Immediate connect modification

The connect syscall initializes the connection, then directly jumps
to the code which finalizes it.

* Silent close modification

The close just aborts the connection (similar to SO_LINGER with 0
time) but without sending any FIN/RST-s to peer.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 15:52:25 -04:00
Pavel Emelyanov 370816aef0 tcp: Move code around
This is just the preparation patch, which makes the needed for
TCP repair code ready for use.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 15:52:25 -04:00
Pavel Emelyanov 4a17fd5229 sock: Introduce named constants for sk_reuse
Name them in a "backward compatible" manner, i.e. reuse or not
are still 1 and 0 respectively. The reuse value of 2 means that
the socket with it will forcibly reuse everyone else's port.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 15:52:25 -04:00
Eric W. Biederman a5347fe36b net: Delete all remaining instances of ctl_path
We don't use struct ctl_path anymore so delete the exported constants.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20 21:22:30 -04:00
Eric W. Biederman ec8f23ce0f net: Convert all sysctl registrations to register_net_sysctl
This results in code with less boiler plate that is a bit easier
to read.

Additionally stops us from using compatibility code in the sysctl
core, hastening the day when the compatibility code can be removed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20 21:22:30 -04:00
Eric W. Biederman f99e8f715a net: Convert nf_conntrack_proto to use register_net_sysctl
There isn't much advantage here except that strings paths are a bit
easier to read, and converting everything to them allows me to kill off
ctl_path.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20 21:22:30 -04:00
Eric W. Biederman 8607ddb867 net ipv4: Convert devinet to use register_net_sysctl
Using an ascii path to register_net_sysctl as opposed to the slightly
awkward ctl_path allows for much simpler code.

We no longer need to malloc dev_name to keep it alive the length of our
sysctl register instead we can use a small temporary buffer on the
stack.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20 21:22:30 -04:00
Eric W. Biederman 4e5ca78541 net ipv4: Remove the unneeded registration of an empty net/ipv4/neigh
sysctl no longer requires explicit creation of directories.  The neigh
directory is always populated with at least a default entry so this
won't cause any user visible changes.

Delete the ipv4_path and the ipv4_skeleton these are no longer needed.

Directly register the ipv4_route_table.

And since I am an idiot remove the header definitions that I should
have removed in the previous patch.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20 21:21:18 -04:00
Eric W. Biederman 5dd3df105b net: Move all of the network sysctls without a namespace into init_net.
This makes it clearer which sysctls are relative to your current network
namespace.

This makes it a little less error prone by not exposing sysctls for the
initial network namespace in other namespaces.

This is the same way we handle all of our other network interfaces to
userspace and I can't honestly remember why we didn't do this for
sysctls right from the start.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20 21:21:17 -04:00
Eric W. Biederman 4344475797 net: Kill register_sysctl_rotable
register_sysctl_rotable never caught on as an interesting way to
register sysctls.  My take on the situation is that what we want are
sysctls that we can only see in the initial network namespace.  What we
have implemented with register_sysctl_rotable are sysctls that we can
see in all of the network namespaces and can only change in the initial
network namespace.

That is a very silly way to go.  Just register the network sysctls
in the initial network namespace and we don't have any weird special
cases to deal with.

The sysctls affected are:
/proc/sys/net/ipv4/ipfrag_secret_interval
/proc/sys/net/ipv4/ipfrag_max_dist
/proc/sys/net/ipv6/ip6frag_secret_interval
/proc/sys/net/ipv6/mld_max_msf

I really don't expect anyone will miss them if they can't read them in a
child user namespace.

CC: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20 21:21:17 -04:00
Eric Dumazet cbf8f7bb20 ipv4: dont drop packet in defrag but consume it
When defragmentation is finalized, we clone a packet and kfree_skb() it.

Call consume_skb() to not confuse dropwatch, since its not a drop.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 14:25:51 -04:00
Shan Wei 7426a5645f net: fix compile error of leaking kmemleak.h header
net/core/sysctl_net_core.c: In function ‘sysctl_core_init’:
net/core/sysctl_net_core.c:259: error: implicit declaration of function ‘kmemleak_not_leak’

with same error in net/ipv4/route.c

Signed-off-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-19 00:11:39 -04:00
Eric Dumazet 22b4a4f22d tcp: fix retransmit of partially acked frames
Alexander Beregalov reported skb_over_panic errors and provided stack
trace.

I occurs commit a21d45726a (tcp: avoid order-1 allocations on wifi and
tx path) added a regression, when a retransmit is done after a partial
ACK.

tcp_retransmit_skb() tries to aggregate several frames if the first one
has enough available room to hold the following ones payload. This is
controlled by /proc/sys/net/ipv4/tcp_retrans_collapse tunable (default :
enabled)

Problem is we must make sure _pskb_trim_head() doesnt fool
skb_availroom() when pulling some bytes from skb (this pull is done when
receiver ACK part of the frame).

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Marc MERLIN <marc@merlins.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-18 16:52:45 -04:00
majianpeng 7f59388108 net/ipv4:Remove two memleak reports by kmemleak_not_leak.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-18 00:20:28 -04:00
Eric Dumazet 4d846f0239 tcp: fix tcp_grow_window() for large incoming frames
tcp_grow_window() has to grow rcv_ssthresh up to window_clamp, allowing
sender to increase its window.

tcp_grow_window() still assumes a tcp frame is under MSS, but its no
longer true with LRO/GRO.

This patch fixes one of the performance issue we noticed with GRO on.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-17 22:32:00 -04:00
Eric Dumazet 95c9617472 net: cleanup unsigned to unsigned int
Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15 12:44:40 -04:00
Daniel Baluta 5e73ea1a31 ipv4: fix checkpatch errors
Fix checkpatch errors of the following type:
	* ERROR: "foo * bar" should be "foo *bar"
	* ERROR: "(foo*)" should be "(foo *)"

Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15 12:37:19 -04:00
Vijay Subramanian a8cb05b238 tcp: Remove redundant code entering quickack mode
tcp_enter_quickack_mode() already calls tcp_incr_quickack() and sets
icsk->icsk_ack.ato  to TCP_ATO_MIN. This patch removes the duplication.

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Reviewed-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-14 15:29:02 -04:00
Alex Copot aacd9289af tcp: bind() use stronger condition for bind_conflict
We must try harder to get unique (addr, port) pairs when
doing port autoselection for sockets with SO_REUSEADDR
option set.

We achieve this by adding a relaxation parameter to
inet_csk_bind_conflict. When 'relax' parameter is off
we return a conflict whenever the current searched
pair (addr, port) is not unique.

This tries to address the problems reported in patch:
	8d238b25b1
	Revert "tcp: bind() fix when many ports are bound"

Tests where ran for creating and binding(0) many sockets
on 100 IPs. The results are, on average:

	* 60000 sockets, 600 ports / IP:
		* 0.210 s, 620 (IP, port) duplicates without patch
		* 0.219 s, no duplicates with patch
	* 100000 sockets, 1000 ports / IP:
		* 0.371 s, 1720 duplicates without patch
		* 0.373 s, no duplicates with patch
	* 200000 sockets, 2000 ports / IP:
		* 0.766 s, 6900 duplicates without patch
		* 0.768 s, no duplicates with patch
	* 500000 sockets, 5000 ports / IP:
		* 2.227 s, 41500 duplicates without patch
		* 2.284 s, no duplicates with patch

Signed-off-by: Alex Copot <alex.mihai.c@gmail.com>
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-14 15:28:55 -04:00
Eric Dumazet c72e118334 inet: makes syn_ack_timeout mandatory
There are two struct request_sock_ops providers, tcp and dccp.

inet_csk_reqsk_queue_prune() can avoid testing syn_ack_timeout being
NULL if we make it non NULL like syn_ack_timeout

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-14 15:24:26 -04:00
Eric Dumazet fd4f2cead6 tcp: RFC6298 supersedes RFC2988bis
Updates some comments to track RFC6298

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-14 15:24:26 -04:00
stephen hemminger 87b6d218f3 tunnel: implement 64 bits statistics
Convert the per-cpu statistics kept for GRE, IPIP, and SIT tunnels
to use 64 bit statistics.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-14 14:47:05 -04:00
Eric Dumazet 447167bf56 udp: intoduce udp_encap_needed static_key
Most machines dont use UDP encapsulation (L2TP)

Adds a static_key so that udp_queue_rcv_skb() doesnt have to perform a
test if L2TP never setup the encap_rcv on a socket.

Idea of this patch came after Simon Horman proposal to add a hook on TCP
as well.

If static_key is not yet enabled, the fast path does a single JMP .

When static_key is enabled, JMP destination is patched to reach the real
encap_type/encap_rcv logic, possibly adding cache misses.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: dev@openvswitch.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-13 13:39:37 -04:00
David S. Miller 011e3c6325 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-04-12 19:41:23 -04:00
Linus Torvalds 174808af90 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix bluetooth userland regression reported by Keith Packard, from
    Gustavo Padovan.

 2) Revert ath9k PS idle change, from Sujith Manoharan.

 3) Correct default TCP memory limits (again), from Eric Dumazet.

 4) Fix tcp_rcv_rtt_update() accidental use of unscaled RTT, from Neal
    Cardwell.

 5) We made a facility for layers like wireless to say how much tailroom
    they need in the SKB for link layer stuff such as wireless
    encryption etc., but TCP works hard to fill every SKB out to the end
    defeating this specification.

    This leads to every TCP packet getting reallocated by the wireless
    code in order to have the right amount of tailroom available.

    Fix TCP to only fill SKBs out to the real amount of data area it
    asked for during the allocation, this way it won't eat into the
    slack added for the device's tailroom needs.

    Reported by Marc Merlin and fixed by Eric Dumazet.

 6) Leaks, endian bugs, and new device IDs in bluetooth from Santosh
    Nayak, João Paulo Rechi Vita, Cho, Yu-Chen, Andrei Emeltchenko,
    AceLan Kao, and Andrei Emeltchenko.

 7) OOPS on tty_close fix in bluetooth's hci_ldisc from Johan Hovold.

 8) netfilter erroneously scales TCP window twice, fix from Changli Gao.

 9) Memleak fix in wext-core from Julia Lawall.

10) Consistently handle invalid TCP packets in ipv4 vs.  ipv6 conntrack,
    from Jozsef Kadlecsik.

11) Validate IP header length properly in netfilter conntrack's
    ipv4_get_l4proto().

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (39 commits)
  NFC: Fix the LLCP Tx fragmentation loop
  rtlwifi: Add missing DMA buffer unmapping for PCI drivers
  rtlwifi: Preallocate USB read buffers and eliminate kalloc in read routine
  tcp: avoid order-1 allocations on wifi and tx path
  net: allow pskb_expand_head() to get maximum tailroom
  bridge: Do not send queries on multicast group leaves
  MAINTAINERS: Mark NATSEMI driver as orphan'd.
  tcp: fix tcp_rcv_rtt_update() use of an unscaled RTT sample
  tcp: restore correct limit
  Revert "ath9k: fix going to full-sleep on PS idle"
  rt2x00: Fix rfkill_polling register function.
  bcma: fix build error on MIPS; implicit pcibios_enable_device
  netfilter: nf_conntrack: fix incorrect logic in nf_conntrack_init_net
  netfilter: nf_ct_ipv4: packets with wrong ihl are invalid
  netfilter: nf_ct_ipv4: handle invalid IPv4 and IPv6 packets consistently
  net/wireless/wext-core.c: add missing kfree
  rtlwifi: Fix oops on rate-control failure
  mac80211: Convert WARN_ON to WARN_ON_ONCE
  rtlwifi: rtl8192de: Fix firmware initialization
  nl80211: ensure interface is up in various APIs
  ...
2012-04-12 14:04:33 -07:00
Eric Dumazet a21d45726a tcp: avoid order-1 allocations on wifi and tx path
Marc Merlin reported many order-1 allocations failures in TX path on its
wireless setup, that dont make any sense with MTU=1500 network, and non
SG capable hardware.

After investigation, it turns out TCP uses sk_stream_alloc_skb() and
used as a convention skb_tailroom(skb) to know how many bytes of data
payload could be put in this skb (for non SG capable devices)

Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER +
sizeof(struct skb_shared_info) being above 2048)

Later, mac80211 layer need to add some bytes at the tail of skb
(IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is
available has to call pskb_expand_head() and request order-1
allocations.

This patch changes sk_stream_alloc_skb() so that only
sk->sk_prot->max_header bytes of headroom are reserved, and use a new
skb field, avail_size to hold the data payload limit.

This way, order-0 allocations done by TCP stack can leave more than 2 KB
of tailroom and no more allocation is performed in mac80211 layer (or
any layer needing some tailroom)

avail_size is unioned with mark/dropcount, since mark will be set later
in IP stack for output packets. Therefore, skb size is unchanged.

Reported-by: Marc MERLIN <marc@merlins.org>
Tested-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-11 10:11:12 -04:00
Linus Torvalds 94fb175c04 dmaengine-fixes for 3.4-rc3
1/ regression fix for Xen as it now trips over a broken assumption
    about the dma address size on 32-bit builds
 
 2/ new quirk for netdma to ignore dma channels that cannot meet
    netdma alignment requirements
 
 3/ fixes for two long standing issues in ioatdma (ring size overflow)
    and iop-adma (potential stack corruption)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJPhIfhAAoJEB7SkWpmfYgCguIQAL4qF+RC9/JggSHIjfOrYiPd
 yboV80GqqQHHBwy8hfZVUrIEPMebvD/xUIk6iUQNXR+6EA8Ln0jukvQMpWNnI+Cc
 TXgA5Ok70an4PD1MqnCsWyCJjsyPyhprbRHurxBcesf+y96POJxhING0rcKvft50
 mvYnbtrkYe9M9x3b8TBGc0JaTVeL29Ck3FtkTz4uUktbkhRNfCcfEd28NRQpf8MB
 vkjbjRGBQmGsnKxYCaEhlF1GPJyTlYjg4BBWtseJgb2R9s7tvJrkotFea/NmSfjq
 XCuVKjpiFp3YyJuxJERWdwqRWvyAZFfcYyZX440nG0b7GBgSn+T7A9XhUs8vMboi
 tLwoDfBbJDlKMaFpHex7Z6RtZZmVl3gWDNZTqpG44n4pabd4RPip04f0k7Wfs+cp
 tzU9hGAOvgsZ8w4/JgxH8YJOZbIGzbDGOA1IhWcbxIbmFTblMiFnV3TC7qfhoRbR
 8qtScIE7bUck2MYVlMMn9utd9tvKFa6HNgo41+f78/4+U7zQ/VrsbA/DWQct40R5
 5k+EEvyYFUzIXn79E0GVN5h4NHH5gfAs3MZ7jIgwgHedBp4Ki68XYKNu+pIV3YwG
 CFTPn1mVOXnCdt+fsjG5tL9Jecx1Mij6w3nWU93ZU6cHmC77YmU+DLxPIGuyR1a2
 EmpObwfq5peXzkgQpEsB
 =F3IR
 -----END PGP SIGNATURE-----

Merge tag 'dmaengine-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine

Pull dmaengine fixes from Dan Williams:

1/ regression fix for Xen as it now trips over a broken assumption
   about the dma address size on 32-bit builds

2/ new quirk for netdma to ignore dma channels that cannot meet
   netdma alignment requirements

3/ fixes for two long standing issues in ioatdma (ring size overflow)
   and iop-adma (potential stack corruption)

* tag 'dmaengine-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
  netdma: adding alignment check for NETDMA ops
  ioatdma: DMA copy alignment needed to address IOAT DMA silicon errata
  ioat: ring size variables need to be 32bit to avoid overflow
  iop-adma: Corrected array overflow in RAID6 Xscale(R) test.
  ioat: fix size of 'completion' for Xen
2012-04-10 15:30:16 -07:00
Neal Cardwell 18a223e0b9 tcp: fix tcp_rcv_rtt_update() use of an unscaled RTT sample
Fix a code path in tcp_rcv_rtt_update() that was comparing scaled and
unscaled RTT samples.

The intent in the code was to only use the 'm' measurement if it was a
new minimum.  However, since 'm' had not yet been shifted left 3 bits
but 'new_sample' had, this comparison would nearly always succeed,
leading us to erroneously set our receive-side RTT estimate to the 'm'
sample when that sample could be nearly 8x too high to use.

The overall effect is to often cause the receive-side RTT estimate to
be significantly too large (up to 40% too large for brief periods in
my tests).

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-10 14:47:09 -04:00
Eric Dumazet 5fb84b1428 tcp: restore correct limit
Commit c43b874d5d (tcp: properly initialize tcp memory limits) tried
to fix a regression added in commits 4acb4190 & 3dc43e3,
but still get it wrong.

Result is machines with low amount of memory have too small tcp_rmem[2]
value and slow tcp receives : Per socket limit being 1/1024 of memory
instead of 1/128 in old kernels, so rcv window is capped to small
values.

Fix this to match comment and previous behavior.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-10 14:39:26 -04:00
David S. Miller 06eb4eafbd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-04-10 14:30:45 -04:00
Jozsef Kadlecsik 07153c6ec0 netfilter: nf_ct_ipv4: packets with wrong ihl are invalid
It was reported that the Linux kernel sometimes logs:

klogd: [2629147.402413] kernel BUG at net / netfilter /
nf_conntrack_proto_tcp.c: 447!
klogd: [1072212.887368] kernel BUG at net / netfilter /
nf_conntrack_proto_tcp.c: 392

ipv4_get_l4proto() in nf_conntrack_l3proto_ipv4.c and tcp_error() in
nf_conntrack_proto_tcp.c should catch malformed packets, so the errors
at the indicated lines - TCP options parsing - should not happen.
However, tcp_error() relies on the "dataoff" offset to the TCP header,
calculated by ipv4_get_l4proto().  But ipv4_get_l4proto() does not check
bogus ihl values in IPv4 packets, which then can slip through tcp_error()
and get caught at the TCP options parsing routines.

The patch fixes ipv4_get_l4proto() by invalidating packets with bogus
ihl value.

The patch closes netfilter bugzilla id 771.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-04-10 12:50:49 +02:00
Jozsef Kadlecsik 8430eac2f6 netfilter: nf_ct_ipv4: handle invalid IPv4 and IPv6 packets consistently
IPv6 conntrack marked invalid packets as INVALID and let the user
drop those by an explicit rule, while IPv4 conntrack dropped such
packets itself.

IPv4 conntrack is changed so that it marks INVALID packets and let
the user to drop them.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-04-10 00:38:34 +02:00
Eric Dumazet 35f9c09fe9 tcp: tcp_sendpages() should call tcp_push() once
commit 2f53384424 (tcp: allow splice() to build full TSO packets) added
a regression for splice() calls using SPLICE_F_MORE.

We need to call tcp_flush() at the end of the last page processed in
tcp_sendpages(), or else transmits can be deferred and future sends
stall.

Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but
with different semantic.

For all sendpage() providers, its a transparent change. Only
sock_sendpage() and tcp_sendpages() can differentiate the two different
flags provided by pipe_to_sendpage()

Reported-by: Tom Herbert <therbert@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-05 19:04:27 -04:00
Dave Jiang a2bd1140a2 netdma: adding alignment check for NETDMA ops
This is the fallout from adding memcpy alignment workaround for certain
IOATDMA hardware. NetDMA will only use DMA engine that can handle byte align
ops.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2012-04-05 15:27:12 -07:00
RongQing.Li ce713ee5a1 net: replace continue with break to reduce unnecessary loop in xxx_xmarksources
The conditional which decides to skip inactive filters does not
change with the change of loop index, so it is unnecessary to
check them many times.

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-05 05:42:18 -04:00
Amir Vadai d4a968658c net/route: export symbol ip_tos2prio
Need to export this to enable drivers use rt_tos2priority()

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-05 05:08:04 -04:00
Eric Dumazet 2f53384424 tcp: allow splice() to build full TSO packets
vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)

The call to tcp_push() at the end of do_tcp_sendpages() forces an
immediate xmit when pipe is not already filled, and tso_fragment() try
to split these skb to MSS multiples.

4096 bytes are usually split in a skb with 2 MSS, and a remaining
sub-mss skb (assuming MTU=1500)

This makes slow start suboptimal because many small frames are sent to
qdisc/driver layers instead of big ones (constrained by cwnd and packets
in flight of course)

In fact, applications using sendmsg() (adding an additional memory copy)
instead of vmsplice()/splice()/sendfile() are a bit faster because of
this anomaly, especially if serving small files in environments with
large initial [c]wnd.

Call tcp_push() only if MSG_MORE is not set in the flags parameter.

This bit is automatically provided by splice() internals but for the
last page, or on all pages if user specified SPLICE_F_MORE splice()
flag.

In some workloads, this can reduce number of sent logical packets by an
order of magnitude, making zero-copy TCP actually faster than
one-copy :)

Reported-by: Tom Herbert <therbert@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03 17:35:43 -04:00
Linus Torvalds ed359a3b7b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Provide device string properly for USB i2400m wimax devices, also
    don't OOPS when providing firmware string.  From Phil Sutter.

 2) Add support for sh_eth SH7734 chips, from Nobuhiro Iwamatsu.

 3) Add another device ID to USB zaurus driver, from Guan Xin.

 4) Loop index start in pool vector iterator is wrong causing MAC to not
    get configured in bnx2x driver, fix from Dmitry Kravkov.

 5) EQL driver assumes HZ=100, fix from Eric Dumazet.

 6) Now that skb_add_rx_frag() can specify the truesize increment
    separately, do so in f_phonet and cdc_phonet, also from Eric
    Dumazet.

 7) virtio_net accidently uses net_ratelimit() not only on the kernel
    warning but also the statistic bump, fix from Rick Jones.

 8) ip_route_input_mc() uses fixed init_net namespace, oops, use
    dev_net(dev) instead.  Fix from Benjamin LaHaise.

 9) dev_forward_skb() needs to clear the incoming interface index of the
    SKB so that it looks like a new incoming packet, also from Benjamin
    LaHaise.

10) iwlwifi mistakenly initializes a channel entry as 2GHZ instead of
    5GHZ, fix from Stanislav Yakovlev.

11) Missing kmalloc() return value checks in orinoco, from Santosh
    Nayak.

12) ath9k doesn't check for HT capabilities in the right way, it is
    checking ht_supported instead of the ATH9K_HW_CAP_HT flag.  Fix from
    Sujith Manoharan.

13) Fix x86 BPF JIT emission of 16-bit immediate field of AND
    instructions, from Feiran Zhuang.

14) Avoid infinite loop in GARP code when registering sysfs entries.
    From David Ward.

15) rose protocol uses memcpy instead of memcmp in a device address
    comparison, oops.  Fix from Daniel Borkmann.

16) Fix build of lpc_eth due to dev_hw_addr_rancom() interface being
    renamed to eth_hw_addr_random().  From Roland Stigge.

17) Make ipv6 RTM_GETROUTE interpret RTA_IIF attribute the same way
    that ipv4 does.  Fix from Shmulik Ladkani.

18) via-rhine has an inverted bit test, causing suspend/resume
    regressions.  Fix from Andreas Mohr.

19) RIONET assumes 4K page size, fix from Akinobu Mita.

20) Initialization of imask register in sky2 is buggy, because bits are
    "or'd" into an uninitialized local variable.  Fix from Lino
    Sanfilippo.

21) Fix FCOE checksum offload handling, from Yi Zou.

22) Fix VLAN processing regression in e1000, from Jiri Pirko.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  sky2: dont overwrite settings for PHY Quick link
  tg3: Fix 5717 serdes powerdown problem
  net: usb: cdc_eem: fix mtu
  net: sh_eth: fix endian check for architecture independent
  usb/rtl8150 : Remove duplicated definitions
  rionet: fix page allocation order of rionet_active
  via-rhine: fix wait-bit inversion.
  ipv6: Fix RTM_GETROUTE's interpretation of RTA_IIF to be consistent with ipv4
  net: lpc_eth: Fix rename of dev_hw_addr_random
  net/netfilter/nfnetlink_acct.c: use linux/atomic.h
  rose_dev: fix memcpy-bug in rose_set_mac_address
  Fix non TBI PHY access; a bad merge undid bug fix in a previous commit.
  net/garp: avoid infinite loop if attribute already exists
  x86 bpf_jit: fix a bug in emitting the 16-bit immediate operand of AND
  bonding: emit event when bonding changes MAC
  mac80211: fix oper channel timestamp updation
  ath9k: Use HW HT capabilites properly
  MAINTAINERS: adding maintainer for ipw2x00
  net: orinoco: add error handling for failed kmalloc().
  net/wireless: ipw2x00: fix a typo in wiphy struct initilization
  ...
2012-04-02 17:53:39 -07:00