Conflicts:
include/net/ipip.h
The changes made to ipip.h in 'net' were already included
in 'net-next' before that header was moved to another location.
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the protection of netns_frags.nqueues updates under the LRU_lock,
instead of the write lock. As they are located on the same cacheline,
and this is also needed when transitioning to use per hash bucket locking.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The LRU list is protected by its own lock, since commit 3ef0eb0db4
(net: frag, move LRU list maintenance outside of rwlock), and
no-longer by a read_lock.
This makes it possible, to remove the inet_frag_queue, which is about
to be "evicted", from the LRU list head. This avoids the problem, of
several CPUs grabbing the same frag queue.
Note, cannot remove the inet_frag_lru_del() call in fq_unlink()
called by inet_frag_kill(), because inet_frag_kill() is also used in
other situations. Thus, we use list_del_init() to allow this
double list_del to work.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip-header id needs to be incremented even if IP_DF flag is set.
This behaviour was changed in commit 490ab08127
(IP_GRE: Fix IP-Identification).
Following patch fixes it so that identification is always
incremented.
Reported-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inspection of upper layer protocol is considered harmful, especially
if it is about ARP or other stateful upper layer protocol; driver
cannot (and should not) have full state of them.
IPv4 over Firewire module used to inspect ARP (both in sending path
and in receiving path), and record peer's GUID, max packet size, max
speed and fifo address. This patch removes such inspection by extending
our "hardware address" definition to include other information as well:
max packet size, max speed and fifo. By doing this, The neighbour
module in networking subsystem can cache them.
Note: As we have started ignoring sspd and max_rec in ARP/NDP, those
information will not be used in the driver when sending.
When a packet is being sent, the IP layer fills our pseudo header with
the extended "hardware address", including GUID and fifo. The driver
can look-up node-id (the real but rather volatile low-level address)
by GUID, and then the module can send the packet to the wire using
parameters provided in the extendedn hardware address.
This approach is realistic because IP over IEEE1394 (RFC2734) and IPv6
over IEEE1394 (RFC3146) share same "hardware address" format
in their address resolution protocols.
Here, extended "hardware address" is defined as follows:
union fwnet_hwaddr {
u8 u[16];
struct {
__be64 uniq_id; /* EUI-64 */
u8 max_rec; /* max packet size */
u8 sspd; /* max speed */
__be16 fifo_hi; /* hi 16bits of FIFO addr */
__be32 fifo_lo; /* lo 32bits of FIFO addr */
} __packed uc;
};
Note that Hardware address is declared as union, so that we can map full
IP address into this, when implementing MCAP (Multicast Cannel Allocation
Protocol) for IPv6, but IP and ARP subsystem do not need to know this
format in detail.
One difference between original ARP (RFC826) and 1394 ARP (RFC2734)
is that 1394 ARP Request/Reply do not contain the target hardware address
field (aka ar$tha). This difference is handled in the ARP subsystem.
CC: Stephan Gatzka <stephan.gatzka@gmail.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use common function get calculate rtnl_link_stats64 stats.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reuse common ip-tunneling code which is re-factored from GRE
module.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch refactors GRE code into ip tunneling code and GRE
specific code. Common tunneling code is moved to ip_tunnel module.
ip_tunnel module is written as generic library which can be used
by different tunneling implementations.
ip_tunnel module contains following components:
- packet xmit and rcv generic code. xmit flow looks like
(gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
- hash table of all devices.
- lookup for tunnel devices.
- control plane operations like device create, destroy, ioctl, netlink
operations code.
- registration for tunneling modules, like gre, ipip etc.
- define single pcpu_tstats dev->tstats.
- struct tnl_ptk_info added to pass parsed tunnel packet parameters.
ipip.h header is renamed to ip_tunnel.h
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip-header id needs to be incremented even if IP_DF flag is set.
This behaviour was changed in commit 490ab08127
(IP_GRE: Fix IP-Identification).
Following patch fixes it so that identification is always
incremented.
Reported-by: Cong Wang <amwang@redhat.com>
Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
This reverts commit d6a8c36dd6.
Next commit makes this commit unnecessary.
Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 10c0d7ed32.
Next commit makes this commit unnecessary.
Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter/IPVS updates for
your net-next tree, they are:
* Better performance in nfnetlink_queue by avoiding copy from the
packet to netlink message, from Eric Dumazet.
* Remove unnecessary locking in the exit path of ebt_ulog, from Gao Feng.
* Use new function ipv6_iface_scope_id in nf_ct_ipv6, from Hannes Frederic Sowa.
* A couple of sparse fixes for IPVS, from Julian Anastasov.
* Use xor hashing in nfnetlink_queue, as suggested by Eric Dumazet, from
myself.
* Allow to dump expectations per master conntrack via ctnetlink, from myself.
* A couple of cleanups to use PTR_RET in module init path, from Silviu-Mihai
Popescu.
* Remove nf_conntrack module a bit faster if netns are in use, from
Vladimir Davydov.
* Use checksum_partial in ip6t_NPT, from YOSHIFUJI Hideaki.
* Sparse fix for nf_conntrack, from Stephen Hemminger.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
On SACK reneging the sender immediately retransmits and forces a
timeout but disables Eifel (undo). If the (buggy) receiver does not
drop any packet this can trigger a false slow-start retransmit storm
driven by the ACKs of the original packets. This can be detected with
undo and TCP timestamps.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch just moves some code arround to make the ip4_frag_ecn_table
and IPFRAG_ECN_* constants accessible from the other reassembly engines. I
also renamed ip4_frag_ecn_table to ip_frag_ecn_table.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch takes benefit of dev_addr_genid and dev_base_seq to check if a change
occurs during a netlink dump. If a change is detected, the flag NLM_F_DUMP_INTR
is set in the first message after the dump was interrupted.
Note that seq and prev_seq must be reset between each family in rtnl_dump_all()
because they are specific to each family.
Reported-by: Junwei Zhang <junwei.zhang@6wind.com>
Reported-by: Hongjun Li <hongjun.li@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull to get the thermal netlink multicast group name fix, otherwise
the assertion added in net-next to netlink to detect that kind of bug
makes systems unbootable for some folks.
Signed-off-by: David S. Miller <davem@davemloft.net>
A long standing problem with TSO is the fact that tcp_tso_should_defer()
rearms the deferred timer, while it should not.
Current code leads to following bad bursty behavior :
20:11:24.484333 IP A > B: . 297161:316921(19760) ack 1 win 119
20:11:24.484337 IP B > A: . ack 263721 win 1117
20:11:24.485086 IP B > A: . ack 265241 win 1117
20:11:24.485925 IP B > A: . ack 266761 win 1117
20:11:24.486759 IP B > A: . ack 268281 win 1117
20:11:24.487594 IP B > A: . ack 269801 win 1117
20:11:24.488430 IP B > A: . ack 271321 win 1117
20:11:24.489267 IP B > A: . ack 272841 win 1117
20:11:24.490104 IP B > A: . ack 274361 win 1117
20:11:24.490939 IP B > A: . ack 275881 win 1117
20:11:24.491775 IP B > A: . ack 277401 win 1117
20:11:24.491784 IP A > B: . 316921:332881(15960) ack 1 win 119
20:11:24.492620 IP B > A: . ack 278921 win 1117
20:11:24.493448 IP B > A: . ack 280441 win 1117
20:11:24.494286 IP B > A: . ack 281961 win 1117
20:11:24.495122 IP B > A: . ack 283481 win 1117
20:11:24.495958 IP B > A: . ack 285001 win 1117
20:11:24.496791 IP B > A: . ack 286521 win 1117
20:11:24.497628 IP B > A: . ack 288041 win 1117
20:11:24.498459 IP B > A: . ack 289561 win 1117
20:11:24.499296 IP B > A: . ack 291081 win 1117
20:11:24.500133 IP B > A: . ack 292601 win 1117
20:11:24.500970 IP B > A: . ack 294121 win 1117
20:11:24.501388 IP B > A: . ack 295641 win 1117
20:11:24.501398 IP A > B: . 332881:351881(19000) ack 1 win 119
While the expected behavior is more like :
20:19:49.259620 IP A > B: . 197601:202161(4560) ack 1 win 119
20:19:49.260446 IP B > A: . ack 154281 win 1212
20:19:49.261282 IP B > A: . ack 155801 win 1212
20:19:49.262125 IP B > A: . ack 157321 win 1212
20:19:49.262136 IP A > B: . 202161:206721(4560) ack 1 win 119
20:19:49.262958 IP B > A: . ack 158841 win 1212
20:19:49.263795 IP B > A: . ack 160361 win 1212
20:19:49.264628 IP B > A: . ack 161881 win 1212
20:19:49.264637 IP A > B: . 206721:211281(4560) ack 1 win 119
20:19:49.265465 IP B > A: . ack 163401 win 1212
20:19:49.265886 IP B > A: . ack 164921 win 1212
20:19:49.266722 IP B > A: . ack 166441 win 1212
20:19:49.266732 IP A > B: . 211281:215841(4560) ack 1 win 119
20:19:49.267559 IP B > A: . ack 167961 win 1212
20:19:49.268394 IP B > A: . ack 169481 win 1212
20:19:49.269232 IP B > A: . ack 171001 win 1212
20:19:49.269241 IP A > B: . 215841:221161(5320) ack 1 win 119
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With decnet converted, we can finally get rid of rta_buf and its
computations around it. It also gets rid of the minimal header
length verification since all message handlers do that explicitly
anyway.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to GRE tunnel, UDP tunnel should take care of IP header ID
too.
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the previous discussion [1] on netdev list, DaveM insists
we should increase the IP header ID for each segmented packets.
This patch fixes it.
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
1. http://marc.info/?t=136384172700001&r=1&w=2
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements F-RTO (foward RTO recovery):
When the first retransmission after timeout is acknowledged, F-RTO
sends new data instead of old data. If the next ACK acknowledges
some never-retransmitted data, then the timeout was spurious and the
congestion state is reverted. Otherwise if the next ACK selectively
acknowledges the new data, then the timeout was genuine and the
loss recovery continues. This idea applies to recurring timeouts
as well. While F-RTO sends different data during timeout recovery,
it does not (and should not) change the congestion control.
The implementaion follows the three steps of SACK enhanced algorithm
(section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
3 are in tcp_process_loss(). The basic version is not supported
because SACK enhanced version also works for non-SACK connections.
The new implementation is functionally in parity with the old F-RTO
implementation except the one case where it increases undo events:
In addition to the RFC algorithm, a spurious timeout may be detected
without sending data in step 2, as long as the SACK confirms not
all the original data are dropped. When this happens, the sender
will undo the cwnd and perhaps enter fast recovery instead. This
additional check increases the F-RTO undo events by 5x compared
to the prior implementation on Google Web servers, since the sender
often does not have new data to send for HTTP.
Note F-RTO may detect spurious timeout before Eifel with timestamps
does so.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidate all of TCP CA_Loss state processing in
tcp_fastretrans_alert() into a new function called tcp_process_loss().
This is to prepare the new F-RTO implementation in the next patch.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch series refactor the F-RTO feature (RFC4138/5682).
This is to simplify the loss recovery processing. Existing F-RTO
was developed during the experimental stage (RFC4138) and has
many experimental features. It takes a separate code path from
the traditional timeout processing by overloading CA_Disorder
instead of using CA_Loss state. This complicates CA_Disorder state
handling because it's also used for handling dubious ACKs and undos.
While the algorithm in the RFC does not change the congestion control,
the implementation intercepts congestion control in various places
(e.g., frto_cwnd in tcp_ack()).
The new code implements newer F-RTO RFC5682 using CA_Loss processing
path. F-RTO becomes a small extension in the timeout processing
and interfaces with congestion control and Eifel undo modules.
It lets congestion control (module) determines how many to send
independently. F-RTO only chooses what to send in order to detect
spurious retranmission. If timeout is found spurious it invokes
existing Eifel undo algorithms like DSACK or TCP timestamp based
detection.
The first patch removes all F-RTO code except the sysctl_tcp_frto is
left for the new implementation. Since CA_EVENT_FRTO is removed, TCP
westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Users of udp encapsulation currently have an encap_rcv callback which they can
use to hook into the udp receive path.
In situations where a encapsulation user allocates resources associated with a
udp encap socket, it may be convenient to be able to also hook the proto
.destroy operation. For example, if an encap user holds a reference to the
udp socket, the destroy hook might be used to relinquish this reference.
This patch adds a socket destroy hook into udp, which is set and enabled
in the same way as the existing encap_rcv hook.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains 7 Netfilter/IPVS fixes for 3.9-rc, they are:
* Restrict IPv6 stateless NPT targets to the mangle table. Many users are
complaining that this target does not work in the nat table, which is the
wrong table for it, from Florian Westphal.
* Fix possible use before initialization in the netns init path of several
conntrack protocol trackers (introduced recently while improving conntrack
netns support), from Gao Feng.
* Fix incorrect initialization of copy_range in nfnetlink_queue, spotted
by Eric Dumazet during the NFWS2013, patch from myself.
* Fix wrong calculation of next SCTP chunk in IPVS, from Julian Anastasov.
* Remove rcu_read_lock section in IPVS while calling ipv4_update_pmtu
not required anymore after change introduced in 3.7, again from Julian.
* Fix SYN looping in IPVS state sync if the backup is used a real server
in DR/TUN modes, this required a new /proc entry to disable the director
function when acting as backup, also from Julian.
* Remove leftover IP_NF_QUEUE Kconfig after ip_queue removal, noted by
Paul Bolle.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kconfig symbol IP_NF_QUEUE is unused since commit
d16cf20e2f ("netfilter: remove ip_queue
support"). Let's remove it too.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces a constant limit of the fragment queue hash
table bucket list lengths. Currently the limit 128 is choosen somewhat
arbitrary and just ensures that we can fill up the fragment cache with
empty packets up to the default ip_frag_high_thresh limits. It should
just protect from list iteration eating considerable amounts of cpu.
If we reach the maximum length in one hash bucket a warning is printed.
This is implemented on the caller side of inet_frag_find to distinguish
between the different users of inet_fragment.c.
I dropped the out of memory warning in the ipv4 fragment lookup path,
because we already get a warning by the slab allocator.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an ICMP ICMP_FRAG_NEEDED (or ICMPV6_PKT_TOOBIG) message finds a
LISTEN socket, and this socket is currently owned by the user, we
set TCP_MTU_REDUCED_DEFERRED flag in listener tsq_flags.
This is bad because if we clone the parent before it had a chance to
clear the flag, the child inherits the tsq_flags value, and next
tcp_release_cb() on the child will decrement sk_refcnt.
Result is that we might free a live TCP socket, as reported by
Dormando.
IPv4: Attempt to release TCP socket in state 1
Fix this issue by testing sk_state against TCP_LISTEN early, so that we
set TCP_MTU_REDUCED_DEFERRED on appropriate sockets (not a LISTEN one)
This bug was introduced in commit 563d34d057
(tcp: dont drop MTU reduction indications)
Reported-by: dormando <dormando@rydia.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.
As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:
Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.
before this patch:
average (among 7 runs) of 20845.5 Requests/Second
after:
average (among 7 runs) of 21403.6 Requests/Second
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 412ed94744.
The commit is wrong as tiph points to the outer IPv4 header which is
installed at ipgre_header() and not the inner one which is protocol dependant.
This commit broke succesfully opennhrp which use PF_PACKET socket with
ETH_P_NHRP protocol. Additionally ssl_addr is set to the link-layer
IPv4 address. This address is written by ipgre_header() to the skb
earlier, and this is the IPv4 header tiph should point to - regardless
of the inner protocol payload.
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
replace ip_fast_csum with csum_replace2 to save cpu cycles
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This uses PTR_RET instead of IS_ERR and PTR_ERR in order to increase
readability.
Signed-off-by: Silviu-Mihai Popescu <silviupopescu1990@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Chrome OS team reported a crash on a Pixel ChromeBook in TCP stack :
https://code.google.com/p/chromium/issues/detail?id=182056
commit a21d45726a (tcp: avoid order-1 allocations on wifi and tx
path) did a poor choice adding an 'avail_size' field to skb, while
what we really needed was a 'reserved_tailroom' one.
It would have avoided commit 22b4a4f22d (tcp: fix retransmit of
partially acked frames) and this commit.
Crash occurs because skb_split() is not aware of the 'avail_size'
management (and should not be aware)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Mukesh Agrawal <quiche@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the second of the TLP patch series; it augments the basic TLP
algorithm with a loss detection scheme.
This patch implements a mechanism for loss detection when a Tail
loss probe retransmission plugs a hole thereby masking packet loss
from the sender. The loss detection algorithm relies on counting
TLP dupacks as outlined in Sec. 3 of:
http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
The basic idea is: Sender keeps track of TLP "episode" upon
retransmission of a TLP packet. An episode ends when the sender receives
an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
episode. We want to make sure that before the episode ends the sender
receives a "TLP dupack", indicating that the TLP retransmission was
unnecessary, so there was no loss/hole that needed plugging. If the
sender gets no TLP dupack before the end of the episode, then it reduces
ssthresh and the congestion window, because the TLP packet arriving at
the receiver probably plugged a hole.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.
TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.
PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.
TLP Algorithm
On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)
Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.
When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.
no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.
ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.
In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.
The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/intel/e1000e/netdev.c
Minor conflict in e1000e, a line that got fixed in 'net'
has been removed in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is needed in order to detect if the timestamp option appears
more than once in a packet, to remove the option if the packet is
fragmented, etc. My previous change neglected to store the option
location when the router addresses were prespecified and Pointer >
Length. But now the option location is also stored when Flag is an
unrecognized value, to ensure these option handling behaviors are
still performed.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
With recent patches from Pravin, most tunnels can't use iptunnel_xmit()
any more, due to ip_select_ident() and skb->ip_summed. But we can just
move these operations out of iptunnel_xmit(), so that tunnels can
use it again.
This by the way fixes a bug in vxlan (missing nf_reset()) for net-next.
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow IPIP to make use of tx-checksum offloading.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tunnel_ip_select_ident() is more efficient when generating ip-header
id given inner packet is of ipv4 type.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds generic tunneling offloading support for IPv4-UDP based
tunnels.
GSO type is added to request this offload for a skb.
netdev feature NETIF_F_UDP_TUNNEL is added for hardware offloaded
udp-tunnel support. Currently no device supports this feature,
software offload is used.
This can be used by tunneling protocols like VXLAN.
CC: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Earlier SG was unset if CSUM was not available for given device to
force skb copy to avoid sending inconsistent csum.
Commit c9af6db4c1 (net: Fix possible wrong checksum generation)
added explicit flag to force copy to fix this issue. Therefore
there is no need to link SG and CSUM, following patch kills this
link between there two features.
This patch is also required following patch in series.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In e337e24d66 (inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and
dccp_v4/6_request_recv_sock) I introduced the function
inet_csk_prepare_forced_close, which does a call to bh_unlock_sock().
This produces a sparse-warning.
This patch adds the missing __releases.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's useful to be able to get the initial state of all entries. The patch adds
the support for IPv4 and IPv6.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a router forwards a packet that contains the IPv4 timestamp option,
if there is no space left in the option for the router to add its own
timestamp, then the router increments the Overflow value in the option.
However, if the addresses of the routers are prespecified in the option,
then the overflow condition cannot happen: the option is structured so
that each prespecified router has a place to write its timestamp. Other
routers do not add a timestamp, so there will never be a lack of space.
This fix ensures that the Overflow value in the IPv4 timestamp option is
not incremented when the addresses of the routers are prespecified, even
if the Pointer value is greater than the Length value.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
By default, DSCP is copying during encapsulation.
Copying the DSCP in IPsec tunneling may be a bit dangerous because packets with
different DSCP may get reordered relative to each other in the network and then
dropped by the remote IPsec GW if the reordering becomes too big compared to the
replay window.
It is possible to avoid this copy with netfilter rules, but it's very convenient
to be able to configure it for each SA directly.
This patch adds a toogle for this purpose. By default, it's not set to maintain
backward compatibility.
Field flags in struct xfrm_usersa_info is full, hence I add a new attribute.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Pull networking fixes from David Miller:
"A moderately sized pile of fixes, some specifically for merge window
introduced regressions although others are for longer standing items
and have been queued up for -stable.
I'm kind of tired of all the RDS protocol bugs over the years, to be
honest, it's way out of proportion to the number of people who
actually use it.
1) Fix missing range initialization in netfilter IPSET, from Jozsef
Kadlecsik.
2) ieee80211_local->tim_lock needs to use BH disabling, from Johannes
Berg.
3) Fix DMA syncing in SFC driver, from Ben Hutchings.
4) Fix regression in BOND device MAC address setting, from Jiri
Pirko.
5) Missing usb_free_urb in ISDN Hisax driver, from Marina Makienko.
6) Fix UDP checksumming in bnx2x driver for 57710 and 57711 chips,
fix from Dmitry Kravkov.
7) Missing cfgspace_lock initialization in BCMA driver.
8) Validate parameter size for SCTP assoc stats getsockopt(), from
Guenter Roeck.
9) Fix SCTP association hangs, from Lee A Roberts.
10) Fix jumbo frame handling in r8169, from Francois Romieu.
11) Fix phy_device memory leak, from Petr Malat.
12) Omit trailing FCS from frames received in BGMAC driver, from Hauke
Mehrtens.
13) Missing socket refcount release in L2TP, from Guillaume Nault.
14) sctp_endpoint_init should respect passed in gfp_t, rather than use
GFP_KERNEL unconditionally. From Dan Carpenter.
15) Add AISX AX88179 USB driver, from Freddy Xin.
16) Remove MAINTAINERS entries for drivers deleted during the merge
window, from Cesar Eduardo Barros.
17) RDS protocol can try to allocate huge amounts of memory, check
that the user's request length makes sense, from Cong Wang.
18) SCTP should use the provided KMALLOC_MAX_SIZE instead of it's own,
bogus, definition. From Cong Wang.
19) Fix deadlocks in FEC driver by moving TX reclaim into NAPI poll,
from Frank Li. Also, fix a build error introduced in the merge
window.
20) Fix bogus purging of default routes in ipv6, from Lorenzo Colitti.
21) Don't double count RTT measurements when we leave the TCP receive
fast path, from Neal Cardwell."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
tcp: fix double-counted receiver RTT when leaving receiver fast path
CAIF: fix sparse warning for caif_usb
rds: simplify a warning message
net: fec: fix build error in no MXC platform
net: ipv6: Don't purge default router if accept_ra=2
net: fec: put tx to napi poll function to fix dead lock
sctp: use KMALLOC_MAX_SIZE instead of its own MAX_KMALLOC_SIZE
rds: limit the size allocated by rds_message_alloc()
MAINTAINERS: remove eexpress
MAINTAINERS: remove drivers/net/wan/cycx*
MAINTAINERS: remove 3c505
caif_dev: fix sparse warnings for caif_flow_cb
ax88179_178a: ASIX AX88179_178A USB 3.0/2.0 to gigabit ethernet adapter driver
sctp: use the passed in gfp flags instead GFP_KERNEL
ipv[4|6]: correct dropwatch false positive in local_deliver_finish
l2tp: Restore socket refcount when sendmsg succeeds
net/phy: micrel: Disable asymmetric pause for KSZ9021
bgmac: omit the fcs
phy: Fix phy_device_free memory leak
bnx2x: Fix KR2 work-around condition
...
We should not update ts_recent and call tcp_rcv_rtt_measure_ts() both
before and after going to step5. That wastes CPU and double-counts the
receiver-side RTT sample.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I had a report recently of a user trying to use dropwatch to localise some frame
loss, and they were getting false positives. Turned out they were using a user
space SCTP stack that used raw sockets to grab frames. When we don't have a
registered protocol for a given packet, we record it as a drop, even if a raw
socket receieves the frame. We should only record the drop in the event a raw
socket doesnt exist to receive the frames
Tested by the reported successfully
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: William Reich <reich@ulticom.com>
Tested-by: William Reich <reich@ulticom.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: William Reich <reich@ulticom.com>
CC: eric.dumazet@gmail.com
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
Pull networking fixes from David Miller:
1) ping_err() ICMP error handler looks at wrong ICMP header, from Li
Wei.
2) TCP socket hash function on ipv6 is too weak, from Eric Dumazet.
3) netif_set_xps_queue() forgets to drop mutex on errors, fix from
Alexander Duyck.
4) sum_frag_mem_limit() can deadlock due to lack of BH disabling, fix
from Eric Dumazet.
5) TCP SYN data is miscalculated in tcp_send_syn_data(), because the
amount of TCP option space was not taken into account properly in
this code path. Fix from yuchung Cheng.
6) MLX4 driver allocates device queues with the wrong size, from Kleber
Sacilotto.
7) sock_diag can access past the end of the sock_diag_handlers[] array,
from Mathias Krause.
8) vlan_set_encap_proto() makes incorrect assumptions about where
skb->data points, rework the logic so that it works regardless of
where skb->data happens to be. From Jesse Gross.
9) Fix gianfar build failure with NET_POLL enabled, from Paul
Gortmaker.
10) Fix Ipv4 ID setting and checksum calculations in GRE driver, from
Pravin B Shelar.
11) bgmac driver does:
int i;
for (i = 0; ...; ...) {
...
for (i = 0; ...; ...) {
effectively corrupting the outer loop index, use a seperate
variable for the inner loops. From Rafał Miłecki.
12) Fix suspend bugs in smsc95xx driver, from Ming Lei.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
usbnet: smsc95xx: rename FEATURE_AUTOSUSPEND
usbnet: smsc95xx: fix broken runtime suspend
usbnet: smsc95xx: fix suspend failure
bgmac: fix indexing of 2nd level loops
b43: Fix lockdep splat on module unload
Revert "ip_gre: propogate target device GSO capability to the tunnel device"
IP_GRE: Fix GRE_CSUM case.
VXLAN: Use tunnel_ip_select_ident() for tunnel IP-Identification.
IP_GRE: Fix IP-Identification.
net/pasemi: Fix missing coding style
vmxnet3: fix ethtool ring buffer size setting
vmxnet3: make local function static
bnx2x: remove dead code and make local funcs static
gianfar: fix compile fail for NET_POLL=y due to struct packing
vlan: adjust vlan_set_encap_proto() for its callers
sock_diag: Simplify sock_diag_handlers[] handling in __sock_diag_rcv_msg
sock_diag: Fix out-of-bounds access to sock_diag_handlers[]
vxlan: remove depends on CONFIG_EXPERIMENTAL
mlx4_en: fix allocation of CPU affinity reverse-map
mlx4_en: fix allocation of device tx_cq
...
Pull slave-dmaengine updates from Vinod Koul:
"This is fairly big pull by my standards as I had missed last merge
window. So we have the support for device tree for slave-dmaengine,
large updates to dw_dmac driver from Andy for reusing on different
architectures. Along with this we have fixes on bunch of the drivers"
Fix up trivial conflicts, usually due to #include line movement next to
each other.
* 'next' of git://git.infradead.org/users/vkoul/slave-dma: (111 commits)
Revert "ARM: SPEAr13xx: Pass DW DMAC platform data from DT"
ARM: dts: pl330: Add #dma-cells for generic dma binding support
DMA: PL330: Register the DMA controller with the generic DMA helpers
DMA: PL330: Add xlate function
DMA: PL330: Add new pl330 filter for DT case.
dma: tegra20-apb-dma: remove unnecessary assignment
edma: do not waste memory for dma_mask
dma: coh901318: set residue only if dma is in progress
dma: coh901318: avoid unbalanced locking
dmaengine.h: remove redundant else keyword
dma: of-dma: protect list write operation by spin_lock
dmaengine: ste_dma40: do not remove descriptors for cyclic transfers
dma: of-dma.c: fix memory leakage
dw_dmac: apply default dma_mask if needed
dmaengine: ioat - fix spare sparse complain
dmaengine: move drivers/of/dma.c -> drivers/dma/of-dma.c
ioatdma: fix race between updating ioat->head and IOAT_COMPLETION_PENDING
dw_dmac: add support for Lynxpoint DMA controllers
dw_dmac: return proper residue value
dw_dmac: fill individual length of descriptor
...
commit "ip_gre: allow CSUM capable devices to handle packets"
aa0e51cdda, broke GRE_CSUM case.
GRE_CSUM needs checksum computed for inner packet. Therefore
csum-calculation can not be offloaded if tunnel device requires
GRE_CSUM. Following patch fixes it by computing inner packet checksum
for GRE_CSUM type, for all other type of GRE devices csum is offloaded.
CC: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRE-GSO generates ip fragments with id 0,2,3,4... for every
GSO packet, which is not correct. Following patch fixes it
by setting ip-header id unique id of fragments are allowed.
As Eric Dumazet suggested it is optimized by using inner ip-header
whenever inner packet is ipv4.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In fast open the sender unncessarily reduces the space available
for data in SYN by 12 bytes. This is because in the sender
incorrectly reserves space for TS option twice in tcp_send_syn_data():
tcp_mtu_to_mss() already accounts for TS option space. But it further
reserves MAX_TCP_OPTION_SPACE when computing the payload space.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we handle icmp errors in each transport protocol's err_handler,
for icmp protocols, that is ping_err. Since this handler only care
of those icmp errors triggered by echo request, errors triggered
by echo reply(which sent by kernel) are sliently ignored.
So wrap ping_err() with icmp_err() to deal with those icmp errors.
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It looks like its possible to open thousands of TCP IPv6
sessions on a server, all landing in a single slot of TCP hash
table. Incoming packets have to lookup sockets in a very
long list.
We should hash all bits from foreign IPv6 addresses, using
a salt and hash mix, not a simple XOR.
inet6_ehashfn() can also separately use the ports, instead
of xoring them.
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should get 'type' and 'code' from the outer ICMP header.
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is the big driver core merge for 3.9-rc1
There are two major series here, both of which touch lots of drivers all
over the kernel, and will cause you some merge conflicts:
- add a new function called devm_ioremap_resource() to properly be
able to check return values.
- remove CONFIG_EXPERIMENTAL
If you need me to provide a merged tree to handle these resolutions,
please let me know.
Other than those patches, there's not much here, some minor fixes and
updates.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlEmV0cACgkQMUfUDdst+yncCQCfbmnQZju7kzWXk6PjdFuKspT9
weAAoMCzcAtEzzc4LXuUxxG/sXBVBCjW
=yWAQ
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches from Greg Kroah-Hartman:
"Here is the big driver core merge for 3.9-rc1
There are two major series here, both of which touch lots of drivers
all over the kernel, and will cause you some merge conflicts:
- add a new function called devm_ioremap_resource() to properly be
able to check return values.
- remove CONFIG_EXPERIMENTAL
Other than those patches, there's not much here, some minor fixes and
updates"
Fix up trivial conflicts
* tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
base: memory: fix soft/hard_offline_page permissions
drivercore: Fix ordering between deferred_probe and exiting initcalls
backlight: fix class_find_device() arguments
TTY: mark tty_get_device call with the proper const values
driver-core: constify data for class_find_device()
firmware: Ignore abort check when no user-helper is used
firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
firmware: Make user-mode helper optional
firmware: Refactoring for splitting user-mode helper code
Driver core: treat unregistered bus_types as having no devices
watchdog: Convert to devm_ioremap_resource()
thermal: Convert to devm_ioremap_resource()
spi: Convert to devm_ioremap_resource()
power: Convert to devm_ioremap_resource()
mtd: Convert to devm_ioremap_resource()
mmc: Convert to devm_ioremap_resource()
mfd: Convert to devm_ioremap_resource()
media: Convert to devm_ioremap_resource()
iommu: Convert to devm_ioremap_resource()
drm: Convert to devm_ioremap_resource()
...
commit 68c3316311 (v4 GRE: Add TCP segmentation offload for GRE)
introduced a bug in error path.
dst is attached to skb, so will be released when skb is freed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
the vars ip_rt_gc_timeout is used only when
CONFIG_SYSCTL is selected.
move these vars into CONFIG_SYSCTL.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If device is not able to handle checksumming it will
be handled in dev_xmit
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contain updates for your net-next tree, they are:
* Fix (for just added) connlabel dependencies, from Florian Westphal.
* Add aliasing support for conntrack, thus users can either use -m state
or -m conntrack from iptables while using the same kernel module, from
Jozsef Kadlecsik.
* Some code refactoring for the CT target to merge common code in
revision 0 and 1, from myself.
* Add aliasing support for CT, based on patch from Jozsef Kadlecsik.
* Add one mutex per nfnetlink subsystem, from myself.
* Improved logging for packets that are dropped by helpers, from myself.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Connection tracking helpers have to drop packets under exceptional
situations. Currently, the user gets the following logging message
in case that happens:
nf_ct_%s: dropping packet ...
However, depending on the helper, there are different reasons why a
packet can be dropped.
This patch modifies the existing code to provide more specific
error message in the scope of each helper to help users to debug
the reason why the packet has been dropped, ie:
nf_ct_%s: dropping packet: reason ...
Thanks to Joe Perches for many formatting suggestions.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.
this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now, some modules such as bonding use proc_create
to create proc entries under /proc/net/, and other modules
such as ipv4 use proc_net_fops_create.
It looks a little chaos.this patch changes all of
proc_net_fops_create to proc_create. we can remove
proc_net_fops_create after this patch.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch adds GRE protocol offload handler so that
skb_gso_segment() can segment GRE packets.
SKB GSO CB is added to keep track of total header length so that
skb_segment can push entire header. e.g. in case of GRE, skb_segment
need to push inner and outer headers to every segment.
New NETIF_F_GRE_GSO feature is added for devices which support HW
GRE TSO offload. Currently none of devices support it therefore GRE GSO
always fall backs to software GSO.
[ Compute pkt_len before ip_local_out() invocation. -DaveM ]
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function will be used in next GRE_GSO patch. This patch does
not change any functionality.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Steffen Klassert says:
====================
1) Remove a duplicated call to skb_orphan() in pf_key, from Cong Wang.
2) Prepare xfrm and pf_key for algorithms without pf_key support,
from Jussi Kivilinna.
3) Fix an unbalanced lock in xfrm_output_one(), from Li RongQing.
4) Add an IPsec state resolution packet queue to handle
packets that are send before the states are resolved.
5) xfrm4_policy_fini() is unused since 2.6.11, time to remove it.
From Michal Kubecek.
6) The xfrm gc threshold was configurable just in the initial
namespace, make it configurable in all namespaces. From
Michal Kubecek.
7) We currently can not insert policies with mark and mask
such that some flows would be matched from both policies.
Allow this if the priorities of these policies are different,
the one with the higher priority is used in this case.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Patch cef401de7b (net: fix possible wrong checksum
generation) fixed wrong checksum calculation but it broke TSO by
defining new GSO type but not a netdev feature for that type.
net_gso_ok() would not allow hardware checksum/segmentation
offload of such packets without the feature.
Following patch fixes TSO and wrong checksum. This patch uses
same logic that Eric Dumazet used. Patch introduces new flag
SKBTX_SHARED_FRAG if at least one frag can be modified by
the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
info tx_flags rather than gso_type.
tx_flags is better compared to gso_type since we can have skb with
shared frag without gso packet. It does not link SHARED_FRAG to
GSO, So there is no need to define netdev feature for this.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A socket timestamp is a sum of the global tcp_time_stamp and
a per-socket offset.
A socket offset is added in places where externally visible
tcp timestamp option is parsed/initialized.
Connections in the SYN_RECV state are not supported, global
tcp_time_stamp is used for them, because repair mode doesn't support
this state. In a future it can be implemented by the similar way
as for TIME_WAIT sockets.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A timestamp can be set, only if a socket is in the repair mode.
This patch adds a new socket option TCP_TIMESTAMP, which allows to
get and set current tcp times stamp.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This functionality is used for restoring tcp sockets. A tcp timestamp
depends on how long a system has been running, so it's differ for each
host. The solution is to set a per-socket offset.
A per-socket offset for a TIME_WAIT socket is inherited from a proper
tcp socket.
tcp_request_sock doesn't have a timestamp offset, because the repair
mode for them are not implemented.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
The bnx2x gso_type setting bug fix in 'net' conflicted with
changes in 'net-next' that broke the gso_* setting logic
out into a seperate function, which also fixes the bug in
question. Thus, use the 'net-next' version.
Signed-off-by: David S. Miller <davem@davemloft.net>
We should call skb_share_check() before pskb_may_pull(), or we
can crash in pskb_expand_head()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Synchronize with 'net' in order to sort out some l2tp, wireless, and
ipv6 GRE fixes that will be built on top of in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
There are transients during normal FRTO procedure during which
the packets_in_flight can go to zero between write_queue state
updates and firing the resulting segments out. As FRTO processing
occurs during that window the check must be more precise to
not match "spuriously" :-). More specificly, e.g., when
packets_in_flight is zero but FLAG_DATA_ACKED is true the problematic
branch that set cwnd into zero would not be taken and new segments
might be sent out later.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Tested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The xfrm gc threshold can be configured via xfrm{4,6}_gc_thresh
sysctl but currently only in init_net, other namespaces always
use the default value. This can substantially limit the number
of IPsec tunnels that can be effectively used.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Function xfrm4_policy_fini() is unused since xfrm4_fini() was
removed in 2.6.11.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
TCP Appropriate Byte Count was added by me, but later disabled.
There is no point in maintaining it since it is a potential source
of bugs and Linux already implements other better window protection
heuristics.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
All in-tree ipv4 protocol implementations are now namespace
aware. Therefore all the run-time checks are superfluous.
Reject registry of any non-namespace aware ipv4 protocol.
Eventually we'll remove prot->netns_ok and this registry
time check as well.
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/intel/e1000e/ethtool.c
drivers/net/vmxnet3/vmxnet3_drv.c
drivers/net/wireless/iwlwifi/dvm/tx.c
net/ipv6/route.c
The ipv6 route.c conflict is simple, just ignore the 'net' side change
as we fixed the same problem in 'net-next' by eliminating cached
neighbours from ipv6 routes.
The e1000e conflict is an addition of a new statistic in the ethtool
code, trivial.
The vmxnet3 conflict is about one change in 'net' removing a guarding
conditional, whilst in 'net-next' we had a netdev_info() conversion.
The iwlwifi conflict is dealing with a WARN_ON() conversion in
'net-next' vs. a revert happening in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
As in del_timer() there has already placed a timer_pending() function
to check whether the timer to be deleted is pending or not, it's
unnecessary to check timer pending state again before del_timer() is
called.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates LINUX_MIB_LISTENDROPS in tcp_v4_conn_request() and
tcp_v4_err(). tcp_v4_conn_request() in particular can drop SYNs for various
reasons which are not currently tracked.
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 9dc274151a (tcp: fix ABC in tcp_slow_start())
uncovered a bug in FRTO code :
tcp_process_frto() is setting snd_cwnd to 0 if the number
of in flight packets is 0.
As Neal pointed out, if no packet is in flight we lost our
chance to disambiguate whether a loss timeout was spurious.
We should assume it was a proper loss.
Reported-by: Pasi Kärkkäinen <pasik@iki.fi>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 9dc274151a (tcp: fix ABC in tcp_slow_start()),
a nul snd_cwnd triggers an infinite loop in tcp_slow_start()
Avoid this infinite loop and log a one time error for further
analysis. FRTO code is suspected to cause this bug.
Reported-by: Pasi Kärkkäinen <pasik@iki.fi>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On receiving the SYN-ACK, Fast Open checks icsk_retransmit for SYN
retransmission to detect SYN/data drops. But if F-RTO is disabled,
icsk_retransmit is reset at step D of tcp_fastretrans_alert() (
under tcp_ack()) before tcp_rcv_fastopen_synack(). The fix is to use
total_retrans instead which accounts for SYN retransmission regardless
the use of F-RTO.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We drop a connection request if the accept backlog is full and there are
sufficient packets in the syn queue to warrant starting drops. Increment the
appropriate counters so this isn't silent, for accurate stats and help in
debugging.
This patch assumes LINUX_MIB_LISTENDROPS is a superset of/includes the
counter LINUX_MIB_LISTENOVERFLOWS.
Signed-off-by: Nivedita Singhvi <niv@us.ibm.com>
Acked-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bring in the 'net' tree so that we can get some ipv4/ipv6 bug
fixes that some net-next work will build upon.
Signed-off-by: David S. Miller <davem@davemloft.net>
A GRE tunnel can be configured so that outgoing tunnel packets inherit
the value of the TOS field from the inner IP header. In doing so, when
a non-IP packet is transmitted through the tunnel, the TOS field will
always be set to 0.
Instead, the user should be able to configure a different TOS value as
the fallback to use for non-IP packets. This is helpful when the non-IP
packets are all control packets and should be handled by routers outside
the tunnel as having Internet Control precedence. One example of this is
the NHRP packets that control a DMVPN-compatible mGRE tunnel; they are
encapsulated directly by GRE and do not contain an inner IP header.
Under the existing behavior, the IFLA_GRE_TOS parameter must be set to
'1' for the TOS value to be inherited. Now, only the least significant
bit of this parameter must be set to '1', and when a non-IP packet is
sent through the tunnel, the upper 6 bits of this same parameter will be
copied into the TOS field. (The ECN bits get masked off as before.)
This behavior is backwards-compatible with existing configurations and
iproute2 versions.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some usecase when lifetime of ipv4 addresses might be helpful.
For example:
1) initramfs networkmanager uses a DHCP daemon to learn network
configuration parameters
2) initramfs networkmanager addresses, routes and DNS configuration
3) initramfs networkmanager is requested to stop
4) initramfs networkmanager stops all daemons including dhclient
5) there are addresses and routes configured but no daemon running. If
the system doesn't start networkmanager for some reason, addresses and
routes will be used forever, which violates RFC 2131.
This patch is essentially a backport of ivp6 address lifetime mechanism
for ipv4 addresses.
Current "ip" tool supports this without any patch (since it does not
distinguish between ipv4 and ipv6 addresses in this perspective.
Also, this should be back-compatible with all current netlink users.
Reported-by: Pavel Šimerda <psimerda@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>