TCP sessions over IPv4 can get stuck if routers between endpoints
do not fragment packets but implement PMTU instead, and we are using
those routers because of an ICMP redirect.
Setup is as follows
MTU1 MTU2 MTU1
A--------B------C------D
with MTU1 > MTU2. A and D are endpoints, B and C are routers. B and C
implement PMTU and drop packets larger than MTU2 (for example because
DF is set on all packets). TCP sessions are initiated between A and D.
There is packet loss between A and D, causing frequent TCP
retransmits.
After the number of retransmits on a TCP session reaches tcp_retries1,
tcp calls dst_negative_advice() prior to each retransmit. This results
in route cache entries for the peer to be deleted in
ipv4_negative_advice() if the Path MTU is set.
If the outstanding data on an affected TCP session is larger than
MTU2, packets sent from the endpoints will be dropped by B or C, and
ICMP NEEDFRAG will be returned. A and D receive NEEDFRAG messages and
update PMTU.
Before the next retransmit, tcp will again call dst_negative_advice(),
causing the route cache entry (with correct PMTU) to be deleted. The
retransmitted packet will be larger than MTU2, causing it to be
dropped again.
This sequence repeats until the TCP session aborts or is terminated.
Problem is fixed by removing redirected route cache entries in
ipv4_negative_advice() only if the PMTU is expired.
Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow fib_find_node() to be called either under rcu_read_lock()
protection or with RTNL held.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some of the debug files ended up wrongly in sysfs, because at that point
of time, debugfs didn't exist. Convert these files to use debugfs and
also seq_file. This patch converts all of these files at once and then
removes the exported symbol for the Bluetooth sysfs class.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When creating a high number of Bluetooth sockets (L2CAP, SCO
and RFCOMM) it is possible to scribble repeatedly on arbitrary
pages of memory. Ensure that the content of these sysfs files is
always less than one page. Even if this means truncating. The
files in question are scheduled to be moved over to debugfs in
the future anyway.
Based on initial patches from Neil Brown and Linus Torvalds
Reported-by: Neil Brown <neilb@suse.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch fixes a bug that allows to lose events when reliable
event delivery mode is used, ie. if NETLINK_BROADCAST_SEND_ERROR
and NETLINK_RECV_NO_ENOBUFS socket options are set.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, ENOBUFS errors are reported to the socket via
netlink_set_err() even if NETLINK_RECV_NO_ENOBUFS is set. However,
that should not happen. This fixes this problem and it changes the
prototype of netlink_set_err() to return the number of sockets that
have set the NETLINK_RECV_NO_ENOBUFS socket option. This return
value is used in the next patch in these bugfix series.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under NET_DMA, data transfer can grind to a halt when userland issues a
large read on a socket with a high RCVLOWAT (i.e., 512 KB for both).
This appears to be because the NET_DMA design queues up lots of memcpy
operations, but doesn't issue or wait for them (and thus free the
associated skbs) until it is time for tcp_recvmesg() to return.
The socket hangs when its TCP window goes to zero before enough data is
available to satisfy the read.
Periodically issue asynchronous memcpy operations, and free skbs for ones
that have completed, to prevent sockets from going into zero-window mode.
Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a unaligned access in nla_get_be64() that was
introduced by myself in a17c859849.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A packet is marked as lost in case packets == 0, although nothing should be done.
This results in a too early retransmitted packet during recovery in some cases.
This small patch fixes this issue by returning immediately.
Signed-off-by: Lennart Schulte <lennart.schulte@nets.rwth-aachen.de>
Signed-off-by: Arnd Hannemann <hannemann@nets.rwth-aachen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
mfc_parent of cache entries is used to index into the vif_table and is
initialised from mfcctl->mfcc_parent. This can take values of to 2^16-1,
while the vif_table has only MAXVIFS (32) entries. The same problem
affects ip6mr.
Refuse invalid values to fix a potential out-of-bounds access. Unlike
the other validity checks, this is checked in ipmr_mfc_add() instead of
the setsockopt handler since its unused in the delete path and might be
uninitialized.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to adjust the next rx descriptor after each packet,
so do it only once at the end of the routine.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com>
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean up some text output formatting.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recovery from PF reset works better when you shorten up the delay
until the watchdog task executes.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The counters in the 82599 Virtual Function are not clear on read. They
accumulate to the maximum value and then roll over. They are also not
cleared when the VF executes a soft reset, so it is possible they are
non-zero when the driver loads and starts. This has all been accounted
for in the code that keeps the stats up to date but there is one case
that is not. When the PF driver is reset the counters in the VF are
all reset to zero. This adds an additional accounting overhead into
the VF driver when the PF is reset under its feet. This patch adds
additional counters that are used by the VF driver to accumulate and
save stats after a PF reset has been detected. Prior to this patch
displaying the stats in the VF after the PF has reset would show
bogus data.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As per Simon Horman's feedback set IXGBE_RSC_CB(skb)->dma to zero
after unmapping HWRSC DMA address to avoid double freeing.
Signed-off-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com>
Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently netdev_features_change is called before fcoe tx queues
setup is done, so this patch moves calling of netdev_features_change
after tx queues setup is done in ixgbe_init_interrupt_scheme, so
that real_num_tx_queues is updated correctly on each fcoe enable
or disable.
This allows additional fcoe queues updated correctly in vlan driver
for their correct queue selection.
Signed-off-by: Vasu Dev <vasu.dev@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds RFC5082 checks for TTL on received ICMP packets.
It adds some security against spoofed ICMP packets
disrupting GTSM protected sessions.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the only path leading to ip6_dst_check makes an indirect call
through dst->ops, dst cannot be NULL in ip6_dst_check.
This patch removes this check in case it misleads people who
come across this code.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xfrm_dst keeps a reference to ipv4 rtable entries on each
cached bundle. The only way to renew xfrm_dst when the underlying
route has changed, is to implement dst_check for this. This is
what ipv6 side does too.
The problems started after 87c1e12b5e
("ipsec: Fix bogus bundle flowi") which fixed a bug causing xfrm_dst
to not get reused, until that all lookups always generated new
xfrm_dst with new route reference and path mtu worked. But after the
fix, the old routes started to get reused even after they were expired
causing pmtu to break (well it would occationally work if the rtable
gc had run recently and marked the route obsolete causing dst_check to
get called).
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
TX checksum offload does not work properly when transmitting
UDP packets with 0, 1 or 2 bytes of data. This patch works
around the problem by calculating checksums for these packets
in the driver.
Signed-off-by: Steve Glendinning <steve.glendinning@smsc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Advanced Power Management is disabled for 82599 KX4 connections by
clearing GRC.APME bit, causing it to not wake the system from an
improper system shutdown. By default GRC.APME is enabled and software
is not supposed to clear these settings during adapter probe.
Signed-off-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix 82599 link issues during driver load and unload test using multi-speed
10G & 1G fiber modules. When connected back to back sometime 82599 multispeed
fiber modules would link at 1G speed instead of 10G highest speed, due to a
race condition in autotry process involving Tx laser flapping. Move autotry
autoneg-37 tx laser flapping process from multispeed module init setup
to driver unload. This will alert the link partner to restart its
autotry process when it tries to establish the link with the link partner
Signed-off-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When doing "ifenslave -d bond0 eth0", there is chance to get NULL
dereference in netif_receive_skb(), because dev->master suddenly becomes
NULL after we tested it.
We should use ACCESS_ONCE() to avoid this (or rcu_dereference())
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Advance driver version number after some bug fix.
Signed-off-by: Guo-Fu Tseng <cooldavid@cooldavid.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Temporary stop the RX IRQ, and disable (sync) tasklet or napi.
And restore it after finished the vlgrp pointer assignment.
Signed-off-by: Guo-Fu Tseng <cooldavid@cooldavid.org>
Cc: stable@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix memory leak while receiving 8021q tagged packet which is not
registered by user.
Signed-off-by: Guo-Fu Tseng <cooldavid@cooldavid.org>
Cc: stable@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the dummy LL interface to the LL interface change
introduced by commit daab433c03c15fd642c71c94eb51bdd3f32602c8.
This fixes the build failure occurring after that commit when
enabling ISDN_DRV_GIGASET but neither ISDN_I4L nor ISDN_CAPI.
Impact: bugfix
Signed-off-by: Tilman Schmidt <tilman@imap.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stanse found a locking problem in vhost_set_vring:
several returns from VHOST_SET_VRING_KICK, VHOST_SET_VRING_CALL,
VHOST_SET_VRING_ERR with the vq->mutex held.
Fix these up.
Reported-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Laurent Chavey <chavey@google.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
A thinko in code means we never trigger interrupt
mitigation. Fix this.
Reported-by: Juan Quintela <quintela@redhat.com>
Reported-by: Unai Uribarri <unai.uribarri@optenet.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Without CONFIG_BRIDGE_IGMP_SNOOPING,
BR_INPUT_SKB_CB(skb)->mrouters_only is not appropriately
initialized, so we can see garbage.
A clear option to fix this is to set it even without that
config, but we cannot optimize out the branch.
Let's introduce a macro that returns value of mrouters_only
and let it return 0 without CONFIG_BRIDGE_IGMP_SNOOPING.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
route: Fix caught BUG_ON during rt_secret_rebuild_oneshot()
Call rt_secret_rebuild can cause BUG_ON(timer_pending(&net->ipv4.rt_secret_timer)) in
add_timer as there is not any synchronization for call rt_secret_rebuild_oneshot()
for the same net namespace.
Also this issue affects to rt_secret_reschedule().
Thus use mod_timer enstead.
Signed-off-by: Vitaliy Gusev <vgusev@openvz.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stanse found that one error path in netpoll_setup dereferences npinfo
even though it is NULL. Avoid that by adding new label and go to that
instead.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Daniel Borkmann <danborkmann@googlemail.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: chavey@google.com
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So in the forward porting of various tipc packages, I was constantly
getting this lockdep warning everytime I used tipc-config to set a network
address for the protocol:
[ INFO: possible circular locking dependency detected ]
2.6.33 #1
tipc-config/1326 is trying to acquire lock:
(ref_table_lock){+.-...}, at: [<ffffffffa0315148>] tipc_ref_discard+0x53/0xd4 [tipc]
but task is already holding lock:
(&(&entry->lock)->rlock#2){+.-...}, at: [<ffffffffa03150d5>] tipc_ref_lock+0x43/0x63 [tipc]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&(&entry->lock)->rlock#2){+.-...}:
[<ffffffff8107b508>] __lock_acquire+0xb67/0xd0f
[<ffffffff8107b78c>] lock_acquire+0xdc/0x102
[<ffffffff8145471e>] _raw_spin_lock_bh+0x3b/0x6e
[<ffffffffa03152b1>] tipc_ref_acquire+0xe8/0x11b [tipc]
[<ffffffffa031433f>] tipc_createport_raw+0x78/0x1b9 [tipc]
[<ffffffffa031450b>] tipc_createport+0x8b/0x125 [tipc]
[<ffffffffa030f221>] tipc_subscr_start+0xce/0x126 [tipc]
[<ffffffffa0308fb2>] process_signal_queue+0x47/0x7d [tipc]
[<ffffffff81053e0c>] tasklet_action+0x8c/0xf4
[<ffffffff81054bd8>] __do_softirq+0xf8/0x1cd
[<ffffffff8100aadc>] call_softirq+0x1c/0x30
[<ffffffff810549f4>] _local_bh_enable_ip+0xb8/0xd7
[<ffffffff81054a21>] local_bh_enable_ip+0xe/0x10
[<ffffffff81454d31>] _raw_spin_unlock_bh+0x34/0x39
[<ffffffffa0308eb8>] spin_unlock_bh.clone.0+0x15/0x17 [tipc]
[<ffffffffa0308f47>] tipc_k_signal+0x8d/0xb1 [tipc]
[<ffffffffa0308dd9>] tipc_core_start+0x8a/0xad [tipc]
[<ffffffffa01b1087>] 0xffffffffa01b1087
[<ffffffff8100207d>] do_one_initcall+0x72/0x18a
[<ffffffff810872fb>] sys_init_module+0xd8/0x23a
[<ffffffff81009b42>] system_call_fastpath+0x16/0x1b
-> #0 (ref_table_lock){+.-...}:
[<ffffffff8107b3b2>] __lock_acquire+0xa11/0xd0f
[<ffffffff8107b78c>] lock_acquire+0xdc/0x102
[<ffffffff81454836>] _raw_write_lock_bh+0x3b/0x6e
[<ffffffffa0315148>] tipc_ref_discard+0x53/0xd4 [tipc]
[<ffffffffa03141ee>] tipc_deleteport+0x40/0x119 [tipc]
[<ffffffffa0316e35>] release+0xeb/0x137 [tipc]
[<ffffffff8139dbf4>] sock_release+0x1f/0x6f
[<ffffffff8139dc6b>] sock_close+0x27/0x2b
[<ffffffff811116f6>] __fput+0x12a/0x1df
[<ffffffff811117c5>] fput+0x1a/0x1c
[<ffffffff8110e49b>] filp_close+0x68/0x72
[<ffffffff8110e552>] sys_close+0xad/0xe7
[<ffffffff81009b42>] system_call_fastpath+0x16/0x1b
Finally decided I should fix this. Its a straightforward inversion,
tipc_ref_acquire takes two locks in this order:
ref_table_lock
entry->lock
while tipc_deleteport takes them in this order:
entry->lock (via tipc_port_lock())
ref_table_lock (via tipc_ref_discard())
when the same entry is referenced, we get the above warning. The fix is equally
straightforward. Theres no real relation between the entry->lock and the
ref_table_lock (they just are needed at the same time), so move the entry->lock
aquisition in tipc_ref_acquire down, after we unlock ref_table_lock (this is
safe since the ref_table_lock guards changes to the reference table, and we've
already claimed a slot there. I've tested the below fix and confirmed that it
clears up the lockdep issue
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes UDP socket refcnt bugs in the pppol2tp driver.
A bug can cause a kernel stack trace when a tunnel socket is closed.
A way to reproduce the issue is to prepare the UDP socket for L2TP (by
opening a tunnel pppol2tp socket) and then close it before any L2TP
sessions are added to it. The sequence is
Create UDP socket
Create tunnel pppol2tp socket to prepare UDP socket for L2TP
pppol2tp_connect: session_id=0, peer_session_id=0
L2TP SCCRP control frame received (tunnel_id==0)
pppol2tp_recv_core: sock_hold()
pppol2tp_recv_core: sock_put
L2TP ZLB control frame received (tunnel_id=nnn)
pppol2tp_recv_core: sock_hold()
pppol2tp_recv_core: sock_put
Close tunnel management socket
pppol2tp_release: session_id=0, peer_session_id=0
Close UDP socket
udp_lib_close: BUG
The addition of sock_hold() in pppol2tp_connect() solves the problem.
For data frames, two sock_put() calls were added to plug a refcnt leak
per received data frame. The ref that is grabbed at the top of
pppol2tp_recv_core() must always be released, but this wasn't done for
accepted data frames or data frames discarded because of bad UDP
checksums. This leak meant that any UDP socket that had passed L2TP
data traffic (i.e. L2TP data frames, not just L2TP control frames)
using pppol2tp would not be released by the kernel.
WARNING: at include/net/sock.h:435 udp_lib_unhash+0x117/0x120()
Pid: 1086, comm: openl2tpd Not tainted 2.6.33-rc1 #8
Call Trace:
[<c119e9b7>] ? udp_lib_unhash+0x117/0x120
[<c101b871>] ? warn_slowpath_common+0x71/0xd0
[<c119e9b7>] ? udp_lib_unhash+0x117/0x120
[<c101b8e3>] ? warn_slowpath_null+0x13/0x20
[<c119e9b7>] ? udp_lib_unhash+0x117/0x120
[<c11598a7>] ? sk_common_release+0x17/0x90
[<c11a5e33>] ? inet_release+0x33/0x60
[<c11577b0>] ? sock_release+0x10/0x60
[<c115780f>] ? sock_close+0xf/0x30
[<c106e542>] ? __fput+0x52/0x150
[<c106b68e>] ? filp_close+0x3e/0x70
[<c101d2e2>] ? put_files_struct+0x62/0xb0
[<c101eaf7>] ? do_exit+0x5e7/0x650
[<c1081623>] ? mntput_no_expire+0x13/0x70
[<c106b68e>] ? filp_close+0x3e/0x70
[<c101eb8a>] ? do_group_exit+0x2a/0x70
[<c101ebe1>] ? sys_exit_group+0x11/0x20
[<c10029b0>] ? sysenter_do_call+0x12/0x26
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch ensures the PHY correctly completes its reset before
setting register values.
Signed-off-by: Steve Glendinning <steve.glendinning@smsc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a driver for SMSC's LAN7500 family of USB 2.0
to gigabit ethernet adapters. It's loosely based on the smsc95xx
driver but the device registers for LAN7500 are completely different.
Signed-off-by: Steve Glendinning <steve.glendinning@smsc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes following warning introduced by commit
12bac0d9f4 ("proc: warn on non-existing
proc entries"):
WARNING: at /work/mips-linux/make/linux/fs/proc/generic.c:316 __xlate_proc_name+0xe0/0xe8()
name 'RBHMA4X00/RTL8019'
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stanse found that one error path (when alloc_skb fails) in netdev_tx
omits to unlock hw_priv->hwlock. Fix that.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Tristram Ha <Tristram.Ha@micrel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct a potential array overrun due to an off by one error in the
range check on the CAPI CONNECT_REQ CIPValue parameter.
Found and reported by Dan Carpenter using smatch.
Impact: bugfix
Signed-off-by: Tilman Schmidt <tilman@imap.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>