Commit Graph

3864 Commits

Author SHA1 Message Date
WANG Cong de185ab46c ipv6: restore the behavior of ipv6_sock_ac_drop()
It is possible that the interface is already gone after joining
the list of anycast on this interface as we don't hold a refcount
for the device, in this case we are safe to ignore the error.

What's more important, for API compatibility we should not
change this behavior for applications even if it were correct.

Fixes: commit a9ed4a2986 ("ipv6: fix rtnl locking in setsockopt for anycast and multicast")
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-07 16:10:07 -07:00
Nicolas Dichtel e7478dfc46 ipv6: use addrconf_get_prefix_route() to remove peer addr
addrconf_get_prefix_route() ensures to get the right route in the right table.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 17:13:24 -07:00
Nicolas Dichtel f24062b07d ipv6: fix a refcnt leak with peer addr
There is no reason to take a refcnt before deleting the peer address route.
It's done some lines below for the local prefix route because
inet6_ifa_finish_destroy() will release it at the end.
For the peer address route, we want to free it right now.

This bug has been introduced by commit
caeaba7900 ("ipv6: add support of peer address").

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 17:13:24 -07:00
Sabrina Dubroca a9ed4a2986 ipv6: fix rtnl locking in setsockopt for anycast and multicast
Calling setsockopt with IPV6_JOIN_ANYCAST or IPV6_LEAVE_ANYCAST
triggers the assertion in addrconf_join_solict()/addrconf_leave_solict()

ipv6_sock_ac_join(), ipv6_sock_ac_drop(), ipv6_sock_ac_close() need to
take RTNL before calling ipv6_dev_ac_inc/dec. Same thing with
ipv6_sock_mc_join(), ipv6_sock_mc_drop(), ipv6_sock_mc_close() before
calling ipv6_dev_mc_inc/dec.

This patch moves ASSERT_RTNL() up a level in the call stack.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reported-by: Tommi Rantala <tt.rantala@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 11:52:28 -07:00
Pablo Neira 41ad82f7f8 netfilter: fix missing dependencies in NETFILTER_XT_TARGET_LOG
make defconfig reports:

warning: (NETFILTER_XT_TARGET_LOG) selects NF_LOG_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NETFILTER_ADVANCED)

Fixes: d79a61d netfilter: NETFILTER_XT_TARGET_LOG selects NF_LOG_*
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-02 13:59:54 -07:00
David S. Miller abccc5878a Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
pull request: Netfilter/IPVS fixes for net

The following patchset contains seven Netfilter fixes for your net
tree, they are:

1) Make the NAT infrastructure independent of x_tables, some users are
   already starting to test nf_tables with NAT without enabling x_tables.
   Without this patch for Kconfig, there's a superfluous dependency
   between NAT and x_tables.
2) Allow to use 0 in the cgroup match, the kernel rejects with -EINVAL
   with no good reason. From Daniel Borkmann.

3) Select CONFIG_NF_NAT from the nf_tables NAT expression, this also
   resolves another NAT dependency with x_tables.

4) Use HAVE_JUMP_LABEL instead of CONFIG_JUMP_LABEL in the Netfilter hook
   code as elsewhere in the kernel to resolve toolchain problems, from
   Zhouyi Zhou.

5) Use iptunnel_handle_offloads() to set up tunnel encapsulation
   depending on the offload capabilities, reported by Alex Gartrell
   patch from Julian Anastasov.

6) Fix wrong family when registering the ip_vs_local_reply6() hook,
   also from Julian.

7) Select the NF_LOG_* symbols from NETFILTER_XT_TARGET_LOG. Rafał
   Miłecki reported that when jumping from 3.16 to 3.17-rc, his log
   target is not selected anymore due to changes in the previous
   development cycle to accomodate the full logging support for
   nf_tables.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-02 13:56:30 -07:00
Benjamin Block 793c3b4000 net: ipv6: fib: don't sleep inside atomic lock
The function fib6_commit_metrics() allocates a piece of memory in mode
GFP_KERNEL while holding an atomic lock from higher up in the stack, in
the function __ip6_ins_rt(). This produces the following BUG:

> BUG: sleeping function called from invalid context at mm/slub.c:1250
> in_atomic(): 1, irqs_disabled(): 0, pid: 2909, name: dhcpcd
> 2 locks held by dhcpcd/2909:
>  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81978e67>] rtnl_lock+0x17/0x20
>  #1:  (&tb->tb6_lock){++--+.}, at: [<ffffffff81a6951a>] ip6_route_add+0x65a/0x800
> CPU: 1 PID: 2909 Comm: dhcpcd Not tainted 3.17.0-rc1 #1
> Hardware name: ASUS All Series/Q87T, BIOS 0216 10/16/2013
>  0000000000000008 ffff8800c8f13858 ffffffff81af135a 0000000000000000
>  ffff880212202430 ffff8800c8f13878 ffffffff810f8d3a ffff880212202c98
>  0000000000000010 ffff8800c8f138c8 ffffffff8121ad0e 0000000000000001
> Call Trace:
>  [<ffffffff81af135a>] dump_stack+0x4e/0x68
>  [<ffffffff810f8d3a>] __might_sleep+0x10a/0x120
>  [<ffffffff8121ad0e>] kmem_cache_alloc_trace+0x4e/0x190
>  [<ffffffff81a6bcd6>] ? fib6_commit_metrics+0x66/0x110
>  [<ffffffff81a6bcd6>] fib6_commit_metrics+0x66/0x110
>  [<ffffffff81a6cbf3>] fib6_add+0x883/0xa80
>  [<ffffffff81a6951a>] ? ip6_route_add+0x65a/0x800
>  [<ffffffff81a69535>] ip6_route_add+0x675/0x800
>  [<ffffffff81a68f2a>] ? ip6_route_add+0x6a/0x800
>  [<ffffffff81a6990c>] inet6_rtm_newroute+0x5c/0x80
>  [<ffffffff8197cf01>] rtnetlink_rcv_msg+0x211/0x260
>  [<ffffffff81978e67>] ? rtnl_lock+0x17/0x20
>  [<ffffffff81119708>] ? lock_release_holdtime+0x28/0x180
>  [<ffffffff81978e67>] ? rtnl_lock+0x17/0x20
>  [<ffffffff8197ccf0>] ? __rtnl_unlock+0x20/0x20
>  [<ffffffff819a989e>] netlink_rcv_skb+0x6e/0xd0
>  [<ffffffff81978ee5>] rtnetlink_rcv+0x25/0x40
>  [<ffffffff819a8e59>] netlink_unicast+0xd9/0x180
>  [<ffffffff819a9600>] netlink_sendmsg+0x700/0x770
>  [<ffffffff81103735>] ? local_clock+0x25/0x30
>  [<ffffffff8194e83c>] sock_sendmsg+0x6c/0x90
>  [<ffffffff811f98e3>] ? might_fault+0xa3/0xb0
>  [<ffffffff8195ca6d>] ? verify_iovec+0x7d/0xf0
>  [<ffffffff8194ec3e>] ___sys_sendmsg+0x37e/0x3b0
>  [<ffffffff8111ef15>] ? trace_hardirqs_on_caller+0x185/0x220
>  [<ffffffff81af979e>] ? mutex_unlock+0xe/0x10
>  [<ffffffff819a55ec>] ? netlink_insert+0xbc/0xe0
>  [<ffffffff819a65e5>] ? netlink_autobind.isra.30+0x125/0x150
>  [<ffffffff819a6520>] ? netlink_autobind.isra.30+0x60/0x150
>  [<ffffffff819a84f9>] ? netlink_bind+0x159/0x230
>  [<ffffffff811f989a>] ? might_fault+0x5a/0xb0
>  [<ffffffff8194f25e>] ? SYSC_bind+0x7e/0xd0
>  [<ffffffff8194f8cd>] __sys_sendmsg+0x4d/0x80
>  [<ffffffff8194f912>] SyS_sendmsg+0x12/0x20
>  [<ffffffff81afc692>] system_call_fastpath+0x16/0x1b

Fixing this by replacing the mode GFP_KERNEL with GFP_ATOMIC.

Signed-off-by: Benjamin Block <bebl@mageta.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 10:54:49 -07:00
Pablo Neira Ayuso 8993cf8edf netfilter: move NAT Kconfig switches out of the iptables scope
Currently, the NAT configs depend on iptables and ip6tables. However,
users should be capable of enabling NAT for nft without having to
switch on iptables.

Fix this by adding new specific IP_NF_NAT and IP6_NF_NAT config
switches for iptables and ip6tables NAT support. I have also moved
the original NF_NAT_IPV4 and NF_NAT_IPV6 configs out of the scope
of iptables to make them independent of it.

This patch also adds NETFILTER_XT_NAT which selects the xt_nat
combo that provides snat/dnat for iptables. We cannot use NF_NAT
anymore since nf_tables can select this.

Reported-by: Matteo Croce <technoboy85@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-08-18 21:55:54 +02:00
Neal Cardwell 4fab907195 tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced()
Make sure we use the correct address-family-specific function for
handling MTU reductions from within tcp_release_cb().

Previously AF_INET6 sockets were incorrectly always using the IPv6
code path when sometimes they were handling IPv4 traffic and thus had
an IPv4 dst.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Willem de Bruijn <willemb@google.com>
Fixes: 563d34d057 ("tcp: dont drop MTU reduction indications")
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14 14:38:54 -07:00
Shmulik Ladkani bc8fc7b8f8 sit: Fix ipip6_tunnel_lookup device matching criteria
As of 4fddbf5d78 ("sit: strictly restrict incoming traffic to tunnel link device"),
when looking up a tunnel, tunnel's underlying interface (t->parms.link)
is verified to match incoming traffic's ingress device.

However the comparison was incorrectly based on skb->dev->iflink.

Instead, dev->ifindex should be used, which correctly represents the
interface from which the IP stack hands the ipip6 packets.

This allows setting up sit tunnels bound to vlan interfaces (otherwise
incoming ipip6 traffic on the vlan interface was dropped due to
ipip6_tunnel_lookup match failure).

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14 14:38:54 -07:00
Linus Torvalds 33caee3992 Merge branch 'akpm' (patchbomb from Andrew Morton)
Merge incoming from Andrew Morton:
 - Various misc things.
 - arch/sh updates.
 - Part of ocfs2.  Review is slow.
 - Slab updates.
 - Most of -mm.
 - printk updates.
 - lib/ updates.
 - checkpatch updates.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (226 commits)
  checkpatch: update $declaration_macros, add uninitialized_var
  checkpatch: warn on missing spaces in broken up quoted
  checkpatch: fix false positives for --strict "space after cast" test
  checkpatch: fix false positive MISSING_BREAK warnings with --file
  checkpatch: add test for native c90 types in unusual order
  checkpatch: add signed generic types
  checkpatch: add short int to c variable types
  checkpatch: add for_each tests to indentation and brace tests
  checkpatch: fix brace style misuses of else and while
  checkpatch: add --fix option for a couple OPEN_BRACE misuses
  checkpatch: use the correct indentation for which()
  checkpatch: add fix_insert_line and fix_delete_line helpers
  checkpatch: add ability to insert and delete lines to patch/file
  checkpatch: add an index variable for fixed lines
  checkpatch: warn on break after goto or return with same tab indentation
  checkpatch: emit a warning on file add/move/delete
  checkpatch: add test for commit id formatting style in commit log
  checkpatch: emit fewer kmalloc_array/kcalloc conversion warnings
  checkpatch: improve "no space after cast" test
  checkpatch: allow multiple const * types
  ...
2014-08-06 21:14:42 -07:00
Ken Helias 1d023284c3 list: fix order of arguments for hlist_add_after(_rcu)
All other add functions for lists have the new item as first argument
and the position where it is added as second argument.  This was changed
for no good reason in this function and makes using it unnecessary
confusing.

The name was changed to hlist_add_behind() to cause unconverted code to
generate a compile error instead of using the wrong parameter order.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Ken Helias <kenhelias@firemail.de>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[intel driver bits]
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:24 -07:00
Dmitry Popov 9ea88a1530 tcp: md5: check md5 signature without socket lock
Since a8afca032 (tcp: md5: protects md5sig_info with RCU) tcp_md5_do_lookup
doesn't require socket lock, rcu_read_lock is enough. Therefore socket lock is
no longer required for tcp_v{4,6}_inbound_md5_hash too, so we can move these
calls (wrapped with rcu_read_{,un}lock) before bh_lock_sock:
from tcp_v{4,6}_do_rcv to tcp_v{4,6}_rcv.

Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06 16:00:20 -07:00
David S. Miller d247b6ab3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/Makefile
	net/ipv6/sysctl_net_ipv6.c

Two ipv6_table_template[] additions overlap, so the index
of the ipv6_table[x] assignments needed to be adjusted.

In the drivers/net/Makefile case, we've gotten rid of the
garbage whereby we had to list every single USB networking
driver in the top-level Makefile, there is just one
"USB_NETWORKING" that guards everything.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 18:46:26 -07:00
Willem de Bruijn 09c2d251b7 net-timestamp: add key to disambiguate concurrent datagrams
Datagrams timestamped on transmission can coexist in the kernel stack
and be reordered in packet scheduling. When reading looped datagrams
from the socket error queue it is not always possible to unique
correlate looped data with original send() call (for application
level retransmits). Even if possible, it may be expensive and complex,
requiring packet inspection.

Introduce a data-independent ID mechanism to associate timestamps with
send calls. Pass an ID alongside the timestamp in field ee_data of
sock_extended_err.

The ID is a simple 32 bit unsigned int that is associated with the
socket and incremented on each send() call for which software tx
timestamp generation is enabled.

The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
avoid changing ee_data for existing applications that expect it 0.
The counter is reset each time the flag is reenabled. Reenabling
does not change the ID of already submitted data. It is possible
to receive out of order IDs if the timestamp stream is not quiesced
first.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Hannes Frederic Sowa 166bd890a3 ipv6: data of fwmark_reflect sysctl needs to be updated on netns construction
Fixes: e110861f86 ("net: add a sysctl to reflect the fwmark on replies")
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 16:16:54 -07:00
Nikolay Aleksandrov d4ad4d22e7 inet: frags: use kmem_cache for inet_frag_queue
Use kmem_cache to allocate/free inet_frag_queue objects since they're
all the same size per inet_frags user and are alloced/freed in high volumes
thus making it a perfect case for kmem_cache.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:31:31 -07:00
Nikolay Aleksandrov 2e404f632f inet: frags: use INET_FRAG_EVICTED to prevent icmp messages
Now that we have INET_FRAG_EVICTED we might as well use it to stop
sending icmp messages in the "frag_expire" functions instead of
stripping INET_FRAG_FIRST_IN from their flags when evicting.
Also fix the comment style in ip6_expire_frag_queue().

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:31:31 -07:00
Nikolay Aleksandrov 06aa8b8a03 inet: frags: rename last_in to flags
The last_in field has been used to store various flags different from
first/last frag in so give it a more descriptive name: flags.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:31:31 -07:00
Nikolay Aleksandrov d2373862b3 inet: frags: use INC_STATS_BH in the ipv6 reassembly code
Softirqs are already disabled so no need to do it again, thus let's be
consistent and use the IP6_INC_STATS_BH variant.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:31:31 -07:00
Duan Jiong 4330487acf net: use inet6_iif instead of IP6CB()->iif
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 22:37:06 -07:00
Duan Jiong 7304fe4681 net: fix the counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS
When dealing with ICMPv[46] Error Message, function icmp_socket_deliver()
and icmpv6_notify() do some valid checks on packet's length, but then some
protocols check packet's length redaudantly. So remove those duplicated
statements, and increase counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS in
function icmp_socket_deliver() and icmpv6_notify() respectively.

In addition, add missed counter in udp6/udplite6 when socket is NULL.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 22:04:18 -07:00
David S. Miller ccda4a77f3 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2014-07-30

This is the last pull request for ipsec-next before I'll be
off for two weeks starting on friday. David, can you please
take urgent ipsec patches directly into net/net-next during
this time?

1) Error handling simplifications for vti and vti6.
   From Mathias Krause.

2) Remove a duplicate semicolon after a return statement.
   From Christoph Paasch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30 20:05:54 -07:00
David S. Miller f139c74a8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30 13:25:49 -07:00
WANG Cong a317a2f19d ipv6: fail early when creating netdev named all or default
We create a proc dir for each network device, this will cause
conflicts when the devices have name "all" or "default".

Rather than emitting an ugly kernel warning, we could just
fail earlier by checking the device name.

Reported-by: Stephane Chazelas <stephane.chazelas@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-29 11:43:50 -07:00
Eric Dumazet 04ca6973f7 ip: make IP identifiers less predictable
In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
Jedidiah describe ways exploiting linux IP identifier generation to
infer whether two machines are exchanging packets.

With commit 73f156a6e8 ("inetpeer: get rid of ip_id_count"), we
changed IP id generation, but this does not really prevent this
side-channel technique.

This patch adds a random amount of perturbation so that IP identifiers
for a given destination [1] are no longer monotonically increasing after
an idle period.

Note that prandom_u32_max(1) returns 0, so if generator is used at most
once per jiffy, this patch inserts no hole in the ID suite and do not
increase collision probability.

This is jiffies based, so in the worst case (HZ=1000), the id can
rollover after ~65 seconds of idle time, which should be fine.

We also change the hash used in __ip_select_ident() to not only hash
on daddr, but also saddr and protocol, so that ICMP probes can not be
used to infer information for other protocols.

For IPv6, adds saddr into the hash as well, but not nexthdr.

If I ping the patched target, we can see ID are now hard to predict.

21:57:11.008086 IP (...)
    A > target: ICMP echo request, seq 1, length 64
21:57:11.010752 IP (... id 2081 ...)
    target > A: ICMP echo reply, seq 1, length 64

21:57:12.013133 IP (...)
    A > target: ICMP echo request, seq 2, length 64
21:57:12.015737 IP (... id 3039 ...)
    target > A: ICMP echo reply, seq 2, length 64

21:57:13.016580 IP (...)
    A > target: ICMP echo request, seq 3, length 64
21:57:13.019251 IP (... id 3437 ...)
    target > A: ICMP echo reply, seq 3, length 64

[1] TCP sessions uses a per flow ID generator not changed by this patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jeffrey Knockel <jeffk@cs.unm.edu>
Reported-by: Jedidiah R. Crandall <crandall@cs.unm.edu>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-28 18:46:34 -07:00
Nikolay Aleksandrov 1bab4c7507 inet: frag: set limits and make init_net's high_thresh limit global
This patch makes init_net's high_thresh limit to be the maximum for all
namespaces, thus introducing a global memory limit threshold equal to the
sum of the individual high_thresh limits which are capped.
It also introduces some sane minimums for low_thresh as it shouldn't be
able to drop below 0 (or > high_thresh in the unsigned case), and
overall low_thresh should not ever be above high_thresh, so we make the
following relations for a namespace:
init_net:
 high_thresh - max(not capped), min(init_net low_thresh)
 low_thresh - max(init_net high_thresh), min (0)

all other namespaces:
 high_thresh = max(init_net high_thresh), min(namespace's low_thresh)
 low_thresh = max(namespace's high_thresh), min(0)

The major issue with having low_thresh > high_thresh is that we'll
schedule eviction but never evict anything and thus rely only on the
timers.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27 22:34:36 -07:00
Florian Westphal ab1c724f63 inet: frag: use seqlock for hash rebuild
rehash is rare operation, don't force readers to take
the read-side rwlock.

Instead, we only have to detect the (rare) case where
the secret was altered while we are trying to insert
a new inetfrag queue into the table.

If it was changed, drop the bucket lock and recompute
the hash to get the 'new' chain bucket that we have to
insert into.

Joint work with Nikolay Aleksandrov.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27 22:34:36 -07:00
Florian Westphal e3a57d18b0 inet: frag: remove periodic secret rebuild timer
merge functionality into the eviction workqueue.

Instead of rebuilding every n seconds, take advantage of the upper
hash chain length limit.

If we hit it, mark table for rebuild and schedule workqueue.
To prevent frequent rebuilds when we're completely overloaded,
don't rebuild more than once every 5 seconds.

ipfrag_secret_interval sysctl is now obsolete and has been marked as
deprecated, it still can be changed so scripts won't be broken but it
won't have any effect. A comment is left above each unused secret_timer
variable to avoid confusion.

Joint work with Nikolay Aleksandrov.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27 22:34:36 -07:00
Florian Westphal 3fd588eb90 inet: frag: remove lru list
no longer used.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27 22:34:36 -07:00
Florian Westphal 434d305405 inet: frag: don't account number of fragment queues
The 'nqueues' counter is protected by the lru list lock,
once thats removed this needs to be converted to atomic
counter.  Given this isn't used for anything except for
reporting it to userspace via /proc, just remove it.

We still report the memory currently used by fragment
reassembly queues.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27 22:34:36 -07:00
Florian Westphal b13d3cbfb8 inet: frag: move eviction of queues to work queue
When the high_thresh limit is reached we try to toss the 'oldest'
incomplete fragment queues until memory limits are below the low_thresh
value.  This happens in softirq/packet processing context.

This has two drawbacks:

1) processors might evict a queue that was about to be completed
by another cpu, because they will compete wrt. resource usage and
resource reclaim.

2) LRU list maintenance is expensive.

But when constantly overloaded, even the 'least recently used' element is
recent, so removing 'lru' queue first is not 'fairer' than removing any
other fragment queue.

This moves eviction out of the fast path:

When the low threshold is reached, a work queue is scheduled
which then iterates over the table and removes the queues that exceed
the memory limits of the namespace. It sets a new flag called
INET_FRAG_EVICTED on the evicted queues so the proper counters will get
incremented when the queue is forcefully expired.

When the high threshold is reached, no more fragment queues are
created until we're below the limit again.

The LRU list is now unused and will be removed in a followup patch.

Joint work with Nikolay Aleksandrov.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27 22:34:35 -07:00
Florian Westphal 86e93e470c inet: frag: move evictor calls into frag_find function
First step to move eviction handling into a work queue.

We lose two spots that accounted evicted fragments in MIB counters.

Accounting will be restored since the upcoming work-queue evictor
invokes the frag queue timer callbacks instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27 22:34:35 -07:00
Florian Westphal fb3cfe6e75 inet: frag: remove hash size assumptions from callers
hide actual hash size from individual users: The _find
function will now fold the given hash value into the required range.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27 22:34:35 -07:00
Florian Westphal 36c7778218 inet: frag: constify match, hashfn and constructor arguments
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27 22:34:35 -07:00
Li RongQing ac3d2e5a9e ipv6: remove obsolete comment in ip6_append_data()
After 11878b40e[net-timestamp: SOCK_RAW and PING timestamping], this comment
becomes obsolete since the codes check not only UDP socket, but also RAW sock;
and the codes are clear, not need the comments

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-24 23:47:04 -07:00
Himangi Saraogi 56ec0fb10c neigh: remove exceptional & on function name
In this file, function names are otherwise used as pointers without &.

A simplified version of the Coccinelle semantic patch that makes this
change is as follows:

// <smpl>
@r@
identifier f;
@@

f(...) { ... }

@@
identifier r.f;
@@

- &f
+ f
// </smpl>

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-24 23:23:31 -07:00
Sorin Dumitru 274f482d33 sock: remove skb argument from sk_rcvqueues_full
It hasn't been used since commit 0fd7bac(net: relax rcvbuf limits).

Signed-off-by: Sorin Dumitru <sorin@returnze.ro>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-23 13:23:06 -07:00
David S. Miller 8fd90bb889 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/infiniband/hw/cxgb4/device.c

The cxgb4 conflict was simply overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-22 00:44:59 -07:00
David S. Miller a8138f42d4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains updates for your net-next tree,
they are:

1) Use kvfree() helper function from x_tables, from Eric Dumazet.

2) Remove extra timer from the conntrack ecache extension, use a
   workqueue instead to redeliver lost events to userspace instead,
   from Florian Westphal.

3) Removal of the ulog targets for ebtables and iptables. The nflog
   infrastructure superseded this almost 9 years ago, time to get rid
   of this code.

4) Replace the list of loggers by an array now that we can only have
   two possible non-overlapping logger flavours, ie. kernel ring buffer
   and netlink logging.

5) Move Eric Dumazet's log buffer code to nf_log to reuse it from
   all of the supported per-family loggers.

6) Consolidate nf_log_packet() as an unified interface for packet logging.
   After this patch, if the struct nf_loginfo is available, it explicitly
   selects the logger that is used.

7) Move ip and ip6 logging code from xt_LOG to the corresponding
   per-family loggers. Thus, x_tables and nf_tables share the same code
   for packet logging.

8) Add generic ARP packet logger, which is used by nf_tables. The
   format aims to be consistent with the output of xt_LOG.

9) Add generic bridge packet logger. Again, this is used by nf_tables
   and it routes the packets to the real family loggers. As a result,
   we get consistent logging format for the bridge family. The ebt_log
   logging code has been intentionally left in place not to break
   backward compatibility since the logging output differs from xt_LOG.

10) Update nft_log to explicitly request the required family logger when
    needed.

11) Finish nft_log so it supports arp, ip, ip6, bridge and inet families.
    Allowing selection between netlink and kernel buffer ring logging.

12) Several fixes coming after the netfilter core logging changes spotted
    by robots.

13) Use IS_ENABLED() macros whenever possible in the netfilter tree,
    from Duan Jiong.

14) Removal of a couple of unnecessary branch before kfree, from Fabian
    Frederick.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-20 21:01:43 -07:00
David Held 2dc41cff75 udp: Use hash2 for long hash1 chains in __udp*_lib_mcast_deliver.
Many multicast sources can have the same port which can result in a very
large list when hashing by port only. Hash by address and port instead
if this is the case. This makes multicast more similar to unicast.

On a 24-core machine receiving from 500 multicast sockets on the same
port, before this patch 80% of system CPU was used up by spin locking
and only ~25% of packets were successfully delivered.

With this patch, all packets are delivered and kernel overhead is ~8%
system CPU on spinlocks.

Signed-off-by: David Held <drheld@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-16 23:29:52 -07:00
David Held 5cf3d46192 udp: Simplify __udp*_lib_mcast_deliver.
Switch to using sk_nulls_for_each which shortens the code and makes it
easier to update.

Signed-off-by: David Held <drheld@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-16 23:29:52 -07:00
Jerry Chu c3caf1192f net-gre-gro: Fix a bug that breaks the forwarding path
Fixed a bug that was introduced by my GRE-GRO patch
(bf5a755f5e net-gre-gro: Add GRE
support to the GRO stack) that breaks the forwarding path
because various GSO related fields were not set. The bug will
cause on the egress path either the GSO code to fail, or a
GRE-TSO capable (NETIF_F_GSO_GRE) NICs to choke. The following
fix has been tested for both cases.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-16 14:45:26 -07:00
David S. Miller 1a98c69af1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-16 14:09:34 -07:00
Willem de Bruijn 11878b40ed net-timestamp: SOCK_RAW and PING timestamping
Add SO_TIMESTAMPING to sockets of type PF_INET[6]/SOCK_RAW:

Add the necessary sock_tx_timestamp calls to the datapath for RAW
sockets (ping sockets already had these calls).

Fix the IP output path to pass the timestamp flags on the first
fragment also for these sockets. The existing code relies on
transhdrlen != 0 to indicate a first fragment. For these sockets,
that assumption does not hold.

This fixes http://bugzilla.kernel.org/show_bug.cgi?id=77221

Tested SOCK_RAW on IPv4 and IPv6, not PING.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 16:32:45 -07:00
Fabian Frederick 66635dc121 ipv6: remove unnecessary break after return
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 16:27:01 -07:00
Fabian Frederick d518825eab netfilter: remove unnecessary break after return
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 16:27:00 -07:00
Tom Gundersen c835a67733 net: set name_assign_type in alloc_netdev()
Extend alloc_netdev{,_mq{,s}}() to take name_assign_type as argument, and convert
all users to pass NET_NAME_UNKNOWN.

Coccinelle patch:

@@
expression sizeof_priv, name, setup, txqs, rxqs, count;
@@

(
-alloc_netdev_mqs(sizeof_priv, name, setup, txqs, rxqs)
+alloc_netdev_mqs(sizeof_priv, name, NET_NAME_UNKNOWN, setup, txqs, rxqs)
|
-alloc_netdev_mq(sizeof_priv, name, setup, count)
+alloc_netdev_mq(sizeof_priv, name, NET_NAME_UNKNOWN, setup, count)
|
-alloc_netdev(sizeof_priv, name, setup)
+alloc_netdev(sizeof_priv, name, NET_NAME_UNKNOWN, setup)
)

v9: move comments here from the wrong commit

Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 16:12:48 -07:00
Himangi Saraogi e3f0b86b99 ipv6: Use BUG_ON
The semantic patch that makes this transformation is as follows:

// <smpl>
@@ expression e; @@
-if (e) BUG();
+BUG_ON(e);
// </smpl>

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-11 15:06:38 -07:00
Himangi Saraogi 8242fc3392 net: ipv6: Use BUG_ON
The semantic patch that makes the transformation is as follows:

// <smpl>
@@ expression e; @@
-if (e) BUG();
+BUG_ON(e);
// </smpl>

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-11 15:06:38 -07:00