Commit Graph

39336 Commits

Author SHA1 Message Date
Florian Fainelli f37db85d0c net: dsa: Set a "dsa" device_type
Provide a device_type information for slave network devices created by
DSA, this is useful for user-space application to easily locate/search
for devices of a specific kind.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 12:25:16 -07:00
Linus Torvalds 101688f534 NFS client bugfixes for Linux 4.3
Highlights include:
 
 Stable patches:
 - fix v4.2 SEEK on files over 2 gigs
 - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O.
 - Fix recovery of recalled read delegations
 
 Bugfixes:
 - Fix a case where NFSv4 fails to send CLOSE after a server reboot
 - Fix sunrpc to wait for connections to complete before retrying
 - Fix sunrpc races between transport connect/disconnect and shutdown
 - Fix an infinite loop when layoutget fail with BAD_STATEID
 - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
 - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set
 - Fix layoutreturn/close ordering issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWBWKdAAoJEGcL54qWCgDy2rUP/iIWUQSpUPfKKw7xquQUQe4j
 ci4nFxpJ/zhKj1u7x3wrkxZAXcEooYo+ZJ7ayzROKcfQL/sUSWGbSLdr3mqrQynv
 b0SDmnJK9V+CdBQrA+Jp5UGQxcumpMxsAfqVznT0qkf/wDp44DCVgDz5Aj8cRbWU
 6xPfMgVLEnXiId9IgKqg3sJ2NmvMZXuI9sHM6hp6OzRmQDjTcx+LgRz7tnQHgaEk
 zGz8R6eDm3OA0wfApqZwJ6JY793HsDdy30W9L0Yi2PVGXfzwoEB8AqgLVwSDIY1B
 5hG5zn3tg9PSz9vhJ7M2h4AgFHdB3w3XGdJUafwqZEeqEIagw1iFCWlMyo/lE2dG
 G7oob9Jiiwxjc3RDWn2wGaafymrrWZwl2nYzC4O3UvJ3hVJ0mEl1iJagK1m8LzfN
 fmnP7tTyPuoOXkzDogZ0YI3FrngO6430PoR2hUPkS1yce/a+IV0HQEmXbSDSwN80
 1d9zyC9TnPj6rFjZeaGxGK17BpkC0oIQCPq4OSJB4396wzAwMqoJjJVVWWeAK4UC
 PxzoXqAAaBFguSsDbuBMcXgiuUw/7DIZ/pdzsWSiCFgocgF5ZdJdieCNtGk0nbLM
 37R7HCauF93JDrkpUMKPnLXScb2IbEh31pFtKzptJYKwMxEiScXXiP3NE9hfX65i
 2zLkl2aBvd154RvVKNbp
 =GdeV
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - fix v4.2 SEEK on files over 2 gigs
   - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O.
   - Fix recovery of recalled read delegations

  Bugfixes:
   - Fix a case where NFSv4 fails to send CLOSE after a server reboot
   - Fix sunrpc to wait for connections to complete before retrying
   - Fix sunrpc races between transport connect/disconnect and shutdown
   - Fix an infinite loop when layoutget fail with BAD_STATEID
   - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
   - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set
   - Fix layoutreturn/close ordering issues"

* tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS41: make close wait for layoutreturn
  NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is set
  NFSv4.x/pnfs: Don't try to recover stateids twice in layoutget
  NFSv4: Recovery of recalled read delegations is broken
  NFS: Fix an infinite loop when layoutget fail with BAD_STATEID
  NFS: Do cleanup before resetting pageio read/write to mds
  SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose
  SUNRPC: Lock the transport layer on shutdown
  nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
  SUNRPC: Ensure that we wait for connections to complete before retrying
  SUNRPC: drop null test before destroy functions
  nfs: fix v4.2 SEEK on files over 2 gigs
  SUNRPC: Fix races between socket connection and destroy code
  nfs: fix pg_test page count calculation
  Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount
2015-09-25 11:33:52 -07:00
Chuck Lever bb6c96d728 xprtrdma: Replace global lkey with lkey local to PD
The core API has changed so that devices that do not have a global
DMA lkey automatically create an mr, per-PD, and make that lkey
available. The global DMA lkey interface is going away in favor of
the per-PD DMA lkey.

The per-PD DMA lkey is always available. Convert xprtrdma to use the
device's per-PD DMA lkey for regbufs, no matter which memory
registration scheme is in use.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: linux-nfs <linux-nfs@vger.kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-09-25 10:46:51 -04:00
Russell King 9861f72074 net: fix net_device refcounting
of_find_net_device_by_node() uses class_find_device() internally to
lookup the corresponding network device.  class_find_device() returns
a reference to the embedded struct device, with its refcount
incremented.

Add a comment to the definition in net/core/net-sysfs.c indicating the
need to drop this refcount, and fix the DSA code to drop this refcount
when the OF-generated platform data is cleaned up and freed.  Also
arrange for the ref to be dropped when handling errors.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 23:04:53 -07:00
Russell King e496ae690b net: dsa: fix of_mdio_find_bus() device refcount leak
Current users of of_mdio_find_bus() leak a struct device refcount, as
they fail to clean up the reference obtained inside class_find_device().

Fix the DSA code to properly refcount the returned MDIO bus by:
1. taking a reference on the struct device whenever we assign it to
   pd->chip[x].host_dev.
2. dropping the reference when we overwrite the existing reference.
3. dropping the reference when we free the data structure.
4. dropping the initial reference we obtained after setting up the
   platform data structure, or on failure.

In step 2 above, where we obtain a new MDIO bus, there is no need to
take a reference on it as we would only have to drop it immediately
after assignment again, iow:

	put_device(cd->host_dev);	/* drop original assignment ref */
	cd->host_dev = get_device(&mdio_bus_switch->dev); /* get our ref */
	put_device(&mdio_bus_switch->dev); /* drop of_mdio_find_bus ref */

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 23:04:52 -07:00
Jiri Pirko f623ab7f51 switchdev: reduce transaction phase enum down to a boolean
Now, since we have only 2 values for transaction phase, just use bool.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:22 -07:00
Jiri Pirko 79a62eb22a dsa: use prepare/commit switchdev transaction helpers
The enum is going to disappear, use the helpers instead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:22 -07:00
Jiri Pirko 9f6467cf22 switchdev: remove "ABORT" transaction phase
No longer used by drivers, as transaction queue with item destructors
takes care of abort phase internally in switchdev code. So kill it.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:22 -07:00
Jiri Pirko f8db83486e switchdev: move transaction phase enum under transaction structure
Before it disappears completely, move transaction phase enum under
transaction structure and make attr/obj structures a bit cleaner.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:21 -07:00
Jiri Pirko 7ea6eb3f56 switchdev: introduce transaction item queue for attr_set and obj_add
Now, the memory allocation in prepare/commit state is done separatelly
in each driver (rocker). Introduce the similar mechanism in generic
switchdev code, in form of queue. That can be used not only for memory
allocations, but also for different items. Abort item destruction
is handled as well.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:21 -07:00
Jiri Pirko 69f5df491e switchdev: rename "trans" to "trans_ph".
This is temporary, name "trans" will be used for something else and
"trans_ph" will eventually disappear.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:21 -07:00
Pablo Neira Ayuso c3456026ad Merge tag 'ipvs2-for-v4.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
Second Round of IPVS Updates for v4.4

please consider these bug fixes and extensive clean-ups of IPVS
from Eric Biederman for v4.4.

His excellent description of the changes, which is part of an even larger
set of clean-up work, is as follows:

  I am gradually working my way through the netfilter stack passing struct
  down into the netfilter hooks and from the netfilter hooks and from there
  down into the functions that actually care.  This removes the need for
  netfilter functions to guess how to figure out how to compute which
  network namespace they are in and instead provides a simple and reliable
  method to do so.

  The cleanups stand on their own but this is part of a larger effort to
  have routes with an output device that is not in the current network
  namespace.

  The IPVS code has been a bit more of a challenge than most.  Just passing
  struct net through to where it is needed did not feel clean to me.  The
  practical issue is that the ipvs code in most places actually wants
  struct netns_ipvs and not struct net.

  So as part of this process I have turned the relationship between struct
  net and the structs netns_ipvs, ip_vs_conn_param, ip_vs_conn, and
  ip_vs_service inside out.  I have modified the ipvs functions to take a
  struct netns_ipvs not a struct net.  The net is code with fewer
  conversions from one type of structure to another.  I did wind up adding
  a struct netns_ipvs parameter to quite a few functions that did not have
  it before so I could pass the structure down from the netfilter hooks to
  where it is actually needed to avoid guessing.

  I have broken up the work in a bunch of small patches so there is at
  least a chance and reviewing that each step I took is correct.  The
  series compiles at each step so bisecting it should not be a problem if
  something weird comes up.

  The first two changes in this series are actually bug fixes.  The first
  is a compile fix for a bug in sctp that came in, in the last round of
  ipvs changes merged into nf-next.  The second fixes an older bug where in
  pathological circumstances the wrong network namespace could be used when
  a proc file is written to.

  The rest of the patchset is a bunch of boring changes getting pushing
  struct netns_ipvs (and by extension ipvs->net) where it needs to be.
  Either by replacing struct net pointers or adding new struct netns_ipvs
  pointers.  With a handful of other minor cleanups (like removing
  skb_net).

I have decided include the bug fixes in this pull request. Patch one
relates to a bug that was added to nf-next recently and is thus not
applicable to nf . Patch two could arguably be promoted to a fix for v4.3
and stable though it does not appear to be severe enough to warrant that
course of action; let me know if you would like me to reconsider.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-25 01:38:58 +02:00
Matt Bennett 17a10c9215 ip6_tunnel: Reduce log level in ip6_tnl_err() to debug
Currently error log messages in ip6_tnl_err are printed at 'warn'
level. This is different to other tunnel types which don't print
any messages. These log messages don't provide any information that
couldn't be deduced with networking tools. Also it can be annoying
to have one end of the tunnel go down and have the logs fill with
pointless messages such as "Path to destination invalid or inactive!".

This patch reduces the log level of these messages to 'dbg' level to
bring the visible behaviour into line with other tunnel types.

Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 16:08:37 -07:00
David S. Miller deccbe80be Just two small fixes:
* VHT MCS mask array overrun, reported by Dan Carpenter
  * reset CQM history to always get a notification, from Sara Sharon
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJWAWENAAoJEDBSmw7B7bqr7agQAI/IVkuKlzu24XFg6PbaUNcL
 cD+jZplmg3OeeUf/Gtv7KVLAbRlo1X+qIiuquewzGhqAOGSm3KmypQmKQy7Y1gMO
 Nq4uJwdkJpo2aQtQH97jXwWfk40d+BqSXNbnpobvzrMWaMXHO+DT18Icc3iJqO9l
 cQOUTNplDMTpXgjHXjf6296MVP30lbUnoI6BPJGb3LS3BvfgX8KYR4XGlRCQ5R1F
 4gZE0IokY+PQmbZvBzXD3dBGuwnztezX8g25IytgFs+UDyeg8xn/OI+Q3e4HMpqD
 4SeNpD3AD1jxEedpr9trCaLCm8NQdnfU4LDE0y5qaTC2cWPTw1qZs4alNlbD+L3L
 odjiP1aU2X9rWTf78/WcJwn+aufBNGz20M9at5IWkUCi88taSGWzZAviRhhGpuzd
 4c2v6DLcaAqtTTGhhN6CUE4f9+JzeICgDjqSmdR90caMAtOPFjS2ZpbMstKVxO1V
 fjJ4Ys/VRsySrSIji/ryPNUSigeNhEJm9jImObcKhsEv6aqdIbYzdDWekbSVDtvB
 nJQJqlI1RlNn+HO+VXvPDIBuSpOXddVO6142yLQtte4ReL9NbZFU0lx7U5N6JoIk
 QfXbC9eD/LxdRONxLvakblWWbm+FflCaTiEi9dL+zQeAAY3Zgyfi2vo4bHdwi3XL
 GMNNo05q5WSFGkfktb8v
 =4Ui5
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2015-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just two small fixes:
 * VHT MCS mask array overrun, reported by Dan Carpenter
 * reset CQM history to always get a notification, from Sara Sharon
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 15:36:20 -07:00
Matt Bennett a46496ce38 ip6_gre: Reduce log level in ip6gre_err() to debug
Currently error log messages in ip6gre_err are printed at 'warn'
level. This is different to most other tunnel types which don't
print any messages. These log messages don't provide any information
that couldn't be deduced with networking tools. Also it can be annoying
to have one end of the tunnel go down and have the logs fill with
pointless messages such as "Path to destination invalid or inactive!".

This patch reduces the log level of these messages to 'dbg' level to
bring the visible behaviour into line with other tunnel types.

Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 15:28:43 -07:00
Wilson Kok 41fc014332 fib_rules: fix fib rule dumps across multiple skbs
dump_rules returns skb length and not error.
But when family == AF_UNSPEC, the caller of dump_rules
assumes that it returns an error. Hence, when family == AF_UNSPEC,
we continue trying to dump on -EMSGSIZE errors resulting in
incorrect dump idx carried between skbs belonging to the same dump.
This results in fib rule dump always only dumping rules that fit
into the first skb.

This patch fixes dump_rules to return error so that we exit correctly
and idx is correctly maintained between skbs that are part of the
same dump.

Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 15:21:54 -07:00
Eric Dumazet d8ed625044 tcp: factorize sk_txhash init
Neal suggested to move sk_txhash init into tcp_create_openreq_child(),
called both from IPv4 and IPv6.

This opportunity was missed in commit 58d607d3e5 ("tcp: provide
skb->hash to synack packets")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:52:30 -07:00
WANG Cong d8aecb1011 net: revert "net_sched: move tp->root allocation into fw_init()"
fw filter uses tp->root==NULL to check if it is the old method,
so it doesn't need allocation at all in this case. This patch
reverts the offending commit and adds some comments for old
method to make it obvious.

Fixes: 33f8b9ecdb ("net_sched: move tp->root allocation into fw_init()")
Reported-by: Akshat Kakkar <akshat.1984@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:33:30 -07:00
Jiri Benc b194f30c61 lwtunnel: remove source and destination UDP port config option
The UDP tunnel config is asymmetric wrt. to the ports used. The source and
destination ports from one direction of the tunnel are not related to the
ports of the other direction. We need to be able to respond to ARP requests
using the correct ports without involving routing.

As the consequence, UDP ports need to be fixed property of the tunnel
interface and cannot be set per route. Remove the ability to set ports per
route. This is still okay to do, as no kernel has been released with these
attributes yet.

Note that the ability to specify source and destination ports is preserved
for other users of the lwtunnel API which don't use routes for tunnel key
specification (like openvswitch).

If in the future we rework ARP handling to allow port specification, the
attributes can be added back.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:31:37 -07:00
Jiri Benc 63d008a4e9 ipv4: send arp replies to the correct tunnel
When using ip lwtunnels, the additional data for xmit (basically, the actual
tunnel to use) are carried in ip_tunnel_info either in dst->lwtstate or in
metadata dst. When replying to ARP requests, we need to send the reply to
the same tunnel the request came from. This means we need to construct
proper metadata dst for ARP replies.

We could perform another route lookup to get a dst entry with the correct
lwtstate. However, this won't always ensure that the outgoing tunnel is the
same as the incoming one, and it won't work anyway for IPv4 duplicate
address detection.

The only thing to do is to "reverse" the ip_tunnel_info.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:31:36 -07:00
Jiri Benc 38cf595b19 ipv6: remove unused neigh parameter from ndisc functions
Since commit 12fd84f438 ("ipv6: Remove unused neigh argument for
icmp6_dst_alloc() and its callers."), the neigh parameter of ndisc_send_na
and ndisc_send_ns is unused.

CC: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 12:26:08 -07:00
Jiri Benc 92c14d9b5e genetlink: simplify genl_notify
The genl_notify function has too many arguments for no real reason - all
callers use genl_info to get them anyway. Just pass the genl_info down to
genl_notify.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 12:25:23 -07:00
Herbert Xu da314c9923 netlink: Replace rhash_portid with bound
On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way.  You have to pair store_release
> and load_acquire.  Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

> depend on memory barriers embedded in other data structures like the
> above.  Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

> There's no reason to be overly smart here.  This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart.  It's about actually understanding
what's going on with the code.  I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> >  		}
> >  	}
> >
> > -	if (!nlk->portid) {
> > +	if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable.  That doesn't change anything.  Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered.  The two bound tests here used to be a single
one.  Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket.  So we need to
repeat the portid check in order to maintain consistency.

> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> >  		return -EPERM;
> >
> > -	if (!nlk->portid)
> > +	if (!nlk->bound)
>
> Don't we need load_acquire here too?  Is this path holding a lock
> which makes that unnecessary?

Ditto.

---8<---
The commit 1f770c0a09 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.

Tejun is right that a barrier is unavoidable.  Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.

Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.

I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one.  This
combined with only reading nlk->bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09 ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Nacked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 12:07:08 -07:00
Eric W. Biederman 57781c1cee ipvs: Pass ipvs into ip_vs_gather_frags
This will be needed later when the network namespace guessing is
removed from ip_defrag.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 9cfdd75b7c ipvs: Remove skb_sknet
This function adds no real value and it obscures what the code is doing.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 7d1f88eca0 ipvs: Pass ipvs not net to ip_vs_protocol_net_(init|cleanup)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 69f390934b ipvs: Remove net argument from ip_vs_tcp_conn_listen
The argument is unnecessary and in practice confusing,
and has caused the callers to do all manner of silly things.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman a43d1a6b97 ipvs: Pass ipvs through ip_vs_route_me_harder into sysctl_snat_reroute
This removes the need to use the hack skb_net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 7b5f689a2c ipvs: Pass ipvs into ip_vs_out_icmp and ip_vs_out_icmp_v6
This removes the need to compute ipvs with the hack "net_ipvs(skb_net(skb))"

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 6f2bcea991 ipvs: Pass ipvs into ip_vs_in_icmp and ip_vs_in_icmp_v6
With ipvs passed into ip_vs_in_icmp and ip_vs_in_icmp_v6
they no longer need to call the hack that is skb_net.

Additionally ipvs_in_icmp no longer needs to call dev_net(skb->dev)
and can use the ipvs->net instead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 6e385bb3ef ipvs: Pass ipvs into ip_vs_in
Derive ipvs from state->net in the callers of ip_vs_in and pass it
into ip_vs_out.  Removing the need to use the hack skb_net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 1b75097dd7 ipvs: Pass ipvs into ip_vs_out
Derive ipvs from state->net in the callers of ip_vs_out and pass it
into ip_vs_out.  Removing the need to use the hack skb_net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman 2300f0451e ipvs: Pass ipvs not net into sysctl_nat_icmp_send
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman 51efbcbbb2 ipvs: Simplify ipvs and net access in ip_vs_leave
Stop using the hack skb_net(skb) to compute the network namespace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman 5703294874 ipvs: Wrap sysctl_cache_bypass and remove ifdefs in ip_vs_leave
With sysctl_cache_bypass now a compile time constant the compiler can
figue out that it can elimiate all of the code that depends on
sysctl_cache_bypass being true.

Also remove the duplicate computation of net previously necessitated
by #ifdef CONFIG_SYSCTL

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman 7c08d78e6f ipvs: Better derivation of ipvs in ip_vs_in_stats and ip_vs_out_stats
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman 20868a40d0 ipvs: Pass ipvs into ensure_mtu_is adequate
This allows two different ways for computing/guessing net to be
removed from ensure_mtu_is_adequate.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman f5745f8ae6 ipvs: Pass ipvs into __ip_vs_get_out_rt_v6
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman ecfe87b884 ipvs: Pass ipvs into __ip_vs_get_out_rt
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman 361c3f5293 ipvs: Better derivation of ipvs in ip_vs_tunnel_xmit
Don't use "net_ipvs(skb_net(skb))" as skb_net is a bad hack.  Instead
use cp->ipvs and ipvs->net for the net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman d8f44c335a ipvs: Pass ipvs into .conn_schedule and ip_vs_try_to_schedule
This moves the hack "net_ipvs(skb_net(skb))" up one level where it
will be easier to remove.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman 2f3edc6a5b ipvs: Pass ipvs not net into ip_vs_conn_net_init and ip_vs_conn_net_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman d889717aaf ipvs: Pass ipvs not net into ip_vs_conn_net_flush
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman 754b81a357 ipvs: Pass ipvs not net to ip_vs_conn_hashkey
Use the address of struct netns_ipvs in the hash not the address of
struct net.  Both addresses are equally valid candidates and by using
the address of struct netns_ipvs there becomes no need deal with
struct net in this part of the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman 0cf705c8c2 ipvs: Pass ipvs into conn_out_get
Move the hack of relying on "net_ipvs(skb_net(skb))" to derive the
ipvs up a layer.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman ab16197642 ipvs: Pass ipvs into .conn_in_get and ip_vs_conn_in_get_proto
Stop relying on "net_ipvs(skb_net(skb))" to derive the ipvs as
skb_net is a hack.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman f5099dd4d9 ipvs: Pass ipvs into ip_vs_conn_fill_param_proto
Move the ugly hack net_ipvs(skb_net(skb)) up a layer in the call stack
so it is easier to remove.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman 1281a9c2d1 ipvs: Pass ipvs not net into init_netns and exit_netns
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman c70bd6800a ipvs: Pass ipvs not net into [un]register_ip_vs_proto_netns
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman b5dd212cc1 ipvs: Pass ipvs not net into ip_vs_app_net_init and ip_vs_app_net_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman 09858708e6 ipvs: Pass ipvs not net into ip_vs_app_inc_release
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman 9f8128a56e ipvs: Pass ipvs not net to register_ip_vs_app and unregister_ip_vs_app
Also move the tests for net_ipvs being NULL into __ip_vs_ftp_init
and __ip_vs_ftp_exit.  The only places where they possibly make
sense.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman 3250dc9c52 ipvs: Pass ipvs not net to register_ip_vs_app_inc
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman a080ce38a0 ipvs: Pass ipvs not net into ip_vs_app_inc_new
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman 19648918fb ipvs: Pass ipvs not net into register_app and unregister_app
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman a4dd0360c6 ipvs: Pass ipvs not net to ip_vs_estimator_net_init and ip_vs_estimator_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman 70a131a2c8 ipvs: Pass ipvs not net to estimation_timer
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman 3d99376689 ipvs: Pass ipvs not net into ip_vs_control_net_(init|cleanup)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman 8b8237a581 ipvs: Pass ipvs not net to ip_vs_control_net_(init|cleanup)_sysctl
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman 423b55954d ipvs: Pass ipvs not net to ip_vs_random_drop_entry
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman 0f34d54bf4 ipvs: Pass ipvs not net to ip_vs_start_estimator aned ip_vs_stop_estimator
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman cacd1e60f1 ipvs: Pass ipvs not net to ip_vs_genl_set_config
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman ebea1f7c0b ipvs: Pass ipvs not net to ip_vs_sync_net_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman 802cb43703 ipvs: Pass ipvs not net to ip_vs_sync_net_init
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman 1fc12004d2 ipvs: Pass ipvs not net to ip_vs_proc_sync_conn
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman 4f30665bac ipvs: Pass ipvs not net to ip_vs_proc_conn
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman b61a8c1a40 ipvs: Pass ipvs not net to ip_vs_sync_conn
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman 72e9481e28 ipvs: Pass ipvs not net to ip_vs_sync_conn_v0
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman 7d537f3ab7 ipvs: Pass ipvs not net to ip_vs_process_message
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman 37b68e6ded ipvs: Store ipvs not net in struct ip_vs_sync_thread_data
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data.  So store a pointer to
struct netns_ipvs.

Update the accesses of tinfo->net to access tinfo->ipvs->net instead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman fd124e2f8b ipvs: Pass ipvs not net to make_receive_sock
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman 68c76b6aa0 ipvs: Pass ipvs not net to make_send_sock
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman b3cf3cbfb5 ipvs: Pass ipvs not net to stop_sync_thread
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman 6ac121d710 ipvs: Pass ipvs not net to start_sync_thread
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman df04ffb766 ipvs: Pass ipvs not net to ip_vs_genl_del_daemon
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman d8443c5f2b ipvs: Pass ipvs not net to ip_vs_genl_new_daemon
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman 34c2f5146c ipvs: Pass ipvs not net to ip_vs_genl_find_service
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman 613fb830b7 ipvs: Pass ipvs not net to ip_vs_genl_parse_service
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman af5403419d ipvs: Pass ipvs not net to __ip_vs_get_timeouts
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:36 +09:00
Eric W. Biederman 08fff4c357 ipvs: Pass ipvs not net to __ip_vs_get_dest_entries
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:36 +09:00
Eric W. Biederman b2876b7773 ipvs: Pass ipvs not net to __ip_vs_get_service_entries
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:36 +09:00
Eric W. Biederman f1faa1e749 ipvs: Pass ipvs not net to ip_vs_set_timeout
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman 18d6ade63c ipvs: Pass ipvs not net to ip_vs_proto_data_get
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman a47b430080 ipvs: Cache ipvs in ip_vs_in_icmp and ip_vs_in_icmp_v6
Storte the value of net_ipvs in a variable named ipvs so that when
there are more users struct netns_ipvs in ip_vs_in_cmp and
ip_vs_in_icmp_v6 they won't need to compute the value again.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman c60856c687 ipvs: Pass ipvs not net to ip_vs_zero_all
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman 56d2169b77 ipvs: Pass ipvs not net to ip_vs_service_net_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman ef7c599d91 ipvs: Pass ipvs not net to ip_vs_flush
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman 5060bd8307 ipvs: Pass ipvs not net to ip_vs_add_service
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman cd58278bd4 ipvs: Cache ipvs in ip_vs_genl_set_cmd
Compute ipvs early in ip_vs_genl_set_cmd and use the cached value to
access ipvs->sync_state.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman 8e743f1b45 ipvs: Pass ipvs not net to ip_vs_dest_trash_expire
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman 79ac82e0aa ipvs: Pass ipvs not net to __ip_vs_del_dest
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman 6c0e14f507 ipvs: Pass ipvs not net to ip_vs_trash_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman dc2add6f2e ipvs: Pass ipvs not net to ip_vs_find_dest
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman 48aed1b029 ipvs: Pass ipvs not net to ip_vs_has_real_service
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman 0a4fd6ce92 ipvs: Pass ipvs not net to ip_vs_service_find
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman bb2e2a8c95 ipvs: Pass ipvs not net to __ip_vs_service_find
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman ba61f39034 ipvs: Pass ipvs not net to ip_vs_svc_hashkey
Use the address of ipvs not the address of net when computing the
hash value.  This removes an unncessary dependency on struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman 1ed8b94780 ipvs: Pass ipvs not net to __ip_vs_svc_fwm_find
ipvs is what the code actually wants to use.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman f6510b245e ipvs: Pass ipvs not net to ip_vs_svc_fwm_hashkey
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman 3109d2f2d1 ipvs: Store ipvs not net in struct ip_vs_service
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data.  So store a pointer to
struct netns_ipvs.

Update the accesses of param->net to access param->ipvs->net instead.

In functions where we are searching for an svc and filtering by net
filter by ipvs instead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman 19913dec1b ipvs: Pass ipvs not net to ip_vs_fill_conn
ipvs is what is actually desired so change the parameter and the modify
the callers to pass struct netns_ipvs.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman e64e2b460c ipvs: Store ipvs not net in struct ip_vs_conn_param
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data.  So store a pointer to
struct netns_ipvs.

Update the accesses of param->net to access param->ipvs->net instead.

When lookup up struct ip_vs_conn in a hash table replace comparisons
of cp->net with comparisons of cp->ipvs which is possible
now that ipvs is present in ip_vs_conn_param.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman 58dbc6f260 ipvs: Store ipvs not net in struct ip_vs_conn
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data.  So store a pointer to
struct netns_ipvs.

Update the accesses of conn->net to access conn->ipvs->net instead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman d484fc3812 ipvs: Use state->net in the ipvs forward functions
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman 717e917ddf ipvs: Don't use current in proc_do_defense_mode
Instead store ipvs in extra2 so that proc_do_defense_mode can easily
find the ipvs that it's value is associated with.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman 1daea8ed16 ipvs: Hoist computation of ipvs earlier in sctp_conn_schedule
The addition of sysctl_sloppy_sctp in sctp_conn_schedule resulted
in a use of ipvs before it was computed.  Hoist the computation of
ipvs earlier to avoid this problem.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:32 +09:00
Siva Mannem dcd45e0649 bridge: don't age externally added FDB entries
Signed-off-by: Siva Mannem <siva.mannem.lnx@gmail.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Premkumar Jonnala <pjonnala@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:35:58 -07:00
Scott Feldman a79e88d9fb bridge: define some min/max/default ageing time constants
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:35:58 -07:00
David Woodhouse d3869efe7a Fix AF_PACKET ABI breakage in 4.2
Commit 7d82410950 ("virtio: add explicit big-endian support to memory
accessors") accidentally changed the virtio_net header used by
AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.

Since virtio_legacy_is_little_endian() is a very long identifier,
define a vio_le macro and use that throughout the code instead of the
hard-coded 'false' for little-endian.

This restores the ABI to match 4.1 and earlier kernels, and makes my
test program work again.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:33:55 -07:00
Neil Horman 2d8bff1269 netpoll: Close race condition between poll_one_napi and napi_disable
Drivers might call napi_disable while not holding the napi instance poll_lock.
In those instances, its possible for a race condition to exist between
poll_one_napi and napi_disable.  That is to say, poll_one_napi only tests the
NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
the following may happen:

CPU0				CPU1
ndo_tx_timeout			napi_poll_dev
 napi_disable			 poll_one_napi
  test_and_set_bit (ret 0)
				  test_bit (ret 1)
   reset adapter		   napi_poll_routine

If the adapter gets a tx timeout without a napi instance scheduled, its possible
for the adapter to think it has exclusive access to the hardware  (as the napi
instance is now scheduled via the napi_disable call), while the netpoll code
thinks there is simply work to do.  The result is parallel hardware access
leading to corrupt data structures in the driver, and a crash.

Additionaly, there is another, more critical race between netpoll and
napi_disable.  The disabled napi state is actually identical to the scheduled
state for a given napi instance.  The implication being that, if a napi instance
is disabled, a netconsole instance would see the napi state of the device as
having been scheduled, and poll it, likely while the driver was dong something
requiring exclusive access.  In the case above, its fairly clear that not having
the rings in a state ready to be polled will cause any number of crashes.

The fix should be pretty easy.  netpoll uses its own bit to indicate that that
the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
We can just gate disabling on that bit as well as the sched bit.  That should
prevent netpoll from conducting a napi poll if we convert its set bit to a
test_and_set_bit operation to provide mutual exclusion

Change notes:
V2)
	Remove a trailing whtiespace
	Resubmit with proper subject prefix

V3)
	Clean up spacing nits

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: jmaxwell@redhat.com
Tested-by: jmaxwell@redhat.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:32:50 -07:00
Daniel Borkmann 5cf8ca0e47 cls_bpf: further limit exec opcodes subset
Jamal suggested to further limit the currently allowed subset of opcodes
that may be used by a direct action return code as the intention is not
to replace the full action engine, but rather to have a minimal set that
can be used in the fast-path on things like ingress for some features
that cls_bpf supports.

Classifiers can, of course, still be chained together that have direct
action mode with those that have a full exec pass. For more complex
scenarios that go beyond this minimal set here, the full tcf_exts_exec()
path must be used.

Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:29:02 -07:00
Daniel Borkmann ef146fa40c cls_bpf: make binding to classid optional
The binding to a particular classid was so far always mandatory for
cls_bpf, but it doesn't need to be. Therefore, lift this restriction
as similarly done in other classifiers.

Only a couple of qdiscs make use of class from the tcf_result, others
don't strictly care, so let the user choose his needs (those that read
out class can handle situations where it could be NULL).

An explicit check for tcf_unbind_filter() is also not needed here, as
the previous r->class was 0, so the xchg() will return that and
therefore a callback to the qdisc's unbind_tcf() is skipped.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:29:02 -07:00
Daniel Borkmann bf007d1c75 cls_bpf: also dump TCA_BPF_FLAGS
In commit 43388da42a49 ("cls_bpf: introduce integrated actions") we
have added TCA_BPF_FLAGS. We can also retrieve this information from
the prog, dump it back to user space as well. It's useful in tc when
displaying/dumping filter info.

Also, remove tp from cls_bpf_prog_from_efd(), came in as a conflict
from a rebase and it's unused here (later work may add it along with
a real user).

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:29:01 -07:00
Daniel Borkmann 927ab1d764 sched, bpf: let stack handle !IFF_UP devs on bpf_clone_redirect
Similarly as already the case in bpf_redirect()/skb_do_redirect()
pair, let the stack deal with devs that are !IFF_UP.

dev_forward_skb() as well as dev_queue_xmit() will free the skb
and increment drop counter internally in such cases, so we can
spare the condition in bpf_clone_redirect().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:25:51 -07:00
Eric Dumazet 675ee231d9 tcp: add proper TS val into RST packets
RST packets sent on behalf of TCP connections with TS option (RFC 7323
TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.

A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
ecr 0], length 0
B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
1460,nop,nop,TS val 7264344 ecr 100], length 0
A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
7264344], length 0

B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
ecr 110], length 0

We need to call skb_mstamp_get() to get proper TS val,
derived from skb->skb_mstamp

Note that RFC 1323 was advocating to not send TS option in RST segment,
but RFC 7323 recommends the opposite :

  Once TSopt has been successfully negotiated, that is both <SYN> and
  <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
  segment for the duration of the connection, and SHOULD be sent in an
  <RST> segment (see Section 5.2 for details)

Note this RFC recommends to send TS val = 0, but we believe it is
premature : We do not know if all TCP stacks are properly
handling the receive side :

   When an <RST> segment is
   received, it MUST NOT be subjected to the PAWS check by verifying an
   acceptable value in SEG.TSval, and information from the Timestamps
   option MUST NOT be used to update connection state information.
   SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.

In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
to decide to send TS val = 0, if it buys something.

Fixes: 7faee5c0d5 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:24:07 -07:00
Tom Herbert 644d0e6569 ipv6 Use get_hash_from_flowi6 for rt6 hash
In rt6_info_hash_nhsfn replace the custom hashing over flowi6 that is
using xor with a call to common function get_hash_from_flowi6.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:21:09 -07:00
Neil Armstrong fbd03513bf net: dsa: Fix Marvell Egress Trailer check
The Marvell Egress rx trailer check must be fixed to
correctly detect bad bits in the third byte of the
Eggress trailer as described in the Table 28 of the
88E6060 datasheet.
The current code incorrectly omits to check the third
byte and checks the fourth byte twice.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-22 17:37:03 -07:00
Jesse Gross ae5f2fb1d5 openvswitch: Zero flows on allocation.
When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.

While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.

In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.

This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.

Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-22 17:33:41 -07:00
Andrea Arcangeli ac5be6b47e userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key"
This reverts commit 51360155ec and adapts
fs/userfaultfd.c to use the old version of that function.

It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what we
insert in the two waitqueue heads of the blocked userfaults.  No
exclusive waitqueue risks to be inserted into those two waitqueue heads
so we can as well stick to "nr == 1" of the old code and we can rely
purely on the fact no waitqueue inserted in one of the two waitqueue
heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-22 15:09:53 -07:00
David S. Miller 99cb99aa05 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next tree
in this 4.4 development cycle, they are:

1) Schedule ICMP traffic to IPVS instances, this introduces a new schedule_icmp
   proc knob to enable/disable it. By default is off to retain the old
   behaviour. Patchset from Alex Gartrell.

I'm also including what Alex originally said for the record:

"The configuration of ipvs at Facebook is relatively straightforward.  All
ipvs instances bgp advertise a set of VIPs and the network prefers the
nearest one or uses ECMP in the event of a tie.  For the uninitiated, ECMP
deterministically and statelessly load balances by hashing the packet
(usually a 5-tuple of protocol, saddr, daddr, sport, and dport) and using
that number as an index (basic hash table type logic).

The problem is that ICMP packets (which contain really important
information like whether or not an MTU has been exceeded) will get a
different hash value and may end up at a different ipvs instance.  With no
information about where to route these packets, they are dropped, creating
ICMP black holes and breaking Path MTU discovery.  Suddenly, my mom's
pictures can't load and I'm fielding midday calls that I want nothing to do
with.

To address this, this patch set introduces the ability to schedule icmp
packets which is gated by a sysctl net.ipv4.vs.schedule_icmp.  If set to 0,
the old behavior is maintained -- otherwise ICMP packets are scheduled."

2) Add another proc entry to ignore tunneled packets to avoid routing loops
   from IPVS, also from Alex.

3) Fifteen patches from Eric Biederman to:

* Stop passing nf_hook_ops as parameter to the hook and use the state hook
  object instead all around the netfilter code, so only the private data
  pointer is passed to the registered hook function.

* Now that we've got state->net, propagate the netns pointer to netfilter hook
  clients to avoid its computation over and over again. A good example of how
  this has been simplified is the former TEE target (now nf_dup infrastructure)
  since it has killed the ugly pick_net() function.

There's another round of netns updates from Eric Biederman making the line. To
avoid the patchbomb again to almost all the networking mailing list (that is 84
patches) I'd suggest we send you a pull request with no patches or let me know
if you prefer a better way.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-22 13:11:43 -07:00
Sara Sharon babc305e21 mac80211: reset CQM history upon reconfiguration
The current behavior of notifying CQM events is inconsistent:
Upon first configuration there is a cqm event with the current
status according to threshold configured, regardless of signal
stability.
When there is reconfiguration no event is sent unless there is
a significant change to the signal level according to the new
configuration.

Since the current reconfiguration behavior might cause missing
CQM events in case the current signal did not change but is on
the other side of the new threshold, fix that by resetting the
stored signal level upon reconfiguration.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:22:50 +02:00
Johannes Berg 2df1b131b5 mac80211: fix VHT MCS mask array overrun
The HT MCS mask has 9 bytes, the VHT one only has 8 streams.
Split the loops to handle this correctly.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:19:48 +02:00
Eric Dumazet 29c6852602 inet: fix races in reqsk_queue_hash_req()
Before allowing lockless LISTEN processing, we need to make
sure to arm the SYN_RECV timer before the req socket is visible
in hash tables.

Also, req->rsk_hash should be written before we set rsk_refcnt
to a non zero value.

Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:32:29 -07:00
Eric Dumazet ed2e923945 tcp/dccp: fix timewait races in timer handling
When creating a timewait socket, we need to arm the timer before
allowing other cpus to find it. The signal allowing cpus to find
the socket is setting tw_refcnt to non zero value.

As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
call inet_twsk_schedule() first.

This also means we need to remove tw_refcnt changes from
inet_twsk_schedule() and let the caller handle it.

Note that because we use mod_timer_pinned(), we have the guarantee
the timer wont expire before we set tw_refcnt as we run in BH context.

To make things more readable I introduced inet_twsk_reschedule() helper.

When rearming the timer, we can use mod_timer_pending() to make sure
we do not rearm a canceled timer.

Note: This bug can possibly trigger if packets of a flow can hit
multiple cpus. This does not normally happen, unless flow steering
is broken somehow. This explains this bug was spotted ~5 months after
its introduction.

A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
but will be provided in a separate patch for proper tracking.

Fixes: 789f558cfb ("tcp/dccp: get rid of central timewait timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ying Cai <ycai@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:32:29 -07:00
Yuchung Cheng f9b9958229 tcp: send loss probe after 1s if no RTT available
This patch makes TLP to use 1 sec timer by default when RTT is
not available due to SYN/ACK retransmission or SYN cookies.

Prior to this change, the lack of RTT prevents TLP so the first
data packets sent can only be recovered by fast recovery or RTO.
If the fast recovery fails to trigger the RTO is 3 second when
SYN/ACK is retransmitted. With this patch we can trigger fast
recovery in 1sec instead.

Note that we need to check Fast Open more properly. A Fast Open
connection could be (accepted then) closed before it receives
the final ACK of 3WHS so the state is FIN_WAIT_1. Without the
new check, TLP will retransmit FIN instead of SYN/ACK.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:19:01 -07:00
Yuchung Cheng 0f1c28ae74 tcp: usec resolution SYN/ACK RTT
Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
RTT is often measured as 0ms or sometimes 1ms, which would affect
RTT estimation and min RTT samping used by some congestion control.

This patch improves SYN/ACK RTT to be usec resolution if platform
supports it. While the timestamping of SYN/ACK is done in request
sock, the RTT measurement is carefully arranged to avoid storing
another u64 timestamp in tcp_sock.

For regular handshake w/o SYNACK retransmission, the RTT is sampled
right after the child socket is created and right before the request
sock is released (tcp_check_req() in tcp_minisocks.c)

For Fast Open the child socket is already created when SYN/ACK was
sent, the RTT is sampled in tcp_rcv_state_process() after processing
the final ACK an right before the request socket is released.

If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
on TCP timestamps to measure the RTT. The sample is taken at the
same place in tcp_rcv_state_process() after the timestamp values
are validated in tcp_validate_incoming(). Note that we do not store
TS echo value in request_sock for SYN-cookies, because the value
is already stored in tp->rx_opt used by tcp_ack_update_rtt().

One side benefit is that the RTT measurement now happens before
initializing congestion control (of the passive side). Therefore
the congestion control can use the SYN/ACK RTT.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:19:01 -07:00
Ursula Braun 91e60eb60b s390/iucv: do not use arrays as argument
The iucv code uses arrays as arguments. Even though this does not
really cause a problem, it could be misleading, since the compiler
turns array arguments into just a pointer argument. To be more
precise this patch changes the array arguments into pointers.

Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:03:04 -07:00
David S. Miller 5dcd246107 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-09-18

Here's the first bluetooth-next pull request for the 4.4 kernel:

 - ieee802154 cleanups & fixes
 - debugfs support for the at86rf230 driver
 - Support for quirky (seemingly counterfeit) CSR Bluetooth controllers
 - Power management and device config improvements for Intel controllers
 - Fix for devices with incorrect advertising data length
 - Fix for closing HCI user channel socket

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:00:44 -07:00
Herbert Xu 1f770c0a09 netlink: Fix autobind race condition that leads to zero port ID
The commit c0bb07df7d ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID.  This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:55:31 -07:00
Nicolas Dichtel bc22a0e2ea iptunnel: make rx/tx bytes counters consistent
This was already done a long time ago in
commit 64194c31a0 ("inet: Make tunnel RX/TX byte counters more consistent")
but tx path was broken (at least since 3.10).

Before the patch the gre header was included on tx.

After the patch:
$ ping -c1 192.168.0.121 ; ip -s l ls dev gre1
PING 192.168.0.121 (192.168.0.121) 56(84) bytes of data.
64 bytes from 192.168.0.121: icmp_req=1 ttl=64 time=2.95 ms

--- 192.168.0.121 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.955/2.955/2.955/0.000 ms
7: gre1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1468 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/gre 10.16.0.249 peer 10.16.0.121
    RX: bytes  packets  errors  dropped overrun mcast
    84         1        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    84         1        0       0       0       0

Reported-by: Julien Meunier <julien.meunier@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:36:22 -07:00
David S. Miller ac81374493 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patch contains Netfilter fixes for your net tree, they are:

1) nf_log_unregister() should only set to NULL the logger that is being
   unregistered, instead of everything else. Patch from Florian Westphal.

2) Fix a crash when accessing physoutdev from PREROUTING in br_netfilter.
   This is partially reverting the patch to shrink nf_bridge_info to 32 bytes.
   Also from Florian.

3) Use existing match/target extensions in the internal nft_compat extension
   lists when the extension is family unspecific (ie. NFPROTO_UNSPEC).

4) Wait for rcu grace period before leaving nf_log_unregister().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:32:20 -07:00
Erik Hugne 4e3ae00100 tipc: reinitialize pointer after skb linearize
The msg pointer into header may change after skb linearization.
We must reinitialize it after calling skb_linearize to prevent
operating on a freed or invalid pointer.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reported-by: Tamás Végh <tamas.vegh@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:31:20 -07:00
Ksenija Stanojevic 22a3f9a204 rxrpc: Replace get_seconds with ktime_get_seconds
Replace time_t type and get_seconds function which are not y2038 safe
on 32-bit systems. Function ktime_get_seconds use monotonic instead of
real time and therefore will not cause overflow.

Signed-off-by: Ksenija Stanojevic <ksenija.stanojevic@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 21:53:56 -07:00
Nikola Forró 0315e38270 net: Fix behaviour of unreachable, blackhole and prohibit routes
Man page of ip-route(8) says following about route types:

  unreachable - these destinations are unreachable.  Packets are dis‐
  carded and the ICMP message host unreachable is generated.  The local
  senders get an EHOSTUNREACH error.

  blackhole - these destinations are unreachable.  Packets are dis‐
  carded silently.  The local senders get an EINVAL error.

  prohibit - these destinations are unreachable.  Packets are discarded
  and the ICMP message communication administratively prohibited is
  generated.  The local senders get an EACCES error.

In the inet6 address family, this was correct, except the local senders
got ENETUNREACH error instead of EHOSTUNREACH in case of unreachable route.
In the inet address family, all three route types generated ICMP message
net unreachable, and the local senders got ENETUNREACH error.

In both address families all three route types now behave consistently
with documentation.

Signed-off-by: Nikola Forró <nforro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 21:45:08 -07:00
Trond Myklebust 4b0ab51db3 SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose
Under all conditions, it should be quite sufficient just to mark
the socket as disconnected. It will then be closed by the
transport shutdown or reconnect code.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-19 16:38:35 -04:00
Trond Myklebust 79234c3db6 SUNRPC: Lock the transport layer on shutdown
Avoid all races with the connect/disconnect handlers by taking the
transport lock.

Reported-by:"Suzuki K. Poulose" <suzuki.poulose@arm.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-19 16:37:43 -04:00
Eric W. Biederman 0a031ac5c0 netfilter: Use nf_ct_net instead of dev_net(out) in nf_nat_masquerade_ipv6
Use nf_ct_net(ct) instead of guessing that the netdevice out can
reliably report the network namespace the conntrack operation is
happening in.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 22:00:28 +02:00
Eric W. Biederman c7af6483b9 netfilter: Pass net into nf_xfrm_me_harder
Instead of calling dev_net on a likley looking network device
pass state->net into nf_xfrm_me_harder.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 22:00:22 +02:00
Eric W. Biederman 06198b34a3 netfilter: Pass priv instead of nf_hook_ops to netfilter hooks
Only pass the void *priv parameter out of the nf_hook_ops.  That is
all any of the functions are interested now, and by limiting what is
passed it becomes simpler to change implementation details.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 22:00:16 +02:00
Eric W. Biederman 176971b338 ipvs: Read hooknum from state rather than ops->hooknum
This should be more cache efficient as state is more likely to be in
core, and the netfilter core will stop passing in ops soon.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 22:00:10 +02:00
Eric W. Biederman a31f1adc09 netfilter: nf_conntrack: Add a struct net parameter to l4_pkt_to_tuple
As gre does not have the srckey in the packet gre_pkt_to_tuple
needs to perform a lookup in it's per network namespace tables.

Pass in the proper network namespace to all pkt_to_tuple
implementations to ensure gre (and any similar protocols) can get this
right.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 22:00:04 +02:00
Eric W. Biederman a4ffe319ae act_connmark: Remember the struct net instead of guessing it.
Stop guessing the struct net instead of remember it.  Guessing is just
silly and will be problematic in the future when I implement routes
between network namespaces.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:59:31 +02:00
Eric W. Biederman 206e8c0075 netfilter: Pass net to nf_dup_ipv4 and nf_dup_ipv6
This allows them to stop guessing the network namespace with pick_net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:59:11 +02:00
Eric W. Biederman 88182a0e0c netfilter: nf_tables: Use pkt->net instead of computing net from the passed net_devices
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:58:49 +02:00
Eric W. Biederman 686c9b5080 netfilter: x_tables: Use par->net instead of computing from the passed net devices
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:58:25 +02:00
Eric W. Biederman 156c196f60 netfilter: x_tables: Pass struct net in xt_action_param
As xt_action_param lives on the stack this does not bloat any
persistent data structures.

This is a first step in making netfilter code that needs to know
which network namespace it is executing in simpler.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:58:14 +02:00
Eric W. Biederman 6aa187f21c netfilter: nf_tables: kill nft_pktinfo.ops
- Add nft_pktinfo.pf to replace ops->pf
- Add nft_pktinfo.hook to replace ops->hooknum

This simplifies the code, makes it more readable, and likely reduces
cache line misses.  Maintainability is enhanced as the details of
nft_hook_ops are of no concern to the recpients of nft_pktinfo.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:58:01 +02:00
Eric W. Biederman 082a758f04 inet netfilter: Prefer state->hook to ops->hooknum
The values of nf_hook_state.hook and nf_hook_ops.hooknum must be the
same by definition.

We are more likely to access the fields in nf_hook_state over the
fields in nf_hook_ops so with a little luck this results in
fewer cache line misses, and slightly more consistent code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:57:51 +02:00
Eric W. Biederman 6cb8ff3f1a inet netfilter: Remove hook from ip6t_do_table, arp_do_table, ipt_do_table
The values of ops->hooknum and state->hook are guaraneted to be equal
making the hook argument to ip6t_do_table, arp_do_table, and
ipt_do_table is unnecessary. Remove the unnecessary hook argument.

In the callers use state->hook instead of ops->hooknum for clarity and
to reduce the number of cachelines the callers touch.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:57:43 +02:00
Eric W. Biederman 97b59c3a91 netfilter: ebtables: Simplify the arguments to ebt_do_table
Nearly everything thing of interest to ebt_do_table is already present
in nf_hook_state.  Simplify ebt_do_table by just passing in the skb,
nf_hook_state, and the table.  This make the code easier to read and
maintenance easier.

To support this create an nf_hook_state on the stack in ebt_broute
(the only caller without a nf_hook_state already available).  This new
nf_hook_state adds no new computations to ebt_broute, but does use a
few more bytes of stack.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:57:35 +02:00