Commit Graph

710686 Commits

Author SHA1 Message Date
David Howells be080a6f43 afs: Overhaul permit caching
Overhaul permit caching in AFS by making it per-vnode and sharing permit
lists where possible.

When most of the fileserver operations are called, they return a status
structure indicating the (revised) details of the vnode or vnodes involved
in the operation.  This includes the access mark derived from the ACL
(named CallerAccess in the protocol definition file).  This is cacheable
and if the ACL changes, the server will tell us that it is breaking the
callback promise, at which point we can discard the currently cached
permits.

With this patch, the afs_permits structure has, at the end, an array of
{ key, CallerAccess } elements, sorted by key pointer.  This is then cached
in a hash table so that it can be shared between vnodes with the same
access permits.

Permit lists can only be shared if they contain the exact same set of
key->CallerAccess mappings.

Note that that table is global rather than being per-net_ns.  If the keys
in a permit list cross net_ns boundaries, there is no problem sharing the
cached permits, since the permits are just integer masks.

Since permit lists pin keys, the permit cache also makes it easier for a
future patch to find all occurrences of a key and remove them by means of
setting the afs_permits::invalidated flag and then clearing the appropriate
key pointer.  In such an event, memory barriers will need adding.

Lastly, the permit caching is skipped if the server has sent either a
vnode-specific or an entire-server callback since the start of the
operation.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:18 +00:00
David Howells c435ee3455 afs: Overhaul the callback handling
Overhaul the AFS callback handling by the following means:

 (1) Don't give up callback promises on vnodes that we are no longer using,
     rather let them just expire on the server or let the server break
     them.  This is actually more efficient for the server as the callback
     lookup is expensive if there are lots of extant callbacks.

 (2) Only give up the callback promises we have from a server when the
     server record is destroyed.  Then we can just give up *all* the
     callback promises on it in one go.

 (3) Servers can end up being shared between cells if cells are aliased, so
     don't add all the vnodes being backed by a particular server into a
     big FID-indexed tree on that server as there may be duplicates.

     Instead have each volume instance (~= superblock) register an interest
     in a server as it starts to make use of it and use this to allow the
     processor for callbacks from the server to find the superblock and
     thence the inode corresponding to the FID being broken by means of
     ilookup_nowait().

 (4) Rather than iterating over the entire callback list when a mass-break
     comes in from the server, maintain a counter of mass-breaks in
     afs_server (cb_seq) and make afs_validate() check it against the copy
     in afs_vnode.

     It would be nice not to have to take a read_lock whilst doing this,
     but that's tricky without using RCU.

 (5) Save a ref on the fileserver we're using for a call in the afs_call
     struct so that we can access its cb_s_break during call decoding.

 (6) Write-lock around callback and status storage in a vnode and read-lock
     around getattr so that we don't see the status mid-update.

This has the following consequences:

 (1) Data invalidation isn't seen until someone calls afs_validate() on a
     vnode.  Unfortunately, we need to use a key to query the server, but
     getting one from a background thread is tricky without caching loads
     of keys all over the place.

 (2) Mass invalidation isn't seen until someone calls afs_validate().

 (3) Callback breaking is going to hit the inode_hash_lock quite a bit.
     Could this be replaced with rcu_read_lock() since inodes are destroyed
     under RCU conditions.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:18 +00:00
David Howells d0676a1678 afs: Rename struct afs_call server member to cm_server
Rename the server member of struct afs_call to cm_server as we're only
going to be using it for incoming calls for the Cache Manager service.
This makes it easier to differentiate from the pointer to the target server
for the client, which will point to a different structure to allow for
callback handling.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:18 +00:00
David Howells 03dc2cfca5 afs: Fix the afs_uuid struct to make the char-sized fields signed
In AFS's encoding of a UUID, the eight 'char' fields are all signed, so
represent them with __s8 rather than __u8.  This makes the compiler
sign-extend them correctly when XDR-encoding them.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:18 +00:00
David Howells f4b3526d83 afs: Connect up the CB.ProbeUuid
The handler for the CB.ProbeUuid operation in the cache manager is
implemented, but isn't listed in the switch-statement of operation
selection, so won't be used.  Fix this by adding it.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:18 +00:00
David Howells 33cd7f2b76 afs: Potentially return call->reply[0] from afs_make_call()
If call->ret_reply0 is set, return call->reply[0] on success.  Change the
return type of afs_make_call() to long so that this can be passed back
without bit loss and then cast to a pointer if required.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells 97e3043ad8 afs: Condense afs_call's reply{,2,3,4} into an array
Condense struct afs_call's reply anchor members - reply{,2,3,4} - into an
array.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells f780c8ea0e afs: Consolidate abort_to_error translators
The AFS abort code space is shared across all services, so there's no need
for separate abort_to_error translators for each service.

Consolidate them into a single function and remove the function pointers
for them.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells 3838d3ecde afs: Allow IPv6 address specification of VL servers
Allow VL server specifications to be given IPv6 addresses as well as IPv4
addresses, for example as:

	echo add foo.org 1111:2222:3333:0:4444:5555:6666:7777 >/proc/fs/afs/cells

Note that ':' is the expected separator for separating IPv4 addresses, but
if a ',' is detected or no '.' is detected in the string, the delimiter is
switched to ','.

This also works with DNS AFSDB or SRV record strings fetched by upcall from
userspace.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells 4d9df9868f afs: Keep and pass sockaddr_rxrpc addresses rather than in_addr
Keep and pass sockaddr_rxrpc addresses around rather than keeping and
passing in_addr addresses to allow for the use of IPv6 and non-standard
port numbers in future.

This also allows the port and service_id fields to be removed from the
afs_call struct.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells ad6a942a9e afs: Update the cache index structure
Update the cache index structure in the following ways:

 (1) Don't use the volume name followed by the volume type as levels in the
     cache index.  Volumes can be renamed.  Use the volume ID instead.

 (2) Don't store the VLDB data for a volume in the tree.  If the volume
     database should be cached locally, then it should be done in a separate
     tree.

 (3) Expand the volume ID stored in the cache to 64 bits.

 (4) Expand the file/vnode ID stored in the cache to 96 bits.

 (5) Increment the cache structure version number to 1.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells 91a90380ef afs: Add some protocol defs
Add some protocol definitions, including max field lengths, flag defs, an
XDR-encoded UUID def, more VL operation IDs and more fileserver abort
codes.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells 9ed900b116 afs: Push the net ns pointer to more places
Push the network namespace pointer to more places in AFS, including the
afs_server structure (which doesn't hold a ref on the netns).

In particular, afs_put_cell() now takes requires a net ns parameter so that
it can safely alter the netns after decrementing the cell usage count - the
cell will be deallocated by a background thread after being cached for a
period, which means that it's not safe to access it after reducing its
usage count.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells 49566f6f06 afs: Note the cell in the superblock info also
Keep a reference to the cell in the superblock info structure in addition
to the volume and net pointers.  This will make it easier to clean up in a
future patch in which afs_put_volume() will need the cell pointer.

Whilst we're at it, make the cell and volume getting functions return a
pointer to the object got to make the call sites look neater.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:16 +00:00
David Howells 59fa1c4a9f afs: Fix server reaping
Fix server reaping and make sure it's all done before we start trying to
purge cells, given that servers currently pin cells.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:16 +00:00
David Howells e3b2ffe0f0 afs: Close the rxrpc socket only after purging the servers
Close the rxrpc socket only after we've purged the server records (and also
cell and volume records which might refer to servers) so that we can give
up the callbacks on each server.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:16 +00:00
David Howells f044c8847b afs: Lay the groundwork for supporting network namespaces
Lay the groundwork for supporting network namespaces (netns) to the AFS
filesystem by moving various global features to a network-namespace struct
(afs_net) and providing an instance of this as a temporary global variable
that everything uses via accessor functions for the moment.

The following changes have been made:

 (1) Store the netns in the superblock info.  This will be obtained from
     the mounter's nsproxy on a manual mount and inherited from the parent
     superblock on an automount.

 (2) The cell list is made per-netns.  It can be viewed through
     /proc/net/afs/cells and also be modified by writing commands to that
     file.

 (3) The local workstation cell is set per-ns in /proc/net/afs/rootcell.
     This is unset by default.

 (4) The 'rootcell' module parameter, which sets a cell and VL server list
     modifies the init net namespace, thereby allowing an AFS root fs to be
     theoretically used.

 (5) The volume location lists and the file lock manager are made
     per-netns.

 (6) The AF_RXRPC socket and associated I/O bits are made per-ns.

The various workqueues remain global for the moment.

Changes still to be made:

 (1) /proc/fs/afs/ should be moved to /proc/net/afs/ and a symlink emplaced
     from the old name.

 (2) A per-netns subsys needs to be registered for AFS into which it can
     store its per-netns data.

 (3) Rather than the AF_RXRPC socket being opened on module init, it needs
     to be opened on the creation of a superblock in that netns.

 (4) The socket needs to be closed when the last superblock using it is
     destroyed and all outstanding client calls on it have been completed.
     This prevents a reference loop on the namespace.

 (5) It is possible that several namespaces will want to use AFS, in which
     case each one will need its own UDP port.  These can either be set
     through /proc/net/afs/cm_port or the kernel can pick one at random.
     The init_ns gets 7001 by default.

Other issues that need resolving:

 (1) The DNS keyring needs net-namespacing.

 (2) Where do upcalls go (eg. DNS request-key upcall)?

 (3) Need something like open_socket_in_file_ns() syscall so that AFS
     command line tools attempting to operate on an AFS file/volume have
     their RPC calls go to the right place.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:16 +00:00
David Howells 5e4def2038 Pass mode to wait_on_atomic_t() action funcs and provide default actions
Make wait_on_atomic_t() pass the TASK_* mode onto its action function as an
extra argument and make it 'unsigned int throughout.

Also, consolidate a bunch of identical action functions into a default
function that can do the appropriate thing for the mode.

Also, change the argument name in the bit_wait*() function declarations to
reflect the fact that it's the mode and not the bit number.

[Peter Z gives this a grudging ACK, but thinks that the whole atomic_t wait
should be done differently, though he's not immediately sure as to how]

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
cc: Ingo Molnar <mingo@kernel.org>
2017-11-13 15:38:16 +00:00
David Howells 81445e63e6 Merge remote-tracking branch 'tip/timers/core' into afs-next
These AFS patches need the timer_reduce() patch from timers/core.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:36:33 +00:00
David S. Miller ede372dcae Merge branch 'net-improve-the-process-of-redirect-and-toobig-for-ipv6-tunnels'
Xin Long says:

====================
net: improve the process of redirect and toobig for ipv6 tunnels

Now let's say there are 3 kinds of icmp packets to process for tunnels,
toobig(needfrag), redirect, others, their process should be:

 - toobig(needfrag)
   update the lower dst's pmtu by route cache, also update sk dst's pmtu
   if possible, or it will be fine if sk dst pmtu will get updated on tx
   path.

 - redirect
   update the lower dst's gw by route cache and return, no need to send
   this redirect packet to user sk.

 - others
   send the packet to user's sk, or it will also be fine to use err_count
   to count it and report fail link on tx path.

All ipv4 tunnels basically follow this while some of ipv6 tunnels are
doing in different ways, like ip6gre and ip6_tunnels update tnl dev's
mtu instead of updating lower dst pmtu, no redirect process on their
err_handlers, which doesn't make any sense and even causes performance
problems.

This patchset is to improve the process of redirect and toobig for ip6gre
ip4ip6, ip6ip6 tunnels, as in ipv4 tunnels.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:44:06 +09:00
Xin Long 77552cfa39 ip6_tunnel: clean up ip4ip6 and ip6ip6's err_handlers
This patch is to remove some useless codes of redirect and fix some
indents on ip4ip6 and ip6ip6's err_handlers.

Note that redirect icmp packet is already processed in ip6_tnl_err,
the old redirect codes in ip4ip6_err actually never worked even
before this patch. Besides, there's no need to send redirect to
user's sk, it's for lower dst, so just remove it in this patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:44:05 +09:00
Xin Long b00f543240 ip6_tunnel: process toobig in a better way
The same improvement in "ip6_gre: process toobig in a better way"
is needed by ip4ip6 and ip6ip6 as well.

Note that ip4ip6 and ip6ip6 will also update sk dst pmtu in their
err_handlers. Like I said before, gre6 could not do this as it's
inner proto is not certain. But for all of them, sk dst pmtu will
be updated in tx path if in need.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:44:05 +09:00
Xin Long 383c1f8875 ip6_tunnel: add the process for redirect in ip6_tnl_err
The same process for redirect in "ip6_gre: add the process for redirect
in ip6gre_err" is needed by ip4ip6 and ip6ip6 as well.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:44:05 +09:00
Xin Long fe1a4ca0a2 ip6_gre: process toobig in a better way
Now ip6gre processes toobig icmp packet by setting gre dev's mtu in
ip6gre_err, which would cause few things not good:

  - It couldn't set mtu with dev_set_mtu due to it's not in user context,
    which causes route cache and idev->cnf.mtu6 not to be updated.

  - It has to update sk dst pmtu in tx path according to gredev->mtu for
    ip6gre, while it updates pmtu again according to lower dst pmtu in
    ip6_tnl_xmit.

  - To change dev->mtu by toobig icmp packet is not a good idea, it should
    only work on pmtu.

This patch is to process toobig by updating the lower dst's pmtu, as later
sk dst pmtu will be updated in ip6_tnl_xmit, the same way as in ip4gre.

Note that gre dev's mtu will not be updated any more, it doesn't make any
sense to change dev's mtu after receiving a toobig packet.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:44:05 +09:00
Xin Long 929fc03275 ip6_gre: add the process for redirect in ip6gre_err
This patch is to add redirect icmp packet process for ip6gre by
calling ip6_redirect() in ip6gre_err(), as in vti6_err.

Prior to this patch, there's even no route cache generated after
receiving redirect.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:44:05 +09:00
Zhu Yanjun 0d728b844c forcedeth: remove redudant assignments in xmit
In xmit process, the variables are set many times. In fact,
it is enough for these variables to be set once.
After a long time test, the throughput performance is better
than before.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:40:19 +09:00
David S. Miller 6afce19623 NFC 4.15 pull request
This is the NFC pull request for 4.15. We have:
 
 - A new netlink command for explicitly deactivating NFC targets
 - i2c constification for all NFC drivers
 - One NFC device allocation error path fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaBjUrAAoJEIqAPN1PVmxK6iYP/iAbkuRGwBYsbKaIxJQiJDKi
 i5z0VUHyaMXCcFA9tl2d5pR0Zj6jv+9uJa/9iIX3+EvCasO1zt3s77eBSM9p4TJe
 WcDVwMEmBa40XHwBvQK/LGlAwSXo5QCw0tUgUz2KSiybBB6KWnzR4grfyDew+lw+
 PPv82d5h8jdz+cPt4leKG1f5DpfZbCVAju0VKEgYMY+0lBtyaDoN3Z/FR6p8Mi7t
 nc0miDXskw9UsSxXBF5J2OsvZOHCY5ToFFIiYlKanvrWydAOlXxlcrl71QlN+naa
 0pdGUtmn2Bkap9+CiW8Ap/zDUyUOZ/C/Mv+aHQdZG87kf3YaXsfUYjPaoDozrbMW
 InUJCLjy9labHMuwTaWb1SDs+Adliu4W9H5bpDZsWY5mmuJTumFM7SaXHSoCN2FG
 +cs5idFM6ugT5UxAh4xOwpHmvUavIvV/A/bfUNksKJhJMWlwWCHuTRdZi5Pv1mDb
 mw2tRQMRoCtszZgI0XEkVwFd6DoFZvkMYH35H1DVIq7fajcQJo8nU2eMv4PW4mkB
 8bp0b+nnOe+QegCur7DVBol297S8j6s0L5wbMurB5lBT3/WpibT6+iCGtBpZ943W
 5nmEdZZDiUAs6hX+Pw8KX42vBVgce6xzx6fJi+2J55SAfcGNuGotS25B9IYLkjda
 /atRXwmeHX2kNWYzisqD
 =Oql5
 -----END PGP SIGNATURE-----

Merge tag 'nfc-next-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next

Samuel Ortiz says:

====================
NFC 4.15 pull request

This is the NFC pull request for 4.15. We have:

- A new netlink command for explicitly deactivating NFC targets
- i2c constification for all NFC drivers
- One NFC device allocation error path fix
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:39:12 +09:00
David S. Miller fd9080a39b Merge branch 'Openvswitch-meter-action'
Andy Zhou says:

====================
Openvswitch meter action

This patch series is the first attempt to add openvswitch
meter support. We have previously experimented with adding
metering support in nftables. However 1) It was not clear
how to expose a named nftables object cleanly, and 2)
the logic that implements metering is quite small, < 100 lines
of code.

With those two observations, it seems cleaner to add meter
support in the openvswitch module directly.

---

    v1(RFC)->v2:  remove unused code improve locking
		  and other review comments
    v2 -> v3:     rebase
    v3 -> v4:     fix undefined "__udivdi3" references on 32 bit builds.
                  use div_u64() instead.
    v4 -> v5:     rebase
====================

Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:37:08 +09:00
Andy Zhou cd8a6c3369 openvswitch: Add meter action support
Implements OVS kernel meter action support.

Signed-off-by: Andy Zhou <azhou@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:37:07 +09:00
Andy Zhou 96fbc13d7e openvswitch: Add meter infrastructure
OVS kernel datapath so far does not support Openflow meter action.
This is the first stab at adding kernel datapath meter support.
This implementation supports only drop band type.

Signed-off-by: Andy Zhou <azhou@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:37:07 +09:00
Andy Zhou 9602c01e57 openvswitch: export get_dp() API.
Later patches will invoke get_dp() outside of datapath.c. Export it.

Signed-off-by: Andy Zhou <azhou@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:37:07 +09:00
Andy Zhou 5794040647 openvswitch: Add meter netlink definitions
Meter has its own netlink family. Define netlink messages and attributes
for communicating with the user space programs.

Signed-off-by: Andy Zhou <azhou@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:37:07 +09:00
David S. Miller aef1e0d5dd Merge branch 'dsa-b53-Support-prepended-Broadcom-tags'
Florian Fainelli says:

====================
net: dsa: b53: Support prepended Broadcom tags

This patch series adds support for prepended 4-bytes Broadcom tags that we
already support. This type of tag will typically be used when interfaced to
a SoC like BCM58xx (NorthStar Plus) which supports a Flow Accelerator (WIP).
In that case, we need to support a slightly different tagging format.

The first patch does a bit of re-factoring and passes a port index to
the get_tag_protocol() function since at least two different drivers need
that type of information (mt7530, b53) to support tagging or not.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:34:55 +09:00
Florian Fainelli 1160603960 net: dsa: b53: Support prepended Broadcom tags
On BCM58xx devices (Northstar Plus), there is an accelerator attached to
port 8 which would only work if we use prepended Broadcom tags. Resolve
that difference in our get_tag_protocol() function by setting the
appropriate tagging protocol in that case. We need to change
b53_brcm_hdr_setup() a little bit now since we can deal with two types
of Broadcom tags.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:34:54 +09:00
Florian Fainelli b74b70c449 net: dsa: Support prepended Broadcom tag
Add a new type: DSA_TAG_PROTO_PREPEND which allows us to support for the
4-bytes Broadcom tag that we already support, but in a format where it
is pre-pended to the packet instead of located between the MAC SA and
the Ethertyper (DSA_TAG_PROTO_BRCM).

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:34:54 +09:00
Florian Fainelli f7c39e3d1e net: dsa: tag_brcm: Prepare for supporting prepended tag
In preparation for supporting the same Broadcom tag format, but instead
of inserted between the MAC SA and EtherType, prepended to the Ethernet
frame, restructure the code a little bit to make that possible and take
an offset parameter.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:34:54 +09:00
Florian Fainelli 5ed4e3eb02 net: dsa: Pass a port to get_tag_protocol()
A number of drivers want to check whether the configured CPU port is a
possible configuration for enabling tagging, pass down the CPU port
number so they verify that.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:34:54 +09:00
Andrew Morton ee9d3429c0 net/sched/sch_red.c: work around gcc-4.4.4 anon union initializer issue
gcc-4.4.4 (at lest) has issues with initializers and anonymous unions:

net/sched/sch_red.c: In function 'red_dump_offload':
net/sched/sch_red.c:282: error: unknown field 'stats' specified in initializer
net/sched/sch_red.c:282: warning: initialization makes integer from pointer without a cast
net/sched/sch_red.c:283: error: unknown field 'stats' specified in initializer
net/sched/sch_red.c:283: warning: initialization makes integer from pointer without a cast
net/sched/sch_red.c: In function 'red_dump_stats':
net/sched/sch_red.c:352: error: unknown field 'xstats' specified in initializer
net/sched/sch_red.c:352: warning: initialization makes integer from pointer without a cast

Work around this.

Fixes: 602f3baf22 ("net_sch: red: Add offload ability to RED qdisc")
Cc: Nogah Frankel <nogahf@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:33:07 +09:00
Slava Shwartsman a1b8714593 net/mlx4: Use Kconfig flag to remove support of old gen2 Mellanox devices
Since Mellanox focus is on newer adapters, we would like to have the
ability to disable the support for old gen2 adapters.

This can be done by turning off the MLX4_CORE_GEN2 Kconfig flag.
We keep it turned on by default.

Signed-off-by: Slava Shwartsman <slavash@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:27:51 +09:00
Jason A. Donenfeld 0642840b8b af_netlink: ensure that NLMSG_DONE never fails in dumps
The way people generally use netlink_dump is that they fill in the skb
as much as possible, breaking when nla_put returns an error. Then, they
get called again and start filling out the next skb, and again, and so
forth. The mechanism at work here is the ability for the iterative
dumping function to detect when the skb is filled up and not fill it
past the brim, waiting for a fresh skb for the rest of the data.

However, if the attributes are small and nicely packed, it is possible
that a dump callback function successfully fills in attributes until the
skb is of size 4080 (libmnl's default page-sized receive buffer size).
The dump function completes, satisfied, and then, if it happens to be
that this is actually the last skb, and no further ones are to be sent,
then netlink_dump will add on the NLMSG_DONE part:

  nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);

It is very important that netlink_dump does this, of course. However, in
this example, that call to nlmsg_put_answer will fail, because the
previous filling by the dump function did not leave it enough room. And
how could it possibly have done so? All of the nla_put variety of
functions simply check to see if the skb has enough tailroom,
independent of the context it is in.

In order to keep the important assumptions of all netlink dump users, it
is therefore important to give them an skb that has this end part of the
tail already reserved, so that the call to nlmsg_put_answer does not
fail. Otherwise, library authors are forced to find some bizarre sized
receive buffer that has a large modulo relative to the common sizes of
messages received, which is ugly and buggy.

This patch thus saves the NLMSG_DONE for an additional message, for the
case that things are dangerously close to the brim. This requires
keeping track of the errno from ->dump() across calls.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:17:13 +09:00
David S. Miller 907a4425f7 Merge branch 'netem-add-nsec-scheduling-and-slot-feature'
Dave Taht says:

====================
netem: add nsec scheduling and slot feature

This patch series converts netem away from the old "ticks" interface and
userspace API, and adds support for a new "slot" feature intended to
emulate bursty macs such as WiFi and LTE better.

Changes since v2:
Use u64 for packet_len_sched_time()
Use simpler max(time_to_send,q->slot.slot_next)

Changes since v1:
Always pass new nanosecond APIs to userspace
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:15:47 +09:00
Dave Taht 836af83b54 netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.

It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.

The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.

The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.

Examples of use:

tc qdisc add dev eth0 root netem delay 200us \
         slot 800us 10ms bytes 64k packets 42

A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:

tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
         limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
         slot 800us 10ms bytes 64k packets 42 limit 512

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:15:47 +09:00
Dave Taht 99803171ef netem: add uapi to express delay and jitter in nanoseconds
netem userspace has long relied on a horrible /proc/net/psched hack
to translate the current notion of "ticks" to nanoseconds.

Expressing latency and jitter instead, in well defined nanoseconds,
increases the dynamic range of emulated delays and jitter in netem.

It will also ease a transition where reducing a tick to nsec
equivalence would constrain the max delay in prior versions of
netem to only 4.3 seconds.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:15:47 +09:00
Dave Taht 112f9cb656 netem: convert to qdisc_watchdog_schedule_ns
Upgrade the internal netem scheduler to use nanoseconds rather than
ticks throughout.

Convert to and from the std "ticks" userspace api automatically,
while allowing for finer grained scheduling to take place.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:15:47 +09:00
Francesco Ruggeri 338d182fa5 ipv6: try not to take rtnl_lock in ip6mr_sk_done
Avoid traversing the list of mr6_tables (which requires the
rtnl_lock) in ip6mr_sk_done(), when we know in advance that
a match will not be found.
This can happen when rawv6_close()/ip6mr_sk_done() is invoked
on non-mroute6 sockets.
This patch helps reduce rtnl_lock contention when destroying
a large number of net namespaces, each having a non-mroute6
raw socket.

v2: same patch, only fixed subject line and expanded comment.

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:13:04 +09:00
Colin Ian King 07842561a8 net: realtek: r8169: remove redundant assignment to giga_ctrl
The variable giga_ctrl is being assigned to zero however this is
never read and hence the assignment is redundant, so remove it.
Cleans up clang warning:

drivers/net/ethernet/realtek/r8169.c:1978:3: warning: Value stored
to 'giga_ctrl' is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:02:33 +09:00
Egil Hjelmeland 4b33709d89 net: dsa: lan9303: Documentation: Add missing word "Mbps"
Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 09:59:29 +09:00
Egil Hjelmeland 30482e4e28 net: dsa: lan9303: Fix lan9303_alr_del_port()
Fix embarrassing bug in lan9303_alr_del_port(): Instead of zeroing
entr->mac_addr, I destroyed the next cache entry. Affected .port_fdb_del and
.port_mdb_del.

Fixes: 0620427ea0 ("net: dsa: lan9303: Add fdb/mdb manipulation")
Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 09:59:07 +09:00
David Howells b24591e2fc timers: Add a function to start/reduce a timer
Add a function, similar to mod_timer(), that will start a timer if it isn't
running and will modify it if it is running and has an expiry time longer
than the new time.  If the timer is running with an expiry time that's the
same or sooner, no change is made.

The function looks like:

	int timer_reduce(struct timer_list *timer, unsigned long expires);

This can be used by code such as networking code to make it easier to share
a timer for multiple timeouts.  For instance, in upcoming AF_RXRPC code,
the rxrpc_call struct will maintain a number of timeouts:

	unsigned long	ack_at;
	unsigned long	resend_at;
	unsigned long	ping_at;
	unsigned long	expect_rx_by;
	unsigned long	expect_req_by;
	unsigned long	expect_term_by;

each of which is set independently of the others.  With timer reduction
available, when the code needs to set one of the timeouts, it only needs to
look at that timeout and then call timer_reduce() to modify the timer,
starting it or bringing it forward if necessary.  There is no need to refer
to the other timeouts to see which is earliest and no need to take any lock
other than, potentially, the timer lock inside timer_reduce().

Note, that this does not protect against concurrent invocations of any of
the timer functions.

As an example, the expect_rx_by timeout above, which terminates a call if
we don't get a packet from the server within a certain time window, would
be set something like this:

	unsigned long now = jiffies;
	unsigned long expect_rx_by = now + packet_receive_timeout;
	WRITE_ONCE(call->expect_rx_by, expect_rx_by);
	timer_reduce(&call->timer, expect_rx_by);

The timer service code (which might, say, be in a work function) would then
check all the timeouts to see which, if any, had triggered, deal with
those:

	t = READ_ONCE(call->ack_at);
	if (time_after_eq(now, t)) {
		cmpxchg(&call->ack_at, t, now + MAX_JIFFY_OFFSET);
		set_bit(RXRPC_CALL_EV_ACK, &call->events);
	}

and then restart the timer if necessary by finding the soonest timeout that
hasn't yet passed and then calling timer_reduce().

The disadvantage of doing things this way rather than comparing the timers
each time and calling mod_timer() is that you *will* take timer events
unless you can finish what you're doing and delete the timer in time.

The advantage of doing things this way is that you don't need to use a lock
to work out when the next timer should be set, other than the timer's own
lock - which you might not have to take.

[ tglx: Fixed weird formatting and adopted it to pending changes ]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: keyrings@vger.kernel.org
Cc: linux-afs@lists.infradead.org
Link: https://lkml.kernel.org/r/151023090769.23050.1801643667223880753.stgit@warthog.procyon.org.uk
2017-11-12 15:10:27 +01:00
Arnd Bergmann df27067e60 pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday()
__getnstimeofday() is a rather odd interface, with a number of quirks:

- The caller may come from NMI context, but the implementation is not NMI safe,
  one way to get there from NMI is

      NMI handler:
        something bad
          panic()
            kmsg_dump()
              pstore_dump()
                 pstore_record_init()
                   __getnstimeofday()

- The calling conventions are different from any other timekeeping functions,
  to deal with returning an error code during suspended timekeeping.

Address the above issues by using a completely different method to get the
time: ktime_get_real_fast_ns() is NMI safe and has a reasonable behavior
when timekeeping is suspended: it returns the time at which it got
suspended. As Thomas Gleixner explained, this is safe, as
ktime_get_real_fast_ns() does not call into the clocksource driver that
might be suspended.

The result can easily be transformed into a timespec structure. Since
ktime_get_real_fast_ns() was not exported to modules, add the export.

The pstore behavior for the suspended case changes slightly, as it now
stores the timestamp at which timekeeping was suspended instead of storing
a zero timestamp.

This change is not addressing y2038-safety, that's subject to a more
complex follow up patch.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Colin Cross <ccross@android.com>
Link: https://lkml.kernel.org/r/20171110152530.1926955-1-arnd@arndb.de
2017-11-12 15:05:52 +01:00