Commit Graph

42883 Commits

Author SHA1 Message Date
Paul Gortmaker d3fc0353f7 ipv4: af_inet: make it explicitly non-modular
The Makefile controlling compilation of this file is obj-y,
meaning that it currently is never being built as a module.

Since MODULE_ALIAS is a no-op for non-modular code, we can simply
remove the MODULE_ALIAS_NETPROTO variant used here.

We replace module.h with kmod.h since the file does make use of
request_module() in order to load other modules from here.

We don't have to worry about init.h coming in via the removed
module.h since the file explicitly includes init.h already.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 22:44:26 -07:00
Jon Paul Maloy 1fc07f3e15 tipc: reset all unicast links when broadcast send link fails
In test situations with many nodes and a heavily stressed system we have
observed that the transmission broadcast link may fail due to an
excessive number of retransmissions of the same packet. In such
situations we need to reset all unicast links to all peers, in order to
reset and re-synchronize the broadcast link.

In this commit, we add a new function tipc_bearer_reset_all() to be used
in such situations. The function scans across all bearers and resets all
their pertaining links.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 22:42:12 -07:00
Jon Paul Maloy a71eb72035 tipc: ensure correct broadcast send buffer release when peer is lost
After a new receiver peer has been added to the broadcast transmission
link, we allow immediate transmission of new broadcast packets, trusting
that the new peer will not accept the packets until it has received the
previously sent unicast broadcast initialiation message. In the same
way, the sender must not accept any acknowledges until it has itself
received the broadcast initialization from the peer, as well as
confirmation of the reception of its own initialization message.

Furthermore, when a receiver peer goes down, the sender has to produce
the missing acknowledges from the lost peer locally, in order ensure
correct release of the buffers that were expected to be acknowledged by
the said peer.

In a highly stressed system we have observed that contact with a peer
may come up and be lost before the above mentioned broadcast initial-
ization and confirmation have been received. This leads to the locally
produced acknowledges being rejected, and the non-acknowledged buffers
to linger in the broadcast link transmission queue until it fills up
and the link goes into permanent congestion.

In this commit, we remedy this by temporarily setting the corresponding
broadcast receive link state to ESTABLISHED and the 'bc_peer_is_up'
state to true before we issue the local acknowledges. This ensures that
those acknowledges will always be accepted. The mentioned state values
are restored immediately afterwards when the link is reset.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 22:42:12 -07:00
Jon Paul Maloy 2d18ac4ba7 tipc: extend broadcast link initialization criteria
At first contact between two nodes, an endpoint might sometimes have
time to send out a LINK_PROTOCOL/STATE packet before it has received
the broadcast initialization packet from the peer, i.e., before it has
received a valid broadcast packet number to add to the 'bc_ack' field
of the protocol message.

This means that the peer endpoint will receive a protocol packet with an
invalid broadcast acknowledge value of 0. Under unlucky circumstances
this may lead to the original, already received acknowledge value being
overwritten, so that the whole broadcast link goes stale after a while.

We fix this by delaying the setting of the link field 'bc_peer_is_up'
until we know that the peer really has received our own broadcast
initialization message. The latter is always sent out as the first
unicast message on a link, and always with seqeunce number 1. Because
of this, we only need to look for a non-zero unicast acknowledge value
in the arriving STATE messages, and once that is confirmed we know we
are safe and can set the mentioned field. Before this moment, we must
ignore all broadcast acknowledges from the peer.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 22:42:12 -07:00
Soheil Hassas Yeganeh 779f1edec6 sock: ignore SCM_RIGHTS and SCM_CREDENTIALS in __sock_cmsg_send
Sergei Trofimovich reported that pulse audio sends SCM_CREDENTIALS
as a control message to TCP. Since __sock_cmsg_send does not
support SCM_RIGHTS and SCM_CREDENTIALS, it returns an error and
hence breaks pulse audio over TCP.

SCM_RIGHTS and SCM_CREDENTIALS are sent on the SOL_SOCKET layer
but they semantically belong to SOL_UNIX. Since all
cmsg-processing functions including sock_cmsg_send ignore control
messages of other layers, it is best to ignore SCM_RIGHTS
and SCM_CREDENTIALS for consistency (and also for fixing pulse
audio over TCP).

Fixes: c14ac9451c ("sock: enable timestamping using control messages")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reported-by: Sergei Trofimovich <slyfox@gentoo.org>
Tested-by: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 14:32:44 -07:00
Julian Anastasov 80610229ef ipv4: reject RTNH_F_DEAD and RTNH_F_LINKDOWN from user space
Vegard Nossum is reporting for a crash in fib_dump_info
when nh_dev = NULL and fib_nhs == 1:

Pid: 50, comm: netlink.exe Not tainted 4.7.0-rc5+
RIP: 0033:[<00000000602b3d18>]
RSP: 0000000062623890  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 000000006261b800 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000024 RDI: 000000006245ba00
RBP: 00000000626238f0 R08: 000000000000029c R09: 0000000000000000
R10: 0000000062468038 R11: 000000006245ba00 R12: 000000006245ba00
R13: 00000000625f96c0 R14: 00000000601e16f0 R15: 0000000000000000
Kernel panic - not syncing: Kernel mode fault at addr 0x2e0, ip 0x602b3d18
CPU: 0 PID: 50 Comm: netlink.exe Not tainted 4.7.0-rc5+ #581
Stack:
 626238f0 960226a02 00000400 000000fe
 62623910 600afca7 62623970 62623a48
 62468038 00000018 00000000 00000000
Call Trace:
 [<602b3e93>] rtmsg_fib+0xd3/0x190
 [<602b6680>] fib_table_insert+0x260/0x500
 [<602b0e5d>] inet_rtm_newroute+0x4d/0x60
 [<60250def>] rtnetlink_rcv_msg+0x8f/0x270
 [<60267079>] netlink_rcv_skb+0xc9/0xe0
 [<60250d4b>] rtnetlink_rcv+0x3b/0x50
 [<60265400>] netlink_unicast+0x1a0/0x2c0
 [<60265e47>] netlink_sendmsg+0x3f7/0x470
 [<6021dc9a>] sock_sendmsg+0x3a/0x90
 [<6021e0d0>] ___sys_sendmsg+0x300/0x360
 [<6021fa64>] __sys_sendmsg+0x54/0xa0
 [<6021fac0>] SyS_sendmsg+0x10/0x20
 [<6001ea68>] handle_syscall+0x88/0x90
 [<600295fd>] userspace+0x3fd/0x500
 [<6001ac55>] fork_handler+0x85/0x90

$ addr2line -e vmlinux -i 0x602b3d18
include/linux/inetdevice.h:222
net/ipv4/fib_semantics.c:1264

Problem happens when RTNH_F_LINKDOWN is provided from user space
when creating routes that do not use the flag, catched with
netlink fuzzer.

Currently, the kernel allows user space to set both flags
to nh_flags and fib_flags but this is not intentional, the
assumption was that they are not set. Fix this by rejecting
both flags with EINVAL.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Fixes: 0eeb075fad ("net: ipv4 sysctl option to ignore routes when nexthop link is down")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Cc: Dinesh Dutt <ddutt@cumulusnetworks.com>
Cc: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:41:09 -07:00
Eric Dumazet 75ff39ccc1 tcp: make challenge acks less predictable
Yue Cao claims that current host rate limiting of challenge ACKS
(RFC 5961) could leak enough information to allow a patient attacker
to hijack TCP sessions. He will soon provide details in an academic
paper.

This patch increases the default limit from 100 to 1000, and adds
some randomization so that the attacker can no longer hijack
sessions without spending a considerable amount of probes.

Based on initial analysis and patch from Linus.

Note that we also have per socket rate limiting, so it is tempting
to remove the host limit in the future.

v2: randomize the count of challenge acks per second, not the period.

Fixes: 282f23c6ee ("tcp: implement RFC 5961 3.2")
Reported-by: Yue Cao <ycao009@ucr.edu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:33:35 -07:00
Simon Horman aa9667e7f6 tunnels: correct conditional build of MPLS and IPv6
Using a combination if #if conditionals and goto labels to unwind
tunnel4_init seems unwieldy. This patch takes a simpler approach of
directly unregistering previously registered protocols when an error
occurs.

This fixes a number of problems with the current implementation
including the potential presence of labels when they are unused
and the potential absence of unregister code when it is needed.

Fixes: 8afe97e5d4 ("tunnels: support MPLS over IPv4 tunnels")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:27:06 -07:00
Xin Long 8dbdf1f5b0 sctp: implement prsctp PRIO policy
prsctp PRIO policy is a policy to abandon lower priority chunks when
asoc doesn't have enough snd buffer, so that the current chunk with
higher priority can be queued successfully.

Similar to TTL/RTX policy, we will set the priority of the chunk to
prsctp_param with sinfo->sinfo_timetolive in sctp_set_prsctp_policy().
So if PRIO policy is enabled, msg->expire_at won't work.

asoc->sent_cnt_removable will record how many chunks can be checked to
remove. If priority policy is enabled, when the chunk is queued into
the out_queue, we will increase sent_cnt_removable. When the chunk is
moved to abandon_queue or dequeue and free, we will decrease
sent_cnt_removable.

In sctp_sendmsg, we will check if there is enough snd buffer for current
msg and if sent_cnt_removable is not 0. Then try to abandon chunks in
sctp_prune_prsctp when sendmsg from the retransmit/transmited queue, and
free chunks from out_queue in right order until the abandon+free size >
msg_len - sctp_wfree. For the abandon size, we have to wait until it
sends FORWARD TSN, receives the sack and the chunks are really freed.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:25:39 -07:00
Xin Long 01aadb3af6 sctp: implement prsctp RTX policy
prsctp RTX policy is a policy to abandon chunks when they are
retransmitted beyond the max count.

This patch uses sent_count to count how many times one chunk has
been sent, and prsctp_param is the max rtx count, which is from
sinfo->sinfo_timetolive in sctp_set_prsctp_policy(). So similar
to TTL policy, if RTX policy is enabled, msg->expire_at won't
work.

Then in sctp_chunk_abandoned, this patch checks if chunk->sent_count
is bigger than chunk->prsctp_param to abandon this chunk.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:25:39 -07:00
Xin Long a6c2f79287 sctp: implement prsctp TTL policy
prsctp TTL policy is a policy to abandon chunks when they expire
at the specific time in local stack. It's similar with expires_at
in struct sctp_datamsg.

This patch uses sinfo->sinfo_timetolive to set the specific time for
TTL policy. sinfo->sinfo_timetolive is also used for msg->expires_at.
So if prsctp_enable or TTL policy is not enabled, msg->expires_at
still works as before.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:25:39 -07:00
Xin Long 826d253d57 sctp: add SCTP_PR_ASSOC_STATUS on sctp sockopt
This patch adds SCTP_PR_ASSOC_STATUS to sctp sockopt, which is used
to dump the prsctp statistics info from the asoc. The prsctp statistics
includes abandoned_sent/unsent from the asoc. abandoned_sent is the
count of the packets we drop packets from retransmit/transmited queue,
and abandoned_unsent is the count of the packets we drop from out_queue
according to the policy.

Note: another option for prsctp statistics dump described in rfc is
SCTP_PR_STREAM_STATUS, which is used to dump the prsctp statistics
info from each stream. But by now, linux doesn't yet have per stream
statistics info, it needs rfc6525 to be implemented. As the prsctp
statistics for each stream has to be based on per stream statistics,
we will delay it until rfc6525 is done in linux.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:25:39 -07:00
Xin Long f959fb442c sctp: add SCTP_DEFAULT_PRINFO into sctp sockopt
This patch adds SCTP_DEFAULT_PRINFO to sctp sockopt. It is used
to set/get sctp Partially Reliable Policies' default params,
which includes 3 policies (ttl, rtx, prio) and their values.

Still, if we set policy params in sndinfo, we will use the params
of sndinfo against chunks, instead of the default params.

In this patch, we will use 5-8bit of sp/asoc->default_flags
to store prsctp policies, and reuse asoc->default_timetolive
to store their values. It means if we enable and set prsctp
policy, prior ttl timeout in sctp will not work any more.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:25:38 -07:00
Xin Long 28aa4c26fc sctp: add SCTP_PR_SUPPORTED on sctp sockopt
According to section 4.5 of rfc7496, prsctp_enable should be per asoc.
We will add prsctp_enable to both asoc and ep, and replace the places
where it used net.sctp->prsctp_enable with asoc->prsctp_enable.

ep->prsctp_enable will be initialized with net.sctp->prsctp_enable, and
asoc->prsctp_enable will be initialized with ep->prsctp_enable. We can
also modify it's value through sockopt SCTP_PR_SUPPORTED.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:25:38 -07:00
Chuck Lever a4e187d83d NFS: Don't drop CB requests with invalid principals
Before commit 778be232a2 ("NFS do not find client in NFSv4
pg_authenticate"), the Linux callback server replied with
RPC_AUTH_ERROR / RPC_AUTH_BADCRED, instead of dropping the CB
request. Let's restore that behavior so the server has a chance to
do something useful about it, and provide a warning that helps
admins correct the problem.

Fixes: 778be232a2 ("NFS do not find client in NFSv4 ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 0533b13072 svc: Avoid garbage replies when pc_func() returns rpc_drop_reply
If an RPC program does not set vs_dispatch and pc_func() returns
rpc_drop_reply, the server sends a reply anyway containing a single
word containing the value RPC_DROP_REPLY (in network byte-order, of
course). This is a nonsense RPC message.

Fixes: 9e701c6109 ("svcrpc: simpler request dropping")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 65b80179f9 xprtrdma: No direct data placement with krb5i and krb5p
Direct data placement is not allowed when using flavors that
guarantee integrity or privacy. When such security flavors are in
effect, don't allow the use of Read and Write chunks for moving
individual data items. All messages larger than the inline threshold
are sent via Long Call or Long Reply.

On my systems (CX-3 Pro on FDR), for small I/O operations, the use
of Long messages adds only around 5 usecs of latency in each
direction.

Note that when integrity or encryption is used, the host CPU touches
every byte in these messages. Even if it could be used, data
movement offload doesn't buy much in this case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 64695bde6c xprtrdma: Clean up fixup_copy_count accounting
fixup_copy_count should count only the number of bytes copied to the
page list. The head and tail are now always handled without a data
copy.

And the debugging at the end of rpcrdma_inline_fixup() is also no
longer necessary, since copy_len will be non-zero when there is reply
data in the tail (a normal and valid case).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever cfabe2c634 xprtrdma: Update only specific fields in private receive buffer
Now that rpcrdma_inline_fixup() updates only two fields in
rq_rcv_buf, a full memcpy of that structure to rq_private_buf is
unwarranted. Updating rq_private_buf fields only where needed also
better documents what is going on.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever cb0ae1fbb2 xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.

The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.

As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:

- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
   if they are not exactly the same

Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.

To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.

While I remember all this, write down the conclusion in documenting
comments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 80414abc28 xprtrdma: rpcrdma_inline_fixup() overruns the receive page list
When the remaining length of an incoming reply is longer than the
XDR buf's page_len, switch over to the tail iovec instead of
copying more than page_len bytes into the page list.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 5ab8142839 xprtrdma: Chunk list encoders no longer share one rl_segments array
Currently, all three chunk list encoders each use a portion of the
one rl_segments array in rpcrdma_req. This is because the MWs for
each chunk list were preserved in rl_segments so that ro_unmap could
find and invalidate them after the RPC was complete.

However, now that MWs are placed on a per-req linked list as they
are registered, there is no longer any information in rpcrdma_mr_seg
that is shared between ro_map and ro_unmap_{sync,safe}, and thus
nothing in rl_segments needs to be preserved after
rpcrdma_marshal_req is complete.

Thus the rl_segments array can be used now just for the needs of
each rpcrdma_convert_iovs call. Once each chunk list is encoded, the
next chunk list encoder is free to re-use all of rl_segments.

This means all three chunk lists in one RPC request can now each
encode a full size data payload with no increase in the size of
rl_segments.

This is a key requirement for Kerberos support, since both the Call
and Reply for a single RPC transaction are conveyed via Long
messages (RDMA Read/Write). Both can be large.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 9d6b040978 xprtrdma: Place registered MWs on a per-req list
Instead of placing registered MWs sparsely into the rl_segments
array, place these MWs on a per-req list.

ro_unmap_{sync,safe} can then simply pull those MWs off the list
instead of walking through the array.

This change significantly reduces the size of struct rpcrdma_req
by removing nsegs and rl_mw from every array element.

As an additional clean-up, chunk co-ordinates are returned in the
"*mw" output argument so they are no longer needed in every
array element.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 2ffc871a57 xprtrdma: Release orphaned MRs immediately
Instead of leaving orphaned MRs to be released when the transport
is destroyed, release them immediately. The MR free list can now be
replenished if it becomes exhausted.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever e2ac236c0b xprtrdma: Allocate MRs on demand
Frequent MR list exhaustion can impact I/O throughput, so enough MRs
are always created during transport set-up to prevent running out.
This means more MRs are created than most workloads need.

Commit 94f58c58c0 ("xprtrdma: Allow Read list and Reply chunk
simultaneously") introduced support for sending two chunk lists per
RPC, which consumes more MRs per RPC.

Instead of trying to provision more MRs, introduce a mechanism for
allocating MRs on demand. A few MRs are allocated during transport
set-up to kick things off.

This significantly reduces the average number of MRs per transport
while allowing the MR count to grow for workloads or devices that
need more MRs.

FRWR with mlx4 allocated almost 400 MRs per transport before this
patch. Now it starts with 32.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever a54d4059e5 xprtrdma: Chunk list encoders must not return zero
Clean up, based on code audit: Remove the possibility that the
chunk list XDR encoders can return zero, which would be interpreted
as a NULL.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 7a89f9c626 xprtrdma: Honor ->send_request API contract
Commit c93c62231c ("xprtrdma: Disconnect on registration failure")
added a disconnect for some RPC marshaling failures. This is needed
only in a handful of cases, but it was triggering for simple stuff
like temporary resource shortages. Try to straighten this out.

Fix up the lower layers so they don't return -ENOMEM or other error
codes that the RPC client's FSM doesn't explicitly recognize.

Also fix up the places in the send_request path that do want a
disconnect. For example, when ib_post_send or ib_post_recv fail,
this is a sign that there is a send or receive queue resource
miscalculation. That should be rare, and is a sign of a software
bug. But xprtrdma can recover: disconnect to reset the transport and
start over.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 3d4cf35bd4 xprtrdma: Reply buffer exhaustion can be catastrophic
Not having an rpcrdma_rep at call_allocate time can be a problem.
It means that send_request can't post a receive buffer to catch
the RPC's reply. Possible consequences are RPC timeouts or even
transport deadlock.

Instead of allowing an RPC to proceed if an rpcrdma_rep is
not available, return NULL to force call_allocate to wait and
try again.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever b54054ca55 xprtrdma: Clean up device capability detection
Clean up: Move device capability detection into memreg-specific
source files.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever a473018cfe xprtrdma: Remove rpcrdma_map_one() and friends
Clean up: ALLPHYSICAL is gone and FMR has been converted to use
scatterlists. There are no more users of these functions.

This patch shrinks the size of struct rpcrdma_req by about 3500
bytes on x86_64. There is one of these structs for each RPC credit
(128 credits per transport connection).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 2dc3a69de0 xprtrdma: Remove ALLPHYSICAL memory registration mode
No HCA or RNIC in the kernel tree requires the use of ALLPHYSICAL.

ALLPHYSICAL advertises in the clear on the network fabric an R_key
that is good for all of the client's memory. No known exploit
exists, but theoretically any user on the server can use that R_key
on the client's QP to read or update any part of the client's memory.

ALLPHYSICAL exposes the client to server bugs, including:
 o base/bounds errors causing data outside the i/o buffer to be
   accessed
 o RDMA access after reply causing data corruption and/or integrity
   fail

ALLPHYSICAL can't protect application memory regions from server
update after a local signal or soft timeout has terminated an RPC.

ALLPHYSICAL chunks are no larger than a page. Special cases to
handle small chunks and long chunk lists have been a source of
implementation complexity and bugs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 42fe28f607 xprtrdma: Do not leak an MW during a DMA map failure
Based on code audit.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 505bbe64dd xprtrdma: Refactor MR recovery work queues
I found that commit ead3f26e35 ("xprtrdma: Add ro_unmap_safe
memreg method"), which introduces ro_unmap_safe, never wired up the
FMR recovery worker.

The FMR and FRWR recovery work queues both do the same thing.
Instead of setting up separate individual work queues for this,
schedule a delayed worker to deal with them, since recovering MRs is
not performance-critical.

Fixes: ead3f26e35 ("xprtrdma: Add ro_unmap_safe memreg method")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever fcdfb968a7 xprtrdma: Use scatterlist for DMA mapping and unmapping under FMR
The use of a scatterlist for handling DMA mapping and unmapping
was recently introduced in frwr_ops.c in commit 4143f34e01
("xprtrdma: Port to new memory registration API"). That commit did
not make a similar update to xprtrdma's FMR support because the
core ib_map_phys_fmr() and ib_unmap_fmr() APIs have not been changed
to take a scatterlist argument.

However, FMR still needs to do DMA mapping and unmapping. It appears
that RDS, for example, uses a scatterlist for this, then builds the
DMA addr array for the ib_map_phys_fmr call separately. I see that
SRP also utilizes a scatterlist for DMA mapping. xprtrdma can do
something similar.

This modernization is used immediately to properly defer DMA
unmapping during fmr_unmap_safe (a FIXME). It separates the DMA
unmapping coordinates from the rl_segments array. This array, being
part of an rpcrdma_req, is always re-used immediately when an RPC
exits. A scatterlist is allocated in memory independent of the
rl_segments array, so it can be preserved indefinitely (ie, until
the MR invalidation and DMA unmapping can actually be done by a
worker thread).

The FRWR and FMR DMA mapping code are slightly different from each
other now, and will diverge further when the "Check for holes" logic
can be removed from FRWR (support for SG_GAP MRs). So I chose not to
create helpers for the common-looking code.

Fixes: ead3f26e35 ("xprtrdma: Add ro_unmap_safe memreg method")
Suggested-by: Sagi Grimberg <sagi@lightbits.io>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 88975ebed5 xprtrdma: Rename fields in rpcrdma_fmr
Clean up: Use the same naming convention used in other
RPC/RDMA-related data structures.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever d48b1d2950 xprtrdma: Move init and release helpers
Clean up: Moving these helpers in a separate patch makes later
patches more readable.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 564471d2f2 xprtrdma: Create common scatterlist fields in rpcrdma_mw
Clean up: FMR is about to replace the rpcrdma_map_one code with
scatterlists. Move the scatterlist fields out of the FRWR-specific
union and into the generic part of rpcrdma_mw.

One minor change: -EIO is now returned if FRWR registration fails.
The RPC is terminated immediately, since the problem is likely due
to a software bug, thus retrying likely won't help.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever 38f1932e60 xprtrdma: Remove FMRs from the unmap list after unmapping
ib_unmap_fmr() takes a list of FMRs to unmap. However, it does not
remove the FMRs from this list as it processes them. Other
ib_unmap_fmr() call sites are careful to remove FMRs from the list
after ib_unmap_fmr() returns.

Since commit 7c7a5390dc ("xprtrdma: Add ro_unmap_sync method for FMR")
fmr_op_unmap_sync passes more than one FMR to ib_unmap_fmr(), but
it didn't bother to remove the FMRs from that list once the call was
complete.

I've noticed some instability that could be related to list
tangling by the new fmr_op_unmap_sync() logic. In an abundance
of caution, add some defensive logic to clean up properly after
ib_unmap_fmr().

Fixes: 7c7a5390dc ("xprtrdma: Add ro_unmap_sync method for FMR")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Michal Kubeček a612769774 udp: prevent bugcheck if filter truncates packet too much
If socket filter truncates an udp packet below the length of UDP header
in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
kernel is configured that way) can be easily enforced by an unprivileged
user which was reported as CVE-2016-6162. For a reproducer, see
http://seclists.org/oss-sec/2016/q3/8

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 12:43:15 -07:00
David S. Miller 7d32eb8781 Here are a couple batman-adv bugfix patches, all by Sven Eckelmann:
- Fix possible NULL pointer dereference for vlan_insert_tag (two patches)
 
  - Fix reference handling in some features, which may lead to reference
    leaks or invalid memory access (four patches)
 
  - Fix speedy join: DHCP packets handled by the gateway feature should
    be sent with 4-address unicast instead of 3-address unicast to make
    speedy join work. This fixes/speeds up DHCP assignment for clients
    which join a mesh for the first time. (one patch)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJXf3SCAAoJEKEr45hCkp6hAaAQAJxKFavGbXHWvj1M1VxqVFkN
 AlxP7JZ6OHgnWxBT3drk4ZRaxIA7v/2VkRYrCbxoYjIENiyrmNz+93SAzaBcTKxE
 nnUntdDbQWYE3MOGC1lUBIoPgjvs4DQRejyq5dvG9CYEcK9hE4pDKV7FUfeBgmgL
 dG5+9ht8JEjMYZq48FQp4SQwkQGpWRiS4fekZEUmcO1pIQpx0uOYTMfMZ/HpqpCN
 im1QhUXlAGCBcOIJwztqVb/04LKcuTS8Du+b50BFF5uITmCZdK0NmG5yBH+1Nn8K
 uKYanY3dHYUE4eGw3NAqnJ0uSiMQFlhk3gqKgHY8uu/KoMiqZ3tjBkNp+3fF3KqH
 0AnXEPPsQPU8RJ5WAHH6TR/UNnoCrfqU6AjbIclHNq7l3WY6u0fD2uKHCGlaV13M
 8XolPWECum8iLEptmYDlhYZrh5D9kteGDV7kt3XtQY8Hpv/UE1Jh1/iGrhNjtbdX
 7P6NsZdi/cnkGPhIaRnoEQaWHZVmbO4Rl8Q2Yb3Ze2LEUuLdrkmBjTBKqiOFMnMe
 7ltA3JL7ip/alRPeNsuiHOY28uNaog3YuEHg8QYiyTs449Os/TjWoh9pzD44dhkB
 auIxmiy/IyVdYwlQwfBHDJupVK7WncUq+iF/rv3TfTmY25FO4FC+EV+PsBZdWsc+
 co+amJR57ZOAygd0GgU2
 =7Z04
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20160708' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are a couple batman-adv bugfix patches, all by Sven Eckelmann:

 - Fix possible NULL pointer dereference for vlan_insert_tag (two patches)

 - Fix reference handling in some features, which may lead to reference
   leaks or invalid memory access (four patches)

 - Fix speedy join: DHCP packets handled by the gateway feature should
   be sent with 4-address unicast instead of 3-address unicast to make
   speedy join work. This fixes/speeds up DHCP assignment for clients
   which join a mesh for the first time. (one patch)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 12:28:44 -07:00
Toby DiPasquale c2b9b4fee8 netfilter: nf_conntrack_h323: fix off-by-one in DecodeQ931
This patch corrects an off-by-one error in the DecodeQ931 function in
the nf_conntrack_h323 module. This error could result in reading off
the end of a Q.931 frame.

Signed-off-by: Toby DiPasquale <toby@cbcg.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 12:32:45 +02:00
Pablo Neira Ayuso c080b460df Merge tag 'ipvs-for-v4.8' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
IPVS Updates for v4.8

please consider these enhancements to the IPVS. This alters the behaviour
of the "least connection" schedulers such that pre-established connections
are included in the active connection count. This avoids overloading
servers when a large number of new connections arrive in a short space of
time - e.g. when clients reconnect after a node or network failure.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 12:16:34 +02:00
Pablo Neira Ayuso 42a5576913 netfilter: nf_tables: get rid of possible_net_t from set and basechain
We can pass the netns pointer as parameter to the functions that need to
gain access to it. From basechains, I didn't find any client for this
field anymore so let's remove this too.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 12:16:04 +02:00
Liping Zhang 3f8b61b7f9 netfilter: nft_ct: make byte/packet expr more friendly
If we want to use ct packets expr, and add a rule like follows:
  # nft add rule filter input ct packets gt 1 counter

We will find that no packets will hit it, because
nf_conntrack_acct is disabled by default. So It will
not work until we enable it manually via
"echo 1 > /proc/sys/net/netfilter/nf_conntrack_acct".

This is not friendly, so like xt_connbytes do, if the user
want to use ct byte/packet expr, enable nf_conntrack_acct
automatically.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 12:16:02 +02:00
Hangbin Liu 47c7445625 netfilter: physdev: physdev-is-out should not work with OUTPUT chain
physdev_mt() will check skb->nf_bridge first, which was alloced in
br_nf_pre_routing. So if we want to use --physdev-out and physdev-is-out,
we need to match it in FORWARD or POSTROUTING chain. physdev_mt_check()
only checked physdev-out and missed physdev-is-out. Fix it and update the
debug message to make it clearer.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Marcelo R Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 12:16:01 +02:00
Florian Westphal 870190a9ec netfilter: nat: convert nat bysrc hash to rhashtable
It did use a fixed-size bucket list plus single lock to protect add/del.

Unlike the main conntrack table we only need to add and remove keys.
Convert it to rhashtable to get table autosizing and per-bucket locking.

The maximum number of entries is -- as before -- tied to the number of
conntracks so we do not need another upperlimit.

The change does not handle rhashtable_remove_fast error, only possible
"error" is -ENOENT, and that is something that can happen legitimetely,
e.g. because nat module was inserted at a later time and no src manip
took place yet.

Tested with http-client-benchmark + httpterm with DNAT and SNAT rules
in place.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 12:07:57 +02:00
Pablo Neira Ayuso 4edfa9d0bf Merge tag 'ipvs-fixes2-for-v4.7' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs
Simon Horman says:

====================
Second Round of IPVS Fixes for v4.7

The fix from Quentin Armitage allows the backup sync daemon to
be bound to a link-local mcast IPv6 address as is already the case
for IPv4.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 11:58:33 +02:00
Florian Westphal 7c96643519 netfilter: move nat hlist_head to nf_conn
The nat extension structure is 32bytes in size on x86_64:

struct nf_conn_nat {
        struct hlist_node          bysource;             /*     0    16 */
        struct nf_conn *           ct;                   /*    16     8 */
        union nf_conntrack_nat_help help;                /*    24     4 */
        int                        masq_index;           /*    28     4 */
        /* size: 32, cachelines: 1, members: 4 */
        /* last cacheline: 32 bytes */
};

The hlist is needed to quickly check for possible tuple collisions
when installing a new nat binding. Storing this in the extension
area has two drawbacks:

1. We need ct backpointer to get the conntrack struct from the extension.
2. When reallocation of extension area occurs we need to fixup the bysource
   hash head via hlist_replace_rcu.

We can avoid both by placing the hlist_head in nf_conn and place nf_conn in
the bysource hash rather than the extenstion.

We can also remove the ->move support; no other extension needs it.

Moving the entire nat extension into nf_conn would be possible as well but
then we have to add yet another callback for deletion from the bysource
hash table rather than just using nat extension ->destroy hook for this.

nf_conn size doesn't increase due to aligment, followup patch replaces
hlist_node with single pointer.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 11:47:50 +02:00
Florian Westphal 242922a027 netfilter: conntrack: simplify early_drop
We don't need to acquire the bucket lock during early drop, we can
use lockless traveral just like ____nf_conntrack_find.

The timer deletion serves as synchronization point, if another cpu
attempts to evict same entry, only one will succeed with timer deletion.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 11:46:22 +02:00
Liping Zhang 8786a9716d netfilter: nf_ct_helper: unlink helper again when hash resize happen
From: Liping Zhang <liping.zhang@spreadtrum.com>

Similar to ctnl_untimeout, when hash resize happened, we should try
to do unhelp from the 0# bucket again.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 11:44:34 +02:00
Liping Zhang 474803d37e netfilter: cttimeout: unlink timeout obj again when hash resize happen
Imagine such situation, nf_conntrack_htable_size now is 4096, we are doing
ctnl_untimeout, and iterate on 3000# bucket.

Meanwhile, another user try to reduce hash size to 2048, then all nf_conn
are removed to the new hashtable. When this hash resize operation finished,
we still try to itreate ct begin from 3000# bucket, find nothing to do and
just return.

We may miss unlinking some timeout objects. And later we will end up with
invalid references to timeout object that are already gone.

So when we find that hash resize happened, try to unlink timeout objects
from the 0# bucket again.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 11:39:08 +02:00
Liping Zhang 64b87639c9 netfilter: conntrack: fix race between nf_conntrack proc read and hash resize
When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack
hash table via /sys/module/nf_conntrack/parameters/hashsize, race will
happen, because reader can observe a newly allocated hash but the old size
(or vice versa). So oops will happen like follows:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000017
  IP: [<ffffffffa0418e21>] seq_print_acct+0x11/0x50 [nf_conntrack]
  Call Trace:
  [<ffffffffa0412f4e>] ? ct_seq_show+0x14e/0x340 [nf_conntrack]
  [<ffffffff81261a1c>] seq_read+0x2cc/0x390
  [<ffffffff812a8d62>] proc_reg_read+0x42/0x70
  [<ffffffff8123bee7>] __vfs_read+0x37/0x130
  [<ffffffff81347980>] ? security_file_permission+0xa0/0xc0
  [<ffffffff8123cf75>] vfs_read+0x95/0x140
  [<ffffffff8123e475>] SyS_read+0x55/0xc0
  [<ffffffff817c2572>] entry_SYSCALL_64_fastpath+0x1a/0xa4

It is very easy to reproduce this kernel crash.
1. open one shell and input the following cmds:
  while : ; do
    echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize
  done
2. open more shells and input the following cmds:
  while : ; do
    cat /proc/net/nf_conntrack
  done
3. just wait a monent, oops will happen soon.

The solution in this patch is based on Florian's Commit 5e3c61f981
("netfilter: conntrack: fix lookup race during hash resize"). And
add a wrapper function nf_conntrack_get_ht to get hash and hsize
suggested by Florian Westphal.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 11:38:57 +02:00
Thierry Escande d85a301c26 NFC: digital: Fix RTOX supervisor PDU handling
When the target needs more time to process the received PDU, it sends
Response Timeout Extension (RTOX) PDU.

When the initiator receives a RTOX PDU, it must reply with a RTOX PDU
and extends the current rwt value with the formula:
 rwt_int = rwt * rtox

This patch takes care of the rtox value passed by the target in the RTOX
PDU and extends the timeout for the next response accordingly.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11 02:02:03 +02:00
Thierry Escande 1a09c56f54 NFC: digital: Add support for NFC DEP Response Waiting Time
When sending an ATR_REQ, the initiator must wait for the ATR_RES at
least 'RWT(nfcdep,activation) + dRWT(nfcdep)' and no more than
'RWT(nfcdep,activation) + dRWT(nfcdep) + dT(nfcdep,initiator)'. This
gives a timeout value between 1237 ms and 1337 ms. This patch defines
DIGITAL_ATR_RES_RWT to 1337 used for the timeout value of ATR_REQ
command.

For other DEP PDUs, the initiator must wait between 'RWT + dRWT(nfcdep)'
and 'RWT + dRWT(nfcdep) + dT(nfcdep,initiator)' where RWT is given by
the following formula: '(256 * 16 / f(c)) * 2^wt' where wt is the value
of the TO field in the ATR_RES response and is in the range between 0
and 14. This patch declares a mapping table for wt values and gives RWT
max values between 100 ms and 5049 ms.

This patch also defines DIGITAL_ATR_RES_TO_WT, the maximum wt value in
target mode, to 8.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11 02:01:14 +02:00
Thierry Escande e200f008ac NFC: digital: Free supervisor PDUs
This patch frees the RTOX resp sk_buff in initiator mode. It also makes
use of the free_resp exit point for ATN supervisor PDUs in both
initiator and target mode.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11 02:00:26 +02:00
Thierry Escande e073eb6797 NFC: digital: Rework ACK PDU handling in initiator mode
With this patch, ACK PDU sk_buffs are now freed and code has been
refactored for better errors handling.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11 01:59:37 +02:00
Thierry Escande 482333b277 NFC: digital: Fix ACK & NACK PDUs handling in target mode
When the target receives a NACK PDU, it re-sends the last sent PDU.

ACK PDUs are received by the target as a reply from the initiator to
chained I-PDUs. There are 3 cases to handle:
- If the target has previously received 1 or more ATN PDUs and the PNI
  in the ACK PDU is equal to the target PNI - 1, then it means that the
  initiator did not received the last issued PDU from the target. In
  this case it re-sends this PDU.
- If the target has received 1 or more ATN PDUs but the ACK PNI is not
  the target PNI - 1, then this means that this ACK is the reply of the
  previous chained I-PDU sent by the target. The target did not received
  it on the first attempt and it is being re-sent by the initiator. The
  process continues as usual.
- No ATN PDU received before this ACK PDU. This is the reply of a
  chained I-PDU. The target keeps on processing its chained I-PDU.

The code has been refactored to avoid too many indentation levels.

Also, ACK and NACK PDUs were not freed. This is now fixed.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11 01:58:46 +02:00
Thierry Escande f23a9868b1 NFC: digital: Fix target DEP_REQ I-PDU handling after ATN PDU
When the initiator sends a DEP_REQ I-PDU, the target device may not
reply in a timely manner. In this case the initiator device must send an
attention PDU (ATN) and if the recipient replies with an ATN PDU in
return, then the last I-PDU must be sent again by the initiator.

This patch fixes how the target handles I-PDU received after an ATN PDU
has been received.

There are 2 possible cases:
- The target has received the initial DEP_REQ and sends back the DEP_RES
  but the initiator did not receive it. In this case, after the
  initiator has sent an ATN PDU and the target replied it (with an ATN
  as well), the initiator sends the saved skb of the initial DEP_REQ
  again and the target replies with the saved skb of the initial
  DEP_RES.
- Or the target did not even received the initial DEP_REQ. In this case,
  after the ATN PDUs exchange, the initiator sends the saved skb and the
  target simply passes it up, just as usual.

This behavior is controlled using the atn_count and the PNI field of the
digital device structure.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11 01:57:50 +02:00
Thierry Escande e8e7f42175 NFC: digital: Remove useless call to skb_reserve()
When allocating chained I-PDUs, there is no need to call skb_reserve()
since it's already done by digital_alloc_skb() and contains enough room
for the driver head and tail data.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11 01:56:45 +02:00
Thierry Escande 1d984c2e03 NFC: digital: Fix handling of saved PDU sk_buff pointers
This patch fixes the way an I-PDU is saved in case it needs to be sent
again. It is now copied using pskb_copy() and not simply referenced
using skb_get() since it could be modified by the driver.

digital_in_send_saved_skb() and digital_tg_send_saved_skb() still get a
reference on the saved skb which is re-sent but release it if the send
operation fails. That way the caller doesn't have to take care about skb
ref in case of error.

RTOX supervisor PDU must not be saved as this can override a previously
saved I-PDU that should be re-sent later on.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11 01:55:42 +02:00
Eric Dumazet 95556a8838 dccp: avoid deadlock in dccp_v4_ctl_send_reset
In the prep work I did before enabling BH while handling socket backlog,
I missed two points in DCCP :

1) dccp_v4_ctl_send_reset() uses bh_lock_sock(), assuming BH were
blocked. It is not anymore always true.

2) dccp_v4_route_skb() was using __IP_INC_STATS() instead of
  IP_INC_STATS()

A similar fix was done for TCP, in commit 47dcc20a39
("ipv4: tcp: ip_send_unicast_reply() is not BH safe")

Fixes: 7309f8821f ("dccp: do not assume DCCP code is non preemptible")
Fixes: 5413d1babe ("net: do not block BH while processing socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 18:14:17 -04:00
Eric Dumazet 927265bc6c ipv6: do not abuse GFP_ATOMIC in inet6_netconf_notify_devconf()
All inet6_netconf_notify_devconf() callers are in process context,
so we can use GFP_KERNEL allocations if we take care of not holding
a rwlock while not needed in ip6mr (we hold RTNL there)

Fixes: d67b8c616b ("netconf: advertise mc_forwarding status")
Fixes: f3a1bfb11c ("rtnl/ipv6: use netconf msg to advertise forwarding status")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 18:13:20 -04:00
Eric Dumazet fa17806cde ipv4: do not abuse GFP_ATOMIC in inet_netconf_notify_devconf()
inet_forward_change() runs with RTNL held.
We are allowed to sleep if required.

If we use __in_dev_get_rtnl() instead of __in_dev_get_rcu(),
we no longer have to use GFP_ATOMIC allocations in
inet_netconf_notify_devconf(), meaning we are less likely to miss
notifications under memory pressure, and wont touch precious memory
reserves either and risk dropping incoming packets.

inet_netconf_get_devconf() can also use GFP_KERNEL allocation.

Fixes: edc9e74893 ("rtnl/ipv4: use netconf msg to advertise forwarding status")
Fixes: 9e5511106f ("rtnl/ipv4: add support of RTM_GETNETCONF")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 18:12:25 -04:00
Jesper Dangaard Brouer 1db19db7f5 net: tracepoint napi:napi_poll add work and budget
An important information for the napi_poll tracepoint is knowing
the work done (packets processed) by the napi_poll() call. Add
both the work done and budget, as they are related.

Handle trace_napi_poll() param change in dropwatch/drop_monitor
and in python perf script netdev-times.py in backward compat way,
as python fortunately supports optional parameter handling.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 18:05:02 -04:00
Simon Horman 407f31be9d mpls: allow routes on ipip and sit devices
Allow MPLS routes on IPIP and SIT devices now that they
support forwarding MPLS packets.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 17:45:56 -04:00
Simon Horman 1b69e7e6c4 ipip: support MPLS over IPv4
Extend the IPIP driver to support MPLS over IPv4. The implementation is an
extension of existing support for IPv4 over IPv4 and is based of multiple
inner-protocol support for the SIT driver.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 17:45:56 -04:00
Simon Horman 49dbe7ae21 sit: support MPLS over IPv4
Extend the SIT driver to support MPLS over IPv4. This implementation
extends existing support for IPv6 over IPv4 and IPv4 over IPv4.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 17:45:56 -04:00
Simon Horman 8afe97e5d4 tunnels: support MPLS over IPv4 tunnels
Extend tunnel support to MPLS over IPv4.  The implementation extends the
existing differentiation between IPIP and IPv6 over IPv4 to also cover MPLS
over IPv4.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 17:45:56 -04:00
Nikolay Aleksandrov a65056ecf4 net: bridge: extend MLD/IGMP query stats
As was suggested this patch adds support for the different versions of MLD
and IGMP query types. Since the user visible structure is still in net-next
we can augment it instead of adding netlink attributes.
The distinction between the different IGMP/MLD query types is done as
suggested in Section 7.1, RFC 3376 [1] and Section 8.1, RFC 3810 [2] based
on query payload size and code for IGMP. Since all IGMP packets go through
multicast_rcv() and it uses ip_mc_check_igmp/ipv6_mc_check_mld we can be
sure that at least the ip/ipv6 header can be directly used.

[1] https://tools.ietf.org/html/rfc3376#section-7
[2] https://tools.ietf.org/html/rfc3810#section-8.1

Suggested-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 17:40:09 -04:00
Marcel Holtmann ca8bee5dde Bluetooth: Rename HCI_BREDR into HCI_PRIMARY
The HCI_BREDR naming is confusing since it actually stands for Primary
Bluetooth Controller. Which is a term that has been used in the latest
standard. However from a legacy point of view there only really have
been Basic Rate (BR) and Enhanced Data Rate (EDR). Recent versions of
Bluetooth introduced Low Energy (LE) and made this terminology a little
bit confused since Dual Mode Controllers include BR/EDR and LE. To
simplify this the name HCI_PRIMARY stands for the Primary Controller
which can be a single mode or dual mode controller.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-07-09 21:37:13 +03:00
Marcel Holtmann e14dbe7203 Bluetooth: Remove controller device attributes
The controller device attributes are not used and expose no valuable
information.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-07-09 21:37:11 +03:00
Marcel Holtmann 2a0be13986 Bluetooth: Remove connection link attributes
The connection link attributes are not used and expose no valuable
information.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-07-09 21:37:08 +03:00
Marcelo Ricardo Leitner f1533cce60 sctp: fix panic when sending auth chunks
When we introduced GSO support, if using auth the auth chunk was being
left queued on the packet even after the final segment was generated.
Later on sctp_transmit_packet it calls sctp_packet_reset, which zeroed
the packet len while not accounting for this left-over. This caused more
space to be used the next packet due to the chunk still being queued,
but space which wasn't allocated as its size wasn't accounted.

The fix is to only queue it back when we know that we are going to
generate another segment.

Fixes: 90017accff ("sctp: Add GSO support")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 00:08:21 -04:00
Vivien Didelot d390238c4f net: dsa: initialize the routing table
The routing table of every switch in a tree is currently initialized to
all zeros. This is an issue since 0 is a valid port number.

Add a DSA_RTABLE_NONE=-1 constant to initialize the signed values of the
routing table pointing to other switches.

This fixes the device mapping of the mv88e6xxx driver where the port
pointing to the switch itself and to non-existent switches was wrongly
configured to be 0. It is now set to the expected 0xf value.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-08 23:59:49 -04:00
David S. Miller 5b58d83617 Two more fixes:
* handle allocation failures in new(ish) A-MSDU decapsulation
  * don't leak memory on nl80211 ACL parse errors
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJXfPjfAAoJEGt7eEactAAd9HMQAJ+KQipCHO+TE/b1tH40LFAO
 1lptZfe3BSM24nkk0mYTi0a2ylzd9hsiuqFXsVBB32B7TARR3/MOCo/DsYd66PdN
 guEg5l/blMOiOoSmdnehvssboZTRwofFebDxnRv8UFWhyFxhbe/xZLUgYjX8JHhD
 +NvzTG0bpTlN8n8e4IsGsVzQkG0G9ianiOir8xSgj1ahHug3S/phR8PDCkbeEYA/
 fgFk3bFhkOOoKbSDMAfeEN/Xc/k6IFkVS89ZI8QSTSrMZklvZqfFLLmxeRuxegk5
 x5VDfPiCXSBez/OoxjlZgHmipGhWZOA9o5S3JxlFl8JSBWhJeVNvW4C1uAEaLRzF
 BZSWynA/a9HecSjzdjswOvx9bTXVNp3D3QldmjuxchdBrmFtrBRMSH3xmARG5F1X
 TslEJlyj6YkjsEjBgpwjxaE0wW9sAhd3GQDj6rvyz3RoRO1lDsu50Z8RVZ1uDbgY
 Q2/nspgPDf1hqQolsOYvaYhYPj75ZcncFG7FqO3mV5wqrSh+n0eyntUz1HW23T/J
 tm7nvS911V+jLGtbow0yHNu3jRSPW8mVXpI7eL1oP6DKJyxjn5CR4MtOFL8BB/lp
 NVtRzI88IgloyMc52/3A5dMELQigIbC2VheRZKNRX7FVYJnqPck6SNhcOvH9emBZ
 Ji/SbZ8oZNba2hvyQmMj
 =pfmq
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-07-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Two more fixes:
 * handle allocation failures in new(ish) A-MSDU decapsulation
 * don't leak memory on nl80211 ACL parse errors
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-08 23:53:41 -04:00
David S. Miller cc3baecb21 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAV3zeiPSw1s6N8H32AQI9Iw//RxwhAzb+ovNIizPqcFRt4BIS3Or6unov
 Bx9jtq+U1POWZWtlIHYZKDH8ndxMnC56NkEHmpIK3uEiqJ6BoQPpDdfN+hhREysV
 Hoa+JOkMgufPBanU/JyKY/4vsmlvLuoOdppN1OA/kx1KECux9xJWIrvFsCUQGeat
 nDsdWChHkZAm/GDPZiFvxEBVaxDe2dmnDMBFTst1RsrH2uICSqM4k5srmjc3NPAY
 bsTqZeQGTIK1V9MggwBHxHFMvvERlGDpcrpoMRjeTzMmCpCg5endJoSu3hdNjHUO
 o5Fi50dhLI5jo84DQiXL0wM4SLND0QQygl+QeU3zlJYtsQsF6WxPnIEGqlGr3+WV
 I4wjDc5lxECyQIjCsrCo5ZwJ47Kqmm/ZQ4uGd9JooAVhqhP7/2dhFH0zXywJZzKs
 zo+dWTF5Xvde+mlknm1RCTgkdx3msPH9EVkEoO4FOPOAg6lhIMQFFXLXZfGr9oX6
 V99t+8t8YhDPTbL4AQnzh/aMHtbpM6be4TYiRRjT6iZLuvWOPW/zpp/1hmyeEkbU
 KZNDunC2tH030Fx5toGi3b2i8M5SJdyex9Udg/YsNexpWmyHMS49PoGk9ZnRRPA+
 xn9+xIVsqTh+xbiyCOPJqlQMMK9ayF7isT2N8T19qoVJxurdE/tMBdtKrJ5uTJFT
 W0n8KV46a+4=
 =1ZSd
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160706' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Improve conn/call lookup and fix call number generation [ver #3]

I've fixed a couple of patch descriptions and excised the patch that
duplicated the connections list for reconsideration at a later date.

For reference, the excised patch is sitting on the rxrpc-experimental
branch of my git tree, based on top of the rxrpc-rewrite branch.  Diffing
it against yesterday's tag shows no differences.

Would you prefer the patch set to be emailed afresh instead of a git-pull
request?

David
---
Here's the next part of the AF_RXRPC rewrite.  The two main purposes of
this set are to fix the call number handling and to make use of RCU when
looking up the connection or call to pass a received packet to.

Important changes in this set include:

 (1) Avoidance of placing stack data into SG lists in rxkad so that kernel
     stacks can become vmalloc'd (Herbert Xu).

 (2) Calls cease pinning the connection they used as soon as possible,
     which allows the connection to be discarded sooner and allows the call
     channel on that connection to be reused earlier.

 (3) Make each call channel on a connection have a separate and independent
     call number space rather than having a shared number space for the
     connection.  Call numbers should increment monotonically per channel
     on the client, and the server should ignore a call with a lower call
     number for that channel than the latest it has seen.  The RESPONSE
     packet sets the minimum values of each call ID counter on a
     connection.

 (4) Look up calls by indexing the channel array on a connection rather
     than by keeping calls in an rbtree on that connection.  Also look up
     calls using the channel array rather than using a hashtable.

     The call hashtable can then be removed.

 (5) Call terminal statuses are cached in the channel array for the last
     call.  It is assumed that if we the server have seen call N, then the
     client no longer cares about call N-1 on the same channel.

     This will allow retransmission of the terminal status in future
     without the need to keep the rxrpc_call struct around.

 (6) Peer lookups are moved out of common connection handling code and into
     service connection handling code as client connections (a) must point
     to a peer before they can be used and (b) are looked up by a
     machine-unique connection ID directly, so we only need to look up the
     peer first if we're going to deal with a service call.

 (7) The reference count on a connection is held elevated by 1 whilst it is
     alive (ie. idle unused connections have a refcount of 1).  The reaper
     will attempt to change the refcount from 1->0 and skip if this cannot
     be done, whilst look ups only increment the refcount if it's non-zero.

     This makes the implementation of RCU lookups easier as we don't have
     to get a ref on the connection or a lock on the connection list to
     prevent a connection being reaped whilst we're contemplating queueing
     a packet that initiates a new service call upon it.

     If we need to get a connection, but there's a dead connection in the
     tree, we use rb_replace_node() to replace the dead one with a new one.

 (8) Use a seqlock to validate the walk over the service connection rbtree
     attached to a peer when it's being walked in RCU mode.

 (9) Make the incoming call/connection packet handling code use RCU mode
     and locks and make it only take a reference if the call/connection
     gets queued on a workqueue.

The intention is that the next set will introduce the connection lifetime
management and capacity limits to prevent clients from overloading the
server.

There are some fixes too:

 (1) Verifying that a packet coming in to a client connection came from the
     expected source.

 (2) Fix handling of connection failure in client call creation where we
     don't reinitialise the list linkage block and a second attempt to
     unlink the failed connection oopses and also we don't set the state
     correctly, which causes an assertion failure.

 (3) New service calls were being added to the socket's accept queue under
     the wrong lock.

Changes:

 (V2) In rxrpc_find_service_conn_rcu() initialised the sequence number to 0.

      Fixed the RCU handling in conn_service.c by introducing and using
      rb_replace_node_rcu() as an RCU-safe alternative in
      rxrpc_publish_service_conn().

      Modified and used rcu_dereference_raw() to avoid RCU sparse warnings
      in rxrpc_find_service_conn_rcu().

      Added in some missing RCU dereference wrappers.  It seems to be
      necessary to turn on CONFIG_PROVE_RCU_REPEATEDLY as well as
      CONFIG_SPARSE_RCU_POINTER to get the static __rcu annotation checking
      to happen.

      Fixed some other sparse warnings, including a missing ntohs() in
      jumbo packet processing.

 (V3) Fixed some commit descriptions.

      Excised the patch that duplicated the connection list to separate out
      the procfs list for reconsideration at a later date.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-08 23:52:12 -04:00
Florian Westphal bba7eb5d9b hfsc: reduce hfsc_sched to 14 cachelines
hfsc_sched is huge (size: 920, cachelines: 15), but we can get it to 14
cachelines by placing level after filter_cnt (covering 4 byte hole) and
reducing period/nactive/flags to u32 (period is just a counter,
incremented when class becomes active -- 2**32 is plenty for this
purpose, also, long is only 32bit wide on 32bit platforms anyway).

cl_vtperiod is exported to userspace via tc_hfsc_stats, but its period
member is already u32, so no precision is lost there either.

Cc: Michal Soltys <soltys@ziu.info>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-08 23:08:39 -04:00
Florian Westphal c8607e0200 netfilter: nft_ct: fix expiration getter
We need to compute timeout.expires - jiffies, not the other way around.
Add a helper, another patch can then later change more places in
conntrack code where we currently open-code this.

Will allow us to only change one place later when we remove per-ct timer.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-08 14:55:14 +02:00
Alexander Aring 9e262f5037 6lowpan: ndisc: set invalid unicast short addr to unspec
When receiving neighbour information with short address option field we
should check the complete range of invalid short addresses and set it to
one invalid address setting which is the unspecified address. This
address is also used when by creating at first a new neighbour entry to
indicate no short address is set.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-07-08 13:23:12 +02:00
Alexander Aring 0ea0b9af9b ieee802154: 6lowpan: fix intra pan id check
The RIOT-OS stack does send intra-pan frames but don't set the intra pan
flag inside the mac header. It seems this is valid frame addressing but
inefficient. Anyway this patch adds a new function for intra pan
addressing, doesn't matter if intra pan flag or source and destination
are the same. The newly introduction function will be used to check on
intra pan addressing for 6lowpan.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-07-08 13:23:12 +02:00
Denis Kenzior 83871f8ccd Bluetooth: Fix hci_sock_recvmsg return value
If recvmsg is called with a destination buffer that is too small to
receive the contents of skb in its entirety, the return value from
recvmsg was inconsistent with common SOCK_SEQPACKET or SOCK_DGRAM
semantics.

If destination buffer provided by userspace is too small (e.g. len <
copied), then MSG_TRUNC flag is set and copied is returned.  Instead, it
should return the length of the message, which is consistent with how
other datagram based sockets act.  Quoting 'man recv':

"All  three calls return the length of the message on successful comple‐
tion.  If a message is too long to fit in the supplied  buffer,  excess
bytes  may  be discarded depending on the type of socket the message is
received from."

and

"MSG_TRUNC (since Linux 2.2)

    For   raw   (AF_PACKET),   Internet   datagram   (since    Linux
    2.4.27/2.6.8),  netlink  (since Linux 2.6.22), and UNIX datagram
    (since Linux 3.4) sockets: return the real length of the packet
    or datagram, even when it was longer than the passed buffer."

Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-07-08 12:20:57 +02:00
Denis Kenzior b5f34f9420 Bluetooth: Fix bt_sock_recvmsg return value
If recvmsg is called with a destination buffer that is too small to
receive the contents of skb in its entirety, the return value from
recvmsg was inconsistent with common SOCK_SEQPACKET or SOCK_DGRAM
semantics.

If destination buffer provided by userspace is too small (e.g. len <
copied), then MSG_TRUNC flag is set and copied is returned.  Instead, it
should return the length of the message, which is consistent with how
other datagram based sockets act.  Quoting 'man recv':

"All  three calls return the length of the message on successful comple‐
tion.  If a message is too long to fit in the supplied  buffer,  excess
bytes  may  be discarded depending on the type of socket the message is
received from."

and

"MSG_TRUNC (since Linux 2.2)

    For   raw   (AF_PACKET),   Internet   datagram   (since    Linux
    2.4.27/2.6.8),  netlink  (since Linux 2.6.22), and UNIX datagram
    (since Linux 3.4) sockets: return the real length of the packet
    or datagram, even when it was longer than the passed buffer."

Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-07-08 12:20:57 +02:00
Alexander Aring 1c5bf998b3 ieee802154: allow netns create of lowpan interface
This patch reverts commit f9d1ce8f81 ("ieee802154: fix netns settings").
The lowpan interface need to be created inside the net namespace where
the wpan interface is available. The wpan namespace can be changed only
by nl802154 before. Without this patch it's not possible to create a
lowpan interface for a wpan interface which isn't inside init_net
namespace.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-07-08 12:20:57 +02:00
Alexander Aring 66e5c2672c ieee802154: add netns support
This patch adds netns support for 802.15.4 subsystem. Most parts are
copy&pasted from wireless subsystem, it has the identically userspace
API.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-07-08 12:20:57 +02:00
Alexander Aring 966be9e790 6lowpan: ndisc: add missing 802.15.4 only check
This patch adds a missing check to handle short address parsing for
802.15.4 6LoWPAN only.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-07-08 12:20:57 +02:00
Alexander Aring 929946a471 6lowpan: ndisc: fix double read unlock
This patch removes a double unlock case to accessing neighbour private
data.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-07-08 12:20:57 +02:00
Andy Lutomirski a4770e1117 Bluetooth: Switch SMP to crypto_cipher_encrypt_one()
SMP does ECB crypto on stack buffers.  This is complicated and
fragile, and it will not work if the stack is virtually allocated.

Switch to the crypto_cipher interface, which is simpler and safer.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Johan Hedberg <johan.hedberg@intel.com>
Tested-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-07-08 12:20:57 +02:00
Michal Kubecek be2cef4990 ipvs: count pre-established TCP states as active
Some users observed that "least connection" distribution algorithm doesn't
handle well bursts of TCP connections from reconnecting clients after
a node or network failure.

This is because the algorithm counts active connection as worth 256
inactive ones where for TCP, "active" only means TCP connections in
ESTABLISHED state. In case of a connection burst, new connections are
handled before previous ones have finished the three way handshaking so
that all are still counted as "inactive", i.e. cheap ones. The become
"active" quickly but at that time, all of them are already assigned to one
real server (or few), resulting in highly unbalanced distribution.

Address this by counting the "pre-established" states as "active".

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2016-07-07 20:30:52 +02:00
Quentin Armitage 3777ed688f ipvs: fix bind to link-local mcast IPv6 address in backup
When using HEAD from
https://git.kernel.org/cgit/utils/kernel/ipvsadm/ipvsadm.git/,
the command:
ipvsadm --start-daemon backup --mcast-interface eth0.60 \
    --mcast-group ff02::1:81
fails with the error message:
Argument list too long

whereas both:
ipvsadm --start-daemon master --mcast-interface eth0.60 \
    --mcast-group ff02::1:81
and:
ipvsadm --start-daemon backup --mcast-interface eth0.60 \
    --mcast-group 224.0.0.81
are successful.

The error message "Argument list too long" isn't helpful. The error occurs
because an IPv6 address is given in backup mode.

The error is in make_receive_sock() in net/netfilter/ipvs/ip_vs_sync.c,
since it fails to set the interface on the address or the socket before
calling inet6_bind() (via sock->ops->bind), where the test
'if (!sk->sk_bound_dev_if)' failed.

Setting sock->sk->sk_bound_dev_if on the socket before calling
inet6_bind() resolves the issue.

Fixes: d33288172e ("ipvs: add more mcast parameters for the sync daemon")
Signed-off-by: Quentin Armitage <quentin@armitage.org.uk>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2016-07-07 20:21:32 +02:00
Thomas Gleixner f3438bc781 timers, net/ipv4/inet: Initialize connection request timers as pinned
Pinned timers must carry the pinned attribute in the timer structure
itself, so convert the code to the new API.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.617891430@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:06 +02:00
David S. Miller a90a6e55f3 One more set of new features:
* beacon report (for radio measurement) support in cfg80211/mac80211
  * hwsim: allow wmediumd in namespaces
  * mac80211: extend 160MHz workaround to CSA IEs
  * mesh: properly encrypt group-addressed privacy action frames
  * mesh: allow setting peer AID
  * first steps for MU-MIMO monitor mode
  * along with various other cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJXfSCuAAoJEGt7eEactAAdaugP/ilrcELQRsIN5ZXCAKYZuwXV
 T01JPwgOaWL9ILu7h1SfG/+j9kzMnyk4WmRpeoj2FGNcyfG2AvULWSLpQJ2abwgQ
 8o/emuLinQwRENevaMUTRSOE0HkXoFPCbbq37+a2i6bAv1QSYY3A0xvWpcU5fZ4D
 7CKYDYPBAdMXYwEwy1g4nYWfDAYqS4rthr3l3rS1Cy7Q2T1ZlMlD90GjD7oeQAEw
 orKulhkkDSzvxfvZCYTzXmUoBQE8sNXGDD+OFsJyowyt+ugM/xan+2tmhCaHSnda
 HpdCS2aRj779UBn9cOfELjffTNpS++PM6KFd8ZDaPcJSMginn/BAHTOeNfNUfL0Q
 +Enu59I82qMDzbG2z1Qezzjv7OTzyEvyvYzNbLOqljTBSBklDa3rHwhyk+g1oVBH
 +4xX1Vk5QBLde+Q0NS0gTkcqOQK8KT5+HEqiUfgLSNDETN0lSGsKbtvMfU/ikE1Y
 aLRkTp7nzUd03qjIFLS6RMf7JjucWWzH1ZXTHvbpDFAG7riOhYRD3Sw+0e7madTd
 +HXjH9dnOGnGPDL+FyDwtW6iclYwNjcIPQiNdOjwWfMA2Wmr7iq+aFUptRCwQTHB
 WtgJ3f8OHax2JXcm4grYfxELZip5vbWJJHUC84Drvmzw3X7FRITf+OEWjdNOzsRD
 Fc7w5ceThh9Id1BvcH2+
 =R1qP
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2016-07-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
One more set of new features:
 * beacon report (for radio measurement) support in cfg80211/mac80211
 * hwsim: allow wmediumd in namespaces
 * mac80211: extend 160MHz workaround to CSA IEs
 * mesh: properly encrypt group-addressed privacy action frames
 * mesh: allow setting peer AID
 * first steps for MU-MIMO monitor mode
 * along with various other cleanups and improvements
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-06 22:32:15 -07:00
James Morris d011a4d861 Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/selinux into next 2016-07-07 10:15:34 +10:00
David S. Miller 30d0844bdc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mellanox/mlx5/core/en.h
	drivers/net/ethernet/mellanox/mlx5/core/en_main.c
	drivers/net/usb/r8152.c

All three conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-06 10:35:22 -07:00
David S. Miller ae3e4562e2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next,
they are:

1) Don't use userspace datatypes in bridge netfilter code, from
   Tobin Harding.

2) Iterate only once over the expectation table when removing the
   helper module, instead of once per-netns, from Florian Westphal.

3) Extra sanitization in xt_hook_ops_alloc() to return error in case
   we ever pass zero hooks, xt_hook_ops_alloc():

4) Handle NFPROTO_INET from the logging core infrastructure, from
   Liping Zhang.

5) Autoload loggers when TRACE target is used from rules, this doesn't
   change the behaviour in case the user already selected nfnetlink_log
   as preferred way to print tracing logs, also from Liping Zhang.

6) Conntrack slabs with SLAB_HWCACHE_ALIGN to allow rearranging fields
   by cache lines, increases the size of entries in 11% per entry.
   From Florian Westphal.

7) Skip zone comparison if CONFIG_NF_CONNTRACK_ZONES=n, from Florian.

8) Remove useless defensive check in nf_logger_find_get() from Shivani
   Bhardwaj.

9) Remove zone extension as place it in the conntrack object, this is
   always include in the hashing and we expect more intensive use of
   zones since containers are in place. Also from Florian Westphal.

10) Owner match now works from any namespace, from Eric Bierdeman.

11) Make sure we only reply with TCP reset to TCP traffic from
    nf_reject_ipv4, patch from Liping Zhang.

12) Introduce --nflog-size to indicate amount of network packet bytes
    that are copied to userspace via log message, from Vishwanath Pai.
    This obsoletes --nflog-range that has never worked, it was designed
    to achieve this but it has never worked.

13) Introduce generic macros for nf_tables object generation masks.

14) Use generation mask in table, chain and set objects in nf_tables.
    This allows fixes interferences with ongoing preparation phase of
    the commit protocol and object listings going on at the same time.
    This update is introduced in three patches, one per object.

15) Check if the object is active in the next generation for element
    deactivation in the rbtree implementation, given that deactivation
    happens from the commit phase path we have to observe the future
    status of the object.

16) Support for deletion of just added elements in the hash set type.

17) Allow to resize hashtable from /proc entry, not only from the
    obscure /sys entry that maps to the module parameter, from Florian
    Westphal.

18) Get rid of NFT_BASECHAIN_DISABLED, this code is not exercised
    anymore since we tear down the ruleset whenever the netdevice
    goes away.

19) Support for matching inverted set lookups, from Arturo Borrero.

20) Simplify the iptables_mangle_hook() by removing a superfluous
    extra branch.

21) Introduce ether_addr_equal_masked() and use it from the netfilter
    codebase, from Joe Perches.

22) Remove references to "Use netfilter MARK value as routing key"
    from the Netfilter Kconfig description given that this toggle
    doesn't exists already for 10 years, from Moritz Sichert.

23) Introduce generic NF_INVF() and use it from the xtables codebase,
    from Joe Perches.

24) Setting logger to NONE via /proc was not working unless explicit
    nul-termination was included in the string. This fixes seems to
    leave the former behaviour there, so we don't break backward.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-06 09:15:15 -07:00
Sven Eckelmann d1fe176ca5 batman-adv: Fix speedy join in gateway client mode
Speedy join only works when the received packet is either broadcast or an
4addr unicast packet. Thus packets converted from broadcast to unicast via
the gateway handling code have to be converted to 4addr packets to allow
the receiving gateway server to add the sender address as temporary entry
to the translation table.

Not doing it will make the batman-adv gateway server drop the DHCP response
in many situations because it doesn't yet have the TT entry for the
destination of the DHCP response.

Fixes: 371351731e ("batman-adv: change interface_rx to get orig node")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-07-06 16:03:40 +02:00
Masashi Honma 7d27a0ba7a cfg80211: Add mesh peer AID setting API
Previously, mesh power management functionality works only with kernel
MPM. Because user space MPM did not report mesh peer AID to kernel,
the kernel could not identify the bit in TIM element. So this patch
adds mesh peer AID setting API.

Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-07-06 15:04:52 +02:00
Johannes Berg 92b3a28a2b mac80211: parse wide bandwidth channel switch IE with workaround
Continuing the workaround implemented in commit 23665aaf91
("mac80211: Interoperability workaround for 80+80 and 160 MHz channels")
use the same code to parse the Wide Bandwidth Channel Switch element
by converting to VHT Operation element since the spec also just refers
to that for parsing semantics, particularly with the workaround.

While at it, remove some dead code - the IEEE80211_STA_DISABLE_40MHZ
flag can never be set at this point since it's checked earlier and the
wide_bw_chansw_ie pointer is set to NULL if it's set.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-07-06 14:55:04 +02:00
Johannes Berg 7d10f6b179 mac80211: report failure to start (partial) scan as scan abort
Rather than reporting the scan as having completed, report it as
being aborted.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-07-06 14:54:38 +02:00
Avraham Stern 7947d3e075 mac80211: Add support for beacon report radio measurement
Add the following to support beacon report radio measurement
with the measurement mode field set to passive or active:
1. Propagate the required scan duration to the device
2. Report the scan start time (in terms of TSF)
3. Report each BSS's detection time (also in terms of TSF)

TSF times refer to the BSS that the interface that requested the
scan is connected to.

Signed-off-by: Assaf Krauss <assaf.krauss@intel.com>
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
[changed ath9k/10k, at76c59x-usb, iwlegacy, wl1251 and wlcore to match
the new API]
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-07-06 14:53:19 +02:00
Avraham Stern 1d76250bd3 nl80211: support beacon report scanning
Beacon report radio measurement requires reporting observed BSSs
on the channels specified in the beacon request. If the measurement
mode is set to passive or active, it requires actually performing a
scan (passive or active, accordingly), and reporting the time that
the scan was started and the time each beacon/probe was received
(both in terms of TSF of the BSS of the requesting AP). If the
request mode is table, this information is optional.
In addition, the radio measurement request specifies the channel
dwell time for the measurement.

In order to use scan for beacon report when the mode is active or
passive, add a parameter to scan request that specifies the
channel dwell time, and add scan start time and beacon received time
to scan results information.

Supporting beacon report is required for Multi Band Operation (MBO).

Signed-off-by: Assaf Krauss <assaf.krauss@intel.com>
Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-07-06 14:51:31 +02:00
Aviya Erenfeld c6e6a0c8be nl80211: Add API to support VHT MU-MIMO air sniffer
add API to support VHT MU-MIMO air sniffer.
in MU-MIMO there are parallel frames on the air while the HW
has only one RX.
add the capability to sniff one of the MU-MIMO parallel frames by
giving the sniffer additional information so it'll know which
of the parallel frames it shall follow.

Add attribute - NL80211_ATTR_MU_MIMO_GROUP_DATA - for getting
a MU-MIMO groupID in order to monitor packets from that group
using VHT MU-MIMO.
And add attribute -NL80211_ATTR_MU_MIMO_FOLLOW_ADDR - for passing
MAC address to monitor mode.
that option will be used by VHT MU-MIMO air sniffer to follow a
station according to it's MAC address using VHT MU-MIMO.

Signed-off-by: Aviya Erenfeld <aviya.erenfeld@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-07-06 14:46:04 +02:00
Johannes Berg f89e07d4cf mac80211: agg-rx: refuse ADDBA Request with timeout update
The current implementation of handling ADDBA Request while a session
is already active with the peer is wrong - in case the peer is using
the existing session's dialog token this should be treated as update
to the session, which can update the timeout value.

We don't really have a good way of supporting that, so reject, but
implement the required behaviour in the spec of "Even if the updated
ADDBA Request frame is not accepted, the original Block ACK setup
remains active." (802.11-2012 10.5.4)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-07-06 14:44:14 +02:00
Gregory Greenman 16a910a672 cfg80211: handle failed skb allocation
Handle the case when dev_alloc_skb returns NULL.

Cc: stable@vger.kernel.org
Fixes: 2b67f944f8 ("cfg80211: reuse existing page fragments in A-MSDU rx")
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2016-07-06 13:52:18 +02:00
Purushottam Kushwaha 6e8ef84222 nl80211: Move ACL parsing later to avoid a possible memory leak
No support for pbss results in a memory leak for the acl_data
(if parse_acl_data succeeds). Fix this by moving the ACL parsing later.

Cc: stable@vger.kernel.org
Fixes: 34d505193b ("cfg80211: basic support for PBSS network type")
Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2016-07-06 13:09:02 +02:00
David Howells d440a1ce5d rxrpc: Kill off the call hash table
The call hash table is now no longer used as calls are looked up directly
by channel slot on the connection, so kill it off.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 11:23:54 +01:00
David Howells 8496af50eb rxrpc: Use RCU to access a peer's service connection tree
Move to using RCU access to a peer's service connection tree when routing
an incoming packet.  This is done using a seqlock to trigger retrying of
the tree walk if a change happened.

Further, we no longer get a ref on the connection looked up in the
data_ready handler unless we queue the connection's work item - and then
only if the refcount > 0.


Note that I'm avoiding the use of a hash table for service connections
because each service connection is addressed by a 62-bit number
(constructed from epoch and connection ID >> 2) that would allow the client
to engage in bucket stuffing, given knowledge of the hash algorithm.
Peers, however, are hashed as the network address is less controllable by
the client.  The total number of peers will also be limited in a future
commit.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:51:14 +01:00
David Howells 1291e9d108 rxrpc: Move data_ready peer lookup into rxrpc_find_connection()
Move the peer lookup done in input.c by data_ready into
rxrpc_find_connection().

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:51:14 +01:00
David Howells e8d70ce177 rxrpc: Prune the contents of the rxrpc_conn_proto struct
Prune the contents of the rxrpc_conn_proto struct.  Most of the fields aren't
used anymore.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:51:14 +01:00
David Howells 001c112249 rxrpc: Maintain an extra ref on a conn for the cache list
Overhaul the usage count accounting for the rxrpc_connection struct to make
it easier to implement RCU access from the data_ready handler.

The problem is that currently we're using a lock to prevent the garbage
collector from trying to clean up a connection that we're contemplating
unidling.  We could just stick incoming packets on the connection we find,
but we've then got a problem that we may race when dispatching a work item
to process it as we need to give that a ref to prevent the rxrpc_connection
struct from disappearing in the meantime.

Further, incoming packets may get discarded if attached to an
rxrpc_connection struct that is going away.  Whilst this is not a total
disaster - the client will presumably resend - it would delay processing of
the call.  This would affect the AFS client filesystem's service manager
operation.

To this end:

 (1) We now maintain an extra count on the connection usage count whilst it
     is on the connection list.  This mean it is not in use when its
     refcount is 1.

 (2) When trying to reuse an old connection, we only increment the refcount
     if it is greater than 0.  If it is 0, we replace it in the tree with a
     new candidate connection.

 (3) Two connection flags are added to indicate whether or not a connection
     is in the local's client connection tree (used by sendmsg) or the
     peer's service connection tree (used by data_ready).  This makes sure
     that we don't try and remove a connection if it got replaced.

     The flags are tested under lock with the removal operation to prevent
     the reaper from killing the rxrpc_connection struct whilst someone
     else is trying to effect a replacement.

     This could probably be alleviated by using memory barriers between the
     flag set/test and the rb_tree ops.  The rb_tree op would still need to
     be under the lock, however.

 (4) When trying to reap an old connection, we try to flip the usage count
     from 1 to 0.  If it's not 1 at that point, then it must've come back
     to life temporarily and we ignore it.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:50:04 +01:00
David Howells d991b4a32f rxrpc: Move peer lookup from call-accept to new-incoming-conn
Move the lookup of a peer from a call that's being accepted into the
function that creates a new incoming connection.  This will allow us to
avoid incrementing the peer's usage count in some cases in future.

Note that I haven't bother to integrate rxrpc_get_addr_from_skb() with
rxrpc_extract_addr_from_skb() as I'm going to delete the former in the very
near future.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:49:57 +01:00
David Howells 7877a4a4bd rxrpc: Split service connection code out into its own file
Split the service-specific connection code out into into its own file.  The
client-specific code has already been split out.  This will leave just the
common code in the original file.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:49:35 +01:00
David Howells c6d2b8d764 rxrpc: Split client connection code out into its own file
Split the client-specific connection code out into its own file.  It will
behave somewhat differently from the service-specific connection code, so
it makes sense to separate them.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:52 +01:00
David Howells a1399f8bb0 rxrpc: Call channels should have separate call number spaces
Each channel on a connection has a separate, independent number space from
which to allocate callNumber values.  It is entirely possible, for example,
to have a connection with four active calls, each with call number 1.

Note that the callNumber values for any particular channel don't have to
start at 1, but they are supposed to increment monotonically for that
channel from a client's perspective and may not be reused once the call
number is transmitted (until the epoch cycles all the way back round).

Currently, however, call numbers are allocated on a per-connection basis
and, further, are held in an rb-tree.  The rb-tree is redundant as the four
channel pointers in the rxrpc_connection struct are entirely capable of
pointing to all the calls currently in progress on a connection.

To this end, make the following changes:

 (1) Handle call number allocation independently per channel.

 (2) Get rid of the conn->calls rb-tree.  This is overkill as a connection
     may have a maximum of four calls in progress at any one time.  Use the
     pointers in the channels[] array instead, indexed by the channel
     number from the packet.

 (3) For each channel, save the result of the last call that was in
     progress on that channel in conn->channels[] so that the final ACK or
     ABORT packet can be replayed if necessary.  Any call earlier than that
     is just ignored.  If we've seen the next call number in a packet, the
     last one is most definitely defunct.

 (4) When generating a RESPONSE packet for a connection, the call number
     counter for each channel must be included in it.

 (5) When parsing a RESPONSE packet for a connection, the call number
     counters contained therein should be used to set the minimum expected
     call numbers on each channel.

To do in future commits:

 (1) Replay terminal packets based on the last call stored in
     conn->channels[].

 (2) Connections should be retired before the callNumber space on any
     channel runs out.

 (3) A server is expected to disregard or reject any new incoming call that
     has a call number less than the current call number counter.  The call
     number counter for that channel must be advanced to the new call
     number.

     Note that the server cannot just require that the next call that it
     sees on a channel be exactly the call number counter + 1 because then
     there's a scenario that could cause a problem: The client transmits a
     packet to initiate a connection, the network goes out, the server
     sends an ACK (which gets lost), the client sends an ABORT (which also
     gets lost); the network then reconnects, the client then reuses the
     call number for the next call (it doesn't know the server already saw
     the call number), but the server thinks it already has the first
     packet of this call (it doesn't know that the client doesn't know that
     it saw the call number the first time).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:52 +01:00
David Howells 30b515f4d1 rxrpc: Access socket accept queue under right lock
The socket's accept queue (socket->acceptq) should be accessed under
socket->call_lock, not under the connection lock.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:51 +01:00
David Howells dee46364ce rxrpc: Add RCU destruction for connections and calls
Add RCU destruction for connections and calls as the RCU lookup from the
transport socket data_ready handler is going to come along shortly.

Whilst we're at it, move the cleanup workqueue flushing and RCU barrierage
into the destruction code for the objects that need it (locals and
connections) and add the extra RCU barrier required for connection cleanup.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:51 +01:00
David Howells e653cfe49c rxrpc: Release a call's connection ref on call disconnection
When a call is disconnected, clear the call's pointer to the connection and
release the associated ref on that connection.  This means that the call no
longer pins the connection and the connection can be discarded even before
the call is.

As the code currently stands, the call struct is effectively pinned by
userspace until userspace has enacted a recvmsg() to retrieve the final
call state as sk_buffs on the receive queue pin the call to which they're
related because:

 (1) The rxrpc_call struct contains the userspace ID that recvmsg() has to
     include in the control message buffer to indicate which call is being
     referred to.  This ID must remain valid until the terminal packet is
     completely read and must be invalidated immediately at that point as
     userspace is entitled to immediately reuse it.

 (2) The final ACK to the reply to a client call isn't sent until the last
     data packet is entirely read (it's probably worth altering this in
     future to be send the ACK as soon as all the data has been received).


This change requires a bit of rearrangement to make sure that the call
isn't going to try and access the connection again after protocol
completion:

 (1) Delete the error link earlier when we're releasing the call.  Possibly
     network errors should be distributed via connections at the cost of
     adding in an access to the rxrpc_connection struct.

 (2) Remove the call from the connection's call tree before disconnecting
     the call.  The call tree needs to be removed anyway and incoming
     packets delivered by channel pointer instead.

 (3) The release call event should be considered last after all other
     events have been processed so that we don't need access to the
     connection again.

 (4) Move the channel_lock taking from rxrpc_release_call() to
     rxrpc_disconnect_call() where it will be required in future.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:51 +01:00
David Howells d1e858c5a3 rxrpc: Fix handling of connection failure in client call creation
If rxrpc_connect_call() fails during the creation of a client connection,
there are two bugs that we can hit that need fixing:

 (1) The call state should be moved to RXRPC_CALL_DEAD before the call
     cleanup phase is invoked.  If not, this can cause an assertion failure
     later.

 (2) call->link should be reinitialised after being deleted in
     rxrpc_new_client_call() - which otherwise leads to a failure later
     when the call cleanup attempts to delete the link again.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:51 +01:00
David Howells 2c4579e4b1 rxrpc: Move usage count getting into rxrpc_queue_conn()
Rather than calling rxrpc_get_connection() manually before calling
rxrpc_queue_conn(), do it inside the queue wrapper.

This allows us to do some important fixes:

 (1) If the usage count is 0, do nothing.  This prevents connections from
     being reanimated once they're dead.

 (2) If rxrpc_queue_work() fails because the work item is already queued,
     retract the usage count increment which would otherwise be lost.

 (3) Don't take a ref on the connection in the work function.  By passing
     the ref through the work item, this is unnecessary.  Doing it in the
     work function is too late anyway.  Previously, connection-directed
     packets held a ref on the connection, but that's not really the best
     idea.

And another useful changes:

 (*) Don't need to take a refcount on the connection in the data_ready
     handler unless we invoke the connection's work item.  We're using RCU
     there so that's otherwise redundant.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:51 +01:00
David Howells eb9b9d2275 rxrpc: Check that the client conns cache is empty before module removal
Check that the client conns cache is empty before module removal and bug if
not, listing any offending connections that are still present.  Unfortunately,
if there are connections still around, then the transport socket is still
unexpectedly open and active, so we can't just unallocate the connections.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:51 +01:00
David Howells bba304db34 rxrpc: Turn connection #defines into enums and put outside struct def
Turn the connection event and state #define lists into enums and move
outside of the struct definition.

Whilst we're at it, change _SERVER to _SERVICE in those identifiers and add
EV_ into the event name to distinguish them from flags and states.

Also add a symbol indicating the number of states and use that in the state
text array.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:51 +01:00
David Howells 5acbee4648 rxrpc: Provide queuing helper functions
Provide queueing helper functions so that the queueing of local and
connection objects can be fixed later.

The issue is that a ref on the object needs to be passed to the work queue,
but the act of queueing the object may fail because the object is already
queued.  Testing the queuedness of an object before hand doesn't work
because there can be a race with someone else trying to queue it.  What
will have to be done is to adjust the refcount depending on the result of
the queue operation.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:05 +01:00
Herbert Xu a263629da5 rxrpc: Avoid using stack memory in SG lists in rxkad
rxkad uses stack memory in SG lists which would not work if stacks were
allocated from vmalloc memory.  In fact, in most cases this isn't even
necessary as the stack memory ends up getting copied over to kmalloc
memory.

This patch eliminates all the unnecessary stack memory uses by supplying
the final destination directly to the crypto API.  In two instances where a
temporary buffer is actually needed we also switch use a scratch area in
the rxrpc_call struct (only one DATA packet will be being secured or
verified at a time).

Finally there is no need to split a split-page buffer into two SG entries
so code dealing with that has been removed.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:05 +01:00
David Howells 689f4c646d rxrpc: Check the source of a packet to a client conn
When looking up a client connection to which to route a packet, we need to
check that the packet came from the correct source so that a peer can't try
to muck around with another peer's connection.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:05 +01:00
David Howells 88b99d0b7a rxrpc: Fix some sparse errors
Fix the following sparse errors:

../net/rxrpc/conn_object.c:77:17: warning: incorrect type in assignment (different base types)
../net/rxrpc/conn_object.c:77:17:    expected restricted __be32 [usertype] call_id
../net/rxrpc/conn_object.c:77:17:    got unsigned int [unsigned] [usertype] call_id
../net/rxrpc/conn_object.c:84:21: warning: restricted __be32 degrades to integer
../net/rxrpc/conn_object.c:86:26: warning: restricted __be32 degrades to integer
../net/rxrpc/conn_object.c:357:15: warning: incorrect type in assignment (different base types)
../net/rxrpc/conn_object.c:357:15:    expected restricted __be32 [usertype] epoch
../net/rxrpc/conn_object.c:357:15:    got unsigned int [unsigned] [usertype] epoch
../net/rxrpc/conn_object.c:369:21: warning: restricted __be32 degrades to integer
../net/rxrpc/conn_object.c:371:26: warning: restricted __be32 degrades to integer
../net/rxrpc/conn_object.c:411:21: warning: restricted __be32 degrades to integer
../net/rxrpc/conn_object.c:413:26: warning: restricted __be32 degrades to integer

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:05 +01:00
Thierry Escande 3cc952dbf1 NFC: digital: Abort last command when dep link goes down
With this patch, the Digital Protocol layer abort the last issued
command when the dep link goes down. That way it does not have to wait
for the driver to reply with a timeout error before sending a new
command (i.e. a start poll command if constant polling is on).

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-06 10:26:52 +02:00
Thierry Escande af66df0f53 NFC: digital: Set the command pending flag
There is a flag in the command structure indicating that this command is
pending. It was checked before sending the command to not send the same
command twice but it was actually never set. This is now fixed.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-06 10:10:34 +02:00
Thierry Escande 82e5795286 NFC: digital: Call pending command callbacks at device unregister
With this patch, when freeing the command queue in the module unregister
function, the callbacks of the commands still queued are called with a
ENODEV error. This gives a chance to the command issuer to free any
memory it could have allocate.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-06 10:09:47 +02:00
Thierry Escande 3f89fea35f NFC: digital: Rework error handling in DEP_RES response
The Digital Protocol stack used to send a NACK frame whatever the error
type it receives in digital_in_recv_dep_res(). It actually should only
send a NACK frame on CRC or parity check errors or on any transmission
error if a NACK frame was previously sent. Existing drivers used to send
EIO error for this kind of issues so this patch limits sending of NACK
frames on EIO errors. All other errors will be reported to the upper
layers.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-06 10:08:57 +02:00
Thierry Escande b77693447d NFC: digital: Fix a memory leak in NFC-F listening mode
When configured as a target listening for a SENSF_REQ poll command, a
nfcid2 array was allocated for no reason leading to a memory leak. The
nfcid2 is sent by the target in the SENSF_RES reply.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-06 10:03:08 +02:00
Thierry Escande 256f3ee3d1 NFC: llcp: Fix 2 memory leaks
Once copied into the sk_buff data area using llcp_add_tlv(), the
allocated TLVs must be freed.

With this patch nfc_llcp_send_connect() and nfc_llcp_send_cc() don't
return immediately on success and now free the allocated TLVs.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-06 10:02:06 +02:00
Thierry Escande de9e5aeb4f NFC: llcp: Fix usage of llcp_add_tlv()
In functions using llcp_add_tlv(), a skb pointer could be set to NULL
and then reuse afterward.

With this patch, the skb pointer returned by llcp_add_tlv() is ignored
since it can only be the passed skb pointer or NULL when the passed TLV
is NULL. There is also no need to check for the TLV pointer as this is
done by llcp_add_tlv().

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-06 10:02:06 +02:00
Martin KaFai Lau 903ce4abdf ipv6: Fix mem leak in rt6i_pcpu
It was first reported and reproduced by Petr (thanks!) in
https://bugzilla.kernel.org/show_bug.cgi?id=119581

free_percpu(rt->rt6i_pcpu) used to always happen in ip6_dst_destroy().

However, after fixing a deadlock bug in
commit 9c7370a166 ("ipv6: Fix a potential deadlock when creating pcpu rt"),
free_percpu() is not called before setting non_pcpu_rt->rt6i_pcpu to NULL.

It is worth to note that rt6i_pcpu is protected by table->tb6_lock.

kmemleak somehow did not report it.  We nailed it down by
observing the pcpu entries in /proc/vmallocinfo (first suggested
by Hannes, thanks!).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Fixes: 9c7370a166 ("ipv6: Fix a potential deadlock when creating pcpu rt")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Tested-by: Petr Novopashenniy <pety@rusnet.ru>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05 14:09:23 -07:00
Vegard Nossum ab58298cf4 net: fix decnet rtnexthop parsing
dn_fib_count_nhs() could enter an infinite loop if nhp->rtnh_len == 0
(i.e. if userspace passes a malformed netlink message).

Let's use the helpers from net/nexthop.h which take care of all this
stuff. We can do exactly the same as e.g. fib_count_nexthops() and
fib_get_nhs() from net/ipv4/fib_semantics.c.

This fixes the softlockup for me.

Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05 14:08:47 -07:00
Ido Schimmel 2a4501ae18 neigh: Send a notification when DELAY_PROBE_TIME changes
When the data plane is offloaded the traffic doesn't go through the
networking stack. Therefore, after first resolving a neighbour the NUD
state machine will transition it from REACHABLE to STALE until it's
finally deleted by the garbage collector.

To prevent such situations the offloading driver should notify the NUD
state machine on any neighbours that were recently used. The driver's
polling interval should be set so that the NUD state machine can
function as if the traffic wasn't offloaded.

Currently, there are no in-tree drivers that can report confirmation for
a neighbour, but only 'used' indication. Therefore, the polling interval
should be set according to DELAY_FIRST_PROBE_TIME, as a neighbour will
transition from REACHABLE state to DELAY (instead of STALE) if "a packet
was sent within the last DELAY_FIRST_PROBE_TIME seconds" (RFC 4861).

Send a netevent whenever the DELAY_FIRST_PROBE_TIME changes - either via
netlink or sysctl - so that offloading drivers can correctly set their
polling interval.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05 09:06:29 -07:00
Jiri Pirko 18bfb924f0 net: introduce default neigh_construct/destroy ndo calls for L2 upper devices
L2 upper device needs to propagate neigh_construct/destroy calls down to
lower devices. Do this by defining default ndo functions and use them in
team, bond, bridge and vlan.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05 09:06:28 -07:00
Jiri Pirko 503eebc265 net: add dev arg to ndo_neigh_construct/destroy
As the following patch will allow upper devices to follow the call down
lower devices, we need to add dev here and not rely on n->dev.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05 09:06:28 -07:00
Pavel Tikhomirov c6ac37d8d8 netfilter: nf_log: fix error on write NONE to logger choice sysctl
It is hard to unbind nf-logger:

  echo NONE > /proc/sys/net/netfilter/nf_log/0
  bash: echo: write error: No such file or directory

  sysctl -w net.netfilter.nf_log.0=NONE
  sysctl: setting key "net.netfilter.nf_log.0": No such file or directory
  net.netfilter.nf_log.0 = NONE

You need explicitly send '\0', for instance like:

  echo -e "NONE\0" > /proc/sys/net/netfilter/nf_log/0

That seem to be strange, so fix it using proc_dostring.

Now it works fine:
   modprobe nfnetlink_log
   echo nfnetlink_log > /proc/sys/net/netfilter/nf_log/0
   cat /proc/sys/net/netfilter/nf_log/0
   nfnetlink_log
   echo NONE > /proc/sys/net/netfilter/nf_log/0
   cat /proc/sys/net/netfilter/nf_log/0
   NONE

v2: add missed error check for proc_dostring

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-05 14:57:57 +02:00
Sven Eckelmann cbef1e1020 batman-adv: Free last_bonding_candidate on release of orig_node
The orig_ifinfo reference counter for last_bonding_candidate in
batadv_orig_node has to be reduced when an originator node is released.
Otherwise the orig_ifinfo is leaked and the reference counter the netdevice
is not reduced correctly.

Fixes: f3b3d90189 ("batman-adv: add bonding again")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-07-05 12:43:58 +02:00
Sven Eckelmann 15c2ed753c batman-adv: Fix reference leak in batadv_find_router
The replacement of last_bonding_candidate in batadv_orig_node has to be an
atomic operation. Otherwise it is possible that the reference counter of a
batadv_orig_ifinfo is reduced which was no longer the
last_bonding_candidate when the new candidate is added. This can either
lead to an invalid memory access or to reference leaks which make it
impossible to an interface which was added to batman-adv.

Fixes: f3b3d90189 ("batman-adv: add bonding again")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-07-05 12:43:52 +02:00
Sven Eckelmann 3db0decf11 batman-adv: Fix non-atomic bla_claim::backbone_gw access
The pointer batadv_bla_claim::backbone_gw can be changed at any time.
Therefore, access to it must be protected to ensure that two function
accessing the same backbone_gw are actually accessing the same. This is
especially important when the crc_lock is used or when the backbone_gw of a
claim is exchanged.

Not doing so leads to invalid memory access and/or reference leaks.

Fixes: 23721387c4 ("batman-adv: add basic bridge loop avoidance code")
Fixes: 5a1dd8a477 ("batman-adv: lock crc access in bridge loop avoidance")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-07-05 12:43:21 +02:00
Sven Eckelmann 33fbb1f3db batman-adv: Fix orig_node_vlan leak on orig_node_release
batadv_orig_node_new uses batadv_orig_node_vlan_new to allocate a new
batadv_orig_node_vlan and add it to batadv_orig_node::vlan_list. References
to this list have also to be cleaned when the batadv_orig_node is removed.

Fixes: 7ea7b4a142 ("batman-adv: make the TT CRC logic VLAN specific")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-07-05 12:43:10 +02:00
Sven Eckelmann 60154a1e04 batman-adv: Avoid nullptr dereference in dat after vlan_insert_tag
vlan_insert_tag can return NULL on errors. The distributed arp table code
therefore has to check the return value of vlan_insert_tag for NULL before
it can safely operate on this pointer.

Fixes: be1db4f661 ("batman-adv: make the Distributed ARP Table vlan aware")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-07-05 12:40:01 +02:00
Sven Eckelmann 10c78f5854 batman-adv: Avoid nullptr dereference in bla after vlan_insert_tag
vlan_insert_tag can return NULL on errors. The bridge loop avoidance code
therefore has to check the return value of vlan_insert_tag for NULL before
it can safely operate on this pointer.

Fixes: 23721387c4 ("batman-adv: add basic bridge loop avoidance code")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-07-05 12:40:01 +02:00
David S. Miller b77af26a79 This feature patchset includes the following changes:
- Cleanup work by Markus Pargmann and Sven Eckelmann (six patches)
 
  - Initial Netlink support by Matthias Schiffer (two patches)
 
  - Throughput Meter implementation by Antonio Quartulli, a kernel-space
    traffic generator to estimate link speeds. This feature is useful on
    low-end WiFi APs where running iperf or netperf from userspace
    gives wrong results due to heavy userspace/kernelspace overhead.
    (two patches)
 
  - API clean-up work by Antonio Quartulli (one patch)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJXel3cAAoJEKEr45hCkp6hxs0P/jQZBJ37Bd4EHRGdhvCJWwsO
 j79zr7mIECub8a6PMkO1GI87ksJNtRdRw7XAIbLKTwsKEsUE0Gpv/MLLKgv/nD7X
 zatcoI4DujkgSojZIcOV/061+M9FAnCtAYv13jIS8nbXdqfGPxPfLua6Zbvx1GS2
 z/Rqg/WbB2qDtDlUrp0W/8oXQ+k6062G7GigroPLmjdWd5lF0H6ly4loWsxFyr0U
 GVl44HM4nOj7DwkVlrGoOXnAbjpz9TNC/aA5TIS/tLFZkm5dvJjjKLDbxo5NM9aq
 hRhFy8Gbe0TmOxV3mKZUT1oHuaHgFDY2tADLiLF2g/ijgaTetXCBJ6DXQ7BkiZnh
 +t1fnutOB1D05+cZGDmlfb2bFXO6vdDwNzKYuPdeW0tUOVwzNIaMK+US1HffUD3F
 ciK/cALsLbfJ3QkUHJclE57baMuB2c7YWJUxGdp2r4lKHak6tc8+BsornI6lB6qY
 kcxip6EEaT7edjT66Qjq8GtFK7WIri5nHI9n5Js+Wwl1QENvkLmZRQ6qZexwSplS
 RTZmmO+i+Y4rGDa3xoVSlC+CEUO7D4VwhET2Jir7KJrVS+pFNRAmfpUNWxeauAls
 D1xWgBrWjjOYu5i3LjwC6cHl4eTWxBwWmBUaxLUUeyoR22ulIs6bXCQMWOLMbupd
 q8k2B5BW9waTAgb4Tam9
 =PFHu
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20160704' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature patchset includes the following changes:

 - Cleanup work by Markus Pargmann and Sven Eckelmann (six patches)

 - Initial Netlink support by Matthias Schiffer (two patches)

 - Throughput Meter implementation by Antonio Quartulli, a kernel-space
   traffic generator to estimate link speeds. This feature is useful on
   low-end WiFi APs where running iperf or netperf from userspace
   gives wrong results due to heavy userspace/kernelspace overhead.
   (two patches)

 - API clean-up work by Antonio Quartulli (one patch)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-04 23:33:59 -07:00
Jiri Pirko 7ce856aaaf mlxsw: spectrum: Add couple of lower device helper functions
Add functions that iterate over lower devices and find port device.
As a dependency add netdev_for_each_all_lower_dev and
netdev_for_each_all_lower_dev_rcu macro with
netdev_all_lower_get_next and netdev_all_lower_get_next_rcu shelpers.

Also, add functions to return mlxsw struct according to lower device
found and mlxsw_port struct with a reference to lower device.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-04 18:25:15 -07:00
Vegard Nossum 3dad5424ad RDS: fix rds_tcp_init() error path
If register_pernet_subsys() fails, we shouldn't try to call
unregister_pernet_subsys().

Fixes: 467fa15356 ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
Cc: stable@vger.kernel.org
Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-04 16:09:49 -07:00
Daniel Borkmann 13c5c240f7 bpf: add bpf_get_hash_recalc helper
If skb_clear_hash() was invoked due to mangling of relevant headers and
BPF program needs skb->hash later on, we can add a helper to trigger hash
recalculation via bpf_get_hash_recalc().

The helper will return the newly retrieved hash directly, but later access
can also be done via skb context again through skb->hash directly (inline)
without needing to call the helper once more.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-04 16:08:40 -07:00
John Fastabend 0967f24459 net: pktgen: support injecting packets for qdisc testing
Add another xmit_mode to pktgen to allow testing xmit functionality
of qdiscs. The new mode "queue_xmit" injects packets at
__dev_queue_xmit() so that qdisc is called.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-04 16:07:34 -07:00
Jamal Hadi Salim 61cc535de3 net sched actions: skbedit convert to use more modern nla_put_xxx
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-04 15:11:14 -07:00
Jamal Hadi Salim ff202ee1ed net sched actions: skbedit add support for mod-ing skb pkt_type
Extremely useful for setting packet type to host so i dont
have to modify the dst mac address using pedit (which requires
that i know the mac address)

Example usage:
tc filter add dev eth0 parent ffff: protocol ip pref 9 u32 \
match ip src 5.5.5.5/32 \
flowid 1:5 action skbedit ptype host

This will tag all packets incoming from 5.5.5.5 with type
PACKET_HOST

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-04 15:11:14 -07:00
Jamal Hadi Salim 8b10cab64c net: simplify and make pkt_type_ok() available for other users
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-04 15:11:13 -07:00
Greg Kroah-Hartman 67417f9c26 Merge 4.7-rc6 into tty-next
We want the tty/serial fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-04 08:17:08 -07:00
Antonio Quartulli 29824a55c0 batman-adv: split routing API data structure in subobjects
The routing API data structure contains several function
pointers that can easily be grouped together based on the
component they work with.

Split the API in subobjects in order to improve definition readability.

At the same time, remove the "bat_" prefix from the API object and
its fields names. These are batman-adv private structs and there is no
need to always prepend such prefix, which only makes function invocations
much much longer.

Signed-off-by: Antonio Quartulli <a@unstable.cc>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-07-04 12:37:19 +02:00
Antonio Quartulli 33a3bb4a33 batman-adv: throughput meter implementation
The throughput meter module is a simple, kernel-space replacement for
throughtput measurements tool like iperf and netperf. It is intended to
approximate TCP behaviour.

It is invoked through batctl: the protocol is connection oriented, with
cumulative acknowledgment and a dynamic-size sliding window.

The test *can* be interrupted by batctl. A receiver side timeout avoids
unlimited waitings for sender packets: after one second of inactivity, the
receiver abort the ongoing test.

Based on a prototype from Edo Monticelli <montik@autistici.org>

Signed-off-by: Antonio Quartulli <antonio.quartulli@open-mesh.com>
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-07-04 12:37:18 +02:00
Antonio Quartulli f50ca95a69 batman-adv: return netdev status in the TX path
Return the proper netdev TX status along the TX path so that the tp_meter
can understand when the queue is full and should stop sending packets.

Signed-off-by: Antonio Quartulli <antonio.quartulli@open-mesh.com>
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-07-04 12:37:18 +02:00
Matthias Schiffer 5da0aef5e9 batman-adv: add netlink command to query generic mesh information files
BATADV_CMD_GET_MESH_INFO is used to query basic information about a
batman-adv softif (name, index and MAC address for both the softif and
the primary hardif; routing algorithm; batman-adv version).

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven.eckelmann@open-mesh.com: Reduce the number of changes to
BATADV_CMD_GET_MESH_INFO, add missing kerneldoc, add policy for attributes]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-07-04 12:37:17 +02:00
Matthias Schiffer 09748a22f4 batman-adv: add generic netlink family for batman-adv
debugfs is currently severely broken virtually everywhere in the kernel
where files are dynamically added and removed (see
http://lkml.iu.edu/hypermail/linux/kernel/1506.1/02196.html for some
details). In addition to that, debugfs is not namespace-aware.

Instead of adding new debugfs entries, the whole infrastructure should be
moved to netlink. This will fix the long standing problem of large buffers
for debug tables and hard to parse text files.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven.eckelmann@open-mesh.com: Strip down patch to only add genl family,
add missing kerneldoc]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-07-04 12:37:17 +02:00
Thierry Escande 806bfe31c9 NFC: llcp: Use dynamic debug for hex dump
LLCP skb tx and rx functions now use print_hex_dump_debug() making
these verbose traces controllable using dynamic debug.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-04 12:26:27 +02:00
Thierry Escande 7854a44526 NFC: digital: Add a delay between poll cycles
This replaces the polling work struct with a delayed work struct and add
a 10 ms delay between 2 poll cycles. This avoids to flood the device
with 'switch off'/'switch on' commands.

Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-04 12:26:27 +02:00
Denys Vlasenko f86dec94e3 NFC: hci: delete unused nfc_llc_get_rx_head_tail_room()
It used to be EXPORTed, but then EXPORT usage was cleaned up
(in 2012), without noticing that the function has no users at all
(and curiously, never had any users).

Delete it.

While at it, remove non-static "inline" hints on nearby functions:
these hints don't work across compilation units anyway,
and these functions are not used in their .c file, thus they are
never inlined. IOW: "inline" here does not help in any way.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Samuel Ortiz <sameo@linux.intel.com>
CC: Christophe Ricard <christophe.ricard@gmail.com>
CC: linux-wireless@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-04 12:14:05 +02:00
Joe Perches c37a2dfa67 netfilter: Convert FWINV<[foo]> macros and uses to NF_INVF
netfilter uses multiple FWINV #defines with identical form that hide a
specific structure variable and dereference it with a invflags member.

$ git grep "#define FWINV"
include/linux/netfilter_bridge/ebtables.h:#define FWINV(bool,invflg) ((bool) ^ !!(info->invflags & invflg))
net/bridge/netfilter/ebtables.c:#define FWINV2(bool, invflg) ((bool) ^ !!(e->invflags & invflg))
net/ipv4/netfilter/arp_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg)))
net/ipv4/netfilter/ip_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
net/ipv6/netfilter/ip6_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ip6info->invflags & (invflg)))
net/netfilter/xt_tcpudp.c:#define FWINVTCP(bool, invflg) ((bool) ^ !!(tcpinfo->invflags & (invflg)))

Consolidate these macros into a single NF_INVF macro.

Miscellanea:

o Neaten the alignment around these uses
o A few lines are > 80 columns for intelligibility

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-03 10:55:07 +02:00
Or Gerlitz 08f4b5918b net/devlink: Add E-Switch mode control
Add the commands to set and show the mode of SRIOV E-Switch, two modes
are supported:

* legacy: operating in the "old" L2 based mode (DMAC --> VF vport)

* switchdev: the E-Switch is referred to as whitebox switch configured
using standard tools such as tc, bridge, openvswitch etc. To allow
working with the tools, for each VF, a VF representor netdevice is
created by the E-Switch manager vendor device driver instance (e.g PF).

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02 14:40:40 -04:00
David S. Miller 3ea00443f1 This feature patchset includes the following changes:
- two patches with minimal clean up work by Antonio Quartulli and
    Simon Wunderlich
 
  - eight patches of B.A.T.M.A.N. V, API and documentation clean
    up work, by Antonio Quartulli and Marek Lindner
 
  - Andrew Lunn fixed the skb priority adoption when forwarding
    fragmented packets (two patches)
 
  - Multicast optimization support is now enabled for bridges which
    comes with some protocol updates, by Linus Luessing
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJXdmhpAAoJEKEr45hCkp6h/R0P/1K0mZjs1lk15j1oc0EeE/lJ
 a47nwLQiAj9O790SUhQuonUYtbw5jhxZq5P1zYvg44ngRdUhsH9yiwr+Yado40CW
 5ek+EdfGfkwNThGG++knVrbhLPrGxSC9Q2qJCApergt4wViWvvovSJOZsKVcanei
 Qv9uGn6TIhZW3FFhvYk6/xgseZhjRISgxPkO1N20tMcC3f0w4YgM5QcCPGG2KB3N
 CYq1qryyl4Jf6NeNap/lXiTo6JA5qOHYe68ziotJTtlsrsYQ3WitJvuKO+bWuQIv
 zOU/jQ7qUwuabLT5TnzZKxQJvhrqfA5V20OM3yD4lnhdgvqVsHgHoIRy6RpN4U8M
 rFlgROZvm+k0ATnL8AcUtIY7EA/EA0ifHN2fdTAfQ0XNc0VxTXEWB4qTTBJu3+se
 N0+QyIjpXzgHqKxjIpr8Sm3tBO/ANCui/gWl5SToGXis3xDsRivvTMWNSNYjgDcP
 WdyLtx9h7RLNOh64Idwsq4yDHt+/P86z7xJQdlkmrshkjqL/HNgS93U5CeAC3mN0
 S6N5PgZG07EYnGxzxDid+6x1UP1VA7dyqHJpxpYY7qbw+/aDVlq5XH/vI/9Lbq5i
 vu/54L8cVG5nBe54SZBeUib5W7Wkgf3POWzt+rrRwbHY+gAE1zUPQNNzgDtLHH0N
 K/XJwdcoGQzA5LEynGE7
 =js/J
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20160701' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature patchset includes the following changes:

 - two patches with minimal clean up work by Antonio Quartulli and
   Simon Wunderlich

 - eight patches of B.A.T.M.A.N. V, API and documentation clean
   up work, by Antonio Quartulli and Marek Lindner

 - Andrew Lunn fixed the skb priority adoption when forwarding
   fragmented packets (two patches)

 - Multicast optimization support is now enabled for bridges which
   comes with some protocol updates, by Linus Luessing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 17:05:00 -04:00
Richard Alpe 55e77a3e82 tipc: fix nl compat regression for link statistics
Fix incorrect use of nla_strlcpy() where the first NLA_HDRLEN bytes
of the link name where left out.

Making the output of tipc-config -ls look something like:
Link statistics:
dcast-link
1:data0-1.1.2:data0
1:data0-1.1.3:data0

Also, for the record, the patch that introduce this regression
claims "Sending the whole object out can cause a leak". Which isn't
very likely as this is a compat layer, where the data we are parsing
is generated by us and we know the string to be NULL terminated. But
you can of course never be to secure.

Fixes: 5d2be1422e (tipc: fix an infoleak in tipc_nl_compat_link_dump)
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:47:38 -04:00
Sowmini Varadhan 11bb62f7c0 RDS: Do not send a pong to an incoming ping with 0 src port
RDS ping messages are sent with a non-zero src port to a zero
dst port, so that the rds pong messages can be sent back to the
originators src port. However if a confused/malicious sender
sends a ping with a 0 src port, we'd have an infinite ping-pong
loop. To avoid this, the receiver should ignore ping messages
with a 0 src port.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:45:18 -04:00
Sowmini Varadhan 8315011ad6 RDS: TCP: Simplify reconnect to avoid duelling reconnnect attempts
When reconnecting, the peer with the smaller IP address will initiate
the reconnect, to avoid needless duelling SYN issues.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:45:17 -04:00
Sowmini Varadhan b04e8554f7 RDS: TCP: Hooks to set up a single connection path
This patch adds ->conn_path_connect callbacks in the rds_transport
that are used to set up a single connection path.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:45:17 -04:00
Sowmini Varadhan 2da43c4a1b RDS: TCP: make receive path use the rds_conn_path
The ->sk_user_data contains a pointer to the rds_conn_path
for the socket. Use this consistently in the rds_tcp_data_ready
callbacks to get the rds_conn_path for rds_recv_incoming.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:45:17 -04:00
Sowmini Varadhan ea3b1ea539 RDS: TCP: make ->sk_user_data point to a rds_conn_path
The socket callbacks should all operate on a struct rds_conn_path,
in preparation for a MP capable RDS-TCP.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:45:17 -04:00
Sowmini Varadhan afb4164d91 RDS: TCP: Refactor connection destruction to handle multiple paths
A single rds_connection may have multiple rds_conn_paths that have
to be carefully and correctly destroyed, for both rmmod and
netns-delete cases.

For both cases, we extract a single rds_tcp_connection for
each conn into a temporary list, and then invoke rds_conn_destroy()
which iteratively dismantles every path in the rds_connection.

For the netns deletion case, we additionally have to make sure
that we do not leave a socket in TIME_WAIT state, as this will
hold up the netns deletion. Thus we call rds_tcp_conn_paths_destroy()
to reset state quickly.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:45:17 -04:00
Sowmini Varadhan 02105b2ccd RDS: TCP: Make rds_tcp_connection track the rds_conn_path
The struct rds_tcp_connection is the transport-specific private
data structure that tracks TCP information per rds_conn_path.
Modify this structure to have a back-pointer to the rds_conn_path
for which it is the ->cp_transport_data.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:45:17 -04:00
Sowmini Varadhan 26e4e6bb68 RDS: TCP: Remove dead logic around c_passive in rds-tcp
The c_passive bit is only intended for the IB transport and will
never be encountered in rds-tcp, so remove the dead logic that
predicates on this bit.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:45:17 -04:00
Sowmini Varadhan 226f7a7d97 RDS: Rework path specific indirections
Refactor code to avoid separate indirections for single-path
and multipath transports. All transports (both single and mp-capable)
will get a pointer to the rds_conn_path, and can trivially derive
the rds_connection from the ->cp_conn.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:45:17 -04:00
Martin KaFai Lau 4a482f34af cgroup: bpf: Add bpf_skb_in_cgroup_proto
Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
belongs to a descendant of a cgroup2.  It is similar to the
feature added in netfilter:
commit c38c4597e4 ("netfilter: implement xt_cgroup cgroup2 path match")

The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
which will be used by the bpf_skb_in_cgroup.

Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
and bpf_skb_in_cgroup() are always used together.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:32:13 -04:00
WANG Cong 82a31b9231 net_sched: fix mirrored packets checksum
Similar to commit 9b368814b3 ("net: fix bridge multicast packet checksum validation")
we need to fixup the checksum for CHECKSUM_COMPLETE when
pushing skb on RX path. Otherwise we get similar splats.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:19:34 -04:00
David S. Miller eb70db8756 packet: Use symmetric hash for PACKET_FANOUT_HASH.
People who use PACKET_FANOUT_HASH want a symmetric hash, meaning that
they want packets going in both directions on a flow to hash to the
same bucket.

The core kernel SKB hash became non-symmetric when the ipv6 flow label
and other entities were incorporated into the standard flow hash order
to increase entropy.

But there are no users of PACKET_FANOUT_HASH who want an assymetric
hash, they all want a symmetric one.

Therefore, use the flow dissector to compute a flat symmetric hash
over only the protocol, addresses and ports.  This hash does not get
installed into and override the normal skb hash, so this change has
no effect whatsoever on the rest of the stack.

Reported-by: Eric Leblond <eric@regit.org>
Tested-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:07:50 -04:00
Daniel Borkmann 113214be7f bpf: refactor bpf_prog_get and type check into helper
Since bpf_prog_get() and program type check is used in a couple of places,
refactor this into a small helper function that we can make use of. Since
the non RO prog->aux part is not used in performance critical paths and a
program destruction via RCU is rather very unlikley when doing the put, we
shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
check, but actually not taking the ref at all (due to being in fdget() /
fdput() section of the bpf fd) is even cleaner and makes the diff smaller
as well, so just go for that. Callsites are changed to make use of the new
helper where possible.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:00:47 -04:00
Moritz Sichert f1504307b9 netfilter: Remove references to obsolete CONFIG_IP_ROUTE_FWMARK
This option was removed in commit 47dcf0cb10 ("[NET]: Rethink mark field
in struct flowi").

Signed-off-by: Moritz Sichert <moritz+linux@sichert.me>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-01 16:37:07 +02:00
Joe Perches 4ae89ad924 etherdevice.h & bridge: netfilter: Add and use ether_addr_equal_masked
There are code duplications of a masked ethernet address comparison here
so make it a separate function instead.

Miscellanea:

o Neaten alignment of FWINV macro uses to make it clearer for the reader

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-01 16:37:06 +02:00
Pablo Neira Ayuso 468b021b94 netfilter: x_tables: simplify ip{6}table_mangle_hook()
No need for a special case to handle NF_INET_POST_ROUTING, this is
basically the same handling as for prerouting, input, forward.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-01 16:37:02 +02:00
Florian Westphal 9cc1c73ad6 netfilter: conntrack: avoid integer overflow when resizing
Can overflow so we might allocate very small table when bucket count is
high on a 32bit platform.

Note: resize is only possible from init_netns.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-01 16:02:33 +02:00
Jason Wang 08294a26e1 net: introduce NETDEV_CHANGE_TX_QUEUE_LEN
This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this
will be triggered when tx_queue_len. It could be used by net device
who want to do some processing at that time. An example is tun who may
want to resize tx array when tx_queue_len is changed.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:32:17 -04:00
Michal Soltys 33ef84a77d net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc()
cl->cl_vt alone is relative only to the current backlog period, while
the curve operates on cumulative virtual time. This patch adds missing
cl->cl_vtoff.

Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:03:43 -04:00
Michal Soltys ab12cb4742 net/sched/sch_hfsc.c: go passive after vt update
When a class is going passive, it should update its cl_vt first
to be consistent with the last dequeue operation.

Otherwise its cl_vt will be one packet behind and parent's cvtmax might
not be updated as well.

One possible side effect is if some class goes passive and subsequently
goes active /without/ its parent going passive - with cl_vt lagging one
packet behind - comparison made in init_vf() will be affected (same
period).

Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:03:43 -04:00
Michal Soltys 2354f056f6 net/sched/sch_hfsc.c: remove leftover dlist and droplist
This is update to:
commit a09ceb0e08 ("sched: remove qdisc->drop")

That commit removed qdisc->drop, but left alone dlist and droplist
that no longer serve any meaningful purpose.

Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:03:43 -04:00
Michal Soltys d1d0fc5e4c net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len()
The condition can only succeed on wrong configurations.

Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:03:43 -04:00
Michal Soltys 12d0ad3be9 net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline
Realtime scheduling implemented in HFSC uses head of the queue to make
the decision about which packet to schedule next. But in case of any
head drop, the deadline calculated for the previous head is not
necessarily correct for the next head (unless both packets have the same
length).

Thanks to peek() function used during dequeue - which internally is a
dequeue operation - hfsc is almost safe from this issue, as peek()
dequeues and isolates the head storing it temporarily until the real
dequeue happens.

But there is one exception: if after the class activation a drop happens
before the first dequeue operation, there's never a chance to do the
peek().

Adding peek() call in enqueue - if this is the first packet in a new
backlog period AND the scheduler has realtime curve defined - fixes that
one corner case. The 1st hfsc_dequeue() will use that peeked packet,
similarly as every subsequent hfsc_dequeue() call uses packet peeked by
the previous call.

Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:03:43 -04:00
Eric Dumazet 19689e38ec tcp: md5: use kmalloc() backed scratch areas
Some arches have virtually mapped kernel stacks, or will soon have.

tcp_md5_hash_header() uses an automatic variable to copy tcp header
before mangling th->check and calling crypto function, which might
be problematic on such arches.

David says that using percpu storage is also problematic on non SMP
builds.

Just use kmalloc() to allocate scratch areas.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 04:02:55 -04:00
David Howells ac5d26836c rxrpc: Fix processing of authenticated/encrypted jumbo packets
When a jumbo packet is being split up and processed, the crypto checksum
for each split-out packet is in the jumbo header and needs placing in the
reconstructed packet header.

When the code was changed to keep the stored copy of the packet header in
host byte order, this reconstruction was missed.

Found with sparse with CF=-D__CHECK_ENDIAN__:

    ../net/rxrpc/input.c:479:33: warning: incorrect type in assignment (different base types)
    ../net/rxrpc/input.c:479:33:    expected unsigned short [unsigned] [usertype] _rsvd
    ../net/rxrpc/input.c:479:33:    got restricted __be16 [addressable] [usertype] _rsvd

Fixes: 0d12f8a402 ("rxrpc: Keep the skb private record of the Rx header in host byte order")
Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-01 08:35:02 +01:00
Shmulik Ladkani fedbb6b4ff ipv4: Fix ip_skb_dst_mtu to use the sk passed by ip_finish_output
ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
calls ip_sk_use_pmtu which casts sk as an inet_sk).

However, in the case of UDP tunneling, the skb->sk is not necessarily an
inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
tun/tap).

OTOH, the sk passed as an argument throughout IP stack's output path is
the one which is of PMTU interest:
 - In case of local sockets, sk is same as skb->sk;
 - In case of a udp tunnel, sk is the tunneling socket.

Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
This augments 7026b1ddb6 'netfilter: Pass socket pointer down through okfn().'

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 09:02:48 -04:00
Mateusz Bajorski 153380ec4b fib_rules: Added NLM_F_EXCL support to fib_nl_newrule
When adding rule with NLM_F_EXCL flag then check if the same rule exist.
If yes then exit with -EEXIST.

This is already implemented in iproute2:
        if (cmd == RTM_NEWRULE) {
                req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL;
                req.r.rtm_type = RTN_UNICAST;
        }

Tested ipv4 and ipv6 with net-next linux on qemu x86

expected behavior after patch:
localhost ~ # ip rule
0:    from all lookup local
32766:    from all lookup main
32767:    from all lookup default
localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
RTNETLINK answers: File exists
localhost ~ # ip rule
0:    from all lookup local
1005:    from 10.46.177.97 lookup 104
32766:    from all lookup main
32767:    from all lookup default

There was already topic regarding this but I don't see any changes
merged and problem still occurs.
https://lkml.kernel.org/r/1135778809.5944.7.camel+%28%29+localhost+%21+localdomain

Signed-off-by: Mateusz Bajorski <mateusz.bajorski@nokia.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 08:23:19 -04:00
Andrey Vagin b1ed4c4fa9 tcp: add an ability to dump and restore window parameters
We found that sometimes a restored tcp socket doesn't work.

A reason of this bug is incorrect window parameters and in this case
tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
other side drops packets with this seq, because seq is less than
tp->rcv_nxt ( tcp_sequence() ).

Data from a send queue is sent only if there is enough space in a
window, so when we restore unacked data, we need to expand a window to
fit this data.

This was in a first version of this patch:
"tcp: extend window to fit all restored unacked data in a send queue"

Then Alexey recommended me to restore window parameters instead of
adjusted them according with data in a sent queue. This sounds resonable.

rcv_wnd has to be restored, because it was reported to another side
and the offered window is never shrunk.
One of reasons why we need to restore snd_wnd was described above.

Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 08:15:31 -04:00
Nikolay Aleksandrov 1080ab95e3 net: bridge: add support for IGMP/MLD stats and export them via netlink
This patch adds stats support for the currently used IGMP/MLD types by the
bridge. The stats are per-port (plus one stat per-bridge) and per-direction
(RX/TX). The stats are exported via netlink via the new linkxstats API
(RTM_GETSTATS). In order to minimize the performance impact, a new option
is used to enable/disable the stats - multicast_stats_enabled, similar to
the recent vlan stats. Also in order to avoid multiple IGMP/MLD type
lookups and checks, we make use of the current "igmp" member of the bridge
private skb->cb region to record the type on Rx (both host-generated and
external packets pass by multicast_rcv()). We can do that since the igmp
member was used as a boolean and all the valid IGMP/MLD types are positive
values. The normal bridge fast-path is not affected at all, the only
affected paths are the flooding ones and since we make use of the IGMP/MLD
type, we can quickly determine if the packet should be counted using
cache-hot data (cb's igmp member). We add counters for:
* IGMP Queries
* IGMP Leaves
* IGMP v1/v2/v3 reports

* MLD Queries
* MLD Leaves
* MLD v1/v2 reports

These are invaluable when monitoring or debugging complex multicast setups
with bridges.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 06:18:24 -04:00
Nikolay Aleksandrov 80e73cc563 net: rtnetlink: add support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute
This patch adds support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute
which allows to export per-slave statistics if the master device supports
the linkxstats callback. The attribute is passed down to the linkxstats
callback and it is up to the callback user to use it (an example has been
added to the only current user - the bridge). This allows us to query only
specific slaves of master devices like bridge ports and export only what
we're interested in instead of having to dump all ports and searching only
for a single one. This will be used to export per-port IGMP/MLD stats and
also per-port vlan stats in the future, possibly other statistics as well.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 06:15:04 -04:00
Michal Kazior 59a7c828d7 mac80211: fix fq lockdep warnings
Some lockdep assertions were not fulfilled and
resulted in a kernel warning/call trace if driver
used intermediate software queues (e.g. ath10k).

Existing code sequences should've guaranteed safety
but it's always good to be extra careful.

The call trace could look like this:

 [ 237.335805] ------------[ cut here ]------------
 [ 237.335852] WARNING: CPU: 3 PID: 1921 at include/net/fq_impl.h:22 fq_flow_dequeue+0xed/0x140 [mac80211]
 [ 237.335855] Modules linked in: ath10k_pci(E-) ath10k_core(E) ath(E) mac80211(E) cfg80211(E)
 [ 237.335913] CPU: 3 PID: 1921 Comm: rmmod Tainted: G        W   E   4.7.0-rc4-wt-ath+ #1377
 [ 237.335916] Hardware name: Hewlett-Packard HP ProBook 6540b/1722, BIOS 68CDD Ver. F.04 01/27/2010
 [ 237.335918]  00200286 00200286 eff85dac c14151e2 f901574e 00000000 eff85de0 c1081075
 [ 237.335928]  c1ab91f0 00000003 00000781 f901574e 00000016 f8fbabad f8fbabad 00000016
 [ 237.335938]  eb24ff60 00000000 ef3886c0 eff85df4 c10810ba 00000009 00000000 00000000
 [ 237.335948] Call Trace:
 [ 237.335953]  [<c14151e2>] dump_stack+0x76/0xb4
 [ 237.335957]  [<c1081075>] __warn+0xe5/0x100
 [ 237.336002]  [<f8fbabad>] ? fq_flow_dequeue+0xed/0x140 [mac80211]
 [ 237.336046]  [<f8fbabad>] ? fq_flow_dequeue+0xed/0x140 [mac80211]
 [ 237.336053]  [<c10810ba>] warn_slowpath_null+0x2a/0x30
 [ 237.336095]  [<f8fbabad>] fq_flow_dequeue+0xed/0x140 [mac80211]
 [ 237.336137]  [<f8fbc67a>] fq_flow_reset.constprop.56+0x2a/0x90 [mac80211]
 [ 237.336180]  [<f8fbc79a>] fq_reset.constprop.59+0x2a/0x50 [mac80211]
 [ 237.336222]  [<f8fc04e8>] ieee80211_txq_teardown_flows+0x38/0x40 [mac80211]
 [ 237.336258]  [<f8f7c1a4>] ieee80211_unregister_hw+0xe4/0x120 [mac80211]
 [ 237.336275]  [<f933f536>] ath10k_mac_unregister+0x16/0x50 [ath10k_core]
 [ 237.336292]  [<f934592d>] ath10k_core_unregister+0x3d/0x90 [ath10k_core]
 [ 237.336301]  [<f85f8836>] ath10k_pci_remove+0x36/0xa0 [ath10k_pci]
 [ 237.336307]  [<c1470388>] pci_device_remove+0x38/0xb0
 ...

Fixes: 5caa328e38 ("mac80211: implement codel on fair queuing flows")
Fixes: fa962b9212 ("mac80211: implement fair queueing per txq")
Tested-by: Kalle Valo <kvalo@qca.qualcomm.com>
Reported-by: Kalle Valo <kvalo@qca.qualcomm.com>
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2016-06-30 12:07:44 +02:00
Bob Copeland efc401f49a mac80211: use common cleanup for user/!user_mpm
We've accumulated a couple of different fixes now to mesh_sta_cleanup()
due to the different paths that user_mpm and !user_mpm cases take -- one
fix to flush nexthop paths and one to fix the counting.

The only caller of mesh_plink_deactivate() is mesh_sta_cleanup(), so we
can push the user_mpm checks down into there in order to share more
code.

In doing so, we can remove an extra call to mesh_path_flush_by_nexthop()
and the (unnecessary) call to mesh_accept_plinks_update().  This will
also ensure the powersaving state code gets called in the user_mpm case.

The only cleanup tasks we need to avoid when MPM is in user-space
are sending the peering frames and stopping the plink timer, so wrap
those in the appropriate check.

Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2016-06-30 12:06:41 +02:00
Masashi Honma 46f6b06050 mac80211: Encrypt "Group addressed privacy" action frames
Previously, the action frames to group address was not encrypted. But
[1] "Table 8-38 Category values" indicates "Mesh" and "Multihop" category
action frames should be encrypted (Group addressed privacy == yes). And the
encyption key should be MGTK ([1] 10.13 Group addressed robust management frame
procedures). So this patch modifies the code to make it suitable for spec.

[1] IEEE Std 802.11-2012

Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2016-06-30 12:06:20 +02:00
Dan Carpenter 49708e3772 mac80211: silence an uninitialized variable warning
We normally return an uninitialized value, but no one checks it so it
doesn't matter.  Anyway, let's silence the static checker warning.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2016-06-30 12:06:19 +02:00
Arnd Bergmann f151d9db4c nl80211: improve nl80211_parse_mesh_config type checking
When building a kernel with W=1, the nl80211.c file causes a number of
warnings, all about the same problem:

net/wireless/nl80211.c: In function 'nl80211_parse_mesh_config':
net/wireless/nl80211.c:5287:103: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5290:96: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5293:124: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5295:148: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5298:106: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5305:116: error: comparison is always false due to limited range of data type [-Werror=type-limits]

The problem is that gcc does not notice that the check is generate
by a macro, so it complains about comparing an unsigned type against 0.

I've tried to come up with a way to rephrase that code in a way that
avoids the warnings and otherwise improves the code as well.

This uses a set of new helper functions that perform the range checking,
and should provide slightly better type safety than the older patch,
at the expense of adding 44 lines to the code. Binary code size is
basically unchanged though (20 bytes added to 126561 bytes .text).

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2016-06-30 12:06:18 +02:00
Daniel Borkmann d2485c4242 bpf: add bpf_skb_change_type helper
This work adds a helper for changing skb->pkt_type in a controlled way.
We only allow a subset of possible values and can extend that in future
should other use cases come up. Doing this as a helper has the advantage
that errors can be handeled gracefully and thus helper kept extensible.

It's a write counterpart to pkt_type member we can already read from
struct __sk_buff context. Major use case is to change incoming skbs to
PACKET_HOST in a programmatic way instead of having to recirculate via
redirect(..., BPF_F_INGRESS), for example.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 05:54:40 -04:00