Commit Graph

1605 Commits

Author SHA1 Message Date
Jon Maloy 3e5cf362c3 tipc: introduce ioctl for fetching node identity
After the introduction of a 128-bit node identity it may be difficult
for a user to correlate between this identity and the generated node
hash address.

We now try to make this easier by introducing a new ioctl() call for
fetching a node identity by using the hash value as key. This will
be particularly useful when we extend some of the commands in the
'tipc' tool, but we also expect regular user applications to need
this feature.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 11:05:41 -04:00
Jon Maloy 7dbc73e612 tipc: fix bug in function tipc_nl_node_dump_monitor
Commit 36a50a989e ("tipc: fix infinite loop when dumping link monitor
summary") intended to fix a problem with user tool looping when max
number of bearers are enabled.

Unfortunately, the wrong version of the commit was posted, so the
problem was not solved at all.

This commit adds the missing part.

Fixes: 36a50a989e ("tipc: fix infinite loop when dumping link monitor summary")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 11:03:56 -04:00
David S. Miller e0ada51db9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simple overlapping changes in microchip
driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:32:48 -04:00
GhantaKrishnamurthy MohanKrishna 682cd3cf94 tipc: confgiure and apply UDP bearer MTU on running links
Currently, we have option to configure MTU of UDP media. The configured
MTU takes effect on the links going up after that moment. I.e, a user
has to reset bearer to have new value applied across its links. This is
confusing and disturbing on a running cluster.

We now introduce the functionality to change the default UDP bearer MTU
in struct tipc_bearer. Additionally, the links are updated dynamically,
without any need for a reset, when bearer value is changed. We leverage
the existing per-link functionality and the design being symetrical to
the confguration of link tolerance.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:04:05 -04:00
GhantaKrishnamurthy MohanKrishna 901271e040 tipc: implement configuration of UDP media MTU
In previous commit, we changed the default emulated MTU for UDP bearers
to 14k.

This commit adds the functionality to set/change the default value
by configuring new MTU for UDP media. UDP bearer(s) have to be disabled
and enabled back for the new MTU to take effect.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:04:05 -04:00
GhantaKrishnamurthy MohanKrishna a4dfa72d0a tipc: set default MTU for UDP media
Currently, all bearers are configured with MTU value same as the
underlying L2 device. However, in case of bearers with media type
UDP, higher throughput is possible with a fixed and higher emulated
MTU value than adapting to the underlying L2 MTU.

In this commit, we introduce a parameter mtu in struct tipc_media
and a default value is set for UDP. A default value of 14k
was determined by experimentation and found to have a higher throughput
than 16k. MTU for UDP bearers are assigned the above set value of
media MTU.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:04:05 -04:00
Tung Nguyen 36a50a989e tipc: fix infinite loop when dumping link monitor summary
When configuring the number of used bearers to MAX_BEARER and issuing
command "tipc link monitor summary", the command enters infinite loop
in user space.

This issue happens because function tipc_nl_node_dump_monitor() returns
the wrong 'prev_bearer' value when all potential monitors have been
scanned.

The correct behavior is to always try to scan all monitors until either
the netlink message is full, in which case we return the bearer identity
of the affected monitor, or we continue through the whole bearer array
until we can return MAX_BEARERS. This solution also caters for the case
where there may be gaps in the bearer array.

Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18 13:48:43 -04:00
Jon Maloy be47e41d77 tipc: fix use-after-free in tipc_nametbl_stop
When we delete a service item in tipc_nametbl_stop() we loop over
all service ranges in the service's RB tree, and for each service
range we loop over its pertaining publications while calling
tipc_service_remove_publ() for each of them.

However, tipc_service_remove_publ() has the side effect that it also
removes the comprising service range item when there are no publications
left. This leads to a "use-after-free" access when the inner loop
continues to the next iteration, since the range item holding the list
we are looping no longer exists.

We fix this by moving the delete of the service range item outside
the said function. Instead, we now let the two functions calling it
test if the list is empty and perform the removal when that is the
case.

Reported-by: syzbot+d64b64afc55660106556@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18 13:48:43 -04:00
Eric Dumazet c6404122cb tipc: fix possible crash in __tipc_nl_net_set()
syzbot reported a crash in __tipc_nl_net_set() caused by NULL dereference.

We need to check that both TIPC_NLA_NET_NODEID and TIPC_NLA_NET_NODEID_W1
are present.

We also need to make sure userland provided u64 attributes.

Fixes: d50ccc2d39 ("tipc: add 128-bit node identifier")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:08:18 -04:00
Eric Dumazet ec518f21cb tipc: add policy for TIPC_NLA_NET_ADDR
Before syzbot/KMSAN bites, add the missing policy for TIPC_NLA_NET_ADDR

Fixes: 27c2141672 ("tipc: add net set to new netlink api")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:08:18 -04:00
Jon Maloy 335b929b28 tipc: fix missing initializer in tipc_sendmsg()
The stack variable 'dnode' in __tipc_sendmsg() may theoretically
end up tipc_node_get_mtu() as an unitilalized variable.

We fix this by intializing the variable at declaration. We also add
a default else clause to the two conditional ones already there, so
that we never end up in the named function if the given address
type is illegal.

Reported-by: syzbot+b0975ce9355b347c1546@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12 21:55:38 -04:00
Jon Maloy c3317f4db8 tipc: fix unbalanced reference counter
When a topology subscription is created, we may encounter (or KASAN
may provoke) a failure to create a corresponding service instance in
the binding table. Instead of letting the tipc_nametbl_subscribe()
report the failure back to the caller, the function just makes a warning
printout and returns, without incrementing the subscription reference
counter as expected by the caller.

This makes the caller believe that the subscription was successful, so
it will at a later moment try to unsubscribe the item. This involves
a sub_put() call. Since the reference counter never was incremented
in the first place, we get a premature delete of the subscription item,
followed by a "use-after-free" warning.

We fix this by adding a return value to tipc_nametbl_subscribe() and
make the caller aware of the failure to subscribe.

This bug seems to always have been around, but this fix only applies
back to the commit shown below. Given the low risk of this happening
we believe this to be sufficient.

Fixes: commit 218527fe27 ("tipc: replace name table service range
array with rb tree")
Reported-by: syzbot+aa245f26d42b8305d157@syzkaller.appspotmail.com

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12 21:46:10 -04:00
Cong Wang e41f054847 tipc: use the right skb in tipc_sk_fill_sock_diag()
Commit 4b2e6877b8 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
tried to fix the crash but failed, the crash is still 100% reproducible
with it.

In tipc_sk_fill_sock_diag(), skb is the diag dump we are filling, it is not
correct to retrieve its NETLINK_CB(), instead, like other protocol diag,
we should use NETLINK_CB(cb->skb).sk here.

Reported-by: <syzbot+326e587eff1074657718@syzkaller.appspotmail.com>
Fixes: 4b2e6877b8 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
Fixes: c30b70deb5 (tipc: implement socket diagnostics for AF_TIPC)
Cc: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-08 12:34:29 -04:00
GhantaKrishnamurthy MohanKrishna 4b2e6877b8 tipc: Fix namespace violation in tipc_sk_fill_sock_diag
To fetch UID info for socket diagnostics, we determine the
namespace of user context using tipc socket instance. This
may cause namespace violation, as the kernel will remap based
on UID.

We fix this by fetching namespace info using the calling userspace
netlink socket.

Fixes: c30b70deb5 (tipc: implement socket diagnostics for AF_TIPC)
Reported-by: syzbot+326e587eff1074657718@syzkaller.appspotmail.com
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:54:35 -04:00
Jon Maloy b714295abc tipc: Fix missing list initializations in struct tipc_subscription
When an item of struct tipc_subscription is created, we fail to
initialize the two lists aggregated into the struct. This has so far
never been a problem, since the items are just added to a root
object by list_add(), which does not require the addee list to be
pre-initialized. However, syzbot is provoking situations where this
addition fails, whereupon the attempted removal if the item from
the list causes a crash.

This problem seems to always have been around, despite that the code
for creating this object was rewritten in commit 242e82cc95 ("tipc:
collapse subscription creation functions"), which is still in net-next.

We fix this for that commit by initializing the two lists properly.

Fixes: 242e82cc95 ("tipc: collapse subscription creation functions")
Reported-by: syzbot+0bb443b74ce09197e970@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:33:52 -04:00
Jon Maloy 7494cfa6d3 tipc: avoid possible string overflow
gcc points out that the combined length of the fixed-length inputs to
l->name is larger than the destination buffer size:

net/tipc/link.c: In function 'tipc_link_create':
net/tipc/link.c:465:26: error: '%s' directive writing up to 32 bytes
into a region of size between 26 and 58 [-Werror=format-overflow=]
sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);

net/tipc/link.c:465:2: note: 'sprintf' output 11 or more bytes
(assuming 75) into a destination of size 60
sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);

A detailed analysis reveals that the theoretical maximum length of
a link name is:
max self_str + 1 + max if_name + 1 + max peer_str + 1 + max if_name =
16 + 1 + 15 + 1 + 16 + 1 + 15 = 65
Since we also need space for a trailing zero we now set MAX_LINK_NAME
to 68.

Just to be on the safe side we also replace the sprintf() call with
snprintf().

Fixes: 25b0b9c4e8 ("tipc: handle collisions of 32-bit node address
hash values")
Reported-by: Arnd Bergmann <arnd@arndb.de>

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:19:52 -04:00
Jon Maloy 37922ea4a3 tipc: permit overlapping service ranges in name table
With the new RB tree structure for service ranges it becomes possible to
solve an old problem; - we can now allow overlapping service ranges in
the table.

When inserting a new service range to the tree, we use 'lower' as primary
key, and when necessary 'upper' as secondary key.

Since there may now be multiple service ranges matching an indicated
'lower' value, we must also add the 'upper' value to the functions
used for removing publications, so that the correct, corresponding
range item can be found.

These changes guarantee that a well-formed publication/withdrawal item
from a peer node never will be rejected, and make it possible to
eliminate the problematic backlog functionality we currently have for
handling such cases.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:19:52 -04:00
Jon Maloy f20889f72b tipc: refactor name table translate function
The function tipc_nametbl_translate() function is ugly and hard to
follow. This can be improved somewhat by introducing a stack variable
for holding the publication list to be used and re-ordering the if-
clauses for selection of algorithm.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:19:52 -04:00
Jon Maloy 218527fe27 tipc: replace name table service range array with rb tree
The current design of the binding table has an unnecessary memory
consuming and complex data structure. It aggregates the service range
items into an array, which is expanded by a factor two every time it
becomes too small to hold a new item. Furthermore, the arrays never
shrink when the number of ranges diminishes.

We now replace this array with an RB tree that is holding the range
items as tree nodes, each range directly holding a list of bindings.

This, along with a few name changes, improves both readability and
volume of the code, as well as reducing memory consumption and hopefully
improving cache hit rate.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:19:52 -04:00
Kirill Tkhai 2f635ceeb2 net: Drop pernet_operations::async
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:09 -04:00
Wei Yongjun e1a22d13eb tipc: tipc_node_create() can be static
Fixes the following sparse warning:

net/tipc/node.c:336:18: warning:
 symbol 'tipc_node_create' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 10:50:22 -04:00
Wei Yongjun c76f2481b6 tipc: fix error handling in tipc_udp_enable()
Release alloced resource before return from the error handling
case in tipc_udp_enable(), otherwise will cause memory leak.

Fixes: 52dfae5c85 ("tipc: obtain node identity from interface by default")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 10:50:02 -04:00
kbuild test robot da18ab32d7 tipc: tipc_disc_addr_trial_msg() can be static
Fixes: 25b0b9c4e8 ("tipc: handle collisions of 32-bit node address hash values")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Jon Maloy jon.maloy@ericsson.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-25 21:21:43 -04:00
Jon Maloy 52dfae5c85 tipc: obtain node identity from interface by default
Selecting and explicitly configuring a TIPC node identity may be
unwanted in some cases.

In this commit we introduce a default setting if the identity has not
been set at the moment the first bearer is enabled. We do this by
using a raw copy of a unique identifier from the used interface: MAC
address in the case of an L2 bearer, IPv4/IPv6 address in the case
of a UDP bearer.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:18 -04:00
Jon Maloy 25b0b9c4e8 tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.

We do this as follows:
- We don't apply the generated address immediately to the node, but do
  instead initiate a 1 sec trial period to allow other cluster members
  to discover and handle such collisions.

- During the trial period the node periodically sends out a new type
  of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
  to all the other nodes in the cluster.

- When a node is receiving such a message, it must check that the
  presented 32-bit identifier either is unused, or was used by the very
  same peer in a previous session. In both cases it accepts the request
  by not responding to it.

- If it finds that the same node has been up before using a different
  address, it responds with a DSC_TRIAL_FAIL_MSG containing that
  address.

- If it finds that the address has already been taken by some other
  node, it generates a new, unused address and returns it to the
  requester.

- During the trial period the requesting node must always be prepared
  to accept a failure message, i.e., a message where a peer suggests a
  different (or equal)  address to the one tried. In those cases it
  must apply the suggested value as trial address and restart the trial
  period.

This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:18 -04:00
Jon Maloy d50ccc2d39 tipc: add 128-bit node identifier
We add a 128-bit node identity, as an alternative to the currently used
32-bit node address.

For the sake of compatibility and to minimize message header changes
we retain the existing 32-bit address field. When not set explicitly by
the user, this field will be filled with a hash value generated from the
much longer node identity, and be used as a shorthand value for the
latter.

We permit either the address or the identity to be set by configuration,
but not both, so when the address value is set by a legacy user the
corresponding 128-bit node identity is generated based on the that value.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:18 -04:00
Jon Maloy 23fd3eace0 tipc: remove direct accesses to own_addr field in struct tipc_net
As a preparation to changing the addressing structure of TIPC we replace
all direct accesses to the tipc_net::own_addr field with the function
dedicated for this, tipc_own_addr().

There are no changes to program logics in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:18 -04:00
Jon Maloy b89afb116c tipc: allow closest-first lookup algorithm when legacy address is configured
The removal of an internal structure of the node address has an unwanted
side effect.
- Currently, if a user is sending an anycast message with destination
  domain 0, the tipc_namebl_translate() function will use the 'closest-
  first' algorithm to first look for a node local destination, and only
  when no such is found, will it resort to the cluster global 'round-
  robin' lookup algorithm.
- Current users can get around this, and enforce unconditional use of
  global round-robin by indicating a destination as Z.0.0 or Z.C.0.
- This option disappears when we make the node address flat, since the
  lookup algorithm has no way of recognizing this case. So, as long as
  there are node local destinations, the algorithm will always select
  one of those, and there is nothing the sender can do to change this.

We solve this by eliminating the 'closest-first' option, which was never
a good idea anyway, for non-legacy users, but only for those. To
distinguish between legacy users and non-legacy users we introduce a new
flag 'legacy_addr_format' in struct tipc_core, to be set when the user
configures a legacy-style Z.C.N node address. Hence, when a legacy user
indicates a zero lookup domain 'closest-first' is selected, and in all
other cases we use 'round-robin'.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:18 -04:00
Jon Maloy 2026364149 tipc: remove restrictions on node address values
Nominally, TIPC organizes network nodes into a three-level network
hierarchy consisting of the levels 'zone', 'cluster' and 'node'. This
hierarchy is reflected in the node address format, - it is sub-divided
into an 8-bit zone id, and 12 bit cluster id, and a 12-bit node id.

However, the 'zone' and 'cluster' levels have in reality never been
fully implemented,and never will be. The result of this has been
that the first 20 bits the node identity structure have been wasted,
and the usable node identity range within a cluster has been limited
to 12 bits. This is starting to become a problem.

In the following commits, we will need to be able to connect between
nodes which are using the whole 32-bit value space of the node address.
We therefore remove the restrictions on which values can be assigned
to node identity, -it is from now on only a 32-bit integer with no
assumed internal structure.

Isolation between clusters is now achieved only by setting different
values for the 'network id' field used during neighbor discovery, in
practice leading to the latter becoming the new cluster identity.

The rules for accepting discovery requests/responses from neighboring
nodes now become:

- If the user is using legacy address format on both peers, reception
  of discovery messages is subject to the legacy lookup domain check
  in addition to the cluster id check.

- Otherwise, the discovery request/response is always accepted, provided
  both peers have the same network id.

This secures backwards compatibility for users who have been using zone
or cluster identities as cluster separators, instead of the intended
'network id'.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:18 -04:00
Jon Maloy b39e465e56 tipc: some cleanups in the file discover.c
To facilitate the coming changes in the neighbor discovery functionality
we make some renaming and refactoring of that code. The functional changes
in this commit are trivial, e.g., that we move the message sending call in
tipc_disc_timeout() outside the spinlock protected region.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:17 -04:00
Jon Maloy cb30a63384 tipc: refactor function tipc_enable_bearer()
As a preparation for the next commits we try to reduce the footprint of
the function tipc_enable_bearer(), while hopefully making is simpler to
follow.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:17 -04:00
GhantaKrishnamurthy MohanKrishna 872619d8cf tipc: step sk->sk_drops when rcv buffer is full
Currently when tipc is unable to queue a received message on a
socket, the message is rejected back to the sender with error
TIPC_ERR_OVERLOAD. However, the application on this socket
has no knowledge about these discards.

In this commit, we try to step the sk_drops counter when tipc
is unable to queue a received message. Export sk_drops
using tipc socket diagnostics.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 14:43:37 -04:00
GhantaKrishnamurthy MohanKrishna c30b70deb5 tipc: implement socket diagnostics for AF_TIPC
This commit adds socket diagnostics capability for AF_TIPC in netlink
family NETLINK_SOCK_DIAG in a new kernel module (diag.ko).

The following are key design considerations:
- config TIPC_DIAG has default y, like INET_DIAG.
- only requests with flag NLM_F_DUMP is supported (dump all).
- tipc_sock_diag_req message is introduced to send filter parameters.
- the response attributes are of TLV, some nested.

To avoid exposing data structures between diag and tipc modules and
avoid code duplication, the following additions are required:
- export tipc_nl_sk_walk function to reuse socket iterator.
- export tipc_sk_fill_sock_diag to fill the tipc diag attributes.
- create a sock_diag response message in __tipc_add_sock_diag defined
  in diag.c and use the above exported tipc_sk_fill_sock_diag
  to fill response.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 14:43:35 -04:00
GhantaKrishnamurthy MohanKrishna dfde331e75 tipc: modify socket iterator for sock_diag
The current socket iterator function tipc_nl_sk_dump, handles socket
locks and calls __tipc_nl_add_sk for each socket.
To reuse this logic in sock_diag implementation, we do minor
modifications to make these functions generic as described below.

In this commit, we add a two new functions __tipc_nl_sk_walk,
__tipc_nl_add_sk_info and modify tipc_nl_sk_dump, __tipc_nl_add_sk
accordingly.

In __tipc_nl_sk_walk we:
1. acquire and release socket locks
2. for each socket, execute the specified callback function

In __tipc_nl_add_sk we:
- Move the netlink attribute insertion to __tipc_nl_add_sk_info.

tipc_nl_sk_dump calls tipc_nl_sk_walk with __tipc_nl_add_sk as argument.

sock_diag will use these generic functions in a later commit.

There is no functional change in this commit.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 14:43:29 -04:00
Jon Maloy e50e73e107 tipc: some name changes
We rename some lists and fields in struct publication both to make
the naming more consistent and to better reflect their roles. We
also update the descriptions of those lists.

node_list -> local_publ
cluster_list -> all_publ
pport_list -> binding_sock
ref -> port

There are no functional changes in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 17:11:46 -04:00
Jon Maloy 935439cc48 tipc: merge two lists in struct publication
The size of struct publication can be reduced further. Membership in
lists 'nodesub_list' and 'local_list' is mutually exlusive, in that
remote publications use the former and local publications the latter.
We replace the two lists with one single, named 'binding_node' which
reflects what it really is.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 17:11:46 -04:00
Jon Maloy ba765ec637 tipc: remove zone_list member in struct publication
As a further consequence of the previous commits, we can also remove
the member 'zone_list 'in struct name_info and struct publication.
Instead, we now let the member cluster_list take over the role a
container of all publications of a given <type,lower, upper>.
We also remove the counters for the size of those lists, since
they don't serve any purpose.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 17:11:46 -04:00
Jon Maloy 64a52b26d5 tipc: remove zone publication list in name table
As a consequence of the previous commit we nan now eliminate zone scope
related lists in the name table. We start with name_table::publ_list[3],
which can now be replaced with two lists, one for node scope publications
and one for cluster scope publications.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 17:11:46 -04:00
Jon Maloy 928df1880e tipc: obsolete TIPC_ZONE_SCOPE
Publications for TIPC_CLUSTER_SCOPE and TIPC_ZONE_SCOPE are in all
aspects handled the same way, both on the publishing node and on the
receiving nodes.

Despite previous ambitions to the contrary, this is never going to change,
so we take the conseqeunce of this and obsolete TIPC_ZONE_SCOPE and related
macros/functions. Whenever a user is doing a bind() or a sendmsg() attempt
using ZONE_SCOPE we translate this internally to CLUSTER_SCOPE, while we
remain compatible with users and remote nodes still using ZONE_SCOPE.

Furthermore, the non-formalized scope value 0 has always been permitted
for use during lookup, with the same meaning as ZONE_SCOPE/CLUSTER_SCOPE.
We now permit it even as binding scope, but for compatibility reasons we
choose to not change the value of TIPC_CLUSTER_SCOPE.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 17:11:46 -04:00
Kirill Tkhai afbbc374ab net: Convert tipc_net_ops
TIPC looks concentrated in itself, and other pernet_operations
seem not touching its entities.

tipc_net_ops look pernet-divided, and they should be safe to
be executed in parallel for several net the same time.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-13 11:24:56 -04:00
Gustavo A. R. Silva c9efb15a99 tipc: bcast: use true and false for boolean values
Assign true or false to boolean variables instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 12:18:00 -05:00
David S. Miller 0f3e9c97eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All of the conflicts were cases of overlapping changes.

In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-06 01:20:46 -05:00
Jon Maloy 1b22bcad7e tipc: correct initial value for group congestion flag
In commit 60c2530696 ("tipc: fix race between poll() and
setsockopt()") we introduced a pointer from struct tipc_group to the
'group_is_connected' flag in struct tipc_sock, so that this field can
be checked without dereferencing the group pointer of the latter struct.

The initial value for this flag is correctly set to 'false' when a
group is created, but we miss the case when no group is created at
all, in which case the initial value should be 'true'. This has the
effect that SOCK_RDM/DGRAM sockets sending datagrams never receive
POLLOUT if they request so.

This commit corrects this bug.

Fixes: 60c2530696 ("tipc: fix race between poll() and setsockopt()")
Reported-by: Hoang Le <hoang.h.le@dektek.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:46:03 -05:00
David S. Miller f5c0c6f429 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
Paolo Abeni 26736a08ee tipc: don't call sock_release() in atomic context
syzbot reported a scheduling while atomic issue at netns
destruction time:

BUG: sleeping function called from invalid context at net/core/sock.c:2769
in_atomic(): 1, irqs_disabled(): 0, pid: 85, name: kworker/u4:3
5 locks held by kworker/u4:3/85:
  #0:  ((wq_completion)"%s""netns"){+.+.}, at: [<00000000c9792deb>]
process_one_work+0xaaf/0x1af0 kernel/workqueue.c:2084
  #1:  (net_cleanup_work){+.+.}, at: [<00000000adc12e2a>]
process_one_work+0xb01/0x1af0 kernel/workqueue.c:2088
  #2:  (net_sem){++++}, at: [<000000009ccb5669>] cleanup_net+0x23f/0xd20
net/core/net_namespace.c:494
  #3:  (net_mutex){+.+.}, at: [<00000000a92767d9>] cleanup_net+0xa7d/0xd20
net/core/net_namespace.c:496
  #4:  (&(&srv->idr_lock)->rlock){+...}, at: [<000000001343e568>]
spin_lock_bh include/linux/spinlock.h:315 [inline]
  #4:  (&(&srv->idr_lock)->rlock){+...}, at: [<000000001343e568>]
tipc_topsrv_stop+0x231/0x610 net/tipc/topsrv.c:685
CPU: 0 PID: 85 Comm: kworker/u4:3 Not tainted 4.16.0-rc1+ #230
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Workqueue: netns cleanup_net
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x257 lib/dump_stack.c:53
  ___might_sleep+0x2b2/0x470 kernel/sched/core.c:6128
  __might_sleep+0x95/0x190 kernel/sched/core.c:6081
  lock_sock_nested+0x37/0x110 net/core/sock.c:2769
  lock_sock include/net/sock.h:1463 [inline]
  tipc_release+0x103/0xff0 net/tipc/socket.c:572
  sock_release+0x8d/0x1e0 net/socket.c:594
  tipc_topsrv_stop+0x3c0/0x610 net/tipc/topsrv.c:696
  tipc_exit_net+0x15/0x40 net/tipc/core.c:96
  ops_exit_list.isra.6+0xae/0x150 net/core/net_namespace.c:148
  cleanup_net+0x6ba/0xd20 net/core/net_namespace.c:529
  process_one_work+0xbbf/0x1af0 kernel/workqueue.c:2113
  worker_thread+0x223/0x1990 kernel/workqueue.c:2247
  kthread+0x33c/0x400 kernel/kthread.c:238
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:429

This is caused by tipc_topsrv_stop() releasing the listener socket
with the idr lock held. This changeset addresses the issue moving
the release operation outside such lock.

Reported-and-tested-by: syzbot+749d9d87c294c00ca856@syzkaller.appspotmail.com
Fixes: 0ef897be12 ("tipc: separate topology server listener socket from subcsriber sockets")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by:  ///jon
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:38:50 -05:00
Jon Maloy 96c252bf1c tipc: fix bug on error path in tipc_topsrv_kern_subscr()
In commit cc1ea9ffadf7 ("tipc: eliminate struct tipc_subscriber") we
re-introduced an old bug on the error path in the function
tipc_topsrv_kern_subscr(). We now re-introduce the correction too.

Reported-by: syzbot+f62e0f2a0ef578703946@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:38:08 -05:00
Jon Maloy 026321c6d0 tipc: rename tipc_server to tipc_topsrv
We rename struct tipc_server to struct tipc_topsrv. This reflect its now
specialized role as topology server. Accoringly, we change or add function
prefixes to make it clearer which functionality those belong to.

There are no functional changes in this commit.

Acked-by: Ying.Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:26:34 -05:00
Jon Maloy 0ef897be12 tipc: separate topology server listener socket from subcsriber sockets
We move the listener socket to struct tipc_server and give it its own
work item. This makes it easier to follow the code, and entails some
simplifications in the reception code in subscriber sockets.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:26:34 -05:00
Jon Maloy 5c45ab24ac tipc: make struct tipc_server private for server.c
In order to narrow the interface and dependencies between the topology
server and the subscription/binding table functionality we move struct
tipc_server inside the file server.c. This requires some code
adaptations in other files, but those are mostly minor.

The most important change is that we have to move the start/stop
functions for the topology server to server.c, where they logically
belong anyway.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:26:34 -05:00
Jon Maloy da0a75e86a tipc: some prefix changes
Since we now have removed struct tipc_subscriber from the code, and
only struct tipc_subscription remains, there is no longer need for long
and awkward prefixes to distinguish between their pertaining functions.

We now change all tipc_subscrp_* prefixes to tipc_sub_*. This is
a purely cosmetic change.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:26:34 -05:00
Jon Maloy 242e82cc95 tipc: collapse subscription creation functions
After the previous changes it becomes logical to collapse the two-level
creation of subscription instances into one. We do that here.

We also rename the creation and deletion functions for more consistency.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:26:34 -05:00
Jon Maloy 8985ecc7c1 tipc: simplify endianness handling in topology subscriber
Because of the requirement for total distribution transparency, users
send subscriptions and receive topology events in their own host format.
It is up to the topology server to determine this format and do the
correct conversions to and from its own host format when needed.

Until now, this has been handled in a rather non-transparent way inside
the topology server and subscriber code, leading to unnecessary
complexity when creating subscriptions and issuing events.

We now improve this situation by adding two new macros, tipc_sub_read()
and tipc_evt_write(). Both those functions calculate the need for
conversion internally before performing their respective operations.
Hence, all handling of such conversions become transparent to the rest
of the code.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:26:33 -05:00
Jon Maloy 414574a0af tipc: simplify interaction between subscription and topology connection
The message transmission and reception in the topology server is more
generic than is currently necessary. By basing the funtionality on the
fact that we only send items of type struct tipc_event and always
receive items of struct tipc_subcr we can make several simplifications,
and also get rid of some unnecessary dynamic memory allocations.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:26:33 -05:00
Jon Maloy df79d040dc tipc: eliminate struct tipc_subscriber
It is unnecessary to keep two structures, struct tipc_conn and struct
tipc_subscriber, with a one-to-one relationship and still with different
life cycles. The fact that the two often run in different contexts, and
still may access each other via direct pointers constitutes an additional
hazard, something we have experienced at several occasions, and still
see happening.

We have identified at least two remaining problems that are easier to
fix if we simplify the topology server data structure somewhat.

- When there is a race between a subscription up/down event and a
  timeout event, it is fully possible that the former might be delivered
  after the latter, leading to confusion for the receiver.

- The function tipc_subcrp_timeout() is executing in interrupt context,
  while the following call chain is at least theoretically possible:
  tipc_subscrp_timeout()
    tipc_subscrp_send_event()
      tipc_conn_sendmsg()
        conn_put()
          tipc_conn_kref_release()
            sock_release(sock)

I.e., we end up calling a function that might try to sleep in
interrupt context. To eliminate this, we need to ensure that the
tipc_conn structure and the socket, as well as the subscription
instances, only are deleted in work queue context, i.e., after the
timeout event really has been sent out.

We now remove this unnecessary complexity, by merging data and
functionality of the subscriber structure into struct tipc_conn
and the associated file server.c. We thereafter add a spinlock and
a new 'inactive' state to the subscription structure. Using those,
both problems described above can be easily solved.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:26:33 -05:00
Jon Maloy c901d26d4a tipc: remove unnecessary function pointers
Interaction between the functionality in server.c and subscr.c is
done via function pointers installed in struct server. This makes
the code harder to follow, and doesn't serve any obvious purpose.

Here, we replace the function pointers with direct function calls.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:26:33 -05:00
Jon Maloy 27469b7352 tipc: remove redundant code in topology server
The socket handling in the topology server is unnecessarily generic.
It is prepared to handle both SOCK_RDM, SOCK_DGRAM and SOCK_STREAM
type sockets, as well as the only socket type which is really used,
SOCK_SEQPACKET.

We now remove this redundant code to make the code more readable.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:26:33 -05:00
Jon Maloy 37c64cf63b tipc: apply bearer link tolerance on running links
Currently, the default link tolerance set in struct tipc_bearer only
has effect on links going up after that moment. I.e., a user has to
reset all the node's links across that bearer to have the new value
applied. This is too limiting and disturbing on a running cluster to
be useful.

We now change this so that also already existing links are updated
dynamically, without any need for a reset, when the bearer value is
changed. We leverage the already existing per-link functionality
for this to achieve the wanted effect.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 15:22:24 -05:00
Ying Xue ed4ffdfec2 tipc: Fix missing RTNL lock protection during setting link properties
Currently when user changes link properties, TIPC first checks if
user's command message contains media name or bearer name through
tipc_media_find() or tipc_bearer_find() which is protected by RTNL
lock. But when tipc_nl_compat_link_set() conducts the checking with
the two functions, it doesn't hold RTNL lock at all, as a result,
the following complaints were reported:

audit: type=1400 audit(1514679888.244:9): avc:  denied  { write } for
pid=3194 comm="syzkaller021477" path="socket:[11143]" dev="sockfs"
ino=11143 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
tcontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
tclass=netlink_generic_socket permissive=1
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>

=============================
WARNING: suspicious RCU usage
4.15.0-rc5+ #152 Not tainted
-----------------------------
net/tipc/bearer.c:177 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
2 locks held by syzkaller021477/3194:
  #0:  (cb_lock){++++}, at: [<00000000d20133ea>] genl_rcv+0x19/0x40
net/netlink/genetlink.c:634
  #1:  (genl_mutex){+.+.}, at: [<00000000fcc5d1bc>] genl_lock
net/netlink/genetlink.c:33 [inline]
  #1:  (genl_mutex){+.+.}, at: [<00000000fcc5d1bc>] genl_rcv_msg+0x115/0x140
net/netlink/genetlink.c:622

stack backtrace:
CPU: 1 PID: 3194 Comm: syzkaller021477 Not tainted 4.15.0-rc5+ #152
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x257 lib/dump_stack.c:53
  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
  tipc_bearer_find+0x2b4/0x3b0 net/tipc/bearer.c:177
  tipc_nl_compat_link_set+0x329/0x9f0 net/tipc/netlink_compat.c:729
  __tipc_nl_compat_doit net/tipc/netlink_compat.c:288 [inline]
  tipc_nl_compat_doit+0x15b/0x660 net/tipc/netlink_compat.c:335
  tipc_nl_compat_handle net/tipc/netlink_compat.c:1119 [inline]
  tipc_nl_compat_recv+0x112f/0x18f0 net/tipc/netlink_compat.c:1201
  genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:599
  genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:624
  netlink_rcv_skb+0x21e/0x460 net/netlink/af_netlink.c:2408
  genl_rcv+0x28/0x40 net/netlink/genetlink.c:635
  netlink_unicast_kernel net/netlink/af_netlink.c:1275 [inline]
  netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1301
  netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1864
  sock_sendmsg_nosec net/socket.c:636 [inline]
  sock_sendmsg+0xca/0x110 net/socket.c:646
  sock_write_iter+0x31a/0x5d0 net/socket.c:915
  call_write_iter include/linux/fs.h:1772 [inline]
  new_sync_write fs/read_write.c:469 [inline]
  __vfs_write+0x684/0x970 fs/read_write.c:482
  vfs_write+0x189/0x510 fs/read_write.c:544
  SYSC_write fs/read_write.c:589 [inline]
  SyS_write+0xef/0x220 fs/read_write.c:581
  do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline]
  do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389
  entry_SYSENTER_compat+0x54/0x63 arch/x86/entry/entry_64_compat.S:129

In order to correct the mistake, __tipc_nl_compat_doit() has been
protected by RTNL lock, which means the whole operation of setting
bearer/media properties is under RTNL protection.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reported-by: syzbot <syzbot+6345fd433db009b29413@syzkaller.appspotmail.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 14:46:33 -05:00
Ying Xue 5631f65dec tipc: Introduce __tipc_nl_net_set
Introduce __tipc_nl_net_set() which doesn't hold RTNL lock.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 14:46:33 -05:00
Ying Xue 07ffb22357 tipc: Introduce __tipc_nl_media_set
Introduce __tipc_nl_media_set() which doesn't hold RTNL lock.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 14:46:32 -05:00
Ying Xue 93532bb1d4 tipc: Introduce __tipc_nl_bearer_set
Introduce __tipc_nl_bearer_set() which doesn't holding RTNL lock.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 14:46:32 -05:00
Ying Xue 45cf7edfbc tipc: Introduce __tipc_nl_bearer_enable
Introduce __tipc_nl_bearer_enable() which doesn't hold RTNL lock.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 14:46:32 -05:00
Ying Xue d59d8b77ab tipc: Introduce __tipc_nl_bearer_disable
Introduce __tipc_nl_bearer_disable() which doesn't hold RTNL lock.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 14:46:32 -05:00
Ying Xue e5d1a1eec0 tipc: Refactor __tipc_nl_compat_doit
As preparation for adding RTNL to make (*cmd->transcode)() and
(*cmd->transcode)() constantly protected by RTNL lock, we move out of
memory allocations existing between them as many as possible so that
the time of holding RTNL can be minimized in __tipc_nl_compat_doit().

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 14:46:32 -05:00
Denys Vlasenko 9b2c45d479 net: make getname() functions return length rather than use int* parameter
Changes since v1:
Added changes in these files:
    drivers/infiniband/hw/usnic/usnic_transport.c
    drivers/staging/lustre/lnet/lnet/lib-socket.c
    drivers/target/iscsi/iscsi_target_login.c
    drivers/vhost/net.c
    fs/dlm/lowcomms.c
    fs/ocfs2/cluster/tcp.c
    security/tomoyo/network.c

Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.

"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.

None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.

This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.

Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.

Userspace API is not changed.

    text    data     bss      dec     hex filename
30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
30108109 2633612  873672 33615393 200ee21 vmlinux.o

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-12 14:15:04 -05:00
Linus Torvalds a9a08845e9 vfs: do bulk POLL* -> EPOLL* replacement
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-11 14:34:03 -08:00
Hoang Le 55b3280d1e tipc: fix skb truesize/datasize ratio control
In commit d618d09a68 ("tipc: enforce valid ratio between skb truesize
and contents") we introduced a test for ensuring that the condition
truesize/datasize <= 4 is true for a received buffer. Unfortunately this
test has two problems.

- Because of the integer arithmetics the test
  if (skb->truesize / buf_roundup_len(skb) > 4) will miss all
  ratios [4 < ratio < 5], which was not the intention.
- The buffer returned by skb_copy() inherits skb->truesize of the
  original buffer, which doesn't help the situation at all.

In this commit, we change the ratio condition and replace skb_copy()
with a call to skb_copy_expand() to finally get this right.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-08 15:30:40 -05:00
Linus Torvalds b2fe5fa686 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

 2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

 3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

 4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

 5) Add eBPF based queue selection to tun, from Jason Wang.

 6) Lockless qdisc support, from John Fastabend.

 7) SCTP stream interleave support, from Xin Long.

 8) Smoother TCP receive autotuning, from Eric Dumazet.

 9) Lots of erspan tunneling enhancements, from William Tu.

10) Add true function call support to BPF, from Alexei Starovoitov.

11) Add explicit support for GRO HW offloading, from Michael Chan.

12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

17) Add resource abstration to devlink, from Arkadi Sharshevsky.

18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

19) Avoid locking in act_csum, from Davide Caratti.

20) devinet_ioctl() simplifications from Al viro.

21) More TCP bpf improvements from Lawrence Brakmo.

22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
  tls: Add support for encryption using async offload accelerator
  ip6mr: fix stale iterator
  net/sched: kconfig: Remove blank help texts
  openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
  tcp_nv: fix potential integer overflow in tcpnv_acked
  r8169: fix RTL8168EP take too long to complete driver initialization.
  qmi_wwan: Add support for Quectel EP06
  rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
  ipmr: Fix ptrdiff_t print formatting
  ibmvnic: Wait for device response when changing MAC
  qlcnic: fix deadlock bug
  tcp: release sk_frag.page in tcp_disconnect
  ipv4: Get the address of interface correctly.
  net_sched: gen_estimator: fix lockdep splat
  net: macb: Handle HRESP error
  net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
  ipv6: addrconf: break critical section in addrconf_verify_rtnl()
  ipv6: change route cache aging logic
  i40e/i40evf: Update DESC_NEEDED value to reflect larger value
  bnxt_en: cleanup DIM work on device shutdown
  ...
2018-01-31 14:31:10 -08:00
Linus Torvalds 1ed2d76e02 Merge branch 'work.sock_recvmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull kern_recvmsg reduction from Al Viro:
 "kernel_recvmsg() is a set_fs()-using wrapper for sock_recvmsg(). In
  all but one case that is not needed - use of ITER_KVEC for ->msg_iter
  takes care of the data and does not care about set_fs(). The only
  exception is svc_udp_recvfrom() where we want cmsg to be store into
  kernel object; everything else can just use sock_recvmsg() and be done
  with that.

  A followup converting svc_udp_recvfrom() away from set_fs() (and
  killing kernel_recvmsg() off) is *NOT* in here - I'd like to hear what
  netdev folks think of the approach proposed in that followup)"

* 'work.sock_recvmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  tipc: switch to sock_recvmsg()
  smc: switch to sock_recvmsg()
  ipvs: switch to sock_recvmsg()
  mISDN: switch to sock_recvmsg()
  drbd: switch to sock_recvmsg()
  lustre lnet_sock_read(): switch to sock_recvmsg()
  cfs2: switch to sock_recvmsg()
  ncpfs: switch to sock_recvmsg()
  dlm: switch to sock_recvmsg()
  svc_recvfrom(): switch to sock_recvmsg()
2018-01-30 18:59:03 -08:00
Linus Torvalds 168fe32a07 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull poll annotations from Al Viro:
 "This introduces a __bitwise type for POLL### bitmap, and propagates
  the annotations through the tree. Most of that stuff is as simple as
  'make ->poll() instances return __poll_t and do the same to local
  variables used to hold the future return value'.

  Some of the obvious brainos found in process are fixed (e.g. POLLIN
  misspelled as POLL_IN). At that point the amount of sparse warnings is
  low and most of them are for genuine bugs - e.g. ->poll() instance
  deciding to return -EINVAL instead of a bitmap. I hadn't touched those
  in this series - it's large enough as it is.

  Another problem it has caught was eventpoll() ABI mess; select.c and
  eventpoll.c assumed that corresponding POLL### and EPOLL### were
  equal. That's true for some, but not all of them - EPOLL### are
  arch-independent, but POLL### are not.

  The last commit in this series separates userland POLL### values from
  the (now arch-independent) kernel-side ones, converting between them
  in the few places where they are copied to/from userland. AFAICS, this
  is the least disruptive fix preserving poll(2) ABI and making epoll()
  work on all architectures.

  As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
  it will trigger only on what would've triggered EPOLLWRBAND on other
  architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
  at all on sparc. With this patch they should work consistently on all
  architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  make kernel-side POLL... arch-independent
  eventpoll: no need to mask the result of epi_item_poll() again
  eventpoll: constify struct epoll_event pointers
  debugging printk in sg_poll() uses %x to print POLL... bitmap
  annotate poll(2) guts
  9p: untangle ->poll() mess
  ->si_band gets POLL... bitmap stored into a user-visible long field
  ring_buffer_poll_wait() return value used as return value of ->poll()
  the rest of drivers/*: annotate ->poll() instances
  media: annotate ->poll() instances
  fs: annotate ->poll() instances
  ipc, kernel, mm: annotate ->poll() instances
  net: annotate ->poll() instances
  apparmor: annotate ->poll() instances
  tomoyo: annotate ->poll() instances
  sound: annotate ->poll() instances
  acpi: annotate ->poll() instances
  crypto: annotate ->poll() instances
  block: annotate ->poll() instances
  x86: annotate ->poll() instances
  ...
2018-01-30 17:58:07 -08:00
Jon Maloy 60c2530696 tipc: fix race between poll() and setsockopt()
Letting tipc_poll() dereference a socket's pointer to struct tipc_group
entails a race risk, as the group item may be deleted in a concurrent
tipc_sk_join() or tipc_sk_leave() thread.

We now move the 'open' flag in struct tipc_group to struct tipc_sock,
and let the former retain only a pointer to the moved field. This will
eliminate the race risk.

Reported-by: syzbot+799dafde0286795858ac@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 15:12:21 -05:00
David S. Miller c02b3741eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Overlapping changes all over.

The mini-qdisc bits were a little bit tricky, however.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 00:10:42 -05:00
Jon Maloy e88f2be832 tipc: fix race condition at topology server receive
We have identified a race condition during reception of socket
events and messages in the topology server.

- The function tipc_close_conn() is releasing the corresponding
  struct tipc_subscriber instance without considering that there
  may still be items in the receive work queue. When those are
  scheduled, in the function tipc_receive_from_work(), they are
  using the subscriber pointer stored in struct tipc_conn, without
  first checking if this is valid or not. This will sometimes
  lead to crashes, as the next call of tipc_conn_recvmsg() will
  access the now deleted item.
  We fix this by making the usage of this pointer conditional on
  whether the connection is active or not. I.e., we check the condition
  test_bit(CF_CONNECTED) before making the call tipc_conn_recvmsg().

- Since the two functions may be running on different cores, the
  condition test described above is not enough. tipc_close_conn()
  may come in between and delete the subscriber item after the condition
  test is done, but before tipc_conn_recv_msg() is finished. This
  happens less frequently than the problem described above, but leads
  to the same symptoms.

  We fix this by using the existing sk_callback_lock for mutual
  exclusion in the two functions. In addition, we have to move
  a call to tipc_conn_terminate() outside the mentioned lock to
  avoid deadlock.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-16 14:42:41 -05:00
Jon Maloy e9a034456a tipc: fix bug during lookup of multicast destination nodes
In commit 232d07b74a ("tipc: improve groupcast scope handling") we
inadvertently broke non-group multicast transmission when changing the
parameter 'domain' to 'scope' in the function
tipc_nametbl_lookup_dst_nodes(). We missed to make the corresponding
change in the calling function, with the result that the lookup always
fails.

A closer anaysis reveals that this parameter is not needed at all.
Non-group multicast is hard coded to use CLUSTER_SCOPE, and in the
current implementation this will be delivered to all matching
destinations except those which are published with NODE_SCOPE on other
nodes. Since such publications never will be visible on the sending node
anyway, it makes no sense to discriminate by scope at all.

We now remove this parameter altogether.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-15 14:27:13 -05:00
Cong Wang 59b36613e8 tipc: fix a memory leak in tipc_nl_node_get_link()
When tipc_node_find_by_name() fails, the nlmsg is not
freed.

While on it, switch to a goto label to properly
free it.

Fixes: be9c086715c ("tipc: narrow down exposure of struct tipc_node")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-15 13:45:50 -05:00
Jon Maloy febafc8455 tipc: fix a potental access after delete in tipc_sk_join()
In commit d12d2e12ce "tipc: send out join messages as soon as new
member is discovered") we added a call to the function tipc_group_join()
without considering the case that the preceding tipc_sk_publish() might
have failed, and the group item already deleted.

We fix this by returning from tipc_sk_join() directly after the
failed tipc_sk_publish.

Reported-by: syzbot+e3eeae78ea88b8d6d858@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-15 13:43:51 -05:00
Jon Maloy eb929a91b2 tipc: improve poll() for group member socket
The current criteria for returning POLLOUT from a group member socket is
too simplistic. It basically returns POLLOUT as soon as the group has
external destinations, something obviously leading to a lot of spinning
during destination congestion situations. At the same time, the internal
congestion handling is unnecessarily complex.

We now change this as follows.

- We introduce an 'open' flag in  struct tipc_group. This flag is used
  only to help poll() get the setting of POLLOUT right, and *not* for
  congeston handling as such. This means that a user can choose to
  ignore an  EAGAIN for a destination and go on sending messages to
  other destinations in the group if he wants to.

- The flag is set to false every time we return EAGAIN on a send call.

- The flag is set to true every time any member, i.e., not necessarily
  the member that caused EAGAIN, is removed from the small_win list.

- We remove the group member 'usr_pending' flag. The size of the send
  window and presence in the 'small_win' list is sufficient criteria
  for recognizing congestion.

This solution seems to be a reasonable compromise between 'anycast',
which is normally not waiting for POLLOUT for a specific destination,
and the other three send modes, which are.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:58 -05:00
Jon Maloy 232d07b74a tipc: improve groupcast scope handling
When a member joins a group, it also indicates a binding scope. This
makes it possible to create both node local groups, invisible to other
nodes, as well as cluster global groups, visible everywhere.

In order to avoid that different members end up having permanently
differing views of group size and memberhip, we must inhibit locally
and globally bound members from joining the same group.

We do this by using the binding scope as an additional separator between
groups. I.e., a member must ignore all membership events from sockets
using a different scope than itself, and all lookups for message
destinations must require an exact match between the message's lookup
scope and the potential target's binding scope.

Apart from making it possible to create local groups using the same
identity on different nodes, a side effect of this is that it now also
becomes possible to create a cluster global group with the same identity
across the same nodes, without interfering with the local groups.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:58 -05:00
Jon Maloy 8348500f80 tipc: add option to suppress PUBLISH events for pre-existing publications
Currently, when a user is subscribing for binding table publications,
he will receive a PUBLISH event for all already existing matching items
in the binding table.

However, a group socket making a subscriptions doesn't need this initial
status update from the binding table, because it has already scanned it
during the join operation. Worse, the multiplicatory effect of issuing
mutual events for dozens or hundreds group members within a short time
frame put a heavy load on the topology server, with the end result that
scale out operations on a big group tend to take much longer than needed.

We now add a new filter option, TIPC_SUB_NO_STATUS, for topology server
subscriptions, so that this initial avalanche of events is suppressed.
This change, along with the previous commit, significantly improves the
range and speed of group scale out operations.

We keep the new option internal for the tipc driver, at least for now.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:58 -05:00
Jon Maloy d12d2e12ce tipc: send out join messages as soon as new member is discovered
When a socket is joining a group, we look up in the binding table to
find if there are already other members of the group present. This is
used for being able to return EAGAIN instead of EHOSTUNREACH if the
user proceeds directly to a send attempt.

However, the information in the binding table can be used to directly
set the created member in state MBR_PUBLISHED and send a JOIN message
to the peer, instead of waiting for a topology PUBLISH event to do this.
When there are many members in a group, the propagation time for such
events can be significant, and we can save time during the join
operation if we use the initial lookup result fully.

In this commit, we eliminate the member state MBR_DISCOVERED which has
been the result of the initial lookup, and do instead go directly to
MBR_PUBLISHED, which initiates the setup.

After this change, the tipc_member FSM looks as follows:

     +-----------+
---->| PUBLISHED |-----------------------------------------------+
PUB- +-----------+                                 LEAVE/WITHRAW |
LISH       |JOIN                                                 |
           |     +-------------------------------------------+   |
           |     |                            LEAVE/WITHDRAW |   |
           |     |                +------------+             |   |
           |     |   +----------->|  PENDING   |---------+   |   |
           |     |   |msg/maxactv +-+---+------+  LEAVE/ |   |   |
           |     |   |              |   |       WITHDRAW |   |   |
           |     |   |   +----------+   |                |   |   |
           |     |   |   |revert/maxactv|                |   |   |
           |     |   |   V              V                V   V   V
           |   +----------+  msg  +------------+       +-----------+
           +-->|  JOINED  |------>|   ACTIVE   |------>|  LEAVING  |--->
           |   +----------+       +--- -+------+ LEAVE/+-----------+DOWN
           |        A   A               |      WITHDRAW A   A    A   EVT
           |        |   |               |RECLAIM        |   |    |
           |        |   |REMIT          V               |   |    |
           |        |   |== adv   +------------+        |   |    |
           |        |   +---------| RECLAIMING |--------+   |    |
           |        |             +-----+------+  LEAVE/    |    |
           |        |                   |REMIT   WITHDRAW   |    |
           |        |                   |< adv              |    |
           |        |msg/               V            LEAVE/ |    |
           |        |adv==ADV_IDLE+------------+   WITHDRAW |    |
           |        +-------------|  REMITTED  |------------+    |
           |                      +------------+                 |
           |PUBLISH                                              |
JOIN +-----------+                                LEAVE/WITHDRAW |
---->|  JOINING  |-----------------------------------------------+
     +-----------+

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:58 -05:00
Jon Maloy c2b22bcf2e tipc: simplify group LEAVE sequence
After the changes in the previous commit the group LEAVE sequence
can be simplified.

We now let the arrival of a LEAVE message unconditionally issue a group
DOWN event to the user. When a topology WITHDRAW event is received, the
member, if it still there, is set to state LEAVING, but we only issue a
group DOWN event when the link to the peer node is gone, so that no
LEAVE message is to be expected.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:57 -05:00
Jon Maloy 7ad32bcb78 tipc: create group member event messages when they are needed
In the current implementation, a group socket receiving topology
events about other members just converts the topology event message
into a group event message and stores it until it reaches the right
state to issue it to the user. This complicates the code unnecessarily,
and becomes impractical when we in the coming commits will need to
create and issue membership events independently.

In this commit, we change this so that we just notice the type and
origin of the incoming topology event, and then drop the buffer. Only
when it is time to actually send a group event to the user do we
explicitly create a new message and send it upwards.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:57 -05:00
Jon Maloy 0233493a5f tipc: adjustment to group member FSM
Analysis reveals that the member state MBR_QURANTINED in reality is
unnecessary, and can be replaced by the state MBR_JOINING at all
occurrencs.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:57 -05:00
Jon Maloy 4ea5dab541 tipc: let group member stay in JOINED mode if unable to reclaim
We handle a corner case in the function tipc_group_update_rcv_win().
During extreme pessure it might happen that a message receiver has all
its active senders in RECLAIMING or REMITTED mode, meaning that there
is nobody to reclaim advertisements from if an additional sender tries
to go active.

Currently we just set the new sender to ACTIVE anyway, hence at least
theoretically opening up for a receiver queue overflow by exceeding the
MAX_ACTIVE limit. The correct solution to this is to instead add the
member to the pending queue, while letting the oldest member in that
queue revert to JOINED state.

In this commit we refactor the code for handling message arrival from
a JOINED member, both to make it more comprehensible and to cover the
case described above.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:57 -05:00
Jon Maloy 8d5dee21f6 tipc: a couple of cleanups
- We remove the 'reclaiming' member list in struct tipc_group, since
  it doesn't serve any purpose.

- We simplify the GRP_REMIT_MSG branch of tipc_group_protocol_rcv().

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:57 -05:00
David S. Miller a0ce093180 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-01-09 10:37:00 -05:00
Yang Shi f4803f1b73 net: tipc: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by TIPC at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-08 20:59:25 -05:00
Jon Maloy d84d1b3b6b tipc: simplify small window members' sorting algorithm
We simplify the sorting algorithm in tipc_update_member(). We also make
the remaining conditional call to this function unconditional, since the
same condition now is tested for inside the said function.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-05 13:37:03 -05:00
Jon Maloy 38266ca17c tipc: some clarifying name changes
We rename some functions and variables, to make their purpose clearer.

- tipc_group::congested -> tipc_group::small_win. Members in this list
  are not necessarily (and typically) congested. Instead, they may
  *potentially* be subject to congestion because their send window is
  less than ADV_IDLE, and therefore need to be checked during message
  transmission.

- tipc_group_is_receiver() -> tipc_group_is_sender(). This socket will
  accept messages coming from members fulfilling this condition, i.e.,
  they are senders from this member's viewpoint.

- tipc_group_is_enabled() -> tipc_group_is_receiver(). Members
  fulfilling this condition will accept messages sent from the current
  socket, i.e., they are receivers from its viewpoint.

There are no functional changes in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-05 13:37:03 -05:00
Jon Maloy f9c935db80 tipc: fix problems with multipoint-to-point flow control
In commit 04d7b574b2 ("tipc: add multipoint-to-point flow control") we
introduced a protocol for preventing buffer overflow when many group
members try to simultaneously send messages to the same receiving member.

Stress test of this mechanism has revealed a couple of related bugs:

- When the receiving member receives an advertisement REMIT message from
  one of the senders, it will sometimes prematurely activate a pending
  member and send it the remitted advertisement, although the upper
  limit for active senders has been reached. This leads to accumulation
  of illegal advertisements, and eventually to messages being dropped
  because of receive buffer overflow.

- When the receiving member leaves REMITTED state while a received
  message is being read, we miss to look at the pending queue, to
  activate the oldest pending peer. This leads to some pending senders
  being starved out, and never getting the opportunity to profit from
  the remitted advertisement.

We fix the former in the function tipc_group_proto_rcv() by returning
directly from the function once it becomes clear that the remitting
peer cannot leave REMITTED state at that point.

We fix the latter in the function tipc_group_update_rcv_win() by looking
up and activate the longest pending peer when it becomes clear that the
remitting peer now can leave REMITTED state.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02 21:52:07 -05:00
David S. Miller 6bb8824732 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
net/ipv6/ip6_gre.c is a case of parallel adds.

include/trace/events/tcp.h is a little bit more tricky.  The removal
of in-trace-macro ifdefs in 'net' paralleled with moving
show_tcp_state_name and friends over to include/trace/events/sock.h
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-29 15:42:26 -05:00
Parthasarathy Bhuvaragan 517d7c79bd tipc: fix hanging poll() for stream sockets
In commit 42b531de17 ("tipc: Fix missing connection request
handling"), we replaced unconditional wakeup() with condtional
wakeup for clients with flags POLLIN | POLLRDNORM | POLLRDBAND.

This breaks the applications which do a connect followed by poll
with POLLOUT flag. These applications are not woken when the
connection is ESTABLISHED and hence sleep forever.

In this commit, we fix it by including the POLLOUT event for
sockets in TIPC_CONNECTING state.

Fixes: 42b531de17 ("tipc: Fix missing connection request handling")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-28 12:15:26 -05:00
Tommi Rantala 642a8439dd tipc: fix tipc_mon_delete() oops in tipc_enable_bearer() error path
Calling tipc_mon_delete() before the monitor has been created will oops.
This can happen in tipc_enable_bearer() error path if tipc_disc_create()
fails.

[   48.589074] BUG: unable to handle kernel paging request at 0000000000001008
[   48.590266] IP: tipc_mon_delete+0xea/0x270 [tipc]
[   48.591223] PGD 1e60c5067 P4D 1e60c5067 PUD 1eb0cf067 PMD 0
[   48.592230] Oops: 0000 [#1] SMP KASAN
[   48.595610] CPU: 5 PID: 1199 Comm: tipc Tainted: G    B            4.15.0-rc4-pc64-dirty #5
[   48.597176] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[   48.598489] RIP: 0010:tipc_mon_delete+0xea/0x270 [tipc]
[   48.599347] RSP: 0018:ffff8801d827f668 EFLAGS: 00010282
[   48.600705] RAX: ffff8801ee813f00 RBX: 0000000000000204 RCX: 0000000000000000
[   48.602183] RDX: 1ffffffff1de6a75 RSI: 0000000000000297 RDI: 0000000000000297
[   48.604373] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff1dd1533
[   48.605607] R10: ffffffff8eafbb05 R11: fffffbfff1dd1534 R12: 0000000000000050
[   48.607082] R13: dead000000000200 R14: ffffffff8e73f310 R15: 0000000000001020
[   48.608228] FS:  00007fc686484800(0000) GS:ffff8801f5540000(0000) knlGS:0000000000000000
[   48.610189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   48.611459] CR2: 0000000000001008 CR3: 00000001dda70002 CR4: 00000000003606e0
[   48.612759] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   48.613831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   48.615038] Call Trace:
[   48.615635]  tipc_enable_bearer+0x415/0x5e0 [tipc]
[   48.620623]  tipc_nl_bearer_enable+0x1ab/0x200 [tipc]
[   48.625118]  genl_family_rcv_msg+0x36b/0x570
[   48.631233]  genl_rcv_msg+0x5a/0xa0
[   48.631867]  netlink_rcv_skb+0x1cc/0x220
[   48.636373]  genl_rcv+0x24/0x40
[   48.637306]  netlink_unicast+0x29c/0x350
[   48.639664]  netlink_sendmsg+0x439/0x590
[   48.642014]  SYSC_sendto+0x199/0x250
[   48.649912]  do_syscall_64+0xfd/0x2c0
[   48.650651]  entry_SYSCALL64_slow_path+0x25/0x25
[   48.651843] RIP: 0033:0x7fc6859848e3
[   48.652539] RSP: 002b:00007ffd25dff938 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[   48.654003] RAX: ffffffffffffffda RBX: 00007ffd25dff990 RCX: 00007fc6859848e3
[   48.655303] RDX: 0000000000000054 RSI: 00007ffd25dff990 RDI: 0000000000000003
[   48.656512] RBP: 00007ffd25dff980 R08: 00007fc685c35fc0 R09: 000000000000000c
[   48.657697] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000d13010
[   48.658840] R13: 00007ffd25e009c0 R14: 0000000000000000 R15: 0000000000000000
[   48.662972] RIP: tipc_mon_delete+0xea/0x270 [tipc] RSP: ffff8801d827f668
[   48.664073] CR2: 0000000000001008
[   48.664576] ---[ end trace e811818d54d5ce88 ]---

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27 10:55:00 -05:00
Tommi Rantala 19142551b2 tipc: error path leak fixes in tipc_enable_bearer()
Fix memory leak in tipc_enable_bearer() if enable_media() fails, and
cleanup with bearer_disable() if tipc_mon_create() fails.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27 10:54:59 -05:00
Jon Maloy 3a33a19bf8 tipc: fix memory leak of group member when peer node is lost
When a group member receives a member WITHDRAW event, this might have
two reasons: either the peer member is leaving the group, or the link
to the member's node has been lost.

In the latter case we need to issue a DOWN event to the user right away,
and let function tipc_group_filter_msg() perform delete of the member
item. However, in this case we miss to change the state of the member
item to MBR_LEAVING, so the member item is not deleted, and we have a
memory leak.

We now separate better between the four sub-cases of a WITHRAW event
and make sure that each case is handled correctly.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-26 13:06:36 -05:00
Jon Maloy 0a3d805c9c tipc: base group replicast ack counter on number of actual receivers
In commit 2f487712b8 ("tipc: guarantee that group broadcast doesn't
bypass group unicast") we introduced a mechanism that requires the first
(replicated) broadcast sent after a unicast to be acknowledged by all
receivers before permitting sending of the next (true) broadcast.

The counter for keeping track of the number of acknowledges to expect
is based on the tipc_group::member_cnt variable. But this misses that
some of the known members may not be ready for reception, and will never
acknowledge the message, either because they haven't fully joined the
group or because they are leaving the group. Such members are identified
by not fulfilling the condition tested for in the function
tipc_group_is_enabled().

We now set the counter for the actual number of acks to receive at the
moment the message is sent, by just counting the number of recipients
satisfying the tipc_group_is_enabled() test.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-26 13:00:04 -05:00
David S. Miller fba961ab29 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of overlapping changes.  Also on the net-next side
the XDP state management is handled more in the generic
layers so undo the 'net' nfp fix which isn't applicable
in net-next.

Include a necessary change by Jakub Kicinski, with log message:

====================
cls_bpf no longer takes care of offload tracking.  Make sure
netdevsim performs necessary checks.  This fixes a warning
caused by TC trying to remove a filter it has not added.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-22 11:16:31 -05:00
Jon Maloy bb25c3855a tipc: remove joining group member from congested list
When we receive a JOIN message from a peer member, the message may
contain an advertised window value ADV_IDLE that permits removing the
member in question from the tipc_group::congested list. However, since
the removal has been made conditional on that the advertised window is
*not* ADV_IDLE, we miss this case. This has the effect that a sender
sometimes may enter a state of permanent, false, broadcast congestion.

We fix this by unconditinally removing the member from the congested
list before calling tipc_member_update(), which might potentially sort
it into the list again.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 14:56:48 -05:00
Jon Maloy 3db0960117 tipc: fix list sorting bug in function tipc_group_update_member()
When, during a join operation, or during message transmission, a group
member needs to be added to the group's 'congested' list, we sort it
into the list in ascending order, according to its current advertised
window size. However, we miss the case when the member is already on
that list. This will have the result that the member, after the window
size has been decremented, might be at the wrong position in that list.
This again may have the effect that we during broadcast and multicast
transmissions miss the fact that a destination is not yet ready for
reception, and we end up sending anyway. From this point on, the
behavior during the remaining session is unpredictable, e.g., with
underflowing window sizes.

We now correct this bug by unconditionally removing the member from
the list before (re-)sorting it in.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19 14:10:03 -05:00
Jon Maloy 3f42f5fe31 tipc: remove leaving group member from all lists
A group member going into state LEAVING should never go back to any
other state before it is finally deleted. However, this might happen
if the socket needs to send out a RECLAIM message during this interval.
Since we forget to remove the leaving member from the group's 'active'
or 'pending' list, the member might be selected for reclaiming, change
state to RECLAIMING, and get stuck in this state instead of being
deleted. This might lead to suppression of the expected 'member down'
event to the receiver.

We fix this by removing the member from all lists, except the RB tree,
at the moment it goes into state LEAVING.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-18 13:16:40 -05:00
Jon Maloy 234833991e tipc: fix lost member events bug
Group messages are not supposed to be returned to sender when the
destination socket disappears. This is done correctly for regular
traffic messages, by setting the 'dest_droppable' bit in the header.
But we forget to do that in group protocol messages. This has the effect
that such messages may sometimes bounce back to the sender, be perceived
as a legitimate peer message, and wreak general havoc for the rest of
the session. In particular, we have seen that a member in state LEAVING
may go back to state RECLAIMED or REMITTED, hence causing suppression
of an otherwise expected 'member down' event to the user.

We fix this by setting the 'dest_droppable' bit even in group protocol
messages.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-18 13:16:40 -05:00
David S. Miller c30abd5e40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three sets of overlapping changes, two in the packet scheduler
and one in the meson-gxl PHY driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-16 22:11:55 -05:00
Jon Maloy c545a945d0 tipc: eliminate potential memory leak
In the function tipc_sk_mcast_rcv() we call refcount_dec(&skb->users)
on received sk_buffers. Since the reference counter might hit zero at
this point, we have a potential memory leak.

We fix this by replacing refcount_dec() with kfree_skb().

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 13:44:36 -05:00
Tom Herbert 97a6ec4ac0 rhashtable: Change rhashtable_walk_start to return void
Most callers of rhashtable_walk_start don't care about a resize event
which is indicated by a return value of -EAGAIN. So calls to
rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
like this is common:

       ret = rhashtable_walk_start(rhiter);
       if (ret && ret != -EAGAIN)
               goto out;

Since zero and -EAGAIN are the only possible return values from the
function this check is pointless. The condition never evaluates to true.

This patch changes rhashtable_walk_start to return void. This simplifies
code for the callers that ignore -EAGAIN. For the few cases where the
caller cares about the resize event, particularly where the table can be
walked in mulitple parts for netlink or seq file dump, the function
rhashtable_walk_start_check has been added that returns -EAGAIN on a
resize event.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-11 09:58:38 -05:00
David S. Miller 51e18a453f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflict was two parallel additions of include files to sch_generic.c,
no biggie.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-09 22:09:55 -05:00
Jon Maloy a7d5f107b4 tipc: fix memory leak in tipc_accept_from_sock()
When the function tipc_accept_from_sock() fails to create an instance of
struct tipc_subscriber it omits to free the already created instance of
struct tipc_conn instance before it returns.

We fix that with this commit.

Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 14:52:41 -05:00
Cong Wang 672ecbe1c9 tipc: fix a null pointer deref on error path
In tipc_topsrv_kern_subscr() when s->tipc_conn_new() fails
we call tipc_close_conn() to clean up, but in this case
calling conn_put() is just enough.

This fixes the folllowing crash:

 kasan: GPF could be caused by NULL-ptr deref or user memory access
 general protection fault: 0000 [#1] SMP KASAN
 Dumping ftrace buffer:
    (ftrace buffer empty)
 Modules linked in:
 CPU: 0 PID: 3085 Comm: syzkaller064164 Not tainted 4.15.0-rc1+ #137
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
 task: 00000000c24413a5 task.stack: 000000005e8160b5
 RIP: 0010:__lock_acquire+0xd55/0x47f0 kernel/locking/lockdep.c:3378
 RSP: 0018:ffff8801cb5474a8 EFLAGS: 00010002
 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffff85ecb400
 RBP: ffff8801cb547830 R08: 0000000000000001 R09: 0000000000000000
 R10: 0000000000000000 R11: ffffffff87489d60 R12: ffff8801cd2980c0
 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000020
 FS:  00000000014ee880(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007ffee2426e40 CR3: 00000001cb85a000 CR4: 00000000001406f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:4004
  __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
  _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:175
  spin_lock_bh include/linux/spinlock.h:320 [inline]
  tipc_subscrb_subscrp_delete+0x8f/0x470 net/tipc/subscr.c:201
  tipc_subscrb_delete net/tipc/subscr.c:238 [inline]
  tipc_subscrb_release_cb+0x17/0x30 net/tipc/subscr.c:316
  tipc_close_conn+0x171/0x270 net/tipc/server.c:204
  tipc_topsrv_kern_subscr+0x724/0x810 net/tipc/server.c:514
  tipc_group_create+0x702/0x9c0 net/tipc/group.c:184
  tipc_sk_join net/tipc/socket.c:2747 [inline]
  tipc_setsockopt+0x249/0xc10 net/tipc/socket.c:2861
  SYSC_setsockopt net/socket.c:1851 [inline]
  SyS_setsockopt+0x189/0x360 net/socket.c:1830
  entry_SYSCALL_64_fastpath+0x1f/0x96

Fixes: 14c04493cb ("tipc: add ability to order and receive topology events in driver")
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 14:51:59 -05:00
David S. Miller 7cda4cee13 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Small overlapping change conflict ('net' changed a line,
'net-next' added a line right afterwards) in flexcan.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 10:44:19 -05:00
Al Viro bc4802736d tipc: switch to sock_recvmsg()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-02 20:38:10 -05:00
Jon Maloy 4c94cc2d3d tipc: fall back to smaller MTU if allocation of local send skb fails
When sending node local messages the code is using an 'mtu' of 66060
bytes to avoid unnecessary fragmentation. During situations of low
memory tipc_msg_build() may sometimes fail to allocate such large
buffers, resulting in unnecessary send failures. This can easily be
remedied by falling back to a smaller MTU, and then reassemble the
buffer chain as if the message were arriving from a remote node.

At the same time, we change the initial MTU setting of the broadcast
link to a lower value, so that large messages always are fragmented
into smaller buffers even when we run in single node mode. Apart from
obtaining the same advantage as for the 'fallback' solution above, this
turns out to give a significant performance improvement. This can
probably be explained with the __pskb_copy() operation performed on the
buffer for each recipient during reception. We found the optimal value
for this, considering the most relevant skb pool, to be 3744 bytes.

Acked-by: Ying Xue <ying.xue@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-01 15:21:25 -05:00
Tommi Rantala c7799c067c tipc: call tipc_rcv() only if bearer is up in tipc_udp_recv()
Remove the second tipc_rcv() call in tipc_udp_recv(). We have just
checked that the bearer is not up, and calling tipc_rcv() with a bearer
that is not up leads to a TIPC div-by-zero crash in
tipc_node_calculate_timer(). The crash is rare in practice, but can
happen like this:

  We're enabling a bearer, but it's not yet up and fully initialized.
  At the same time we receive a discovery packet, and in tipc_udp_recv()
  we end up calling tipc_rcv() with the not-yet-initialized bearer,
  causing later the div-by-zero crash in tipc_node_calculate_timer().

Jon Maloy explains the impact of removing the second tipc_rcv() call:
  "link setup in the worst case will be delayed until the next arriving
   discovery messages, 1 sec later, and this is an acceptable delay."

As the tipc_rcv() call is removed, just leave the function via the
rcu_out label, so that we will kfree_skb().

[   12.590450] Own node address <1.1.1>, network identity 1
[   12.668088] divide error: 0000 [#1] SMP
[   12.676952] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.14.2-dirty #1
[   12.679225] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[   12.682095] task: ffff8c2a761edb80 task.stack: ffffa41cc0cac000
[   12.684087] RIP: 0010:tipc_node_calculate_timer.isra.12+0x45/0x60 [tipc]
[   12.686486] RSP: 0018:ffff8c2a7fc838a0 EFLAGS: 00010246
[   12.688451] RAX: 0000000000000000 RBX: ffff8c2a5b382600 RCX: 0000000000000000
[   12.691197] RDX: 0000000000000000 RSI: ffff8c2a5b382600 RDI: ffff8c2a5b382600
[   12.693945] RBP: ffff8c2a7fc838b0 R08: 0000000000000001 R09: 0000000000000001
[   12.696632] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c2a5d8949d8
[   12.699491] R13: ffffffff95ede400 R14: 0000000000000000 R15: ffff8c2a5d894800
[   12.702338] FS:  0000000000000000(0000) GS:ffff8c2a7fc80000(0000) knlGS:0000000000000000
[   12.705099] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.706776] CR2: 0000000001bb9440 CR3: 00000000bd009001 CR4: 00000000003606e0
[   12.708847] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   12.711016] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   12.712627] Call Trace:
[   12.713390]  <IRQ>
[   12.714011]  tipc_node_check_dest+0x2e8/0x350 [tipc]
[   12.715286]  tipc_disc_rcv+0x14d/0x1d0 [tipc]
[   12.716370]  tipc_rcv+0x8b0/0xd40 [tipc]
[   12.717396]  ? minmax_running_min+0x2f/0x60
[   12.718248]  ? dst_alloc+0x4c/0xa0
[   12.718964]  ? tcp_ack+0xaf1/0x10b0
[   12.719658]  ? tipc_udp_is_known_peer+0xa0/0xa0 [tipc]
[   12.720634]  tipc_udp_recv+0x71/0x1d0 [tipc]
[   12.721459]  ? dst_alloc+0x4c/0xa0
[   12.722130]  udp_queue_rcv_skb+0x264/0x490
[   12.722924]  __udp4_lib_rcv+0x21e/0x990
[   12.723670]  ? ip_route_input_rcu+0x2dd/0xbf0
[   12.724442]  ? tcp_v4_rcv+0x958/0xa40
[   12.725039]  udp_rcv+0x1a/0x20
[   12.725587]  ip_local_deliver_finish+0x97/0x1d0
[   12.726323]  ip_local_deliver+0xaf/0xc0
[   12.726959]  ? ip_route_input_noref+0x19/0x20
[   12.727689]  ip_rcv_finish+0xdd/0x3b0
[   12.728307]  ip_rcv+0x2ac/0x360
[   12.728839]  __netif_receive_skb_core+0x6fb/0xa90
[   12.729580]  ? udp4_gro_receive+0x1a7/0x2c0
[   12.730274]  __netif_receive_skb+0x1d/0x60
[   12.730953]  ? __netif_receive_skb+0x1d/0x60
[   12.731637]  netif_receive_skb_internal+0x37/0xd0
[   12.732371]  napi_gro_receive+0xc7/0xf0
[   12.732920]  receive_buf+0x3c3/0xd40
[   12.733441]  virtnet_poll+0xb1/0x250
[   12.733944]  net_rx_action+0x23e/0x370
[   12.734476]  __do_softirq+0xc5/0x2f8
[   12.734922]  irq_exit+0xfa/0x100
[   12.735315]  do_IRQ+0x4f/0xd0
[   12.735680]  common_interrupt+0xa2/0xa2
[   12.736126]  </IRQ>
[   12.736416] RIP: 0010:native_safe_halt+0x6/0x10
[   12.736925] RSP: 0018:ffffa41cc0cafe90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff4d
[   12.737756] RAX: 0000000000000000 RBX: ffff8c2a761edb80 RCX: 0000000000000000
[   12.738504] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[   12.739258] RBP: ffffa41cc0cafe90 R08: 0000014b5b9795e5 R09: ffffa41cc12c7e88
[   12.740118] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[   12.740964] R13: ffff8c2a761edb80 R14: 0000000000000000 R15: 0000000000000000
[   12.741831]  default_idle+0x2a/0x100
[   12.742323]  arch_cpu_idle+0xf/0x20
[   12.742796]  default_idle_call+0x28/0x40
[   12.743312]  do_idle+0x179/0x1f0
[   12.743761]  cpu_startup_entry+0x1d/0x20
[   12.744291]  start_secondary+0x112/0x120
[   12.744816]  secondary_startup_64+0xa5/0xa5
[   12.745367] Code: b9 f4 01 00 00 48 89 c2 48 c1 ea 02 48 3d d3 07 00
00 48 0f 47 d1 49 8b 0c 24 48 39 d1 76 07 49 89 14 24 48 89 d1 31 d2 48
89 df <48> f7 f1 89 c6 e8 81 6e ff ff 5b 41 5c 5d c3 66 90 66 2e 0f 1f
[   12.747527] RIP: tipc_node_calculate_timer.isra.12+0x45/0x60 [tipc] RSP: ffff8c2a7fc838a0
[   12.748555] ---[ end trace 1399ab83390650fd ]---
[   12.749296] Kernel panic - not syncing: Fatal exception in interrupt
[   12.750123] Kernel Offset: 0x13200000 from 0xffffffff82000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   12.751215] Rebooting in 60 seconds..

Fixes: c9b64d492b ("tipc: add replicast peer discovery")
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-01 15:14:22 -05:00
Al Viro ade994f4f6 net: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:20:04 -05:00
Jon Maloy 2e724dca77 tipc: eliminate access after delete in group_filter_msg()
KASAN revealed another access after delete in group.c. This time
it found that we read the header of a received message after the
buffer has been released.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-27 14:44:45 -05:00
Jon Maloy e0e853ac03 tipc: fix access of released memory
When the function tipc_group_filter_msg() finds that a member event
indicates that the member is leaving the group, it first deletes the
member instance, and then purges the message queue being handled
by the call. But the message queue is an aggregated field in the
just deleted item, leading the purge call to access freed memory.

We fix this by swapping the order of the two actions.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-21 20:22:03 +09:00
Jon Maloy d618d09a68 tipc: enforce valid ratio between skb truesize and contents
The socket level flow control is based on the assumption that incoming
buffers meet the condition (skb->truesize / roundup(skb->len) <= 4),
where the latter value is rounded off upwards to the nearest 1k number.
This does empirically hold true for the device drivers we know, but we
cannot trust that it will always be so, e.g., in a system with jumbo
frames and very small packets.

We now introduce a check for this condition at packet arrival, and if
we find it to be false, we copy the packet to a new, smaller buffer,
where the condition will be true. We expect this to affect only a small
fraction of all incoming packets, if at all.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-16 10:49:00 +09:00
Jon Maloy 8d6e79d3ce tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.

To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.

To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.

It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 15:36:05 +09:00
David S. Miller 2a171788ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Files removed in 'net-next' had their license header updated
in 'net'.  We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 09:26:51 +09:00
Jon Maloy fa36882682 tipc: eliminate unnecessary probing
The neighbor monitor employs a threshold, default set to 32 peer nodes,
where it activates the "Overlapping Neighbor Monitoring" algorithm.
Below that threshold, monitoring is full-mesh, and no "domain records"
are passed between the nodes.

Because of this, a node never received a peer's ack that it has received
the most recent update of the own domain. Hence, the field 'acked_gen'
in struct tipc_monitor_state remains permamently at zero, whereas the
own domain generation is incremented for each added or removed peer.

This has the effect that the function tipc_mon_get_state() always sets
the field 'probing' in struct tipc_monitor_state true, again leading the
tipc_link_timeout() of the link in question to always send out a probe,
even when link->silent_intv_count is zero.

This is functionally harmless, but leads to some unncessary probing,
which can easily be eliminated by setting the 'probing' field of the
said struct correctly in such cases.

At the same time, we explictly invalidate the sent domain records when
the algorithm is not activated. This will eliminate any risk that an
invalid domain record might be inadverently accepted by the peer.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 15:48:46 +09:00
Greg Kroah-Hartman b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Kees Cook 31b102bb50 net: tipc: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: tipc-discussion@lists.sourceforge.net
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-01 12:38:45 +09:00
Cong Wang e233df0157 tipc: fix a dangling pointer
tsk->group is set to grp earlier, but we forget to unset it
after grp is freed.

Fixes: 75da2163db ("tipc: introduce communication groups")
Reported-by: syzkaller bot
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-26 17:46:53 +09:00
Jon Maloy f65163fed0 tipc: eliminate KASAN warning
The following warning was reported by syzbot on Oct 24. 2017:
KASAN: slab-out-of-bounds Read in tipc_nametbl_lookup_dst_nodes

This is a harmless bug, but we still want to get rid of the warning,
so we swap the two conditions in question.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-26 09:38:34 +09:00
Jon Maloy 0d5fcebf3c tipc: refactor tipc_sk_timeout() function
The function tipc_sk_timeout() is more complex than necessary, and
even seems to contain an undetected bug. At one of the occurences
where we renew the timer we just order it with (HZ / 20), instead
of (jiffies + HZ / 20);

In this commit we clean up the function.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 02:36:35 +01:00
Jon Maloy cb4dc41eaa tipc: fix broken tipc_poll() function
In commit ae236fb208 ("tipc: receive group membership events via
member socket") we broke the tipc_poll() function by checking the
state of the receive queue before the call to poll_sock_wait(), while
relying that state afterwards, when it might have changed.

We restore this in this commit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 12:27:05 +01:00
Dan Carpenter c75e427d93 tipc: checking for NULL instead of IS_ERR()
The tipc_alloc_conn() function never returns NULL, it returns error
pointers, so I have fixed the check.

Fixes: 14c04493cb ("tipc: add ability to order and receive topology events in driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 08:34:00 +01:00
Jon Maloy 36c0a9dfc6 tipc: fix rebasing error
In commit 2f487712b8 ("tipc: guarantee that group broadcast doesn't
bypass group unicast") there was introduced a last-minute rebasing
error that broke non-group communication.

We fix this here.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 21:28:36 +01:00
Jon Maloy 04d7b574b2 tipc: add multipoint-to-point flow control
We already have point-to-multipoint flow control within a group. But
we even need the opposite; -a scheme which can handle that potentially
hundreds of sources may try to send messages to the same destination
simultaneously without causing buffer overflow at the recipient. This
commit adds such a mechanism.

The algorithm works as follows:

- When a member detects a new, joining member, it initially set its
  state to JOINED and advertises a minimum window to the new member.
  This window is chosen so that the new member can send exactly one
  maximum sized message, or several smaller ones, to the recipient
  before it must stop and wait for an additional advertisement. This
  minimum window ADV_IDLE is set to 65 1kB blocks.

- When a member receives the first data message from a JOINED member,
  it changes the state of the latter to ACTIVE, and advertises a larger
  window ADV_ACTIVE = 12 x ADV_IDLE blocks to the sender, so it can
  continue sending with minimal disturbances to the data flow.

- The active members are kept in a dedicated linked list. Each time a
  message is received from an active member, it will be moved to the
  tail of that list. This way, we keep a record of which members have
  been most (tail) and least (head) recently active.

- There is a maximum number (16) of permitted simultaneous active
  senders per receiver. When this limit is reached, the receiver will
  not advertise anything immediately to a new sender, but instead put
  it in a PENDING state, and add it to a corresponding queue. At the
  same time, it will pick the least recently active member, send it an
  advertisement RECLAIM message, and set this member to state
  RECLAIMING.

- The reclaimee member has to respond with a REMIT message, meaning that
  it goes back to a send window of ADV_IDLE, and returns its unused
  advertised blocks beyond that value to the reclaiming member.

- When the reclaiming member receives the REMIT message, it unlinks
  the reclaimee from its active list, resets its state to JOINED, and
  notes that it is now back at ADV_IDLE advertised blocks to that
  member. If there are still unread data messages sent out by
  reclaimee before the REMIT, the member goes into an intermediate
  state REMITTED, where it stays until the said messages have been
  consumed.

- The returned advertised blocks can now be re-advertised to the
  pending member, which is now set to state ACTIVE and added to
  the active member list.

- To be proactive, i.e., to minimize the risk that any member will
  end up in the pending queue, we start reclaiming resources already
  when the number of active members exceeds 3/4 of the permitted
  maximum.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:01 -07:00
Jon Maloy a3bada7066 tipc: guarantee delivery of last broadcast before DOWN event
The following scenario is possible:
- A user sends a broadcast message, and thereafter immediately leaves
  the group.
- The LEAVE message, following a different path than the broadcast,
  arrives ahead of the broadcast, and the sending member is removed
  from the receiver's list.
- The broadcast message arrives, but is dropped because the sender
  now is unknown to the receipient.

We fix this by sequence numbering membership events, just like ordinary
unicast messages. Currently, when a JOIN is sent to a peer, it contains
a synchronization point, - the sequence number of the next sent
broadcast, in order to give the receiver a start synchronization point.
We now let even LEAVE messages contain such an "end synchronization"
point, so that the recipient can delay the removal of the sending member
until it knows that all messages have been received.

The received synchronization points are added as sequence numbers to the
generated membership events, making it possible to handle them almost
the same way as regular unicasts in the receiving filter function. In
particular, a DOWN event with a too high sequence number will be kept
in the reordering queue until the missing broadcast(s) arrive and have
been delivered.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:01 -07:00
Jon Maloy 399574d419 tipc: guarantee delivery of UP event before first broadcast
The following scenario is possible:
- A user joins a group, and immediately sends out a broadcast message
  to its members.
- The broadcast message, following a different data path than the
  initial JOIN message sent out during the joining procedure, arrives
  to a receiver before the latter..
- The receiver drops the message, since it is not ready to accept any
  messages until the JOIN has arrived.

We avoid this by treating group protocol JOIN messages like unicast
messages.
- We let them pass through the recipient's multicast input queue, just
  like ordinary unicasts.
- We force the first following broadacst to be sent as replicated
  unicast and being acknowledged by the recipient before accepting
  any more broadcast transmissions.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:01 -07:00
Jon Maloy 2f487712b8 tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:

- Each time a unicast is sent, we set a the broadcast method for the
  socket to "replicast" and "mandatory". This forces the first
  subsequent broadcast message to follow the same network and data path
  as the preceding unicast to a destination, hence preventing it from
  overtaking the latter.

- In order to make the 'same data path' statement above true, we let
  group unicasts pass through the multicast link input queue, instead
  of as previously through the unicast link input queue.

- In the first broadcast following a unicast, we set a new header flag,
  requiring all recipients to immediately acknowledge its reception.

- During the period before all the expected acknowledges are received,
  the socket refuses to accept any more broadcast attempts, i.e., by
  blocking or returning EAGAIN. This period should typically not be
  longer than a few microseconds.

- When all acknowledges have been received, the sending socket will
  open up for subsequent broadcasts, this time giving the link layer
  freedom to itself select the best transmission method.

- The forced and/or abrupt transmission method changes described above
  may lead to broadcasts arriving out of order to the recipients. We
  remedy this by introducing code that checks and if necessary
  re-orders such messages at the receiving end.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:01 -07:00
Jon Maloy b87a5ea31c tipc: guarantee group unicast doesn't bypass group broadcast
Group unicast messages don't follow the same path as broadcast messages,
and there is a high risk that unicasts sent from a socket might bypass
previously sent broadcasts from the same socket.

We fix this by letting all unicast messages carry the sequence number of
the next sent broadcast from the same node, but without updating this
number at the receiver. This way, a receiver can check and if necessary
re-order such messages before they are added to the socket receive buffer.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:01 -07:00
Jon Maloy 5b8dddb637 tipc: introduce group multicast messaging
The previously introduced message transport to all group members is
based on the tipc multicast service, but is logically a broadcast
service within the group, and that is what we call it.

We now add functionality for sending messages to all group members
having a certain identity. Correspondingly, we call this feature 'group
multicast'. The service is using unicast when only one destination is
found, otherwise it will use the bearer broadcast service to transfer
the messages. In the latter case, the receiving members filter arriving
messages by looking at the intended destination instance. If there is
no match, the message will be dropped, while still being considered
received and read as seen by the flow control mechanism.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:01 -07:00
Jon Maloy ee106d7f94 tipc: introduce group anycast messaging
In this commit, we make it possible to send connectionless unicast
messages to any member corresponding to the given member identity,
when there is more than one such member. The sender must use a
TIPC_ADDR_NAME address to achieve this effect.

We also perform load balancing between the destinations, i.e., we
primarily select one which has advertised sufficient send window
to not cause a block/EAGAIN delay, if any. This mechanism is
overlayed on the always present round-robin selection.

Anycast messages are subject to the same start synchronization
and flow control mechanism as group broadcast messages.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 27bd9ec027 tipc: introduce group unicast messaging
We now make it possible to send connectionless unicast messages
within a communication group. To send a message, the sender can use
either a direct port address, aka port identity, or an indirect port
name to be looked up.

This type of messages are subject to the same start synchronization
and flow control mechanism as group broadcast messages.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy b7d4263551 tipc: introduce flow control for group broadcast messages
We introduce an end-to-end flow control mechanism for group broadcast
messages. This ensures that no messages are ever lost because of
destination receive buffer overflow, with minimal impact on performance.
For now, the algorithm is based on the assumption that there is only one
active transmitter at any moment in time.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy ae236fb208 tipc: receive group membership events via member socket
Like with any other service, group members' availability can be
subscribed for by connecting to be topology server. However, because
the events arrive via a different socket than the member socket, there
is a real risk that membership events my arrive out of synch with the
actual JOIN/LEAVE action. I.e., it is possible to receive the first
messages from a new member before the corresponding JOIN event arrives,
just as it is possible to receive the last messages from a leaving
member after the LEAVE event has already been received.

Since each member socket is internally also subscribing for membership
events, we now fix this problem by passing those events on to the user
via the member socket. We leverage the already present member synch-
ronization protocol to guarantee correct message/event order. An event
is delivered to the user as an empty message where the two source
addresses identify the new/lost member. Furthermore, we set the MSG_OOB
bit in the message flags to mark it as an event. If the event is an
indication about a member loss we also set the MSG_EOR bit, so it can
be distinguished from a member addition event.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 31c82a2d9d tipc: add second source address to recvmsg()/recvfrom()
With group communication, it becomes important for a message receiver to
identify not only from which socket (identfied by a node:port tuple) the
message was sent, but also the logical identity (type:instance) of the
sending member.

We fix this by adding a second instance of struct sockaddr_tipc to the
source address area when a message is read. The extra address struct
is filled in with data found in the received message header (type,) and
in the local member representation struct (instance.)

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 75da2163db tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.

We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.

A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.

Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy a80ae5306a tipc: improve destination linked list
We often see a need for a linked list of destination identities,
sometimes containing a port number, sometimes a node identity, and
sometimes both. The currently defined struct u32_list is not generic
enough to cover all cases, so we extend it to contain two u32 integers
and rename it to struct tipc_dest_list.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy f70d37b796 tipc: add new function for sending multiple small messages
We see an increasing need to send multiple single-buffer messages
of TIPC_SYSTEM_IMPORTANCE to different individual destination nodes.
Instead of looping over the send queue and sending each buffer
individually, as we do now, we add a new help function
tipc_node_distr_xmit() to do this.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 64ac5f5977 tipc: refactor function filter_rcv()
In the following commits we will need to handle multiple incoming and
rejected/returned buffers in the function socket.c::filter_rcv().
As a preparation for this, we generalize the function by handling
buffer queues instead of individual buffers. We also introduce a
help function tipc_skb_reject(), and rename filter_rcv() to
tipc_sk_filter_rcv() in line with other functions in socket.c.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 38077b8ef8 tipc: add ability to obtain node availability status from other files
In the coming commits, functions at the socket level will need the
ability to read the availability status of a given node. We therefore
introduce a new function for this purpose, while renaming the existing
static function currently having the wanted name.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 23998835be tipc: improve address sanity check in tipc_connect()
The address given to tipc_connect() is not completely sanity checked,
under the assumption that this will be done later in the function
__tipc_sendmsg() when the address is used there.

However, the latter functon will in the next commits serve as caller
to several other send functions, so we want to move the corresponding
sanity check there to the beginning of that function, before we possibly
need to grab the address stored by tipc_connect(). We must therefore
be able to trust that this address already has been thoroughly checked.

We do this in this commit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 14c04493cb tipc: add ability to order and receive topology events in driver
As preparation for introducing communication groups, we add the ability
to issue topology subscriptions and receive topology events from kernel
space. This will make it possible for group member sockets to keep track
of other group members.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy a9e2971b8c tipc: Unclone message at secondary destination lookup
When a bundling message is received, the function tipc_link_input()
calls function tipc_msg_extract() to unbundle all inner messages of
the bundling message before adding them to input queue.

The function tipc_msg_extract() just clones all inner skb for all
inner messagges from the bundling skb. This means that the skb
headroom of an inner message overlaps with the data part of the
preceding message in the bundle.

If the message in question is a name addressed message, it may be
subject to a secondary destination lookup, and eventually be sent out
on one of the interfaces again. But, since what is perceived as headroom
by the device driver in reality is the last bytes of the preceding
message in the bundle, the latter will be overwritten by the MAC
addresses of the L2 header. If the preceding message has not yet been
consumed by the user, it will evenually be delivered with corrupted
contents.

This commit fixes this by uncloning all messages passing through the
function tipc_msg_lookup_dest(), hence ensuring that the headroom
is always valid when the message is passed on.

Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08 21:13:23 -07:00
Jon Maloy 3382605fd8 tipc: correct initialization of skb list
We change the initialization of the skb transmit buffer queues
in the functions tipc_bcast_xmit() and tipc_rcast_xmit() to also
initialize their spinlocks. This is needed because we may, during
error conditions, need to call skb_queue_purge() on those queues
further down the stack.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08 21:13:23 -07:00
Parthasarathy Bhuvaragan aad06212d3 tipc: use only positive error codes in messages
In commit e3a77561e7 ("tipc: split up function tipc_msg_eval()"),
we have updated the function tipc_msg_lookup_dest() to set the error
codes to negative values at destination lookup failures. Thus when
the function sets the error code to -TIPC_ERR_NO_NAME, its inserted
into the 4 bit error field of the message header as 0xf instead of
TIPC_ERR_NO_NAME (1). The value 0xf is an unknown error code.

In this commit, we set only positive error code.

Fixes: e3a77561e7 ("tipc: split up function tipc_msg_eval()")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-01 04:03:35 +01:00
Kleber Sacilotto de Souza 8e0deed924 tipc: remove unnecessary call to dev_net()
The net device is already stored in the 'net' variable, so no need to call
dev_net() again.

Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-06 21:25:52 -07:00
David S. Miller 6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Parthasarathy Bhuvaragan d55c60eba0 tipc: permit bond slave as bearer
For a bond slave device as a tipc bearer, the dev represents the bond
interface and orig_dev represents the slave in tipc_l2_rcv_msg().
Since we decode the tipc_ptr from bonding device (dev), we fail to
find the bearer and thus tipc links are not established.

In this commit, we register the tipc protocol callback per device and
look for tipc bearer from both the devices.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29 15:07:33 -07:00
Parthasarathy Bhuvaragan 991ca84daa tipc: context imbalance at node read unlock
If we fail to find a valid bearer in tipc_node_get_linkname(),
node_read_unlock() is called without holding the node read lock.

This commit fixes this error.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 21:54:34 -07:00
Parthasarathy Bhuvaragan 60d1d93664 tipc: reassign pointers after skb reallocation / linearization
In tipc_msg_reverse(), we assign skb attributes to local pointers
in stack at startup. This is followed by skb_linearize() and for
cloned buffers we perform skb relocation using pskb_expand_head().
Both these methods may update the skb attributes and thus making
the pointers incorrect.

In this commit, we fix this error by ensuring that the pointers
are re-assigned after any of these skb operations.

Fixes: 29042e19f2 ("tipc: let function tipc_msg_reverse() expand header
when needed")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 21:54:34 -07:00
Parthasarathy Bhuvaragan 27163138b4 tipc: perform skb_linearize() before parsing the inner header
In tipc_rcv(), we linearize only the header and usually the packets
are consumed as the nodes permit direct reception. However, if the
skb contains tunnelled message due to fail over or synchronization
we parse it in tipc_node_check_state() without performing
linearization. This will cause link disturbances if the skb was
non linear.

In this commit, we perform linearization for the above messages.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 21:54:34 -07:00
Bob Peterson 6c7e983b22 tipc: Fix tipc_sk_reinit handling of -EAGAIN
In 9dbbfb0ab6 function tipc_sk_reinit
had additional logic added to loop in the event that function
rhashtable_walk_next() returned -EAGAIN. No worries.

However, if rhashtable_walk_start returns -EAGAIN, it does "continue",
and therefore skips the call to rhashtable_walk_stop(). That has
the effect of calling rcu_read_lock() without its paired call to
rcu_read_unlock(). Since rcu_read_lock() may be nested, the problem
may not be apparent for a while, especially since resize events may
be rare. But the comments to rhashtable_walk_start() state:

 * ...Note that we take the RCU lock in all
 * cases including when we return an error.  So you must always call
 * rhashtable_walk_stop to clean up.

This patch replaces the continue with a goto and label to ensure a
matching call to rhashtable_walk_stop().

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 14:02:26 -07:00
Arvind Yadav 042a90106b net: tipc: constify genl_ops
genl_ops are not supposed to change at runtime. All functions
working with genl_ops provided by <net/genetlink.h> work with
const genl_ops. So mark the non-const structs as const.

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-23 22:31:38 -07:00
Ying Xue fd849b7c41 tipc: fix a race condition of releasing subscriber object
No matter whether a request is inserted into workqueue as a work item
to cancel a subscription or to delete a subscription's subscriber
asynchronously, the work items may be executed in different workers.
As a result, it doesn't mean that one request which is raised prior to
another request is definitely handled before the latter. By contrast,
if the latter request is executed before the former request, below
error may happen:

[  656.183644] BUG: spinlock bad magic on CPU#0, kworker/u8:0/12117
[  656.184487] general protection fault: 0000 [#1] SMP
[  656.185160] Modules linked in: tipc ip6_udp_tunnel udp_tunnel 9pnet_virtio 9p 9pnet virtio_net virtio_pci virtio_ring virtio [last unloaded: ip6_udp_tunnel]
[  656.187003] CPU: 0 PID: 12117 Comm: kworker/u8:0 Not tainted 4.11.0-rc7+ #6
[  656.187920] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  656.188690] Workqueue: tipc_rcv tipc_recv_work [tipc]
[  656.189371] task: ffff88003f5cec40 task.stack: ffffc90004448000
[  656.190157] RIP: 0010:spin_bug+0xdd/0xf0
[  656.190678] RSP: 0018:ffffc9000444bcb8 EFLAGS: 00010202
[  656.191375] RAX: 0000000000000034 RBX: ffff88003f8d1388 RCX: 0000000000000000
[  656.192321] RDX: ffff88003ba13708 RSI: ffff88003ba0cd08 RDI: ffff88003ba0cd08
[  656.193265] RBP: ffffc9000444bcd0 R08: 0000000000000030 R09: 000000006b6b6b6b
[  656.194208] R10: ffff8800bde3e000 R11: 00000000000001b4 R12: 6b6b6b6b6b6b6b6b
[  656.195157] R13: ffffffff81a3ca64 R14: ffff88003f8d1388 R15: ffff88003f8d13a0
[  656.196101] FS:  0000000000000000(0000) GS:ffff88003ba00000(0000) knlGS:0000000000000000
[  656.197172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  656.197935] CR2: 00007f0b3d2e6000 CR3: 000000003ef9e000 CR4: 00000000000006f0
[  656.198873] Call Trace:
[  656.199210]  do_raw_spin_lock+0x66/0xa0
[  656.199735]  _raw_spin_lock_bh+0x19/0x20
[  656.200258]  tipc_subscrb_subscrp_delete+0x28/0xf0 [tipc]
[  656.200990]  tipc_subscrb_rcv_cb+0x45/0x260 [tipc]
[  656.201632]  tipc_receive_from_sock+0xaf/0x100 [tipc]
[  656.202299]  tipc_recv_work+0x2b/0x60 [tipc]
[  656.202872]  process_one_work+0x157/0x420
[  656.203404]  worker_thread+0x69/0x4c0
[  656.203898]  kthread+0x138/0x170
[  656.204328]  ? process_one_work+0x420/0x420
[  656.204889]  ? kthread_create_on_node+0x40/0x40
[  656.205527]  ret_from_fork+0x29/0x40
[  656.206012] Code: 48 8b 0c 25 00 c5 00 00 48 c7 c7 f0 24 a3 81 48 81 c1 f0 05 00 00 65 8b 15 61 ef f5 7e e8 9a 4c 09 00 4d 85 e4 44 8b 4b 08 74 92 <45> 8b 84 24 40 04 00 00 49 8d 8c 24 f0 05 00 00 eb 8d 90 0f 1f
[  656.208504] RIP: spin_bug+0xdd/0xf0 RSP: ffffc9000444bcb8
[  656.209798] ---[ end trace e2a800e6eb0770be ]---

In above scenario, the request of deleting subscriber was performed
earlier than the request of canceling a subscription although the
latter was issued before the former, which means tipc_subscrb_delete()
was called before tipc_subscrp_cancel(). As a result, when
tipc_subscrb_subscrp_delete() called by tipc_subscrp_cancel() was
executed to cancel a subscription, the subscription's subscriber
refcnt had been decreased to 1. After tipc_subscrp_delete() where
the subscriber was freed because its refcnt was decremented to zero,
but the subscriber's lock had to be released, as a consequence, panic
happened.

By contrast, if we increase subscriber's refcnt before
tipc_subscrb_subscrp_delete() is called in tipc_subscrp_cancel(),
the panic issue can be avoided.

Fixes: d094c4d5f5 ("tipc: add subscription refcount to avoid invalid delete")
Reported-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22 14:25:02 -07:00
Parthasarathy Bhuvaragan 458be024ef tipc: remove subscription references only for pending timers
In commit, 139bb36f75 ("tipc: advance the time of deleting
subscription from subscriber->subscrp_list"), we delete the
subscription from the subscribers list and from nametable
unconditionally. This leads to the following bug if the timer
running tipc_subscrp_timeout() in another CPU accesses the
subscription list after the subscription delete request.

[39.570] general protection fault: 0000 [#1] SMP
::
[39.574] task: ffffffff81c10540 task.stack: ffffffff81c00000
[39.575] RIP: 0010:tipc_subscrp_timeout+0x32/0x80 [tipc]
[39.576] RSP: 0018:ffff88003ba03e90 EFLAGS: 00010282
[39.576] RAX: dead000000000200 RBX: ffff88003f0f3600 RCX: 0000000000000101
[39.577] RDX: dead000000000100 RSI: 0000000000000201 RDI: ffff88003f0d7948
[39.578] RBP: ffff88003ba03ea0 R08: 0000000000000001 R09: ffff88003ba03ef8
[39.579] R10: 000000000000014f R11: 0000000000000000 R12: ffff88003f0d7948
[39.580] R13: ffff88003f0f3618 R14: ffffffffa006c250 R15: ffff88003f0f3600
[39.581] FS:  0000000000000000(0000) GS:ffff88003ba00000(0000) knlGS:0000000000000000
[39.582] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39.583] CR2: 00007f831c6e0714 CR3: 000000003d3b0000 CR4: 00000000000006f0
[39.584] Call Trace:
[39.584]  <IRQ>
[39.585]  call_timer_fn+0x3d/0x180
[39.585]  ? tipc_subscrb_rcv_cb+0x260/0x260 [tipc]
[39.586]  run_timer_softirq+0x168/0x1f0
[39.586]  ? sched_clock_cpu+0x16/0xc0
[39.587]  __do_softirq+0x9b/0x2de
[39.587]  irq_exit+0x60/0x70
[39.588]  smp_apic_timer_interrupt+0x3d/0x50
[39.588]  apic_timer_interrupt+0x86/0x90
[39.589] RIP: 0010:default_idle+0x20/0xf0
[39.589] RSP: 0018:ffffffff81c03e58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[39.590] RAX: 0000000000000000 RBX: ffffffff81c10540 RCX: 0000000000000000
[39.591] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[39.592] RBP: ffffffff81c03e68 R08: 0000000000000000 R09: 0000000000000000
[39.593] R10: ffffc90001cbbe00 R11: 0000000000000000 R12: 0000000000000000
[39.594] R13: ffffffff81c10540 R14: 0000000000000000 R15: 0000000000000000
[39.595]  </IRQ>
::
[39.603] RIP: tipc_subscrp_timeout+0x32/0x80 [tipc] RSP: ffff88003ba03e90
[39.604] ---[ end trace 79ce94b7216cb459 ]---

Fixes: 139bb36f75 ("tipc: advance the time of deleting subscription from subscriber->subscrp_list")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22 14:25:02 -07:00
David S. Miller e2a7c34fb2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-21 17:06:42 -07:00
Jon Paul Maloy 40501f90ed tipc: don't reset stale broadcast send link
When the broadcast send link after 100 attempts has failed to
transfer a packet to all peers, we consider it stale, and reset
it. Thereafter it needs to re-synchronize with the peers, something
currently done by just resetting and re-establishing all links to
all peers. This has turned out to be overkill, with potentially
unwanted consequences for the remaining cluster.

A closer analysis reveals that this can be done much simpler. When
this kind of failure happens, for reasons that may lie outside the
TIPC protocol, it is typically only one peer which is failing to
receive and acknowledge packets. It is hence sufficient to identify
and reset the links only to that peer to resolve the situation, without
having to reset the broadcast link at all. This solution entails a much
lower risk of negative consequences for the own node as well as for
the overall cluster.

We implement this change in this commit.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-21 13:37:45 -07:00
Eric Dumazet 5bfd37b4de tipc: fix use-after-free
syszkaller reported use-after-free in tipc [1]

When msg->rep skb is freed, set the pointer to NULL,
so that caller does not free it again.

[1]

==================================================================
BUG: KASAN: use-after-free in skb_push+0xd4/0xe0 net/core/skbuff.c:1466
Read of size 8 at addr ffff8801c6e71e90 by task syz-executor5/4115

CPU: 1 PID: 4115 Comm: syz-executor5 Not tainted 4.13.0-rc4+ #32
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 print_address_description+0x73/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x24e/0x340 mm/kasan/report.c:409
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
 skb_push+0xd4/0xe0 net/core/skbuff.c:1466
 tipc_nl_compat_recv+0x833/0x18f0 net/tipc/netlink_compat.c:1209
 genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:598
 genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:623
 netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2397
 genl_rcv+0x28/0x40 net/netlink/genetlink.c:634
 netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
 netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1291
 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1854
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 sock_write_iter+0x31a/0x5d0 net/socket.c:898
 call_write_iter include/linux/fs.h:1743 [inline]
 new_sync_write fs/read_write.c:457 [inline]
 __vfs_write+0x684/0x970 fs/read_write.c:470
 vfs_write+0x189/0x510 fs/read_write.c:518
 SYSC_write fs/read_write.c:565 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:557
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x4512e9
RSP: 002b:00007f3bc8184c08 EFLAGS: 00000216 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 00000000004512e9
RDX: 0000000000000020 RSI: 0000000020fdb000 RDI: 0000000000000006
RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000216 R12: 00000000004b5e76
R13: 00007f3bc8184b48 R14: 00000000004b5e86 R15: 0000000000000000

Allocated by task 4115:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
 kmem_cache_alloc_node+0x13d/0x750 mm/slab.c:3651
 __alloc_skb+0xf1/0x740 net/core/skbuff.c:219
 alloc_skb include/linux/skbuff.h:903 [inline]
 tipc_tlv_alloc+0x26/0xb0 net/tipc/netlink_compat.c:148
 tipc_nl_compat_dumpit+0xf2/0x3c0 net/tipc/netlink_compat.c:248
 tipc_nl_compat_handle net/tipc/netlink_compat.c:1130 [inline]
 tipc_nl_compat_recv+0x756/0x18f0 net/tipc/netlink_compat.c:1199
 genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:598
 genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:623
 netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2397
 genl_rcv+0x28/0x40 net/netlink/genetlink.c:634
 netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
 netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1291
 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1854
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 sock_write_iter+0x31a/0x5d0 net/socket.c:898
 call_write_iter include/linux/fs.h:1743 [inline]
 new_sync_write fs/read_write.c:457 [inline]
 __vfs_write+0x684/0x970 fs/read_write.c:470
 vfs_write+0x189/0x510 fs/read_write.c:518
 SYSC_write fs/read_write.c:565 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:557
 entry_SYSCALL_64_fastpath+0x1f/0xbe

Freed by task 4115:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
 __cache_free mm/slab.c:3503 [inline]
 kmem_cache_free+0x77/0x280 mm/slab.c:3763
 kfree_skbmem+0x1a1/0x1d0 net/core/skbuff.c:622
 __kfree_skb net/core/skbuff.c:682 [inline]
 kfree_skb+0x165/0x4c0 net/core/skbuff.c:699
 tipc_nl_compat_dumpit+0x36a/0x3c0 net/tipc/netlink_compat.c:260
 tipc_nl_compat_handle net/tipc/netlink_compat.c:1130 [inline]
 tipc_nl_compat_recv+0x756/0x18f0 net/tipc/netlink_compat.c:1199
 genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:598
 genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:623
 netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2397
 genl_rcv+0x28/0x40 net/netlink/genetlink.c:634
 netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
 netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1291
 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1854
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 sock_write_iter+0x31a/0x5d0 net/socket.c:898
 call_write_iter include/linux/fs.h:1743 [inline]
 new_sync_write fs/read_write.c:457 [inline]
 __vfs_write+0x684/0x970 fs/read_write.c:470
 vfs_write+0x189/0x510 fs/read_write.c:518
 SYSC_write fs/read_write.c:565 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:557
 entry_SYSCALL_64_fastpath+0x1f/0xbe

The buggy address belongs to the object at ffff8801c6e71dc0
 which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 208 bytes inside of
 224-byte region [ffff8801c6e71dc0, ffff8801c6e71ea0)
The buggy address belongs to the page:
page:ffffea00071b9c40 count:1 mapcount:0 mapping:ffff8801c6e71000 index:0x0
flags: 0x200000000000100(slab)
raw: 0200000000000100 ffff8801c6e71000 0000000000000000 000000010000000c
raw: ffffea0007224a20 ffff8801d98caf48 ffff8801d9e79040 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8801c6e71d80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
 ffff8801c6e71e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8801c6e71e80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
                         ^
 ffff8801c6e71f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8801c6e71f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov  <dvyukov@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18 15:59:29 -07:00
Jon Paul Maloy 59a361bc6f tipc: avoid inheriting msg_non_seq flag when message is returned
In the function msg_reverse(), we reverse the header while trying to
reuse the original buffer whenever possible. Those rejected/returned
messages are always transmitted as unicast, but the msg_non_seq field
is not explicitly set to zero as it should be.

We have seen cases where multicast senders set the message type to
"NOT dest_droppable", meaning that a multicast message shorter than
one MTU will be returned, e.g., during receive buffer overflow, by
reusing the original buffer. This has the effect that even the
'msg_non_seq' field is inadvertently inherited by the rejected message,
although it is now sent as a unicast message. This again leads the
receiving unicast link endpoint to steer the packet toward the broadcast
link receive function, where it is dropped. The affected unicast link is
thereafter (after 100 failed retransmissions) declared 'stale' and
reset.

We fix this by unconditionally setting the 'msg_non_seq' flag to zero
for all rejected/returned messages.

Reported-by: Canh Duc Luu <canh.d.luu@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-14 11:20:36 -07:00
Jon Paul Maloy fed5f5718c tipc: accept PACKET_MULTICAST packets
On L2 bearers, the TIPC broadcast function is sending out packets using
the corresponding L2 broadcast address. At reception, we filter such
packets under the assumption that they will also be delivered as
broadcast packets.

This assumption doesn't always hold true. Under high load, we have seen
that a switch may convert the destination address and deliver the packet
as a PACKET_MULTICAST, something leading to inadvertently dropped
packets and a stale and reset broadcast link.

We fix this by extending the reception filtering to accept packets of
type PACKET_MULTICAST.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-14 11:19:25 -07:00
Jon Paul Maloy ed43594aed tipc: remove premature ESTABLISH FSM event at link synchronization
When a link between two nodes come up, both endpoints will initially
send out a STATE message to the peer, to increase the probability that
the peer endpoint also is up when the first traffic message arrives.
Thereafter, if the establishing link is the second link between two
nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message,
helping the peer to perform initial synchronization between the two
links.

However, the initial STATE message may be lost, in which case the SYNCH
message will be the first one arriving at the peer. This should also
work, as the SYNCH message itself will be used to take up the link
endpoint before  initializing synchronization.

Unfortunately the code for this case is broken. Currently, the link is
brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH
arrives, whereupon __tipc_node_link_up() is called to distribute the
link slots and take the link into traffic. But, __tipc_node_link_up() is
itself starting with a test for whether the link is up, and if true,
returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call
is unnecessary, since tipc_node_link_up() is itself issuing such an
event, but also harmful, since it inhibits tipc_node_link_up() to
perform the test of its tasks, and the link endpoint in question hence
is never taken into traffic.

This problem has been exposed when we set up dual links between pre-
and post-4.4 kernels, because the former ones don't send out the
initial STATE message described above.

We fix this by removing the unnecessary event call.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 22:38:06 -07:00
Reshetova, Elena 41c6d650f6 net: convert sock.sk_refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Jia-Ju Bai 343eba69c6 net: tipc: Fix a sleep-in-atomic bug in tipc_msg_reverse
The kernel may sleep under a rcu read lock in tipc_msg_reverse, and the
function call path is:
tipc_l2_rcv_msg (acquire the lock by rcu_read_lock)
  tipc_rcv
    tipc_sk_rcv
      tipc_msg_reverse
        pskb_expand_head(GFP_KERNEL) --> may sleep
tipc_node_broadcast
  tipc_node_xmit_skb
    tipc_node_xmit
      tipc_sk_rcv
        tipc_msg_reverse
          pskb_expand_head(GFP_KERNEL) --> may sleep

To fix it, "GFP_KERNEL" is replaced with "GFP_ATOMIC".

Signed-off-by: Jia-Ju Bai <baijiaju1990@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10 18:20:38 -04:00
Jon Paul Maloy 844cf763fb tipc: make macro tipc_wait_for_cond() smp safe
The macro tipc_wait_for_cond() is embedding the macro sk_wait_event()
to fulfil its task. The latter, in turn, is evaluating the stated
condition outside the socket lock context. This is problematic if
the condition is accessing non-trivial data structures which may be
altered by incoming interrupts, as is the case with the cong_links()
linked list, used by socket to keep track of the current set of
congested links. We sometimes see crashes when this list is accessed
by a condition function at the same time as a SOCK_WAKEUP interrupt
is removing an element from the list.

We fix this by expanding selected parts of sk_wait_event() into the
outer macro, while ensuring that all evaluations of a given condition
are performed under socket lock protection.

Fixes: commit 365ad353c2 ("tipc: reduce risk of user starvation during link congestion")
Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 22:19:30 -04:00
Linus Torvalds 8d65b08deb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Millar:
 "Here are some highlights from the 2065 networking commits that
  happened this development cycle:

   1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

   2) Add a generic XDP driver, so that anyone can test XDP even if they
      lack a networking device whose driver has explicit XDP support
      (me).

   3) Sparc64 now has an eBPF JIT too (me)

   4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
      Starovoitov)

   5) Make netfitler network namespace teardown less expensive (Florian
      Westphal)

   6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

   7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

   8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

   9) Multiqueue support in stmmac driver (Joao Pinto)

  10) Remove TCP timewait recycling, it never really could possibly work
      well in the real world and timestamp randomization really zaps any
      hint of usability this feature had (Soheil Hassas Yeganeh)

  11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
      Aleksandrov)

  12) Add socket busy poll support to epoll (Sridhar Samudrala)

  13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
      and several others)

  14) IPSEC hw offload infrastructure (Steffen Klassert)"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
  tipc: refactor function tipc_sk_recv_stream()
  tipc: refactor function tipc_sk_recvmsg()
  net: thunderx: Optimize page recycling for XDP
  net: thunderx: Support for XDP header adjustment
  net: thunderx: Add support for XDP_TX
  net: thunderx: Add support for XDP_DROP
  net: thunderx: Add basic XDP support
  net: thunderx: Cleanup receive buffer allocation
  net: thunderx: Optimize CQE_TX handling
  net: thunderx: Optimize RBDR descriptor handling
  net: thunderx: Support for page recycling
  ipx: call ipxitf_put() in ioctl error path
  net: sched: add helpers to handle extended actions
  qed*: Fix issues in the ptp filter config implementation.
  qede: Fix concurrency issue in PTP Tx path processing.
  stmmac: Add support for SIMATIC IOT2000 platform
  net: hns: fix ethtool_get_strings overflow in hns driver
  tcp: fix wraparound issue in tcp_lp
  bpf, arm64: fix jit branch offset related to ldimm64
  bpf, arm64: implement jiting of BPF_XADD
  ...
2017-05-02 16:40:27 -07:00
Jon Paul Maloy ec8a09fbbe tipc: refactor function tipc_sk_recv_stream()
We try to make this function more readable by improving variable names
and comments, using more stack variables, and doing some smaller changes
to the logics. We also rename the function to make it consistent with
naming conventions used elsewhere in the code.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-02 15:56:54 -04:00
Jon Paul Maloy e9f8b10101 tipc: refactor function tipc_sk_recvmsg()
We try to make this function more readable by improving variable names
and comments, plus some minor changes to the logics.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-02 15:56:54 -04:00
Parthasarathy Bhuvaragan c1be775628 tipc: close the connection if protocol messages contain errors
When a socket is shutting down, we notify the peer node about the
connection termination by reusing an incoming message if possible.
If the last received message was a connection acknowledgment
message, we reverse this message and set the error code to
TIPC_ERR_NO_PORT and send it to peer.

In tipc_sk_proto_rcv(), we never check for message errors while
processing the connection acknowledgment or probe messages. Thus
this message performs the usual flow control accounting and leaves
the session hanging.

In this commit, we terminate the connection when we receive such
error messages.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 12:20:42 -04:00
Parthasarathy Bhuvaragan 4e0df4951e tipc: improve error validations for sockets in CONNECTING state
Until now, the checks for sockets in CONNECTING state was based on
the assumption that the incoming message was always from the
peer's accepted data socket.

However an application using a non-blocking socket sends an implicit
connect, this socket which is in CONNECTING state can receive error
messages from the peer's listening socket. As we discard these
messages, the application socket hangs as there due to inactivity.
In addition to this, there are other places where we process errors
but do not notify the user.

In this commit, we process such incoming error messages and notify
our users about them using sk_state_change().

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 12:20:42 -04:00
Parthasarathy Bhuvaragan 42b531de17 tipc: Fix missing connection request handling
In filter_connect, we use waitqueue_active() to check for any
connections to wakeup. But waitqueue_active() is missing memory
barriers while accessing the critical sections, leading to
inconsistent results.

In this commit, we replace this with an SMP safe wq_has_sleeper()
using the generic socket callback sk_data_ready().

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 12:20:42 -04:00
David S. Miller b1513c3531 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 22:39:08 -04:00
Parthasarathy Bhuvaragan 05ff837897 tipc: fix socket flow control accounting error at tipc_recv_stream
Until now in tipc_recv_stream(), we update the received
unacknowledged bytes based on a stack variable and not based on the
actual message size.
If the user buffer passed at tipc_recv_stream() is smaller than the
received skb, the size variable in stack differs from the actual
message size in the skb. This leads to a flow control accounting
error causing permanent congestion.

In this commit, we fix this accounting error by always using the
size of the incoming message.

Fixes: 10724cc7bb ("tipc: redesign connection-level flow control")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:45:38 -04:00
Parthasarathy Bhuvaragan 3364d61c92 tipc: fix socket flow control accounting error at tipc_send_stream
Until now in tipc_send_stream(), we return -1 when the socket
encounters link congestion even if the socket had successfully
sent partial data. This is incorrect as the application resends
the same the partial data leading to data corruption at
receiver's end.

In this commit, we return the partially sent bytes as the return
value at link congestion.

Fixes: 10724cc7bb ("tipc: redesign connection-level flow control")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:45:37 -04:00
Pan Bian 78302fd405 tipc: check return value of nlmsg_new
Function nlmsg_new() will return a NULL pointer if there is no enough
memory, and its return value should be checked before it is used.
However, in function tipc_nl_node_get_monitor(), the validation of the
return value of function nlmsg_new() is missed. This patch fixes the
bug.

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 15:51:30 -04:00
Johannes Berg fe52145f91 netlink: pass extended ACK struct where available
This is an add-on to the previous patch that passes the extended ACK
structure where it's already available by existing genl_info or extack
function arguments.

This was done with this spatch (with some manual adjustment of
indentation):

@@
expression A, B, C, D, E;
identifier fn, info;
@@
fn(..., struct genl_info *info, ...) {
...
-nlmsg_parse(A, B, C, D, E, NULL)
+nlmsg_parse(A, B, C, D, E, info->extack)
...
}

@@
expression A, B, C, D, E;
identifier fn, info;
@@
fn(..., struct genl_info *info, ...) {
<...
-nla_parse_nested(A, B, C, D, NULL)
+nla_parse_nested(A, B, C, D, info->extack)
...>
}

@@
expression A, B, C, D, E;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nlmsg_parse(A, B, C, D, E, NULL)
+nlmsg_parse(A, B, C, D, E, extack)
...>
}

@@
expression A, B, C, D, E;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_parse(A, B, C, D, E, NULL)
+nla_parse(A, B, C, D, E, extack)
...>
}

@@
expression A, B, C, D, E;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
...
-nlmsg_parse(A, B, C, D, E, NULL)
+nlmsg_parse(A, B, C, D, E, extack)
...
}

@@
expression A, B, C, D;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_parse_nested(A, B, C, D, NULL)
+nla_parse_nested(A, B, C, D, extack)
...>
}

@@
expression A, B, C, D;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nlmsg_validate(A, B, C, D, NULL)
+nlmsg_validate(A, B, C, D, extack)
...>
}

@@
expression A, B, C, D;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_validate(A, B, C, D, NULL)
+nla_validate(A, B, C, D, extack)
...>
}

@@
expression A, B, C;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_validate_nested(A, B, C, NULL)
+nla_validate_nested(A, B, C, extack)
...>
}

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:22 -04:00
Johannes Berg fceb6435e8 netlink: pass extended ACK struct to parsing functions
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:22 -04:00
Erik Hugne 66bc1e8d5d tipc: allow rdm/dgram socketpairs
for socketpairs using connectionless transport, we cache
the respective node local TIPC portid to use in subsequent
calls to send() in the socket's private data.

Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29 14:10:11 -07:00
Erik Hugne 70b03759e9 tipc: add support for stream/seqpacket socketpairs
sockets A and B are connected back-to-back, similar to what
AF_UNIX does.

Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29 14:10:11 -07:00
Ying Xue 7efea60dcf tipc: adjust the policy of holding subscription kref
When a new subscription object is inserted into name_seq->subscriptions
list, it's under name_seq->lock protection; when a subscription is
deleted from the list, it's also under the same lock protection;
similarly, when accessing a subscription by going through subscriptions
list, the entire process is also protected by the name_seq->lock.

Therefore, if subscription refcount is increased before it's inserted
into subscriptions list, and its refcount is decreased after it's
deleted from the list, it will be unnecessary to hold refcount at all
before accessing subscription object which is obtained by going through
subscriptions list under name_seq->lock protection.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 18:03:33 -07:00
Ying Xue 139bb36f75 tipc: advance the time of deleting subscription from subscriber->subscrp_list
After a subscription object is created, it's inserted into its
subscriber subscrp_list list under subscriber lock protection,
similarly, before it's destroyed, it should be first removed from
its subscriber->subscrp_list. Since the subscription list is
accessed with subscriber lock, all the subscriptions are valid
during the lock duration. Hence in tipc_subscrb_subscrp_delete(), we
remove subscription get/put and the extra subscriber unlock/lock.

After this change, the subscriptions refcount cleanup is very simple
and does not access any lock.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 18:03:32 -07:00
Ying Xue 557d054c01 tipc: fix nametbl deadlock at tipc_nametbl_unsubscribe
Until now, tipc_nametbl_unsubscribe() is called at subscriptions
reference count cleanup. Usually the subscriptions cleanup is
called at subscription timeout or at subscription cancel or at
subscriber delete.

We have ignored the possibility of this being called from other
locations, which causes deadlock as we try to grab the
tn->nametbl_lock while holding it already.

   CPU1:                             CPU2:
----------                     ----------------
tipc_nametbl_publish
spin_lock_bh(&tn->nametbl_lock)
tipc_nametbl_insert_publ
tipc_nameseq_insert_publ
tipc_subscrp_report_overlap
tipc_subscrp_get
tipc_subscrp_send_event
                             tipc_close_conn
                             tipc_subscrb_release_cb
                             tipc_subscrb_delete
                             tipc_subscrp_put
tipc_subscrp_put
tipc_subscrp_kref_release
tipc_nametbl_unsubscribe
spin_lock_bh(&tn->nametbl_lock)
<<grab nametbl_lock again>>

   CPU1:                              CPU2:
----------                     ----------------
tipc_nametbl_stop
spin_lock_bh(&tn->nametbl_lock)
tipc_purge_publications
tipc_nameseq_remove_publ
tipc_subscrp_report_overlap
tipc_subscrp_get
tipc_subscrp_send_event
                             tipc_close_conn
                             tipc_subscrb_release_cb
                             tipc_subscrb_delete
                             tipc_subscrp_put
tipc_subscrp_put
tipc_subscrp_kref_release
tipc_nametbl_unsubscribe
spin_lock_bh(&tn->nametbl_lock)
<<grab nametbl_lock again>>

In this commit, we advance the calling of tipc_nametbl_unsubscribe()
from the refcount cleanup to the intended callers.

Fixes: d094c4d5f5 ("tipc: add subscription refcount to avoid invalid delete")
Reported-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 11:59:16 -07:00
David Howells cdfbabfb2f net: Work around lockdep limitation in sockets that use sockets
Lockdep issues a circular dependency warning when AFS issues an operation
through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

The theory lockdep comes up with is as follows:

 (1) If the pagefault handler decides it needs to read pages from AFS, it
     calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
     creating a call requires the socket lock:

	mmap_sem must be taken before sk_lock-AF_RXRPC

 (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
     binds the underlying UDP socket whilst holding its socket lock.
     inet_bind() takes its own socket lock:

	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

 (3) Reading from a TCP socket into a userspace buffer might cause a fault
     and thus cause the kernel to take the mmap_sem, but the TCP socket is
     locked whilst doing this:

	sk_lock-AF_INET must be taken before mmap_sem

However, lockdep's theory is wrong in this instance because it deals only
with lock classes and not individual locks.  The AF_INET lock in (2) isn't
really equivalent to the AF_INET lock in (3) as the former deals with a
socket entirely internal to the kernel that never sees userspace.  This is
a limitation in the design of lockdep.

Fix the general case by:

 (1) Double up all the locking keys used in sockets so that one set are
     used if the socket is created by userspace and the other set is used
     if the socket is created by the kernel.

 (2) Store the kern parameter passed to sk_alloc() in a variable in the
     sock struct (sk_kern_sock).  This informs sock_lock_init(),
     sock_init_data() and sk_clone_lock() as to the lock keys to be used.

     Note that the child created by sk_clone_lock() inherits the parent's
     kern setting.

 (3) Add a 'kern' parameter to ->accept() that is analogous to the one
     passed in to ->create() that distinguishes whether kernel_accept() or
     sys_accept4() was the caller and can be passed to sk_alloc().

     Note that a lot of accept functions merely dequeue an already
     allocated socket.  I haven't touched these as the new socket already
     exists before we get the parameter.

     Note also that there are a couple of places where I've made the accepted
     socket unconditionally kernel-based:

	irda_accept()
	rds_rcp_accept_one()
	tcp_accept_from_sock()

     because they follow a sock_create_kern() and accept off of that.

Whilst creating this, I noticed that lustre and ocfs don't create sockets
through sock_create_kern() and thus they aren't marked as for-kernel,
though they appear to be internal.  I wonder if these should do that so
that they use the new set of lock keys.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 18:23:27 -08:00
Ingo Molnar 174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Jon Paul Maloy 681a55d717 tipc: move premature initilalization of stack variables
In the function tipc_rcv() we initialize a couple of stack variables
from the message header before that same header has been validated.
In rare cases when the arriving header is non-linar, the validation
function itself may linearize the buffer by calling skb_may_pull(),
while the wrongly initialized stack fields are not updated accordingly.

We fix this in this commit.

Reported-by: Matthew Wong <mwong@sonusnet.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24 11:42:54 -05:00
Herbert Xu 40f9f43970 tipc: Fix tipc_sk_reinit race conditions
There are two problems with the function tipc_sk_reinit.  Firstly
it's doing a manual walk over an rhashtable.  This is broken as
an rhashtable can be resized and if you manually walk over it
during a resize then you may miss entries.

Secondly it's missing memory barriers as previously the code used
spinlocks which provide the barriers implicitly.

This patch fixes both problems.

Fixes: 07f6c4bc04 ("tipc: convert tipc reference table to...")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 12:28:35 -05:00
David S. Miller 4e8f2fc1a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two trivial overlapping changes conflicts in MPLS and mlx5.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28 10:33:06 -05:00
Dan Carpenter a08ef4768f tipc: uninitialized return code in tipc_setsockopt()
We shuffled some code around and added some new case statements here and
now "res" isn't initialized on all paths.

Fixes: 01fd12bb18 ("tipc: make replicast a user selectable option")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 12:41:34 -05:00
Parthasarathy Bhuvaragan 35e22e49a5 tipc: fix cleanup at module unload
In tipc_server_stop(), we iterate over the connections with limiting
factor as server's idr_in_use. We ignore the fact that this variable
is decremented in tipc_close_conn(), leading to premature exit.

In this commit, we iterate until the we have no connections left.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:58 -05:00
Parthasarathy Bhuvaragan 4c887aa65d tipc: ignore requests when the connection state is not CONNECTED
In tipc_conn_sendmsg(), we first queue the request to the outqueue
followed by the connection state check. If the connection is not
connected, we should not queue this message.

In this commit, we reject the messages if the connection state is
not CF_CONNECTED.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:58 -05:00
Parthasarathy Bhuvaragan 9dc3abdd1f tipc: fix nametbl_lock soft lockup at module exit
Commit 333f796235 ("tipc: fix a race condition leading to
subscriber refcnt bug") reveals a soft lockup while acquiring
nametbl_lock.

Before commit 333f796235, we call tipc_conn_shutdown() from
tipc_close_conn() in the context of tipc_topsrv_stop(). In that
context, we are allowed to grab the nametbl_lock.

Commit 333f796235, moved tipc_conn_release (renamed from
tipc_conn_shutdown) to the connection refcount cleanup. This allows
either tipc_nametbl_withdraw() or tipc_topsrv_stop() to the cleanup.

Since tipc_exit_net() first calls tipc_topsrv_stop() and then
tipc_nametble_withdraw() increases the chances for the later to
perform the connection cleanup.

The soft lockup occurs in the call chain of tipc_nametbl_withdraw(),
when it performs the tipc_conn_kref_release() as it tries to grab
nametbl_lock again while holding it already.
tipc_nametbl_withdraw() grabs nametbl_lock
  tipc_nametbl_remove_publ()
    tipc_subscrp_report_overlap()
      tipc_subscrp_send_event()
        tipc_conn_sendmsg()
          << if (con->flags != CF_CONNECTED) we do conn_put(),
             triggering the cleanup as refcount=0. >>
          tipc_conn_kref_release
            tipc_sock_release
              tipc_conn_release
                tipc_subscrb_delete
                  tipc_subscrp_delete
                    tipc_nametbl_unsubscribe << Soft Lockup >>

The previous changes in this series fixes the race conditions fixed
by commit 333f796235. Hence we can now revert the commit.

Fixes: 333f796235 ("tipc: fix a race condition leading to subscriber refcnt bug")
Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:58 -05:00
Parthasarathy Bhuvaragan fc0adfc8fd tipc: fix connection refcount error
Until now, the generic server framework maintains the connection
id's per subscriber in server's conn_idr. At tipc_close_conn, we
remove the connection id from the server list, but the connection is
valid until we call the refcount cleanup. Hence we have a window
where the server allocates the same connection to an new subscriber
leading to inconsistent reference count. We have another refcount
warning we grab the refcount in tipc_conn_lookup() for connections
with flag with CF_CONNECTED not set. This usually occurs at shutdown
when the we stop the topology server and withdraw TIPC_CFG_SRV
publication thereby triggering a withdraw message to subscribers.

In this commit, we:
1. remove the connection from the server list at recount cleanup.
2. grab the refcount for a connection only if CF_CONNECTED is set.

Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:57 -05:00
Parthasarathy Bhuvaragan d094c4d5f5 tipc: add subscription refcount to avoid invalid delete
Until now, the subscribers keep track of the subscriptions using
reference count at subscriber level. At subscription cancel or
subscriber delete, we delete the subscription only if the timer
was pending for the subscription. This approach is incorrect as:
1. del_timer() is not SMP safe, if on CPU0 the check for pending
   timer returns true but CPU1 might schedule the timer callback
   thereby deleting the subscription. Thus when CPU0 is scheduled,
   it deletes an invalid subscription.
2. We export tipc_subscrp_report_overlap(), which accesses the
   subscription pointer multiple times. Meanwhile the subscription
   timer can expire thereby freeing the subscription and we might
   continue to access the subscription pointer leading to memory
   violations.

In this commit, we introduce subscription refcount to avoid deleting
an invalid subscription.

Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:57 -05:00
Parthasarathy Bhuvaragan 93f955aad4 tipc: fix nametbl_lock soft lockup at node/link events
We trigger a soft lockup as we grab nametbl_lock twice if the node
has a pending node up/down or link up/down event while:
- we process an incoming named message in tipc_named_rcv() and
  perform an tipc_update_nametbl().
- we have pending backlog items in the name distributor queue
  during a nametable update using tipc_nametbl_publish() or
  tipc_nametbl_withdraw().

The following are the call chain associated:
tipc_named_rcv() Grabs nametbl_lock
   tipc_update_nametbl() (publish/withdraw)
     tipc_node_subscribe()/unsubscribe()
       tipc_node_write_unlock()
          << lockup occurs if an outstanding node/link event
             exits, as we grabs nametbl_lock again >>

tipc_nametbl_withdraw() Grab nametbl_lock
  tipc_named_process_backlog()
    tipc_update_nametbl()
      << rest as above >>

The function tipc_node_write_unlock(), in addition to releasing the
lock processes the outstanding node/link up/down events. To do this,
we need to grab the nametbl_lock again leading to the lockup.

In this commit we fix the soft lockup by introducing a fast variant of
node_unlock(), where we just release the lock. We adapt the
node_subscribe()/node_unsubscribe() to use the fast variants.

Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:57 -05:00
Jon Paul Maloy 01fd12bb18 tipc: make replicast a user selectable option
If the bearer carrying multicast messages supports broadcast, those
messages will be sent to all cluster nodes, irrespective of whether
these nodes host any actual destinations socket or not. This is clearly
wasteful if the cluster is large and there are only a few real
destinations for the message being sent.

In this commit we extend the eligibility of the newly introduced
"replicast" transmit option. We now make it possible for a user to
select which method he wants to be used, either as a mandatory setting
via setsockopt(), or as a relative setting where we let the broadcast
layer decide which method to use based on the ratio between cluster
size and the message's actual number of destination nodes.

In the latter case, a sending socket must stick to a previously
selected method until it enters an idle period of at least 5 seconds.
This eliminates the risk of message reordering caused by method change,
i.e., when changes to cluster size or number of destinations would
otherwise mandate a new method to be used.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:10:17 -05:00
Jon Paul Maloy a853e4c6d0 tipc: introduce replicast as transport option for multicast
TIPC multicast messages are currently carried over a reliable
'broadcast link', making use of the underlying media's ability to
transport packets as L2 broadcast or IP multicast to all nodes in
the cluster.

When the used bearer is lacking that ability, we can instead emulate
the broadcast service by replicating and sending the packets over as
many unicast links as needed to reach all identified destinations.
We now introduce a new TIPC link-level 'replicast' service that does
this.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:10:17 -05:00
Jon Paul Maloy 2ae0b8af1f tipc: add functionality to lookup multicast destination nodes
As a further preparation for the upcoming 'replicast' functionality,
we add some necessary structs and functions for looking up and returning
a list of all nodes that host destinations for a given multicast message.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:10:16 -05:00
Jon Paul Maloy 9999974a83 tipc: add function for checking broadcast support in bearer
As a preparation for the 'replicast' functionality we are going to
introduce in the next commits, we need the broadcast base structure to
store whether bearer broadcast is available at all from the currently
used bearer or bearers.

We do this by adding a new function tipc_bearer_bcast_support() to
the bearer layer, and letting the bearer selection function in
bcast.c use this to give a new boolean field, 'bcast_support' the
appropriate value.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:10:15 -05:00
David S. Miller 580bdf5650 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-01-17 15:19:37 -05:00
Parthasarathy Bhuvaragan 57d5f64d83 tipc: allocate user memory with GFP_KERNEL flag
Until now, we allocate memory always with GFP_ATOMIC flag.
When the system is under memory pressure and a user tries to send,
the send fails due to low memory. However, the user application
can wait for free memory if we allocate it using GFP_KERNEL flag.

In this commit, we use allocate memory with GFP_KERNEL for all user
allocation.

Reported-by: Rune Torgersen <runet@innovsys.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 13:31:53 -05:00
Jon Paul Maloy 365ad353c2 tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.

This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.

In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.

The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 11:13:05 -05:00
Jon Paul Maloy 4d8642d896 tipc: modify struct tipc_plist to be more versatile
During multicast reception we currently use a simple linked list with
push/pop semantics to store port numbers.

We now see a need for a more generic list for storing values of type
u32. We therefore make some modifications to this list, while replacing
the prefix 'tipc_plist_' with 'u32_'. We also add a couple of new
functions which will come to use in the next commits.

Acked-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 11:13:05 -05:00
Jon Paul Maloy 8c44e1af16 tipc: unify tipc_wait_for_sndpkt() and tipc_wait_for_sndmsg() functions
The functions tipc_wait_for_sndpkt() and tipc_wait_for_sndmsg() are very
similar. The latter function is also called from two locations, and
there will be more in the coming commits, which will all need to test on
different conditions.

Instead of making yet another duplicates of the function, we now
introduce a new macro tipc_wait_for_cond() where the wakeup condition
can be stated as an argument to the call. This macro replaces all
current and future uses of the two functions, which can now be
eliminated.

Acked-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 11:13:05 -05:00
Jon Paul Maloy 693c56491f tipc: don't send FIN message from connectionless socket
In commit 6f00089c73 ("tipc: remove SS_DISCONNECTING state") the
check for socket type is in the wrong place, causing a closing socket
to always send out a FIN message even when the socket was never
connected. This is normally harmless, since the destination node for
such messages most often is zero, and the message will be dropped, but
it is still a wrong and confusing behavior.

We fix this in this commit.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23 17:53:47 -05:00
Linus Torvalds 9a19a6db37 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:

 - more ->d_init() stuff (work.dcache)

 - pathname resolution cleanups (work.namei)

 - a few missing iov_iter primitives - copy_from_iter_full() and
   friends. Either copy the full requested amount, advance the iterator
   and return true, or fail, return false and do _not_ advance the
   iterator. Quite a few open-coded callers converted (and became more
   readable and harder to fuck up that way) (work.iov_iter)

 - several assorted patches, the big one being logfs removal

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  logfs: remove from tree
  vfs: fix put_compat_statfs64() does not handle errors
  namei: fold should_follow_link() with the step into not-followed link
  namei: pass both WALK_GET and WALK_MORE to should_follow_link()
  namei: invert WALK_PUT logics
  namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
  namei: saner calling conventions for mountpoint_last()
  namei.c: get rid of user_path_parent()
  switch getfrag callbacks to ..._full() primitives
  make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
  [iov_iter] new primitives - copy_from_iter_full() and friends
  don't open-code file_inode()
  ceph: switch to use of ->d_init()
  ceph: unify dentry_operations instances
  lustre: switch to use of ->d_init()
2016-12-16 10:24:44 -08:00
Al Viro cbbd26b8b1 [iov_iter] new primitives - copy_from_iter_full() and friends
copy_from_iter_full(), copy_from_iter_full_nocache() and
csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
et.al., advancing iterator only in case of successful full copy
and returning whether it had been successful or not.

Convert some obvious users.  *NOTE* - do not blindly assume that
something is a good candidate for those unless you are sure that
not advancing iov_iter in failure case is the right thing in
this case.  Anything that does short read/short write kind of
stuff (or is in a loop, etc.) is unlikely to be a good one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 14:33:36 -05:00
David S. Miller 2745529ac7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Couple conflicts resolved here:

1) In the MACB driver, a bug fix to properly initialize the
   RX tail pointer properly overlapped with some changes
   to support variable sized rings.

2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
   overlapping with a reorganization of the driver to support
   ACPI, OF, as well as PCI variants of the chip.

3) In 'net' we had several probe error path bug fixes to the
   stmmac driver, meanwhile a lot of this code was cleaned up
   and reorganized in 'net-next'.

4) The cls_flower classifier obtained a helper function in
   'net-next' called __fl_delete() and this overlapped with
   Daniel Borkamann's bug fix to use RCU for object destruction
   in 'net'.  It also overlapped with Jiri's change to guard
   the rhashtable_remove_fast() call with a check against
   tc_skip_sw().

5) In mlx4, a revert bug fix in 'net' overlapped with some
   unrelated changes in 'net-next'.

6) In geneve, a stale header pointer after pskb_expand_head()
   bug fix in 'net' overlapped with a large reorganization of
   the same code in 'net-next'.  Since the 'net-next' code no
   longer had the bug in question, there was nothing to do
   other than to simply take the 'net-next' hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 12:29:53 -05:00
Michal Kubeček 3de81b7588 tipc: check minimum bearer MTU
Qian Zhang (张谦) reported a potential socket buffer overflow in
tipc_msg_build() which is also known as CVE-2016-8632: due to
insufficient checks, a buffer overflow can occur if MTU is too short for
even tipc headers. As anyone can set device MTU in a user/net namespace,
this issue can be abused by a regular user.

As agreed in the discussion on Ben Hutchings' original patch, we should
check the MTU at the moment a bearer is attached rather than for each
processed packet. We also need to repeat the check when bearer MTU is
adjusted to new device MTU. UDP case also needs a check to avoid
overflow when calculating bearer MTU.

Fixes: b97bf3fd8f ("[TIPC] Initial merge")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reported-by: Qian Zhang (张谦) <zhangqian-c@360.cn>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 14:03:20 -05:00
Jon Paul Maloy 9590112241 tipc: fix link statistics counter errors
In commit e4bf4f7696 ("tipc: simplify packet sequence number
handling") we changed the internal representation of the packet
sequence number counters from u32 to u16, reflecting what is really
sent over the wire.

Since then some link statistics counters have been displaying incorrect
values, partially because the counters meant to be used as sequence
number snapshots are now used as direct counters, stored as u32, and
partially because some counter updates are just missing in the code.

In this commit we correct this in two ways. First, we base the
displayed packet sent/received values on direct counters instead
of as previously a calculated difference between current sequence
number and a snapshot. Second, we add the missing updates of the
counters.

This change is compatible with the current netlink API, and requires
no changes to the user space tools.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:35:55 -05:00
David S. Miller 0b42f25d2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
udplite conflict is resolved by taking what 'net-next' did
which removed the backlog receive method assignment, since
it is no longer necessary.

Two entries were added to the non-priv ethtool operations
switch statement, one in 'net' and one in 'net-next, so
simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-26 23:42:21 -05:00
Jon Paul Maloy 6998cc6ec2 tipc: resolve connection flow control compatibility problem
In commit 10724cc7bb ("tipc: redesign connection-level flow control")
we replaced the previous message based flow control with one based on
1k blocks. In order to ensure backwards compatibility the mechanism
falls back to using message as base unit when it senses that the peer
doesn't support the new algorithm. The default flow control window,
i.e., how many units can be sent before the sender blocks and waits
for an acknowledge (aka advertisement) is 512. This was tested against
the previous version, which uses an acknowledge frequency of on ack per
256 received message, and found to work fine.

However, we missed the fact that versions older than Linux 3.15 use an
acknowledge frequency of 512, which is exactly the limit where a 4.6+
sender will stop and wait for acknowledge. This would also work fine if
it weren't for the fact that if the first sent message on a 4.6+ server
side is an empty SYNACK, this one is also is counted as a sent message,
while it is not counted as a received message on a legacy 3.15-receiver.
This leads to the sender always being one step ahead of the receiver, a
scenario causing the sender to block after 512 sent messages, while the
receiver only has registered 511 read messages. Hence, the legacy
receiver is not trigged to send an acknowledge, with a permanently
blocked sender as result.

We solve this deadlock by simply allowing the sender to send one more
message before it blocks, i.e., by a making minimal change to the
condition used for determining connection congestion.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 21:38:16 -05:00
Jon Paul Maloy d876a4d2af tipc: improve sanity check for received domain records
In commit 35c55c9877 ("tipc: add neighbor monitoring framework") we
added a data area to the link monitor STATE messages under the
assumption that previous versions did not use any such data area.

For versions older than Linux 4.3 this assumption is not correct. In
those version, all STATE messages sent out from a node inadvertently
contain a 16 byte data area containing a string; -a leftover from
previous RESET messages which were using this during the setup phase.
This string serves no purpose in STATE messages, and should no be there.

Unfortunately, this data area is delivered to the link monitor
framework, where a sanity check catches that it is not a correct domain
record, and drops it. It also issues a rate limited warning about the
event.

Since such events occur much more frequently than anticipated, we now
choose to remove the warning in order to not fill the kernel log with
useless contents. We also make the sanity check stricter, to further
reduce the risk that such data is inavertently admitted.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 20:06:18 -05:00
Jon Paul Maloy f79675563a tipc: fix compatibility bug in link monitoring
commit 817298102b ("tipc: fix link priority propagation") introduced a
compatibility problem between TIPC versions newer than Linux 4.6 and
those older than Linux 4.4. In versions later than 4.4, link STATE
messages only contain a non-zero link priority value when the sender
wants the receiver to change its priority. This has the effect that the
receiver resets itself in order to apply the new priority. This works
well, and is consistent with the said commit.

However, in versions older than 4.4 a valid link priority is present in
all sent link STATE messages, leading to cyclic link establishment and
reset on the 4.6+ node.

We fix this by adding a test that the received value should not only
be valid, but also differ from the current value in order to cause the
receiving link endpoint to reset.

Reported-by: Amar Nv <amar.nv005@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 20:06:18 -05:00
David S. Miller f9aa9dc7d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.

That driver has a change_mtu method explicitly for sending
a message to the hardware.  If that fails it returns an
error.

Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.

However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22 13:27:16 -05:00
Jon Paul Maloy 51b9a31c42 tipc: eliminate obsolete socket locking policy description
The comment block in socket.c describing the locking policy is
obsolete, and does not reflect current reality. We remove it in this
commit.

Since the current locking policy is much simpler and follows a
mainstream approach, we see no need to add a new description.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 22:15:41 -05:00
Alexey Dobriyan c7d03a00b5 netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

	void f(long *p, int i)
	{
		g(p[i]);
	}

  roughly translates to

	movsx	rsi, esi
	mov	rdi, [rsi+...]
	call 	g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

	static inline void *net_generic(const struct net *net, int id)
	{
		...
		ptr = ng->ptr[id - 1];
		...
	}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
	function                                     old     new   delta
	nfsd4_lock                                  3886    3959     +73
	tipc_link_build_proto_msg                   1096    1140     +44
	mac80211_hwsim_new_radio                    2776    2808     +32
	tipc_mon_rcv                                1032    1058     +26
	svcauth_gss_legacy_init                     1413    1429     +16
	tipc_bcbase_select_primary                   379     392     +13
	nfsd4_exchange_id                           1247    1260     +13
	nfsd4_setclientid_confirm                    782     793     +11
		...
	put_client_renew_locked                      494     480     -14
	ip_set_sockfn_get                            730     716     -14
	geneve_sock_add                              829     813     -16
	nfsd4_sequence_done                          721     703     -18
	nlmclnt_lookup_host                          708     686     -22
	nfsd4_lockt                                 1085    1063     -22
	nfs_get_client                              1077    1050     -27
	tcf_bpf_init                                1106    1076     -30
	nfsd4_encode_fattr                          5997    5930     -67
	Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 10:59:15 -05:00
WANG Cong d9dc8b0f8b net: fix sleeping for sk_wait_event()
Similar to commit 14135f30e3 ("inet: fix sleeping inside inet_wait_for_connect()"),
sk_wait_event() needs to fix too, because release_sock() is blocking,
it changes the process state back to running after sleep, which breaks
the previous prepare_to_wait().

Switch to the new wait API.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-14 13:17:21 -05:00
Parthasarathy Bhuvaragan f40acbaf42 tipc: remove SS_CONNECTED sock state
In this commit, we replace references to sock->state SS_CONNECTE
with sk_state TIPC_ESTABLISHED.

Finally, the sock->state is no longer explicitly used by tipc.
The FSM below is for various types of connection oriented sockets.

Stream Server Listening Socket:
+-----------+       +-------------+
| TIPC_OPEN |------>| TIPC_LISTEN |
+-----------+       +-------------+

Stream Server Data Socket:
+-----------+       +------------------+
| TIPC_OPEN |------>| TIPC_ESTABLISHED |
+-----------+       +------------------+
                          ^   |
                          |   |
                          |   v
                    +--------------------+
                    | TIPC_DISCONNECTING |
                    +--------------------+

Stream Socket Client:
+-----------+       +-----------------+
| TIPC_OPEN |------>| TIPC_CONNECTING |------+
+-----------+       +-----------------+      |
                            |                |
                            |                |
                            v                |
                    +------------------+     |
                    | TIPC_ESTABLISHED |     |
                    +------------------+     |
                          ^   |              |
                          |   |              |
                          |   v              |
                    +--------------------+   |
                    | TIPC_DISCONNECTING |<--+
                    +--------------------+

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:26 -04:00
Parthasarathy Bhuvaragan 99a2088981 tipc: create TIPC_CONNECTING as a new sk_state
In this commit, we create a new tipc socket state TIPC_CONNECTING
by primarily replacing the SS_CONNECTING with TIPC_CONNECTING.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:26 -04:00
Parthasarathy Bhuvaragan 6f00089c73 tipc: remove SS_DISCONNECTING state
In this commit, we replace the references to SS_DISCONNECTING with
the combination of sk_state TIPC_DISCONNECTING and flags set in
sk_shutdown.
We introduce a new function _tipc_shutdown(), which provides
the common code required by tipc_release() and tipc_shutdown().

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan 9fd4b070f6 tipc: create TIPC_DISCONNECTING as a new sk_state
In this commit, we create a new tipc socket state TIPC_DISCONNECTING in
sk_state. TIPC_DISCONNECTING is replacing the socket connection status
update using SS_DISCONNECTING.
TIPC_DISCONNECTING is set for connection oriented sockets at:
- tipc_shutdown()
- connection probe timeout
- when we receive an error message on the connection.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan 438adcaf0d tipc: create TIPC_OPEN as a new sk_state
In this commit, we create a new tipc socket state TIPC_OPEN in
sk_state. We primarily replace the SS_UNCONNECTED sock->state with
TIPC_OPEN.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan 8ea642ee9a tipc: create TIPC_ESTABLISHED as a new sk_state
Until now, tipc maintains probing state for connected sockets in
tsk->probing_state variable.

In this commit, we express this information as socket states and
this remove the variable. We set probe_unacked flag when a probe
is sent out and reset it if we receive a reply. Instead of the
probing state TIPC_CONN_OK, we create a new state TIPC_ESTABLISHED.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan 0c288c8692 tipc: create TIPC_LISTEN as a new sk_state
Until now, tipc maintains the socket state in sock->state variable.
This is used to maintain generic socket states, but in tipc
we overload it and save tipc socket states like TIPC_LISTEN.
Other protocols like TCP, UDP store protocol specific states
in sk->sk_state instead.

In this commit, we :
- declare a new tipc state TIPC_LISTEN, that replaces SS_LISTEN
- Create a new function tipc_set_state(), to update sk->sk_state.
- TIPC_LISTEN state is maintained in sk->sk_state.
- replace references to SS_LISTEN with TIPC_LISTEN.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan c752023aab tipc: remove socket state SS_READY
Until now, tipc socket state SS_READY declares that the socket is a
connectionless socket.

In this commit, we remove the state SS_READY and replace it with a
condition which returns true for datagram / connectionless sockets.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan 360aab6b49 tipc: remove probing_intv from tipc_sock
Until now, probing_intv is a variable in struct tipc_sock but is
always set to a constant CONN_PROBING_INTERVAL. The socket
connection is probed based on this value.

In this commit, we remove this variable and setup the socket
timer based on the constant CONN_PROBING_INTERVAL.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan d6fb7e9c99 tipc: remove tsk->connected from tipc_sock
Until now, we determine if a socket is connected or not based on
tsk->connected, which is set once when the probing state is set
to TIPC_CONN_OK. It is unset when the sock->state is updated from
SS_CONNECTED to any other state.

In this commit, we remove connected variable from tipc_sock and
derive socket connection status from the following condition:
sock->state == SS_CONNECTED => tsk->connected

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan 87227fe7e4 tipc: remove tsk->connected for connectionless sockets
Until now, for connectionless sockets the peer information during
connect is stored in tsk->peer and a connection state is set in
tsk->connected. This is redundant.

In this commit, for connectionless sockets we update:
- __tipc_sendmsg(), when the destination is NULL the peer existence
  is determined by tsk->peer.family, instead of tsk->connected.
- tipc_connect(), remove set/unset of tsk->connected.
Hence tsk->connected is no longer used for connectionless sockets.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan aeda16b6ae tipc: rename tsk->remote to tsk->peer for consistent naming
Until now, the peer information for connect is stored in tsk->remote
but the rest of code uses the name peer for peer/remote.

In this commit, we rename tsk->remote to tsk->peer to align with
naming convention followed in the rest of the code.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan ba8aebe943 tipc: rename struct tipc_skb_cb member handle to bytes_read
In this commit, we rename handle to bytes_read indicating the
purpose of the member.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
Parthasarathy Bhuvaragan cb5da847af tipc: set kern=0 in sk_alloc() during tipc_accept()
Until now, tipc_accept() calls sk_alloc() with kern=1. This is
incorrect as the data socket's owner is the user application.
Thus for these accepted data sockets the network namespace
refcount is skipped.

In this commit, we fix this by setting kern=0.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
Parthasarathy Bhuvaragan 4891d8fe16 tipc: wakeup sleeping users at disconnect
Until now, in filter_connect() when we terminate a connection due to
an error message from peer, we set the socket state to DISCONNECTING.

The socket is notified about this broken connection using EPIPE when
a user tries to send a message. However if a socket was waiting on a
poll() while the connection is being terminated, we fail to wakeup
that socket.

In this commit, we wakeup sleeping sockets at connection termination.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
Parthasarathy Bhuvaragan 7cf87fa278 tipc: return early for non-blocking sockets at link congestion
Until now, in stream/mcast send() we pass the message to the link
layer even when the link is congested and add the socket to the
link's wakeup queue. This is unnecessary for non-blocking sockets.
If a socket is set to non-blocking and sends multicast with zero
back off time while receiving EAGAIN, we exhaust the memory.

In this commit, we return immediately at stream/mcast send() for
non-blocking sockets.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
David S. Miller 27058af401 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 12:42:58 -04:00
Jon Paul Maloy 06bd2b1ed0 tipc: fix broadcast link synchronization problem
In commit 2d18ac4ba7 ("tipc: extend broadcast link initialization
criteria") we tried to fix a problem with the initial synchronization
of broadcast link acknowledge values. Unfortunately that solution is
not sufficient to solve the issue.

We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
broadcast acknowledge numbers, leading to premature opening of the
broadcast link. When the bypassed packets finally arrive, they are
inadvertently accepted, and the already correctly initialized
acknowledge number in the broadcast receive link is overwritten by
the invalid (zero) value of the said packets. After this the broadcast
link goes stale.

We now fix this by marking the packets where we know the acknowledge
value is or may be invalid, and then ignoring the acks from those.

To this purpose, we claim an unused bit in the header to indicate that
the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
packets, plus those LINK_PROTOCOL packets sent out before the broadcast
links are fully synchronized.

This minor protocol update is fully backwards compatible.

Reported-by: John Thompson <thompa.atl@gmail.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 17:21:09 -04:00
Johannes Berg 56989f6d85 genetlink: mark families as __ro_after_init
Now genl_register_family() is the only thing (other than the
users themselves, perhaps, but I didn't find any doing that)
writing to the family struct.

In all families that I found, genl_register_family() is only
called from __init functions (some indirectly, in which case
I've add __init annotations to clarifly things), so all can
actually be marked __ro_after_init.

This protects the data structure from accidental corruption.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg 489111e5c2 genetlink: statically initialize families
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.

This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg a07ea4d994 genetlink: no longer support using static family IDs
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.

Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)

Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg c90c39dab3 genetlink: introduce and use genl_family_attrbuf()
This helper function allows family implementations to access
their family's attrbuf. This gets rid of the attrbuf usage
in families, and also adds locking validation, since it's not
valid to use the attrbuf with parallel_ops or outside of the
dumpit callback.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:08 -04:00
Dan Carpenter 7307616245 tipc: info leak in __tipc_nl_add_udp_addr()
We should clear out the padding and unused struct members so that we
don't expose stack information to userspace.

Fixes: fdb3accc2c ('tipc: add the ability to get UDP options via netlink')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 12:10:01 -04:00
Wei Yongjun c20cb81193 tipc: fix possible memory leak in tipc_udp_enable()
'ub' is malloced in tipc_udp_enable() and should be freed before
leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: ba5aa84a2d ("tipc: split UDP nl address parsing")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-13 11:28:32 -04:00
David S. Miller b20b378d49 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mediatek/mtk_eth_soc.c
	drivers/net/ethernet/qlogic/qed/qed_dcbx.c
	drivers/net/phy/Kconfig

All conflicts were cases of overlapping commits.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-12 15:52:44 -07:00
Jon Paul Maloy e0a05ebe26 tipc: send broadcast nack directly upon sequence gap detection
Because of the risk of an excessive number of NACK messages and
retransissions, receivers have until now abstained from sending
broadcast NACKS directly upon detection of a packet sequence number
gap. We have instead relied on such gaps being detected by link
protocol STATE message exchange, something that by necessity delays
such detection and subsequent retransmissions.

With the introduction of unicast NACK transmission and rate control
of retransmissions we can now remove this limitation. We now allow
receiving nodes to send NACKS immediately, while coordinating the
permission to do so among the nodes in order to avoid NACK storms.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 17:10:25 -07:00
Jon Paul Maloy 7c4a54b963 tipc: rate limit broadcast retransmissions
As cluster sizes grow, so does the amount of identical or overlapping
broadcast NACKs generated by the packet receivers. This often leads to
'NACK crunches' resulting in huge numbers of redundant retransmissions
of the same packet ranges.

In this commit, we introduce rate control of broadcast retransmissions,
so that a retransmitted range cannot be retransmitted again until after
at least 10 ms. This reduces the frequency of duplicate, redundant
retransmissions by an order of magnitude, while having a significant
positive impact on overall throughput and scalability.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 17:10:24 -07:00
Jon Paul Maloy 02d11ca200 tipc: transfer broadcast nacks in link state messages
When we send broadcasts in clusters of more 70-80 nodes, we sometimes
see the broadcast link resetting because of an excessive number of
retransmissions. This is caused by a combination of two factors:

1) A 'NACK crunch", where loss of broadcast packets is discovered
   and NACK'ed by several nodes simultaneously, leading to multiple
   redundant broadcast retransmissions.

2) The fact that the NACKS as such also are sent as broadcast, leading
   to excessive load and packet loss on the transmitting switch/bridge.

This commit deals with the latter problem, by moving sending of
broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
bits in word 8 of the said message for this purpose, and introduce a
new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
backwards compatible.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 17:10:24 -07:00
Parthasarathy Bhuvaragan d2f394dc48 tipc: fix random link resets while adding a second bearer
In a dual bearer configuration, if the second tipc link becomes
active while the first link still has pending nametable "bulk"
updates, it randomly leads to reset of the second link.

When a link is established, the function named_distribute(),
fills the skb based on node mtu (allows room for TUNNEL_PROTOCOL)
with NAME_DISTRIBUTOR message for each PUBLICATION.
However, the function named_distribute() allocates the buffer by
increasing the node mtu by INT_H_SIZE (to insert NAME_DISTRIBUTOR).
This consumes the space allocated for TUNNEL_PROTOCOL.

When establishing the second link, the link shall tunnel all the
messages in the first link queue including the "bulk" update.
As size of the NAME_DISTRIBUTOR messages while tunnelling, exceeds
the link mtu the transmission fails (-EMSGSIZE).

Thus, the synch point based on the message count of the tunnel
packets is never reached leading to link timeout.

In this commit, we adjust the size of name distributor message so that
they can be tunnelled.

Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 10:12:26 -07:00
David S. Miller 6abdd5f593 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All three conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 00:54:02 -04:00
Richard Alpe 832629ca5c tipc: add UDP remoteip dump to netlink API
When using replicast a UDP bearer can have an arbitrary amount of
remote ip addresses associated with it. This means we cannot simply
add all remote ip addresses to an existing bearer data message as it
might fill the message, leaving us with a truncated message that we
can't safely resume. To handle this we introduce the new netlink
command TIPC_NL_UDP_GET_REMOTEIP. This command is intended to be
called when the bearer data message has the
TIPC_NLA_UDP_MULTI_REMOTEIP flag set, indicating there are more than
one remote ip (replicast).

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:41 -07:00
Richard Alpe fdb3accc2c tipc: add the ability to get UDP options via netlink
Add UDP bearer options to netlink bearer get message. This is used by
the tipc user space tool to display UDP options.

The UDP bearer information is passed using either a sockaddr_in or
sockaddr_in6 structs. This means the user space receiver should
intermediately store the retrieved data in a large enough struct
(sockaddr_strage) before casting to the proper IP version type.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:41 -07:00