Unfortunately, I introduced a tremendously stupid bug into
genlmsg_multicast() when doing all those multicast group
changes: it adjusts the group number, but then passes it
to genlmsg_multicast_netns() which does that again.
Somehow, my tests failed to catch this, so add a warning
into genlmsg_multicast_netns() and remove the offending
group ID adjustment.
Also add a warning to the similar code in other functions
so people who misuse them are more loudly warned.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
to return msg_name to the user.
This prevents numerous uninitialized memory leaks we had in the
recvmsg handlers and makes it harder for new code to accidentally leak
uninitialized memory.
Optimize for the case recvfrom is called with NULL as address. We don't
need to copy the address at all, so set it to NULL before invoking the
recvmsg handler. We can do so, because all the recvmsg handlers must
cope with the case a plain read() is called on them. read() also sets
msg_name to NULL.
Also document these changes in include/linux/net.h as suggested by David
Miller.
Changes since RFC:
Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address. It also more naturally reflects the logic by the callers of
verify_iovec.
With this change in place I could remove "
if (!uaddr || msg_sys->msg_namelen == 0)
msg->msg_name = NULL
".
This change does not alter the user visible error logic as we ignore
msg_namelen as long as msg_name is NULL.
Also remove two unnecessary curly brackets in ___sys_recvmsg and change
comments to netdev style.
Cc: David Miller <davem@davemloft.net>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Register generic netlink multicast groups as an array with
the family and give them contiguous group IDs. Then instead
of passing the global group ID to the various functions that
send messages, pass the ID relative to the family - for most
families that's just 0 because the only have one group.
This avoids the list_head and ID in each group, adding a new
field for the mcast group ID offset to the family.
At the same time, this allows us to prevent abusing groups
again like the quota and dropmon code did, since we can now
check that a family only uses a group it owns.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This doesn't really change anything, but prepares for the
next patch that will change the APIs to pass the group ID
within the family, rather than the global group ID.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no reason to have the family pointer there since it
can just be passed internally where needed, so remove it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are no users of this API remaining, and we'll soon
change group registration to be static (like ops are now)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The quota code is abusing the genetlink API and is using
its family ID as the multicast group ID, which is invalid
and may belong to somebody else (and likely will.)
Make the quota code use the correct API, but since this
is already used as-is by userspace, reserve a family ID
for this code and also reserve that group ID to not break
userspace assumptions.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The drop monitor code is abusing the genetlink API and is
statically using the generic netlink multicast group 1, even
if that group belongs to somebody else (which it invariably
will, since it's not reserved.)
Make the drop monitor code use the proper APIs to reserve a
group ID, but also reserve the group id 1 in generic netlink
code to preserve the userspace API. Since drop monitor can
be a module, don't clear the bit for it on unregistration.
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As suggested by David Miller, make genl_register_family_with_ops()
a macro and pass only the array, evaluating ARRAY_SIZE() in the
macro, this is a little safer.
The openvswitch has some indirection, assing ops/n_ops directly in
that code. This might ultimately just assign the pointers in the
family initializations, saving the struct genl_family_and_ops and
code (once mcast groups are handled differently.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The parameter is just 'group', not 'groups', fix the documentation typo.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse pointed out that the new flags variable I had added
shadowed an existing one, rename the new one to avoid that,
making the code clearer.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the ops assignment is just two variables rather than a
long list iteration etc., there's no reason to separately export
__genl_register_family() and __genl_register_family_with_ops().
Unify the two functions into __genl_register_family() and make
genl_register_family_with_ops() call it after assigning the ops.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow making the ops array const by not modifying the ops
flags on registration but rather only when ops are sent
out in the family information.
No users are updated yet except for the pre_doit/post_doit
calls in wireless (the only ones that exist now.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a linked list, use an array. This reduces
the data size needed by the users of genetlink, for example
in wireless (net/wireless/nl80211.c) on 64-bit it frees up
over 1K of data space.
Remove the attempted sending of CTRL_CMD_NEWOPS ctrl event
since genl_ctrl_event(CTRL_CMD_NEWOPS, ...) only returns
-EINVAL anyway, therefore no such event could ever be sent.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
genl_register_ops() is still needed for internal registration,
but is no longer available to users of the API.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix finer-grained control and let only a whitelist of allowed netlink
protocols pass, in our case related to networking. If later on, other
subsystems decide they want to add their protocol as well to the list
of allowed protocols they shall simply add it. While at it, we also
need to tell what protocol is in use otherwise BPF_S_ANC_PROTOCOL can
not pick it up (as it's not filled out).
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
net/bridge/br_multicast.c
net/ipv6/sit.c
The conflicts were minor:
1) sit.c changes overlap with change to ip_tunnel_xmit() signature.
2) br_multicast.c had an overlap between computing max_delay using
msecs_to_jiffies and turning MLDV2_MRC() into an inline function
with a name using lowercase instead of uppercase letters.
3) stmmac had two overlapping changes, one which conditionally allocated
and hooked up a dma_cfg based upon the presence of the pbl OF property,
and another one handling store-and-forward DMA made. The latter of
which should not go into the new of_find_property() basic block.
Signed-off-by: David S. Miller <davem@davemloft.net>
netlink dump operations take module as parameter to hold
reference for entire netlink dump duration.
Currently it holds ref only on genl module which is not correct
when we use ops registered to genl from another module.
Following patch adds module pointer to genl_ops so that netlink
can hold ref count on it.
CC: Jesse Gross <jesse@nicira.com>
CC: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of genl-family with parallel ops off, dumpif() callback
is expected to run under genl_lock, But commit def3117493
(genl: Allow concurrent genl callbacks.) changed this behaviour
where only first dumpit() op was called under genl-lock.
For subsequent dump, only nlk->cb_lock was taken.
Following patch fixes it by defining locked dumpit() and done()
callback which takes care of genl-locking.
CC: Jesse Gross <jesse@nicira.com>
CC: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/iwlwifi/pcie/trans.c
include/linux/inetdevice.h
The inetdevice.h conflict involves moving the IPV4_DEVCONF values
into a UAPI header, overlapping additions of some new entries.
The iwlwifi conflict is a context overlap.
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 58ad436fcf.
It turns out that the change introduced a potential deadlock
by causing a locking dependency with netlink's cb_mutex. I
can't seem to find a way to resolve this without doing major
changes to the locking, so revert this.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch stores struct netlink_callback in netlink_sock
to avoid allocating and freeing it on every netlink dump msg.
Only one dump operation is allowed for a given socket at a time
therefore we can safely convert cb pointer to cb struct inside
netlink_sock.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When dumping generic netlink families, only the first dump call
is locked with genl_lock(), which protects the list of families,
and thus subsequent calls can access the data without locking,
racing against family addition/removal. This can cause a crash.
Fix it - the locking needs to be conditional because the first
time around it's already locked.
A similar bug was reported to me on an old kernel (3.4.47) but
the exact scenario that happened there is no longer possible,
on those kernels the first round wasn't locked either. Looking
at the current code I found the race described above, which had
also existed on the old kernel.
Cc: stable@vger.kernel.org
Reported-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Variable ptr is being assigned, but never used, so just remove it.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, it is not possible to use neither NLM_F_EXCL nor
NLM_F_REPLACE from genetlink. This is due to this checking in
genl_family_rcv_msg:
if (nlh->nlmsg_flags & NLM_F_DUMP)
NLM_F_DUMP is NLM_F_MATCH|NLM_F_ROOT. Thus, if NLM_F_EXCL or
NLM_F_REPLACE flag is set, genetlink believes that you're
requesting a dump and it calls the .dumpit callback.
The solution that I propose is to refine this checking to
make it stricter:
if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP)
And given the combination NLM_F_REPLACE and NLM_F_EXCL does
not make sense to me, it removes the ambiguity.
There was a patch that tried to fix this some time ago (0ab03c2
netlink: test for all flags of the NLM_F_DUMP composite) but it
tried to resolve this ambiguity in *all* existing netlink subsystems,
not only genetlink. That patch was reverted since it broke iproute2,
which is using NLM_F_ROOT to request the dump of the routing cache.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since (c05cdb1 netlink: allow large data transfers from user-space),
netlink splats if it invokes skb_clone on large netlink skbs since:
* skb_shared_info was not correctly initialized.
* skb->destructor is not set in the cloned skb.
This was spotted by trinity:
[ 894.990671] BUG: unable to handle kernel paging request at ffffc9000047b001
[ 894.991034] IP: [<ffffffff81a212c4>] skb_clone+0x24/0xc0
[...]
[ 894.991034] Call Trace:
[ 894.991034] [<ffffffff81ad299a>] nl_fib_input+0x6a/0x240
[ 894.991034] [<ffffffff81c3b7e6>] ? _raw_read_unlock+0x26/0x40
[ 894.991034] [<ffffffff81a5f189>] netlink_unicast+0x169/0x1e0
[ 894.991034] [<ffffffff81a601e1>] netlink_sendmsg+0x251/0x3d0
Fix it by:
1) introducing a new netlink_skb_clone function that is used in nl_fib_input,
that sets our special skb->destructor in the cloned skb. Moreover, handle
the release of the large cloned skb head area in the destructor path.
2) not allowing large skbuffs in the netlink broadcast path. I cannot find
any reasonable use of the large data transfer using netlink in that path,
moreover this helps to skip extra skb_clone handling.
I found two more netlink clients that are cloning the skbs, but they are
not in the sendmsg path. Therefore, the sole client cloning that I found
seems to be the fib frontend.
Thanks to Eric Dumazet for helping to address this issue.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similarly to the networking receive path with ptype_all taps, we add
the possibility to register netdevices that are for ARPHRD_NETLINK to
the netlink subsystem, so that those can be used for netlink analyzers
resp. debuggers. We do not offer a direct callback function as out-of-tree
modules could do crap with it. Instead, a netdevice must be registered
properly and only receives a clone, managed by the netlink layer. Symbols
are exported as GPL-only.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/ath/ath9k/Kconfig
drivers/net/xen-netback/netback.c
net/batman-adv/bat_iv_ogm.c
net/wireless/nl80211.c
The ath9k Kconfig conflict was a change of a Kconfig option name right
next to the deletion of another option.
The xen-netback conflict was overlapping changes involving the
handling of the notify list in xen_netbk_rx_action().
Batman conflict resolution provided by Antonio Quartulli, basically
keep everything in both conflict hunks.
The nl80211 conflict is a little more involved. In 'net' we added a
dynamic memory allocation to nl80211_dump_wiphy() to fix a race that
Linus reported. Meanwhile in 'net-next' the handlers were converted
to use pre and post doit handlers which use a flag to determine
whether to hold the RTNL mutex around the operation.
However, the dump handlers to not use this logic. Instead they have
to explicitly do the locking. There were apparent bugs in the
conversion of nl80211_dump_wiphy() in that we were not dropping the
RTNL mutex in all the return paths, and it seems we very much should
be doing so. So I fixed that whilst handling the overlapping changes.
To simplify the initial returns, I take the RTNL mutex after we try
to allocate 'tb'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit da12c90e09
"netlink: Add compare function for netlink_table"
only set compare at the time we create kernel netlink,
and reset compare to NULL at the time we finially
release netlink socket, but netlink_lookup wants
the compare exist always.
So we should set compare after we allocate nl_table,
and never reset it. make comapre exist all the time.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return the error if something went wrong instead of unconditionally
returning 0.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we know, netlink sockets are private resource of
net namespace, they can communicate with each other
only when they in the same net namespace. this works
well until we try to add namespace support for other
subsystems which use netlink.
Don't like ipv4 and route table.., it is not suited to
make these subsytems belong to net namespace, Such as
audit and crypto subsystems,they are more suitable to
user namespace.
So we must have the ability to make the netlink sockets
in same user namespace can communicate with each other.
This patch adds a new function pointer "compare" for
netlink_table, we can decide if the netlink sockets can
communicate with each other through this netlink_table
self-defined compare function.
The behavior isn't changed if we don't provide the compare
function for netlink_table.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I can hit ENOBUFS in the sendmsg() path with a large batch that is
composed of many netlink messages. Here that limit is 8 MBytes of
skbuff data area as kmalloc does not manage to get more than that.
While discussing atomic rule-set for nftables with Patrick McHardy,
we decided to put all rule-set updates that need to be applied
atomically in one single batch to simplify the existing approach.
However, as explained above, the existing netlink code limits us
to a maximum of ~20000 rules that fit in one single batch without
hitting ENOBUFS. iptables does not have such limitation as it is
using vmalloc.
This patch adds netlink_alloc_large_skb() which is only used in
the netlink_sendmsg() path. It uses alloc_skb if the memory
requested is <= one memory page, that should be the common case
for most subsystems, else vmalloc for higher memory allocations.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet spotted that we have to check skb->head instead
of skb->data as skb->head points to the beginning of the
data area of the skbuff. Similarly, we have to initialize the
skb->head pointer, not skb->data in __alloc_skb_head.
After this fix, netlink crashes in the release path of the
sk_buff, so let's fix that as well.
This bug was introduced in (0ebd0ac net: add function to
allocate sk_buff head without data area).
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, in menuconfig, Netlink's new mmaped IO is the very first
entry under the ``Networking support'' item and comes even before
``Networking options'':
[ ] Netlink: mmaped IO
Networking options --->
...
Lets move this into ``Networking options'' under netlink's Kconfig,
since this might be more appropriate. Introduced by commit ccdfcc398
(``netlink: mmaped netlink: ring setup'').
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f9c2288837 (netlink:
implement memory mapped recvmsg) increamented skb->users
ref count twice for a dump op which does not look right.
Following patch fixes that.
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'attrbuf' is malloced in genl_family_rcv_msg() when family->maxattr &&
family->parallel_ops, thus should be freed before leaving from the error
handling cases, otherwise it will cause memory leak.
Introduced by commit def3117493
(genl: Allow concurrent genl callbacks.)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
All genl callbacks are serialized by genl-mutex. This can become
bottleneck in multi threaded case.
Following patch adds an parameter to genl_family so that a
particular family can get concurrent netlink callback without
genl_lock held.
New rw-sem is used to protect genl callback from genl family unregister.
in case of parallel_ops genl-family read-lock is taken for callbacks and
write lock is taken for register or unregistration for any family.
In case of locked genl family semaphore and gel-mutex is locked for
any openration.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Depending of the kernel configuration (CONFIG_UIDGID_STRICT_TYPE_CHECKS), we can
get the following errors:
net/netlink/af_netlink.c: In function ‘netlink_queue_mmaped_skb’:
net/netlink/af_netlink.c:663:14: error: incompatible types when assigning to type ‘__u32’ from type ‘kuid_t’
net/netlink/af_netlink.c:664:14: error: incompatible types when assigning to type ‘__u32’ from type ‘kgid_t’
net/netlink/af_netlink.c: In function ‘netlink_ring_set_copied’:
net/netlink/af_netlink.c:693:14: error: incompatible types when assigning to type ‘__u32’ from type ‘kuid_t’
net/netlink/af_netlink.c:694:14: error: incompatible types when assigning to type ‘__u32’ from type ‘kgid_t’
We must use the helpers to get the uid and gid, and also take care of user_ns.
Fix suggested by Eric W. Biederman <ebiederm@xmission.com>.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/netlink/diag.c: In function 'sk_diag_put_rings_cfg':
net/netlink/diag.c:28:17: error: 'struct netlink_sock' has no member named 'pg_vec_lock'
net/netlink/diag.c:29:29: error: 'struct netlink_sock' has no member named 'rx_ring'
net/netlink/diag.c:31:30: error: 'struct netlink_sock' has no member named 'tx_ring'
net/netlink/diag.c:33:19: error: 'struct netlink_sock' has no member named 'pg_vec_lock'
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add flow control for memory mapped RX. Since user-space usually doesn't
invoke recvmsg() when using memory mapped I/O, flow control is performed
in netlink_poll(). Dumps are allowed to continue if at least half of the
ring frames are unused.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for mmap'ed recvmsg(). To allow the kernel to construct messages
into the mapped area, a dataless skb is allocated and the data pointer is
set to point into the ring frame. This means frames will be delivered to
userspace in order of allocation instead of order of transmission. This
usually doesn't matter since the order is either not determinable by
userspace or message creation/transmission is serialized. The only case
where this can have a visible difference is nfnetlink_queue. Userspace
can't assume mmap'ed messages have ordered IDs anymore and needs to check
this if using batched verdicts.
For non-mapped sockets, nothing changes.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for mmap'ed sendmsg() to netlink. Since the kernel validates
received messages before processing them, the code makes sure userspace
can't modify the message contents after invoking sendmsg(). To do that
only a single mapping of the TX ring is allowed to exist and the socket
must not be shared. If either of these two conditions does not hold, it
falls back to copying.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add helper functions for looking up mmap'ed frame headers, reading and
writing their status, allocating skbs with mmap'ed data areas and a poll
function.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for mmap'ed RX and TX ring setup and teardown based on the
af_packet.c code. The following patches will use this to add the real
mmap'ed receive and transmit functionality.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
For mmap'ed I/O a netlink specific skb destructor needs to be invoked
after the final kfree_skb() to clean up state. This doesn't work currently
since the skb's ownership is transfered to the receiving socket using
skb_set_owner_r(), which orphans the skb, thereby invoking the destructor
prematurely.
Since netlink doesn't account skbs to the originating socket, there's no
need to orphan the skb. Add a netlink specific skb_set_owner_r() variant
that does not orphan the skb and use a netlink specific destructor to
call sock_rfree().
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>