Commit Graph

4891 Commits

Author SHA1 Message Date
Li RongQing 2e7b162c5e netfilter: nf_conntrack: ensure that CONNTRACK_LOCKS is power of 2
CONNTRACK_LOCKS is divisor when computer array index, if it is power of
2, compiler will optimize modulo operation as bitwise AND, or else
modulo will lower performance.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:36:46 +01:00
Li RongQing a9f5e78c40 netfilter: nf_tables: check the result of dereferencing base_chain->stats
Check the result of dereferencing base_chain->stats, instead of result
of this_cpu_ptr with NULL.

base_chain->stats maybe be changed to NULL when a chain is updated and a
new NULL counter can be attached.

And we do not need to check returning of this_cpu_ptr since
base_chain->stats is from percpu allocator if it is non-NULL,
this_cpu_ptr returns a valid value.

And fix two sparse error by replacing rcu_access_pointer and
rcu_dereference with READ_ONCE under rcu_read_lock.

Thanks for Eric's help to finish this patch.

Fixes: 009240940e ("netfilter: nf_tables: don't assume chain stats are set when jumplabel is set")
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:34:24 +01:00
Xin Long f52a40fb41 ipvs: get sctphdr by sctphoff in sctp_csum_check
sctp_csum_check() is called by sctp_s/dnat_handler() where it calls
skb_make_writable() to ensure sctphdr to be linearized.

So there's no need to get sctphdr by calling skb_header_pointer()
in sctp_csum_check().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:28:44 +01:00
Li RongQing 11d4dd0b20 netfilter: convert the proto argument from u8 to u16
The proto in struct xt_match and struct xt_target is u16, when
calling xt_check_target/match, their proto argument is u8,
and will cause truncation, it is harmless to ip packet, since
ip proto is u8

if a etable's match/target has proto that is u16, will cause
the check failure.

and convert be16 to short in bridge/netfilter/ebtables.c

Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:28:43 +01:00
wenxu 3e511d5652 netfilter: nft_tunnel: Add dst_cache support
The metadata_dst does not initialize the dst_cache field, this causes
problems to ip_md_tunnel_xmit() since it cannot use this cache, hence,
Triggering a route lookup for every packet.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:25:06 +01:00
Florian Westphal be0502a3f2 netfilter: conntrack: tcp: only close if RST matches exact sequence
TCP resets cause instant transition from established to closed state
provided the reset is in-window.  Endpoints that implement RFC 5961
require resets to match the next expected sequence number.
RST segments that are in-window (but that do not match RCV.NXT) are
ignored, and a "challenge ACK" is sent back.

Main problem for conntrack is that its a middlebox, i.e.  whereas an end
host might have ACK'd SEQ (and would thus accept an RST with this
sequence number), conntrack might not have seen this ACK (yet).

Therefore we can't simply flag RSTs with non-exact match as invalid.

This updates RST processing as follows:

1. If the connection is in a state other than ESTABLISHED, nothing is
   changed, RST is subject to normal in-window check.

2. If the RSTs sequence number either matches exactly RCV.NXT,
   connection state moves to CLOSE.

3. The same applies if the RST sequence number aligns with a previous
   packet in the same direction.

In all other cases, the connection remains in ESTABLISHED state.
If the normal-in-window check passes, the timeout will be lowered
to that of CLOSE.

If the peer sends a challenge ack, connection timeout will be reset.

If the challenge ACK triggers another RST (RST was valid after all),
this 2nd RST will match expected sequence and conntrack state changes to
CLOSE.

If no challenge ACK is received, the connection will time out after
CLOSE seconds (10 seconds by default), just like without this patch.

Packetdrill test case:

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 win 64240 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

// Receive a segment.
0.210 < P. 1:1001(1000) ack 1 win 46
0.210 > . 1:1(0) ack 1001

// Application writes 1000 bytes.
0.250 write(4, ..., 1000) = 1000
0.250 > P. 1:1001(1000) ack 1001

// First reset, old sequence. Conntrack (correctly) considers this
// invalid due to failed window validation (regardless of this patch).
0.260 < R  2:2(0) ack 1001 win 260

// 2nd reset, but too far ahead sequence.  Same: correctly handled
// as invalid.
0.270 < R 99990001:99990001(0) ack 1001 win 260

// in-window, but not exact sequence.
// Current Linux kernels might reply with a challenge ack, and do not
// remove connection.
// Without this patch, conntrack state moves to CLOSE.
// With patch, timeout is lowered like CLOSE, but connection stays
// in ESTABLISHED state.
0.280 < R 1010:1010(0) ack 1001 win 260

// Expect challenge ACK
0.281 > . 1001:1001(0) ack 1001 win 501

// With or without this patch, RST will cause connection
// to move to CLOSE (sequence number matches)
// 0.282 < R 1001:1001(0) ack 1001 win 260

// ACK
0.300 < . 1001:1001(0) ack 1001 win 257

// more data could be exchanged here, connection
// is still established

// Client closes the connection.
0.610 < F. 1001:1001(0) ack 1001 win 260
0.650 > . 1001:1001(0) ack 1002

// Close the connection without reading outstanding data
0.700 close(4) = 0

// so one more reset.  Will be deemed acceptable with patch as well:
// connection is already closing.
0.701 > R. 1001:1001(0) ack 1002 win 501
// End packetdrill test case.

With patch, this generates following conntrack events:
   [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [UNREPLIED]
[UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80
[UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
[UPDATE] 120 FIN_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
[UPDATE] 60 CLOSE_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
[UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]

Without patch, first RST moves connection to close, whereas socket state
does not change until FIN is received.
   [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [UNREPLIED]
[UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80
[UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]
[UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]

Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:19:31 +01:00
Andrea Claudi f25a9b8515 ipvs: change some data types from int to bool
Change the data type of the following variables from int to bool
across ipvs code:

  - found
  - loop
  - need_full_dest
  - need_full_svc
  - payload_csum

Also change the following functions to use bool full_entry param
instead of int:

  - ip_vs_genl_parse_dest()
  - ip_vs_genl_parse_service()

This patch does not change any functionality but makes the source
code slightly easier to read.

Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:19:04 +01:00
Pablo Neira Ayuso 123f89c8aa netfilter: nft_set_hash: remove nft_hash_key()
hashtable is never used for 2-byte keys, remove nft_hash_key().

Fixes: e240cd0df4 ("netfilter: nf_tables: place all set backends in one single module")
Reported-by: Florian Westphal <fw@strlen.de>
Tested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 11:08:32 +01:00
Pablo Neira Ayuso a01cbae57e netfilter: nft_set_hash: bogus element self comparison from deactivation path
Use the element from the loop iteration, not the same element we want to
deactivate otherwise this branch always evaluates true.

Fixes: 6c03ae210c ("netfilter: nft_set_hash: add non-resizable hashtable implementation")
Reported-by: Florian Westphal <fw@strlen.de>
Tested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 11:08:31 +01:00
Pablo Neira Ayuso 3b02b0adc2 netfilter: nft_set_hash: fix lookups with fixed size hash on big endian
Call jhash_1word() for the 4-bytes key case from the insertion and
deactivation path, otherwise big endian arch set lookups fail.

Fixes: 446a8268b7 ("netfilter: nft_set_hash: add lookup variant for fixed size hashtable")
Reported-by: Florian Westphal <fw@strlen.de>
Tested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 11:08:31 +01:00
Li RongQing 35acfbab6e netfilter: remove unneeded switch fall-through
Empty case is fine and does not switch fall-through

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 11:03:59 +01:00
Florian Westphal cc16921351 netfilter: conntrack: avoid same-timeout update
No need to dirty a cache line if timeout is unchanged.
Also, WARN() is useless here: we crash on 'skb->len' access
if skb is NULL.

Last, ct->timeout is u32, not 'unsigned long' so adapt the
function prototype accordingly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:58:21 +01:00
Florian Westphal d2c5c103b1 netfilter: nat: remove nf_nat_l3proto.h and nf_nat_core.h
The l3proto name is gone, its header file is the last trace.
While at it, also remove nf_nat_core.h, its very small and all users
include nf_nat.h too.

before:
   text    data     bss     dec     hex filename
  22948    1612    4136   28696    7018 nf_nat.ko

after removal of l3proto register/unregister functions:
   text	   data	    bss	    dec	    hex	filename
  22196	   1516	   4136	  27848	   6cc8 nf_nat.ko

checkpatch complains about overly long lines, but line breaks
do not make things more readable and the line length gets smaller
here, not larger.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:54:08 +01:00
Florian Westphal d6c4c8ffb5 netfilter: nat: remove l3proto struct
All l3proto function pointers have been removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:53:57 +01:00
Florian Westphal dac3fe7259 netfilter: nat: remove csum_recalc hook
We can now use direct calls.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:53:47 +01:00
Florian Westphal 03fe5efc4c netfilter: nat: remove csum_update hook
We can now use direct calls.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:53:35 +01:00
Florian Westphal 2e666b229d netfilter: nat: remove l3 manip_pkt hook
We can now use direct calls.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:53:05 +01:00
Florian Westphal 14cb1a6e29 netfilter: nat: remove nf_nat_l4proto.h
after ipv4/6 nat tracker merge, there are no external callers, so
make last function static and remove the header.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:52:47 +01:00
Florian Westphal 3bf195ae60 netfilter: nat: merge nf_nat_ipv4,6 into nat core
before:
   text    data     bss     dec     hex filename
  16566    1576    4136   22278    5706 nf_nat.ko
   3598	    844	      0	   4442	   115a	nf_nat_ipv6.ko
   3187	    844	      0	   4031	    fbf	nf_nat_ipv4.ko

after:
   text    data     bss     dec     hex filename
  22948    1612    4136   28696    7018 nf_nat.ko

... with ipv4/v6 nat now provided directly via nf_nat.ko.

Also changes:
       ret = nf_nat_ipv4_fn(priv, skb, state);
       if (ret != NF_DROP && ret != NF_STOLEN &&
into
	if (ret != NF_ACCEPT)
		return ret;

everywhere.

The nat hooks never should return anything other than
ACCEPT or DROP (and the latter only in rare error cases).

The original code uses multi-line ANDing including assignment-in-if:
        if (ret != NF_DROP && ret != NF_STOLEN &&
           !(IPCB(skb)->flags & IPSKB_XFRM_TRANSFORMED) &&
            (ct = nf_ct_get(skb, &ctinfo)) != NULL) {

I removed this while moving, breaking those in separate conditionals
and moving the assignments into extra lines.

checkpatch still generates some warnings:
 1. Overly long lines (of moved code).
    Breaking them is even more ugly. so I kept this as-is.
 2. use of extern function declarations in a .c file.
    This is necessary evil, we must call
    nf_nat_l3proto_register() from the nat core now.
    All l3proto related functions are removed later in this series,
    those prototypes are then removed as well.

v2: keep empty nf_nat_ipv6_csum_update stub for CONFIG_IPV6=n case.
v3: remove IS_ENABLED(NF_NAT_IPV4/6) tests, NF_NAT_IPVx toggles
    are removed here.
v4: also get rid of the assignments in conditionals.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:49:55 +01:00
Florian Westphal 096d09067a netfilter: nat: move nlattr parse and xfrm session decode to core
None of these functions calls any external functions, moving them allows
to avoid both the indirection and a need to export these symbols.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:49:42 +01:00
Florian Westphal d1aca8ab31 netfilter: nat: merge ipv4 and ipv6 masquerade functionality
Before:
   text	   data	    bss	    dec	    hex	filename
  13916	   1412	   4128	  19456	   4c00	nf_nat.ko
   4510	    968	      4	   5482	   156a	nf_nat_ipv4.ko
   5146	    944	      8	   6098	   17d2	nf_nat_ipv6.ko

After:
   text	   data	    bss	    dec	    hex	filename
  16566	   1576	   4136	  22278	   5706	nf_nat.ko
   3187	    844	      0	   4031	    fbf	nf_nat_ipv4.ko
   3598	    844	      0	   4442	   115a	nf_nat_ipv6.ko

... so no drastic changes in combined size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:49:24 +01:00
Florian Tham 4283428e49 netfilter: nf_conntrack_amanda: add support for STATE streams
The Amanda CONNECT command has been updated to establish an optional
fourth connection [0]. Previously, a CONNECT command would look like:

    CONNECT DATA port0 MESG port1 INDEX port2

nf_conntrack_amanda analyses the CONNECT command string in order to
learn the port numbers of the related DATA, MESG and INDEX streams. As
of amanda v3.4, the CONNECT command can advertise an additional port:

    CONNECT DATA port0 MESG port1 INDEX port2 STATE port3

The new STATE stream is not handled, thus the connection on the STATE
port cannot be established.

The patch adds support for STATE streams to the amanda conntrack helper.

I tested with max_expected = 3, leaving the other patch hunks
unmodified. Amanda reports "connection refused" and aborts. After I set
max_expected to 4, the backup completes successfully.

[0] 3b8384fc9f (diff-711e502fc81a65182c0954765b42919eR456)

Signed-off-by: Florian Tham <tham@fidion.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:46:39 +01:00
Pablo Neira Ayuso b8e2040063 netfilter: nft_compat: use .release_ops and remove list of extension
Add .release_ops, that is called in case of error at a later stage in
the expression initialization path, ie. .select_ops() has been already
set up operations and that needs to be undone. This allows us to unwind
.select_ops from the error path, ie. release the dynamic operations for
this extension.

Moreover, allocate one single operation instead of recycling them, this
comes at the cost of consuming a bit more memory per rule, but it
simplifies the infrastructure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:41:24 +01:00
David S. Miller 375ca548f7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two easily resolvable overlapping change conflicts, one in
TCP and one in the eBPF verifier.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-20 00:34:07 -08:00
David S. Miller 8bbed40f10 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for you net-next
tree:

1) Missing NFTA_RULE_POSITION_ID netlink attribute validation,
   from Phil Sutter.

2) Restrict matching on tunnel metadata to rx/tx path, from wenxu.

3) Avoid indirect calls for IPV6=y, from Florian Westphal.

4) Add two indirections to prepare merger of IPV4 and IPV6 nat
   modules, from Florian Westphal.

5) Broken indentation in ctnetlink, from Colin Ian King.

6) Patches to use struct_size() from netfilter and IPVS,
   from Gustavo A. R. Silva.

7) Display kernel splat only once in case of racing to confirm
   conntrack from bridge plus nfqueue setups, from Chieh-Min Wang.

8) Skip checksum validation for layer 4 protocols that don't need it,
   patch from Alin Nastac.

9) Sparse warning due to symbol that should be static in CLUSTERIP,
   from Wei Yongjun.

10) Add new toggle to disable SDP payload translation when media
    endpoint is reachable though the same interface as the signalling
    peer, from Alin Nastac.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-18 11:38:30 -08:00
Alin Nastac a3419ce335 netfilter: nf_conntrack_sip: add sip_external_media logic
When enabled, the sip_external_media logic will leave SDP
payload untouched when it detects that interface towards INVITEd
party is the same with the one towards media endpoint.

The typical scenario for this logic is when a LAN SIP agent has more
than one IP address (uses a different address for media streams than
the one used on signalling stream) and it also forwards calls to a
voice mailbox located on the WAN side. In such case sip_direct_media
must be disabled (so normal calls could be handled by the SIP
helper), but media streams that are not traversing this router must
also be excluded from address translation (e.g. call forwards).

Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-16 10:49:12 +01:00
Andrea Claudi c93a49b976 ipvs: fix warning on unused variable
When CONFIG_IP_VS_IPV6 is not defined, build produced this warning:

net/netfilter/ipvs/ip_vs_ctl.c:899:6: warning: unused variable ‘ret’ [-Wunused-variable]
  int ret = 0;
      ^~~

Fix this by moving the declaration of 'ret' in the CONFIG_IP_VS_IPV6
section in the same function.

While at it, drop its unneeded initialisation.

Fixes: 098e13f5b2 ("ipvs: fix dependency on nf_defrag_ipv6")
Reported-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-16 10:41:42 +01:00
David S. Miller 3313da8188 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The netfilter conflicts were rather simple overlapping
changes.

However, the cls_tcindex.c stuff was a bit more complex.

On the 'net' side, Cong is fixing several races and memory
leaks.  Whilst on the 'net-next' side we have Vlad adding
the rtnl-ness support.

What I've decided to do, in order to resolve this, is revert the
conversion over to using a workqueue that Cong did, bringing us back
to pure RCU.  I did it this way because I believe that either Cong's
races don't apply with have Vlad did things, or Cong will have to
implement the race fix slightly differently.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-15 12:38:38 -08:00
Pablo Neira Ayuso 23b7ca4f74 netfilter: nf_tables: fix flush after rule deletion in the same batch
Flush after rule deletion bogusly hits -ENOENT. Skip rules that have
been already from nft_delrule_by_chain() which is always called from the
flush path.

Fixes: cf9dc09d09 ("netfilter: nf_tables: fix missing rules flushing per table")
Reported-by: Phil Sutter <phil@nwl.cc>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-15 15:32:32 +01:00
Pablo Neira Ayuso 753c111f65 netfilter: nft_compat: use-after-free when deleting targets
Fetch pointer to module before target object is released.

Fixes: 29e3880109 ("netfilter: nf_tables: fix use-after-free when deleting compat expressions")
Fixes: 0ca743a559 ("netfilter: nf_tables: add compatibility layer for x_tables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-13 18:14:54 +01:00
Andrea Claudi 098e13f5b2 ipvs: fix dependency on nf_defrag_ipv6
ipvs relies on nf_defrag_ipv6 module to manage IPv6 fragmentation,
but lacks proper Kconfig dependencies and does not explicitly
request defrag features.

As a result, if netfilter hooks are not loaded, when IPv6 fragmented
packet are handled by ipvs only the first fragment makes through.

Fix it properly declaring the dependency on Kconfig and registering
netfilter hooks on ip_vs_add_service() and ip_vs_new_dest().

Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 11:24:01 +01:00
Chieh-Min Wang 13f5251fd1 netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm
For bridge(br_flood) or broadcast/multicast packets, they could clone
skb with unconfirmed conntrack which break the rule that unconfirmed
skb->_nfct is never shared.  With nfqueue running on my system, the race
can be easily reproduced with following warning calltrace:

[13257.707525] CPU: 0 PID: 12132 Comm: main Tainted: P        W       4.4.60 #7744
[13257.707568] Hardware name: Qualcomm (Flattened Device Tree)
[13257.714700] [<c021f6dc>] (unwind_backtrace) from [<c021bce8>] (show_stack+0x10/0x14)
[13257.720253] [<c021bce8>] (show_stack) from [<c0449e10>] (dump_stack+0x94/0xa8)
[13257.728240] [<c0449e10>] (dump_stack) from [<c022a7e0>] (warn_slowpath_common+0x94/0xb0)
[13257.735268] [<c022a7e0>] (warn_slowpath_common) from [<c022a898>] (warn_slowpath_null+0x1c/0x24)
[13257.743519] [<c022a898>] (warn_slowpath_null) from [<c06ee450>] (__nf_conntrack_confirm+0xa8/0x618)
[13257.752284] [<c06ee450>] (__nf_conntrack_confirm) from [<c0772670>] (ipv4_confirm+0xb8/0xfc)
[13257.761049] [<c0772670>] (ipv4_confirm) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
[13257.769725] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
[13257.777108] [<c06e7af0>] (nf_hook_slow) from [<c07f20b4>] (br_nf_post_routing+0x274/0x31c)
[13257.784486] [<c07f20b4>] (br_nf_post_routing) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
[13257.792556] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
[13257.800458] [<c06e7af0>] (nf_hook_slow) from [<c07e5580>] (br_forward_finish+0x94/0xa4)
[13257.808010] [<c07e5580>] (br_forward_finish) from [<c07f22ac>] (br_nf_forward_finish+0x150/0x1ac)
[13257.815736] [<c07f22ac>] (br_nf_forward_finish) from [<c06e8df0>] (nf_reinject+0x108/0x170)
[13257.824762] [<c06e8df0>] (nf_reinject) from [<c06ea854>] (nfqnl_recv_verdict+0x3d8/0x420)
[13257.832924] [<c06ea854>] (nfqnl_recv_verdict) from [<c06e940c>] (nfnetlink_rcv_msg+0x158/0x248)
[13257.841256] [<c06e940c>] (nfnetlink_rcv_msg) from [<c06e5564>] (netlink_rcv_skb+0x54/0xb0)
[13257.849762] [<c06e5564>] (netlink_rcv_skb) from [<c06e4ec8>] (netlink_unicast+0x148/0x23c)
[13257.858093] [<c06e4ec8>] (netlink_unicast) from [<c06e5364>] (netlink_sendmsg+0x2ec/0x368)
[13257.866348] [<c06e5364>] (netlink_sendmsg) from [<c069fb8c>] (sock_sendmsg+0x34/0x44)
[13257.874590] [<c069fb8c>] (sock_sendmsg) from [<c06a03dc>] (___sys_sendmsg+0x1ec/0x200)
[13257.882489] [<c06a03dc>] (___sys_sendmsg) from [<c06a11c8>] (__sys_sendmsg+0x3c/0x64)
[13257.890300] [<c06a11c8>] (__sys_sendmsg) from [<c0209b40>] (ret_fast_syscall+0x0/0x34)

The original code just triggered the warning but do nothing. It will
caused the shared conntrack moves to the dying list and the packet be
droppped (nf_ct_resolve_clash returns NF_DROP for dying conntrack).

- Reproduce steps:

+----------------------------+
|          br0(bridge)       |
|                            |
+-+---------+---------+------+
  | eth0|   | eth1|   | eth2|
  |     |   |     |   |     |
  +--+--+   +--+--+   +---+-+
     |         |          |
     |         |          |
  +--+-+     +-+--+    +--+-+
  | PC1|     | PC2|    | PC3|
  +----+     +----+    +----+

iptables -A FORWARD -m mark --mark 0x1000000/0x1000000 -j NFQUEUE --queue-num 100 --queue-bypass

ps: Our nfq userspace program will set mark on packets whose connection
has already been processed.

PC1 sends broadcast packets simulated by hping3:

hping3 --rand-source --udp 192.168.1.255 -i u100

- Broadcast racing flow chart is as follow:

br_handle_frame
  BR_HOOK(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, br_handle_frame_finish)
  // skb->_nfct (unconfirmed conntrack) is constructed at PRE_ROUTING stage
  br_handle_frame_finish
    // check if this packet is broadcast
    br_flood_forward
      br_flood
        list_for_each_entry_rcu(p, &br->port_list, list) // iterate through each port
          maybe_deliver
            deliver_clone
              skb = skb_clone(skb)
              __br_forward
                BR_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD,...)
                // queue in our nfq and received by our userspace program
                // goto __nf_conntrack_confirm with process context on CPU 1
    br_pass_frame_up
      BR_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN,...)
      // goto __nf_conntrack_confirm with softirq context on CPU 0

Because conntrack confirm can happen at both INPUT and POSTROUTING
stage.  So with NFQUEUE running, skb->_nfct with the same unconfirmed
conntrack could race on different core.

This patch fixes a repeating kernel splat, now it is only displayed
once.

Signed-off-by: Chieh-Min Wang <chiehminw@synology.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 11:14:51 +01:00
Gustavo A. R. Silva 6ca64ef37d netfilter: xt_recent: Use struct_size() in kvzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

size = sizeof(struct foo) + count * sizeof(void *);
instance = alloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

size = struct_size(instance, entry, count);
instance = alloc(size, GFP_KERNEL)

Notice that, in this case, variable sz is not necessary, hence it is
removed.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 00:39:39 +01:00
Gustavo A. R. Silva 6fde9df6b7 ipvs: Use struct_size() helper
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    struct boo entry[];
};

size = sizeof(struct foo) + count * sizeof(struct boo);
instance = alloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

size = struct_size(instance, entry, count);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 00:39:39 +01:00
Colin Ian King 48ab807c79 netfilter: conntrack: fix indentation issue
A statement in an if block is not indented correctly. Fix this.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 00:39:39 +01:00
Francesco Ruggeri 8d29d16d21 netfilter: compat: initialize all fields in xt_init
If a non zero value happens to be in xt[NFPROTO_BRIDGE].cur at init
time, the following panic can be caused by running

% ebtables -t broute -F BROUTING

from a 32-bit user level on a 64-bit kernel. This patch replaces
kmalloc_array with kcalloc when allocating xt.

[  474.680846] BUG: unable to handle kernel paging request at 0000000009600920
[  474.687869] PGD 2037006067 P4D 2037006067 PUD 2038938067 PMD 0
[  474.693838] Oops: 0000 [#1] SMP
[  474.697055] CPU: 9 PID: 4662 Comm: ebtables Kdump: loaded Not tainted 4.19.17-11302235.AroraKernelnext.fc18.x86_64 #1
[  474.707721] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.0 06/28/2013
[  474.714313] RIP: 0010:xt_compat_calc_jump+0x2f/0x63 [x_tables]
[  474.720201] Code: 40 0f b6 ff 55 31 c0 48 6b ff 70 48 03 3d dc 45 00 00 48 89 e5 8b 4f 6c 4c 8b 47 60 ff c9 39 c8 7f 2f 8d 14 08 d1 fa 48 63 fa <41> 39 34 f8 4c 8d 0c fd 00 00 00 00 73 05 8d 42 01 eb e1 76 05 8d
[  474.739023] RSP: 0018:ffffc9000943fc58 EFLAGS: 00010207
[  474.744296] RAX: 0000000000000000 RBX: ffffc90006465000 RCX: 0000000002580249
[  474.751485] RDX: 00000000012c0124 RSI: fffffffff7be17e9 RDI: 00000000012c0124
[  474.758670] RBP: ffffc9000943fc58 R08: 0000000000000000 R09: ffffffff8117cf8f
[  474.765855] R10: ffffc90006477000 R11: 0000000000000000 R12: 0000000000000001
[  474.773048] R13: 0000000000000000 R14: ffffc9000943fcb8 R15: ffffc9000943fcb8
[  474.780234] FS:  0000000000000000(0000) GS:ffff88a03f840000(0063) knlGS:00000000f7ac7700
[  474.788612] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[  474.794632] CR2: 0000000009600920 CR3: 0000002037422006 CR4: 00000000000606e0
[  474.802052] Call Trace:
[  474.804789]  compat_do_replace+0x1fb/0x2a3 [ebtables]
[  474.810105]  compat_do_ebt_set_ctl+0x69/0xe6 [ebtables]
[  474.815605]  ? try_module_get+0x37/0x42
[  474.819716]  compat_nf_setsockopt+0x4f/0x6d
[  474.824172]  compat_ip_setsockopt+0x7e/0x8c
[  474.828641]  compat_raw_setsockopt+0x16/0x3a
[  474.833220]  compat_sock_common_setsockopt+0x1d/0x24
[  474.838458]  __compat_sys_setsockopt+0x17e/0x1b1
[  474.843343]  ? __check_object_size+0x76/0x19a
[  474.847960]  __ia32_compat_sys_socketcall+0x1cb/0x25b
[  474.853276]  do_fast_syscall_32+0xaf/0xf6
[  474.857548]  entry_SYSENTER_compat+0x6b/0x7a

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 00:35:18 +01:00
David S. Miller a655fe9f19 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
An ipvlan bug fix in 'net' conflicted with the abstraction away
of the IPV6 specific support in 'net-next'.

Similarly, a bug fix for mlx5 in 'net' conflicted with the flow
action conversion in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 15:00:17 -08:00
Florian Westphal 947e492c0f netfilter: nft_compat: don't use refcount_inc on newly allocated entry
When I moved the refcount to refcount_t type I missed the fact that
refcount_inc() will result in use-after-free warning with
CONFIG_REFCOUNT_FULL=y builds.

The correct fix would be to init the reference count to 1 at allocation
time, but, unfortunately we cannot do this, as we can't undo that
in case something else fails later in the batch.

So only solution I see is to special-case the 'new entry' condition
and replace refcount_inc() with a "delayed" refcount_set(1) in this case,
as done here.

The .activate callback can be removed to simplify things, we only
need to make sure that deactivate() decrements/unlinks the entry
from the list at end of transaction phase (commit or abort).

Fixes: 12c44aba66 ("netfilter: nft_compat: use refcnt_t type for nft_xt reference count")
Reported-by: Jordan Glover <Golden_Miller83@protonmail.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-05 14:10:33 +01:00
Florian Westphal ac02bcf9cc netfilter: ipv6: avoid indirect calls for IPV6=y case
indirect calls are only needed if ipv6 is a module.
Add helpers to abstract the v6ops indirections and use them instead.

fragment, reroute and route_input are kept as indirect calls.
The first two are not not used in hot path and route_input is only
used by bridge netfilter.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 18:21:12 +01:00
Pablo Neira Ayuso f6ac858589 netfilter: nf_tables: unbind set in rule from commit path
Anonymous sets that are bound to rules from the same transaction trigger
a kernel splat from the abort path due to double set list removal and
double free.

This patch updates the logic to search for the transaction that is
responsible for creating the set and disable the set list removal and
release, given the rule is now responsible for this. Lookup is reverse
since the transaction that adds the set is likely to be at the tail of
the list.

Moreover, this patch adds the unbind step to deliver the event from the
commit path.  This should not be done from the worker thread, since we
have no guarantees of in-order delivery to the listener.

This patch removes the assumption that both activate and deactivate
callbacks need to be provided.

Fixes: cd5125d8f5 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase")
Reported-by: Mikhail Morfikov <mmorfikov@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 17:29:17 +01:00
wenxu a46c52d9f2 netfilter: nft_tunnel: Add NFTA_TUNNEL_MODE options
nft "tunnel" expr match both the tun_info of RX and TX. This patch
provide the NFTA_TUNNEL_MODE to individually match the tun_info of
RX or TX.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 14:39:58 +01:00
Martynas Pumputis 4e35c1cb94 netfilter: nf_nat: skip nat clash resolution for same-origin entries
It is possible that two concurrent packets originating from the same
socket of a connection-less protocol (e.g. UDP) can end up having
different IP_CT_DIR_REPLY tuples which results in one of the packets
being dropped.

To illustrate this, consider the following simplified scenario:

1. Packet A and B are sent at the same time from two different threads
   by same UDP socket.  No matching conntrack entry exists yet.
   Both packets cause allocation of a new conntrack entry.
2. get_unique_tuple gets called for A.  No clashing entry found.
   conntrack entry for A is added to main conntrack table.
3. get_unique_tuple is called for B and will find that the reply
   tuple of B is already taken by A.
   It will allocate a new UDP source port for B to resolve the clash.
4. conntrack entry for B cannot be added to main conntrack table
   because its ORIGINAL direction is clashing with A and the REPLY
   directions of A and B are not the same anymore due to UDP source
   port reallocation done in step 3.

This patch modifies nf_conntrack_tuple_taken so it doesn't consider
colliding reply tuples if the IP_CT_DIR_ORIGINAL tuples are equal.

[ Florian: simplify patch to not use .allow_clash setting
  and always ignore identical flows ]

Signed-off-by: Martynas Pumputis <martynas@weave.works>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 14:26:11 +01:00
David S. Miller eaf2a47f40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-01-29 21:18:54 -08:00
Florian Westphal 0604628bb0 netfilter: nf_tables: add NFTA_RULE_POSITION_ID to nla_policy
Fixes: 75dd48e2e4 ("netfilter: nf_tables: Support RULE_ID reference in new rule")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-29 16:59:06 +01:00
David S. Miller 343917b410 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next tree:

1) Introduce a hashtable to speed up object lookups, from Florian Westphal.

2) Make direct calls to built-in extension, also from Florian.

3) Call helper before confirming the conntrack as it used to be originally,
   from Florian.

4) Call request_module() to autoload br_netfilter when physdev is used
   to relax the dependency, also from Florian.

5) Allow to insert rules at a given position ID that is internal to the
   batch, from Phil Sutter.

6) Several patches to replace conntrack indirections by direct calls,
   and to reduce modularization, from Florian. This also includes
   several follow up patches to deal with minor fallout from this
   rework.

7) Use RCU from conntrack gre helper, from Florian.

8) GRE conntrack module becomes built-in into nf_conntrack, from Florian.

9) Replace nf_ct_invert_tuplepr() by calls to nf_ct_invert_tuple(),
   from Florian.

10) Unify sysctl handling at the core of nf_conntrack, from Florian.

11) Provide modparam to register conntrack hooks.

12) Allow to match on the interface kind string, from wenxu.

13) Remove several exported symbols, not required anymore now after
    a bit of de-modulatization work has been done, from Florian.

14) Remove built-in map support in the hash extension, this can be
    done with the existing userspace infrastructure, from laura.

15) Remove indirection to calculate checksums in IPVS, from Matteo Croce.

16) Use call wrappers for indirection in IPVS, also from Matteo.

17) Remove superfluous __percpu parameter in nft_counter, patch from
    Luc Van Oostenryck.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28 17:34:38 -08:00
Florian Westphal 83f529281d netfilter: ipv4: remove useless export_symbol
Only one caller; place it where needed and get rid of the EXPORT_SYMBOL.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:32:58 +01:00
Cong Wang ac088a88b5 netfilter: conntrack: fix error path in nf_conntrack_pernet_init()
When nf_ct_netns_get() fails, it should clean up itself,
its caller doesn't need to call nf_conntrack_fini_net().

nf_conntrack_init_net() is called after registering sysctl
and proc, so its cleanup function should be called before
unregistering sysctl and proc.

Fixes: ba3fbe6636 ("netfilter: nf_conntrack: provide modparam to always register conntrack hooks")
Fixes: b884fa4617 ("netfilter: conntrack: unify sysctl handling")
Reported-and-tested-by: syzbot+fcee88b2d87f0539dfe9@syzkaller.appspotmail.com
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:32:55 +01:00
Luc Van Oostenryck dd03b1ad26 netfilter: nft_counter: remove wrong __percpu of nft_counter_resest()'s arg
nft_counter_rest() has its first argument declared as
	struct nft_counter_percpu_priv __percpu *priv
but this structure is not percpu (it only countains
a member 'counter' which is, correctly, a pointer to a
percpu struct nft_counter).

So, remove the '__percpu' from the argument's declaration.

Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:32:43 +01:00
Matteo Croce 6ecd754883 ipvs: use indirect call wrappers
Use the new indirect call wrappers in IPVS when calling the TCP or UDP
protocol specific functions.
This avoids an indirect calls in IPVS, and reduces the performance
impact of the Spectre mitigation.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:15:58 +01:00
Matteo Croce fe19a8fea7 ipvs: avoid indirect calls when calculating checksums
The function pointer ip_vs_protocol->csum_check is only used in protocol
specific code, and never in the generic one.
Remove the function pointer from struct ip_vs_protocol and call the
checksum functions directly.
This reduces the performance impact of the Spectre mitigation, and
should give a small improvement even with RETPOLINES disabled.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:15:58 +01:00
Fernando Fernandez Mancera 1a6a0951fc netfilter: nfnetlink_osf: add missing fmatch check
When we check the tcp options of a packet and it doesn't match the current
fingerprint, the tcp packet option pointer must be restored to its initial
value in order to do the proper tcp options check for the next fingerprint.

Here we can see an example.
Assumming the following fingerprint base with two lines:

S10:64:1:60:M*,S,T,N,W6:      Linux:3.0::Linux 3.0
S20:64:1:60:M*,S,T,N,W7:      Linux:4.19:arch:Linux 4.1

Where TCP options are the last field in the OS signature, all of them overlap
except by the last one, ie. 'W6' versus 'W7'.

In case a packet for Linux 4.19 kicks in, the osf finds no matching because the
TCP options pointer is updated after checking for the TCP options in the first
line.

Therefore, reset pointer back to where it should be.

Fixes: 11eeef41d5 ("netfilter: passive OS fingerprint xtables match")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:09:11 +01:00
ZhangXiaoxu 53ab60baa1 ipvs: Fix signed integer overflow when setsockopt timeout
There is a UBSAN bug report as below:
UBSAN: Undefined behaviour in net/netfilter/ipvs/ip_vs_ctl.c:2227:21
signed integer overflow:
-2147483647 * 1000 cannot be represented in type 'int'

Reproduce program:
	#include <stdio.h>
	#include <sys/types.h>
	#include <sys/socket.h>

	#define IPPROTO_IP 0
	#define IPPROTO_RAW 255

	#define IP_VS_BASE_CTL		(64+1024+64)
	#define IP_VS_SO_SET_TIMEOUT	(IP_VS_BASE_CTL+10)

	/* The argument to IP_VS_SO_GET_TIMEOUT */
	struct ipvs_timeout_t {
		int tcp_timeout;
		int tcp_fin_timeout;
		int udp_timeout;
	};

	int main() {
		int ret = -1;
		int sockfd = -1;
		struct ipvs_timeout_t to;

		sockfd = socket(AF_INET, SOCK_RAW, IPPROTO_RAW);
		if (sockfd == -1) {
			printf("socket init error\n");
			return -1;
		}

		to.tcp_timeout = -2147483647;
		to.tcp_fin_timeout = -2147483647;
		to.udp_timeout = -2147483647;

		ret = setsockopt(sockfd,
				 IPPROTO_IP,
				 IP_VS_SO_SET_TIMEOUT,
				 (char *)(&to),
				 sizeof(to));

		printf("setsockopt return %d\n", ret);
		return ret;
	}

Return -EINVAL if the timeout value is negative or max than 'INT_MAX / HZ'.

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-24 13:38:54 +01:00
Florian Westphal e2f7cc72cb netfilter: conntrack: fix bogus port values for other l4 protocols
We must only extract l4 proto information if we can track the layer 4
protocol.

Before removal of pkt_to_tuple callback, the code to extract port
information was only reached for TCP/UDP/LITE/DCCP/SCTP.

The other protocols were handled by the indirect call, and the
'generic' tracker took care of other protocols that have no notion
of 'ports'.

After removal of the callback we must be more strict here and only
init port numbers for those protocols that have ports.

Fixes: df5e162908 ("netfilter: conntrack: remove pkt_to_tuple callback")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-22 23:28:40 +01:00
Florian Westphal 81e01647fd netfilter: conntrack: fix IPV6=n builds
Stephen Rothwell reports:
 After merging the netfilter-next tree, today's linux-next build
 (powerpc ppc64_defconfig) failed like this:

 ERROR: "nf_conntrack_invert_icmpv6_tuple" [nf_conntrack.ko] undefined!
 ERROR: "nf_conntrack_icmpv6_packet" [nf_conntrack.ko] undefined!
 ERROR: "nf_conntrack_icmpv6_init_net" [nf_conntrack.ko] undefined!
 ERROR: "icmpv6_pkt_to_tuple" [nf_conntrack.ko] undefined!
 ERROR: "nf_ct_gre_keymap_destroy" [nf_conntrack.ko] undefined!

icmpv6 related errors are due to lack of IS_ENABLED(CONFIG_IPV6) (no
icmpv6 support is builtin if kernel has CONFIG_IPV6=n), the
nf_ct_gre_keymap_destroy error is due to lack of PROTO_GRE check.

Fixes: a47c540481 ("netfilter: conntrack: handle builtin l4proto packet functions via direct calls")
Fixes: e2e48b4716 ("netfilter: conntrack: handle icmp pkt_to_tuple helper via direct calls")
Fixes: 197c4300ae ("netfilter: conntrack: remove invert_tuple callback")
Fixes: 2a389de86e ("netfilter: conntrack: remove l4proto init and get_net callbacks")
Fixes: e56894356f ("netfilter: conntrack: remove l4proto destroy hook")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-22 23:28:24 +01:00
Laura Garcia Liebana 0123a75e1d Revert "netfilter: nft_hash: add map lookups for hashing operations"
A better way to implement this from userspace has been found without
specific code in the kernel side, revert this.

Fixes: b9ccc07e3f ("netfilter: nft_hash: add map lookups for hashing operations")
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:59:47 +01:00
Florian Westphal 472caa6918 netfilter: nat: un-export nf_nat_used_tuple
Not used since 203f2e7820 ("netfilter: nat: remove l4proto->unique_tuple")

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:59:45 +01:00
wenxu 0fb4d21956 netfilter: nft_meta: Add NFT_META_I/OIFKIND meta type
In the ip_rcv the skb goes through the PREROUTING hook first, then kicks
in vrf device and go through the same hook again. When conntrack dnat
works with vrf, there will be some conflict with rules because the
packet goes through the hook twice with different nf status.

ip link add user1 type vrf table 1
ip link add user2 type vrf table 2
ip l set dev tun1 master user1
ip l set dev tun2 master user2

nft add table firewall
nft add chain firewall zones { type filter hook prerouting  priority - 300 \; }
nft add rule firewall zones counter ct zone set iif map { "tun1" : 1, "tun2" : 2 }
nft add chain firewall rule-1000-ingress
nft add rule firewall rule-1000-ingress ct zone 1 tcp dport 22 ct state new counter accept
nft add rule firewall rule-1000-ingress counter drop
nft add chain firewall rule-1000-egress
nft add rule firewall rule-1000-egress tcp dport 22 ct state new counter drop
nft add rule firewall rule-1000-egress counter accept

nft add chain firewall rules-all { type filter hook prerouting priority - 150 \; }
nft add rule firewall rules-all ip daddr vmap { "2.2.2.11" : jump rule-1000-ingress }
nft add rule firewall rules-all ct zone vmap { 1 : jump rule-1000-egress }

nft add rule firewall dnat-all ct zone vmap { 1 : jump dnat-1000 }
nft add rule firewall dnat-1000 ip daddr 2.2.2.11 counter dnat to 10.0.0.7

For a package with ip daddr 2.2.2.11 and tcp dport 22, first time accept in the
rule-1000-ingress and dnat to 10.0.0.7. Then second time the packet goto the wrong
chain rule-1000-egress which leads the packet drop

With this patch, userspace can add the 'don't re-do entire ruleset for
vrf' policy itself via:

nft add rule firewall rules-all meta iifkind "vrf" counter accept

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:58:20 +01:00
Pablo Neira Ayuso ba3fbe6636 netfilter: nf_conntrack: provide modparam to always register conntrack hooks
The connection tracking hooks can be optionally registered per netns
when conntrack is specifically invoked from the ruleset since
0c66dc1ea3 ("netfilter: conntrack: register hooks in netns when needed
by ruleset"). Then, since 4d3a57f23d ("netfilter: conntrack: do not
enable connection tracking unless needed"), the default behaviour is
changed to always register them on demand.

This patch provides a toggle that allows users to always register them.
Without this toggle, in order to use conntrack for statistics
collection, you need a dummy rule that refers to conntrack, eg.

        iptables -I INPUT -m state --state NEW

This patch allows users to restore the original behaviour via modparam,
ie. always register connection tracking, eg.

        modprobe nf_conntrack enable_hooks=1

Hence, no dummy rule is required.

Reported-by: Laura Garcia <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 4a60dc748d netfilter: conntrack: remove nf_ct_l4proto_find_get
Its now same as __nf_ct_l4proto_find(), so rename that to
nf_ct_l4proto_find and use it everywhere.

It never returns NULL and doesn't need locks or reference counts.

Before this series:
302824  net/netfilter/nf_conntrack.ko
 21504  net/netfilter/nf_conntrack_proto_gre.ko

  text	   data	    bss	    dec	    hex	filename
  6281	   1732	      4	   8017	   1f51	nf_conntrack_proto_gre.ko
108356	  20613	    236	 129205	  1f8b5	nf_conntrack.ko

After:
294864  net/netfilter/nf_conntrack.ko
  text	   data	    bss	    dec	    hex	filename
106979	  19557	    240	 126776	  1ef38	nf_conntrack.ko

so, even with builtin gre, total size got reduced.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal e56894356f netfilter: conntrack: remove l4proto destroy hook
Only one user (gre), add a direct call and remove this facility.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 2a389de86e netfilter: conntrack: remove l4proto init and get_net callbacks
Those were needed we still had modular trackers.
As we don't have those anymore, prefer direct calls and remove all
the (un)register infrastructure associated with this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 70aed4647c netfilter: conntrack: remove sysctl registration helpers
After previous patch these are not used anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal b884fa4617 netfilter: conntrack: unify sysctl handling
Due to historical reasons, all l4 trackers register their own
sysctls.

This leads to copy&pasted boilerplate code, that does exactly same
thing, just with different data structure.

Place all of this in a single file.

This allows to remove the various ctl_table pointers from the ct_netns
structure and reduces overall code size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 303e0c5589 netfilter: conntrack: avoid unneeded nf_conntrack_l4proto lookups
after removal of the packet and invert function pointers, several
places do not need to lookup the l4proto structure anymore.

Remove those lookups.
The function nf_ct_invert_tuplepr becomes redundant, replace
it with nf_ct_invert_tuple everywhere.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal edf0338dab netfilter: conntrack: remove pernet l4 proto register interface
No used anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 44fb87f635 netfilter: conntrack: remove remaining l4proto indirect packet calls
Now that all l4trackers are builtin, no need to use a mix of direct and
indirect calls.
This removes the last two users: gre and the generic l4 protocol
tracker.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal b184356d0a netfilter: conntrack: remove module owner field
No need to get/put module owner reference, none of these can be removed
anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 197c4300ae netfilter: conntrack: remove invert_tuple callback
Only used by icmp(v6).  Prefer a direct call and remove this
function from the l4proto struct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal df5e162908 netfilter: conntrack: remove pkt_to_tuple callback
GRE is now builtin, so we can handle it via direct call and
remove the callback.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 751fc301ec netfilter: conntrack: remove net_id
No users anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal 22fc4c4c9f netfilter: conntrack: gre: switch module to be built-in
This makes the last of the modular l4 trackers 'bool'.

After this, all infrastructure to handle dynamic l4 protocol registration
becomes obsolete and can be removed in followup patches.

Old:
302824 net/netfilter/nf_conntrack.ko
 21504 net/netfilter/nf_conntrack_proto_gre.ko

New:
313728 net/netfilter/nf_conntrack.ko

Old:
   text	   data	    bss	    dec	    hex	filename
   6281	   1732	      4	   8017	   1f51	nf_conntrack_proto_gre.ko
 108356	  20613	    236	 129205	  1f8b5	nf_conntrack.ko
New:
 112095	  21381	    240	 133716	  20a54	nf_conntrack.ko

The size increase is only temporary.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal 202e651cd4 netfilter: conntrack: gre: convert rwlock to rcu
We can use gre.  Lock is only needed when a new expectation is added.

In case a single spinlock proves to be problematic we can either add one
per netns or use an array of locks combined with net_hash_mix() or similar
to pick the 'correct' one.

But given this is only needed for an expectation rather than per packet
a single one should be ok.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal e2e48b4716 netfilter: conntrack: handle icmp pkt_to_tuple helper via direct calls
rather than handling them via indirect call, use a direct one instead.
This leaves GRE as the last user of this indirect call facility.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal a47c540481 netfilter: conntrack: handle builtin l4proto packet functions via direct calls
The l4 protocol trackers are invoked via indirect call: l4proto->packet().

With one exception (gre), all l4trackers are builtin, so we can make
.packet optional and use a direct call for most protocols.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Phil Sutter 75dd48e2e4 netfilter: nf_tables: Support RULE_ID reference in new rule
To allow for a batch to contain rules in arbitrary ordering, introduce
NFTA_RULE_POSITION_ID attribute which works just like NFTA_RULE_POSITION
but contains the ID of another rule within the same batch. This helps
iptables-nft-restore handling dumps with mixed insert/append commands
correctly.

Note that NFTA_RULE_POSITION takes precedence over
NFTA_RULE_POSITION_ID, so if the former is present, the latter is
ignored.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal 8e2f311a68 netfilter: physdev: relax br_netfilter dependency
Following command:
  iptables -D FORWARD -m physdev ...
causes connectivity loss in some setups.

Reason is that iptables userspace will probe kernel for the module revision
of the physdev patch, and physdev has an artificial dependency on
br_netfilter (xt_physdev use makes no sense unless a br_netfilter module
is loaded).

This causes the "phydev" module to be loaded, which in turn enables the
"call-iptables" infrastructure.

bridged packets might then get dropped by the iptables ruleset.

The better fix would be to change the "call-iptables" defaults to 0 and
enforce explicit setting to 1, but that breaks backwards compatibility.

This does the next best thing: add a request_module call to checkentry.
This was a stray '-D ... -m physdev' won't activate br_netfilter
anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal 827318feb6 netfilter: conntrack: remove helper hook again
place them into the confirm one.

Old:
 hook (300): ipv4/6_help() first call helper, then seqadj.
 hook (INT_MAX): confirm

Now:
 hook (INT_MAX): confirm, first call helper, then seqadj, then confirm

Not having the extra call is noticeable in bechmarks.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal 10870dd89e netfilter: nf_tables: add direct calls for all builtin expressions
With CONFIG_RETPOLINE its faster to add an if (ptr == &foo_func)
check and and use direct calls for all the built-in expressions.

~15% improvement in pathological cases.

checkpatch doesn't like the X macro due to the embedded return statement,
but the macro has a very limited scope so I don't think its a problem.

I would like to avoid bugs of the form
  If (e->ops->eval == (unsigned long)nft_foo_eval)
	 nft_bar_eval();

and open-coded if ()/else if()/else cascade, thus the macro.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal 4d44175aa5 netfilter: nf_tables: handle nft_object lookups via rhltable
Instead of linear search, use rhlist interface to look up the objects.
This fixes rulesets with thousands of named objects (quota, counters and
the like).

We only use a single table for this and consider the address of the
table we're doing the lookup in as a part of the key.

This reduces restore time of a sample ruleset with ~20k named counters
from 37 seconds to 0.8 seconds.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal d152159b89 netfilter: nf_tables: prepare nft_object for lookups via hashtable
Add a 'key' structure for object, so we can look them up by name + table
combination (the name can be the same in each table).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:32 +01:00
Florian Westphal b2e3d68d12 netfilter: nft_compat: destroy function must not have side effects
The nft_compat destroy function deletes the nft_xt object from a list.
This isn't allowed anymore. Destroy functions are called asynchronously,
i.e. next batch can find the object that has a pending ->destroy()
invocation:

cpu0                       cpu1
 worker
   ->destroy               for_each_entry()
	                     if (x == ...
			        return x->ops;
     list_del(x)
     kfree_rcu(x)
                           expr->ops->... // ops was free'd

To resolve this, the list_del needs to occur before the transaction
mutex gets released.  nf_tables has a 'deactivate' hook for this
purpose, so use that to unlink the object from the list.

Fixes: 0935d55884 ("netfilter: nf_tables: asynchronous release")
Reported-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 02:29:42 +01:00
Florian Westphal cf52572ebb netfilter: nft_compat: make lists per netns
There are two problems with nft_compat since the netlink config
plane uses a per-netns mutex:

1. Concurrent add/del accesses to the same list
2. accesses to a list element after it has been free'd already.

This patch fixes the first problem.

Freeing occurs from a work queue, after transaction mutexes have been
released, i.e., it still possible for a new transaction (even from
same net ns) to find the to-be-deleted expression in the list.

The ->destroy functions are not allowed to have any such side effects,
i.e. the list_del() in the destroy function is not allowed.

This part of the problem is solved in the next patch.
I tried to make this work by serializing list access via mutex
and by moving list_del() to a deactivate callback, but
Taehee spotted following race on this approach:

  NET #0                          NET #1
   >select_ops()
   ->init()
                                   ->select_ops()
   ->deactivate()
   ->destroy()
      nft_xt_put()
       kfree_rcu(xt, rcu_head);
                                   ->init() <-- use-after-free occurred.

Unfortunately, we can't increment reference count in
select_ops(), because we can't undo the refcount increase in
case a different expression fails in the same batch.

(The destroy hook will only be called in case the expression
 was initialized successfully).

Fixes: f102d66b33 ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Reported-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 02:29:42 +01:00
Florian Westphal 12c44aba66 netfilter: nft_compat: use refcnt_t type for nft_xt reference count
Using standard integer type was fine while all operations on it were
guarded by the nftnl subsys mutex.

This isn't true anymore:
1. transactions are guarded only by a pernet mutex, so concurrent
   rule manipulation in different netns is racy
2. the ->destroy hook runs from a work queue after the transaction
   mutex has been released already.

cpu0                           cpu1 (net 1)        cpu2 (net 2)
 kworker
    nft_compat->destroy        nft_compat->init    nft_compat->init
      if (--nft_xt->ref == 0)   nft_xt->ref++        nft_xt->ref++

Switch to refcount_t.  Doing this however only fixes a minor aspect,
nft_compat also performs linked-list operations in an unsafe way.

This is addressed in the next two patches.

Fixes: f102d66b33 ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Fixes: 0935d55884 ("netfilter: nf_tables: asynchronous release")
Reported-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 02:29:41 +01:00
David S. Miller 72f6d4d14c Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

This is the first batch of Netfilter fixes for your net tree:

1) Fix endless loop in nf_tables rules netlink dump, from Phil Sutter.

2) Reference counter leak in object from the error path, from Taehee Yoo.

3) Selective rule dump requires table and chain.

4) Fix DNAT with nft_flow_offload reverse route lookup, from wenxu.

5) Use GFP_KERNEL_ACCOUNT in vmalloc allocation from ebtables, from
   Shakeel Butt.

6) Set ifindex from route to fix interaction with VRF slave device,
   also from wenxu.

7) Use nfct_help() to check for conntrack helper, IPS_HELPER status
   flag is only set from explicit helpers via -j CT, from Henry Yen.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-15 13:31:46 -08:00
Henry Yen 2314e87974 netfilter: nft_flow_offload: fix checking method of conntrack helper
This patch uses nfct_help() to detect whether an established connection
needs conntrack helper instead of using test_bit(IPS_HELPER_BIT,
&ct->status).

The reason is that IPS_HELPER_BIT is only set when using explicit CT
target.

However, in the case that a device enables conntrack helper via command
"echo 1 > /proc/sys/net/netfilter/nf_conntrack_helper", the status of
IPS_HELPER_BIT will not present any change, and consequently it loses
the checking ability in the context.

Signed-off-by: Henry Yen <henry.yen@mediatek.com>
Reviewed-by: Ryder Lee <ryder.lee@mediatek.com>
Tested-by: John Crispin <john@phrozen.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-14 12:50:59 +01:00
wenxu 10f4e76587 netfilter: nft_flow_offload: fix interaction with vrf slave device
In the forward chain, the iif is changed from slave device to master vrf
device. Thus, flow offload does not find a match on the lower slave
device.

This patch uses the cached route, ie. dst->dev, to update the iif and
oif fields in the flow entry.

After this patch, the following example works fine:

 # ip addr add dev eth0 1.1.1.1/24
 # ip addr add dev eth1 10.0.0.1/24
 # ip link add user1 type vrf table 1
 # ip l set user1 up
 # ip l set dev eth0 master user1
 # ip l set dev eth1 master user1

 # nft add table firewall
 # nft add flowtable f fb1 { hook ingress priority 0 \; devices = { eth0, eth1 } \; }
 # nft add chain f ftb-all {type filter hook forward priority 0 \; policy accept \; }
 # nft add rule f ftb-all ct zone 1 ip protocol tcp flow offload @fb1
 # nft add rule f ftb-all ct zone 1 ip protocol udp flow offload @fb1

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-11 00:55:37 +01:00
wenxu a799aea098 netfilter: nft_flow_offload: Fix reverse route lookup
Using the following example:

	client 1.1.1.7 ---> 2.2.2.7 which dnat to 10.0.0.7 server

The first reply packet (ie. syn+ack) uses an incorrect destination
address for the reverse route lookup since it uses:

	daddr = ct->tuplehash[!dir].tuple.dst.u3.ip;

which is 2.2.2.7 in the scenario that is described above, while this
should be:

	daddr = ct->tuplehash[dir].tuple.src.u3.ip;

that is 10.0.0.7.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-09 23:25:02 +01:00
Pablo Neira Ayuso 715849ab31 netfilter: nf_tables: selective rule dump needs table to be specified
Table needs to be specified for selective rule dumps per chain.

Fixes: 241faeceb8 ("netfilter: nf_tables: Speed up selective rule dumps")
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-08 23:31:18 +01:00
Taehee Yoo b91d903688 netfilter: nf_tables: fix leaking object reference count
There is no code that decreases the reference count of stateful objects
in error path of the nft_add_set_elem(). this causes a leak of reference
count of stateful objects.

Test commands:
   $nft add table ip filter
   $nft add counter ip filter c1
   $nft add map ip filter m1 { type ipv4_addr : counter \;}
   $nft add element ip filter m1 { 1 : c1 }
   $nft add element ip filter m1 { 1 : c1 }
   $nft delete element ip filter m1 { 1 }
   $nft delete counter ip filter c1

Result:
   Error: Could not process rule: Device or resource busy
   delete counter ip filter c1
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

At the second 'nft add element ip filter m1 { 1 : c1 }', the reference
count of the 'c1' is increased then it tries to insert into the 'm1'. but
the 'm1' already has same element so it returns -EEXIST.
But it doesn't decrease the reference count of the 'c1' in the error path.
Due to a leak of the reference count of the 'c1', the 'c1' can't be
removed by 'nft delete counter ip filter c1'.

Fixes: 8aeff920dc ("netfilter: nf_tables: add stateful object reference to set elements")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-08 23:31:17 +01:00
Phil Sutter 310529e663 netfilter: nf_tables: Fix for endless loop when dumping ruleset
__nf_tables_dump_rules() stores the current idx value into cb->args[0]
before returning to caller. With multiple chains present, cb->args[0] is
therefore updated after each chain's rules have been traversed. This
though causes the final nf_tables_dump_rules() run (which should return
an skb->len of zero since no rules are left to dump) to continue dumping
rules for each but the first chain. Fix this by moving the cb->args[0]
update to nf_tables_dump_rules().

With no final action to be performed anymore in
__nf_tables_dump_rules(), drop 'out_unfinished' jump label and 'rc'
variable - instead return the appropriate value directly.

Fixes: 241faeceb8 ("netfilter: nf_tables: Speed up selective rule dumps")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-08 23:01:13 +01:00
Masahiro Yamada e9666d10a5 jump_label: move 'asm goto' support test to Kconfig
Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".

The jump label is controlled by HAVE_JUMP_LABEL, which is defined
like this:

  #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
  # define HAVE_JUMP_LABEL
  #endif

We can improve this by testing 'asm goto' support in Kconfig, then
make JUMP_LABEL depend on CC_HAS_ASM_GOTO.

Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
match to the real kernel capability.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
2019-01-06 09:46:51 +09:00
Linus Torvalds 43d86ee8c6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Several fixes here. Basically split down the line between newly
  introduced regressions and long existing problems:

   1) Double free in tipc_enable_bearer(), from Cong Wang.

   2) Many fixes to nf_conncount, from Florian Westphal.

   3) op->get_regs_len() can throw an error, check it, from Yunsheng
      Lin.

   4) Need to use GFP_ATOMIC in *_add_hash_mac_address() of fsl/fman
      driver, from Scott Wood.

   5) Inifnite loop in fib_empty_table(), from Yue Haibing.

   6) Use after free in ax25_fillin_cb(), from Cong Wang.

   7) Fix socket locking in nr_find_socket(), also from Cong Wang.

   8) Fix WoL wakeup enable in r8169, from Heiner Kallweit.

   9) On 32-bit sock->sk_stamp is not thread-safe, from Deepa Dinamani.

  10) Fix ptr_ring wrap during queue swap, from Cong Wang.

  11) Missing shutdown callback in hinic driver, from Xue Chaojing.

  12) Need to return NULL on error from ip6_neigh_lookup(), from Stefano
      Brivio.

  13) BPF out of bounds speculation fixes from Daniel Borkmann"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits)
  ipv6: Consider sk_bound_dev_if when binding a socket to an address
  ipv6: Fix dump of specific table with strict checking
  bpf: add various test cases to selftests
  bpf: prevent out of bounds speculation on pointer arithmetic
  bpf: fix check_map_access smin_value test when pointer contains offset
  bpf: restrict unknown scalars of mixed signed bounds for unprivileged
  bpf: restrict stack pointer arithmetic for unprivileged
  bpf: restrict map value pointer arithmetic for unprivileged
  bpf: enable access to ax register also from verifier rewrite
  bpf: move tmp variable into ax register in interpreter
  bpf: move {prev_,}insn_idx into verifier env
  isdn: fix kernel-infoleak in capi_unlocked_ioctl
  ipv6: route: Fix return value of ip6_neigh_lookup() on neigh_create() error
  net/hamradio/6pack: use mod_timer() to rearm timers
  net-next/hinic:add shutdown callback
  net: hns3: call hns3_nic_net_open() while doing HNAE3_UP_CLIENT
  ip: validate header length on virtual device xmit
  tap: call skb_probe_transport_header after setting skb->dev
  ptr_ring: wrap back ->producer in __ptr_ring_swap_queue()
  net: rds: remove unnecessary NULL check
  ...
2019-01-03 12:53:47 -08:00
Florian Westphal a007232066 netfilter: nf_conncount: fix argument order to find_next_bit
Size and 'next bit' were swapped, this bug could cause worker to
reschedule itself even if system was idle.

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-29 02:45:22 +01:00
Pablo Neira Ayuso c80f10bc97 netfilter: nf_conncount: speculative garbage collection on empty lists
Instead of removing a empty list node that might be reintroduced soon
thereafter, tentatively place the empty list node on the list passed to
tree_nodes_free(), then re-check if the list is empty again before erasing
it from the tree.

[ Florian: rebase on top of pending nf_conncount fixes ]

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-29 02:45:22 +01:00
Pablo Neira Ayuso 2f971a8f42 netfilter: nf_conncount: move all list iterations under spinlock
Two CPUs may race to remove a connection from the list, the existing
conn->dead will result in a use-after-free. Use the per-list spinlock to
protect list iterations.

As all accesses to the list now happen while holding the per-list lock,
we no longer need to delay free operations with rcu.

Joint work with Florian.

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-29 02:45:21 +01:00
Florian Westphal df4a902509 netfilter: nf_conncount: merge lookup and add functions
'lookup' is always followed by 'add'.
Merge both and make the list-walk part of nf_conncount_add().

This also avoids one unneeded unlock/re-lock pair.

Extra care needs to be taken in count_tree, as we only hold rcu
read lock, i.e. we can only insert to an existing tree node after
acquiring its lock and making sure it has a nonzero count.

As a zero count should be rare, just fall back to insert_tree()
(which acquires tree lock).

This issue and its solution were pointed out by Shawn Bohrer
during patch review.

Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-29 02:45:20 +01:00
Florian Westphal e8cfb372b3 netfilter: nf_conncount: restart search when nodes have been erased
Shawn Bohrer reported a following crash:
 |RIP: 0010:rb_erase+0xae/0x360
 [..]
 Call Trace:
  nf_conncount_destroy+0x59/0xc0 [nf_conncount]
  cleanup_match+0x45/0x70 [ip_tables]
  ...

Shawn tracked this down to bogus 'parent' pointer:
Problem is that when we insert a new node, then there is a chance that
the 'parent' that we found was also passed to tree_nodes_free() (because
that node was empty) for erase+free.

Instead of trying to be clever and detect when this happens, restart
the search if we have evicted one or more nodes.  To prevent frequent
restarts, do not perform gc on the second round.

Also, unconditionally schedule the gc worker.
The condition

  gc_count > ARRAY_SIZE(gc_nodes))

cannot be true unless tree grows very large, as the height of the tree
will be low even with hundreds of nodes present.

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reported-by: Shawn Bohrer <sbohrer@cloudflare.com>
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-29 02:45:20 +01:00
Florian Westphal f7fcc98dfc netfilter: nf_conncount: split gc in two phases
The lockless workqueue garbage collector can race with packet path
garbage collector to delete list nodes, as it calls tree_nodes_free()
with the addresses of nodes that might have been free'd already from
another cpu.

To fix this, split gc into two phases.

One phase to perform gc on the connections: From a locking perspective,
this is the same as count_tree(): we hold rcu lock, but we do not
change the tree, we only change the nodes' contents.

The second phase acquires the tree lock and reaps empty nodes.
This avoids a race condition of the garbage collection vs.  packet path:
If a node has been free'd already, the second phase won't find it anymore.

This second phase is, from locking perspective, same as insert_tree().

The former only modifies nodes (list content, count), latter modifies
the tree itself (rb_erase or rb_insert).

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-29 02:45:19 +01:00
Florian Westphal 4cd273bb91 netfilter: nf_conncount: don't skip eviction when age is negative
age is signed integer, so result can be negative when the timestamps
have a large delta.  In this case we want to discard the entry.

Instead of using age >= 2 || age < 0, just make it unsigned.

Fixes: b36e4523d4 ("netfilter: nf_conncount: fix garbage collection confirm race")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-29 02:45:18 +01:00
Shawn Bohrer c78e7818f1 netfilter: nf_conncount: replace CONNCOUNT_LOCK_SLOTS with CONNCOUNT_SLOTS
Most of the time these were the same value anyway, but when
CONFIG_LOCKDEP was enabled we would use a smaller number of locks to
reduce overhead.  Unfortunately having two values is confusing and not
worth the complexity.

This fixes a bug where tree_gc_worker() would only GC up to
CONNCOUNT_LOCK_SLOTS trees which meant when CONFIG_LOCKDEP was enabled
not all trees would be GCed by tree_gc_worker().

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-29 02:45:18 +01:00
Kangjie Lu eb8950861c netfilter: nf_tables: fix a missing check of nla_put_failure
If nla_nest_start() may fail. The fix checks its return value and goes
to nla_put_failure if it fails.

Signed-off-by: Kangjie Lu <kjlu@umn.edu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-29 02:44:51 +01:00
Arun KS ca79b0c211 mm: convert totalram_pages and totalhigh_pages variables to atomic
totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things.  It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Arun KS 3d6357de8a mm: reference totalram_pages and managed_pages once per function
Patch series "mm: convert totalram_pages, totalhigh_pages and managed
pages to atomic", v5.

This series converts totalram_pages, totalhigh_pages and
zone->managed_pages to atomic variables.

totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.

Main motivation was that managed_page_count_lock handling was complicating
things.  It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
to remove the lock and convert variables to atomic.  With the change,
preventing poteintial store-to-read tearing comes as a bonus.

This patch (of 4):

This is in preparation to a later patch which converts totalram_pages and
zone->managed_pages to atomic variables.  Please note that re-reading the
value might lead to a different value and as such it could lead to
unexpected behavior.  There are no known bugs as a result of the current
code but it is better to prevent from them in principle.

Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
David S. Miller c3e5336925 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Support for destination MAC in ipset, from Stefano Brivio.

2) Disallow all-zeroes MAC address in ipset, also from Stefano.

3) Add IPSET_CMD_GET_BYNAME and IPSET_CMD_GET_BYINDEX commands,
   introduce protocol version number 7, from Jozsef Kadlecsik.
   A follow up patch to fix ip_set_byindex() is also included
   in this batch.

4) Honor CTA_MARK_MASK from ctnetlink, from Andreas Jaggi.

5) Statify nf_flow_table_iterate(), from Taehee Yoo.

6) Use nf_flow_table_iterate() to simplify garbage collection in
   nf_flow_table logic, also from Taehee Yoo.

7) Don't use _bh variants of call_rcu(), rcu_barrier() and
   synchronize_rcu_bh() in Netfilter, from Paul E. McKenney.

8) Remove NFC_* cache definition from the old caching
   infrastructure.

9) Remove layer 4 port rover in NAT helpers, use random port
   instead, from Florian Westphal.

10) Use strscpy() in ipset, from Qian Cai.

11) Remove NF_NAT_RANGE_PROTO_RANDOM_FULLY branch now that
    random port is allocated by default, from Xiaozhou Liu.

12) Ignore NF_NAT_RANGE_PROTO_RANDOM too, from Florian Westphal.

13) Limit port allocation selection routine in NAT to avoid
    softlockup splats when most ports are in use, from Florian.

14) Remove unused parameters in nf_ct_l4proto_unregister_sysctl()
    from Yafang Shao.

15) Direct call to nf_nat_l4proto_unique_tuple() instead of
    indirection, from Florian Westphal.

16) Several patches to remove all layer 4 NAT indirections,
    remove nf_nat_l4proto struct, from Florian Westphal.

17) Fix RTP/RTCP source port translation when SNAT is in place,
    from Alin Nastac.

18) Selective rule dump per chain, from Phil Sutter.

19) Revisit CLUSTERIP target, this includes a deadlock fix from
    netns path, sleep in atomic, remove bogus WARN_ON_ONCE()
    and disallow mismatching IP address and MAC address.
    Patchset from Taehee Yoo.

20) Update UDP timeout to stream after 2 seconds, from Florian.

21) Shrink UDP established timeout to 120 seconds like TCP timewait.

22) Sysctl knobs to set GRE timeouts, from Yafang Shao.

23) Move seq_print_acct() to conntrack core file, from Florian.

24) Add enum for conntrack sysctl knobs, also from Florian.

25) Place nf_conntrack_acct, nf_conntrack_helper, nf_conntrack_events
    and nf_conntrack_timestamp knobs in the core, from Florian Westphal.
    As a side effect, shrink netns_ct structure by removing obsolete
    sysctl anchors, also from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20 18:20:26 -08:00
Florian Westphal fc3893fd5c netfilter: conntrack: remove empty pernet fini stubs
after moving sysctl handling into single place, the init functions
can't fail anymore and some of the fini functions are empty.

Remove them and change return type to void.
This also simplifies error unwinding in conntrack module init path.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:51:54 +01:00
Florian Westphal cb2833ed00 netfilter: conntrack: merge ecache and timestamp sysctl tables with main one
Similar to previous change, this time for eache and timestamp.
Unlike helper and acct, these can be disabled at build time, so they
need ifdef guards.

Next patch will remove a few (now obsolete) functions.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:51:51 +01:00
Florian Westphal d912dec124 netfilter: conntrack: merge acct and helper sysctl table with main one
Needless copy&paste, just handle all in one.  Next patch will handle
acct and timestamp, which have similar functions.

Intentionally leaves cruft behind, will be cleaned up in a followup
patch.

The obsolete sysctl pointers in netns_ct struct are left in place and
removed in a single change, as changes to netns trigger rebuild of
almost all files.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:51:48 +01:00
Florian Westphal 4a65798a94 netfilter: conntrack: add mnemonics for sysctl table
Its a bit hard to see what table[3] really lines up with, so add
human-readable mnemonics and use them for initialisation.

This makes it easier to see e.g. which sysctls are not exported to
unprivileged userns.

objdiff shows no changes.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:51:44 +01:00
Florian Westphal 4b216e21cf netfilter: conntrack: un-export seq_print_acct
Only one caller, just place it where its needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:51:39 +01:00
Yafang Shao a0badcc665 netfilter: conntrack: register sysctl table for gre
This patch adds two sysctl knobs for GRE:

	net.netfilter.nf_conntrack_gre_timeout = 30
	net.netfilter.nf_conntrack_gre_timeout_stream = 180

Update the Documentation as well.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:51:25 +01:00
Florian Westphal 294304e4c5 netfilter: conntrack: udp: set stream timeout to 2 minutes
We have no explicit signal when a UDP stream has terminated, peers just
stop sending.

For suspected stream connections a timeout of two minutes is sane to keep
NAT mapping alive a while longer.

It matches tcp conntracks 'timewait' default timeout value.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:48:46 +01:00
Florian Westphal d535c8a69c netfilter: conntrack: udp: only extend timeout to stream mode after 2s
Currently DNS resolvers that send both A and AAAA queries from same source port
can trigger stream mode prematurely, which results in non-early-evictable conntrack entry
for three minutes, even though DNS requests are done in a few milliseconds.

Add a two second grace period where we continue to use the ordinary
30-second default timeout.  Its enough for DNS request/response traffic,
even if two request/reply packets are involved.

ASSURED is still set, else conntrack (and thus a possible
NAT mapping ...) gets zapped too in case conntrack table runs full.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:48:38 +01:00
David S. Miller 2be09de7d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.

Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20 11:53:36 -08:00
Florian Westphal 2294be0f11 net: use skb_sec_path helper in more places
skb_sec_path gains 'const' qualifier to avoid
xt_policy.c: 'skb_sec_path' discards 'const' qualifier from pointer target type

same reasoning as previous conversions: Won't need to touch these
spots anymore when skb->sp is removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 11:21:37 -08:00
Florian Westphal 7af8f4ca31 net: move secpath_exist helper to sk_buff.h
Future patch will remove skb->sp pointer.
To reduce noise in those patches, move existing helper to
sk_buff and use it in more places to ease skb->sp replacement later.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 11:21:37 -08:00
Florian Westphal c4b0e771f9 netfilter: avoid using skb->nf_bridge directly
This pointer is going to be removed soon, so use the existing helpers in
more places to avoid noise when the removal happens.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 11:21:37 -08:00
Phil Sutter 241faeceb8 netfilter: nf_tables: Speed up selective rule dumps
If just a table name was given, nf_tables_dump_rules() continued over
the list of tables even after a match was found. The simple fix is to
exit the loop if it reached the bottom and ctx->table was not NULL.

When iterating over the table's chains, the same problem as above
existed. But worse than that, if a chain name was given the hash table
wasn't used to find the corresponding chain. Fix this by introducing a
helper function iterating over a chain's rules (and taking care of the
cb->args handling), then introduce a shortcut to it if a chain name was
given.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-18 00:08:53 +01:00
Alin Nastac 8294059931 netfilter: nf_nat_sip: fix RTP/RTCP source port translations
Each media stream negotiation between 2 SIP peers will trigger creation
of 4 different expectations (2 RTP and 2 RTCP):
 - INVITE will create expectations for the media packets sent by the
   called peer
 - reply to the INVITE will create expectations for media packets sent
   by the caller

The dport used by these expectations usually match the ones selected
by the SIP peers, but they might get translated due to conflicts with
another expectation. When such event occur, it is important to do
this translation in both directions, dport translation on the receiving
path and sport translation on the sending path.

This commit fixes the sport translation when the peer requiring it is
also the one that starts the media stream. In this scenario, first media
stream packet is forwarded from LAN to WAN and will rely on
nf_nat_sip_expected() to do the necessary sport translation. However, the
expectation matched by this packet does not contain the necessary information
for doing SNAT, this data being stored in the paired expectation created by
the sender's SIP message (INVITE or reply to it).

Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:43:58 +01:00
Florian Westphal 5cbabeec1e netfilter: nat: remove nf_nat_l4proto struct
This removes the (now empty) nf_nat_l4proto struct, all its instances
and all the no longer needed runtime (un)register functionality.

nf_nat_need_gre() can be axed as well: the module that calls it (to
load the no-longer-existing nat_gre module) also calls other nat core
functions. GRE nat is now always available if kernel is built with it.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:31 +01:00
Florian Westphal faec18dbb0 netfilter: nat: remove l4proto->manip_pkt
This removes the last l4proto indirection, the two callers, the l3proto
packet mangling helpers for ipv4 and ipv6, now call the
nf_nat_l4proto_manip_pkt() helper.

nf_nat_proto_{dccp,tcp,sctp,gre,icmp,icmpv6} are left behind, even though
they contain no functionality anymore to not clutter this patch.

Next patch will remove the empty files and the nf_nat_l4proto
struct.

nf_nat_proto_udp.c is renamed to nf_nat_proto.c, as it now contains the
other nat manip functionality as well, not just udp and udplite.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:29 +01:00
Florian Westphal 76b90019e0 netfilter: nat: remove l4proto->nlattr_to_range
all protocols did set this to nf_nat_l4proto_nlattr_to_range, so
just call it directly.

The important difference is that we'll now also call it for
protocols that we don't support (i.e., nf_nat_proto_unknown did
not provide .nlattr_to_range).

However, there should be no harm, even icmp provided this callback.
If we don't implement a specific l4nat for this, nothing would make
use of this information, so adding a big switch/case construct listing
all supported l4protocols seems a bit pointless.

This change leaves a single function pointer in the l4proto struct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:23 +01:00
Florian Westphal fe2d002099 netfilter: nat: remove l4proto->in_range
With exception of icmp, all of the l4 nat protocols set this to
nf_nat_l4proto_in_range.

Get rid of this and just check the l4proto in the caller.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:14 +01:00
Florian Westphal 40e786bd29 netfilter: nat: fold in_range indirection into caller
No need for indirections here, we only support ipv4 and ipv6
and the called functions are very small.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:09 +01:00
Florian Westphal 203f2e7820 netfilter: nat: remove l4proto->unique_tuple
fold remaining users (icmp, icmpv6, gre) into nf_nat_l4proto_unique_tuple.
The static-save of old incarnation of resolved key in gre and icmp is
removed as well, just use the prandom based offset like the others.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:04 +01:00
Florian Westphal 716b23c19e netfilter: nat: un-export nf_nat_l4proto_unique_tuple
almost all l4proto->unique_tuple implementations just call this helper,
so make ->unique_tuple() optional and call its helper directly if the
l4proto doesn't override it.

This is an intermediate step to get rid of ->unique_tuple completely.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:32:57 +01:00
Florian Westphal 912da924a2 netfilter: remove NF_NAT_RANGE_PROTO_RANDOM support
Historically this was net_random() based, and was then converted to
a hash based algorithm (private boot seed + hash of endpoint addresses)
due to concerns of leaking net_random() bits.

RANDOM_FULLY mode was added later to avoid problems with hash
based mode (see commit 34ce324019,
"netfilter: nf_nat: add full port randomization support" for details).

Just make prandom_u32() the default search starting point and get rid of
->secure_port() altogether.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:32:36 +01:00
Yafang Shao df7043bed4 netfilter: remove unused parameters in nf_ct_l4proto_[un]register_sysctl()
These parameters aren't used now.
So remove them.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:32:30 +01:00
Florian Westphal a504b703bb netfilter: nat: limit port clash resolution attempts
In case almost or all available ports are taken, clash resolution can
take a very long time, resulting in soft lockup.

This can happen when many to-be-natted hosts connect to same
destination:port (e.g. a proxy) and all connections pass the same SNAT.

Pick a random offset in the acceptable range, then try ever smaller
number of adjacent port numbers, until either the limit is reached or a
useable port was found.  This results in at most 248 attempts
(128 + 64 + 32 + 16 + 8, i.e. 4 restarts with new search offset)
instead of 64000+,

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:32:08 +01:00
Xiaozhou Liu b635cbf68f netfilter: nat: remove unnecessary 'else if' branch
Since a pseudo-random starting point is used in finding a port in
the default case, that 'else if' branch above is no longer a necessity.
So remove it to simplify code.

Signed-off-by: Xiaozhou Liu <liuxiaozhou@bytedance.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:31:03 +01:00
Qian Cai 00ec3ab060 netfilter: ipset: replace a strncpy() with strscpy()
To make overflows as obvious as possible and to prevent code from blithely
proceeding with a truncated string. This also has a side-effect to fix a
compilation warning when using GCC 8.2.1.

net/netfilter/ipset/ip_set_core.c: In function 'ip_set_sockfn_get':
net/netfilter/ipset/ip_set_core.c:2027:3: warning: 'strncpy' writing 32 bytes into a region of size 2 overflows the destination [-Wstringop-overflow=]

Signed-off-by: Qian Cai <cai@gmx.us>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-14 00:04:08 +01:00
Florent Fourcot 8e350ce1f7 netfilter: ipset: fix ip_set_byindex function
New function added by "Introduction of new commands and protocol
version 7" is not working, since we return skb2 to user

Signed-off-by: Victorien Molle <victorien.molle@wifirst.fr>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-14 00:02:26 +01:00
Taehee Yoo d4e7df1656 netfilter: nf_conncount: use rb_link_node_rcu() instead of rb_link_node()
rbnode in insert_tree() is rcu protected pointer.
So, in order to handle this pointer, _rcu function should be used.
rb_link_node_rcu() is a rcu version of rb_link_node().

Fixes: 34848d5c89 ("netfilter: nf_conncount: Split insert and traversal")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-13 01:14:58 +01:00
Florian Westphal 542fbda0f0 netfilter: nat: can't use dst_hold on noref dst
The dst entry might already have a zero refcount, waiting on rcu list
to be free'd.  Using dst_hold() transitions its reference count to 1, and
next dst release will try to free it again -- resulting in a double free:

  WARNING: CPU: 1 PID: 0 at include/net/dst.h:239 nf_xfrm_me_harder+0xe7/0x130 [nf_nat]
  RIP: 0010:nf_xfrm_me_harder+0xe7/0x130 [nf_nat]
  Code: 48 8b 5c 24 60 65 48 33 1c 25 28 00 00 00 75 53 48 83 c4 68 5b 5d 41 5c c3 85 c0 74 0d 8d 48 01 f0 0f b1 0a 74 86 85 c0 75 f3 <0f> 0b e9 7b ff ff ff 29 c6 31 d2 b9 20 00 48 00 4c 89 e7 e8 31 27
  Call Trace:
  nf_nat_ipv4_out+0x78/0x90 [nf_nat_ipv4]
  nf_hook_slow+0x36/0xd0
  ip_output+0x9f/0xd0
  ip_forward+0x328/0x440
  ip_rcv+0x8a/0xb0

Use dst_hold_safe instead and bail out if we cannot take a reference.

Fixes: a4c2fd7f78 ("net: remove DST_NOCACHE flag")
Reported-by: Martin Zaharinov <micron10@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-13 00:58:22 +01:00
Pan Bian 708abf74dd netfilter: ipset: do not call ipset_nest_end after nla_nest_cancel
In the error handling block, nla_nest_cancel(skb, atd) is called to
cancel the nest operation. But then, ipset_nest_end(skb, atd) is
unexpected called to end the nest operation. This patch calls the
ipset_nest_end only on the branch that nla_nest_cancel is not called.

Fixes: 45040978c8 ("netfilter: ipset: Fix set:list type crash when flush/dump set in parallel")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-13 00:55:50 +01:00
Florian Westphal 530aad7701 netfilter: seqadj: re-load tcp header pointer after possible head reallocation
When adjusting sack block sequence numbers, skb_make_writable() gets
called to make sure tcp options are all in the linear area, and buffer
is not shared.

This can cause tcp header pointer to get reallocated, so we must
reaload it to avoid memory corruption.

This bug pre-dates git history.

Reported-by: Neel Mehta <nmehta@google.com>
Reported-by: Shane Huntley <shuntley@google.com>
Reported-by: Heather Adkins <argv@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-07 10:54:23 +01:00
Taehee Yoo 4c05ec4738 netfilter: nf_tables: fix suspicious RCU usage in nft_chain_stats_replace()
basechain->stats is rcu protected data which is updated from
nft_chain_stats_replace(). This function is executed from the commit
phase which holds the pernet nf_tables commit mutex - not the global
nfnetlink subsystem mutex.

Test commands to reproduce the problem are:
   %iptables-nft -I INPUT
   %iptables-nft -Z
   %iptables-nft -Z

This patch uses RCU calls to handle basechain->stats updates to fix a
splat that looks like:

[89279.358755] =============================
[89279.363656] WARNING: suspicious RCU usage
[89279.368458] 4.20.0-rc2+ #44 Tainted: G        W    L
[89279.374661] -----------------------------
[89279.379542] net/netfilter/nf_tables_api.c:1404 suspicious rcu_dereference_protected() usage!
[...]
[89279.406556] 1 lock held by iptables-nft/5225:
[89279.411728]  #0: 00000000bf45a000 (&net->nft.commit_mutex){+.+.}, at: nf_tables_valid_genid+0x1f/0x70 [nf_tables]
[89279.424022] stack backtrace:
[89279.429236] CPU: 0 PID: 5225 Comm: iptables-nft Tainted: G        W    L    4.20.0-rc2+ #44
[89279.430135] Call Trace:
[89279.430135]  dump_stack+0xc9/0x16b
[89279.430135]  ? show_regs_print_info+0x5/0x5
[89279.430135]  ? lockdep_rcu_suspicious+0x117/0x160
[89279.430135]  nft_chain_commit_update+0x4ea/0x640 [nf_tables]
[89279.430135]  ? sched_clock_local+0xd4/0x140
[89279.430135]  ? check_flags.part.35+0x440/0x440
[89279.430135]  ? __rhashtable_remove_fast.constprop.67+0xec0/0xec0 [nf_tables]
[89279.430135]  ? sched_clock_cpu+0x126/0x170
[89279.430135]  ? find_held_lock+0x39/0x1c0
[89279.430135]  ? hlock_class+0x140/0x140
[89279.430135]  ? is_bpf_text_address+0x5/0xf0
[89279.430135]  ? check_flags.part.35+0x440/0x440
[89279.430135]  ? __lock_is_held+0xb4/0x140
[89279.430135]  nf_tables_commit+0x2555/0x39c0 [nf_tables]

Fixes: f102d66b33 ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-04 01:37:13 +01:00
Florian Westphal 6ed5943f87 netfilter: nat: remove l4 protocol port rovers
This is a leftover from days where single-cpu systems were common:
Store last port used to resolve a clash to use it as a starting point when
the next conflict needs to be resolved.

When we have parallel attempt to connect to same address:port pair,
its likely that both cores end up computing the same "available" port,
as both use same starting port, and newly used ports won't become
visible to other cores until the conntrack gets confirmed later.

One of the cores then has to drop the packet at insertion time because
the chosen new tuple turns out to be in use after all.

Lets simplify this: remove port rover and use a pseudo-random starting
point.

Note that this doesn't make netfilter default to 'fully random' mode;
the 'rover' was only used if NAT could not reuse source port as-is.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-01 12:38:42 +01:00
Paul E. McKenney c8d1da4000 netfilter: Replace call_rcu_bh(), rcu_barrier_bh(), and synchronize_rcu_bh()
Now that call_rcu()'s callback is not invoked until after bh-disable
regions of code have completed (in addition to explicitly marked
RCU read-side critical sections), call_rcu() can be used in place
of call_rcu_bh().  Similarly, rcu_barrier() can be used in place of
rcu_barrier_bh() and synchronize_rcu() in place of synchronize_rcu_bh().
This commit therefore makes these changes.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-01 12:38:23 +01:00
David S. Miller e561bb29b6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Trivial conflict in net/core/filter.c, a locally computed
'sdif' is now an argument to the function.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-28 22:10:54 -08:00
Taehee Yoo ca08987885 netfilter: nf_tables: deactivate expressions in rule replecement routine
There is no expression deactivation call from the rule replacement path,
hence, chain counter is not decremented. A few steps to reproduce the
problem:

   %nft add table ip filter
   %nft add chain ip filter c1
   %nft add chain ip filter c1
   %nft add rule ip filter c1 jump c2
   %nft replace rule ip filter c1 handle 3 accept
   %nft flush ruleset

<jump c2> expression means immediate NFT_JUMP to chain c2.
Reference count of chain c2 is increased when the rule is added.

When rule is deleted or replaced, the reference counter of c2 should be
decreased via nft_rule_expr_deactivate() which calls
nft_immediate_deactivate().

Splat looks like:
[  214.396453] WARNING: CPU: 1 PID: 21 at net/netfilter/nf_tables_api.c:1432 nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables]
[  214.398983] Modules linked in: nf_tables nfnetlink
[  214.398983] CPU: 1 PID: 21 Comm: kworker/1:1 Not tainted 4.20.0-rc2+ #44
[  214.398983] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
[  214.398983] RIP: 0010:nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables]
[  214.398983] Code: 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 8e 00 00 00 48 8b 7b 58 e8 e1 2c 4e c6 48 89 df e8 d9 2c 4e c6 eb 9a <0f> 0b eb 96 0f 0b e9 7e fe ff ff e8 a7 7e 4e c6 e9 a4 fe ff ff e8
[  214.398983] RSP: 0018:ffff8881152874e8 EFLAGS: 00010202
[  214.398983] RAX: 0000000000000001 RBX: ffff88810ef9fc28 RCX: ffff8881152876f0
[  214.398983] RDX: dffffc0000000000 RSI: 1ffff11022a50ede RDI: ffff88810ef9fc78
[  214.398983] RBP: 1ffff11022a50e9d R08: 0000000080000000 R09: 0000000000000000
[  214.398983] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff11022a50eba
[  214.398983] R13: ffff888114446e08 R14: ffff8881152876f0 R15: ffffed1022a50ed6
[  214.398983] FS:  0000000000000000(0000) GS:ffff888116400000(0000) knlGS:0000000000000000
[  214.398983] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  214.398983] CR2: 00007fab9bb5f868 CR3: 000000012aa16000 CR4: 00000000001006e0
[  214.398983] Call Trace:
[  214.398983]  ? nf_tables_table_destroy.isra.37+0x100/0x100 [nf_tables]
[  214.398983]  ? __kasan_slab_free+0x145/0x180
[  214.398983]  ? nf_tables_trans_destroy_work+0x439/0x830 [nf_tables]
[  214.398983]  ? kfree+0xdb/0x280
[  214.398983]  nf_tables_trans_destroy_work+0x5f5/0x830 [nf_tables]
[ ... ]

Fixes: bb7b40aecb ("netfilter: nf_tables: bogus EBUSY in chain deletions")
Reported by: Christoph Anton Mitterer <calestyo@scientia.net>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=914505
Link: https://bugzilla.kernel.org/show_bug.cgi?id=201791
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-28 10:56:40 +01:00
Taehee Yoo 53ca0f2fec netfilter: nf_conncount: remove wrong condition check routine
All lists that reach the tree_nodes_free() function have both zero
counter and true dead flag. The reason for this is that lists to be
release are selected by nf_conncount_gc_list() which already decrements
the list counter and sets on the dead flag. Therefore, this if statement
in tree_nodes_free() is unnecessary and wrong.

Fixes: 31568ec09e ("netfilter: nf_conncount: fix list_del corruption in conn_free")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-27 00:48:12 +01:00
Taehee Yoo 584eab291c netfilter: add missing error handling code for register functions
register_{netdevice/inetaddr/inet6addr}_notifier may return an error
value, this patch adds the code to handle these error paths.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-27 00:35:19 +01:00
Florian Westphal 89259088c1 netfilter: nfnetlink_cttimeout: fetch timeouts for udplite and gre, too
syzbot was able to trigger the WARN in cttimeout_default_get() by
passing UDPLITE as l4protocol.  Alias UDPLITE to UDP, both use
same timeout values.

Furthermore, also fetch GRE timeouts.  GRE is a bit more complicated,
as it still can be a module and its netns_proto_gre struct layout isn't
visible outside of the gre module. Can't move timeouts around, it
appears conntrack sysctl unregister assumes net_generic() returns
nf_proto_net, so we get crash. Expose layout of netns_proto_gre instead.

A followup nf-next patch could make gre tracker be built-in as well
if needed, its not that large.

Last, make the WARN() mention the missing protocol value in case
anything else is missing.

Reported-by: syzbot+2fae8fa157dd92618cae@syzkaller.appspotmail.com
Fixes: 8866df9264 ("netfilter: nfnetlink_cttimeout: pass default timeout policy to obj_to_nlattr")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-26 10:25:20 +01:00
Xin Long 2a31e4bd9a ipvs: call ip_vs_dst_notifier earlier than ipv6_dev_notf
ip_vs_dst_event is supposed to clean up all dst used in ipvs'
destinations when a net dev is going down. But it works only
when the dst's dev is the same as the dev from the event.

Now with the same priority but late registration,
ip_vs_dst_notifier is always called later than ipv6_dev_notf
where the dst's dev is set to lo for NETDEV_DOWN event.

As the dst's dev lo is not the same as the dev from the event
in ip_vs_dst_event, ip_vs_dst_notifier doesn't actually work.
Also as these dst have to wait for dest_trash_timer to clean
them up. It would cause some non-permanent kernel warnings:

  unregister_netdevice: waiting for br0 to become free. Usage count = 3

To fix it, call ip_vs_dst_notifier earlier than ipv6_dev_notf
by increasing its priority to ADDRCONF_NOTIFY_PRIORITY + 5.

Note that for ipv4 route fib_netdev_notifier doesn't set dst's
dev to lo in NETDEV_DOWN event, so this fix is only needed when
IP_VS_IPV6 is defined.

Fixes: 7a4f0761fc ("IPVS: init and cleanup restructuring")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-26 10:23:42 +01:00
Taehee Yoo b4e955e9f3 netfilter: xt_hashlimit: fix a possible memory leak in htable_create()
In the htable_create(), hinfo is allocated by vmalloc()
So that if error occurred, hinfo should be freed.

Fixes: 11d5f15723 ("netfilter: xt_hashlimit: Create revision 2 to support higher pps rates")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-17 19:13:58 +01:00
Florian Westphal 29e3880109 netfilter: nf_tables: fix use-after-free when deleting compat expressions
nft_compat ops do not have static storage duration, unlike all other
expressions.

When nf_tables_expr_destroy() returns, expr->ops might have been
free'd already, so we need to store next address before calling
expression destructor.

For same reason, we can't deref match pointer after nft_xt_put().

This can be easily reproduced by adding msleep() before
nft_match_destroy() returns.

Fixes: 0ca743a559 ("netfilter: nf_tables: add compatibility layer for x_tables")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-13 09:57:33 +01:00
Taehee Yoo 0fbcc5b568 netfilter: xt_RATEEST: remove netns exit routine
xt_rateest_net_exit() was added to check whether rules are flushed
successfully. but ->net_exit() callback is called earlier than
->destroy() callback.
So that ->net_exit() callback can't check that.

test commands:
   %ip netns add vm1
   %ip netns exec vm1 iptables -t mangle -I PREROUTING -p udp \
	   --dport 1111 -j RATEEST --rateest-name ap \
	   --rateest-interval 250ms --rateest-ewma 0.5s
   %ip netns del vm1

splat looks like:
[  668.813518] WARNING: CPU: 0 PID: 87 at net/netfilter/xt_RATEEST.c:210 xt_rateest_net_exit+0x210/0x340 [xt_RATEEST]
[  668.813518] Modules linked in: xt_RATEEST xt_tcpudp iptable_mangle bpfilter ip_tables x_tables
[  668.813518] CPU: 0 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc7+ #21
[  668.813518] Workqueue: netns cleanup_net
[  668.813518] RIP: 0010:xt_rateest_net_exit+0x210/0x340 [xt_RATEEST]
[  668.813518] Code: 00 48 8b 85 30 ff ff ff 4c 8b 23 80 38 00 0f 85 24 01 00 00 48 8b 85 30 ff ff ff 4d 85 e4 4c 89 a5 58 ff ff ff c6 00 f8 74 b2 <0f> 0b 48 83 c3 08 4c 39 f3 75 b0 48 b8 00 00 00 00 00 fc ff df 49
[  668.813518] RSP: 0018:ffff8801156c73f8 EFLAGS: 00010282
[  668.813518] RAX: ffffed0022ad8e85 RBX: ffff880118928e98 RCX: 5db8012a00000000
[  668.813518] RDX: ffff8801156c7428 RSI: 00000000cb1d185f RDI: ffff880115663b74
[  668.813518] RBP: ffff8801156c74d0 R08: ffff8801156633c0 R09: 1ffff100236440be
[  668.813518] R10: 0000000000000001 R11: ffffed002367d852 R12: ffff880115142b08
[  668.813518] R13: 1ffff10022ad8e81 R14: ffff880118928ea8 R15: dffffc0000000000
[  668.813518] FS:  0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000
[  668.813518] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  668.813518] CR2: 0000563aa69f4f28 CR3: 0000000105a16000 CR4: 00000000001006f0
[  668.813518] Call Trace:
[  668.813518]  ? unregister_netdevice_many+0xe0/0xe0
[  668.813518]  ? xt_rateest_net_init+0x2c0/0x2c0 [xt_RATEEST]
[  668.813518]  ? default_device_exit+0x1ca/0x270
[  668.813518]  ? remove_proc_entry+0x1cd/0x390
[  668.813518]  ? dev_change_net_namespace+0xd00/0xd00
[  668.813518]  ? __init_waitqueue_head+0x130/0x130
[  668.813518]  ops_exit_list.isra.10+0x94/0x140
[  668.813518]  cleanup_net+0x45b/0x900
[  668.813518]  ? net_drop_ns+0x110/0x110
[  668.813518]  ? swapgs_restore_regs_and_return_to_usermode+0x3c/0x80
[  668.813518]  ? save_trace+0x300/0x300
[  668.813518]  ? lock_acquire+0x196/0x470
[  668.813518]  ? lock_acquire+0x196/0x470
[  668.813518]  ? process_one_work+0xb60/0x1de0
[  668.813518]  ? _raw_spin_unlock_irq+0x29/0x40
[  668.813518]  ? _raw_spin_unlock_irq+0x29/0x40
[  668.813518]  ? __lock_acquire+0x4500/0x4500
[  668.813518]  ? __lock_is_held+0xb4/0x140
[  668.813518]  process_one_work+0xc13/0x1de0
[  668.813518]  ? pwq_dec_nr_in_flight+0x3c0/0x3c0
[  668.813518]  ? set_load_weight+0x270/0x270
[ ... ]

Fixes: 3427b2ab63 ("netfilter: make xt_rateest hash table per net")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-13 09:57:29 +01:00
Taehee Yoo b966098769 netfilter: nf_flow_table: simplify nf_flow_offload_gc_step()
nf_flow_offload_gc_step() and nf_flow_table_iterate() are very similar.
so that many duplicate code can be removed.
After this patch, nf_flow_offload_gc_step() is simple callback function of
nf_flow_table_iterate() like nf_flow_table_do_cleanup().

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-12 16:41:09 +01:00
Taehee Yoo 49de9c090f netfilter: nf_flow_table: make nf_flow_table_iterate() static
nf_flow_table_iterate() is local function, make it static.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-12 16:40:55 +01:00
Florian Westphal 447750f281 netfilter: nf_tables: don't use position attribute on rule replacement
Its possible to set both HANDLE and POSITION when replacing a rule.
In this case, the rule at POSITION gets replaced using the
userspace-provided handle.  Rule handles are supposed to be generated
by the kernel only.

Duplicate handles should be harmless, however better disable this "feature"
by only checking for the POSITION attribute on insert operations.

Fixes: 5e94846686 ("netfilter: nf_tables: add insert operation")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-12 16:13:35 +01:00