2005-04-17 00:20:36 +02:00
|
|
|
/*
|
|
|
|
* IPVS An implementation of the IP virtual server support for the
|
|
|
|
* LINUX operating system. IPVS is now implemented as a module
|
|
|
|
* over the Netfilter framework. IPVS can be used to build a
|
|
|
|
* high-performance and highly available server based on a
|
|
|
|
* cluster of servers.
|
|
|
|
*
|
|
|
|
* Authors: Wensong Zhang <wensong@linuxvirtualserver.org>
|
|
|
|
* Peter Kese <peter.kese@ijs.si>
|
|
|
|
* Julian Anastasov <ja@ssi.bg>
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public License
|
|
|
|
* as published by the Free Software Foundation; either version
|
|
|
|
* 2 of the License, or (at your option) any later version.
|
|
|
|
*
|
|
|
|
* The IPVS code for kernel 2.2 was done by Wensong Zhang and Peter Kese,
|
|
|
|
* with changes/fixes from Julian Anastasov, Lars Marowsky-Bree, Horms
|
|
|
|
* and others. Many code here is taken from IP MASQ code of kernel 2.2.
|
|
|
|
*
|
|
|
|
* Changes:
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2009-07-30 23:29:44 +02:00
|
|
|
#define KMSG_COMPONENT "IPVS"
|
|
|
|
#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
|
|
|
|
|
2006-01-05 23:57:36 +01:00
|
|
|
#include <linux/interrupt.h>
|
2005-12-27 05:43:12 +01:00
|
|
|
#include <linux/in.h>
|
2006-01-04 05:02:20 +01:00
|
|
|
#include <linux/net.h>
|
2005-04-17 00:20:36 +02:00
|
|
|
#include <linux/kernel.h>
|
2005-12-27 05:43:12 +01:00
|
|
|
#include <linux/module.h>
|
2005-04-17 00:20:36 +02:00
|
|
|
#include <linux/vmalloc.h>
|
|
|
|
#include <linux/proc_fs.h> /* for proc_net_* */
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 09:04:11 +01:00
|
|
|
#include <linux/slab.h>
|
2005-04-17 00:20:36 +02:00
|
|
|
#include <linux/seq_file.h>
|
|
|
|
#include <linux/jhash.h>
|
|
|
|
#include <linux/random.h>
|
|
|
|
|
2007-09-12 12:01:34 +02:00
|
|
|
#include <net/net_namespace.h>
|
2005-04-17 00:20:36 +02:00
|
|
|
#include <net/ip_vs.h>
|
|
|
|
|
|
|
|
|
IPVS: Allow boot time change of hash size
I was very frustrated about the fact that I have to recompile the kernel
to change the hash size. So, I created this patch.
If IPVS is built-in you can append ip_vs.conn_tab_bits=?? to kernel
command line, or, if you built IPVS as modules, you can add
options ip_vs conn_tab_bits=??.
To keep everything backward compatible, you still can select the size at
compile time, and that will be used as default.
It has been about a year since this patch was originally posted
and subsequently dropped on the basis of insufficient test data.
Mark Bergsma has provided the following test results which seem
to strongly support the need for larger hash table sizes:
We do however run into the same problem with the default setting (212 =
4096 entries), as most of our LVS balancers handle around a million
connections/SLAB entries at any point in time (around 100-150 kpps
load). With only 4096 hash table entries this implies that each entry
consists of a linked list of 256 connections *on average*.
To provide some statistics, I did an oprofile run on an 2.6.31 kernel,
with both the default 4096 table size, and the same kernel recompiled
with IP_VS_CONN_TAB_BITS set to 18 (218 = 262144 entries). I built a
quick test setup with a part of Wikimedia/Wikipedia's live traffic
mirrored by the switch to the test host.
With the default setting, at ~ 120 kpps packet load we saw a typical %si
CPU usage of around 30-35%, and oprofile reported a hot spot in
ip_vs_conn_in_get:
samples % image name app name
symbol name
1719761 42.3741 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
302577 7.4554 bnx2 bnx2 /bnx2
181984 4.4840 vmlinux vmlinux __ticket_spin_lock
128636 3.1695 vmlinux vmlinux ip_route_input
74345 1.8318 ip_vs.ko ip_vs.ko ip_vs_conn_out_get
68482 1.6874 vmlinux vmlinux mwait_idle
After loading the recompiled kernel with 218 entries, %si CPU usage
dropped in half to around 12-18%, and oprofile looks much healthier,
with only 7% spent in ip_vs_conn_in_get:
samples % image name app name
symbol name
265641 14.4616 bnx2 bnx2 /bnx2
143251 7.7986 vmlinux vmlinux __ticket_spin_lock
140661 7.6576 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
94364 5.1372 vmlinux vmlinux mwait_idle
86267 4.6964 vmlinux vmlinux ip_route_input
[ horms@verge.net.au: trivial up-port and minor style fixes ]
Signed-off-by: Catalin(ux) M. BOIE <catab@embedromix.ro>
Cc: Mark Bergsma <mark@wikimedia.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-05 05:50:24 +01:00
|
|
|
#ifndef CONFIG_IP_VS_TAB_BITS
|
|
|
|
#define CONFIG_IP_VS_TAB_BITS 12
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Connection hash size. Default is what was selected at compile time.
|
|
|
|
*/
|
2010-11-15 18:38:52 +01:00
|
|
|
static int ip_vs_conn_tab_bits = CONFIG_IP_VS_TAB_BITS;
|
IPVS: Allow boot time change of hash size
I was very frustrated about the fact that I have to recompile the kernel
to change the hash size. So, I created this patch.
If IPVS is built-in you can append ip_vs.conn_tab_bits=?? to kernel
command line, or, if you built IPVS as modules, you can add
options ip_vs conn_tab_bits=??.
To keep everything backward compatible, you still can select the size at
compile time, and that will be used as default.
It has been about a year since this patch was originally posted
and subsequently dropped on the basis of insufficient test data.
Mark Bergsma has provided the following test results which seem
to strongly support the need for larger hash table sizes:
We do however run into the same problem with the default setting (212 =
4096 entries), as most of our LVS balancers handle around a million
connections/SLAB entries at any point in time (around 100-150 kpps
load). With only 4096 hash table entries this implies that each entry
consists of a linked list of 256 connections *on average*.
To provide some statistics, I did an oprofile run on an 2.6.31 kernel,
with both the default 4096 table size, and the same kernel recompiled
with IP_VS_CONN_TAB_BITS set to 18 (218 = 262144 entries). I built a
quick test setup with a part of Wikimedia/Wikipedia's live traffic
mirrored by the switch to the test host.
With the default setting, at ~ 120 kpps packet load we saw a typical %si
CPU usage of around 30-35%, and oprofile reported a hot spot in
ip_vs_conn_in_get:
samples % image name app name
symbol name
1719761 42.3741 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
302577 7.4554 bnx2 bnx2 /bnx2
181984 4.4840 vmlinux vmlinux __ticket_spin_lock
128636 3.1695 vmlinux vmlinux ip_route_input
74345 1.8318 ip_vs.ko ip_vs.ko ip_vs_conn_out_get
68482 1.6874 vmlinux vmlinux mwait_idle
After loading the recompiled kernel with 218 entries, %si CPU usage
dropped in half to around 12-18%, and oprofile looks much healthier,
with only 7% spent in ip_vs_conn_in_get:
samples % image name app name
symbol name
265641 14.4616 bnx2 bnx2 /bnx2
143251 7.7986 vmlinux vmlinux __ticket_spin_lock
140661 7.6576 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
94364 5.1372 vmlinux vmlinux mwait_idle
86267 4.6964 vmlinux vmlinux ip_route_input
[ horms@verge.net.au: trivial up-port and minor style fixes ]
Signed-off-by: Catalin(ux) M. BOIE <catab@embedromix.ro>
Cc: Mark Bergsma <mark@wikimedia.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-05 05:50:24 +01:00
|
|
|
module_param_named(conn_tab_bits, ip_vs_conn_tab_bits, int, 0444);
|
|
|
|
MODULE_PARM_DESC(conn_tab_bits, "Set connections' hash size");
|
|
|
|
|
|
|
|
/* size and mask values */
|
2010-11-15 18:38:52 +01:00
|
|
|
int ip_vs_conn_tab_size __read_mostly;
|
|
|
|
static int ip_vs_conn_tab_mask __read_mostly;
|
IPVS: Allow boot time change of hash size
I was very frustrated about the fact that I have to recompile the kernel
to change the hash size. So, I created this patch.
If IPVS is built-in you can append ip_vs.conn_tab_bits=?? to kernel
command line, or, if you built IPVS as modules, you can add
options ip_vs conn_tab_bits=??.
To keep everything backward compatible, you still can select the size at
compile time, and that will be used as default.
It has been about a year since this patch was originally posted
and subsequently dropped on the basis of insufficient test data.
Mark Bergsma has provided the following test results which seem
to strongly support the need for larger hash table sizes:
We do however run into the same problem with the default setting (212 =
4096 entries), as most of our LVS balancers handle around a million
connections/SLAB entries at any point in time (around 100-150 kpps
load). With only 4096 hash table entries this implies that each entry
consists of a linked list of 256 connections *on average*.
To provide some statistics, I did an oprofile run on an 2.6.31 kernel,
with both the default 4096 table size, and the same kernel recompiled
with IP_VS_CONN_TAB_BITS set to 18 (218 = 262144 entries). I built a
quick test setup with a part of Wikimedia/Wikipedia's live traffic
mirrored by the switch to the test host.
With the default setting, at ~ 120 kpps packet load we saw a typical %si
CPU usage of around 30-35%, and oprofile reported a hot spot in
ip_vs_conn_in_get:
samples % image name app name
symbol name
1719761 42.3741 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
302577 7.4554 bnx2 bnx2 /bnx2
181984 4.4840 vmlinux vmlinux __ticket_spin_lock
128636 3.1695 vmlinux vmlinux ip_route_input
74345 1.8318 ip_vs.ko ip_vs.ko ip_vs_conn_out_get
68482 1.6874 vmlinux vmlinux mwait_idle
After loading the recompiled kernel with 218 entries, %si CPU usage
dropped in half to around 12-18%, and oprofile looks much healthier,
with only 7% spent in ip_vs_conn_in_get:
samples % image name app name
symbol name
265641 14.4616 bnx2 bnx2 /bnx2
143251 7.7986 vmlinux vmlinux __ticket_spin_lock
140661 7.6576 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
94364 5.1372 vmlinux vmlinux mwait_idle
86267 4.6964 vmlinux vmlinux ip_route_input
[ horms@verge.net.au: trivial up-port and minor style fixes ]
Signed-off-by: Catalin(ux) M. BOIE <catab@embedromix.ro>
Cc: Mark Bergsma <mark@wikimedia.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-05 05:50:24 +01:00
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
/*
|
|
|
|
* Connection hash table: for input and output packets lookups of IPVS
|
|
|
|
*/
|
2011-02-19 11:05:08 +01:00
|
|
|
static struct hlist_head *ip_vs_conn_tab __read_mostly;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/* SLAB cache for IPVS connections */
|
2006-12-07 05:33:20 +01:00
|
|
|
static struct kmem_cache *ip_vs_conn_cachep __read_mostly;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/* counter for no client port connections */
|
|
|
|
static atomic_t ip_vs_conn_no_cport_cnt = ATOMIC_INIT(0);
|
|
|
|
|
|
|
|
/* random value for IPVS connection hash */
|
2010-11-15 18:38:52 +01:00
|
|
|
static unsigned int ip_vs_conn_rnd __read_mostly;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Fine locking granularity for big connection hash table
|
|
|
|
*/
|
2011-01-03 14:44:57 +01:00
|
|
|
#define CT_LOCKARRAY_BITS 5
|
2005-04-17 00:20:36 +02:00
|
|
|
#define CT_LOCKARRAY_SIZE (1<<CT_LOCKARRAY_BITS)
|
|
|
|
#define CT_LOCKARRAY_MASK (CT_LOCKARRAY_SIZE-1)
|
|
|
|
|
|
|
|
struct ip_vs_aligned_lock
|
|
|
|
{
|
2013-03-21 10:58:10 +01:00
|
|
|
spinlock_t l;
|
2005-04-17 00:20:36 +02:00
|
|
|
} __attribute__((__aligned__(SMP_CACHE_BYTES)));
|
|
|
|
|
|
|
|
/* lock array for conn table */
|
|
|
|
static struct ip_vs_aligned_lock
|
|
|
|
__ip_vs_conntbl_lock_array[CT_LOCKARRAY_SIZE] __cacheline_aligned;
|
|
|
|
|
2013-03-22 10:46:54 +01:00
|
|
|
static inline void ct_write_lock_bh(unsigned int key)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
2013-03-22 10:46:54 +01:00
|
|
|
spin_lock_bh(&__ip_vs_conntbl_lock_array[key&CT_LOCKARRAY_MASK].l);
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
2013-03-22 10:46:54 +01:00
|
|
|
static inline void ct_write_unlock_bh(unsigned int key)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
2013-03-22 10:46:54 +01:00
|
|
|
spin_unlock_bh(&__ip_vs_conntbl_lock_array[key&CT_LOCKARRAY_MASK].l);
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns hash value for IPVS connection entry
|
|
|
|
*/
|
2012-04-15 07:58:06 +02:00
|
|
|
static unsigned int ip_vs_conn_hashkey(struct net *net, int af, unsigned int proto,
|
2008-09-02 15:55:43 +02:00
|
|
|
const union nf_inet_addr *addr,
|
|
|
|
__be16 port)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
2008-09-02 15:55:43 +02:00
|
|
|
#ifdef CONFIG_IP_VS_IPV6
|
|
|
|
if (af == AF_INET6)
|
2011-01-03 14:44:57 +01:00
|
|
|
return (jhash_3words(jhash(addr, 16, ip_vs_conn_rnd),
|
|
|
|
(__force u32)port, proto, ip_vs_conn_rnd) ^
|
|
|
|
((size_t)net>>8)) & ip_vs_conn_tab_mask;
|
2008-09-02 15:55:43 +02:00
|
|
|
#endif
|
2011-01-03 14:44:57 +01:00
|
|
|
return (jhash_3words((__force u32)addr->ip, (__force u32)port, proto,
|
|
|
|
ip_vs_conn_rnd) ^
|
|
|
|
((size_t)net>>8)) & ip_vs_conn_tab_mask;
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
2010-08-22 14:37:53 +02:00
|
|
|
static unsigned int ip_vs_conn_hashkey_param(const struct ip_vs_conn_param *p,
|
|
|
|
bool inverse)
|
|
|
|
{
|
|
|
|
const union nf_inet_addr *addr;
|
|
|
|
__be16 port;
|
|
|
|
|
2010-08-22 14:37:54 +02:00
|
|
|
if (p->pe_data && p->pe->hashkey_raw)
|
2010-08-22 14:37:53 +02:00
|
|
|
return p->pe->hashkey_raw(p, ip_vs_conn_rnd, inverse) &
|
|
|
|
ip_vs_conn_tab_mask;
|
|
|
|
|
|
|
|
if (likely(!inverse)) {
|
|
|
|
addr = p->caddr;
|
|
|
|
port = p->cport;
|
|
|
|
} else {
|
|
|
|
addr = p->vaddr;
|
|
|
|
port = p->vport;
|
|
|
|
}
|
|
|
|
|
2011-01-03 14:44:57 +01:00
|
|
|
return ip_vs_conn_hashkey(p->net, p->af, p->protocol, addr, port);
|
2010-08-22 14:37:53 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned int ip_vs_conn_hashkey_conn(const struct ip_vs_conn *cp)
|
|
|
|
{
|
|
|
|
struct ip_vs_conn_param p;
|
|
|
|
|
2011-01-03 14:44:57 +01:00
|
|
|
ip_vs_conn_fill_param(ip_vs_conn_net(cp), cp->af, cp->protocol,
|
|
|
|
&cp->caddr, cp->cport, NULL, 0, &p);
|
2010-08-22 14:37:53 +02:00
|
|
|
|
2010-11-08 12:05:57 +01:00
|
|
|
if (cp->pe) {
|
|
|
|
p.pe = cp->pe;
|
2010-08-22 14:37:53 +02:00
|
|
|
p.pe_data = cp->pe_data;
|
|
|
|
p.pe_data_len = cp->pe_data_len;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ip_vs_conn_hashkey_param(&p, false);
|
|
|
|
}
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/*
|
2011-01-03 14:44:57 +01:00
|
|
|
* Hashes ip_vs_conn in ip_vs_conn_tab by netns,proto,addr,port.
|
2005-04-17 00:20:36 +02:00
|
|
|
* returns bool success.
|
|
|
|
*/
|
|
|
|
static inline int ip_vs_conn_hash(struct ip_vs_conn *cp)
|
|
|
|
{
|
2012-04-15 07:58:06 +02:00
|
|
|
unsigned int hash;
|
2005-04-17 00:20:36 +02:00
|
|
|
int ret;
|
|
|
|
|
2010-06-22 08:07:01 +02:00
|
|
|
if (cp->flags & IP_VS_CONN_F_ONE_PACKET)
|
|
|
|
return 0;
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
/* Hash by protocol, client address and port */
|
2010-08-22 14:37:53 +02:00
|
|
|
hash = ip_vs_conn_hashkey_conn(cp);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2013-03-22 10:46:54 +01:00
|
|
|
ct_write_lock_bh(hash);
|
ipvs: Add missing locking during connection table hashing and unhashing
The code that hashes and unhashes connections from the connection table
is missing locking of the connection being modified, which opens up a
race condition and results in memory corruption when this race condition
is hit.
Here is what happens in pretty verbose form:
CPU 0 CPU 1
------------ ------------
An active connection is terminated and
we schedule ip_vs_conn_expire() on this
CPU to expire this connection.
IRQ assignment is changed to this CPU,
but the expire timer stays scheduled on
the other CPU.
New connection from same ip:port comes
in right before the timer expires, we
find the inactive connection in our
connection table and get a reference to
it. We proper lock the connection in
tcp_state_transition() and read the
connection flags in set_tcp_state().
ip_vs_conn_expire() gets called, we
unhash the connection from our
connection table and remove the hashed
flag in ip_vs_conn_unhash(), without
proper locking!
While still holding proper locks we
write the connection flags in
set_tcp_state() and this sets the hashed
flag again.
ip_vs_conn_expire() fails to expire the
connection, because the other CPU has
incremented the reference count. We try
to re-insert the connection into our
connection table, but this fails in
ip_vs_conn_hash(), because the hashed
flag has been set by the other CPU. We
re-schedule execution of
ip_vs_conn_expire(). Now this connection
has the hashed flag set, but isn't
actually hashed in our connection table
and has a dangling list_head.
We drop the reference we held on the
connection and schedule the expire timer
for timeouting the connection on this
CPU. Further packets won't be able to
find this connection in our connection
table.
ip_vs_conn_expire() gets called again,
we think it's already hashed, but the
list_head is dangling and while removing
the connection from our connection table
we write to the memory location where
this list_head points to.
The result will probably be a kernel oops at some other point in time.
This race condition is pretty subtle, but it can be triggered remotely.
It needs the IRQ assignment change or another circumstance where packets
coming from the same ip:port for the same service are being processed on
different CPUs. And it involves hitting the exact time at which
ip_vs_conn_expire() gets called. It can be avoided by making sure that
all packets from one connection are always processed on the same CPU and
can be made harder to exploit by changing the connection timeouts to
some custom values.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: stable@kernel.org
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-06-09 16:10:57 +02:00
|
|
|
spin_lock(&cp->lock);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
if (!(cp->flags & IP_VS_CONN_F_HASHED)) {
|
|
|
|
cp->flags |= IP_VS_CONN_F_HASHED;
|
|
|
|
atomic_inc(&cp->refcnt);
|
2013-03-21 10:58:10 +01:00
|
|
|
hlist_add_head_rcu(&cp->c_list, &ip_vs_conn_tab[hash]);
|
2005-04-17 00:20:36 +02:00
|
|
|
ret = 1;
|
|
|
|
} else {
|
2009-08-02 13:05:41 +02:00
|
|
|
pr_err("%s(): request for already hashed, called from %pF\n",
|
|
|
|
__func__, __builtin_return_address(0));
|
2005-04-17 00:20:36 +02:00
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
|
ipvs: Add missing locking during connection table hashing and unhashing
The code that hashes and unhashes connections from the connection table
is missing locking of the connection being modified, which opens up a
race condition and results in memory corruption when this race condition
is hit.
Here is what happens in pretty verbose form:
CPU 0 CPU 1
------------ ------------
An active connection is terminated and
we schedule ip_vs_conn_expire() on this
CPU to expire this connection.
IRQ assignment is changed to this CPU,
but the expire timer stays scheduled on
the other CPU.
New connection from same ip:port comes
in right before the timer expires, we
find the inactive connection in our
connection table and get a reference to
it. We proper lock the connection in
tcp_state_transition() and read the
connection flags in set_tcp_state().
ip_vs_conn_expire() gets called, we
unhash the connection from our
connection table and remove the hashed
flag in ip_vs_conn_unhash(), without
proper locking!
While still holding proper locks we
write the connection flags in
set_tcp_state() and this sets the hashed
flag again.
ip_vs_conn_expire() fails to expire the
connection, because the other CPU has
incremented the reference count. We try
to re-insert the connection into our
connection table, but this fails in
ip_vs_conn_hash(), because the hashed
flag has been set by the other CPU. We
re-schedule execution of
ip_vs_conn_expire(). Now this connection
has the hashed flag set, but isn't
actually hashed in our connection table
and has a dangling list_head.
We drop the reference we held on the
connection and schedule the expire timer
for timeouting the connection on this
CPU. Further packets won't be able to
find this connection in our connection
table.
ip_vs_conn_expire() gets called again,
we think it's already hashed, but the
list_head is dangling and while removing
the connection from our connection table
we write to the memory location where
this list_head points to.
The result will probably be a kernel oops at some other point in time.
This race condition is pretty subtle, but it can be triggered remotely.
It needs the IRQ assignment change or another circumstance where packets
coming from the same ip:port for the same service are being processed on
different CPUs. And it involves hitting the exact time at which
ip_vs_conn_expire() gets called. It can be avoided by making sure that
all packets from one connection are always processed on the same CPU and
can be made harder to exploit by changing the connection timeouts to
some custom values.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: stable@kernel.org
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-06-09 16:10:57 +02:00
|
|
|
spin_unlock(&cp->lock);
|
2013-03-22 10:46:54 +01:00
|
|
|
ct_write_unlock_bh(hash);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* UNhashes ip_vs_conn from ip_vs_conn_tab.
|
2013-03-21 10:58:10 +01:00
|
|
|
* returns bool success. Caller should hold conn reference.
|
2005-04-17 00:20:36 +02:00
|
|
|
*/
|
|
|
|
static inline int ip_vs_conn_unhash(struct ip_vs_conn *cp)
|
|
|
|
{
|
2012-04-15 07:58:06 +02:00
|
|
|
unsigned int hash;
|
2005-04-17 00:20:36 +02:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* unhash it and decrease its reference counter */
|
2010-08-22 14:37:53 +02:00
|
|
|
hash = ip_vs_conn_hashkey_conn(cp);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2013-03-22 10:46:54 +01:00
|
|
|
ct_write_lock_bh(hash);
|
ipvs: Add missing locking during connection table hashing and unhashing
The code that hashes and unhashes connections from the connection table
is missing locking of the connection being modified, which opens up a
race condition and results in memory corruption when this race condition
is hit.
Here is what happens in pretty verbose form:
CPU 0 CPU 1
------------ ------------
An active connection is terminated and
we schedule ip_vs_conn_expire() on this
CPU to expire this connection.
IRQ assignment is changed to this CPU,
but the expire timer stays scheduled on
the other CPU.
New connection from same ip:port comes
in right before the timer expires, we
find the inactive connection in our
connection table and get a reference to
it. We proper lock the connection in
tcp_state_transition() and read the
connection flags in set_tcp_state().
ip_vs_conn_expire() gets called, we
unhash the connection from our
connection table and remove the hashed
flag in ip_vs_conn_unhash(), without
proper locking!
While still holding proper locks we
write the connection flags in
set_tcp_state() and this sets the hashed
flag again.
ip_vs_conn_expire() fails to expire the
connection, because the other CPU has
incremented the reference count. We try
to re-insert the connection into our
connection table, but this fails in
ip_vs_conn_hash(), because the hashed
flag has been set by the other CPU. We
re-schedule execution of
ip_vs_conn_expire(). Now this connection
has the hashed flag set, but isn't
actually hashed in our connection table
and has a dangling list_head.
We drop the reference we held on the
connection and schedule the expire timer
for timeouting the connection on this
CPU. Further packets won't be able to
find this connection in our connection
table.
ip_vs_conn_expire() gets called again,
we think it's already hashed, but the
list_head is dangling and while removing
the connection from our connection table
we write to the memory location where
this list_head points to.
The result will probably be a kernel oops at some other point in time.
This race condition is pretty subtle, but it can be triggered remotely.
It needs the IRQ assignment change or another circumstance where packets
coming from the same ip:port for the same service are being processed on
different CPUs. And it involves hitting the exact time at which
ip_vs_conn_expire() gets called. It can be avoided by making sure that
all packets from one connection are always processed on the same CPU and
can be made harder to exploit by changing the connection timeouts to
some custom values.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: stable@kernel.org
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-06-09 16:10:57 +02:00
|
|
|
spin_lock(&cp->lock);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
if (cp->flags & IP_VS_CONN_F_HASHED) {
|
2013-03-21 10:58:10 +01:00
|
|
|
hlist_del_rcu(&cp->c_list);
|
2005-04-17 00:20:36 +02:00
|
|
|
cp->flags &= ~IP_VS_CONN_F_HASHED;
|
|
|
|
atomic_dec(&cp->refcnt);
|
|
|
|
ret = 1;
|
|
|
|
} else
|
|
|
|
ret = 0;
|
|
|
|
|
ipvs: Add missing locking during connection table hashing and unhashing
The code that hashes and unhashes connections from the connection table
is missing locking of the connection being modified, which opens up a
race condition and results in memory corruption when this race condition
is hit.
Here is what happens in pretty verbose form:
CPU 0 CPU 1
------------ ------------
An active connection is terminated and
we schedule ip_vs_conn_expire() on this
CPU to expire this connection.
IRQ assignment is changed to this CPU,
but the expire timer stays scheduled on
the other CPU.
New connection from same ip:port comes
in right before the timer expires, we
find the inactive connection in our
connection table and get a reference to
it. We proper lock the connection in
tcp_state_transition() and read the
connection flags in set_tcp_state().
ip_vs_conn_expire() gets called, we
unhash the connection from our
connection table and remove the hashed
flag in ip_vs_conn_unhash(), without
proper locking!
While still holding proper locks we
write the connection flags in
set_tcp_state() and this sets the hashed
flag again.
ip_vs_conn_expire() fails to expire the
connection, because the other CPU has
incremented the reference count. We try
to re-insert the connection into our
connection table, but this fails in
ip_vs_conn_hash(), because the hashed
flag has been set by the other CPU. We
re-schedule execution of
ip_vs_conn_expire(). Now this connection
has the hashed flag set, but isn't
actually hashed in our connection table
and has a dangling list_head.
We drop the reference we held on the
connection and schedule the expire timer
for timeouting the connection on this
CPU. Further packets won't be able to
find this connection in our connection
table.
ip_vs_conn_expire() gets called again,
we think it's already hashed, but the
list_head is dangling and while removing
the connection from our connection table
we write to the memory location where
this list_head points to.
The result will probably be a kernel oops at some other point in time.
This race condition is pretty subtle, but it can be triggered remotely.
It needs the IRQ assignment change or another circumstance where packets
coming from the same ip:port for the same service are being processed on
different CPUs. And it involves hitting the exact time at which
ip_vs_conn_expire() gets called. It can be avoided by making sure that
all packets from one connection are always processed on the same CPU and
can be made harder to exploit by changing the connection timeouts to
some custom values.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: stable@kernel.org
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-06-09 16:10:57 +02:00
|
|
|
spin_unlock(&cp->lock);
|
2013-03-22 10:46:54 +01:00
|
|
|
ct_write_unlock_bh(hash);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
/* Try to unlink ip_vs_conn from ip_vs_conn_tab.
|
|
|
|
* returns bool success.
|
|
|
|
*/
|
|
|
|
static inline bool ip_vs_conn_unlink(struct ip_vs_conn *cp)
|
|
|
|
{
|
|
|
|
unsigned int hash;
|
|
|
|
bool ret;
|
|
|
|
|
|
|
|
hash = ip_vs_conn_hashkey_conn(cp);
|
|
|
|
|
2013-03-22 10:46:54 +01:00
|
|
|
ct_write_lock_bh(hash);
|
2013-03-21 10:58:10 +01:00
|
|
|
spin_lock(&cp->lock);
|
|
|
|
|
|
|
|
if (cp->flags & IP_VS_CONN_F_HASHED) {
|
|
|
|
ret = false;
|
|
|
|
/* Decrease refcnt and unlink conn only if we are last user */
|
|
|
|
if (atomic_cmpxchg(&cp->refcnt, 1, 0) == 1) {
|
|
|
|
hlist_del_rcu(&cp->c_list);
|
|
|
|
cp->flags &= ~IP_VS_CONN_F_HASHED;
|
|
|
|
ret = true;
|
|
|
|
}
|
|
|
|
} else
|
|
|
|
ret = atomic_read(&cp->refcnt) ? false : true;
|
|
|
|
|
|
|
|
spin_unlock(&cp->lock);
|
2013-03-22 10:46:54 +01:00
|
|
|
ct_write_unlock_bh(hash);
|
2013-03-21 10:58:10 +01:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Gets ip_vs_conn associated with supplied parameters in the ip_vs_conn_tab.
|
|
|
|
* Called for pkts coming from OUTside-to-INside.
|
2010-08-22 14:37:52 +02:00
|
|
|
* p->caddr, p->cport: pkt source address (foreign host)
|
|
|
|
* p->vaddr, p->vport: pkt dest address (load balancer)
|
2005-04-17 00:20:36 +02:00
|
|
|
*/
|
2010-08-22 14:37:52 +02:00
|
|
|
static inline struct ip_vs_conn *
|
|
|
|
__ip_vs_conn_in_get(const struct ip_vs_conn_param *p)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
2012-04-15 07:58:06 +02:00
|
|
|
unsigned int hash;
|
2005-04-17 00:20:36 +02:00
|
|
|
struct ip_vs_conn *cp;
|
|
|
|
|
2010-08-22 14:37:53 +02:00
|
|
|
hash = ip_vs_conn_hashkey_param(p, false);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
rcu_read_lock();
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[hash], c_list) {
|
2013-03-21 10:58:11 +01:00
|
|
|
if (p->cport == cp->cport && p->vport == cp->vport &&
|
|
|
|
cp->af == p->af &&
|
2010-08-22 14:37:52 +02:00
|
|
|
ip_vs_addr_equal(p->af, p->caddr, &cp->caddr) &&
|
|
|
|
ip_vs_addr_equal(p->af, p->vaddr, &cp->vaddr) &&
|
|
|
|
((!p->cport) ^ (!(cp->flags & IP_VS_CONN_F_NO_CPORT))) &&
|
2011-01-03 14:44:57 +01:00
|
|
|
p->protocol == cp->protocol &&
|
|
|
|
ip_vs_conn_net_eq(cp, p->net)) {
|
2013-03-21 10:58:10 +01:00
|
|
|
if (!__ip_vs_conn_get(cp))
|
|
|
|
continue;
|
2005-04-17 00:20:36 +02:00
|
|
|
/* HIT */
|
2013-03-21 10:58:10 +01:00
|
|
|
rcu_read_unlock();
|
2005-04-17 00:20:36 +02:00
|
|
|
return cp;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
rcu_read_unlock();
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2010-08-22 14:37:52 +02:00
|
|
|
struct ip_vs_conn *ip_vs_conn_in_get(const struct ip_vs_conn_param *p)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
|
|
|
struct ip_vs_conn *cp;
|
|
|
|
|
2010-08-22 14:37:52 +02:00
|
|
|
cp = __ip_vs_conn_in_get(p);
|
|
|
|
if (!cp && atomic_read(&ip_vs_conn_no_cport_cnt)) {
|
|
|
|
struct ip_vs_conn_param cport_zero_p = *p;
|
|
|
|
cport_zero_p.cport = 0;
|
|
|
|
cp = __ip_vs_conn_in_get(&cport_zero_p);
|
|
|
|
}
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2008-09-02 15:55:43 +02:00
|
|
|
IP_VS_DBG_BUF(9, "lookup/in %s %s:%d->%s:%d %s\n",
|
2010-08-22 14:37:52 +02:00
|
|
|
ip_vs_proto_name(p->protocol),
|
|
|
|
IP_VS_DBG_ADDR(p->af, p->caddr), ntohs(p->cport),
|
|
|
|
IP_VS_DBG_ADDR(p->af, p->vaddr), ntohs(p->vport),
|
2008-09-02 15:55:43 +02:00
|
|
|
cp ? "hit" : "not hit");
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
return cp;
|
|
|
|
}
|
|
|
|
|
2010-08-22 14:37:52 +02:00
|
|
|
static int
|
|
|
|
ip_vs_conn_fill_param_proto(int af, const struct sk_buff *skb,
|
|
|
|
const struct ip_vs_iphdr *iph,
|
ipvs: API change to avoid rescan of IPv6 exthdr
Reduce the number of times we scan/skip the IPv6 exthdrs.
This patch contains a lot of API changes. This is done, to avoid
repeating the scan of finding the IPv6 headers, via ipv6_find_hdr(),
which is called by ip_vs_fill_iph_skb().
Finding the IPv6 headers is done as early as possible, and passed on
as a pointer "struct ip_vs_iphdr *" to the affected functions.
This patch reduce/removes 19 calls to ip_vs_fill_iph_skb().
Notice, I have choosen, not to change the API of function
pointer "(*schedule)" (in struct ip_vs_scheduler) as it can be
used by external schedulers, via {un,}register_ip_vs_scheduler.
Only 4 out of 10 schedulers use info from ip_vs_iphdr*, and when
they do, they are only interested in iph->{s,d}addr.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2012-09-26 14:07:17 +02:00
|
|
|
int inverse, struct ip_vs_conn_param *p)
|
2010-08-22 14:37:52 +02:00
|
|
|
{
|
|
|
|
__be16 _ports[2], *pptr;
|
2011-01-03 14:44:57 +01:00
|
|
|
struct net *net = skb_net(skb);
|
2010-08-22 14:37:52 +02:00
|
|
|
|
ipvs: API change to avoid rescan of IPv6 exthdr
Reduce the number of times we scan/skip the IPv6 exthdrs.
This patch contains a lot of API changes. This is done, to avoid
repeating the scan of finding the IPv6 headers, via ipv6_find_hdr(),
which is called by ip_vs_fill_iph_skb().
Finding the IPv6 headers is done as early as possible, and passed on
as a pointer "struct ip_vs_iphdr *" to the affected functions.
This patch reduce/removes 19 calls to ip_vs_fill_iph_skb().
Notice, I have choosen, not to change the API of function
pointer "(*schedule)" (in struct ip_vs_scheduler) as it can be
used by external schedulers, via {un,}register_ip_vs_scheduler.
Only 4 out of 10 schedulers use info from ip_vs_iphdr*, and when
they do, they are only interested in iph->{s,d}addr.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2012-09-26 14:07:17 +02:00
|
|
|
pptr = frag_safe_skb_hp(skb, iph->len, sizeof(_ports), _ports, iph);
|
2010-08-22 14:37:52 +02:00
|
|
|
if (pptr == NULL)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (likely(!inverse))
|
2011-01-03 14:44:57 +01:00
|
|
|
ip_vs_conn_fill_param(net, af, iph->protocol, &iph->saddr,
|
|
|
|
pptr[0], &iph->daddr, pptr[1], p);
|
2010-08-22 14:37:52 +02:00
|
|
|
else
|
2011-01-03 14:44:57 +01:00
|
|
|
ip_vs_conn_fill_param(net, af, iph->protocol, &iph->daddr,
|
|
|
|
pptr[1], &iph->saddr, pptr[0], p);
|
2010-08-22 14:37:52 +02:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-08-02 17:12:44 +02:00
|
|
|
struct ip_vs_conn *
|
|
|
|
ip_vs_conn_in_get_proto(int af, const struct sk_buff *skb,
|
ipvs: API change to avoid rescan of IPv6 exthdr
Reduce the number of times we scan/skip the IPv6 exthdrs.
This patch contains a lot of API changes. This is done, to avoid
repeating the scan of finding the IPv6 headers, via ipv6_find_hdr(),
which is called by ip_vs_fill_iph_skb().
Finding the IPv6 headers is done as early as possible, and passed on
as a pointer "struct ip_vs_iphdr *" to the affected functions.
This patch reduce/removes 19 calls to ip_vs_fill_iph_skb().
Notice, I have choosen, not to change the API of function
pointer "(*schedule)" (in struct ip_vs_scheduler) as it can be
used by external schedulers, via {un,}register_ip_vs_scheduler.
Only 4 out of 10 schedulers use info from ip_vs_iphdr*, and when
they do, they are only interested in iph->{s,d}addr.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2012-09-26 14:07:17 +02:00
|
|
|
const struct ip_vs_iphdr *iph, int inverse)
|
2010-08-02 17:12:44 +02:00
|
|
|
{
|
2010-08-22 14:37:52 +02:00
|
|
|
struct ip_vs_conn_param p;
|
2010-08-02 17:12:44 +02:00
|
|
|
|
ipvs: API change to avoid rescan of IPv6 exthdr
Reduce the number of times we scan/skip the IPv6 exthdrs.
This patch contains a lot of API changes. This is done, to avoid
repeating the scan of finding the IPv6 headers, via ipv6_find_hdr(),
which is called by ip_vs_fill_iph_skb().
Finding the IPv6 headers is done as early as possible, and passed on
as a pointer "struct ip_vs_iphdr *" to the affected functions.
This patch reduce/removes 19 calls to ip_vs_fill_iph_skb().
Notice, I have choosen, not to change the API of function
pointer "(*schedule)" (in struct ip_vs_scheduler) as it can be
used by external schedulers, via {un,}register_ip_vs_scheduler.
Only 4 out of 10 schedulers use info from ip_vs_iphdr*, and when
they do, they are only interested in iph->{s,d}addr.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2012-09-26 14:07:17 +02:00
|
|
|
if (ip_vs_conn_fill_param_proto(af, skb, iph, inverse, &p))
|
2010-08-02 17:12:44 +02:00
|
|
|
return NULL;
|
|
|
|
|
2010-08-22 14:37:52 +02:00
|
|
|
return ip_vs_conn_in_get(&p);
|
2010-08-02 17:12:44 +02:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(ip_vs_conn_in_get_proto);
|
|
|
|
|
2005-09-15 06:08:51 +02:00
|
|
|
/* Get reference to connection template */
|
2010-08-22 14:37:52 +02:00
|
|
|
struct ip_vs_conn *ip_vs_ct_in_get(const struct ip_vs_conn_param *p)
|
2005-09-15 06:08:51 +02:00
|
|
|
{
|
2012-04-15 07:58:06 +02:00
|
|
|
unsigned int hash;
|
2005-09-15 06:08:51 +02:00
|
|
|
struct ip_vs_conn *cp;
|
|
|
|
|
2010-08-22 14:37:53 +02:00
|
|
|
hash = ip_vs_conn_hashkey_param(p, false);
|
2005-09-15 06:08:51 +02:00
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
rcu_read_lock();
|
2005-09-15 06:08:51 +02:00
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[hash], c_list) {
|
2013-03-21 10:58:11 +01:00
|
|
|
if (unlikely(p->pe_data && p->pe->ct_match)) {
|
|
|
|
if (!ip_vs_conn_net_eq(cp, p->net))
|
|
|
|
continue;
|
2013-03-21 10:58:10 +01:00
|
|
|
if (p->pe == cp->pe && p->pe->ct_match(p, cp)) {
|
|
|
|
if (__ip_vs_conn_get(cp))
|
|
|
|
goto out;
|
|
|
|
}
|
2010-08-22 14:37:53 +02:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2010-08-22 14:37:52 +02:00
|
|
|
if (cp->af == p->af &&
|
|
|
|
ip_vs_addr_equal(p->af, p->caddr, &cp->caddr) &&
|
ipvs: Fix IPv4 FWMARK virtual services
This fixes the use of fwmarks to denote IPv4 virtual services
which was unfortunately broken as a result of the integration
of IPv6 support into IPVS, which was included in 2.6.28.
The problem arises because fwmarks are stored in the 4th octet
of a union nf_inet_addr .all, however in the case of IPv4 only
the first octet, corresponding to .ip, is assigned and compared.
In other words, using .all = { 0, 0, 0, htonl(svc->fwmark) always
results in a value of 0 (32bits) being stored for IPv4. This means
that one fwmark can be used, as it ends up being mapped to 0, but things
break down when multiple fwmarks are used, as they all end up being mapped
to 0.
As fwmarks are 32bits a reasonable fix seems to be to just store the fwmark
in .ip, and comparing and storing .ip when fwmarks are used.
This patch makes the assumption that in calls to ip_vs_ct_in_get()
and ip_vs_sched_persist() if the proto parameter is IPPROTO_IP then
we are dealing with an fwmark. I believe this is valid as ip_vs_in()
does fairly strict filtering on the protocol and IPPROTO_IP should
not be used in these calls unless explicitly passed when making
these calls for fwmarks in ip_vs_sched_persist().
Tested-by: Fabien Duchêne <fabien.duchene@student.uclouvain.be>
Cc: Joseph Mack NA3T <jmack@wm7d.net>
Cc: Julius Volz <julius.volz@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-06 17:02:29 +02:00
|
|
|
/* protocol should only be IPPROTO_IP if
|
2010-08-22 14:37:52 +02:00
|
|
|
* p->vaddr is a fwmark */
|
|
|
|
ip_vs_addr_equal(p->protocol == IPPROTO_IP ? AF_UNSPEC :
|
|
|
|
p->af, p->vaddr, &cp->vaddr) &&
|
2013-03-21 10:58:11 +01:00
|
|
|
p->vport == cp->vport && p->cport == cp->cport &&
|
2005-09-15 06:08:51 +02:00
|
|
|
cp->flags & IP_VS_CONN_F_TEMPLATE &&
|
2013-03-21 10:58:11 +01:00
|
|
|
p->protocol == cp->protocol &&
|
|
|
|
ip_vs_conn_net_eq(cp, p->net)) {
|
2013-03-21 10:58:10 +01:00
|
|
|
if (__ip_vs_conn_get(cp))
|
|
|
|
goto out;
|
|
|
|
}
|
2005-09-15 06:08:51 +02:00
|
|
|
}
|
|
|
|
cp = NULL;
|
|
|
|
|
|
|
|
out:
|
2013-03-21 10:58:10 +01:00
|
|
|
rcu_read_unlock();
|
2005-09-15 06:08:51 +02:00
|
|
|
|
2008-09-02 15:55:43 +02:00
|
|
|
IP_VS_DBG_BUF(9, "template lookup/in %s %s:%d->%s:%d %s\n",
|
2010-08-22 14:37:52 +02:00
|
|
|
ip_vs_proto_name(p->protocol),
|
|
|
|
IP_VS_DBG_ADDR(p->af, p->caddr), ntohs(p->cport),
|
|
|
|
IP_VS_DBG_ADDR(p->af, p->vaddr), ntohs(p->vport),
|
2008-09-02 15:55:43 +02:00
|
|
|
cp ? "hit" : "not hit");
|
2005-09-15 06:08:51 +02:00
|
|
|
|
|
|
|
return cp;
|
|
|
|
}
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2010-08-22 14:37:52 +02:00
|
|
|
/* Gets ip_vs_conn associated with supplied parameters in the ip_vs_conn_tab.
|
|
|
|
* Called for pkts coming from inside-to-OUTside.
|
|
|
|
* p->caddr, p->cport: pkt source address (inside host)
|
|
|
|
* p->vaddr, p->vport: pkt dest address (foreign host) */
|
|
|
|
struct ip_vs_conn *ip_vs_conn_out_get(const struct ip_vs_conn_param *p)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
2012-04-15 07:58:06 +02:00
|
|
|
unsigned int hash;
|
2005-04-17 00:20:36 +02:00
|
|
|
struct ip_vs_conn *cp, *ret=NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check for "full" addressed entries
|
|
|
|
*/
|
2010-08-22 14:37:53 +02:00
|
|
|
hash = ip_vs_conn_hashkey_param(p, true);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
rcu_read_lock();
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[hash], c_list) {
|
2013-03-21 10:58:11 +01:00
|
|
|
if (p->vport == cp->cport && p->cport == cp->dport &&
|
|
|
|
cp->af == p->af &&
|
2010-08-22 14:37:52 +02:00
|
|
|
ip_vs_addr_equal(p->af, p->vaddr, &cp->caddr) &&
|
|
|
|
ip_vs_addr_equal(p->af, p->caddr, &cp->daddr) &&
|
2011-01-03 14:44:57 +01:00
|
|
|
p->protocol == cp->protocol &&
|
|
|
|
ip_vs_conn_net_eq(cp, p->net)) {
|
2013-03-21 10:58:10 +01:00
|
|
|
if (!__ip_vs_conn_get(cp))
|
|
|
|
continue;
|
2005-04-17 00:20:36 +02:00
|
|
|
/* HIT */
|
|
|
|
ret = cp;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
rcu_read_unlock();
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2008-09-02 15:55:43 +02:00
|
|
|
IP_VS_DBG_BUF(9, "lookup/out %s %s:%d->%s:%d %s\n",
|
2010-08-22 14:37:52 +02:00
|
|
|
ip_vs_proto_name(p->protocol),
|
|
|
|
IP_VS_DBG_ADDR(p->af, p->caddr), ntohs(p->cport),
|
|
|
|
IP_VS_DBG_ADDR(p->af, p->vaddr), ntohs(p->vport),
|
2008-09-02 15:55:43 +02:00
|
|
|
ret ? "hit" : "not hit");
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2010-08-02 17:12:44 +02:00
|
|
|
struct ip_vs_conn *
|
|
|
|
ip_vs_conn_out_get_proto(int af, const struct sk_buff *skb,
|
ipvs: API change to avoid rescan of IPv6 exthdr
Reduce the number of times we scan/skip the IPv6 exthdrs.
This patch contains a lot of API changes. This is done, to avoid
repeating the scan of finding the IPv6 headers, via ipv6_find_hdr(),
which is called by ip_vs_fill_iph_skb().
Finding the IPv6 headers is done as early as possible, and passed on
as a pointer "struct ip_vs_iphdr *" to the affected functions.
This patch reduce/removes 19 calls to ip_vs_fill_iph_skb().
Notice, I have choosen, not to change the API of function
pointer "(*schedule)" (in struct ip_vs_scheduler) as it can be
used by external schedulers, via {un,}register_ip_vs_scheduler.
Only 4 out of 10 schedulers use info from ip_vs_iphdr*, and when
they do, they are only interested in iph->{s,d}addr.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2012-09-26 14:07:17 +02:00
|
|
|
const struct ip_vs_iphdr *iph, int inverse)
|
2010-08-02 17:12:44 +02:00
|
|
|
{
|
2010-08-22 14:37:52 +02:00
|
|
|
struct ip_vs_conn_param p;
|
2010-08-02 17:12:44 +02:00
|
|
|
|
ipvs: API change to avoid rescan of IPv6 exthdr
Reduce the number of times we scan/skip the IPv6 exthdrs.
This patch contains a lot of API changes. This is done, to avoid
repeating the scan of finding the IPv6 headers, via ipv6_find_hdr(),
which is called by ip_vs_fill_iph_skb().
Finding the IPv6 headers is done as early as possible, and passed on
as a pointer "struct ip_vs_iphdr *" to the affected functions.
This patch reduce/removes 19 calls to ip_vs_fill_iph_skb().
Notice, I have choosen, not to change the API of function
pointer "(*schedule)" (in struct ip_vs_scheduler) as it can be
used by external schedulers, via {un,}register_ip_vs_scheduler.
Only 4 out of 10 schedulers use info from ip_vs_iphdr*, and when
they do, they are only interested in iph->{s,d}addr.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2012-09-26 14:07:17 +02:00
|
|
|
if (ip_vs_conn_fill_param_proto(af, skb, iph, inverse, &p))
|
2010-08-02 17:12:44 +02:00
|
|
|
return NULL;
|
|
|
|
|
2010-08-22 14:37:52 +02:00
|
|
|
return ip_vs_conn_out_get(&p);
|
2010-08-02 17:12:44 +02:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(ip_vs_conn_out_get_proto);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Put back the conn and restart its timer with its timeout
|
|
|
|
*/
|
|
|
|
void ip_vs_conn_put(struct ip_vs_conn *cp)
|
|
|
|
{
|
2010-06-22 08:07:01 +02:00
|
|
|
unsigned long t = (cp->flags & IP_VS_CONN_F_ONE_PACKET) ?
|
|
|
|
0 : cp->timeout;
|
|
|
|
mod_timer(&cp->timer, jiffies+t);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
__ip_vs_conn_put(cp);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Fill a no_client_port connection with a client port number
|
|
|
|
*/
|
2006-09-28 23:29:52 +02:00
|
|
|
void ip_vs_conn_fill_cport(struct ip_vs_conn *cp, __be16 cport)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
|
|
|
if (ip_vs_conn_unhash(cp)) {
|
2013-03-22 10:46:54 +01:00
|
|
|
spin_lock_bh(&cp->lock);
|
2005-04-17 00:20:36 +02:00
|
|
|
if (cp->flags & IP_VS_CONN_F_NO_CPORT) {
|
|
|
|
atomic_dec(&ip_vs_conn_no_cport_cnt);
|
|
|
|
cp->flags &= ~IP_VS_CONN_F_NO_CPORT;
|
|
|
|
cp->cport = cport;
|
|
|
|
}
|
2013-03-22 10:46:54 +01:00
|
|
|
spin_unlock_bh(&cp->lock);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/* hash on new dport */
|
|
|
|
ip_vs_conn_hash(cp);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Bind a connection entry with the corresponding packet_xmit.
|
|
|
|
* Called by ip_vs_conn_new.
|
|
|
|
*/
|
|
|
|
static inline void ip_vs_bind_xmit(struct ip_vs_conn *cp)
|
|
|
|
{
|
|
|
|
switch (IP_VS_FWD_METHOD(cp)) {
|
|
|
|
case IP_VS_CONN_F_MASQ:
|
|
|
|
cp->packet_xmit = ip_vs_nat_xmit;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_VS_CONN_F_TUNNEL:
|
|
|
|
cp->packet_xmit = ip_vs_tunnel_xmit;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_VS_CONN_F_DROUTE:
|
|
|
|
cp->packet_xmit = ip_vs_dr_xmit;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_VS_CONN_F_LOCALNODE:
|
|
|
|
cp->packet_xmit = ip_vs_null_xmit;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_VS_CONN_F_BYPASS:
|
|
|
|
cp->packet_xmit = ip_vs_bypass_xmit;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-09-02 15:55:45 +02:00
|
|
|
#ifdef CONFIG_IP_VS_IPV6
|
|
|
|
static inline void ip_vs_bind_xmit_v6(struct ip_vs_conn *cp)
|
|
|
|
{
|
|
|
|
switch (IP_VS_FWD_METHOD(cp)) {
|
|
|
|
case IP_VS_CONN_F_MASQ:
|
|
|
|
cp->packet_xmit = ip_vs_nat_xmit_v6;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_VS_CONN_F_TUNNEL:
|
|
|
|
cp->packet_xmit = ip_vs_tunnel_xmit_v6;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_VS_CONN_F_DROUTE:
|
|
|
|
cp->packet_xmit = ip_vs_dr_xmit_v6;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_VS_CONN_F_LOCALNODE:
|
|
|
|
cp->packet_xmit = ip_vs_null_xmit;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_VS_CONN_F_BYPASS:
|
|
|
|
cp->packet_xmit = ip_vs_bypass_xmit_v6;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
static inline int ip_vs_dest_totalconns(struct ip_vs_dest *dest)
|
|
|
|
{
|
|
|
|
return atomic_read(&dest->activeconns)
|
|
|
|
+ atomic_read(&dest->inactconns);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Bind a connection entry with a virtual service destination
|
|
|
|
* Called just after a new connection entry is created.
|
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
ip_vs_bind_dest(struct ip_vs_conn *cp, struct ip_vs_dest *dest)
|
|
|
|
{
|
2010-09-17 14:18:16 +02:00
|
|
|
unsigned int conn_flags;
|
2012-05-08 09:28:19 +02:00
|
|
|
__u32 flags;
|
2010-09-17 14:18:16 +02:00
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
/* if dest is NULL, then return directly */
|
|
|
|
if (!dest)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Increase the refcnt counter of the dest */
|
2013-03-22 10:46:38 +01:00
|
|
|
ip_vs_dest_hold(dest);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2010-09-17 14:18:16 +02:00
|
|
|
conn_flags = atomic_read(&dest->conn_flags);
|
|
|
|
if (cp->protocol != IPPROTO_UDP)
|
|
|
|
conn_flags &= ~IP_VS_CONN_F_ONE_PACKET;
|
2012-05-08 09:28:19 +02:00
|
|
|
flags = cp->flags;
|
2005-04-17 00:20:36 +02:00
|
|
|
/* Bind with the destination and its corresponding transmitter */
|
2012-05-08 09:28:19 +02:00
|
|
|
if (flags & IP_VS_CONN_F_SYNC) {
|
2007-11-20 06:53:27 +01:00
|
|
|
/* if the connection is not template and is created
|
|
|
|
* by sync, preserve the activity flag.
|
|
|
|
*/
|
2012-05-08 09:28:19 +02:00
|
|
|
if (!(flags & IP_VS_CONN_F_TEMPLATE))
|
2010-09-17 14:18:16 +02:00
|
|
|
conn_flags &= ~IP_VS_CONN_F_INACTIVE;
|
2010-10-17 15:43:36 +02:00
|
|
|
/* connections inherit forwarding method from dest */
|
2012-05-08 09:28:19 +02:00
|
|
|
flags &= ~(IP_VS_CONN_F_FWD_MASK | IP_VS_CONN_F_NOOUTPUT);
|
2010-09-17 14:18:16 +02:00
|
|
|
}
|
2012-05-08 09:28:19 +02:00
|
|
|
flags |= conn_flags;
|
|
|
|
cp->flags = flags;
|
2005-04-17 00:20:36 +02:00
|
|
|
cp->dest = dest;
|
|
|
|
|
2008-09-02 15:55:53 +02:00
|
|
|
IP_VS_DBG_BUF(7, "Bind-dest %s c:%s:%d v:%s:%d "
|
|
|
|
"d:%s:%d fwd:%c s:%u conn->flags:%X conn->refcnt:%d "
|
|
|
|
"dest->refcnt:%d\n",
|
|
|
|
ip_vs_proto_name(cp->protocol),
|
|
|
|
IP_VS_DBG_ADDR(cp->af, &cp->caddr), ntohs(cp->cport),
|
|
|
|
IP_VS_DBG_ADDR(cp->af, &cp->vaddr), ntohs(cp->vport),
|
|
|
|
IP_VS_DBG_ADDR(cp->af, &cp->daddr), ntohs(cp->dport),
|
|
|
|
ip_vs_fwd_tag(cp), cp->state,
|
|
|
|
cp->flags, atomic_read(&cp->refcnt),
|
|
|
|
atomic_read(&dest->refcnt));
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/* Update the connection counters */
|
2012-05-08 09:28:19 +02:00
|
|
|
if (!(flags & IP_VS_CONN_F_TEMPLATE)) {
|
2012-04-24 22:46:36 +02:00
|
|
|
/* It is a normal connection, so modify the counters
|
|
|
|
* according to the flags, later the protocol can
|
|
|
|
* update them on state change
|
|
|
|
*/
|
2012-05-08 09:28:19 +02:00
|
|
|
if (!(flags & IP_VS_CONN_F_INACTIVE))
|
2007-11-20 06:53:27 +01:00
|
|
|
atomic_inc(&dest->activeconns);
|
|
|
|
else
|
|
|
|
atomic_inc(&dest->inactconns);
|
2005-04-17 00:20:36 +02:00
|
|
|
} else {
|
|
|
|
/* It is a persistent connection/template, so increase
|
2011-03-31 03:57:33 +02:00
|
|
|
the persistent connection counter */
|
2005-04-17 00:20:36 +02:00
|
|
|
atomic_inc(&dest->persistconns);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dest->u_threshold != 0 &&
|
|
|
|
ip_vs_dest_totalconns(dest) >= dest->u_threshold)
|
|
|
|
dest->flags |= IP_VS_DEST_F_OVERLOAD;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2007-11-07 11:35:54 +01:00
|
|
|
/*
|
|
|
|
* Check if there is a destination for the connection, if so
|
|
|
|
* bind the connection to the destination.
|
|
|
|
*/
|
2013-03-22 10:46:52 +01:00
|
|
|
void ip_vs_try_bind_dest(struct ip_vs_conn *cp)
|
2007-11-07 11:35:54 +01:00
|
|
|
{
|
|
|
|
struct ip_vs_dest *dest;
|
|
|
|
|
2013-03-22 10:46:52 +01:00
|
|
|
rcu_read_lock();
|
2012-04-24 22:46:37 +02:00
|
|
|
dest = ip_vs_find_dest(ip_vs_conn_net(cp), cp->af, &cp->daddr,
|
|
|
|
cp->dport, &cp->vaddr, cp->vport,
|
|
|
|
cp->protocol, cp->fwmark, cp->flags);
|
|
|
|
if (dest) {
|
|
|
|
struct ip_vs_proto_data *pd;
|
|
|
|
|
2013-03-22 10:46:54 +01:00
|
|
|
spin_lock_bh(&cp->lock);
|
2012-05-08 19:40:30 +02:00
|
|
|
if (cp->dest) {
|
2013-03-22 10:46:54 +01:00
|
|
|
spin_unlock_bh(&cp->lock);
|
2013-03-22 10:46:52 +01:00
|
|
|
rcu_read_unlock();
|
|
|
|
return;
|
2012-05-08 19:40:30 +02:00
|
|
|
}
|
|
|
|
|
2012-04-24 22:46:37 +02:00
|
|
|
/* Applications work depending on the forwarding method
|
|
|
|
* but better to reassign them always when binding dest */
|
|
|
|
if (cp->app)
|
|
|
|
ip_vs_unbind_app(cp);
|
|
|
|
|
2007-11-07 11:35:54 +01:00
|
|
|
ip_vs_bind_dest(cp, dest);
|
2013-03-22 10:46:54 +01:00
|
|
|
spin_unlock_bh(&cp->lock);
|
2012-04-24 22:46:37 +02:00
|
|
|
|
|
|
|
/* Update its packet transmitter */
|
|
|
|
cp->packet_xmit = NULL;
|
|
|
|
#ifdef CONFIG_IP_VS_IPV6
|
|
|
|
if (cp->af == AF_INET6)
|
|
|
|
ip_vs_bind_xmit_v6(cp);
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
ip_vs_bind_xmit(cp);
|
|
|
|
|
|
|
|
pd = ip_vs_proto_data_get(ip_vs_conn_net(cp), cp->protocol);
|
|
|
|
if (pd && atomic_read(&pd->appcnt))
|
|
|
|
ip_vs_bind_app(cp, pd->pp);
|
|
|
|
}
|
2013-03-22 10:46:52 +01:00
|
|
|
rcu_read_unlock();
|
2007-11-07 11:35:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
/*
|
|
|
|
* Unbind a connection entry with its VS destination
|
|
|
|
* Called by the ip_vs_conn_expire function.
|
|
|
|
*/
|
|
|
|
static inline void ip_vs_unbind_dest(struct ip_vs_conn *cp)
|
|
|
|
{
|
|
|
|
struct ip_vs_dest *dest = cp->dest;
|
|
|
|
|
|
|
|
if (!dest)
|
|
|
|
return;
|
|
|
|
|
2008-09-02 15:55:53 +02:00
|
|
|
IP_VS_DBG_BUF(7, "Unbind-dest %s c:%s:%d v:%s:%d "
|
|
|
|
"d:%s:%d fwd:%c s:%u conn->flags:%X conn->refcnt:%d "
|
|
|
|
"dest->refcnt:%d\n",
|
|
|
|
ip_vs_proto_name(cp->protocol),
|
|
|
|
IP_VS_DBG_ADDR(cp->af, &cp->caddr), ntohs(cp->cport),
|
|
|
|
IP_VS_DBG_ADDR(cp->af, &cp->vaddr), ntohs(cp->vport),
|
|
|
|
IP_VS_DBG_ADDR(cp->af, &cp->daddr), ntohs(cp->dport),
|
|
|
|
ip_vs_fwd_tag(cp), cp->state,
|
|
|
|
cp->flags, atomic_read(&cp->refcnt),
|
|
|
|
atomic_read(&dest->refcnt));
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/* Update the connection counters */
|
2005-09-15 06:08:51 +02:00
|
|
|
if (!(cp->flags & IP_VS_CONN_F_TEMPLATE)) {
|
2005-04-17 00:20:36 +02:00
|
|
|
/* It is a normal connection, so decrease the inactconns
|
|
|
|
or activeconns counter */
|
|
|
|
if (cp->flags & IP_VS_CONN_F_INACTIVE) {
|
|
|
|
atomic_dec(&dest->inactconns);
|
|
|
|
} else {
|
|
|
|
atomic_dec(&dest->activeconns);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/* It is a persistent connection/template, so decrease
|
2011-03-31 03:57:33 +02:00
|
|
|
the persistent connection counter */
|
2005-04-17 00:20:36 +02:00
|
|
|
atomic_dec(&dest->persistconns);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dest->l_threshold != 0) {
|
|
|
|
if (ip_vs_dest_totalconns(dest) < dest->l_threshold)
|
|
|
|
dest->flags &= ~IP_VS_DEST_F_OVERLOAD;
|
|
|
|
} else if (dest->u_threshold != 0) {
|
|
|
|
if (ip_vs_dest_totalconns(dest) * 4 < dest->u_threshold * 3)
|
|
|
|
dest->flags &= ~IP_VS_DEST_F_OVERLOAD;
|
|
|
|
} else {
|
|
|
|
if (dest->flags & IP_VS_DEST_F_OVERLOAD)
|
|
|
|
dest->flags &= ~IP_VS_DEST_F_OVERLOAD;
|
|
|
|
}
|
|
|
|
|
2013-03-22 10:46:38 +01:00
|
|
|
ip_vs_dest_put(dest);
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
2011-02-04 10:33:01 +01:00
|
|
|
static int expire_quiescent_template(struct netns_ipvs *ipvs,
|
|
|
|
struct ip_vs_dest *dest)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_SYSCTL
|
|
|
|
return ipvs->sysctl_expire_quiescent_template &&
|
|
|
|
(atomic_read(&dest->weight) == 0);
|
|
|
|
#else
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
}
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Checking if the destination of a connection template is available.
|
|
|
|
* If available, return 1, otherwise invalidate this connection
|
|
|
|
* template and return 0.
|
|
|
|
*/
|
|
|
|
int ip_vs_check_template(struct ip_vs_conn *ct)
|
|
|
|
{
|
|
|
|
struct ip_vs_dest *dest = ct->dest;
|
2011-01-03 14:44:58 +01:00
|
|
|
struct netns_ipvs *ipvs = net_ipvs(ip_vs_conn_net(ct));
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Checking the dest server status.
|
|
|
|
*/
|
|
|
|
if ((dest == NULL) ||
|
2007-02-09 15:24:47 +01:00
|
|
|
!(dest->flags & IP_VS_DEST_F_AVAILABLE) ||
|
2011-02-04 10:33:01 +01:00
|
|
|
expire_quiescent_template(ipvs, dest)) {
|
2008-09-02 15:55:53 +02:00
|
|
|
IP_VS_DBG_BUF(9, "check_template: dest not available for "
|
|
|
|
"protocol %s s:%s:%d v:%s:%d "
|
|
|
|
"-> d:%s:%d\n",
|
|
|
|
ip_vs_proto_name(ct->protocol),
|
|
|
|
IP_VS_DBG_ADDR(ct->af, &ct->caddr),
|
|
|
|
ntohs(ct->cport),
|
|
|
|
IP_VS_DBG_ADDR(ct->af, &ct->vaddr),
|
|
|
|
ntohs(ct->vport),
|
|
|
|
IP_VS_DBG_ADDR(ct->af, &ct->daddr),
|
|
|
|
ntohs(ct->dport));
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Invalidate the connection template
|
|
|
|
*/
|
2006-09-28 23:29:52 +02:00
|
|
|
if (ct->vport != htons(0xffff)) {
|
2005-04-17 00:20:36 +02:00
|
|
|
if (ip_vs_conn_unhash(ct)) {
|
2006-09-28 23:29:52 +02:00
|
|
|
ct->dport = htons(0xffff);
|
|
|
|
ct->vport = htons(0xffff);
|
2005-04-17 00:20:36 +02:00
|
|
|
ct->cport = 0;
|
|
|
|
ip_vs_conn_hash(ct);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Simply decrease the refcnt of the template,
|
|
|
|
* don't restart its timer.
|
|
|
|
*/
|
2013-03-21 10:58:10 +01:00
|
|
|
__ip_vs_conn_put(ct);
|
2005-04-17 00:20:36 +02:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
static void ip_vs_conn_rcu_free(struct rcu_head *head)
|
|
|
|
{
|
|
|
|
struct ip_vs_conn *cp = container_of(head, struct ip_vs_conn,
|
|
|
|
rcu_head);
|
|
|
|
|
|
|
|
ip_vs_pe_put(cp->pe);
|
|
|
|
kfree(cp->pe_data);
|
|
|
|
kmem_cache_free(ip_vs_conn_cachep, cp);
|
|
|
|
}
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
static void ip_vs_conn_expire(unsigned long data)
|
|
|
|
{
|
|
|
|
struct ip_vs_conn *cp = (struct ip_vs_conn *)data;
|
2012-04-24 22:46:40 +02:00
|
|
|
struct net *net = ip_vs_conn_net(cp);
|
|
|
|
struct netns_ipvs *ipvs = net_ipvs(net);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* do I control anybody?
|
|
|
|
*/
|
|
|
|
if (atomic_read(&cp->n_control))
|
|
|
|
goto expire_later;
|
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
/* Unlink conn if not referenced anymore */
|
|
|
|
if (likely(ip_vs_conn_unlink(cp))) {
|
2005-04-17 00:20:36 +02:00
|
|
|
/* delete the timer if it is activated by other users */
|
2013-02-03 21:32:57 +01:00
|
|
|
del_timer(&cp->timer);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/* does anybody control me? */
|
|
|
|
if (cp->control)
|
|
|
|
ip_vs_control_del(cp);
|
|
|
|
|
2011-06-13 09:06:57 +02:00
|
|
|
if (cp->flags & IP_VS_CONN_F_NFCT) {
|
2010-09-21 17:35:41 +02:00
|
|
|
ip_vs_conn_drop_conntrack(cp);
|
2011-06-13 09:06:57 +02:00
|
|
|
/* Do not access conntracks during subsys cleanup
|
|
|
|
* because nf_conntrack_find_get can not be used after
|
|
|
|
* conntrack cleanup for the net.
|
|
|
|
*/
|
|
|
|
smp_rmb();
|
|
|
|
if (ipvs->enable)
|
|
|
|
ip_vs_conn_drop_conntrack(cp);
|
|
|
|
}
|
2010-09-21 17:35:41 +02:00
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
if (unlikely(cp->app != NULL))
|
|
|
|
ip_vs_unbind_app(cp);
|
|
|
|
ip_vs_unbind_dest(cp);
|
|
|
|
if (cp->flags & IP_VS_CONN_F_NO_CPORT)
|
|
|
|
atomic_dec(&ip_vs_conn_no_cport_cnt);
|
2013-03-21 10:58:10 +01:00
|
|
|
call_rcu(&cp->rcu_head, ip_vs_conn_rcu_free);
|
2011-01-03 14:44:57 +01:00
|
|
|
atomic_dec(&ipvs->conn_count);
|
2005-04-17 00:20:36 +02:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
expire_later:
|
2013-03-21 10:58:10 +01:00
|
|
|
IP_VS_DBG(7, "delayed: conn->refcnt=%d conn->n_control=%d\n",
|
|
|
|
atomic_read(&cp->refcnt),
|
2005-04-17 00:20:36 +02:00
|
|
|
atomic_read(&cp->n_control));
|
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
atomic_inc(&cp->refcnt);
|
|
|
|
cp->timeout = 60*HZ;
|
|
|
|
|
2012-04-24 22:46:40 +02:00
|
|
|
if (ipvs->sync_state & IP_VS_STATE_MASTER)
|
|
|
|
ip_vs_sync_conn(net, cp, sysctl_sync_threshold(ipvs));
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
ip_vs_conn_put(cp);
|
|
|
|
}
|
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
/* Modify timer, so that it expires as soon as possible.
|
|
|
|
* Can be called without reference only if under RCU lock.
|
|
|
|
*/
|
2005-04-17 00:20:36 +02:00
|
|
|
void ip_vs_conn_expire_now(struct ip_vs_conn *cp)
|
|
|
|
{
|
2013-03-21 10:58:10 +01:00
|
|
|
/* Using mod_timer_pending will ensure the timer is not
|
|
|
|
* modified after the final del_timer in ip_vs_conn_expire.
|
|
|
|
*/
|
|
|
|
if (timer_pending(&cp->timer) &&
|
|
|
|
time_after(cp->timer.expires, jiffies))
|
|
|
|
mod_timer_pending(&cp->timer, jiffies);
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create a new connection entry and hash it into the ip_vs_conn_tab
|
|
|
|
*/
|
|
|
|
struct ip_vs_conn *
|
2010-08-22 14:37:52 +02:00
|
|
|
ip_vs_conn_new(const struct ip_vs_conn_param *p,
|
2012-04-15 07:58:06 +02:00
|
|
|
const union nf_inet_addr *daddr, __be16 dport, unsigned int flags,
|
2010-11-19 14:25:07 +01:00
|
|
|
struct ip_vs_dest *dest, __u32 fwmark)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
|
|
|
struct ip_vs_conn *cp;
|
2011-01-03 14:44:57 +01:00
|
|
|
struct netns_ipvs *ipvs = net_ipvs(p->net);
|
|
|
|
struct ip_vs_proto_data *pd = ip_vs_proto_data_get(p->net,
|
|
|
|
p->protocol);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2013-03-21 10:58:12 +01:00
|
|
|
cp = kmem_cache_alloc(ip_vs_conn_cachep, GFP_ATOMIC);
|
2005-04-17 00:20:36 +02:00
|
|
|
if (cp == NULL) {
|
2009-08-02 13:05:41 +02:00
|
|
|
IP_VS_ERR_RL("%s(): no memory\n", __func__);
|
2005-04-17 00:20:36 +02:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2011-02-19 11:05:08 +01:00
|
|
|
INIT_HLIST_NODE(&cp->c_list);
|
2008-01-24 06:20:07 +01:00
|
|
|
setup_timer(&cp->timer, ip_vs_conn_expire, (unsigned long)cp);
|
2011-01-03 14:44:57 +01:00
|
|
|
ip_vs_conn_net_set(cp, p->net);
|
2010-08-22 14:37:52 +02:00
|
|
|
cp->af = p->af;
|
|
|
|
cp->protocol = p->protocol;
|
2013-03-21 10:58:12 +01:00
|
|
|
ip_vs_addr_set(p->af, &cp->caddr, p->caddr);
|
2010-08-22 14:37:52 +02:00
|
|
|
cp->cport = p->cport;
|
2013-03-21 10:58:12 +01:00
|
|
|
ip_vs_addr_set(p->af, &cp->vaddr, p->vaddr);
|
2010-08-22 14:37:52 +02:00
|
|
|
cp->vport = p->vport;
|
ipvs: Fix IPv4 FWMARK virtual services
This fixes the use of fwmarks to denote IPv4 virtual services
which was unfortunately broken as a result of the integration
of IPv6 support into IPVS, which was included in 2.6.28.
The problem arises because fwmarks are stored in the 4th octet
of a union nf_inet_addr .all, however in the case of IPv4 only
the first octet, corresponding to .ip, is assigned and compared.
In other words, using .all = { 0, 0, 0, htonl(svc->fwmark) always
results in a value of 0 (32bits) being stored for IPv4. This means
that one fwmark can be used, as it ends up being mapped to 0, but things
break down when multiple fwmarks are used, as they all end up being mapped
to 0.
As fwmarks are 32bits a reasonable fix seems to be to just store the fwmark
in .ip, and comparing and storing .ip when fwmarks are used.
This patch makes the assumption that in calls to ip_vs_ct_in_get()
and ip_vs_sched_persist() if the proto parameter is IPPROTO_IP then
we are dealing with an fwmark. I believe this is valid as ip_vs_in()
does fairly strict filtering on the protocol and IPPROTO_IP should
not be used in these calls unless explicitly passed when making
these calls for fwmarks in ip_vs_sched_persist().
Tested-by: Fabien Duchêne <fabien.duchene@student.uclouvain.be>
Cc: Joseph Mack NA3T <jmack@wm7d.net>
Cc: Julius Volz <julius.volz@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-06 17:02:29 +02:00
|
|
|
/* proto should only be IPPROTO_IP if d_addr is a fwmark */
|
2013-03-21 10:58:12 +01:00
|
|
|
ip_vs_addr_set(p->protocol == IPPROTO_IP ? AF_UNSPEC : p->af,
|
|
|
|
&cp->daddr, daddr);
|
2005-04-17 00:20:36 +02:00
|
|
|
cp->dport = dport;
|
|
|
|
cp->flags = flags;
|
2010-11-19 14:25:07 +01:00
|
|
|
cp->fwmark = fwmark;
|
2010-11-08 12:05:57 +01:00
|
|
|
if (flags & IP_VS_CONN_F_TEMPLATE && p->pe) {
|
|
|
|
ip_vs_pe_get(p->pe);
|
|
|
|
cp->pe = p->pe;
|
2010-08-22 14:37:53 +02:00
|
|
|
cp->pe_data = p->pe_data;
|
|
|
|
cp->pe_data_len = p->pe_data_len;
|
2013-03-21 10:58:12 +01:00
|
|
|
} else {
|
|
|
|
cp->pe = NULL;
|
|
|
|
cp->pe_data = NULL;
|
|
|
|
cp->pe_data_len = 0;
|
2010-08-22 14:37:53 +02:00
|
|
|
}
|
2005-04-17 00:20:36 +02:00
|
|
|
spin_lock_init(&cp->lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set the entry is referenced by the current thread before hashing
|
|
|
|
* it in the table, so that other thread run ip_vs_random_dropentry
|
|
|
|
* but cannot drop this entry.
|
|
|
|
*/
|
|
|
|
atomic_set(&cp->refcnt, 1);
|
|
|
|
|
2013-03-21 10:58:12 +01:00
|
|
|
cp->control = NULL;
|
2005-04-17 00:20:36 +02:00
|
|
|
atomic_set(&cp->n_control, 0);
|
|
|
|
atomic_set(&cp->in_pkts, 0);
|
|
|
|
|
2013-03-21 10:58:12 +01:00
|
|
|
cp->packet_xmit = NULL;
|
|
|
|
cp->app = NULL;
|
|
|
|
cp->app_data = NULL;
|
|
|
|
/* reset struct ip_vs_seq */
|
|
|
|
cp->in_seq.delta = 0;
|
|
|
|
cp->out_seq.delta = 0;
|
|
|
|
|
2011-01-03 14:44:57 +01:00
|
|
|
atomic_inc(&ipvs->conn_count);
|
2005-04-17 00:20:36 +02:00
|
|
|
if (flags & IP_VS_CONN_F_NO_CPORT)
|
|
|
|
atomic_inc(&ip_vs_conn_no_cport_cnt);
|
|
|
|
|
|
|
|
/* Bind the connection with a destination server */
|
2013-03-21 10:58:12 +01:00
|
|
|
cp->dest = NULL;
|
2005-04-17 00:20:36 +02:00
|
|
|
ip_vs_bind_dest(cp, dest);
|
|
|
|
|
|
|
|
/* Set its state and timeout */
|
|
|
|
cp->state = 0;
|
2013-03-21 10:58:12 +01:00
|
|
|
cp->old_state = 0;
|
2005-04-17 00:20:36 +02:00
|
|
|
cp->timeout = 3*HZ;
|
2012-04-24 22:46:40 +02:00
|
|
|
cp->sync_endtime = jiffies & ~3UL;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/* Bind its packet transmitter */
|
2008-09-02 15:55:45 +02:00
|
|
|
#ifdef CONFIG_IP_VS_IPV6
|
2010-08-22 14:37:52 +02:00
|
|
|
if (p->af == AF_INET6)
|
2008-09-02 15:55:45 +02:00
|
|
|
ip_vs_bind_xmit_v6(cp);
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
ip_vs_bind_xmit(cp);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2011-01-03 14:44:52 +01:00
|
|
|
if (unlikely(pd && atomic_read(&pd->appcnt)))
|
|
|
|
ip_vs_bind_app(cp, pd->pp);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2010-09-21 17:35:41 +02:00
|
|
|
/*
|
|
|
|
* Allow conntrack to be preserved. By default, conntrack
|
|
|
|
* is created and destroyed for every packet.
|
|
|
|
* Sometimes keeping conntrack can be useful for
|
|
|
|
* IP_VS_CONN_F_ONE_PACKET too.
|
|
|
|
*/
|
|
|
|
|
2011-01-03 14:44:58 +01:00
|
|
|
if (ip_vs_conntrack_enabled(ipvs))
|
2010-09-21 17:35:41 +02:00
|
|
|
cp->flags |= IP_VS_CONN_F_NFCT;
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
/* Hash it in the ip_vs_conn_tab finally */
|
|
|
|
ip_vs_conn_hash(cp);
|
|
|
|
|
|
|
|
return cp;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* /proc/net/ip_vs_conn entries
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_PROC_FS
|
2011-01-03 14:44:57 +01:00
|
|
|
struct ip_vs_iter_state {
|
2011-02-19 11:05:08 +01:00
|
|
|
struct seq_net_private p;
|
|
|
|
struct hlist_head *l;
|
2011-01-03 14:44:57 +01:00
|
|
|
};
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
static void *ip_vs_conn_array(struct seq_file *seq, loff_t pos)
|
|
|
|
{
|
|
|
|
int idx;
|
|
|
|
struct ip_vs_conn *cp;
|
2011-01-03 14:44:57 +01:00
|
|
|
struct ip_vs_iter_state *iter = seq->private;
|
2007-02-09 15:24:47 +01:00
|
|
|
|
IPVS: Allow boot time change of hash size
I was very frustrated about the fact that I have to recompile the kernel
to change the hash size. So, I created this patch.
If IPVS is built-in you can append ip_vs.conn_tab_bits=?? to kernel
command line, or, if you built IPVS as modules, you can add
options ip_vs conn_tab_bits=??.
To keep everything backward compatible, you still can select the size at
compile time, and that will be used as default.
It has been about a year since this patch was originally posted
and subsequently dropped on the basis of insufficient test data.
Mark Bergsma has provided the following test results which seem
to strongly support the need for larger hash table sizes:
We do however run into the same problem with the default setting (212 =
4096 entries), as most of our LVS balancers handle around a million
connections/SLAB entries at any point in time (around 100-150 kpps
load). With only 4096 hash table entries this implies that each entry
consists of a linked list of 256 connections *on average*.
To provide some statistics, I did an oprofile run on an 2.6.31 kernel,
with both the default 4096 table size, and the same kernel recompiled
with IP_VS_CONN_TAB_BITS set to 18 (218 = 262144 entries). I built a
quick test setup with a part of Wikimedia/Wikipedia's live traffic
mirrored by the switch to the test host.
With the default setting, at ~ 120 kpps packet load we saw a typical %si
CPU usage of around 30-35%, and oprofile reported a hot spot in
ip_vs_conn_in_get:
samples % image name app name
symbol name
1719761 42.3741 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
302577 7.4554 bnx2 bnx2 /bnx2
181984 4.4840 vmlinux vmlinux __ticket_spin_lock
128636 3.1695 vmlinux vmlinux ip_route_input
74345 1.8318 ip_vs.ko ip_vs.ko ip_vs_conn_out_get
68482 1.6874 vmlinux vmlinux mwait_idle
After loading the recompiled kernel with 218 entries, %si CPU usage
dropped in half to around 12-18%, and oprofile looks much healthier,
with only 7% spent in ip_vs_conn_in_get:
samples % image name app name
symbol name
265641 14.4616 bnx2 bnx2 /bnx2
143251 7.7986 vmlinux vmlinux __ticket_spin_lock
140661 7.6576 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
94364 5.1372 vmlinux vmlinux mwait_idle
86267 4.6964 vmlinux vmlinux ip_route_input
[ horms@verge.net.au: trivial up-port and minor style fixes ]
Signed-off-by: Catalin(ux) M. BOIE <catab@embedromix.ro>
Cc: Mark Bergsma <mark@wikimedia.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-05 05:50:24 +01:00
|
|
|
for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
|
2013-03-21 10:58:10 +01:00
|
|
|
hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
|
|
|
|
/* __ip_vs_conn_get() is not needed by
|
|
|
|
* ip_vs_conn_seq_show and ip_vs_conn_sync_seq_show
|
|
|
|
*/
|
2005-04-17 00:20:36 +02:00
|
|
|
if (pos-- == 0) {
|
2011-01-03 14:44:57 +01:00
|
|
|
iter->l = &ip_vs_conn_tab[idx];
|
2011-02-19 11:05:08 +01:00
|
|
|
return cp;
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
}
|
2013-05-22 07:50:32 +02:00
|
|
|
cond_resched_rcu();
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *ip_vs_conn_seq_start(struct seq_file *seq, loff_t *pos)
|
2013-04-17 22:50:46 +02:00
|
|
|
__acquires(RCU)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
2011-01-03 14:44:57 +01:00
|
|
|
struct ip_vs_iter_state *iter = seq->private;
|
|
|
|
|
|
|
|
iter->l = NULL;
|
2013-04-17 22:50:46 +02:00
|
|
|
rcu_read_lock();
|
2005-04-17 00:20:36 +02:00
|
|
|
return *pos ? ip_vs_conn_array(seq, *pos - 1) :SEQ_START_TOKEN;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *ip_vs_conn_seq_next(struct seq_file *seq, void *v, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct ip_vs_conn *cp = v;
|
2011-01-03 14:44:57 +01:00
|
|
|
struct ip_vs_iter_state *iter = seq->private;
|
2013-03-21 10:58:10 +01:00
|
|
|
struct hlist_node *e;
|
2011-02-19 11:05:08 +01:00
|
|
|
struct hlist_head *l = iter->l;
|
2005-04-17 00:20:36 +02:00
|
|
|
int idx;
|
|
|
|
|
|
|
|
++*pos;
|
2007-02-09 15:24:47 +01:00
|
|
|
if (v == SEQ_START_TOKEN)
|
2005-04-17 00:20:36 +02:00
|
|
|
return ip_vs_conn_array(seq, 0);
|
|
|
|
|
|
|
|
/* more on same hash chain? */
|
2013-03-21 10:58:10 +01:00
|
|
|
e = rcu_dereference(hlist_next_rcu(&cp->c_list));
|
|
|
|
if (e)
|
|
|
|
return hlist_entry(e, struct ip_vs_conn, c_list);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
idx = l - ip_vs_conn_tab;
|
IPVS: Allow boot time change of hash size
I was very frustrated about the fact that I have to recompile the kernel
to change the hash size. So, I created this patch.
If IPVS is built-in you can append ip_vs.conn_tab_bits=?? to kernel
command line, or, if you built IPVS as modules, you can add
options ip_vs conn_tab_bits=??.
To keep everything backward compatible, you still can select the size at
compile time, and that will be used as default.
It has been about a year since this patch was originally posted
and subsequently dropped on the basis of insufficient test data.
Mark Bergsma has provided the following test results which seem
to strongly support the need for larger hash table sizes:
We do however run into the same problem with the default setting (212 =
4096 entries), as most of our LVS balancers handle around a million
connections/SLAB entries at any point in time (around 100-150 kpps
load). With only 4096 hash table entries this implies that each entry
consists of a linked list of 256 connections *on average*.
To provide some statistics, I did an oprofile run on an 2.6.31 kernel,
with both the default 4096 table size, and the same kernel recompiled
with IP_VS_CONN_TAB_BITS set to 18 (218 = 262144 entries). I built a
quick test setup with a part of Wikimedia/Wikipedia's live traffic
mirrored by the switch to the test host.
With the default setting, at ~ 120 kpps packet load we saw a typical %si
CPU usage of around 30-35%, and oprofile reported a hot spot in
ip_vs_conn_in_get:
samples % image name app name
symbol name
1719761 42.3741 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
302577 7.4554 bnx2 bnx2 /bnx2
181984 4.4840 vmlinux vmlinux __ticket_spin_lock
128636 3.1695 vmlinux vmlinux ip_route_input
74345 1.8318 ip_vs.ko ip_vs.ko ip_vs_conn_out_get
68482 1.6874 vmlinux vmlinux mwait_idle
After loading the recompiled kernel with 218 entries, %si CPU usage
dropped in half to around 12-18%, and oprofile looks much healthier,
with only 7% spent in ip_vs_conn_in_get:
samples % image name app name
symbol name
265641 14.4616 bnx2 bnx2 /bnx2
143251 7.7986 vmlinux vmlinux __ticket_spin_lock
140661 7.6576 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
94364 5.1372 vmlinux vmlinux mwait_idle
86267 4.6964 vmlinux vmlinux ip_route_input
[ horms@verge.net.au: trivial up-port and minor style fixes ]
Signed-off-by: Catalin(ux) M. BOIE <catab@embedromix.ro>
Cc: Mark Bergsma <mark@wikimedia.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-05 05:50:24 +01:00
|
|
|
while (++idx < ip_vs_conn_tab_size) {
|
2013-03-21 10:58:10 +01:00
|
|
|
hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
|
2011-01-03 14:44:57 +01:00
|
|
|
iter->l = &ip_vs_conn_tab[idx];
|
2005-04-17 00:20:36 +02:00
|
|
|
return cp;
|
2007-02-09 15:24:47 +01:00
|
|
|
}
|
2013-05-22 07:50:32 +02:00
|
|
|
cond_resched_rcu();
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
2011-01-03 14:44:57 +01:00
|
|
|
iter->l = NULL;
|
2005-04-17 00:20:36 +02:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void ip_vs_conn_seq_stop(struct seq_file *seq, void *v)
|
2013-04-17 22:50:46 +02:00
|
|
|
__releases(RCU)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
2013-04-17 22:50:46 +02:00
|
|
|
rcu_read_unlock();
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static int ip_vs_conn_seq_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (v == SEQ_START_TOKEN)
|
|
|
|
seq_puts(seq,
|
2010-08-22 14:37:53 +02:00
|
|
|
"Pro FromIP FPrt ToIP TPrt DestIP DPrt State Expires PEName PEData\n");
|
2005-04-17 00:20:36 +02:00
|
|
|
else {
|
|
|
|
const struct ip_vs_conn *cp = v;
|
2011-01-03 14:44:57 +01:00
|
|
|
struct net *net = seq_file_net(seq);
|
2010-08-22 14:37:53 +02:00
|
|
|
char pe_data[IP_VS_PENAME_MAXLEN + IP_VS_PEDATA_MAXLEN + 3];
|
|
|
|
size_t len = 0;
|
|
|
|
|
2011-01-03 14:44:57 +01:00
|
|
|
if (!ip_vs_conn_net_eq(cp, net))
|
|
|
|
return 0;
|
2010-11-08 12:05:57 +01:00
|
|
|
if (cp->pe_data) {
|
2010-08-22 14:37:53 +02:00
|
|
|
pe_data[0] = ' ';
|
2010-11-08 12:05:57 +01:00
|
|
|
len = strlen(cp->pe->name);
|
|
|
|
memcpy(pe_data + 1, cp->pe->name, len);
|
2010-08-22 14:37:53 +02:00
|
|
|
pe_data[len + 1] = ' ';
|
|
|
|
len += 2;
|
2010-11-08 12:05:57 +01:00
|
|
|
len += cp->pe->show_pe_data(cp, pe_data + len);
|
2010-08-22 14:37:53 +02:00
|
|
|
}
|
|
|
|
pe_data[len] = '\0';
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2008-09-02 15:55:49 +02:00
|
|
|
#ifdef CONFIG_IP_VS_IPV6
|
|
|
|
if (cp->af == AF_INET6)
|
2010-08-22 14:37:53 +02:00
|
|
|
seq_printf(seq, "%-3s %pI6 %04X %pI6 %04X "
|
|
|
|
"%pI6 %04X %-11s %7lu%s\n",
|
2008-09-02 15:55:49 +02:00
|
|
|
ip_vs_proto_name(cp->protocol),
|
2008-10-29 00:08:13 +01:00
|
|
|
&cp->caddr.in6, ntohs(cp->cport),
|
|
|
|
&cp->vaddr.in6, ntohs(cp->vport),
|
|
|
|
&cp->daddr.in6, ntohs(cp->dport),
|
2008-09-02 15:55:49 +02:00
|
|
|
ip_vs_state_name(cp->protocol, cp->state),
|
2010-08-22 14:37:53 +02:00
|
|
|
(cp->timer.expires-jiffies)/HZ, pe_data);
|
2008-09-02 15:55:49 +02:00
|
|
|
else
|
|
|
|
#endif
|
|
|
|
seq_printf(seq,
|
|
|
|
"%-3s %08X %04X %08X %04X"
|
2010-08-22 14:37:53 +02:00
|
|
|
" %08X %04X %-11s %7lu%s\n",
|
2005-04-17 00:20:36 +02:00
|
|
|
ip_vs_proto_name(cp->protocol),
|
2008-09-02 15:55:33 +02:00
|
|
|
ntohl(cp->caddr.ip), ntohs(cp->cport),
|
|
|
|
ntohl(cp->vaddr.ip), ntohs(cp->vport),
|
|
|
|
ntohl(cp->daddr.ip), ntohs(cp->dport),
|
2005-04-17 00:20:36 +02:00
|
|
|
ip_vs_state_name(cp->protocol, cp->state),
|
2010-08-22 14:37:53 +02:00
|
|
|
(cp->timer.expires-jiffies)/HZ, pe_data);
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-07-11 08:07:31 +02:00
|
|
|
static const struct seq_operations ip_vs_conn_seq_ops = {
|
2005-04-17 00:20:36 +02:00
|
|
|
.start = ip_vs_conn_seq_start,
|
|
|
|
.next = ip_vs_conn_seq_next,
|
|
|
|
.stop = ip_vs_conn_seq_stop,
|
|
|
|
.show = ip_vs_conn_seq_show,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int ip_vs_conn_open(struct inode *inode, struct file *file)
|
|
|
|
{
|
2011-01-03 14:44:57 +01:00
|
|
|
return seq_open_net(inode, file, &ip_vs_conn_seq_ops,
|
|
|
|
sizeof(struct ip_vs_iter_state));
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
2007-02-12 09:55:35 +01:00
|
|
|
static const struct file_operations ip_vs_conn_fops = {
|
2005-04-17 00:20:36 +02:00
|
|
|
.owner = THIS_MODULE,
|
|
|
|
.open = ip_vs_conn_open,
|
|
|
|
.read = seq_read,
|
|
|
|
.llseek = seq_lseek,
|
2011-05-15 17:20:29 +02:00
|
|
|
.release = seq_release_net,
|
2005-04-17 00:20:36 +02:00
|
|
|
};
|
2007-11-20 06:52:42 +01:00
|
|
|
|
2012-04-15 07:58:06 +02:00
|
|
|
static const char *ip_vs_origin_name(unsigned int flags)
|
2007-11-20 06:52:42 +01:00
|
|
|
{
|
|
|
|
if (flags & IP_VS_CONN_F_SYNC)
|
|
|
|
return "SYNC";
|
|
|
|
else
|
|
|
|
return "LOCAL";
|
|
|
|
}
|
|
|
|
|
|
|
|
static int ip_vs_conn_sync_seq_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (v == SEQ_START_TOKEN)
|
|
|
|
seq_puts(seq,
|
|
|
|
"Pro FromIP FPrt ToIP TPrt DestIP DPrt State Origin Expires\n");
|
|
|
|
else {
|
|
|
|
const struct ip_vs_conn *cp = v;
|
2011-01-03 14:44:57 +01:00
|
|
|
struct net *net = seq_file_net(seq);
|
|
|
|
|
|
|
|
if (!ip_vs_conn_net_eq(cp, net))
|
|
|
|
return 0;
|
2007-11-20 06:52:42 +01:00
|
|
|
|
2008-09-02 15:55:49 +02:00
|
|
|
#ifdef CONFIG_IP_VS_IPV6
|
|
|
|
if (cp->af == AF_INET6)
|
2008-10-29 20:52:50 +01:00
|
|
|
seq_printf(seq, "%-3s %pI6 %04X %pI6 %04X %pI6 %04X %-11s %-6s %7lu\n",
|
2008-09-02 15:55:49 +02:00
|
|
|
ip_vs_proto_name(cp->protocol),
|
2008-10-29 00:08:13 +01:00
|
|
|
&cp->caddr.in6, ntohs(cp->cport),
|
|
|
|
&cp->vaddr.in6, ntohs(cp->vport),
|
|
|
|
&cp->daddr.in6, ntohs(cp->dport),
|
2008-09-02 15:55:49 +02:00
|
|
|
ip_vs_state_name(cp->protocol, cp->state),
|
|
|
|
ip_vs_origin_name(cp->flags),
|
|
|
|
(cp->timer.expires-jiffies)/HZ);
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
seq_printf(seq,
|
|
|
|
"%-3s %08X %04X %08X %04X "
|
|
|
|
"%08X %04X %-11s %-6s %7lu\n",
|
2007-11-20 06:52:42 +01:00
|
|
|
ip_vs_proto_name(cp->protocol),
|
2008-09-02 15:55:33 +02:00
|
|
|
ntohl(cp->caddr.ip), ntohs(cp->cport),
|
|
|
|
ntohl(cp->vaddr.ip), ntohs(cp->vport),
|
|
|
|
ntohl(cp->daddr.ip), ntohs(cp->dport),
|
2007-11-20 06:52:42 +01:00
|
|
|
ip_vs_state_name(cp->protocol, cp->state),
|
|
|
|
ip_vs_origin_name(cp->flags),
|
|
|
|
(cp->timer.expires-jiffies)/HZ);
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct seq_operations ip_vs_conn_sync_seq_ops = {
|
|
|
|
.start = ip_vs_conn_seq_start,
|
|
|
|
.next = ip_vs_conn_seq_next,
|
|
|
|
.stop = ip_vs_conn_seq_stop,
|
|
|
|
.show = ip_vs_conn_sync_seq_show,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int ip_vs_conn_sync_open(struct inode *inode, struct file *file)
|
|
|
|
{
|
2011-01-03 14:44:57 +01:00
|
|
|
return seq_open_net(inode, file, &ip_vs_conn_sync_seq_ops,
|
|
|
|
sizeof(struct ip_vs_iter_state));
|
2007-11-20 06:52:42 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
static const struct file_operations ip_vs_conn_sync_fops = {
|
|
|
|
.owner = THIS_MODULE,
|
|
|
|
.open = ip_vs_conn_sync_open,
|
|
|
|
.read = seq_read,
|
|
|
|
.llseek = seq_lseek,
|
2011-05-15 17:20:29 +02:00
|
|
|
.release = seq_release_net,
|
2007-11-20 06:52:42 +01:00
|
|
|
};
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Randomly drop connection entries before running out of memory
|
|
|
|
*/
|
|
|
|
static inline int todrop_entry(struct ip_vs_conn *cp)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* The drop rate array needs tuning for real environments.
|
|
|
|
* Called from timer bh only => no locking
|
|
|
|
*/
|
2005-11-30 01:21:38 +01:00
|
|
|
static const char todrop_rate[9] = {0, 1, 2, 3, 4, 5, 6, 7, 8};
|
2005-04-17 00:20:36 +02:00
|
|
|
static char todrop_counter[9] = {0};
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* if the conn entry hasn't lasted for 60 seconds, don't drop it.
|
|
|
|
This will leave enough time for normal connection to get
|
|
|
|
through. */
|
|
|
|
if (time_before(cp->timeout + jiffies, cp->timer.expires + 60*HZ))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Don't drop the entry if its number of incoming packets is not
|
|
|
|
located in [0, 8] */
|
|
|
|
i = atomic_read(&cp->in_pkts);
|
|
|
|
if (i > 8 || i < 0) return 0;
|
|
|
|
|
|
|
|
if (!todrop_rate[i]) return 0;
|
|
|
|
if (--todrop_counter[i] > 0) return 0;
|
|
|
|
|
|
|
|
todrop_counter[i] = todrop_rate[i];
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2005-07-12 05:59:57 +02:00
|
|
|
/* Called from keventd and must protect itself from softirqs */
|
2011-01-03 14:44:59 +01:00
|
|
|
void ip_vs_random_dropentry(struct net *net)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
|
|
|
int idx;
|
2013-03-21 10:58:10 +01:00
|
|
|
struct ip_vs_conn *cp, *cp_c;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2013-05-22 07:50:32 +02:00
|
|
|
rcu_read_lock();
|
2005-04-17 00:20:36 +02:00
|
|
|
/*
|
|
|
|
* Randomly scan 1/32 of the whole table every second
|
|
|
|
*/
|
IPVS: Allow boot time change of hash size
I was very frustrated about the fact that I have to recompile the kernel
to change the hash size. So, I created this patch.
If IPVS is built-in you can append ip_vs.conn_tab_bits=?? to kernel
command line, or, if you built IPVS as modules, you can add
options ip_vs conn_tab_bits=??.
To keep everything backward compatible, you still can select the size at
compile time, and that will be used as default.
It has been about a year since this patch was originally posted
and subsequently dropped on the basis of insufficient test data.
Mark Bergsma has provided the following test results which seem
to strongly support the need for larger hash table sizes:
We do however run into the same problem with the default setting (212 =
4096 entries), as most of our LVS balancers handle around a million
connections/SLAB entries at any point in time (around 100-150 kpps
load). With only 4096 hash table entries this implies that each entry
consists of a linked list of 256 connections *on average*.
To provide some statistics, I did an oprofile run on an 2.6.31 kernel,
with both the default 4096 table size, and the same kernel recompiled
with IP_VS_CONN_TAB_BITS set to 18 (218 = 262144 entries). I built a
quick test setup with a part of Wikimedia/Wikipedia's live traffic
mirrored by the switch to the test host.
With the default setting, at ~ 120 kpps packet load we saw a typical %si
CPU usage of around 30-35%, and oprofile reported a hot spot in
ip_vs_conn_in_get:
samples % image name app name
symbol name
1719761 42.3741 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
302577 7.4554 bnx2 bnx2 /bnx2
181984 4.4840 vmlinux vmlinux __ticket_spin_lock
128636 3.1695 vmlinux vmlinux ip_route_input
74345 1.8318 ip_vs.ko ip_vs.ko ip_vs_conn_out_get
68482 1.6874 vmlinux vmlinux mwait_idle
After loading the recompiled kernel with 218 entries, %si CPU usage
dropped in half to around 12-18%, and oprofile looks much healthier,
with only 7% spent in ip_vs_conn_in_get:
samples % image name app name
symbol name
265641 14.4616 bnx2 bnx2 /bnx2
143251 7.7986 vmlinux vmlinux __ticket_spin_lock
140661 7.6576 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
94364 5.1372 vmlinux vmlinux mwait_idle
86267 4.6964 vmlinux vmlinux ip_route_input
[ horms@verge.net.au: trivial up-port and minor style fixes ]
Signed-off-by: Catalin(ux) M. BOIE <catab@embedromix.ro>
Cc: Mark Bergsma <mark@wikimedia.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-05 05:50:24 +01:00
|
|
|
for (idx = 0; idx < (ip_vs_conn_tab_size>>5); idx++) {
|
2012-04-15 07:58:06 +02:00
|
|
|
unsigned int hash = net_random() & ip_vs_conn_tab_mask;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[hash], c_list) {
|
2005-09-15 06:08:51 +02:00
|
|
|
if (cp->flags & IP_VS_CONN_F_TEMPLATE)
|
2005-04-17 00:20:36 +02:00
|
|
|
/* connection template */
|
|
|
|
continue;
|
2011-01-03 14:44:59 +01:00
|
|
|
if (!ip_vs_conn_net_eq(cp, net))
|
|
|
|
continue;
|
2005-04-17 00:20:36 +02:00
|
|
|
if (cp->protocol == IPPROTO_TCP) {
|
|
|
|
switch(cp->state) {
|
|
|
|
case IP_VS_TCP_S_SYN_RECV:
|
|
|
|
case IP_VS_TCP_S_SYNACK:
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_VS_TCP_S_ESTABLISHED:
|
|
|
|
if (todrop_entry(cp))
|
|
|
|
break;
|
|
|
|
continue;
|
|
|
|
|
|
|
|
default:
|
|
|
|
continue;
|
|
|
|
}
|
2013-06-18 09:08:08 +02:00
|
|
|
} else if (cp->protocol == IPPROTO_SCTP) {
|
|
|
|
switch (cp->state) {
|
|
|
|
case IP_VS_SCTP_S_INIT1:
|
|
|
|
case IP_VS_SCTP_S_INIT:
|
|
|
|
break;
|
|
|
|
case IP_VS_SCTP_S_ESTABLISHED:
|
|
|
|
if (todrop_entry(cp))
|
|
|
|
break;
|
|
|
|
continue;
|
|
|
|
default:
|
|
|
|
continue;
|
|
|
|
}
|
2005-04-17 00:20:36 +02:00
|
|
|
} else {
|
|
|
|
if (!todrop_entry(cp))
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
IP_VS_DBG(4, "del connection\n");
|
|
|
|
ip_vs_conn_expire_now(cp);
|
2013-03-21 10:58:10 +01:00
|
|
|
cp_c = cp->control;
|
|
|
|
/* cp->control is valid only with reference to cp */
|
|
|
|
if (cp_c && __ip_vs_conn_get(cp)) {
|
2005-04-17 00:20:36 +02:00
|
|
|
IP_VS_DBG(4, "del conn template\n");
|
2013-03-21 10:58:10 +01:00
|
|
|
ip_vs_conn_expire_now(cp_c);
|
|
|
|
__ip_vs_conn_put(cp);
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
}
|
2013-05-22 07:50:32 +02:00
|
|
|
cond_resched_rcu();
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
2013-05-22 07:50:32 +02:00
|
|
|
rcu_read_unlock();
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flush all the connection entries in the ip_vs_conn_tab
|
|
|
|
*/
|
2011-01-03 14:44:57 +01:00
|
|
|
static void ip_vs_conn_flush(struct net *net)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
|
|
|
int idx;
|
2013-03-21 10:58:10 +01:00
|
|
|
struct ip_vs_conn *cp, *cp_c;
|
2011-01-03 14:44:57 +01:00
|
|
|
struct netns_ipvs *ipvs = net_ipvs(net);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2011-01-03 14:44:58 +01:00
|
|
|
flush_again:
|
2013-05-22 07:50:32 +02:00
|
|
|
rcu_read_lock();
|
IPVS: Allow boot time change of hash size
I was very frustrated about the fact that I have to recompile the kernel
to change the hash size. So, I created this patch.
If IPVS is built-in you can append ip_vs.conn_tab_bits=?? to kernel
command line, or, if you built IPVS as modules, you can add
options ip_vs conn_tab_bits=??.
To keep everything backward compatible, you still can select the size at
compile time, and that will be used as default.
It has been about a year since this patch was originally posted
and subsequently dropped on the basis of insufficient test data.
Mark Bergsma has provided the following test results which seem
to strongly support the need for larger hash table sizes:
We do however run into the same problem with the default setting (212 =
4096 entries), as most of our LVS balancers handle around a million
connections/SLAB entries at any point in time (around 100-150 kpps
load). With only 4096 hash table entries this implies that each entry
consists of a linked list of 256 connections *on average*.
To provide some statistics, I did an oprofile run on an 2.6.31 kernel,
with both the default 4096 table size, and the same kernel recompiled
with IP_VS_CONN_TAB_BITS set to 18 (218 = 262144 entries). I built a
quick test setup with a part of Wikimedia/Wikipedia's live traffic
mirrored by the switch to the test host.
With the default setting, at ~ 120 kpps packet load we saw a typical %si
CPU usage of around 30-35%, and oprofile reported a hot spot in
ip_vs_conn_in_get:
samples % image name app name
symbol name
1719761 42.3741 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
302577 7.4554 bnx2 bnx2 /bnx2
181984 4.4840 vmlinux vmlinux __ticket_spin_lock
128636 3.1695 vmlinux vmlinux ip_route_input
74345 1.8318 ip_vs.ko ip_vs.ko ip_vs_conn_out_get
68482 1.6874 vmlinux vmlinux mwait_idle
After loading the recompiled kernel with 218 entries, %si CPU usage
dropped in half to around 12-18%, and oprofile looks much healthier,
with only 7% spent in ip_vs_conn_in_get:
samples % image name app name
symbol name
265641 14.4616 bnx2 bnx2 /bnx2
143251 7.7986 vmlinux vmlinux __ticket_spin_lock
140661 7.6576 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
94364 5.1372 vmlinux vmlinux mwait_idle
86267 4.6964 vmlinux vmlinux ip_route_input
[ horms@verge.net.au: trivial up-port and minor style fixes ]
Signed-off-by: Catalin(ux) M. BOIE <catab@embedromix.ro>
Cc: Mark Bergsma <mark@wikimedia.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-05 05:50:24 +01:00
|
|
|
for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2013-03-21 10:58:10 +01:00
|
|
|
hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
|
2011-01-03 14:44:57 +01:00
|
|
|
if (!ip_vs_conn_net_eq(cp, net))
|
|
|
|
continue;
|
2005-04-17 00:20:36 +02:00
|
|
|
IP_VS_DBG(4, "del connection\n");
|
|
|
|
ip_vs_conn_expire_now(cp);
|
2013-03-21 10:58:10 +01:00
|
|
|
cp_c = cp->control;
|
|
|
|
/* cp->control is valid only with reference to cp */
|
|
|
|
if (cp_c && __ip_vs_conn_get(cp)) {
|
2005-04-17 00:20:36 +02:00
|
|
|
IP_VS_DBG(4, "del conn template\n");
|
2013-03-21 10:58:10 +01:00
|
|
|
ip_vs_conn_expire_now(cp_c);
|
|
|
|
__ip_vs_conn_put(cp);
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
}
|
2013-05-22 07:50:32 +02:00
|
|
|
cond_resched_rcu();
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
2013-05-22 07:50:32 +02:00
|
|
|
rcu_read_unlock();
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/* the counter may be not NULL, because maybe some conn entries
|
|
|
|
are run by slow timer handler or unhashed but still referred */
|
2011-01-03 14:44:57 +01:00
|
|
|
if (atomic_read(&ipvs->conn_count) != 0) {
|
2005-04-17 00:20:36 +02:00
|
|
|
schedule();
|
|
|
|
goto flush_again;
|
|
|
|
}
|
|
|
|
}
|
2011-01-03 14:44:42 +01:00
|
|
|
/*
|
|
|
|
* per netns init and exit
|
|
|
|
*/
|
2011-05-01 18:50:16 +02:00
|
|
|
int __net_init ip_vs_conn_net_init(struct net *net)
|
2011-01-03 14:44:42 +01:00
|
|
|
{
|
2011-01-03 14:44:57 +01:00
|
|
|
struct netns_ipvs *ipvs = net_ipvs(net);
|
|
|
|
|
|
|
|
atomic_set(&ipvs->conn_count, 0);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2013-02-18 02:34:54 +01:00
|
|
|
proc_create("ip_vs_conn", 0, net->proc_net, &ip_vs_conn_fops);
|
|
|
|
proc_create("ip_vs_conn_sync", 0, net->proc_net, &ip_vs_conn_sync_fops);
|
2011-01-03 14:44:42 +01:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-05-01 18:50:16 +02:00
|
|
|
void __net_exit ip_vs_conn_net_cleanup(struct net *net)
|
2011-01-03 14:44:42 +01:00
|
|
|
{
|
2011-01-03 14:44:57 +01:00
|
|
|
/* flush all the connection entries first */
|
|
|
|
ip_vs_conn_flush(net);
|
2013-02-18 02:34:56 +01:00
|
|
|
remove_proc_entry("ip_vs_conn", net->proc_net);
|
|
|
|
remove_proc_entry("ip_vs_conn_sync", net->proc_net);
|
2011-01-03 14:44:42 +01:00
|
|
|
}
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2008-08-10 20:24:35 +02:00
|
|
|
int __init ip_vs_conn_init(void)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
|
|
|
int idx;
|
|
|
|
|
IPVS: Allow boot time change of hash size
I was very frustrated about the fact that I have to recompile the kernel
to change the hash size. So, I created this patch.
If IPVS is built-in you can append ip_vs.conn_tab_bits=?? to kernel
command line, or, if you built IPVS as modules, you can add
options ip_vs conn_tab_bits=??.
To keep everything backward compatible, you still can select the size at
compile time, and that will be used as default.
It has been about a year since this patch was originally posted
and subsequently dropped on the basis of insufficient test data.
Mark Bergsma has provided the following test results which seem
to strongly support the need for larger hash table sizes:
We do however run into the same problem with the default setting (212 =
4096 entries), as most of our LVS balancers handle around a million
connections/SLAB entries at any point in time (around 100-150 kpps
load). With only 4096 hash table entries this implies that each entry
consists of a linked list of 256 connections *on average*.
To provide some statistics, I did an oprofile run on an 2.6.31 kernel,
with both the default 4096 table size, and the same kernel recompiled
with IP_VS_CONN_TAB_BITS set to 18 (218 = 262144 entries). I built a
quick test setup with a part of Wikimedia/Wikipedia's live traffic
mirrored by the switch to the test host.
With the default setting, at ~ 120 kpps packet load we saw a typical %si
CPU usage of around 30-35%, and oprofile reported a hot spot in
ip_vs_conn_in_get:
samples % image name app name
symbol name
1719761 42.3741 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
302577 7.4554 bnx2 bnx2 /bnx2
181984 4.4840 vmlinux vmlinux __ticket_spin_lock
128636 3.1695 vmlinux vmlinux ip_route_input
74345 1.8318 ip_vs.ko ip_vs.ko ip_vs_conn_out_get
68482 1.6874 vmlinux vmlinux mwait_idle
After loading the recompiled kernel with 218 entries, %si CPU usage
dropped in half to around 12-18%, and oprofile looks much healthier,
with only 7% spent in ip_vs_conn_in_get:
samples % image name app name
symbol name
265641 14.4616 bnx2 bnx2 /bnx2
143251 7.7986 vmlinux vmlinux __ticket_spin_lock
140661 7.6576 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
94364 5.1372 vmlinux vmlinux mwait_idle
86267 4.6964 vmlinux vmlinux ip_route_input
[ horms@verge.net.au: trivial up-port and minor style fixes ]
Signed-off-by: Catalin(ux) M. BOIE <catab@embedromix.ro>
Cc: Mark Bergsma <mark@wikimedia.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-05 05:50:24 +01:00
|
|
|
/* Compute size and mask */
|
|
|
|
ip_vs_conn_tab_size = 1 << ip_vs_conn_tab_bits;
|
|
|
|
ip_vs_conn_tab_mask = ip_vs_conn_tab_size - 1;
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
/*
|
|
|
|
* Allocate the connection hash table and initialize its list heads
|
|
|
|
*/
|
2011-02-19 11:05:08 +01:00
|
|
|
ip_vs_conn_tab = vmalloc(ip_vs_conn_tab_size * sizeof(*ip_vs_conn_tab));
|
2005-04-17 00:20:36 +02:00
|
|
|
if (!ip_vs_conn_tab)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/* Allocate ip_vs_conn slab cache */
|
|
|
|
ip_vs_conn_cachep = kmem_cache_create("ip_vs_conn",
|
|
|
|
sizeof(struct ip_vs_conn), 0,
|
2007-07-20 03:11:58 +02:00
|
|
|
SLAB_HWCACHE_ALIGN, NULL);
|
2005-04-17 00:20:36 +02:00
|
|
|
if (!ip_vs_conn_cachep) {
|
|
|
|
vfree(ip_vs_conn_tab);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2009-08-02 13:05:41 +02:00
|
|
|
pr_info("Connection hash table configured "
|
|
|
|
"(size=%d, memory=%ldKbytes)\n",
|
IPVS: Allow boot time change of hash size
I was very frustrated about the fact that I have to recompile the kernel
to change the hash size. So, I created this patch.
If IPVS is built-in you can append ip_vs.conn_tab_bits=?? to kernel
command line, or, if you built IPVS as modules, you can add
options ip_vs conn_tab_bits=??.
To keep everything backward compatible, you still can select the size at
compile time, and that will be used as default.
It has been about a year since this patch was originally posted
and subsequently dropped on the basis of insufficient test data.
Mark Bergsma has provided the following test results which seem
to strongly support the need for larger hash table sizes:
We do however run into the same problem with the default setting (212 =
4096 entries), as most of our LVS balancers handle around a million
connections/SLAB entries at any point in time (around 100-150 kpps
load). With only 4096 hash table entries this implies that each entry
consists of a linked list of 256 connections *on average*.
To provide some statistics, I did an oprofile run on an 2.6.31 kernel,
with both the default 4096 table size, and the same kernel recompiled
with IP_VS_CONN_TAB_BITS set to 18 (218 = 262144 entries). I built a
quick test setup with a part of Wikimedia/Wikipedia's live traffic
mirrored by the switch to the test host.
With the default setting, at ~ 120 kpps packet load we saw a typical %si
CPU usage of around 30-35%, and oprofile reported a hot spot in
ip_vs_conn_in_get:
samples % image name app name
symbol name
1719761 42.3741 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
302577 7.4554 bnx2 bnx2 /bnx2
181984 4.4840 vmlinux vmlinux __ticket_spin_lock
128636 3.1695 vmlinux vmlinux ip_route_input
74345 1.8318 ip_vs.ko ip_vs.ko ip_vs_conn_out_get
68482 1.6874 vmlinux vmlinux mwait_idle
After loading the recompiled kernel with 218 entries, %si CPU usage
dropped in half to around 12-18%, and oprofile looks much healthier,
with only 7% spent in ip_vs_conn_in_get:
samples % image name app name
symbol name
265641 14.4616 bnx2 bnx2 /bnx2
143251 7.7986 vmlinux vmlinux __ticket_spin_lock
140661 7.6576 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
94364 5.1372 vmlinux vmlinux mwait_idle
86267 4.6964 vmlinux vmlinux ip_route_input
[ horms@verge.net.au: trivial up-port and minor style fixes ]
Signed-off-by: Catalin(ux) M. BOIE <catab@embedromix.ro>
Cc: Mark Bergsma <mark@wikimedia.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-01-05 05:50:24 +01:00
|
|
|
ip_vs_conn_tab_size,
|
|
|
|
(long)(ip_vs_conn_tab_size*sizeof(struct list_head))/1024);
|
2005-04-17 00:20:36 +02:00
|
|
|
IP_VS_DBG(0, "Each connection entry needs %Zd bytes at least\n",
|
|
|
|
sizeof(struct ip_vs_conn));
|
|
|
|
|
2011-02-19 11:05:08 +01:00
|
|
|
for (idx = 0; idx < ip_vs_conn_tab_size; idx++)
|
|
|
|
INIT_HLIST_HEAD(&ip_vs_conn_tab[idx]);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
for (idx = 0; idx < CT_LOCKARRAY_SIZE; idx++) {
|
2013-03-21 10:58:10 +01:00
|
|
|
spin_lock_init(&__ip_vs_conntbl_lock_array[idx].l);
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/* calculate the random value for connection hash */
|
|
|
|
get_random_bytes(&ip_vs_conn_rnd, sizeof(ip_vs_conn_rnd));
|
|
|
|
|
2011-05-03 22:09:31 +02:00
|
|
|
return 0;
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
void ip_vs_conn_cleanup(void)
|
|
|
|
{
|
2013-03-21 10:58:10 +01:00
|
|
|
/* Wait all ip_vs_conn_rcu_free() callbacks to complete */
|
|
|
|
rcu_barrier();
|
2005-04-17 00:20:36 +02:00
|
|
|
/* Release the empty cache */
|
|
|
|
kmem_cache_destroy(ip_vs_conn_cachep);
|
|
|
|
vfree(ip_vs_conn_tab);
|
|
|
|
}
|