linux/net
Evgeniy Polyakov a9d8f9110d inet: Allowing more than 64k connections and heavily optimize bind(0) time.
With simple extension to the binding mechanism, which allows to bind more
than 64k sockets (or smaller amount, depending on sysctl parameters),
we have to traverse the whole bind hash table to find out empty bucket.
And while it is not a problem for example for 32k connections, bind()
completion time grows exponentially (since after each successful binding
we have to traverse one bucket more to find empty one) even if we start
each time from random offset inside the hash table.

So, when hash table is full, and we want to add another socket, we have
to traverse the whole table no matter what, so effectivelly this will be
the worst case performance and it will be constant.

Attached picture shows bind() time depending on number of already bound
sockets.

Green area corresponds to the usual binding to zero port process, which
turns on kernel port selection as described above. Red area is the bind
process, when number of reuse-bound sockets is not limited by 64k (or
sysctl parameters). The same exponential growth (hidden by the green
area) before number of ports reaches sysctl limit.

At this time bind hash table has exactly one reuse-enbaled socket in a
bucket, but it is possible that they have different addresses. Actually
kernel selects the first port to try randomly, so at the beginning bind
will take roughly constant time, but with time number of port to check
after random start will increase. And that will have exponential growth,
but because of above random selection, not every next port selection
will necessary take longer time than previous. So we have to consider
the area below in the graph (if you could zoom it, you could find, that
there are many different times placed there), so area can hide another.

Blue area corresponds to the port selection optimization.

This is rather simple design approach: hashtable now maintains (unprecise
and racely updated) number of currently bound sockets, and when number
of such sockets becomes greater than predefined value (I use maximum
port range defined by sysctls), we stop traversing the whole bind hash
table and just stop at first matching bucket after random start. Above
limit roughly corresponds to the case, when bind hash table is full and
we turned on mechanism of allowing to bind more reuse-enabled sockets,
so it does not change behaviour of other sockets.

Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net>
Tested-by: Denys Fedoryschenko <denys@visp.net.lb>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-21 14:34:31 -08:00
..
9p net/9p: fid->fid is used uninitialized 2009-01-19 16:20:15 -08:00
802
8021q vlan: add neigh_setup 2009-01-08 10:50:20 -08:00
appletalk appletalk: remove unneeded stubs 2009-01-21 14:02:18 -08:00
atm lec: convert to net_device_ops 2009-01-21 14:02:00 -08:00
ax25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2008-12-28 12:49:40 -08:00
bluetooth bluetooth: driver API update 2009-01-07 17:23:17 -08:00
bridge netfilter 05/09: ebtables: fix inversion in match code 2009-01-12 21:18:35 -08:00
can can: fix slowpath issue in hrtimer callback function 2009-01-14 21:06:55 -08:00
core gro: Fix merging of paged packets 2009-01-20 14:44:03 -08:00
dcb DCB: fix kfree(skb) 2009-01-04 17:29:21 -08:00
dccp dccp: Debugging functions for feature negotiation 2009-01-21 14:34:05 -08:00
decnet
dsa dsa: convert to net_device_ops (v2) 2009-01-06 16:45:26 -08:00
econet
ethernet
ipv4 inet: Allowing more than 64k connections and heavily optimize bind(0) time. 2009-01-21 14:34:31 -08:00
ipv6 gro: Fix handling of complete checksums in IPv6 2009-01-20 14:44:01 -08:00
ipx
irda tty: Fix an ircomm warning and note another bug 2009-01-02 10:19:43 -08:00
iucv s390: remove s390_root_dev_*() 2009-01-06 10:44:34 -08:00
key
lapb
llc
mac80211 mac80211: more kernel-doc fixes 2009-01-16 17:08:23 -05:00
netfilter netfilter: ctnetlink: fix scheduling while atomic 2009-01-21 12:19:49 -08:00
netlabel netlabel: Update kernel configuration API 2008-12-31 12:54:11 -05:00
netlink genetlink: export genl_unregister_mc_group() 2009-01-07 10:00:17 -08:00
netrom netrom: convert to net_device_ops 2009-01-21 14:02:02 -08:00
packet
phonet phonet: update to net_device_ops 2009-01-07 17:24:34 -08:00
rfkill net/rfkill/rfkill.c: fix unused rfkill_led_trigger() warning 2009-01-04 17:11:24 -08:00
rose rose: convert to network_device_ops 2009-01-21 14:02:04 -08:00
rxrpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2008-12-28 12:49:40 -08:00
sched pkt_sched: sch_htb: Break all htb_do_events() after 2 jiffies 2009-01-12 21:54:40 -08:00
sctp fix similar typos to successfull 2009-01-08 08:31:15 -08:00
sunrpc SUNRPC: The sunrpc server code should not be used by out-of-tree modules 2009-01-07 17:18:42 -05:00
tipc net/tipc/bcast.h: use ARRAY_SIZE 2009-01-11 00:06:33 -08:00
unix introduce new LSM hooks where vfsmount is available. 2008-12-31 18:07:37 -05:00
wanrouter
wimax wimax: testing for rfkill support should also test for CONFIG_RFKILL_MODULE 2009-01-08 11:08:01 -08:00
wireless cfg80211: Fix parsed country IE info for 5 GHz 2009-01-16 17:08:24 -05:00
x25
xfrm Revert "xfrm: For 32/64 compatability wrt. xfrm_usersa_info" 2009-01-20 09:49:51 -08:00
compat.c
Kconfig wimax: Makefile, Kconfig and docbook linkage for the stack 2009-01-07 10:00:17 -08:00
Makefile wimax: Makefile, Kconfig and docbook linkage for the stack 2009-01-07 10:00:17 -08:00
nonet.c
socket.c [CVE-2009-0029] System call wrappers part 22 2009-01-14 14:15:27 +01:00
sysctl_net.c
TUNABLE