linux/net/sunrpc
Greg Banks 59a252ff8c knfsd: avoid overloading the CPU scheduler with enormous load averages
Avoid overloading the CPU scheduler with enormous load averages
when handling high call-rate NFS loads.  When the knfsd bottom half
is made aware of an incoming call by the socket layer, it tries to
choose an nfsd thread and wake it up.  As long as there are idle
threads, one will be woken up.

If there are lot of nfsd threads (a sensible configuration when
the server is disk-bound or is running an HSM), there will be many
more nfsd threads than CPUs to run them.  Under a high call-rate
low service-time workload, the result is that almost every nfsd is
runnable, but only a handful are actually able to run.  This situation
causes two significant problems:

1. The CPU scheduler takes over 10% of each CPU, which is robbing
   the nfsd threads of valuable CPU time.

2. At a high enough load, the nfsd threads starve userspace threads
   of CPU time, to the point where daemons like portmap and rpc.mountd
   do not schedule for tens of seconds at a time.  Clients attempting
   to mount an NFS filesystem timeout at the very first step (opening
   a TCP connection to portmap) because portmap cannot wake up from
   select() and call accept() in time.

Disclaimer: these effects were observed on a SLES9 kernel, modern
kernels' schedulers may behave more gracefully.

The solution is simple: keep in each svc_pool a counter of the number
of threads which have been woken but have not yet run, and do not wake
any more if that count reaches an arbitrary small threshold.

Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
synthetic client threads simulating an rsync (i.e. recursive directory
listing) workload reading from an i386 RH9 install image (161480
regular files in 10841 directories) on the server.  That tree is small
enough to fill in the server's RAM so no disk traffic was involved.
This setup gives a sustained call rate in excess of 60000 calls/sec
before being CPU-bound on the server.  The server was running 128 nfsds.

Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
taking 5.2%.  This patch drops those contributions to 3.0% and 2.2%.
Load average was over 120 before the patch, and 20.9 after.

This patch is a forward-ported version of knfsd-avoid-nfsd-overload
which has been shipping in the SGI "Enhanced NFS" product since 2006.
It has been posted before:

http://article.gmane.org/gmane.linux.nfs/10374

Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:41 -04:00
..
auth_gss rpc: add service field to new upcall 2008-12-23 16:19:56 -05:00
xprtrdma rpc/rdma: goto instead of copypaste 2008-12-14 23:19:48 -08:00
auth_generic.c SUNRPC: Fix a performance regression in the RPC authentication code 2008-11-20 13:17:40 -08:00
auth_null.c NFSv4: Don't use cred->cr_ops->cr_name in nfs4_proc_setclientid() 2008-04-19 16:54:53 -04:00
auth_unix.c SUNRPC: Use GFP_NOFS when allocating credentials 2008-07-09 12:08:48 -04:00
auth.c Merge branch 'devel' into next 2008-12-30 16:51:43 -05:00
cache.c SUNRPC: The sunrpc server code should not be used by out-of-tree modules 2009-01-07 17:18:42 -05:00
clnt.c Merge branch 'devel' into next 2008-12-30 16:51:43 -05:00
Kconfig sunrpc: fix rdma dependencies 2009-02-03 15:20:13 -08:00
Makefile SUNRPC: Add a generic RPC credential 2008-03-14 13:42:38 -04:00
rpc_pipe.c zero i_uid/i_gid on inode allocation 2009-01-05 11:54:28 -05:00
rpcb_clnt.c net: replace NIPQUAD() in net/*/ 2008-10-31 00:54:56 -07:00
sched.c SUNRPC: Tighten up the task locking rules in __rpc_execute() 2009-03-10 20:33:16 -04:00
socklib.c
stats.c SUNRPC: The sunrpc server code should not be used by out-of-tree modules 2009-01-07 17:18:42 -05:00
sunrpc_syms.c SUNRPC: Move exported symbol definitions after function declaration part 2 2008-02-01 17:01:24 -05:00
svc_xprt.c knfsd: avoid overloading the CPU scheduler with enormous load averages 2009-03-18 17:38:41 -04:00
svc.c SUNRPC: The sunrpc server code should not be used by out-of-tree modules 2009-01-07 17:18:42 -05:00
svcauth_unix.c SUNRPC: The sunrpc server code should not be used by out-of-tree modules 2009-01-07 17:18:42 -05:00
svcauth.c SUNRPC: The sunrpc server code should not be used by out-of-tree modules 2009-01-07 17:18:42 -05:00
svcsock.c SUNRPC: The sunrpc server code should not be used by out-of-tree modules 2009-01-07 17:18:42 -05:00
sysctl.c sunrpc: fix possible overrun on read of /proc/sys/sunrpc/transports 2008-09-01 14:24:24 -04:00
timer.c SUNRPC: add EXPORT_SYMBOL_GPL for generic transport functions 2007-10-09 17:17:36 -04:00
xdr.c SUNRPC: Convert the xdr helpers and rpc_pipefs to EXPORT_SYMBOL_GPL 2008-12-23 15:21:31 -05:00
xprt.c SUNRPC: xprt_connect() don't abort the task if the transport isn't bound 2009-03-11 14:09:39 -04:00
xprtsock.c SUNRPC: xprt_connect() don't abort the task if the transport isn't bound 2009-03-11 14:09:39 -04:00