linux/kernel
Eric Dumazet 5640f76858 net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 16:31:37 -04:00
..
debug kernel/debug: Make use of KGDB_REASON_NMI 2012-07-31 08:16:43 -05:00
events perf/hwpb: Invoke __perf_event_disable() if interrupts are already disabled 2012-09-04 17:29:53 +02:00
gcov gcov: disable CONSTRUCTORS for UML 2011-07-26 16:49:45 -07:00
irq Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-08-03 10:56:44 -07:00
power Revert "NMI watchdog: fix for lockup detector breakage on resume" 2012-08-08 20:49:45 +02:00
sched sched: Fix kernel-doc warnings in kernel/sched/fair.c 2012-09-04 14:30:49 +02:00
time sched: Add missing call to calc_load_exit_idle() 2012-09-04 14:30:29 +02:00
trace tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1 2012-08-17 15:19:46 -04:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.locks locking/kconfig: Simplify INLINE_SPIN_UNLOCK usage 2012-03-23 13:18:57 +01:00
Kconfig.preempt locking/kconfig: Simplify INLINE_SPIN_UNLOCK usage 2012-03-23 13:18:57 +01:00
Makefile Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-06-01 10:34:35 -07:00
acct.c Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-01-08 12:19:57 -08:00
async.c [SCSI] async: make async_synchronize_full() flush all work regardless of domain 2012-07-20 09:07:37 +01:00
audit.c netlink: Rename pid to portid to avoid confusion 2012-09-10 15:30:41 -04:00
audit.h audit: remove AUDIT_SETUP_CONTEXT as it isn't used 2012-01-17 16:16:57 -05:00
audit_tree.c audit: clean up refcounting in audit-tree 2012-08-15 12:55:22 +02:00
audit_watch.c get rid of kern_path_parent() 2012-07-14 16:35:02 +04:00
auditfilter.c audit: allow interfield comparison in audit rules 2012-01-17 16:17:01 -05:00
auditsc.c seccomp: remove duplicated failure logging 2012-04-14 11:13:20 +10:00
backtracetest.c
bounds.c memcg: remove direct page_cgroup-to-page pointer 2011-03-23 19:46:28 -07:00
capability.c userns: Teach inode_capable to understand inodes whose uids map to other namespaces. 2012-05-15 14:59:24 -07:00
cgroup.c Merge branch 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2012-07-24 17:47:44 -07:00
cgroup_freezer.c cgroup: convert all non-memcg controllers to the new cftype interface 2012-04-01 12:09:55 -07:00
compat.c new helper: sigsuspend() 2012-05-21 23:52:30 -04:00
configs.c kernel/configs.c: include MODULE_*() when CONFIG_IKCONFIG_PROC=n 2011-07-25 20:57:15 -07:00
cpu.c mm/hotplug: correctly setup fallback zonelists when creating new pgdat 2012-07-31 18:42:44 -07:00
cpu_pm.c kernel/cpu_pm.c: fix various typos 2012-05-31 17:49:27 -07:00
cpuset.c cpusets: Remove/update outdated comments 2012-07-24 13:53:28 +02:00
crash_dump.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
cred.c keys: kill task_struct->replacement_session_keyring 2012-05-23 22:11:41 -04:00
delayacct.c KVM: Steal time implementation 2011-07-14 12:59:14 +03:00
dma.c Remove all #inclusions of asm/system.h 2012-03-28 18:30:03 +01:00
elfcore.c elf coredump: add extended numbering support 2010-03-06 11:26:46 -08:00
exec_domain.c sys_personality: remove the bogus checks in sys_personality()->__set_personality() path 2010-08-09 20:45:05 -07:00
exit.c net: use a per task frag allocator 2012-09-24 16:31:37 -04:00
extable.c extable: Skip sorting if sorted at build time. 2012-04-19 15:06:55 -07:00
fork.c net: use a per task frag allocator 2012-09-24 16:31:37 -04:00
freezer.c PM / Freezer: Remove references to TIF_FREEZE in comments 2012-03-04 23:08:54 +01:00
futex.c futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi() 2012-07-24 16:02:57 +02:00
futex_compat.c futex: Mark get_robust_list as deprecated 2012-03-29 11:37:17 +02:00
groups.c userns: Convert in_group_p and in_egroup_p to use kgid_t 2012-05-03 03:29:33 -07:00
hrtimer.c hrtimer: Update hrtimer base offsets each hrtimer_interrupt 2012-07-11 23:34:39 +02:00
hung_task.c hung task debugging: Inject NMI when hung and going to panic 2012-04-25 12:39:25 +02:00
irq_work.c irq_work: fix compile failure on tile from missing include 2012-04-13 13:15:16 -04:00
itimer.c itimer: Use printk_once instead of WARN_ONCE 2012-04-10 11:00:30 +02:00
jump_label.c static keys: Inline the static_key_enabled() function 2012-02-28 20:01:08 +01:00
kallsyms.c vsprintf: fix %ps on non symbols when using kallsyms 2012-05-29 16:22:32 -07:00
kcmp.c syscalls, x86: add __NR_kcmp syscall 2012-05-31 17:49:32 -07:00
kexec.c kdump: append newline to the last lien of vmcoreinfo note 2012-07-30 17:25:20 -07:00
kfifo.c [media] kernel:kfifo: export __kfifo_max_r symbol 2012-04-11 18:24:37 -03:00
kmod.c kmod: avoid deadlock from recursive kmod call 2012-07-30 17:25:20 -07:00
kprobes.c kprobes: return proper error code from register_kprobe() 2012-03-05 15:49:42 -08:00
ksysfs.c kernel: ksysfs.c is implicitly using stat.h 2011-10-31 09:20:13 -04:00
kthread.c kthread_worker: reimplement flush_kthread_work() to allow freeing the work item being executed 2012-07-22 10:15:28 -07:00
latencytop.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
lglock.c brlocks/lglocks: turn into functions 2012-05-29 23:28:41 -04:00
lockdep.c lockdep: Add CPU-idle/offline warning to lockdep-RCU splat 2012-02-21 09:06:06 -08:00
lockdep_internals.h lockdep: No need to disable preemption in debug atomic ops 2010-05-04 05:38:16 +02:00
lockdep_proc.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
lockdep_states.h
module.c Guard check in module loader against integer overflow 2012-05-23 22:28:53 +09:30
mutex-debug.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
mutex-debug.h mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
mutex.c sched/rt: Use schedule_preempt_disabled() 2012-03-01 10:28:03 +01:00
mutex.h mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
notifier.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
nsproxy.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
padata.c padata: Fix cpu hotplug 2012-03-29 19:52:46 +08:00
panic.c panic: fix a possible deadlock in panic() 2012-07-30 17:25:13 -07:00
params.c params: replace printk(KERN_<LVL>...) with pr_<lvl>(...) 2012-05-04 17:28:18 -07:00
pid.c net ip6 flowlabel: Make owner a union of struct pid * and kuid_t 2012-08-14 21:49:25 -07:00
pid_namespace.c pidns: Export free_pid_ns 2012-08-14 21:49:35 -07:00
posix-cpu-timers.c [S390] cputime: add sparse checking and cleanup 2011-12-15 14:56:19 +01:00
posix-timers.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
printk.c printk: Fix calculation of length used to discard records 2012-08-12 21:25:50 +03:00
profile.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
ptrace.c userns: Convert ptrace, kill, set_priority permission checks to work with kuids and kgids 2012-05-03 03:28:51 -07:00
range.c range: fix bogus misuse of module.h to get printk() 2011-10-31 09:20:11 -04:00
rcu.h rcu: Allow nesting of rcu_idle_enter() and rcu_idle_exit() 2012-02-21 09:06:12 -08:00
rcupdate.c rcu: Consolidate tree/tiny __rcu_read_{,un}lock() implementations 2012-07-02 12:34:23 -07:00
rcutiny.c rcu: Fix rcu_is_cpu_idle() #ifdef in TINY_RCU 2012-07-02 12:34:25 -07:00
rcutiny_plugin.h rcu: Fix code-style issues involving "else" 2012-07-06 06:01:48 -07:00
rcutorture.c rcu: Fix broken strings in RCU's source code. 2012-07-06 06:01:49 -07:00
rcutree.c rcu: Fix code-style issues involving "else" 2012-07-06 06:01:48 -07:00
rcutree.h Merge branches 'bigrtm.2012.07.04a', 'doctorture.2012.07.02a', 'fixes.2012.07.06a' and 'fnh.2012.07.02a' into HEAD 2012-07-06 05:59:30 -07:00
rcutree_plugin.h rcu: Fix code-style issues involving "else" 2012-07-06 06:01:48 -07:00
rcutree_trace.c rcu: Fix broken strings in RCU's source code. 2012-07-06 06:01:49 -07:00
relay.c splice: fix racy pipe->buffers uses 2012-06-13 21:16:42 +02:00
res_counter.c rescounters: add res_counter_uncharge_until() 2012-05-29 16:22:27 -07:00
resource.c resource: make sure requested range is included in the root range 2012-07-30 17:25:21 -07:00
rtmutex-debug.c lockdep, rtmutex, bug: Show taint flags on error 2011-12-06 08:16:49 +01:00
rtmutex-debug.h
rtmutex-tester.c rtmutex-tester: convert sysdev_class to a regular subsystem 2011-12-14 14:54:22 -08:00
rtmutex.c Revert "rcu: Permit rt_mutex_unlock() with irqs disabled" 2011-12-11 10:33:18 -08:00
rtmutex.h
rtmutex_common.h rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rwsem.c Remove all #inclusions of asm/system.h 2012-03-28 18:30:03 +01:00
seccomp.c seccomp: fix build warnings when there is no CONFIG_SECCOMP_FILTER 2012-04-18 12:24:52 +10:00
semaphore.c semaphore: fix improper comment reference to mutex 2012-04-05 17:15:55 -07:00
signal.c signal: make sure we don't get stopped with pending task_work 2012-07-22 23:57:54 +04:00
smp.c smp: Remove ipi_call_lock[_irq]()/ipi_call_unlock[_irq]() 2012-06-05 17:27:14 +02:00
smpboot.c smpboot, idle: Fix comment mismatch over idle_threads_init() 2012-05-24 22:58:08 +02:00
smpboot.h smpboot: Remove leftover declaration 2012-06-11 15:07:52 +02:00
softirq.c mm: allow PF_MEMALLOC from softirq context 2012-07-31 18:42:45 -07:00
spinlock.c locking/kconfig: Simplify INLINE_SPIN_UNLOCK usage 2012-03-23 13:18:57 +01:00
srcu.c rcu: Implement per-domain single-threaded call_srcu() state machine 2012-04-30 10:48:25 -07:00
stacktrace.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
stop_machine.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
sys.c kernel/sys.c: avoid argv_free(NULL) 2012-07-30 17:25:13 -07:00
sys_ni.c syscalls, x86: add __NR_kcmp syscall 2012-05-31 17:49:32 -07:00
sysctl.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
sysctl_binary.c mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads 2012-07-31 18:42:40 -07:00
task_work.c task_work: add a scheduling point in task_work_run() 2012-08-21 09:11:44 -07:00
taskstats.c netlink: Rename pid to portid to avoid confusion 2012-09-10 15:30:41 -04:00
test_kprobes.c kprobes: Fix selftest to clear flags field for reusing probes 2010-10-14 08:55:27 +02:00
time.c time: Remove bogus comments 2012-03-15 18:17:55 -07:00
timeconst.pl
timer.c alpha: take a bunch of syscalls into osf_sys.c 2012-08-19 08:41:19 -07:00
tracepoint.c static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]() 2012-02-24 10:05:59 +01:00
tsacct.c [S390] cputime: add sparse checking and cleanup 2011-12-15 14:56:19 +01:00
uid16.c userns: Convert setting and getting uid and gid system calls to use kuid and kgid 2012-05-03 03:28:41 -07:00
up.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
user-return-notifier.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
user.c userns: Silence silly gcc warning. 2012-05-19 15:44:40 -06:00
user_namespace.c userns: Store uid and gid values in struct cred with kuid_t and kgid_t types 2012-05-03 03:28:38 -07:00
utsname.c userns: Use cred->user_ns instead of cred->user->user_ns 2012-04-07 16:55:51 -07:00
utsname_sysctl.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
wait.c lockdep/waitqueues: Add better annotation 2011-12-21 10:07:39 +01:00
watchdog.c Revert "NMI watchdog: fix for lockup detector breakage on resume" 2012-08-08 20:49:45 +02:00
workqueue.c workqueue: fix possible idle worker depletion across CPU hotplug 2012-09-10 10:05:54 -07:00
workqueue_sched.h workqueue: implement concurrency managed dynamic worker pool 2010-06-29 10:07:14 +02:00