linux/fs
Anton Blanchard b8af67e268 fs/block_dev.c: fix performance regression in O_DIRECT|O_SYNC writes to block devices
We are seeing a large regression in database performance on recent
kernels.  The database opens a block device with O_DIRECT|O_SYNC and a
number of threads write to different regions of the file at the same time.

A simple test case is below.  I haven't defined DEVICE since getting it
wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
see about 17MB/sec and only a few threads in IO wait:

procs  -----io---- -system-- -----cpu------
 r  b     bi    bo   in   cs us sy id wa st
 0  3      0 16170  656 2259  0  0 86 14  0
 0  2      0 16704  695 2408  0  0 92  8  0
 0  2      0 17308  744 2653  0  0 86 14  0
 0  2      0 17933  759 2777  0  0 89 10  0

Most threads are blocking in vfs_fsync_range, which has:

        mutex_lock(&mapping->host->i_mutex);
        err = fop->fsync(file, dentry, datasync);
        if (!ret)
                ret = err;
        mutex_unlock(&mapping->host->i_mutex);

commit 148f948ba8 (vfs: Introduce new
helpers for syncing after writing to O_SYNC file or IS_SYNC inode) offers
some explanation of what is going on:

    Use these new helpers for syncing from generic VFS functions. This makes
    O_SYNC writes to block devices acquire i_mutex for syncing. If we really
    care about this, we can make block_fsync() drop the i_mutex and reacquire
    it before it returns.

Thanks Jan for such a good commit message!  As well as dropping i_mutex,
Christoph suggests we should remove the call to sync_blockdev():

> sync_blockdev is an overcomplicated alias for filemap_write_and_wait on
> the block device inode, which is exactly what we did just before calling
> into ->fsync

The patch below incorporates both suggestions. With it the testcase improves
from 17MB/s to 68M/sec:

procs  -----io---- -system-- -----cpu------
 r  b     bi    bo   in   cs us sy id wa st
 0  7      0 65536 1000 3878  0  0 70 30  0
 0 34      0 69632 1016 3921  0  1 46 53  0
 0 57      0 69632 1000 3921  0  0 55 45  0
 0 53      0 69640  754 4111  0  0 81 19  0

Testcase:

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define NR_THREADS 64
#define BUFSIZE (64 * 1024)

#define DEVICE "/dev/mapper/XXXXXX"

#define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1))

static int fd;

static void *doit(void *arg)
{
	unsigned long offset = (long)arg;
	char *b, *buf;

	b = malloc(BUFSIZE + 1024);
	buf = (char *)ALIGN((unsigned long)b, 1024);
	memset(buf, 0, BUFSIZE);

	while (1)
		pwrite(fd, buf, BUFSIZE, offset);
}

int main(int argc, char *argv[])
{
	int flags = O_RDWR|O_DIRECT;
	int i;
	unsigned long offset = 0;

	if (argc > 1 && !strcmp(argv[1], "O_SYNC"))
		flags |= O_SYNC;

	fd = open(DEVICE, flags);
	if (fd == -1) {
		perror("open");
		exit(1);
	}

	for (i = 0; i < NR_THREADS-1; i++) {
		pthread_t tid;
		pthread_create(&tid, NULL, doit, (void *)offset);
		offset += BUFSIZE;
	}
	doit((void *)offset);

	return 0;
}

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-24 11:31:26 -07:00
..
9p Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs 2010-04-05 13:42:54 -07:00
adfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
affs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
afs AFS: Don't pass error value to page_cache_release() in error handling 2010-04-21 12:27:43 -07:00
autofs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
autofs4 include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
befs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
bfs pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
btrfs Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable 2010-04-12 18:37:04 -07:00
cachefiles include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
ceph Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client 2010-04-14 18:45:31 -07:00
cifs Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 2010-04-08 11:58:14 -07:00
coda include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
configfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
cramfs
debugfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
devpts include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
dlm include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
ecryptfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 2010-04-19 14:20:32 -07:00
efs
exofs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
exportfs nfs: new subdir Documentation/filesystems/nfs 2009-10-27 19:34:04 -04:00
ext2 ext2: symlink must be handled via filesystem specific operation 2010-04-12 21:11:25 +02:00
ext3 ext3: symlink must be handled via filesystem specific operation 2010-04-12 21:11:39 +02:00
ext4 include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
fat Merge branch 'master' into export-slabh 2010-04-05 11:37:28 +09:00
freevxfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
fscache fs-cache: order the debugfs stats correctly 2010-04-07 08:38:05 -07:00
fuse include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
gfs2 include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
hfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
hfsplus include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
hostfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
hpfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
hppfs hppfs can use existing proc_mnt, no need for do_kern_mount() in there 2010-03-03 14:08:00 -05:00
hugetlbfs Untangling ima mess, part 1: alloc_file() 2009-12-16 12:16:47 -05:00
isofs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
jbd include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
jbd2 include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
jffs2 include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
jfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6 2010-04-21 12:30:07 -07:00
lockd include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
logfs Merge git://git.kernel.org/pub/scm/linux/kernel/git/joern/logfs 2010-04-21 12:31:12 -07:00
minix include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
ncpfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
nfs NFSv4: fix delegated locking 2010-04-12 07:55:15 -04:00
nfs_common include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
nfsd include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
nilfs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 2010-04-12 18:34:25 -07:00
nls Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 2009-09-30 09:31:14 -07:00
notify include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
ntfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
ocfs2 include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
omfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
openpromfs
partitions include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
proc pagemap: fix pfn calculation for hugepage 2010-04-07 08:38:04 -07:00
qnx4 fs/qnx4: decrement sizeof size in strncmp 2010-02-04 11:55:46 +01:00
quota quota: Convert __DQUOT_PARANOIA symbol to standard config option 2010-04-20 18:25:25 +02:00
ramfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
reiserfs reiserfs: fix corruption during shrinking of xattrs 2010-04-24 11:31:24 -07:00
romfs fix leak in romfs_fill_super() 2010-01-26 22:22:26 -05:00
smbfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
squashfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
sysfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
sysv pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
ubifs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
udf udf: add speciffic ->setattr callback 2010-04-08 15:35:20 +02:00
ufs ufs: make solaris fsck happy 2010-03-12 15:52:35 -08:00
xfs xfs: don't warn on EAGAIN in inode reclaim 2010-04-16 13:51:44 -05:00
Kconfig Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client 2010-03-19 09:43:06 -07:00
Kconfig.binfmt
Makefile Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client 2010-03-19 09:43:06 -07:00
aio.c aio: remove unused field 2009-12-16 07:20:13 -08:00
anon_inodes.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
attr.c fs: use rlimit helpers 2010-03-06 11:26:29 -08:00
bad_inode.c
binfmt_aout.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
binfmt_elf.c coredump: pass mm->flags as a coredump parameter for consistency 2010-03-06 11:26:46 -08:00
binfmt_elf_fdpic.c FDPIC: For-loop in elf_core_vma_data_size() is incorrect 2010-03-24 16:43:29 -07:00
binfmt_em86.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
binfmt_flat.c uclinux: error message when FLAT reloc symbol is invalid, v2 2010-04-21 13:28:49 +10:00
binfmt_misc.c
binfmt_script.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
binfmt_som.c Split 'flush_old_exec' into two functions 2010-01-29 08:22:01 -08:00
bio-integrity.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
bio.c Merge branch 'master' into for-linus 2010-03-19 08:05:10 +01:00
block_dev.c fs/block_dev.c: fix performance regression in O_DIRECT|O_SYNC writes to block devices 2010-04-24 11:31:26 -07:00
buffer.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2010-03-12 16:04:50 -08:00
char_dev.c
compat.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
compat_binfmt_elf.c elf coredump: replace ELF_CORE_EXTRA_* macros by functions 2010-03-06 11:26:45 -08:00
compat_ioctl.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
dcache.c fix race in d_splice_alias() 2010-03-03 14:13:08 -05:00
dcookies.c
direct-io.c dio: fix use-after-free 2009-12-17 04:52:13 -05:00
drop_caches.c
eventfd.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
eventpoll.c anonfd: Allow making anon files read-only 2009-12-22 12:27:34 -05:00
exec.c coredump: suppress uid comparison test if core output files are pipes 2010-03-06 11:26:46 -08:00
fcntl.c fs: use rlimit helpers 2010-03-06 11:26:29 -08:00
fifo.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
file.c fs: use rlimit helpers 2010-03-06 11:26:29 -08:00
file_table.c vfs: take f_lock on modifying f_mode after open time 2010-03-06 11:26:25 -08:00
filesystems.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
fs-writeback.c Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block 2010-04-09 11:50:29 -07:00
fs_struct.c
generic_acl.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
inode.c dquot: move dquot initialization responsibility into the filesystem 2010-03-05 00:20:30 +01:00
internal.h Take vfsmount_lock to fs/internal.h 2010-03-03 14:07:59 -05:00
ioctl.c Cleanup generic block based fiemap 2010-04-23 10:39:48 -07:00
ioprio.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
libfs.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
locks.c Merge branch 'for-next' into for-linus 2010-03-08 16:55:37 +01:00
mbcache.c
mpage.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
namei.c Restore LOOKUP_DIRECTORY hint handling in final lookup on open() 2010-03-26 12:41:05 -04:00
namespace.c vfs: add NOFOLLOW flag to umount(2) 2010-03-03 14:08:00 -05:00
nfsctl.c Switch may_open() and break_lease() to passing O_... 2010-03-03 13:00:21 -05:00
no-block.c
open.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
pipe.c fs: no games with DCACHE_UNHASHED 2009-12-17 10:51:40 -05:00
pnode.c Kill CL_PROPAGATION, sanitize fs/pnode.c:get_source() 2010-03-03 13:00:22 -05:00
pnode.h VFS: Clean up shared mount flag propagation 2010-03-03 14:07:55 -05:00
posix_acl.c
read_write.c do_sync_read/write() should set kiocb.ki_nbytes to be consistent 2010-03-24 16:43:29 -07:00
read_write.h
readdir.c
select.c Add generic sys_old_select() 2010-03-12 15:52:32 -08:00
seq_file.c seq_file: fix new kernel-doc warnings 2010-03-07 15:48:26 -08:00
signalfd.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
splice.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
stack.c VFS/fsstack: handle 32-bit smp + preempt + large files in fsstack_copy_inode_size 2009-12-17 10:58:17 -05:00
stat.c Add unlocked version of inode_add_bytes() function 2009-12-23 13:33:54 +01:00
super.c Mirror MS_KERNMOUNT in ->mnt_flags 2010-03-03 14:08:00 -05:00
sync.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
timerfd.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
utimes.c
xattr.c sanitize xattr handler prototypes 2009-12-16 12:16:49 -05:00
xattr_acl.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00