Commit Graph

12756 Commits

Author SHA1 Message Date
Ingo Molnar 72c26c9a26 Merge branch 'linus' into tracing/blktrace
Conflicts:
	block/blktrace.c

Semantic merge:
	kernel/trace/blktrace.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-19 09:00:35 +01:00
Linus Torvalds ba95fd47d1 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list
  block: fix booting from partitioned md array
  block: revert part of 18ce3751cc
  cciss: PCI power management reset for kexec
  paride/pg.c: xs(): &&/|| confusion
  fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
  block: fix bad definition of BIO_RW_SYNC
  bsg: Fix sense buffer bug in SG_IO
2009-02-18 18:33:04 -08:00
Ingo Molnar f04b30de3c inotify: fix GFP_KERNEL related deadlock
Enhanced lockdep coverage of __GFP_NOFS turned up this new lockdep
assert:

[ 1093.677775]
[ 1093.677781] =================================
[ 1093.680031] [ INFO: inconsistent lock state ]
[ 1093.680031] 2.6.29-rc5-tip-01504-gb49eca1-dirty #1
[ 1093.680031] ---------------------------------
[ 1093.680031] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 1093.680031] kswapd0/308 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1093.680031]  (&inode->inotify_mutex){+.+.?.}, at: [<c0205942>] inotify_inode_is_dead+0x20/0x80
[ 1093.680031] {RECLAIM_FS-ON-W} state was registered at:
[ 1093.680031]   [<c01696b9>] mark_held_locks+0x43/0x5b
[ 1093.680031]   [<c016baa4>] lockdep_trace_alloc+0x6c/0x6e
[ 1093.680031]   [<c01cf8b0>] kmem_cache_alloc+0x20/0x150
[ 1093.680031]   [<c040d0ec>] idr_pre_get+0x27/0x6c
[ 1093.680031]   [<c02056e3>] inotify_handle_get_wd+0x25/0xad
[ 1093.680031]   [<c0205f43>] inotify_add_watch+0x7a/0x129
[ 1093.680031]   [<c020679e>] sys_inotify_add_watch+0x20f/0x250
[ 1093.680031]   [<c010389e>] sysenter_do_call+0x12/0x35
[ 1093.680031]   [<ffffffff>] 0xffffffff
[ 1093.680031] irq event stamp: 60417
[ 1093.680031] hardirqs last  enabled at (60417): [<c018d5f5>] call_rcu+0x53/0x59
[ 1093.680031] hardirqs last disabled at (60416): [<c018d5b9>] call_rcu+0x17/0x59
[ 1093.680031] softirqs last  enabled at (59656): [<c0146229>] __do_softirq+0x157/0x16b
[ 1093.680031] softirqs last disabled at (59651): [<c0106293>] do_softirq+0x74/0x15d
[ 1093.680031]
[ 1093.680031] other info that might help us debug this:
[ 1093.680031] 2 locks held by kswapd0/308:
[ 1093.680031]  #0:  (shrinker_rwsem){++++..}, at: [<c01b0502>] shrink_slab+0x36/0x189
[ 1093.680031]  #1:  (&type->s_umount_key#4){+++++.}, at: [<c01e6d77>] shrink_dcache_memory+0x110/0x1fb
[ 1093.680031]
[ 1093.680031] stack backtrace:
[ 1093.680031] Pid: 308, comm: kswapd0 Not tainted 2.6.29-rc5-tip-01504-gb49eca1-dirty #1
[ 1093.680031] Call Trace:
[ 1093.680031]  [<c016947a>] valid_state+0x12a/0x13d
[ 1093.680031]  [<c016954e>] mark_lock+0xc1/0x1e9
[ 1093.680031]  [<c016a5b4>] ? check_usage_forwards+0x0/0x3f
[ 1093.680031]  [<c016ab74>] __lock_acquire+0x2c6/0xac8
[ 1093.680031]  [<c01688d9>] ? register_lock_class+0x17/0x228
[ 1093.680031]  [<c016b3d3>] lock_acquire+0x5d/0x7a
[ 1093.680031]  [<c0205942>] ? inotify_inode_is_dead+0x20/0x80
[ 1093.680031]  [<c08824c4>] __mutex_lock_common+0x3a/0x4cb
[ 1093.680031]  [<c0205942>] ? inotify_inode_is_dead+0x20/0x80
[ 1093.680031]  [<c08829ed>] mutex_lock_nested+0x2e/0x36
[ 1093.680031]  [<c0205942>] ? inotify_inode_is_dead+0x20/0x80
[ 1093.680031]  [<c0205942>] inotify_inode_is_dead+0x20/0x80
[ 1093.680031]  [<c01e6672>] dentry_iput+0x90/0xc2
[ 1093.680031]  [<c01e67a3>] d_kill+0x21/0x45
[ 1093.680031]  [<c01e6a46>] __shrink_dcache_sb+0x27f/0x355
[ 1093.680031]  [<c01e6dc5>] shrink_dcache_memory+0x15e/0x1fb
[ 1093.680031]  [<c01b05ed>] shrink_slab+0x121/0x189
[ 1093.680031]  [<c01b0d12>] kswapd+0x39f/0x561
[ 1093.680031]  [<c01ae499>] ? isolate_pages_global+0x0/0x233
[ 1093.680031]  [<c0157eae>] ? autoremove_wake_function+0x0/0x43
[ 1093.680031]  [<c01b0973>] ? kswapd+0x0/0x561
[ 1093.680031]  [<c0157daf>] kthread+0x41/0x82
[ 1093.680031]  [<c0157d6e>] ? kthread+0x0/0x82
[ 1093.680031]  [<c01043ab>] kernel_thread_helper+0x7/0x10

inotify_handle_get_wd() does idr_pre_get() which does a
kmem_cache_alloc() without __GFP_FS - and is hence deadlockable under
extreme MM pressure.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: MinChan Kim <minchan.kim@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:56 -08:00
Bill Nottingham 2db69a9340 vt: Declare PIO_CMAP/GIO_CMAP as compatbile ioctls.
Otherwise, these don't work when called from 32-bit userspace on 64-bit
kernels.

Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:56 -08:00
Peter Zijlstra ada723dcd6 fs/super.c: add lockdep annotation to s_umount
Li Zefan said:

Thread 1:
  for ((; ;))
  {
      mount -t cpuset xxx /mnt > /dev/null 2>&1
      cat /mnt/cpus > /dev/null 2>&1
      umount /mnt > /dev/null 2>&1
  }

Thread 2:
  for ((; ;))
  {
      mount -t cpuset xxx /mnt > /dev/null 2>&1
      umount /mnt > /dev/null 2>&1
  }

(Note: It is irrelevant which cgroup subsys is used.)

After a while a lockdep warning showed up:

=============================================
[ INFO: possible recursive locking detected ]
2.6.28 #479
---------------------------------------------
mount/13554 is trying to acquire lock:
 (&type->s_umount_key#19){--..}, at: [<c049d888>] sget+0x5e/0x321

but task is already holding lock:
 (&type->s_umount_key#19){--..}, at: [<c049da0c>] sget+0x1e2/0x321

other info that might help us debug this:
1 lock held by mount/13554:
 #0:  (&type->s_umount_key#19){--..}, at: [<c049da0c>] sget+0x1e2/0x321

stack backtrace:
Pid: 13554, comm: mount Not tainted 2.6.28-mc #479
Call Trace:
 [<c044ad2e>] validate_chain+0x4c6/0xbbd
 [<c044ba9b>] __lock_acquire+0x676/0x700
 [<c044bb82>] lock_acquire+0x5d/0x7a
 [<c049d888>] ? sget+0x5e/0x321
 [<c061b9b8>] down_write+0x34/0x50
 [<c049d888>] ? sget+0x5e/0x321
 [<c049d888>] sget+0x5e/0x321
 [<c045a2e7>] ? cgroup_set_super+0x0/0x3e
 [<c045959f>] ? cgroup_test_super+0x0/0x2f
 [<c045bcea>] cgroup_get_sb+0x98/0x2e7
 [<c045cfb6>] cpuset_get_sb+0x4a/0x5f
 [<c049dfa4>] vfs_kern_mount+0x40/0x7b
 [<c049e02d>] do_kern_mount+0x37/0xbf
 [<c04af4a0>] do_mount+0x5c3/0x61a
 [<c04addd2>] ? copy_mount_options+0x2c/0x111
 [<c04af560>] sys_mount+0x69/0xa0
 [<c0403251>] sysenter_do_call+0x12/0x31

The cause is after alloc_super() and then retry, an old entry in list
fs_supers is found, so grab_super(old) is called, but both functions hold
s_umount lock:

struct super_block *sget(...)
{
	...
retry:
	spin_lock(&sb_lock);
	if (test) {
		list_for_each_entry(old, &type->fs_supers, s_instances) {
			if (!test(old, data))
				continue;
			if (!grab_super(old))  <--- 2nd: down_write(&old->s_umount);
				goto retry;
			if (s)
				destroy_super(s);
			return old;
		}
	}
	if (!s) {
		spin_unlock(&sb_lock);
		s = alloc_super(type);   <--- 1th: down_write(&s->s_umount)
		if (!s)
			return ERR_PTR(-ENOMEM);
		goto retry;
	}
	...
}

It seems like a false positive, and seems like VFS but not cgroup needs to
be fixed.

Peter said:

We can simply put the new s_umount instance in a but lockdep doesn't
particularly cares about subclass order.

If there's any issue with the callers of sget() assuming the s_umount lock
being of sublcass 0, then there is another annotation we can use to fix
that, but lets not bother with that if this is sufficient.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12673

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Menage <menage@google.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:55 -08:00
Nick Piggin 1cf6e7d83b mm: task dirty accounting fix
YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).

Additionally, there is some inconsistency about when task_dirty_inc is
called.  It is used for dirty balancing, however it even gets called for
__set_page_dirty_no_writeback.

So rather than increment it in a set_page_dirty wrapper, move it down to
exactly where the dirty page accounting stats are incremented.

Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:54 -08:00
Davide Libenzi 610d18f412 timerfd: add flags check
As requested by Michael, add a missing check for valid flags in
timerfd_settime(), and make it return EINVAL in case some extra bits are
set.

Michael said:
If this is to be any use to userland apps that want to check flag
support (perhaps it is too late already), then the sooner we get it
into the kernel the better: 2.6.29 would be good; earlier stables as
well would be even better.

[akpm@linux-foundation.org: remove unused TFD_FLAGS_SET]
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>		[2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:53 -08:00
Eric Biederman 8f19d47293 seq_file: properly cope with pread
Currently seq_read assumes that the offset passed to it is always the
offset it passed to user space.  In the case pread this assumption is
broken and we do the wrong thing when presented with pread.

To solve this I introduce an offset cache inside of struct seq_file so we
know where our logical file position is.  Then in seq_read if we try to
read from another offset we reset our data structures and attempt to go to
the offset user space wanted.

[akpm@linux-foundation.org: restore FMODE_PWRITE]
[pjt@google.com: seq_open needs its fmode opened up to take advantage of this]
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Turner <pjt@google.com>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:53 -08:00
Jens Axboe 78f707bfc7 block: revert part of 18ce3751cc
The above commit added WRITE_SYNC and switched various places to using
that for committing writes that will be waited upon immediately after
submission. However, this causes a performance regression with AS and CFQ
for ext3 at least, since sync_dirty_buffer() will submit some writes with
WRITE_SYNC while ext3 has sumitted others dependent writes without the sync
flag set. This causes excessive anticipation/idling in the IO scheduler
because sync and async writes get interleaved, causing a big performance
regression for the below test case (which is meant to simulate sqlite
like behaviour).

---- test case ----

int main(int argc, char **argv)
{

	int fdes, i;
	FILE *fp;
	struct timeval start;
	struct timeval end;
	struct timeval res;

	gettimeofday(&start, NULL);
	for (i=0; i<ROWS; i++) {
		fp = fopen("test_file", "a");
		fprintf(fp, "Some Text Data\n");
		fdes = fileno(fp);
		fsync(fdes);
		fclose(fp);
	}
	gettimeofday(&end, NULL);

	timersub(&end, &start, &res);
	fprintf(stdout, "time to write %d lines is %ld(msec)\n", ROWS,
			(res.tv_sec*1000000 + res.tv_usec)/1000);

	return 0;
}

-------------------

Thanks to Sean.White@APCC.com for tracking down this performance
regression and providing a test case.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-02-18 10:32:01 +01:00
Subhash Peddamallu a60e78e57a fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
When freeing from bio pool use right ptr to account for bs->front_pad,
instead of bio ptr,

Signed-off-by: Subhash Peddamallu <subhash.peddamallu@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-02-18 10:32:01 +01:00
Linus Torvalds 48c0d9ece3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: hold trans_mutex when using btrfs_record_root_in_trans
  Btrfs: make a lockdep class for the extent buffer locks
  Btrfs: fs/btrfs/volumes.c: remove useless kzalloc
  Btrfs: remove unused code in split_state()
  Btrfs: remove btrfs_init_path
  Btrfs: balance_level checks !child after access
  Btrfs: Avoid using __GFP_HIGHMEM with slab allocator
  Btrfs: don't clean old snapshots on sync(1)
  Btrfs: use larger metadata clusters in ssd mode
  Btrfs: process mount options on mount -o remount,
  Btrfs: make sure all pending extent operations are complete
2009-02-17 14:19:14 -08:00
Linus Torvalds 3512a79dbc Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix NULL dereference in ext4_ext_migrate()'s error handling
  ext4: Implement range_cyclic in ext4_da_writepages instead of write_cache_pages
  ext4: Initialize preallocation list_head's properly
  ext4: Fix lockdep warning
  ext4: Fix to read empty directory blocks correctly in 64k
  jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate()
  Revert "ext4: wait on all pending commits in ext4_sync_fs()"
  jbd2: Fix return value of jbd2_journal_start_commit()
2009-02-17 14:05:05 -08:00
Al Viro 1a88b5364b Fix incomplete __mntput locking
Getting this wrong caused

	WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()

due to optimistically checking cpu_writer->mnt outside the spinlock.

Here's what we really want:
 * we know that nobody will set cpu_writer->mnt to mnt from now on
 * all changes to that sucker are done under cpu_writer->lock
 * we want the laziest equivalent of
	spin_lock(&cpu_writer->lock);
	if (likely(cpu_writer->mnt != mnt)) {
		spin_unlock(&cpu_writer->lock);
		continue;
	}
	/* do stuff */
  that would make sure we won't miss earlier setting of ->mnt done by
  another CPU.

Anyway, for now we just move the spin_lock() earlier and move the test
into the properly locked region.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-and-tested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-17 14:02:08 -08:00
Dan Carpenter 090542641d ext4: Fix NULL dereference in ext4_ext_migrate()'s error handling
This was found through a code checker (http://repo.or.cz/w/smatch.git/). 
It looks like you might be able to trigger the error by trying to migrate 
a readonly file system.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-15 20:02:19 -05:00
Aneesh Kumar K.V 2acf2c261b ext4: Implement range_cyclic in ext4_da_writepages instead of write_cache_pages
With delayed allocation we lock the page in write_cache_pages() and
try to build an in memory extent of contiguous blocks.  This is needed
so that we can get large contiguous blocks request.  If range_cyclic
mode is enabled, write_cache_pages() will loop back to the 0 index if
no I/O has been done yet, and try to start writing from the beginning
of the range.  That causes an attempt to take the page lock of lower
index page while holding the page lock of higher index page, which can
cause a dead lock with another writeback thread.

The solution is to implement the range_cyclic behavior in
ext4_da_writepages() instead.

http://bugzilla.kernel.org/show_bug.cgi?id=12579

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-14 10:42:58 -05:00
Aneesh Kumar K.V d794bf8e09 ext4: Initialize preallocation list_head's properly
When creating a new ext4_prealloc_space structure, we have to
initialize its list_head pointers before we add them to any prealloc
lists.  Otherwise, with list debug enabled, we will get list
corruption warnings.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-14 10:31:16 -05:00
Ingo Molnar 1c511f740f Merge branches 'tracing/ftrace', 'tracing/ring-buffer', 'tracing/sysprof', 'tracing/urgent' and 'linus' into tracing/core 2009-02-13 10:25:18 +01:00
Yan Zheng 2456242530 Btrfs: hold trans_mutex when using btrfs_record_root_in_trans
btrfs_record_root_in_trans needs the trans_mutex held to make sure two
callers don't race to setup the root in a given transaction.  This adds
it to all the places that were missing it.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-02-12 14:14:53 -05:00
Chris Mason 4008c04a07 Btrfs: make a lockdep class for the extent buffer locks
Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block.  But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.

The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.

btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
gets upset about bad lock orderin.

The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:09:45 -05:00
Julia Lawall 3f3420df50 Btrfs: fs/btrfs/volumes.c: remove useless kzalloc
The call to kzalloc is followed by a kmalloc whose result is stored in the
same variable.

The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,l;
position p1,p2;
expression *ptr != NULL;
@@

(
if ((x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...)) == NULL) S
|
x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
)
<... when != x
     when != if (...) { <+...x...+> }
x->f = E
...>
(
 return \(0\|<+...x...+>\|ptr\);
|
 return@p2 ...;
)

@script:python@
p1 << r.p1;
p2 << r.p2;
@@

print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 10:16:03 -05:00
Qinghuang Feng a48ddf08ba Btrfs: remove unused code in split_state()
These two lines are not used, remove them.

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:25:23 -05:00
Jeff Mahoney e00f730865 Btrfs: remove btrfs_init_path
btrfs_init_path was initially used when the path objects were on the
stack.  Now all the work is done by btrfs_alloc_path and btrfs_init_path
isn't required.

This patch removes it, and just uses kmem_cache_zalloc to zero out the object.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:11:25 -05:00
Jeff Mahoney 7951f3cefb Btrfs: balance_level checks !child after access
The BUG_ON() is in the wrong spot.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 10:06:15 -05:00
Yan Zheng b335b0034e Btrfs: Avoid using __GFP_HIGHMEM with slab allocator
btrfs_releasepage may call kmem_cache_alloc indirectly,
and provide same GFP flags it gets to kmem_cache_alloc.
So it's possible to use __GFP_HIGHMEM with the slab
allocator.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-02-12 10:06:04 -05:00
Chris Mason e1df36d2f1 Btrfs: don't clean old snapshots on sync(1)
Cleaning old snapshots can make sync(1) somewhat slow, and some users
and applications still use it in a global fsync kind of workload.

This patch changes btrfs not to clean old snapshots during sync, which is
safe from a FS consistency point of view.  The major downside is that it
makes it difficult to tell when old snapshots have been reaped and
the space they were using has been reclaimed.  A new ioctl will be added
for this purpose instead.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 09:45:08 -05:00
Chris Mason 536ac8ae86 Btrfs: use larger metadata clusters in ssd mode
Larger metadata clusters can significantly improve writeback performance
on ssd drives with large erasure blocks.  The larger clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle.

On spinning media, lager metadata clusters end up spreading out the
metadata more over time, which makes fsck slower, so we don't want this
to be the default.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 09:41:38 -05:00
Chris Mason b288052e17 Btrfs: process mount options on mount -o remount,
Btrfs wasn't parsing any new mount options during remount, making it
difficult to set mount options on a root drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 09:37:35 -05:00
Josef Bacik eb09967089 Btrfs: make sure all pending extent operations are complete
Theres a slight problem with finish_current_insert, if we set all to 1 and then
go through and don't actually skip any of the extents on the pending list, we
could exit right after we've added new extents.

This is a problem because by inserting the new extents we could have gotten new
COW's to happen and such, so we may have some pending updates to do or even
more inserts to do after that.

So this patch will only exit if we have never skipped any of the extents in the
pending list, and we have no extents to insert, this will make sure that all of
the pending work is truly done before we return.  I've been running with this
patch for a few days with all of my other testing and have not seen issues.
Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-12 09:27:38 -05:00
Carsten Otte 0e4a9b5928 ext2/xip: refuse to change xip flag during remount with busy inodes
For a reason that I was unable to understand in three months of debugging,
mount ext2 -o remount stopped working properly when remounting from
regular operation to xip, or the other way around.  According to a git
bisect search, the problem was introduced with the VM_MIXEDMAP/PTE_SPECIAL
rework in the vm:

commit 70688e4dd1
Author: Nick Piggin <npiggin@suse.de>
Date:   Mon Apr 28 02:13:02 2008 -0700

    xip: support non-struct page backed memory

In the failing scenario, the filesystem is mounted read only via root=
kernel parameter on s390x.  During remount (in rc.sysinit), the inodes of
the bash binary and its libraries are busy and cannot be invalidated (the
bash which is running rc.sysinit resides on subject filesystem).
Afterwards, another bash process (running ifup-eth) recurses into a
subshell, runs dup_mm (via fork).  Some of the mappings in this bash
process were created from inodes that could not be invalidated during
remount.

Both parent and child process crash some time later due to inconsistencies
in their address spaces.  The issue seems to be timing sensitive, various
attempts to recreate it have failed.

This patch refuses to change the xip flag during remount in case some
inodes cannot be invalidated.  This patch keeps users from running into
that issue.

[akpm@linux-foundation.org: cleanup]
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:36 -08:00
Jan Kara 02ac597c9b ext3: revert "ext3: wait on all pending commits in ext3_sync_fs"
This reverts commit c87591b719.

Since journal_start_commit() is now fixed to return 1 when we started a
transaction commit, there's some transaction waiting to be committed or
there's a transaction already committing, we don't need to call
ext3_force_commit() in ext3_sync_fs().  Furthermore ext3_force_commit()
can unnecessarily create sync transaction which is expensive so it's
worthwhile to remove it when we can.

Cc: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:35 -08:00
Jan Kara 8fe4cd0dc5 jbd: fix return value of journal_start_commit()
journal_start_commit() returns 1 if either a transaction is committing or
the function has queued a transaction commit.  But it returns 0 if we
raced with somebody queueing the transaction commit as well.  This
resulted in ext3_sync_fs() not functioning correctly (description from
Arthur Jones): In the case of a data=ordered umount with pending long
symlinks which are delayed due to a long list of other I/O on the backing
block device, this causes the buffer associated with the long symlinks to
not be moved to the inode dirty list in the second phase of fsync_super.
Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT
flag and the dirty pages are never written to the backing block device,
causing long symlink corruption and exposing new or previously freed block
data to userspace.

This can be reproduced with a script created by Eric Sandeen
<sandeen@redhat.com>:

        #!/bin/bash

        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        rm -f /mnt/test2/*
        dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
        touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        /mnt/test2/link
        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        ls /mnt/test2/

This patch fixes journal_start_commit() to always return 1 when there's
a transaction committing or queued for commit.

Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Mike Snitzer <snitzer@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:35 -08:00
Mel Gorman 5a6fe12595 Do not account for the address space used by hugetlbfs using VM_ACCOUNT
When overcommit is disabled, the core VM accounts for pages used by anonymous
shared, private mappings and special mappings. It keeps track of VMAs that
should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
with VM_NORESERVE.

Overcommit for hugetlbfs is much riskier than overcommit for base pages
due to contiguity requirements. It avoids overcommiting on both shared and
private mappings using reservation counters that are checked and updated
during mmap(). This ensures (within limits) that hugepages exist in the
future when faults occurs or it is too easy to applications to be SIGKILLed.

As hugetlbfs makes its own reservations of a different unit to the base page
size, VM_ACCOUNT should never be set. Even if the units were correct, we would
double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
be set because an application can request no reserves be made for hugetlbfs
at the risk of getting killed later.

With commit fc8744adc8, VM_NORESERVE and
VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
breaks the accounting for both the core VM and hugetlbfs, can trigger an
OOM storm when hugepage pools are too small lockups and corrupted counters
otherwise are used. This patch brings hugetlbfs more in line with how the
core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-10 10:48:42 -08:00
Aneesh Kumar K.V ba4439165f ext4: Fix lockdep warning
We should not call ext4_mb_add_n_trim while holding alloc_semp.

    =============================================
    [ INFO: possible recursive locking detected ]
    2.6.29-rc4-git1-dirty #124
    ---------------------------------------------
    ffsb/3116 is trying to acquire lock:
     (&meta_group_info[i]->alloc_sem){----}, at: [<ffffffff8035a6e8>]
     ext4_mb_load_buddy+0xd2/0x343

    but task is already holding lock:
     (&meta_group_info[i]->alloc_sem){----}, at: [<ffffffff8035a6e8>]
     ext4_mb_load_buddy+0xd2/0x343

http://bugzilla.kernel.org/show_bug.cgi?id=12672

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-10 11:14:34 -05:00
Wei Yongjun 7be2baaa03 ext4: Fix to read empty directory blocks correctly in 64k
The rec_len field in the directory entry is 16 bits, so there was a
problem representing rec_len for filesystems with a 64k block size in
the case where the directory entry takes the entire 64k block.
Unfortunately, there were two schemes that were proposed; one where
all zeros meant 65536 and one where all ones (65535) meant 65536.
E2fsprogs used 0, whereas the kernel used 65535.  Oops.  Fortunately
this case happens extremely rarely, with the most common case being
the lost+found directory, created by mke2fs.

So we will be liberal in what we accept, and accept both encodings,
but we will continue to encode 65536 as 65535.  This will require a
change in e2fsprogs, but with fortunately ext4 filesystems normally
have the dir_index feature enabled, which precludes having a
completely empty directory block.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-10 09:53:42 -05:00
Jan Kara 7f5aa21508 jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate()
If we race with commit code setting i_transaction to NULL, we could
possibly dereference it.  Proper locking requires the journal pointer
(to access journal->j_list_lock), which we don't have.  So we have to
change the prototype of the function so that filesystem passes us the
journal pointer.  Also add a more detailed comment about why the
function jbd2_journal_begin_ordered_truncate() does what it does and
how it should be used.

Thanks to Dan Carpenter <error27@gmail.com> for pointing to the
suspitious code.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Joel Becker <joel.becker@oracle.com>
CC: linux-ext4@vger.kernel.org
CC: ocfs2-devel@oss.oracle.com
CC: mfasheh@suse.de
CC: Dan Carpenter <error27@gmail.com>
2009-02-10 11:15:34 -05:00
Ingo Molnar f9915bfef3 Merge branches 'tracing/ftrace' and 'tracing/urgent' into tracing/core 2009-02-10 13:25:42 +01:00
Jan Kara 9eddacf9e9 Revert "ext4: wait on all pending commits in ext4_sync_fs()"
This undoes commit 14ce0cb411.

Since jbd2_journal_start_commit() is now fixed to return 1 when we
started a transaction commit, there's some transaction waiting to be
committed or there's a transaction already committing, we don't
need to call ext4_force_commit() in ext4_sync_fs(). Furthermore
ext4_force_commit() can unnecessarily create sync transaction which is
expensive so it's worthwhile to remove it when we can.

http://bugzilla.kernel.org/show_bug.cgi?id=12224

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: linux-ext4@vger.kernel.org
2009-02-10 06:46:05 -05:00
Jan Kara c88ccea314 jbd2: Fix return value of jbd2_journal_start_commit()
The function jbd2_journal_start_commit() returns 1 if either a
transaction is committing or the function has queued a transaction
commit. But it returns 0 if we raced with somebody queueing the
transaction commit as well. This resulted in ext4_sync_fs() not
functioning correctly (description from Arthur Jones): 

   In the case of a data=ordered umount with pending long symlinks
   which are delayed due to a long list of other I/O on the backing
   block device, this causes the buffer associated with the long
   symlinks to not be moved to the inode dirty list in the second
   phase of fsync_super.  Then, before they can be dirtied again,
   kjournald exits, seeing the UMOUNT flag and the dirty pages are
   never written to the backing block device, causing long symlink
   corruption and exposing new or previously freed block data to
   userspace.

This can be reproduced with a script created by Eric Sandeen
<sandeen@redhat.com>:

        #!/bin/bash

        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        rm -f /mnt/test2/*
        dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
        touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        /mnt/test2/link
        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        ls /mnt/test2/

This patch fixes jbd2_journal_start_commit() to always return 1 when
there's a transaction committing or queued for commit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
CC: Eric Sandeen <sandeen@redhat.com>
CC: linux-ext4@vger.kernel.org
2009-02-10 11:27:46 -05:00
Linus Torvalds 4c098bcd55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: don't use spin_is_contended
2009-02-09 14:00:16 -08:00
Chris Mason 284b066af4 Btrfs: don't use spin_is_contended
Btrfs was using spin_is_contended to see if it should drop locks before
doing extent allocations during btrfs_search_slot.  The idea was to avoid
expensive searches in the tree unless the lock was actually contended.

But, spin_is_contended is specific to the ticket spinlocks on x86, so this
is causing compile errors everywhere else.

In practice, the contention could easily appear some time after we started
doing the extent allocation, and it makes more sense to always drop the lock
instead.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-09 16:22:03 -05:00
Linus Torvalds 896abeb743 Merge branch 'for-2.6.29' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.29' of git://linux-nfs.org/~bfields/linux:
  lockd: fix regression in lockd's handling of blocked locks
2009-02-09 10:30:19 -08:00
J. Bruce Fields 9d9b87c121 lockd: fix regression in lockd's handling of blocked locks
If a client requests a blocking lock, is denied, then requests it again,
then here in nlmsvc_lock() we will call vfs_lock_file() without FL_SLEEP
set, because we've already queued a block and don't need the locks code
to do it again.

But that means vfs_lock_file() will return -EAGAIN instead of
FILE_LOCK_DENIED.  So we still need to translate that -EAGAIN return
into a nlm_lck_blocked error in this case, and put ourselves back on
lockd's block list.

The bug was introduced by bde74e4bc6 "locks: add special return
value for asynchronous locks".

Thanks to Frank van Maarseveen for the report; his original test
case was essentially

	for i in `seq 30`; do flock /nfsmount/foo sleep 10 & done

Tested-by: Frank van Maarseveen <frankvm@frankvm.com>
Reported-by: Frank van Maarseveen <frankvm@frankvm.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-02-09 13:19:46 -05:00
Ingo Molnar 4ad476e11f Merge commit 'v2.6.29-rc4' into tracing/core 2009-02-09 10:32:48 +01:00
Cornelia Huck 766ccb9ed4 async: Rename _special -> _domain for clarity.
Rename the async_*_special() functions to async_*_domain(), which
describes the purpose of these functions much better.
[Broke up long lines to silence checkpatch]

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-02-08 09:56:11 -08:00
Linus Torvalds ccfef64621 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
  CRED: Fix SUID exec regression
2009-02-06 18:52:55 -08:00
Linus Torvalds ae1a25da84 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (37 commits)
  Btrfs: Make sure dir is non-null before doing S_ISGID checks
  Btrfs: Fix memory leak in cache_drop_leaf_ref
  Btrfs: don't return congestion in write_cache_pages as often
  Btrfs: Only prep for btree deletion balances when nodes are mostly empty
  Btrfs: fix btrfs_unlock_up_safe to walk the entire path
  Btrfs: change btrfs_del_leaf to drop locks earlier
  Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode
  Btrfs: Don't try to compress pages past i_size
  Btrfs: join the transaction in __btrfs_setxattr
  Btrfs: Handle SGID bit when creating inodes
  Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks
  Btrfs: Change btree locking to use explicit blocking points
  Btrfs: hash_lock is no longer needed
  Btrfs: disable leak debugging checks in extent_io.c
  Btrfs: sort references by byte number during btrfs_inc_ref
  Btrfs: async threads should try harder to find work
  Btrfs: selinux support
  Btrfs: make btrfs acls selectable
  Btrfs: Catch missed bios in the async bio submission thread
  Btrfs: fix readdir on 32 bit machines
  ...
2009-02-06 18:37:22 -08:00
Tyler Hicks fd9fc842bb eCryptfs: Regression in unencrypted filename symlinks
The addition of filename encryption caused a regression in unencrypted
filename symlink support.  ecryptfs_copy_filename() is used when dealing
with unencrypted filenames and it reported that the new, copied filename
was a character longer than it should have been.

This caused the return value of readlink() to count the NULL byte of the
symlink target.  Most applications don't care about the extra NULL byte,
but a version control system (bzr) helped in discovering the bug.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-06 18:36:40 -08:00
Linus Torvalds 1d87b0d388 Merge branch 'to-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland
* 'to-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland:
  elf core dump: fix get_user use
2009-02-06 18:10:04 -08:00
Roland McGrath 92dc07b1f9 elf core dump: fix get_user use
The elf_core_dump() code does its work with set_fs(KERNEL_DS) in force,
so vma_dump_size() needs to switch back with set_fs(USER_DS) to safely
use get_user() for a normal user-space address.

Checking for VM_READ optimizes out the case where get_user() would fail
anyway.  The vm_file check here was already superfluous given the control
flow earlier in the function, so that is a cleanup/optimization unrelated
to other changes but an obvious and trivial one.

Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
2009-02-06 17:34:07 -08:00
David Howells 0bf2f3aec5 CRED: Fix SUID exec regression
The patch:

	commit a6f76f23d2
	CRED: Make execve() take advantage of copy-on-write credentials

moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called.  This means that LSM_UNSAFE_SHARE is now
calculated incorrectly.  This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made.  All of which are true for threads created by the
pthread library.

However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().

So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us.  These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().

We do have to be careful because CLONE_THREAD does not imply FS or FILES.

We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.

This can be tested with the attached pair of programs.  Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user.  If
successful, you should see something like:

	[dhowells@andromeda tmp]$ ./test1
	--TEST1--
	uid=4043, euid=4043 suid=4043
	exec ./test2
	--TEST2--
	uid=4043, euid=0 suid=0
	SUCCESS - Correct effective user ID

and if unsuccessful, something like:

	[dhowells@andromeda tmp]$ ./test1
	--TEST1--
	uid=4043, euid=4043 suid=4043
	exec ./test2
	--TEST2--
	uid=4043, euid=4043 suid=4043
	ERROR - Incorrect effective user ID!

The non-root user ID you see will depend on the user you run as.

[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>

static void *thread_func(void *arg)
{
	while (1) {}
}

int main(int argc, char **argv)
{
	pthread_t tid;
	uid_t uid, euid, suid;

	printf("--TEST1--\n");
	getresuid(&uid, &euid, &suid);
	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

	if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
		perror("pthread_create");
		exit(1);
	}

	printf("exec ./test2\n");
	execlp("./test2", "test2", NULL);
	perror("./test2");
	_exit(1);
}

[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	uid_t uid, euid, suid;

	getresuid(&uid, &euid, &suid);
	printf("--TEST2--\n");
	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

	if (euid != 0) {
		fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
		exit(1);
	}
	printf("SUCCESS - Correct effective user ID\n");
	exit(0);
}

[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2

test1: test1.c
	gcc $(CFLAGS) -o test1 test1.c -lpthread

test2: test2.c
	gcc $(CFLAGS) -o test2 test2.c
	sudo chown root.root test2
	sudo chmod +s test2

Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-07 08:46:18 +11:00