Commit Graph

25047 Commits

Author SHA1 Message Date
Sage Weil 2151937d7c ceph: fix rasize reporting by ceph_show_options
Fix typo.

Reported-by: mowang da <whooya.xxl@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-12-02 09:27:54 -08:00
Linus Torvalds 0a4ebed781 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (31 commits)
  ocfs2: avoid unaligned access to dqc_bitmap
  ocfs2: Use filemap_write_and_wait() instead of write_inode_now()
  ocfs2: honor O_(D)SYNC flag in fallocate
  ocfs2: Add a missing journal credit in ocfs2_link_credits() -v2
  ocfs2: send correct UUID to cleancache initialization
  ocfs2: Commit transactions in error cases -v2
  ocfs2: make direntry invalid when deleting it
  fs/ocfs2/dlm/dlmlock.c: free kmem_cache_zalloc'd data using kmem_cache_free
  ocfs2: Avoid livelock in ocfs2_readpage()
  ocfs2: serialize unaligned aio
  ocfs2: Implement llseek()
  ocfs2: Fix ocfs2_page_mkwrite()
  ocfs2: Add comment about orphan scanning
  ocfs2: Clean up messages in the fs
  ocfs2/cluster: Cluster up now includes network connections too
  ocfs2/cluster: Add new function o2net_fill_node_map()
  ocfs2/cluster: Fix output in file elapsed_time_in_ms
  ocfs2/dlm: dlmlock_remote() needs to account for remastery
  ocfs2/dlm: Take inflight reference count for remotely mastered resources too
  ocfs2/dlm: Cleanup dlm_wait_for_node_death() and dlm_wait_for_node_recovery()
  ...
2011-12-01 14:55:34 -08:00
Akinobu Mita 939255798a ocfs2: avoid unaligned access to dqc_bitmap
The dqc_bitmap field of struct ocfs2_local_disk_chunk is 32-bit aligned,
but not 64-bit aligned.  The dqc_bitmap is accessed by ocfs2_set_bit(),
ocfs2_clear_bit(), ocfs2_test_bit(), or ocfs2_find_next_zero_bit().  These
are wrapper macros for ext2_*_bit() which need to take an unsigned long
aligned address (though some architectures are able to handle unaligned
address correctly)

So some 64bit architectures may not be able to access the dqc_bitmap
correctly.

This avoids such unaligned access by using another wrapper functions for
ext2_*_bit().  The code is taken from fs/ext4/mballoc.c which also need to
handle unaligned bitmap access.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-12-01 14:39:32 -08:00
Trond Myklebust 111d489f0f NFSv4.1: Ensure that we handle _all_ SEQUENCE status bits.
Currently, the code assumes that the SEQUENCE status bits are mutually
exclusive. They are not...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 2.6.34]
2011-12-01 16:37:42 -05:00
Trond Myklebust 4f38e4aadc NFSv4: Don't error if we handled it in nfs4_recovery_handle_error
If we handled an error condition, then nfs4_recovery_handle_error should
return '0' so that the state recovery thread can continue.
Also ensure that nfs4_check_lease() continues to abort if we haven't got
any credentials by having it return ENOKEY (which is not handled).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-12-01 16:31:34 -05:00
Linus Torvalds b930c26416 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix meta data raid-repair merge problem
  Btrfs: skip allocation attempt from empty cluster
  Btrfs: skip block groups without enough space for a cluster
  Btrfs: start search for new cluster at the beginning
  Btrfs: reset cluster's max_size when creating bitmap
  Btrfs: initialize new bitmaps' list
  Btrfs: fix oops when calling statfs on readonly device
  Btrfs: Don't error on resizing FS to same size
  Btrfs: fix deadlock on metadata reservation when evicting a inode
  Fix URL of btrfs-progs git repository in docs
  btrfs scrub: handle -ENOMEM from init_ipath()
2011-12-01 08:28:53 -08:00
Jan Schmidt f4a8e6563e Btrfs: fix meta data raid-repair merge problem
Commit 4a54c8c16 introduced raid-repair, killing the individual
readpage_io_failed_hook entries from inode.c and disk-io.c. Commit
4bb31e92 introduced new readahead code, adding a readpage_io_failed_hook to
disk-io.c.

The raid-repair commit had logic to disable raid-repair, if
readpage_io_failed_hook is set. Thus, the readahead commit effectively
disabled raid-repair for meta data.

This commit changes the logic to always attempt raid-repair when needed and
call the readpage_io_failed_hook in case raid-repair fails. This is much
more straight forward and should have been like that from the beginning.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-01 09:30:36 -05:00
Alexandre Oliva be064d1139 Btrfs: skip allocation attempt from empty cluster
If we don't have a cluster, don't bother trying to allocate from it,
jumping right away to the attempt to allocate a new cluster.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-30 13:43:00 -05:00
Alexandre Oliva 425d83156c Btrfs: skip block groups without enough space for a cluster
We test whether a block group has enough free space to hold the
requested block, but when we're doing clustered allocation, we can
save some cycles by testing whether it has enough room for the cluster
upfront, otherwise we end up attempting to set up a cluster and
failing.  Only in the NO_EMPTY_SIZE loop do we attempt an unclustered
allocation, and by then we'll have zeroed the cluster size, so this
patch won't stop us from using the block group as a last resort.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-30 13:43:00 -05:00
Alexandre Oliva 1b22bad779 Btrfs: start search for new cluster at the beginning
Instead of starting at zero (offset is always zero), request a cluster
starting at search_start, that denotes the beginning of the current
block group.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-30 13:43:00 -05:00
Alexandre Oliva b78d09bceb Btrfs: reset cluster's max_size when creating bitmap
The field that indicates the size of the largest contiguous chunk of
free space in the cluster is not initialized when setting up bitmaps,
it's only increased when we find a larger contiguous chunk.  We end up
retaining a larger value than appropriate for highly-fragmented
clusters, which may cause pointless searches for large contiguous
groups, and even cause clusters that do not meet the density
requirements to be set up.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-30 13:43:00 -05:00
Alexandre Oliva f2d0f6765d Btrfs: initialize new bitmaps' list
We're failing to create clusters with bitmaps because
setup_cluster_no_bitmap checks that the list is empty before inserting
the bitmap entry in the list for setup_cluster_bitmap, but the list
field is only initialized when it is restored from the on-disk free
space cache, or when it is written out to disk.

Besides a potential race condition due to the multiple use of the list
field, filesystem performance severely degrades over time: as we use
up all non-bitmap free extents, the try-to-set-up-cluster dance is
done at every metadata block allocation.  For every block group, we
fail to set up a cluster, and after failing on them all up to twice,
we fall back to the much slower unclustered allocation.

To make matters worse, before the unclustered allocation, we try to
create new block groups until we reach the 1% threshold, which
introduces additional bitmaps and thus block groups that we'll iterate
over at each metadata block request.
2011-11-30 18:46:06 +01:00
Li Zefan b772a86ea6 Btrfs: fix oops when calling statfs on readonly device
To reproduce this bug:

  # dd if=/dev/zero of=img bs=1M count=256
  # mkfs.btrfs img
  # losetup -r /dev/loop1 img
  # mount /dev/loop1 /mnt
  OOPS!!

It triggered BUG_ON(!nr_devices) in btrfs_calc_avail_data_space().

To fix this, instead of checking write-only devices, we check all open
deivces:

  # df -h /dev/loop1
  Filesystem            Size  Used Avail Use% Mounted on
  /dev/loop1            250M   28K  238M   1% /mnt

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-11-30 18:46:05 +01:00
Mike Fleetwood ece7d20e8b Btrfs: Don't error on resizing FS to same size
It seems overly harsh to fail a resize of a btrfs file system to the
same size when a shrink or grow would succeed.  User app GParted trips
over this error.  Allow it by bypassing the shrink or grow operation.

Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com>
2011-11-30 18:46:04 +01:00
Miao Xie aa38a711a8 Btrfs: fix deadlock on metadata reservation when evicting a inode
When I ran the xfstests, I found the test tasks was blocked on meta-data
reservation.

By debugging, I found the reason of this bug:
   start transaction
        |
	v
   reserve meta-data space
	|
	v
   flush delay allocation -> iput inode -> evict inode
	^					|
	|					v
   wait for delay allocation flush <- reserve meta-data space

And besides that, the flush on evicting inode will block the thread, which
is reclaiming the memory, and make oom happen easily.

Fix this bug by skipping the flush step when evicting inode.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-11-30 18:46:03 +01:00
Dan Carpenter 26bdef541d btrfs scrub: handle -ENOMEM from init_ipath()
init_ipath() can return an ERR_PTR(-ENOMEM).

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2011-11-30 18:46:01 +01:00
Christoph Hellwig 4c393a6059 xfs: fix attr2 vs large data fork assert
With Dmitry fsstress updates I've seen very reproducible crashes in
xfs_attr_shortform_remove because xfs_attr_shortform_bytesfit claims that
the attributes would not fit inline into the inode after removing an
attribute.  It turns out that we were operating on an inode with lots
of delalloc extents, and thus an if_bytes values for the data fork that
is larger than biggest possible on-disk storage for it which utterly
confuses the code near the end of xfs_attr_shortform_bytesfit.

Fix this by always allowing the current attribute fork, like we already
do for the attr1 format, given that delalloc conversion will take care
for moving either the data or attribute area out of line if it doesn't
fit at that point - or making the point moot by merging extents at this
point.

Also document the function better, and clean up some loose bits.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2011-11-29 13:03:12 -06:00
Christoph Hellwig 4dd2cb4a28 xfs: force buffer writeback before blocking on the ilock in inode reclaim
If we are doing synchronous inode reclaim we block the VM from making
progress in memory reclaim.  So if we encouter a flush locked inode
promote it in the delwri list and wake up xfsbufd to write it out now.
Without this we can get hangs of up to 30 seconds during workloads hitting
synchronous inode reclaim.

The scheme is copied from what we do for dquot reclaims.

Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Ben Myers <bpm@sgi.com>
2011-11-29 12:06:14 -06:00
Linus Torvalds 883381d9f1 Merge branch 'dev' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'dev' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix racy use-after-free in ext4_end_io_dio()
2011-11-29 08:59:12 -08:00
Marcos Paulo de Souza 786228ab30 writeback: Fix issue on make htmldocs
Document the @reason parameter to make "make htmldocs" happy.

Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Marcos Paulo de Souza <marcos.mage@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-11-29 15:50:28 +08:00
Christoph Hellwig fa8b18edd7 xfs: validate acl count
This prevents in-memory corruption and possible panics if the on-disk
ACL is badly corrupted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2011-11-28 22:14:24 -06:00
Linus Torvalds cb3599926e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  pstore: pass allocated memory region back to caller
2011-11-28 11:27:57 -08:00
Tejun Heo 4c81f045c0 ext4: fix racy use-after-free in ext4_end_io_dio()
ext4_end_io_dio() queues io_end->work and then clears iocb->private;
however, io_end->work calls aio_complete() which frees the iocb
object.  If that slab object gets reallocated, then ext4_end_io_dio()
can end up clearing someone else's iocb->private, this use-after-free
can cause a leak of a struct ext4_io_end_t structure.

Detected and tested with slab poisoning.

[ Note: Can also reproduce using 12 fio's against 12 file systems with the
  following configuration file:

  [global]
  direct=1
  ioengine=libaio
  iodepth=1
  bs=4k
  ba=4k
  size=128m

  [create]
  filename=${TESTDIR}
  rw=write

  -- tytso ]

Google-Bug-Id: 5354697
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Kent Overstreet <koverstreet@google.com>
Tested-by: Kent Overstreet <koverstreet@google.com>
Cc: stable@kernel.org
2011-11-24 19:22:24 -05:00
Linus Torvalds de7badf1ad Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  eCryptfs: Extend array bounds for all filename chars
  eCryptfs: Flush file in vma close
  eCryptfs: Prevent file create race condition
2011-11-23 14:28:13 -08:00
Tyler Hicks 0f751e641a eCryptfs: Extend array bounds for all filename chars
From mhalcrow's original commit message:

    Characters with ASCII values greater than the size of
    filename_rev_map[] are valid filename characters.
    ecryptfs_decode_from_filename() will access kernel memory beyond
    that array, and ecryptfs_parse_tag_70_packet() will then decrypt
    those characters. The attacker, using the FNEK of the crafted file,
    can then re-encrypt the characters to reveal the kernel memory past
    the end of the filename_rev_map[] array. I expect low security
    impact since this array is statically allocated in the text area,
    and the amount of memory past the array that is accessible is
    limited by the largest possible ASCII filename character.

This patch solves the issue reported by mhalcrow but with an
implementation suggested by Linus to simply extend the length of
filename_rev_map[] to 256. Characters greater than 0x7A are mapped to
0x00, which is how invalid characters less than 0x7A were previously
being handled.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Michael Halcrow <mhalcrow@google.com>
Cc: stable@kernel.org
2011-11-23 15:43:53 -06:00
Tyler Hicks 32001d6fe9 eCryptfs: Flush file in vma close
Dirty pages weren't being written back when an mmap'ed eCryptfs file was
closed before the mapping was unmapped. Since f_ops->flush() is not
called by the munmap() path, the lower file was simply being released.
This patch flushes the eCryptfs file in the vm_ops->close() path.

https://launchpad.net/bugs/870326

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Cc: stable@kernel.org [2.6.39+]
2011-11-23 15:40:09 -06:00
Tyler Hicks b59db43ad4 eCryptfs: Prevent file create race condition
The file creation path prematurely called d_instantiate() and
unlock_new_inode() before the eCryptfs inode info was fully
allocated and initialized and before the eCryptfs metadata was written
to the lower file.

This could result in race conditions in subsequent file and inode
operations leading to unexpected error conditions or a null pointer
dereference while attempting to use the unallocated memory.

https://launchpad.net/bugs/813146

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Cc: stable@kernel.org
2011-11-23 15:39:38 -06:00
Linus Torvalds 2db1125d51 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  mount_subtree() pointless use-after-free
  iio: fix a leak due to improper use of anon_inode_getfd()
  microblaze: bury asm/namei.h
2011-11-22 13:19:21 -08:00
Al Viro d31da0f0ba mount_subtree() pointless use-after-free
d'oh... we'd carefully pinned mnt->mnt_sb down, dropped mnt and attempt
to grab s_umount on mnt->mnt_sb.  The trouble is, *mnt might've been
overwritten by now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-22 12:31:21 -05:00
Linus Torvalds e25ba0ce03 Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Revert pnfs ugliness from the generic NFS read code path
  SUNRPC: destroy freshly allocated transport in case of sockaddr init error
  NFS: Fix a regression in the referral code
  nfs: move nfs_file_operations declaration to bottom of file.c (try #2)
  nfs: when attempting to open a directory, fall back on normal lookup (try #5)
2011-11-22 08:54:15 -08:00
Linus Torvalds af36d15f58 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: remove free-space-cache.c WARN during log replay
  Btrfs: sectorsize align offsets in fiemap
  Btrfs: clear pages dirty for io and set them extent mapped
  Btrfs: wait on caching if we're loading the free space cache
  Btrfs: prefix resize related printks with btrfs:
  btrfs: fix stat blocks accounting
  Btrfs: avoid unnecessary bitmap search for cluster setup
  Btrfs: fix to search one more bitmap for cluster setup
  btrfs: mirror_num should be int, not u64
  btrfs: Fix up 32/64-bit compatibility for new ioctls
  Btrfs: fix barrier flushes
  Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush
2011-11-22 08:53:40 -08:00
Linus Torvalds f8f5ed7c99 Merge branch 'dev' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'dev' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix up a undefined error in ext4_free_blocks in debugging code
  ext4: add blk_finish_plug in error case of writepages.
  ext4: Remove kernel_lock annotations
  ext4: ignore journalled data options on remount if fs has no journal
2011-11-21 12:11:37 -08:00
Linus Torvalds c292fe4aae Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: Allocate larger oid buffer in request msgs
  ceph: initialize root dentry
  ceph: fix iput race when queueing inode work
2011-11-21 12:11:13 -08:00
Chris Mason 24a7031396 Btrfs: remove free-space-cache.c WARN during log replay
The log replay code only partially loads block groups, since
the block group caching code is able to detect and deal with
extents the logging code has pinned down.

While the logging code is pinning down block groups, there is
a bogus WARN_ON we're hitting if the code wasn't able to find
an extent in the cache.  This commit removes the warning because
it can happen any time there isn't a valid free space cache
for that block group.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-21 14:57:33 -05:00
Yongqiang Yang 6e58ad69ef ext4: fix up a undefined error in ext4_free_blocks in debugging code
sbi is not defined, so let ext4_free_blocks use EXT4_SB(sb) instead
when EXT4FS_DEBUG is defined.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
2011-11-21 12:09:19 -05:00
David Howells dd179946db VFS: Log the fact that we've given ELOOP rather than creating a loop
To prevent an NFS server from being used to create a directory loop in an NFS
superblock on the client, the following patch was committed:

	commit 1836750115
	Author: Al Viro <viro@zeniv.linux.org.uk>
	Date:   Tue Jul 12 21:42:24 2011 -0400
	Subject: fix loop checks in d_materialise_unique()

This causes ELOOP to be reported to anyone trying to access the dentry that
would otherwise cause the kernel to complete the loop.

However, no indication is given to the caller as to why an operation that ought
to work doesn't.  The fault is with the kernel, which doesn't want to try and
solve the problem as it gets horrendously messy if there's another mountpoint
somewhere in the trees being spliced that can't be moved[*].

[*] The real problem is that we don't handle the excision of a subtree that
gets moved _out_ of what we can see.  This can happen on the server where a
directory is merely moved between two other dirs on the same filesystem, but
where destination dir is not accessible by the client.

So, given the choice to return ELOOP rather than trying to reconfigure the
dentry tree, we should give the caller some indication of why they aren't being
allowed to make what should be a legitimate request and log a message.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-20 23:04:27 -05:00
Josef Bacik 4d479cf010 Btrfs: sectorsize align offsets in fiemap
We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere
else that calls btrfs_get_extent while running xfstests 13 in a loop.  This is
because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets,
which will end up adding mappings that are not sectorsize aligned, which will
cause problems in some cases for subsequent calls to btrfs_get_extent for
similar areas that are sectorsize aligned.  With this patch I ran xfstests 13 in
a loop for a couple of hours and didn't hit the problem that I could previously
hit in at most 20 minutes.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-11-20 07:42:17 -05:00
Josef Bacik f7d61dcd68 Btrfs: clear pages dirty for io and set them extent mapped
When doing the io_ctl helpers to clean up the free space cache stuff I stopped
using our normal prepare_pages stuff, which means I of course forgot to do
things like set the pages extent mapped, which will cause us all sorts of
wonderful propblems.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-11-20 07:42:17 -05:00
Josef Bacik 291c7d2f57 Btrfs: wait on caching if we're loading the free space cache
We've been hitting panics when running xfstest 13 in a loop for long periods of
time.  And actually this problem has always existed so we've been hitting these
things randomly for a while.  Basically what happens is we get a thread coming
into the allocator and reading the space cache off of disk and adding the
entries to the free space cache as we go.  Then we get another thread that comes
in and tries to allocate from that block group.  Since block_group->cached !=
BTRFS_CACHE_NO it goes ahead and tries to do the allocation.  We do this because
if we're doing the old slow way of caching we don't want to hold people up and
wait for everything to finish.  The problem with this is we could end up
discarding the space cache at some arbitrary point in the future, which means we
could very well end up allocating space that is either bad, or when the real
caching happens it could end up thinking the space isn't in use when it really
is and cause all sorts of other problems.

The solution is to add a new flag to indicate we are loading the free space
cache from disk, and always try to cache the block group if cache->cached !=
BTRFS_CACHE_FINISHED.  That way if we are loading the space cache anybody else
who tries to allocate from the block group will have to wait until it's finished
to make sure it completes successfully.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-11-20 07:42:16 -05:00
Arnd Hannemann 5bb1468238 Btrfs: prefix resize related printks with btrfs:
For the user it is confusing to find something like:
[10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472
in kernel log, because it doesn't point directly to btrfs.

This patch prefixes those messages with "btrfs:" like other btrfs
related printks.

Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:16 -05:00
David Sterba fadc0d8be4 btrfs: fix stat blocks accounting
Round inode bytes and delalloc bytes up to real blocksize before
converting to sector size. Otherwise eg. files smaller than 512
are reported with zero blocks due to incorrect rounding.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:15 -05:00
Li Zefan 52621cb6ed Btrfs: avoid unnecessary bitmap search for cluster setup
setup_cluster_no_bitmap() searches all the extents and bitmaps starting
from offset. Therefore if it returns -ENOSPC, all the bitmaps starting
from offset are in the bitmaps list, so it's sufficient to search from
this list in setup_cluser_bitmap().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:15 -05:00
Li Zefan 0f0fbf1d0e Btrfs: fix to search one more bitmap for cluster setup
Suppose there are two bitmaps [0, 256], [256, 512] and one extent
[100, 120] in the free space cache, and we want to setup a cluster
with offset=100, bytes=50.

In this case, there will be only one bitmap [256, 512] in the temporary
bitmaps list, and then setup_cluster_bitmap() won't search bitmap [0, 256].

The cause is, the list is constructed in setup_cluster_no_bitmap(),
and only bitmaps with bitmap_entry->offset >= offset will be added
into the list, and the very bitmap that convers offset has
bitmap_entry->offset <= offset.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:14 -05:00
Jan Schmidt 32240a913d btrfs: mirror_num should be int, not u64
My previous patch introduced some u64 for failed_mirror variables, this one
makes it consistent again.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:14 -05:00
Jeff Mahoney 745c4d8e16 btrfs: Fix up 32/64-bit compatibility for new ioctls
This patch casts to unsigned long before casting to a pointer and fixes
 the following warnings:
fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:13 -05:00
Chris Mason 387125fc72 Btrfs: fix barrier flushes
When btrfs is writing the super blocks, it send barrier flushes to make
sure writeback caching drives get all the metadata on disk in the
right order.

But, we have two bugs in the way these are sent down.  When doing
full commits (not via the tree log), we are sending the barrier down
before the last super when it should be going down before the first.

In multi-device setups, we should be waiting for the barriers to
complete on all devices before writing any of the supers.

Both of these bugs can cause corruptions on power failures.  We fix it
with some new code to send down empty barriers to all devices before
writing the first super.

Alexandre Oliva found the multi-device bug.  Arne Jansen did the async
barrier loop.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
2011-11-20 07:21:14 -05:00
Al Viro f1fd306a91 minixfs: kill manual hweight(), simplify
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-19 11:13:28 -05:00
Josh Boyer 016e8d44bc fs/minix: Verify bitmap block counts before mounting
Newer versions of MINIX can create filesystems that allocate an extra
bitmap block.  Mounting of this succeeds, but doing a statfs call will
result in an oops in count_free because of a negative number being used
for the bh index.

Avoid this by verifying the number of allocated blocks at mount time,
erroring out if there are not enough and make statfs ignore the extras
if there are too many.

This fixes https://bugzilla.kernel.org/show_bug.cgi?id=18792

Signed-off-by: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-19 11:13:26 -05:00
Linus Torvalds 208f6f6068 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  new helper: mount_subtree()
  switch create_mnt_ns() to saner calling conventions, fix double mntput() in nfs
  btrfs: fix double mntput() in mount_subvol()
2011-11-19 06:06:39 -05:00
Linus Torvalds ab5c5f639b Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  MAINTAINERS: update XFS maintainer entry
  xfs: use doalloc flag in xfs_qm_dqattach_one()
2011-11-19 06:05:17 -05:00
Linus Torvalds 15bd1cfb30 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
* 'for-linus' of git://git.kernel.dk/linux-block:
  block: add missed trace_block_plug
  paride: fix potential information leak in pg_read()
  bio: change some signed vars to unsigned
  block: avoid unnecessary plug list flush
  cciss: auto engage SCSI mid layer at driver load time
  loop: cleanup set_status interface
  include/linux/bio.h: use a static inline function for bio_integrity_clone()
  loop: prevent information leak after failed read
  block: Always check length of all iov entries in blk_rq_map_user_iov()
  The Windows driver .inf disables ASPM on all cciss devices. Do the same.
  backing-dev: ensure wakeup_timer is deleted
  block: Revert "[SCSI] genhd: add a new attribute "alias" in gendisk"
2011-11-18 09:34:35 -02:00
Kees Cook f6f8285132 pstore: pass allocated memory region back to caller
The buf_lock cannot be held while populating the inodes, so make the backend
pass forward an allocated and filled buffer instead. This solves the following
backtrace. The effect is that "buf" is only ever used to notify the backends
that something was written to it, and shouldn't be used in the read path.

To replace the buf_lock during the read path, isolate the open/read/close
loop with a separate mutex to maintain serialized access to the backend.

Note that is is up to the pstore backend to cope if the (*write)() path is
called in the middle of the read path.

[   59.691019] BUG: sleeping function called from invalid context at .../mm/slub.c:847
[   59.691019] in_atomic(): 0, irqs_disabled(): 1, pid: 1819, name: mount
[   59.691019] Pid: 1819, comm: mount Not tainted 3.0.8 #1
[   59.691019] Call Trace:
[   59.691019]  [<810252d5>] __might_sleep+0xc3/0xca
[   59.691019]  [<810a26e6>] kmem_cache_alloc+0x32/0xf3
[   59.691019]  [<810b53ac>] ? __d_lookup_rcu+0x6f/0xf4
[   59.691019]  [<810b68b1>] alloc_inode+0x2a/0x64
[   59.691019]  [<810b6903>] new_inode+0x18/0x43
[   59.691019]  [<81142447>] pstore_get_inode.isra.1+0x11/0x98
[   59.691019]  [<81142623>] pstore_mkfile+0xae/0x26f
[   59.691019]  [<810a2a66>] ? kmem_cache_free+0x19/0xb1
[   59.691019]  [<8116c821>] ? ida_get_new_above+0x140/0x158
[   59.691019]  [<811708ea>] ? __init_rwsem+0x1e/0x2c
[   59.691019]  [<810b67e8>] ? inode_init_always+0x111/0x1b0
[   59.691019]  [<8102127e>] ? should_resched+0xd/0x27
[   59.691019]  [<8137977f>] ? _cond_resched+0xd/0x21
[   59.691019]  [<81142abf>] pstore_get_records+0x52/0xa7
[   59.691019]  [<8114254b>] pstore_fill_super+0x7d/0x91
[   59.691019]  [<810a7ff5>] mount_single+0x46/0x82
[   59.691019]  [<8114231a>] pstore_mount+0x15/0x17
[   59.691019]  [<811424ce>] ? pstore_get_inode.isra.1+0x98/0x98
[   59.691019]  [<810a8199>] mount_fs+0x5a/0x12d
[   59.691019]  [<810b9174>] ? alloc_vfsmnt+0xa4/0x14a
[   59.691019]  [<810b9474>] vfs_kern_mount+0x4f/0x7d
[   59.691019]  [<810b9d7e>] do_kern_mount+0x34/0xb2
[   59.691019]  [<810bb15f>] do_mount+0x5fc/0x64a
[   59.691019]  [<810912fb>] ? strndup_user+0x2e/0x3f
[   59.691019]  [<810bb3cb>] sys_mount+0x66/0x99
[   59.691019]  [<8137b537>] sysenter_do_call+0x12/0x26

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2011-11-17 12:58:07 -08:00
Jan Kara 249ec93c01 ocfs2: Use filemap_write_and_wait() instead of write_inode_now()
Since ocfs2 has no ->write_inode method, there's no point in calling
write_inode_now() from ocfs2_cleanup_delete_inode().  Use
filemap_write_and_wait() instead. This helps us to cleanup inode writing
interfaces...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-11-17 02:18:57 -08:00
Mark Fasheh df295d4a4b ocfs2: honor O_(D)SYNC flag in fallocate
We need to sync the transaction which updates i_size if the file is marked
as needing sync semantics.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-11-17 02:15:58 -08:00
Xiaowei.Hu 0393afea31 ocfs2: Add a missing journal credit in ocfs2_link_credits() -v2
With indexed_dir enabled, ocfs2 maintains a list of dirblocks having
space.

The credit calculation in ocfs2_link_credits() did not correctly account
for adding an entry that exactly fills a dirblock that triggers removing
that dirblock by changing the pointer in the previous block in the list.
The credit calculation did not account for that previous block.

To expose, do:

mkfs.ocfs2 -b 512 -M local /dev/sdX
mount /dev/sdX /ocfs2
mkdir /ocfs2/linkdir
touch /ocfs2/linkdir/file1
for i in `seq 1 29` ; do link /ocfs2/linkdir/file1
/ocfs2/linkdir/linklinklinklinklinklink$i; done
rm -f /ocfs2/linkdir/linklinklinklinklinklink10
sleep 8
link /ocfs2/linkdir/file1
/ocfs2/linkdir/linklinklinklinklinklinkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Note:
The link names have been crafted for a 512 byte blocksize. Reproducing
with a larger blocksize will require longer (or more) links. The sleep
is important. We want jbd2 to commit the transaction so that the missing
block does not piggy back on account of the previous transaction.

Signed-off-by: XiaoweiHu <xiaowei.hu at oracle.com>
Reviewed-by: WengangWang <wen.gang.wang at oracle.com>
Reviewed-by: Sunil.Mushran <sunil.mushran at oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-11-17 01:46:48 -08:00
Dan Magenheimer e41d33af7c ocfs2: send correct UUID to cleancache initialization
ocfs2: Fix cleancache initialization call to correctly pass uuid

As reported by Steven Whitehouse in https://lkml.org/lkml/2011/5/27/221
the ocfs2 volume UUID is incorrectly passed to cleancache.
As a result, shared-ephemeral tmem pools will not actually
be created; instead they will be private (unshared) which
misses out on a major benefit of tmem.

Reported-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-11-17 01:46:48 -08:00
Wengang Wang b8a0ae579f ocfs2: Commit transactions in error cases -v2
There are three cases found that in error cases, journal transactions are not
committed nor aborted. We should take care of these case by committing the
transactions. Otherwise, there would left a journal handle which will lead to
, in same process context, the comming ocfs2_start_trans() gets wrong credits.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-11-17 01:46:47 -08:00
Wengang Wang 8298524803 ocfs2: make direntry invalid when deleting it
When we deleting a direntry from a directory, if it's the first in a block we
invalid it by setting inode to 0; otherwise, we merge the deleted one to the
prior and contiguous direntry. And we don't truncate directories.

There is a problem for the later case since inode is not set to 0.
This problem happens when the caller passes a file position as parameter to
ocfs2_dir_foreach_blk(). If the position happens to point to a stale(not
the first, deleted in betweens of ocfs2_dir_foreach_blk()s) direntry, we are
not able to recognize its staleness. So that we treat it as a live one wrongly.

The fix is to set inode to 0 in both cases indicating the direntry is stale.
This won't introduce additional IOs.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-11-17 01:46:46 -08:00
Julia Lawall fc9f899483 fs/ocfs2/dlm/dlmlock.c: free kmem_cache_zalloc'd data using kmem_cache_free
Memory allocated using kmem_cache_zalloc should be freed using
kmem_cache_free, not kfree.

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression x,e,e1,e2;
@@

x = kmem_cache_zalloc(e1,e2)
... when != x = e
?-kfree(x)
+kmem_cache_free(e1,x)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-11-17 01:46:46 -08:00
Al Viro ea441d1104 new helper: mount_subtree()
takes vfsmount and relative path, does lookup within that vfsmount
(possibly triggering automounts) and returns the result as root
of subtree suitable for return by ->mount() (i.e. a reference to
dentry and an active reference to its superblock grabbed, superblock
locked exclusive).

btrfs and nfs switched to it instead of open-coding the sucker.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-16 22:00:34 -05:00
Al Viro c133449587 switch create_mnt_ns() to saner calling conventions, fix double mntput() in nfs
Life is much saner if create_mnt_ns(mnt) drops mnt in case of error...
Switch it to such calling conventions, switch callers, fix double mntput() in
fs/nfs/super.c one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-16 16:12:14 -05:00
Al Viro 8d514bbf37 btrfs: fix double mntput() in mount_subvol()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-16 16:06:09 -05:00
Dan Carpenter 121f099412 bio: change some signed vars to unsigned
This is just a cleanup patch to silence a static checker warning.

The problem is that we cap "nr_iovecs" so it can't be larger than
"UIO_MAXIOV" but we don't check for negative values.  It turns out this is
prevented at other layers, but logically it doesn't make sense to have
negative nr_iovecs so making it unsigned is nicer.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-16 09:21:50 +01:00
Mitsuo Hayasaka db3e74b582 xfs: use doalloc flag in xfs_qm_dqattach_one()
The doalloc arg in xfs_qm_dqattach_one() is a flag that indicates
whether a new area to handle quota information will be allocated
if needed. Originally, it was passed to xfs_qm_dqget(), but has
been removed by the following commit (probably by mistake):

	commit 8e9b6e7fa4
	Author: Christoph Hellwig <hch@lst.de>
	Date:   Sun Feb 8 21:51:42 2009 +0100

	xfs: remove the unused XFS_QMOPT_DQLOCK flag

As the result, xfs_qm_dqget() called from xfs_qm_dqattach_one()
never allocates the new area even if it is needed.

This patch gives the doalloc arg to xfs_qm_dqget() in
xfs_qm_dqattach_one() to fix this problem.

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2011-11-15 14:45:09 -06:00
Dan Carpenter bc5b8a9003 hfs: add sanity check for file name length
On a corrupted file system the ->len field could be wrong leading to
a buffer overflow.

Reported-and-acked-by: Clement LECIGNE <clement.lecigne@netasq.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-15 14:29:42 -02:00
Liu Bo f1ebcc74d5 Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush
The btrfs snapshotting code requires that once a root has been
snapshotted, we don't change it during a commit.

But there are two cases to lead to tree corruptions:

1) multi-thread snapshots can commit serveral snapshots in a transaction,
   and this may change the src root when processing the following pending
   snapshots, which lead to the former snapshots corruptions;

2) the free inode cache was changing the roots when it root the cache,
   which lead to corruptions.

This fixes things by making sure we force COW the block after we create a
snapshot during commiting a transaction, then any changes to the roots
will result in COW, and we get all the fs roots and snapshot roots to be
consistent.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-15 09:53:28 -05:00
Linus Torvalds c1f4246716 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: rename the option to nospace_cache
  Btrfs: handle bio_add_page failure gracefully in scrub
  Btrfs: fix deadlock caused by the race between relocation
  Btrfs: only map pages if we know we need them when reading the space cache
  Btrfs: fix orphan backref nodes
  Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush}
  Btrfs: fix unreleased path in btrfs_orphan_cleanup()
  Btrfs: fix no reserved space for writing out inode cache
  Btrfs: fix nocow when deleting the item
  Btrfs: tweak the delayed inode reservations again
  Btrfs: rework error handling in btrfs_mount()
  Btrfs: close devices on all error paths in open_ctree()
  Btrfs: avoid null dereference and leaks when bailing from open_ctree()
  Btrfs: fix subvol_name leak on error in btrfs_mount()
  Btrfs: fix memory leak in btrfs_parse_early_options()
  Btrfs: fix our reservations for updating an inode when completing io
  Btrfs: fix oops on NULL trans handle in btrfs_truncate
  btrfs: fix double-free 'tree_root' in 'btrfs_mount()'
2011-11-11 23:47:06 -02:00
Linus Torvalds 53e3ccfd15 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: fix force shutdown handling in xfs_end_io
  xfs: constify xfs_item_ops
  xfs: Fix possible memory corruption in xfs_readlink
2011-11-11 23:37:17 -02:00
Sage Weil 774ac21da7 ceph: initialize root dentry
Set up d_fsdata on the root dentry.  This fixes a NULL pointer dereference
in ceph_d_prune on umount.  It also means we can eventually strip out all
of the conditional checks on d_fsdata because it is now set unconditionally
(prior to setting up the d_ops).

Fix the ceph_d_prune debug print while we're here.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-11 09:50:17 -08:00
David Sterba 8965593e41 btrfs: rename the option to nospace_cache
Rename no_space_cache option to nospace_cache to be more consistent with
the rest, where the simple prefix 'no' is used to negate an option.

The option has been introduced during the -rc1 cycle and there are has not been
widely used, so it's safe.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-11 10:14:57 -05:00
Arne Jansen 69f4cb526b Btrfs: handle bio_add_page failure gracefully in scrub
Currently scrub fails with ENOMEM when bio_add_page fails. Unfortunately
dm based targets accept only one page per bio, thus making scrub always
fails. This patch just submits the current bio when an error is encountered
and starts a new one.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-11 08:17:10 -05:00
Miao Xie 62f30c5462 Btrfs: fix deadlock caused by the race between relocation
We can not do flushable reservation for the relocation when we create snapshot,
because it may make the transaction commit task and the flush task wait for
each other and the deadlock happens.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:05 -05:00
Josef Bacik 2f120c05e6 Btrfs: only map pages if we know we need them when reading the space cache
People have been running into a warning when loading space cache because the
page is already mapped when trying to read in a bitmap.  The way we read in
entries and pages is kind of convoluted, so fix it so that io_ctl_read_entry
maps the entries if it needs to, and if it hits the end of the page it simply
unmaps the page.  That way we can unconditionally unmap the io_ctl before
reading in the bitmap and we should stop hitting these warnings.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:05 -05:00
Miao Xie 76b9e23d25 Btrfs: fix orphan backref nodes
If the root node of a fs/file tree is in the block group that is
being relocated, but the others are not in the other block groups.
when we create a snapshot for this tree between the relocation tree
creation ends and ->create_reloc_tree is set to 0, Btrfs will create
some backref nodes that are the lowest nodes of the backrefs cache.
But we forget to add them into ->leaves list of the backref cache
and deal with them, and at last, they will triggered BUG_ON().

  kernel BUG at fs/btrfs/relocation.c:239!

This patch fixes it by adding them into ->leaves list of backref cache.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:05 -05:00
Miao Xie 61b520a9d0 Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush}
btrfs_block_rsv_add{, _noflush}() have similar code, so abstract that code.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:05 -05:00
Miao Xie 3254c87618 Btrfs: fix unreleased path in btrfs_orphan_cleanup()
When we did stress test for the space relocation, the deadlock happened.
By debugging, We found it was caused by the carelessness that we forgot
to unlock the read lock of the extent buffers in btrfs_orphan_cleanup()
before we end the transaction handle, so the transaction commit task waited
the task, which called btrfs_orphan_cleanup(), to unlock the extent buffer,
but that task waited the commit task to end the transaction commit, and
the deadlock happened. Fix it.

Signed-ff-by: Miao Xie <miaox@cn.fujitsu.com>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:05 -05:00
Miao Xie ba38eb4de3 Btrfs: fix no reserved space for writing out inode cache
I-node cache forgets to reserve the space when writing out it. And when
we do some stress test, such as synctest, it will trigger WARN_ON() in
use_block_rsv().

WARNING: at fs/btrfs/extent-tree.c:5718 btrfs_alloc_free_block+0xbf/0x281 [btrfs]()
...
Call Trace:
 [<ffffffff8104df86>] warn_slowpath_common+0x80/0x98
 [<ffffffff8104dfb3>] warn_slowpath_null+0x15/0x17
 [<ffffffffa0369c60>] btrfs_alloc_free_block+0xbf/0x281 [btrfs]
 [<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
 [<ffffffffa035c040>] __btrfs_cow_block+0x118/0x3b5 [btrfs]
 [<ffffffffa035c7ba>] btrfs_cow_block+0x103/0x14e [btrfs]
 [<ffffffffa035e4c4>] btrfs_search_slot+0x249/0x6a4 [btrfs]
 [<ffffffffa036d086>] btrfs_lookup_inode+0x2a/0x8a [btrfs]
 [<ffffffffa03788b7>] btrfs_update_inode+0xaa/0x141 [btrfs]
 [<ffffffffa036d7ec>] btrfs_save_ino_cache+0xea/0x202 [btrfs]
 [<ffffffffa03a761e>] ? btrfs_update_reloc_root+0x17e/0x197 [btrfs]
 [<ffffffffa0373867>] commit_fs_roots+0xaa/0x158 [btrfs]
 [<ffffffffa03746a6>] btrfs_commit_transaction+0x405/0x731 [btrfs]
 [<ffffffff810690df>] ? wake_up_bit+0x25/0x25
 [<ffffffffa039d652>] ? btrfs_log_dentry_safe+0x43/0x51 [btrfs]
 [<ffffffffa0381c5f>] btrfs_sync_file+0x16a/0x198 [btrfs]
 [<ffffffff81122806>] ? mntput+0x21/0x23
 [<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
 [<ffffffff8112d170>] vfs_fsync+0x17/0x19
 [<ffffffff8112d316>] do_fsync+0x29/0x3e
 [<ffffffff8112d348>] sys_fsync+0xb/0xf
 [<ffffffff81468352>] system_call_fastpath+0x16/0x1b

Sometimes it causes BUG_ON() in the reservation code of the delayed inode
is triggered.

So we must reserve enough space for inode cache.

Note: If we can not reserve the enough space for inode cache, we will
give up writing out it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:04 -05:00
Miao Xie 924cd8fbe4 Btrfs: fix nocow when deleting the item
btrfs_previous_item() just search the b+ tree, do not COW the nodes or leaves,
if we modify the result of it, the meta-data will be broken. fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:04 -05:00
Chris Mason f7d572188b Merge branch 'mount-fixes' of git://github.com/idryomov/btrfs-unstable into integration 2011-11-10 20:42:53 -05:00
Chris Mason 2115133f8b Btrfs: tweak the delayed inode reservations again
Josef sent along an incremental to the inode reservation
code to make sure we try and fall back to directly updating
the inode item if things go horribly wrong.

This reworks that patch slightly, adding a fallback function
that will always try to update the inode item directly without
going through the delayed_inode code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:39:08 -05:00
Trond Myklebust 62e4a76987 NFS: Revert pnfs ugliness from the generic NFS read code path
pNFS-specific code belongs in the pnfs layer. It should not be
hijacking generic NFS read or write code paths.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-10 14:50:26 -05:00
Linus Torvalds 5e442a493f Revert "proc: fix races against execve() of /proc/PID/fd**"
This reverts commit aa6afca5bc.

It escalates of some of the google-chrome SELinux problems with ptrace
("Check failed: pid_ > 0.  Did not find zygote process"), and Andrew
says that it is also causing mystery lockdep reports.

Reported-by: Alex Villacís Lasso <a_villacis@palosanto.com>
Requested-by: James Morris <jmorris@namei.org>
Requested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-09 18:16:00 -05:00
Ilya Dryomov 04d21a244f Btrfs: rework error handling in btrfs_mount()
Commits 6c41761f and 45ea6095 introduced the possibility of NULL pointer
dereference on error paths, also we would leave all devices busy and
leak fs_info with all sub-structures on error when trying to mount an
already mounted fs to a different directory.

Fix this by doing all allocations before trying to open any of the
devices, adjust error path for mount-already-mounted-fs case.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:39 +02:00
Ilya Dryomov 586e46e281 Btrfs: close devices on all error paths in open_ctree()
Fix a bug introduced by 7e662854 where we would leave devices busy on
certain error paths in open_ctree().  fs_info is guaranteed to be
non-NULL now so it's safe to dereference it on all error paths.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:38 +02:00
Ilya Dryomov 4d34b27895 Btrfs: avoid null dereference and leaks when bailing from open_ctree()
Fix bugs introduced by 6c41761f.  Firstly, after failing to allocate any
of the tree roots (first 'goto fail' in open_ctree()) we would
dereference a NULL fs_info pointer in free_fs_info().  Secondly, after
failures from init_srcu_struct(), setup_bdi() and new_inode() we would
leak all earlier allocated roots: fs_info fields haven't been
initialized yet so free_fs_info() is rendered useless.

Fix this by initializing fs_info pointer and fs_info fields before any
allocations happen.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:38 +02:00
Ilya Dryomov f23c8af8ca Btrfs: fix subvol_name leak on error in btrfs_mount()
btrfs_parse_early_options() can fail due to error while scanning devices
(-o device= option), but still strdup() subvol_name string:

mount -o subvol=SUBV,device=BAD_DEVICE <dev> <mnt>

So free subvol_name string on error.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:38 +02:00
Ilya Dryomov a90e8b6fb8 Btrfs: fix memory leak in btrfs_parse_early_options()
Don't leak subvol_name string in case multiple subvol= options are
given.  "The lastest option is effective" behavior (consistent with
subvolid= and subvolrootid= options) is preserved.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:38 +02:00
Josef Bacik 7fd2ae21a4 Btrfs: fix our reservations for updating an inode when completing io
People have been reporting ENOSPC crashes in finish_ordered_io.  This is because
we try to steal from the delalloc block rsv to satisfy a reservation to update
the inode.  The problem with this is we don't explicitly save space for updating
the inode when doing delalloc.  This is kind of a problem and we've gotten away
with this because way back when we just stole from the delalloc reserve without
any questions, and this worked out fine because generally speaking the leaf had
been modified either by the mtime update when we did the original write or
because we just updated the leaf when we inserted the file extent item, only on
rare occasions had the leaf not actually been modified, and that was still ok
because we'd just use a block or two out of the over-reservation that is
delalloc.

Then came the delayed inode stuff.  This is amazing, except it wants a full
reservation for updating the inode since it may do it at some point down the
road after we've written the blocks and we have to recow everything again.  This
worked out because the delayed inode stuff just stole from the global reserve,
that is until recently when I changed that because it caused other problems.

So here we are, we're doing everything right and being screwed for it.  So take
an extra reservation for the inode at delalloc reservation time and carry it
through the life of the delalloc reservation.  If we need it we can steal it in
the delayed inode stuff.  If we have already stolen it try and do a normal
metadata reservation.  If that fails try to steal from the delalloc reservation.
If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
solve this and in the meantime we'll steal from the global reserve.

With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
any problems.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-08 15:47:34 -05:00
Chris Mason 917c16b2b6 Btrfs: fix oops on NULL trans handle in btrfs_truncate
If we fail to reserve space in the transaction during truncate, we can
error out with a NULL trans handle.  The cleanup code needs an extra
check to make sure we aren't trying to use the bad handle.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-08 14:49:59 -05:00
Christoph Hellwig 810627d9a6 xfs: fix force shutdown handling in xfs_end_io
Ensure ioend->io_error gets propagated back to e.g. AIO completions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-11-08 10:48:23 -06:00
Christoph Hellwig 272e42b215 xfs: constify xfs_item_ops
The log item ops aren't nessecarily the biggest exploit vector, but marking
them const is easy enough.  Also remove the unused xfs_item_ops_t typedef
while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-11-08 10:48:23 -06:00
Carlos Maiolino b52a360b2a xfs: Fix possible memory corruption in xfs_readlink
Fixes a possible memory corruption when the link is larger than
MAXPATHLEN and XFS_DEBUG is not enabled. This also remove the
S_ISLNK assert, since the inode mode is checked previously in
xfs_readlink_by_handle() and via VFS.

Updated to address concerns raised by Ben Hutchings about the loose
attention paid to 32- vs 64-bit values, and the lack of handling a
potentially negative pathlen value:
 - Changed type of "pathlen" to be xfs_fsize_t, to match that of
   ip->i_d.di_size
 - Added checking for a negative pathlen to the too-long pathlen
   test, and generalized the message that gets reported in that case
   to reflect the change
As a result, if a negative pathlen were encountered, this function
would return EFSCORRUPTED (and would fail an assertion for a debug
build)--just as would a too-long pathlen.

Signed-off-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2011-11-08 10:48:23 -06:00
Al Viro a3fbbde70a VFS: we need to set LOOKUP_JUMPED on mountpoint crossing
Mountpoint crossing is similar to following procfs symlinks - we do
not get ->d_revalidate() called for dentry we have arrived at, with
unpleasant consequences for NFS4.

Simple way to reproduce the problem in mainline:

    cat >/tmp/a.c <<'EOF'
    #include <unistd.h>
    #include <fcntl.h>
    #include <stdio.h>
    main()
    {
            struct flock fl = {.l_type = F_RDLCK, .l_whence = SEEK_SET, .l_len = 1};
            if (fcntl(0, F_SETLK, &fl))
                    perror("setlk");
    }
    EOF
    cc /tmp/a.c -o /tmp/test

then on nfs4:

    mount --bind file1 file2
    /tmp/test < file1		# ok
    /tmp/test < file2		# spews "setlk: No locks available"...

What happens is the missing call of ->d_revalidate() after mountpoint
crossing and that's where NFS4 would issue OPEN request to server.

The fix is simple - treat mountpoint crossing the same way we deal with
following procfs-style symlinks.  I.e.  set LOOKUP_JUMPED...

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-07 14:58:06 -08:00
Trond Myklebust a6f498a891 NFS: Fix a regression in the referral code
Fix a regression that was introduced by commit
0c2e53f11a (NFS: Remove the unused
"lookupfh()" version of nfs4_proc_lookup()).

In the case where the lookup gets an NFS4ERR_MOVED, we want to return
the result of nfs4_get_referral(). Instead, that value is getting
clobbered by the call to nfs4_handle_exception()...

Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Tested-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-07 16:20:00 -05:00
slyich@gmail.com 45ea6095c8 btrfs: fix double-free 'tree_root' in 'btrfs_mount()'
On error path 'tree_root' is treed in 'free_fs_info()'.
No need to free it explicitely. Noticed by SLUB in debug mode:

Complete reproducer under usermode linux (discovered on real
machine):

    bdev=/dev/ubda
    btr_root=/btr
    /mkfs.btrfs $bdev
    mount $bdev $btr_root
    mkdir $btr_root/subvols/
    cd $btr_root/subvols/
    /btrfs su cr foo
    /btrfs su cr bar
    mount $bdev -osubvol=subvols/foo $btr_root/subvols/bar
    umount $btr_root/subvols/bar

which gives

device fsid 4d55aa28-45b1-474b-b4ec-da912322195e devid 1 transid 7 /dev/ubda
=============================================================================
BUG kmalloc-2048: Object already free
-----------------------------------------------------------------------------

INFO: Allocated in btrfs_mount+0x389/0x7f0 age=0 cpu=0 pid=277
INFO: Freed in btrfs_mount+0x51c/0x7f0 age=0 cpu=0 pid=277
INFO: Slab 0x0000000062886200 objects=15 used=9 fp=0x0000000070b4d2d0 flags=0x4081
INFO: Object 0x0000000070b4d2d0 @offset=21200 fp=0x0000000070b4a968
...
Call Trace:
70b31948:  [<6008c522>] print_trailer+0xe2/0x130
70b31978:  [<6008c5aa>] object_err+0x3a/0x50
70b319a8:  [<6008e242>] free_debug_processing+0x142/0x2a0
70b319e0:  [<600ebf6f>] btrfs_mount+0x55f/0x7f0
70b319f8:  [<6008e5c1>] __slab_free+0x221/0x2d0

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Arne Jansen <sensille@gmx.net>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-07 16:08:01 -05:00
Al Viro 50e696308c vfs: d_invalidate() should leave mountpoints alone
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-07 10:54:10 -08:00
Linus Torvalds ff4d7fa8c3 Merge git://git.samba.org/sfrench/cifs-2.6
* git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Cleanup byte-range locking code style
  CIFS: Simplify setlk error handling for mandatory locking
2011-11-07 09:56:22 -08:00
Linus Torvalds e0d65113a7 Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (226 commits)
  mtd: tests: annotate as DANGEROUS in Kconfig
  mtd: tests: don't use mtd0 as a default
  mtd: clean up usage of MTD_DOCPROBE_ADDRESS
  jffs2: add compr=lzo and compr=zlib options
  jffs2: implement mount option parsing and compression overriding
  mtd: nand: initialize ops.mode
  mtd: provide an alias for the redboot module name
  mtd: m25p80: don't probe device which has status of 'disabled'
  mtd: nand_h1900 never worked
  mtd: Add DiskOnChip G3 support
  mtd: m25p80: add EON flash EN25Q32B into spi flash id table
  mtd: mark block device queue as non-rotational
  mtd: r852: make r852_pm_ops static
  mtd: m25p80: add support for at25df321a spi data flash
  mtd: mxc_nand: preset_v1_v2: unlock all NAND flash blocks
  mtd: nand: switch `check_pattern()' to standard `memcmp()'
  mtd: nand: invalidate cache on unaligned reads
  mtd: nand: do not scan bad blocks with NAND_BBT_NO_OOB set
  mtd: nand: wait to set BBT version
  mtd: nand: scrub BBT on ECC errors
  ...

Fix up trivial conflicts:
 - arch/arm/mach-at91/board-usb-a9260.c
	Merged into board-usb-a926x.c
 - drivers/mtd/maps/lantiq-flash.c
	add_mtd_partitions -> mtd_device_register vs changed to use
	mtd_device_parse_register.
2011-11-07 09:11:16 -08:00
Linus Torvalds cf5e15fbd7 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
  UBIFS: fix the dark space calculation
  UBIFS: introduce a helper to dump scanning info
2011-11-07 08:52:19 -08:00
Namjae Jeon 3c1fcb2c24 ext4: add blk_finish_plug in error case of writepages.
blk_finish_plug is needed in error case of writepages.

Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-07 11:01:13 -05:00
Richard Weinberger 2397256d62 ext4: Remove kernel_lock annotations
The BKL is gone, these annotations are useless.

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-07 10:50:09 -05:00
Theodore Ts'o eb513689c9 ext4: ignore journalled data options on remount if fs has no journal
This avoids a confusing failure in the init scripts when the
/etc/fstab has data=writeback or data=journal but the file system does
not have a journal.  So check for this case explicitly, and warn the
user that we are ignoring the (pointless, since they have no journal)
data=* mount option.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-07 10:47:42 -05:00
Linus Torvalds 6a6662ced4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits)
  Btrfs: check for a null fs root when writing to the backup root log
  Btrfs: fix race during transaction joins
  Btrfs: fix a potential btrfs_bio leak on scrub fixups
  Btrfs: rename btrfs_bio multi -> bbio for consistency
  Btrfs: stop leaking btrfs_bios on readahead
  Btrfs: stop the readahead threads on failed mount
  Btrfs: fix extent_buffer leak in the metadata IO error handling
  Btrfs: fix the new inspection ioctls for 32 bit compat
  Btrfs: fix delayed insertion reservation
  Btrfs: ClearPageError during writepage and clean_tree_block
  Btrfs: be smarter about committing the transaction in reserve_metadata_bytes
  Btrfs: make a delayed_block_rsv for the delayed item insertion
  Btrfs: add a log of past tree roots
  btrfs: separate superblock items out of fs_info
  Btrfs: use the global reserve when truncating the free space cache inode
  Btrfs: release metadata from global reserve if we have to fallback for unlink
  Btrfs: make sure to flush queued bios if write_cache_pages waits
  Btrfs: fix extent pinning bugs in the tree log
  Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN
  Btrfs: don't wait as long for more batches during SSD log commit
  ...
2011-11-06 20:03:41 -08:00
Linus Torvalds 32aaeffbd4 Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
  Revert "tracing: Include module.h in define_trace.h"
  irq: don't put module.h into irq.h for tracking irqgen modules.
  bluetooth: macroize two small inlines to avoid module.h
  ip_vs.h: fix implicit use of module_get/module_put from module.h
  nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
  include: replace linux/module.h with "struct module" wherever possible
  include: convert various register fcns to macros to avoid include chaining
  crypto.h: remove unused crypto_tfm_alg_modname() inline
  uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
  pm_runtime.h: explicitly requires notifier.h
  linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
  miscdevice.h: fix up implicit use of lists and types
  stop_machine.h: fix implicit use of smp.h for smp_processor_id
  of: fix implicit use of errno.h in include/linux/of.h
  of_platform.h: delete needless include <linux/module.h>
  acpi: remove module.h include from platform/aclinux.h
  miscdevice.h: delete unnecessary inclusion of module.h
  device_cgroup.h: delete needless include <linux/module.h>
  net: sch_generic remove redundant use of <linux/module.h>
  net: inet_timewait_sock doesnt need <linux/module.h>
  ...

Fix up trivial conflicts (other header files, and  removal of the ab3550 mfd driver) in
 - drivers/media/dvb/frontends/dibx000_common.c
 - drivers/media/video/{mt9m111.c,ov6650.c}
 - drivers/mfd/ab3550-core.c
 - include/linux/dmaengine.h
2011-11-06 19:44:47 -08:00
Linus Torvalds 208bca0860 Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: Add a 'reason' to wb_writeback_work
  writeback: send work item to queue_io, move_expired_inodes
  writeback: trace event balance_dirty_pages
  writeback: trace event bdi_dirty_ratelimit
  writeback: fix ppc compile warnings on do_div(long long, unsigned long)
  writeback: per-bdi background threshold
  writeback: dirty position control - bdi reserve area
  writeback: control dirty pause time
  writeback: limit max dirty pause time
  writeback: IO-less balance_dirty_pages()
  writeback: per task dirty rate limit
  writeback: stabilize bdi->dirty_ratelimit
  writeback: dirty rate control
  writeback: add bg_threshold parameter to __bdi_update_bandwidth()
  writeback: dirty position control
  writeback: account per-bdi accumulated dirtied pages
2011-11-06 19:02:23 -08:00
Linus Torvalds 5d5a8d2d9d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph/super.c: quiet sparse noise
  ceph/mds_client.c: quiet sparse noise
  ceph: use new D_COMPLETE dentry flag
  ceph: clear parent D_COMPLETE flag when on dentry prune
2011-11-06 17:28:44 -08:00
Chris Mason 7c7e82a77f Btrfs: check for a null fs root when writing to the backup root log
During log replay, can commit the transaction before the fs_root
pointers are setup, so we have to make sure they are not null before
trying to use them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 18:50:56 -05:00
Chris Mason d43317dcd0 Btrfs: fix race during transaction joins
While we're allocating ram for a new transaction, we drop our spinlock.
When we get the lock back, we do check to see if a transaction started
while we slept, but we don't check to make sure it isn't blocked
because a commit has already started.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:26:19 -05:00
Ilya Dryomov 56d2a48f81 Btrfs: fix a potential btrfs_bio leak on scrub fixups
In case we were able to map less than we wanted (length < PAGE_SIZE
clause is true) btrfs_bio is still allocated and we have to free it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:11:29 -05:00
Ilya Dryomov 21ca543efc Btrfs: rename btrfs_bio multi -> bbio for consistency
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:11:21 -05:00
Ilya Dryomov 9510dc4c62 Btrfs: stop leaking btrfs_bios on readahead
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:11:08 -05:00
Chris Mason 306c8b68c8 Btrfs: stop the readahead threads on failed mount
If we don't stop them, they linger around corrupting
memory by using pointers to freed things.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:09:41 -05:00
Chris Mason c674e04e1c Btrfs: fix extent_buffer leak in the metadata IO error handling
The scrub readahead branch brought in a new error handling hook,
but it was leaking extent_buffer references.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:09:10 -05:00
Chris Mason 740c3d226c Btrfs: fix the new inspection ioctls for 32 bit compat
The new ioctls to follow backrefs are not clean for 32/64 bit
compat.  This reworks them for u64s everywhere.  They are brand new, so
there are no problems with changing the interface now.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:08:49 -05:00
Chris Mason 806468f8bf Merge git://git.jan-o-sch.net/btrfs-unstable into integration
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/extent_io.c
	fs/btrfs/extent_io.h
	fs/btrfs/scrub.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:07:10 -05:00
Chris Mason 531f4b1ae5 Merge branch 'for-chris' of git://github.com/sensille/linux into integration
Conflicts:
	fs/btrfs/ctree.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:05:08 -05:00
Josef Bacik c06a0e120a Btrfs: fix delayed insertion reservation
We all keep getting those stupid warnings from use_block_rsv when running
stress.sh, and it's because the delayed insertion stuff is being stupid.  It's
not the delayed insertion stuffs fault, it's all just stupid.  When marking an
inode dirty for oh say updating the time on it, we just do a
btrfs_join_transaction, which doesn't reserve any space.  This is stupid because
we're going to have to have space reserve to make this change, but we do it
because it's fast because chances are we're going to call it over and over again
and it doesn't matter.  Well thanks to the delayed insertion stuff this is
mostly the case, so we do actually need to make this reservation.  So if
trans->bytes_reserved is 0 then try to do a normal reservation.  If not return
ENOSPC which will make the btrfs_dirty_inode start a proper transaction which
will let it do the whole ENOSPC dance and reserve enough space for the delayed
insertion to steal the reservation from the transaction.

The other stupid thing we do is not reserve space for the inode when writing to
the thing.  Usually this is ok since we have to update the time so we'd have
already done all this work before we get to the endio stuff, so it doesn't
matter.  But this is stupid because we could write the data after the
transaction commits where we changed the mtime of the inode so we have to cow
all the way down to the inode anyway.  This used to be masked by the delalloc
reservation stuff, but because we delay the update it doesn't get masked in this
case.  So again the delayed insertion stuff bites us in the ass.  So if our
trans->block_rsv is delalloc, just steal the reservation from the delalloc
reserve.  Hopefully this won't bite us in the ass, but I've said that before.

With this patch stress.sh no longer spits out those stupid warnings (famous last
words).  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:04:20 -05:00
Chris Mason bf0da8c183 Btrfs: ClearPageError during writepage and clean_tree_block
Failure testing was tripping up over stale PageError bits in
metadata pages.  If we have an io error on a block, and later on
end up reusing it, nobody ever clears PageError on those pages.

During commit, we'll find PageError and think we had trouble writing
the block, which will lead to aborts and other problems.

This changes clean_tree_block and the btrfs writepage code to
clear the PageError bit.  In both cases we're either completely
done with the page or the page has good stuff and the error bit
is no longer valid.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:04:20 -05:00
Josef Bacik 663350ac38 Btrfs: be smarter about committing the transaction in reserve_metadata_bytes
Because of the overcommit stuff I had to make it so that we committed the
transaction all the time in reserve_metadata_bytes in case we had overcommitted
because of delayed items.  This was because previously we had no way of knowing
how much space was reserved for delayed items.  Now that we have the
delayed_block_rsv we can check it to see if committing the transaction would get
us anywhere.  This patch breaks out the committing logic into a helper function
that will check to see if committing the transaction would free enough space for
us to get anything done.  With this patch xfstests 83 goes from taking 445
seconds to taking 28 seconds on my box.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:04:19 -05:00
Josef Bacik 6d668dda0c Btrfs: make a delayed_block_rsv for the delayed item insertion
I've been hitting warnings in use_block_rsv when running the delayed insertion
stuff.  It's because we will readjust global block rsv based on what is in use,
which means we could end up discarding reservations that are for the delayed
insertion stuff.  So instead create a seperate block rsv for the delayed
insertion stuff.  This will also make it easier to debug problems with the
delayed insertion reservations since we will know that only the delayed
insertion code touches this block_rsv.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:04:18 -05:00
Chris Mason af31f5e5b8 Btrfs: add a log of past tree roots
This takes some of the free space in the btrfs super block
to record information about most of the roots in the last four
commits.

It also adds a -o recovery to use the root history log when
we're not able to read the tree of tree roots, the extent
tree root, the device tree root or the csum root.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:04:15 -05:00
David Sterba 6c41761fc6 btrfs: separate superblock items out of fs_info
fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)

Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-11-06 03:04:01 -05:00
Josef Bacik c8174313a8 Btrfs: use the global reserve when truncating the free space cache inode
We no longer use the orphan block rsv for holding the reservation for truncating
the inode, so instead use the global block rsv and check to make sure it has
enough space for us to truncate the space.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:50 -05:00
Josef Bacik 5a77d76c24 Btrfs: release metadata from global reserve if we have to fallback for unlink
I fixed a problem where we weren't reserving space for an orphan item when we
had to fallback to using the global reserve for an unlink, but I introduced
another problem.  I was migrating the bytes from the transaction reserve to the
global reserve and then releasing from the global reserve in
btrfs_end_transaction().  The problem with this is that a migrate will jack up
the size for the destination, but leave the size alone for the source, with the
idea that you can do a release normally on the source and it all washes out, and
then you can do a release again on the destination and it works out right.  My
way was skipping the release on the trans_block_rsv which still had the jacked
up size from our original reservation.  So instead release manually from the
global reserve if this transaction was using it, and then set the
trans->block_rsv back to the trans_block_rsv so that btrfs_end_transaction
cleans everything up properly.  With this patch xfstest 83 doesn't emit warnings
about leaking space.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:49 -05:00
Chris Mason 01d658f2ca Btrfs: make sure to flush queued bios if write_cache_pages waits
write_cache_pages tries to build up a large bio to stuff down the pipe.
But if it needs to wait for a page lock, it needs to make sure and send
down any pending writes so we don't deadlock with anyone who has the
page lock and is waiting for writeback of things inside the bio.

Dave Sterba triggered this as a deadlock between the autodefrag code and
the extent write_cache_pages

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:48 -05:00
Chris Mason e688b7252f Btrfs: fix extent pinning bugs in the tree log
The tree log had two important bugs that could cause corruptions after a
crash.  Sometimes we were allowing tree log blocks to be reused after
the tree log was committed but before the transaction commit was done.

This allowed a future metadata write to overwrite the tree log data.  It
is fixed by adding a new variant of freeing reserved extents that always
pins them.  Credit goes to Stefan Behrens and Arne Jansen for many many
hours spent tracking this bug down.

During tree log replay, we do a pass through the tree log and pin all
the extents we find.  This makes sure the replay code won't go in and
use any of those blocks for new allocations during replay.  The problem
is the free space cache isn't honoring these pinned extents.  So the
allocator can end up handing them out, leading to all kinds of problems
during replay.

The fix here is to force any free space cache to load while we pin the
extents, and then to make sure we remove the pinned extents from the
free space rbtree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
2011-11-06 03:03:48 -05:00
Chris Mason 1eae31e918 Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN
btrfs_remove_free_space needs to make sure to set ret back to a
valid return value after setting it to EAGAIN, otherwise we return
it to the callers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:47 -05:00
Chris Mason cd354ad613 Btrfs: don't wait as long for more batches during SSD log commit
When we're doing log commits, we try to wait for more writers to come in
and make the commit bigger.  This helps improve performance on rotating
disks, but on SSDs it adds latencies.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:47 -05:00
Sage Weil 15a2015fbc ceph: fix iput race when queueing inode work
If we queue a work item that calls iput(), make sure we ihold() before
attempting to queue work. Otherwise our queued work might miraculously run
before we notice the queue_work() succeeded and call ihold(), allowing the
inode to be destroyed.

That is, instead of

	if (queue_work(...))
		ihold();

we need to do

	ihold();
	if (!queue_work(...))
		iput();

Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-05 22:06:31 -07:00
H Hartley Sweeten 0c6d4b4e22 ceph/super.c: quiet sparse noise
Quiet the sparse noise:

warning: symbol 'create_fs_client' was not declared. Should it be static?
warning: symbol 'destroy_fs_client' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Sage Weil <sage@newdream.net>
ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-05 21:10:12 -07:00
H Hartley Sweeten 7fd7d101ff ceph/mds_client.c: quiet sparse noise
Quiet the following sparse noise:

warning: symbol 'get_nonsnap_parent' was not declared. Should it be static?
warning: symbol 'done_closing_sessions' was not declared. Should it be static?

Local functions don't need external visability. Make them static.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-05 21:10:11 -07:00
Sage Weil c6ffe10015 ceph: use new D_COMPLETE dentry flag
We used to use a flag on the directory inode to track whether the dcache
contents for a directory were a complete cached copy.  Switch to a dentry
flag CEPH_D_COMPLETE that is safely updated by ->d_prune().

Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-05 21:10:10 -07:00
Dan McGee 5c8a0fbba5 VFS: fix statfs() automounter semantics regression
No one in their right mind would expect statfs() to not work on a
automounter managed mount point. Fix it.

[ I'm not sure about the "no one in their right mind" part.  It's not
  mounted, and you didn't ask for it to be mounted.  But nobody will
  really care, and this probably makes it match previous semantics, so..
      - Linus ]

This mirrors the fix made to the quota code in 815d405cef.

Signed-off-by: Dan McGee <dpmcgee@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-04 18:15:59 -07:00
Linus Torvalds 3d0a8d10cf Merge branch 'for-3.2/drivers' of git://git.kernel.dk/linux-block
* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
  virtio-blk: use ida to allocate disk index
  hpsa: add small delay when using PCI Power Management to reset for kump
  cciss: add small delay when using PCI Power Management to reset for kump
  xen/blkback: Fix two races in the handling of barrier requests.
  xen/blkback: Check for proper operation.
  xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
  xen/blkback: Report VBD_WSECT (wr_sect) properly.
  xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
  xen-blkfront: plug device number leak in xlblk_init() error path
  xen-blkfront: If no barrier or flush is supported, use invalid operation.
  xen-blkback: use kzalloc() in favor of kmalloc()+memset()
  xen-blkback: fixed indentation and comments
  xen-blkfront: fix a deadlock while handling discard response
  xen-blkfront: Handle discard requests.
  xen-blkback: Implement discard requests ('feature-discard')
  xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
  drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
  drivers/block/loop.c: emit uevent on auto release
  drivers/block/cpqarray.c: use pci_dev->revision
  loop: always allow userspace partitions and optionally support automatic scanning
  ...

Fic up trivial header file includsion conflict in drivers/block/loop.c
2011-11-04 17:22:14 -07:00
Linus Torvalds b4fdcb02f1 Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
  block: don't call blk_drain_queue() if elevator is not up
  blk-throttle: use queue_is_locked() instead of lockdep_is_held()
  blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
  blk-throttle: Free up policy node associated with deleted rule
  block: warn if tag is greater than real_max_depth.
  block: make gendisk hold a reference to its queue
  blk-flush: move the queue kick into
  blk-flush: fix invalid BUG_ON in blk_insert_flush
  block: Remove the control of complete cpu from bio.
  block: fix a typo in the blk-cgroup.h file
  block: initialize the bounce pool if high memory may be added later
  block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
  block: drop @tsk from attempt_plug_merge() and explain sync rules
  block: make get_request[_wait]() fail if queue is dead
  block: reorganize throtl_get_tg() and blk_throtl_bio()
  block: reorganize queue draining
  block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
  block: pass around REQ_* flags instead of broken down booleans during request alloc/free
  block: move blk_throtl prototypes to block/blk.h
  block: fix genhd refcounting in blkio_policy_parse_and_set()
  ...

Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
 - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
 - drivers/staging/zram/zram_drv.c
2011-11-04 17:06:58 -07:00
Linus Torvalds 044595d4e4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
  Squashfs: Add an option to set dev block size to 4K
2011-11-04 16:48:37 -07:00
Jeff Layton 0486958f57 nfs: move nfs_file_operations declaration to bottom of file.c (try #2)
...a remove a set of forward declarations.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-04 16:39:11 -04:00
Jeff Layton 1788ea6e3b nfs: when attempting to open a directory, fall back on normal lookup (try #5)
commit d953126 changed how nfs_atomic_lookup handles an -EISDIR return
from an OPEN call. Prior to that patch, that caused the client to fall
back to doing a normal lookup. When that patch went in, the code began
returning that error to userspace. The d_revalidate codepath however
never had the corresponding change, so it was still possible to end up
with a NULL ctx->state pointer after that.

That patch caused a regression. When we attempt to open a directory that
does not have a cached dentry, that open now errors out with EISDIR. If
you attempt the same open with a cached dentry, it will succeed.

Fix this by reverting the change in nfs_atomic_lookup and allowing
attempts to open directories to fall back to a normal lookup

Also, add a NFSv4-specific f_ops->open routine that just returns
-ENOTDIR. This should never be called if things are working properly,
but if it ever is, then the dprintk may help in debugging.

To facilitate this, a new file_operations field is also added to the
nfs_rpc_ops struct.

Cc: stable@kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-04 16:39:04 -04:00
Linus Torvalds 6736c04799 Merge branch 'nfs-for-3.2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
* 'nfs-for-3.2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (25 commits)
  nfs: set vs_hidden on nfs4_callback_version4 (try #2)
  pnfs-obj: Support for RAID5 read-4-write interface.
  pnfs-obj: move to ore 03: Remove old raid engine
  pnfs-obj: move to ore 02: move to ORE
  pnfs-obj: move to ore 01: ore_layout & ore_components
  pnfs-obj: Rename objlayout_io_state => objlayout_io_res
  pnfs-obj: Get rid of objlayout_{alloc,free}_io_state
  pnfs-obj: Return PNFS_NOT_ATTEMPTED in case of read/write_pagelist
  pnfs-obj: Remove redundant EOF from objlayout_io_state
  nfs: Remove unused variable from write.c
  nfs: Fix unused variable warning from file.c
  NFS: Remove no-op less-than-zero checks on unsigned variables.
  NFS: Clean up nfs4_xdr_dec_secinfo()
  NFS: Fix documenting comment for nfs_create_request()
  NFS4: fix cb_recallany decode error
  nfs4: serialize layoutcommit
  SUNRPC: remove rpcbind clients destruction on module cleanup
  SUNRPC: remove rpcbind clients creation during service registering
  NFSd: call svc rpcbind cleanup explicitly
  SUNRPC: cleanup service destruction
  ...
2011-11-04 12:27:43 -07:00
Jeff Layton 6070295efc nfs: set vs_hidden on nfs4_callback_version4 (try #2)
This service should not be registered with or unregistered from rpcbind.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-04 09:00:09 -04:00
Pavel Shilovsky a88b470773 CIFS: Cleanup byte-range locking code style
Reorder parms of cifs_lock_init, trivially simplify getlk code and
remove extra {} in cifs_lock_add_if.

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-11-04 00:53:21 -05:00
Pavel Shilovsky 161ebf9fcc CIFS: Simplify setlk error handling for mandatory locking
Now we allocate a lock structure at first, then we request to the server
and save the lock if server returned OK though void function - it prevents
the situation when we locked a file on the server and then return -ENOMEM
from setlk.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-11-04 00:53:15 -05:00
Linus Torvalds 1a67a573b8 Merge git://git.samba.org/sfrench/cifs-2.6
* git://git.samba.org/sfrench/cifs-2.6:
  cifs: Assume passwords are encoded according to iocharset (try #2)
  CIFS: Fix the VFS brlock cache usage in posix locking case
  [CIFS] Update cifs version to 1.76
  CIFS: Remove extra mutex_unlock in cifs_lock_add_if
2011-11-03 21:07:58 -07:00
Sage Weil b58dc4100b ceph: clear parent D_COMPLETE flag when on dentry prune
When the VFS prunes a dentry from the cache, clear the D_COMPLETE flag
on the parent dentry.  Do this for the live and snapshotted namespaces. Do
not bother for the .snap dir contents, since we do not cache that.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-03 09:23:49 -07:00
Trond Myklebust 31cbecb4ab Merge branch 'osd-devel' into nfs-for-next 2011-11-02 23:56:40 -04:00
Boaz Harrosh 278c023a99 pnfs-obj: Support for RAID5 read-4-write interface.
The ore need suplied a r4w_get_page/r4w_put_page API
from Filesystem so it can get cache pages to read-into when
writing parial stripes.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-02 23:56:09 -04:00
Boaz Harrosh 04291b628c pnfs-obj: move to ore 03: Remove old raid engine
Finally remove all the old raid engine, which is by now
dead code.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-02 23:56:08 -04:00
Boaz Harrosh eecfc6312a pnfs-obj: move to ore 02: move to ORE
In this patch we are actually moving to the ORE.
(Object Raid Engine).

objio_state holds a pointer to an ore_io_state. Once
we have an ore_io_state at hand we can call the ore
for reading/writing. We register on the done path
to kick off the nfs io_done mechanism.

Again for Ease of reviewing the old code is "#if 0"
but is not removed so the diff command works better.
The old code will be removed in the next patch.

fs/exofs/Kconfig::ORE is modified to also be auto-included
if PNFS_OBJLAYOUT is set. Since we now depend on ORE.
(See comments in fs/exofs/Kconfig)

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-02 23:56:08 -04:00
Boaz Harrosh af4f5b54bc pnfs-obj: move to ore 01: ore_layout & ore_components
For Ease of reviewing I split the move to ore into 3 parts
	move to ore 01: ore_layout & ore_components
	move to ore 02: move to ORE
	move to ore 03: Remove old raid engine

This patch modifies the objio_lseg, layout-segment level
and devices and components arrays to use the ORE types.

Though it will be removed soon, also the raid engine
is modified to actually compile, possibly run, with
the new types. So it is the same old raid engine but
with some new ORE types.

For Ease of reviewing, some of the old code is
"#if 0" but is not removed so the diff command works
better. The old code will be removed in the 3rd patch.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-02 23:56:07 -04:00
Boaz Harrosh e2e04355d9 pnfs-obj: Rename objlayout_io_state => objlayout_io_res
* All instances of objlayout_io_state => objlayout_io_res
* All instances of state => oir;
* All instances of ol_state => oir;

Big but nothing to it

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-02 23:56:06 -04:00
Boaz Harrosh 96218556b0 pnfs-obj: Get rid of objlayout_{alloc,free}_io_state
This is part of moving objio_osd to use the ORE.

objlayout_io_state had two functions:
1. It was used in the error reporting mechanism at layout_return.
   This function is kept intact.
   (Later patch will rename objlayout_io_state => objlayout_io_res)
2. Carrier of rw io members into the objio_read/write_paglist API.
   This is removed in this patch.

The {r,w}data received from NFS are passed directly to the
objio_{read,write}_paglist API. The io_engine is now allocating
it's own IO state as part of the read/write. The minimal
functionality that was part of the generic allocation is passed
to the io_engine.

So part of this patch is rename of:
	ios->ol_state.foo => ios->foo

At objlayout_{read,write}_done an objlayout_io_state is passed that
denotes the result of the IO. (Hence the later name change).
If the IO is successful objlayout calls an objio_free_result() API
immediately (Which for objio_osd causes the release of the io_state).
If the IO ended in an error it is hanged onto until reported in
layout_return and is released later through the objio_free_result()
API. (All this is not new just renamed and cleaned)

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-02 23:56:05 -04:00
Boaz Harrosh e6c40fe3f4 pnfs-obj: Return PNFS_NOT_ATTEMPTED in case of read/write_pagelist
objlayout driver was always returning PNFS_ATTEMPTED from it's
read/write_pagelist operations. Even on error. Fix that.

Start by establishing an error return API from io-engine, by
not returning ssize_t (length-or-error) but returning "int"
0=OK, 0>Error. And clean up all return types in io-engine.

Then if io-engine returned error return PNFS_NOT_ATTEMPTED
to generic layer. (With a dprint)

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-02 23:56:03 -04:00
Boaz Harrosh 4cdc685c7d pnfs-obj: Remove redundant EOF from objlayout_io_state
The EOF calculation was done on .read_pagelist(), cached
in objlayout_io_state->eof, and set in objlayout_read_done()
into nfs_read_data->res.eof.

So set it directly into nfs_read_data->res.eof and avoid
the extra member.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-02 23:56:00 -04:00
Rakib Mullick 2b72c9ccd2 nfs: Remove unused variable from write.c
When CONFIG_NFS=y and CONFIG_NFS_V3_{,V4}=n we get the following warning.

	fs/nfs/write.c: In function ‘nfs_writeback_done’:
	fs/nfs/write.c:1246:21: warning: unused variable ‘server’

 Remove the variable 'server' to fix the above warning.

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-02 23:50:24 -04:00
Rakib Mullick 6f276e49fd nfs: Fix unused variable warning from file.c
Fix the following unused variable warning.

fs/nfs/file.c: In function ‘nfs_file_release’:
fs/nfs/file.c:140:17: warning: unused variable ‘dentry’
fs/nfs/file.c: In function ‘nfs_file_read’:
fs/nfs/file.c:237:9: warning: unused variable ‘count’

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-02 23:49:09 -04:00
Linus Torvalds 06ef93e1b8 Merge branch 'for-3.2' of git://linux-nfs.org/~bfields/linux
* 'for-3.2' of git://linux-nfs.org/~bfields/linux:
  nfsd4: typo logical vs bitwise negate in nfsd4_decode_share_access
2011-11-02 16:54:36 -07:00
Linus Torvalds 092f4c56c1 Merge branch 'akpm' (Andrew's incoming - part two)
Says Andrew:

 "60 patches.  That's good enough for -rc1 I guess.  I have quite a lot
  of detritus to be rechecked, work through maintainers, etc.

 - most of the remains of MM
 - rtc
 - various misc
 - cgroups
 - memcg
 - cpusets
 - procfs
 - ipc
 - rapidio
 - sysctl
 - pps
 - w1
 - drivers/misc
 - aio"

* akpm: (60 commits)
  memcg: replace ss->id_lock with a rwlock
  aio: allocate kiocbs in batches
  drivers/misc/vmw_balloon.c: fix typo in code comment
  drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop
  w1: disable irqs in critical section
  drivers/w1/w1_int.c: multiple masters used same init_name
  drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal
  drivers/power/ds2780_battery.c: add a nolock function to w1 interface
  drivers/power/ds2780_battery.c: create central point for calling w1 interface
  w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it
  pps gpio client: add missing dependency
  pps: new client driver using GPIO
  pps: default echo function
  include/linux/dma-mapping.h: add dma_zalloc_coherent()
  sysctl: make CONFIG_SYSCTL_SYSCALL default to n
  sysctl: add support for poll()
  RapidIO: documentation update
  drivers/net/rionet.c: fix ethernet address macros for LE platforms
  RapidIO: fix potential null deref in rio_setup_device()
  RapidIO: add mport driver for Tsi721 bridge
  ...
2011-11-02 16:07:27 -07:00
Jeff Moyer 080d676de0 aio: allocate kiocbs in batches
In testing aio on a fast storage device, I found that the context lock
takes up a fair amount of cpu time in the I/O submission path.  The reason
is that we take it for every I/O submitted (see __aio_get_req).  Since we
know how many I/Os are passed to io_submit, we can preallocate the kiocbs
in batches, reducing the number of times we take and release the lock.

In my testing, I was able to reduce the amount of time spent in
_raw_spin_lock_irq by .56% (average of 3 runs).  The command I used to
test this was:

   aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev>

I also tested the patch with various numbers of events passed to
io_submit, and I ran the xfstests aio group of tests to ensure I didn't
break anything.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Daniel Ehrenberg <dehrenberg@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:07:03 -07:00
Lucas De Marchi f1ecf06854 sysctl: add support for poll()
Adding support for poll() in sysctl fs allows userspace to receive
notifications of changes in sysctl entries.  This adds a infrastructure to
allow files in sysctl fs to be pollable and implements it for hostname and
domainname.

[akpm@linux-foundation.org: s/declare/define/ for definitions]
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Greg KH <gregkh@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:07:02 -07:00
Vasiliy Kulikov aa6afca5bc proc: fix races against execve() of /proc/PID/fd**
fd* files are restricted to the task's owner, and other users may not get
direct access to them.  But one may open any of these files and run any
setuid program, keeping opened file descriptors.  As there are permission
checks on open(), but not on readdir() and read(), operations on the kept
file descriptors will not be checked.  It makes it possible to violate
procfs permission model.

Reading fdinfo/* may disclosure current fds' position and flags, reading
directory contents of fdinfo/ and fd/ may disclosure the number of opened
files by the target task.  This information is not sensible per se, but it
can reveal some private information (like length of a password stored in a
file) under certain conditions.

Used existing (un)lock_trace functions to check for ptrace_may_access(),
but instead of using EPERM return code from it use EACCES to be consistent
with existing proc_pid_follow_link()/proc_pid_readlink() return code.  If
they differ, attacker can guess what fds exist by analyzing stat() return
code.  Patched handlers: stat() for fd/*, stat() and read() for fdindo/*,
readdir() and lookup() for fd/ and fdinfo/.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:07:00 -07:00
Pavel Emelyanov 887df07891 procfs: report EISDIR when reading sysctl dirs in proc
On reading sysctl dirs we should return -EISDIR instead of -EINVAL.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:07:00 -07:00
Phillip Lougher 434a964daa hfs: fix hfs_find_init() sb->ext_tree NULL ptr oops
Clement Lecigne reports a filesystem which causes a kernel oops in
hfs_find_init() trying to dereference sb->ext_tree which is NULL.

This proves to be because the filesystem has a corrupted MDB extent
record, where the extents file does not fit into the first three extents
in the file record (the first blocks).

In hfs_get_block() when looking up the blocks for the extent file
(HFS_EXT_CNID), it fails the first blocks special case, and falls
through to the extent code (which ultimately calls hfs_find_init())
which is in the process of being initialised.

Hfs avoids this scenario by always having the extents b-tree fitting
into the first blocks (the extents B-tree can't have overflow extents).

The fix is to check at mount time that the B-tree fits into first
blocks, i.e.  fail if HFS_I(inode)->alloc_blocks >=
HFS_I(inode)->first_blocks

Note, the existing commit 47f365eb57 ("hfs: fix oops on mount with
corrupted btree extent records") becomes subsumed into this as a special
case, but only for the extents B-tree (HFS_EXT_CNID), it is perfectly
acceptable for the catalog B-Tree file to grow beyond three extents,
with the remaining extent descriptors in the extents overfow.

This fixes CVE-2011-2203

Reported-by: Clement LECIGNE <clement.lecigne@netasq.com>
Signed-off-by: Phillip Lougher <plougher@redhat.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:06:59 -07:00
Namjae Jeon 3069083cc8 isofs: add readpages support
Use mpage_readpages() instead of multiple calls to isofs_readpage() to
reduce the CPU utilization and make performance higher.

Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:06:59 -07:00
Richard Weinberger 0620d9193c ramfs: remove module leftovers
Since ramfs is hard-selected to "y", the module leftovers make no sense.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:06:58 -07:00
Jiri Kosina a3defbe5c3 binfmt_elf: fix PIE execution with randomization disabled
The case of address space randomization being disabled in runtime through
randomize_va_space sysctl is not treated properly in load_elf_binary(),
resulting in SIGKILL coming at exec() time for certain PIE-linked binaries
in case the randomization has been disabled at runtime prior to calling
exec().

Handle the randomize_va_space == 0 case the same way as if we were not
supporting .text randomization at all.

Based on original patch by H.J. Lu and Josh Boyer.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: H.J. Lu <hongjiu.lu@intel.com>
Cc: <stable@kernel.org>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:06:58 -07:00
Linus Torvalds d211858837 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue:
  vfs: add d_prune dentry operation
  vfs: protect i_nlink
  filesystems: add set_nlink()
  filesystems: add missing nlink wrappers
  logfs: remove unnecessary nlink setting
  ocfs2: remove unnecessary nlink setting
  jfs: remove unnecessary nlink setting
  hypfs: remove unnecessary nlink setting
  vfs: ignore error on forced remount
  readlinkat: ensure we return ENOENT for the empty pathname for normal lookups
  vfs: fix dentry leak in simple_fill_super()
2011-11-02 11:41:01 -07:00
Phillip Lougher 7657cacf47 Squashfs: Add an option to set dev block size to 4K
This commit adds an option to set the device block size used to 4K.

By default Squashfs sets the device block size (sb_min_blocksize) to 1K
or the smallest block size supported by the block device (if larger).
This, because blocks are packed together and unaligned in Squashfs,
should reduce latency.

This, however, gives poor performance on MTD NAND devices where
the optimal I/O size is 4K (even though the devices can support
smaller block sizes).

Using a 4K device block size may also improve overall I/O
performance for some file access patterns (e.g. sequential
accesses of files in filesystem order) on all media.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
2011-11-02 17:25:50 +00:00
Linus Torvalds f1f8935a5c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (97 commits)
  jbd2: Unify log messages in jbd2 code
  jbd/jbd2: validate sb->s_first in journal_get_superblock()
  ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined
  ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled
  ext4: fix a typo in struct ext4_allocation_context
  ext4: Don't normalize an falloc request if it can fit in 1 extent.
  ext4: remove comments about extent mount option in ext4_new_inode()
  ext4: let ext4_discard_partial_buffers handle unaligned range correctly
  ext4: return ENOMEM if find_or_create_pages fails
  ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock()
  ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten
  ext4: optimize locking for end_io extent conversion
  ext4: remove unnecessary call to waitqueue_active()
  ext4: Use correct locking for ext4_end_io_nolock()
  ext4: fix race in xattr block allocation path
  ext4: trace punch_hole correctly in ext4_ext_map_blocks
  ext4: clean up AGGRESSIVE_TEST code
  ext4: move variables to their scope
  ext4: fix quota accounting during migration
  ext4: migrate cleanup
  ...
2011-11-02 10:06:20 -07:00
Linus Torvalds 34116645d9 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Cleanup metadata flags handling
  udf: Skip mirror metadata FE loading when metadata FE is ok
  ext3: Allow quota file use root reservation
  udf: Remove web reference from UDF MAINTAINERS entry
  quota: Drop path reference on error exit from quotactl
  udf: Neaten udf_debug uses
  udf: Neaten logging output, use vsprintf extension %pV
  udf: Convert printks to pr_<level>
  udf: Rename udf_warning to udf_warn
  udf: Rename udf_error to udf_err
  udf: Promote some debugging messages to udf_error
  ext3: Remove the obsolete broken EXT3_IOC32_WAIT_FOR_READONLY.
  udf: Add readpages support for udf.
  ext3/balloc.c: local functions should be static
  ext2: fix the outdated comment in ext2_nfs_get_inode()
  ext3: remove deprecated oldalloc
  fs/ext3/balloc.c: delete useless initialization
  fs/ext2/balloc.c: delete useless initialization
  ext3: fix message in ext3_remount for rw-remount case
  ext3: Remove i_mutex from ext3_sync_file()

Fix up trivial (printf format cleanup) conflicts in fs/udf/udfdecl.h
2011-11-02 10:05:22 -07:00
Linus Torvalds de0a5345a5 Merge branch 'for-linus' of git://github.com/richardweinberger/linux
* 'for-linus' of git://github.com/richardweinberger/linux: (90 commits)
  um: fix ubd cow size
  um: Fix kmalloc argument order in um/vdso/vma.c
  um: switch to use of drivers/Kconfig
  UserModeLinux-HOWTO.txt: fix a typo
  UserModeLinux-HOWTO.txt: remove ^H characters
  um: we need sys/user.h only on i386
  um: merge delay_{32,64}.c
  um: distribute exports to where exported stuff is defined
  um: kill system-um.h
  um: generic ftrace.h will do...
  um: segment.h is x86-only and needed only there
  um: asm/pda.h is not needed anymore
  um: hw_irq.h can go generic as well
  um: switch to generic-y
  um: clean Kconfig up a bit
  um: a couple of missing dependencies...
  um: kill useless argument of free_chan() and free_one_chan()
  um: unify ptrace_user.h
  um: unify KSTK_...
  um: fix gcov build breakage
  ...
2011-11-02 09:45:39 -07:00
Al Viro 548fd1e8db um: kill useless include of user.h
everything in USER_OBJ gets it via -include user.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>
2011-11-02 14:15:13 +01:00
Sage Weil f0023bc617 vfs: add d_prune dentry operation
This adds a d_prune dentry operation that is called by the VFS prior to
pruning (i.e. unhashing and killing) a hashed dentry from the dcache.
Wrap dentry_lru_del() and use the new _prune() helper in the cases where we
are about to unhash and kill the dentry.

This will be used by Ceph to maintain a flag indicating whether the
complete contents of a directory are contained in the dcache, allowing it
to satisfy lookups and readdir without addition server communication.

Renumber a few DCACHE_* #defines to group DCACHE_OP_PRUNE with the other
DCACHE_OP_ bits.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:43 +01:00
Miklos Szeredi a78ef704a8 vfs: protect i_nlink
Prevent direct modification of i_nlink by making it const and adding a
non-const __i_nlink alias.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:43 +01:00
Miklos Szeredi bfe8684869 filesystems: add set_nlink()
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:43 +01:00
Miklos Szeredi 6d6b77f163 filesystems: add missing nlink wrappers
Replace direct i_nlink updates with the respective updater function
(inc_nlink, drop_nlink, clear_nlink, inode_dec_link_count).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-11-02 12:53:43 +01:00
Miklos Szeredi dd2a981f46 logfs: remove unnecessary nlink setting
alloc_inode() initializes i_nlink to 1.  Remove unnecessary
re-initialization.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Joern Engel <joern@logfs.org>
CC: Prasad Joshi <prasadjoshi.linux@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:42 +01:00
Miklos Szeredi a7732b05f7 ocfs2: remove unnecessary nlink setting
alloc_inode() initializes i_nlink to 1.  Remove unnecessary
re-initialization.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Joel Becker <jlbec@evilplan.org>
CC: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:42 +01:00
Miklos Szeredi c818b682c9 jfs: remove unnecessary nlink setting
alloc_inode() initializes i_nlink to 1.  Remove unnecessary
re-initialization.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:42 +01:00
Miklos Szeredi 2833eb2b46 vfs: ignore error on forced remount
On emergency remount we want to force MS_RDONLY on the super block
even if ->remount_fs() failed for some reason.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:42 +01:00
Andy Whitcroft 1fa1e7f615 readlinkat: ensure we return ENOENT for the empty pathname for normal lookups
Since the commit below which added O_PATH support to the *at() calls, the
error return for readlink/readlinkat for the empty pathname has switched
from ENOENT to EINVAL:

  commit 65cfc67223
  Author: Al Viro <viro@zeniv.linux.org.uk>
  Date:   Sun Mar 13 15:56:26 2011 -0400

    readlinkat(), fchownat() and fstatat() with empty relative pathnames

This is both unexpected for userspace and makes readlink/readlinkat
inconsistant with all other interfaces; and inconsistant with our stated
return for these pathnames.

As the readlinkat call does not have a flags parameter we cannot use the
AT_EMPTY_PATH approach used in the other calls.  Therefore expose whether
the original path is infact entry via a new user_path_at_empty() path
lookup function.  Use this to determine whether to default to EINVAL or
ENOENT for failures.

Addresses http://bugs.launchpad.net/bugs/817187

[akpm@linux-foundation.org: remove unused getname_flags()]
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:42 +01:00
Konstantin Khlebnikov 32096ea1aa vfs: fix dentry leak in simple_fill_super()
put dentry if inode allocation failed, d_genocide() cannot release it

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:42 +01:00
Eryu Guan f2a44523b2 jbd2: Unify log messages in jbd2 code
Some jbd2 code prints out kernel messages with "JBD2: " prefix, at the
same time other jbd2 code prints with "JBD: " prefix. Unify the prefix
to "JBD2: ".

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-01 19:09:18 -04:00
Eryu Guan 8762202dd0 jbd/jbd2: validate sb->s_first in journal_get_superblock()
I hit a J_ASSERT(blocknr != 0) failure in cleanup_journal_tail() when
mounting a fsfuzzed ext3 image. It turns out that the corrupted ext3
image has s_first = 0 in journal superblock, and the 0 is passed to
journal->j_head in journal_reset(), then to blocknr in
cleanup_journal_tail(), in the end the J_ASSERT failed.

So validate s_first after reading journal superblock from disk in
journal_get_superblock() to ensure s_first is valid.

The following script could reproduce it:

fstype=ext3
blocksize=1024
img=$fstype.img
offset=0
found=0
magic="c0 3b 39 98"

dd if=/dev/zero of=$img bs=1M count=8
mkfs -t $fstype -b $blocksize -F $img
filesize=`stat -c %s $img`
while [ $offset -lt $filesize ]
do
        if od -j $offset -N 4 -t x1 $img | grep -i "$magic";then
                echo "Found journal: $offset"
                found=1
                break
        fi
        offset=`echo "$offset+$blocksize" | bc`
done

if [ $found -ne 1 ];then
        echo "Magic \"$magic\" not found"
        exit 1
fi

dd if=/dev/zero of=$img seek=$(($offset+23)) conv=notrunc bs=1 count=1

mkdir -p ./mnt
mount -o loop $img ./mnt

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-01 19:04:59 -04:00
Yongqiang Yang bf52c6f7af ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined
The variable 'block' is removed by commit 750c9c47, so use the
replacement ex_ee_block instead.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-01 18:59:26 -04:00
Yongqiang Yang 32de675690 ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled
This patch fixes a syntax error which omits a comma. Besides this,
logical block number is unsigend 32 bits, so printk should use %u
instead %d.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-01 18:56:41 -04:00
Benny Halevy fc0d14fe2d nfsd4: typo logical vs bitwise negate in nfsd4_decode_share_access
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-11-01 18:06:43 -04:00
Linus Torvalds 1c39865151 Merge branch 'pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
* 'pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  pstore: make pstore write function return normal success/fail value
  pstore: change mutex locking to spin_locks
  pstore: defer inserting OOPS entries into pstore
2011-11-01 10:52:29 -07:00
Eric W. Biederman f6d90b4f9c sysfs: Make sysfs_rename safe with sysfs_dirents in rbtrees.
In sysfs_rename we need to remove the optimization of not calling
sysfs_unlink_sibling and sysfs_link_sibling if the renamed parent
directory is not changing.  This optimization is no longer valid now
that sysfs dirents are stored in an rbtree sorted by name.

Move the assignment of s_ns before the call of sysfs_link_sibling.  With
no sysfs_dirent fields changing after the call of sysfs_link_sibling
this allows sysfs_link_sibling to take any of the directory entries into
account when it builds the rbtrees, and s_ns looks like a prime canidate
to be used in the rbtree in the future.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Greg KH <gregkh@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-01 09:16:14 -07:00
Nelson Elhage d8805e633e epoll: fix spurious lockdep warnings
epoll can acquire recursively acquire ep->mtx on multiple "struct
eventpoll"s at once in the case where one epoll fd is monitoring another
epoll fd.  This is perfectly OK, since we're careful about the lock
ordering, but it causes spurious lockdep warnings.  Annotate the recursion
using mutex_lock_nested, and add a comment explaining the nesting rules
for good measure.

Recent versions of systemd are triggering this, and it can also be
demonstrated with the following trivial test program:

--------------------8<--------------------

int main(void) {
   int e1, e2;
   struct epoll_event evt = {
       .events = EPOLLIN
   };

   e1 = epoll_create1(0);
   e2 = epoll_create1(0);
   epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);
   return 0;
}
--------------------8<--------------------

Reported-by: Paul Bolle <pebolle@tiscali.nl>
Tested-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Nelson Elhage <nelhage@nelhage.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:57 -07:00
Andy Shevchenko 0a90e0f101 fat: follow rename pack_hex_byte() to hex_byte_pack()
There is no functional change.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:57 -07:00
Joe Perches b9075fa968 treewide: use __printf not __attribute__((format(printf,...)))
Standardize the style for compiler based printf format verification.
Standardized the location of __printf too.

Done via script and a little typing.

$ grep -rPl --include=*.[ch] -w "__attribute__" * | \
  grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \
  xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }'

[akpm@linux-foundation.org: revert arch bits]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:54 -07:00
Pavel Emelyanov d70ef97baf fs/pipe.c: add ->statfs callback for pipefs
Currently a statfs on a pipe's /proc/<pid>/fd/ link returns -ENOSYS.  Wire
pipfs up so that the statfs succeeds.

This is required by checkpoint-restart in the userspace to make it
possible to distinguish pipes from fifos.

When we dump information about task's open files we use the /proc/pid/fd
directoy's symlinks and the fact that opening any of them gives us exactly
the same dentry->inode pair as the original process has.  Now if a task
we're dumping has opened pipe and fifo we need to detect this and act
accordingly.  Knowing that an fd with type S_ISFIFO resides on a pipefs is
the most precise way.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:51 -07:00
Tao Ma 72a2ebd8bc fs/buffer.c: add device information for error output in __find_get_block_slow()
On the ext4 mailing list[1], we got some report about errors in
__find_get_block_slow(), but the information is very limited.

If the device information is given, we can know the name of the sick
volume.  Futhermore, we can get the corresponding status of that
block(group, inode block etc) by analyzing the disk layout.

[1] http://marc.info/?l=linux-ext4&m=131379831421147&w=2

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:49 -07:00
Mikulas Patocka 09f363c736 vmscan: fix shrinker callback bug in fs/super.c
The callback must not return -1 when nr_to_scan is zero. Fix the bug in
fs/super.c and add this requirement to the callback specification.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:49 -07:00
Akinobu Mita 798248206b lib/string.c: introduce memchr_inv()
memchr_inv() is mainly used to check whether the whole buffer is filled
with just a specified byte.

The function name and prototype are stolen from logfs and the
implementation is from SLUB.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Acked-by: Joern Engel <joern@logfs.org>
Cc: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:47 -07:00
Mel Gorman 966dbde2c2 ext4: warn if direct reclaim tries to writeback pages
Direct reclaim should never writeback pages.  Warn if an attempt is made.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:46 -07:00
Mel Gorman 94054fa3fc xfs: warn if direct reclaim tries to writeback pages
Direct reclaim should never writeback pages.  For now, handle the
situation and warn about it.  Ultimately, this will be a BUG_ON.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:46 -07:00
Christoph Lameter bc3e53f682 mm: distinguish between mlocked and pinned pages
Some kernel components pin user space memory (infiniband and perf) (by
increasing the page count) and account that memory as "mlocked".

The difference between mlocking and pinning is:

A. mlocked pages are marked with PG_mlocked and are exempt from
   swapping. Page migration may move them around though.
   They are kept on a special LRU list.

B. Pinned pages cannot be moved because something needs to
   directly access physical memory. They may not be on any
   LRU list.

I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:

Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.

This patch introduces a separate counter for pinned pages and
accounts them seperately.

Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Mike Marciniszyn <infinipath@qlogic.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:46 -07:00
Robert P. J. Day f5fc870da2 tmpfs: add "tmpfs" to the Kconfig prompt to make it obvious.
Add the leading word "tmpfs" to the Kconfig string to make it blindingly
obvious that this selection refers to tmpfs.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:45 -07:00
David Rientjes c9f01245b6 oom: remove oom_disable_count
This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy.  The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.

The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing.  The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:

 - it is possible that the OOM_DISABLE thread sharing memory with the
   victim is waiting on that thread to exit and will actually cause
   future memory freeing, and

 - there is no guarantee that a thread is disabled from oom killing just
   because another thread sharing its mm is oom disabled.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:45 -07:00