Commit Graph

274 Commits

Author SHA1 Message Date
Al Viro e407699ef5 btrfs, nfs, apparmor: don't pull mnt_namespace.h for no reason...
it's not needed anymore; we used to, back when we had to do
mount_subtree() by hand, complete with put_mnt_ns() in it.
No more...  Apparmor didn't need it since the __d_path() fix.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:52:38 -05:00
Linus Torvalds c9a7fe9672 Merge branches 'for-linus' and 'for-linus-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: unplug every once and a while
  Btrfs: deal with NULL srv_rsv in the delalloc inode reservation code
  Btrfs: only set cache_generation if we setup the block group
  Btrfs: don't panic if orphan item already exists
  Btrfs: fix leaked space in truncate
  Btrfs: fix how we do delalloc reservations and how we free reservations on error
  Btrfs: deal with enospc from dirtying inodes properly
  Btrfs: fix num_workers_starting bug and other bugs in async thread
  BTRFS: Establish i_ops before calling d_instantiate
  Btrfs: add a cond_resched() into the worker loop
  Btrfs: fix ctime update of on-disk inode
  btrfs: keep orphans for subvolume deletion
  Btrfs: fix inaccurate available space on raid0 profile
  Btrfs: fix wrong disk space information of the files
  Btrfs: fix wrong i_size when truncating a file to a larger size
  Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror

* 'for-linus-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: lower the dirty balance poll interval
2011-12-16 12:15:50 -08:00
Chris Mason 567a45e917 Merge branch 'for-chris' of http://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into integration
Conflicts:
	fs/btrfs/inode.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15 13:43:49 -05:00
Josef Bacik 22c44fe65a Btrfs: deal with enospc from dirtying inodes properly
Now that we're properly keeping track of delayed inode space we've been getting
a lot of warnings out of btrfs_dirty_inode() when running xfstest 83.  This is
because a bunch of people call mark_inode_dirty, which is void so we can't
return ENOSPC.  This needs to be fixed in a few areas

1) file_update_time - this updates the mtime and such when writing to a file,
which will call mark_inode_dirty.  So copy file_update_time into btrfs so we can
call btrfs_dirty_inode directly and return an error if we get one appropriately.

2) fix symlinks to use btrfs_setattr for ->setattr.  For some reason we weren't
setting ->setattr for symlinks, even though we should have been.  This catches
one of the cases where we were getting errors in mark_inode_dirty.

3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly
instead of mark_inode_dirty.  This lets us return errors properly for truncate
and chown/anything related to setattr.

4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and
print an error if we have one.  The only remaining user we can't control for
this is touch_atime(), but we don't really want to keep people from walking
down the tree if we don't have space to save the atime update, so just complain
but don't worry about it.

With this patch xfstests 83 complains a handful of times instead of hundreds of
times.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15 11:04:21 -05:00
Miao Xie 39fb26c398 Btrfs: fix inaccurate available space on raid0 profile
When we use raid0 as the data profile, df command may show us a very
inaccurate value of the available space, which may be much less than the
real one. It may make the users puzzled. Fix it by changing the calculation
of the available space, and making it be more similar to a fake chunk
allocation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15 10:50:36 -05:00
Linus Torvalds b930c26416 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix meta data raid-repair merge problem
  Btrfs: skip allocation attempt from empty cluster
  Btrfs: skip block groups without enough space for a cluster
  Btrfs: start search for new cluster at the beginning
  Btrfs: reset cluster's max_size when creating bitmap
  Btrfs: initialize new bitmaps' list
  Btrfs: fix oops when calling statfs on readonly device
  Btrfs: Don't error on resizing FS to same size
  Btrfs: fix deadlock on metadata reservation when evicting a inode
  Fix URL of btrfs-progs git repository in docs
  btrfs scrub: handle -ENOMEM from init_ipath()
2011-12-01 08:28:53 -08:00
Li Zefan b772a86ea6 Btrfs: fix oops when calling statfs on readonly device
To reproduce this bug:

  # dd if=/dev/zero of=img bs=1M count=256
  # mkfs.btrfs img
  # losetup -r /dev/loop1 img
  # mount /dev/loop1 /mnt
  OOPS!!

It triggered BUG_ON(!nr_devices) in btrfs_calc_avail_data_space().

To fix this, instead of checking write-only devices, we check all open
deivces:

  # df -h /dev/loop1
  Filesystem            Size  Used Avail Use% Mounted on
  /dev/loop1            250M   28K  238M   1% /mnt

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-11-30 18:46:05 +01:00
Al Viro ea441d1104 new helper: mount_subtree()
takes vfsmount and relative path, does lookup within that vfsmount
(possibly triggering automounts) and returns the result as root
of subtree suitable for return by ->mount() (i.e. a reference to
dentry and an active reference to its superblock grabbed, superblock
locked exclusive).

btrfs and nfs switched to it instead of open-coding the sucker.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-16 22:00:34 -05:00
Al Viro c133449587 switch create_mnt_ns() to saner calling conventions, fix double mntput() in nfs
Life is much saner if create_mnt_ns(mnt) drops mnt in case of error...
Switch it to such calling conventions, switch callers, fix double mntput() in
fs/nfs/super.c one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-16 16:12:14 -05:00
Al Viro 8d514bbf37 btrfs: fix double mntput() in mount_subvol()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-16 16:06:09 -05:00
David Sterba 8965593e41 btrfs: rename the option to nospace_cache
Rename no_space_cache option to nospace_cache to be more consistent with
the rest, where the simple prefix 'no' is used to negate an option.

The option has been introduced during the -rc1 cycle and there are has not been
widely used, so it's safe.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-11 10:14:57 -05:00
Ilya Dryomov 04d21a244f Btrfs: rework error handling in btrfs_mount()
Commits 6c41761f and 45ea6095 introduced the possibility of NULL pointer
dereference on error paths, also we would leave all devices busy and
leak fs_info with all sub-structures on error when trying to mount an
already mounted fs to a different directory.

Fix this by doing all allocations before trying to open any of the
devices, adjust error path for mount-already-mounted-fs case.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:39 +02:00
Ilya Dryomov f23c8af8ca Btrfs: fix subvol_name leak on error in btrfs_mount()
btrfs_parse_early_options() can fail due to error while scanning devices
(-o device= option), but still strdup() subvol_name string:

mount -o subvol=SUBV,device=BAD_DEVICE <dev> <mnt>

So free subvol_name string on error.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:38 +02:00
Ilya Dryomov a90e8b6fb8 Btrfs: fix memory leak in btrfs_parse_early_options()
Don't leak subvol_name string in case multiple subvol= options are
given.  "The lastest option is effective" behavior (consistent with
subvolid= and subvolrootid= options) is preserved.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:38 +02:00
slyich@gmail.com 45ea6095c8 btrfs: fix double-free 'tree_root' in 'btrfs_mount()'
On error path 'tree_root' is treed in 'free_fs_info()'.
No need to free it explicitely. Noticed by SLUB in debug mode:

Complete reproducer under usermode linux (discovered on real
machine):

    bdev=/dev/ubda
    btr_root=/btr
    /mkfs.btrfs $bdev
    mount $bdev $btr_root
    mkdir $btr_root/subvols/
    cd $btr_root/subvols/
    /btrfs su cr foo
    /btrfs su cr bar
    mount $bdev -osubvol=subvols/foo $btr_root/subvols/bar
    umount $btr_root/subvols/bar

which gives

device fsid 4d55aa28-45b1-474b-b4ec-da912322195e devid 1 transid 7 /dev/ubda
=============================================================================
BUG kmalloc-2048: Object already free
-----------------------------------------------------------------------------

INFO: Allocated in btrfs_mount+0x389/0x7f0 age=0 cpu=0 pid=277
INFO: Freed in btrfs_mount+0x51c/0x7f0 age=0 cpu=0 pid=277
INFO: Slab 0x0000000062886200 objects=15 used=9 fp=0x0000000070b4d2d0 flags=0x4081
INFO: Object 0x0000000070b4d2d0 @offset=21200 fp=0x0000000070b4a968
...
Call Trace:
70b31948:  [<6008c522>] print_trailer+0xe2/0x130
70b31978:  [<6008c5aa>] object_err+0x3a/0x50
70b319a8:  [<6008e242>] free_debug_processing+0x142/0x2a0
70b319e0:  [<600ebf6f>] btrfs_mount+0x55f/0x7f0
70b319f8:  [<6008e5c1>] __slab_free+0x221/0x2d0

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Arne Jansen <sensille@gmx.net>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-07 16:08:01 -05:00
Chris Mason af31f5e5b8 Btrfs: add a log of past tree roots
This takes some of the free space in the btrfs super block
to record information about most of the roots in the last four
commits.

It also adds a -o recovery to use the root history log when
we're not able to read the tree of tree roots, the extent
tree root, the device tree root or the csum root.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:04:15 -05:00
David Sterba 6c41761fc6 btrfs: separate superblock items out of fs_info
fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)

Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-11-06 03:04:01 -05:00
David Sterba f9d9ef62cd btrfs: do not allow mounting non-subvolumes via subvol option
There's a missing test whether the path passed to subvol=path option
during mount is a real subvolume, allowing any directory located in
default subovlume to be passed and accepted for mount.

(current btrfs progs prevent this early)
$ btrfs subvol snapshot . p1-snap
ERROR: '.' is not a subvolume

(with "is subvolume?" test bypassed)
$ btrfs subvol snapshot . p1-snap
Create a snapshot of '.' in './p1-snap'

$ btrfs subvol list -p .
ID 258 parent 5 top level 5 path subvol
ID 259 parent 5 top level 5 path subvol1
ID 260 parent 5 top level 5 path default-subvol1
ID 262 parent 5 top level 5 path p1/p1-snapshot
ID 263 parent 259 top level 5 path subvol1/subvol1-snap

The problem I see is that this makes a false impression of snapshotting the
given subvolume but in fact snapshots the default one: a user expects outcome
like ID 263 but in fact gets ID 262 .

This patch makes mount fail with EINVAL with a message in syslog.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-10-24 14:43:25 +02:00
Ilya Dryomov 5f524444c3 Btrfs: fix a bug when opening seed devices
Initialize fs_info->bdev_holder a bit earlier to be able to pass a
correct holder id to blkdev_get() when opening seed devices with O_EXCL.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-10-20 18:20:36 +02:00
Jeff Liu 83c8c9bde0 btrfs: trivial fix, a potential memory leak in btrfs_parse_early_options()
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
2011-10-20 18:10:38 +02:00
Josef Bacik 73bc187680 Btrfs: introduce mount option no_space_cache
Some users have requested this and I've found I needed a way to disable cache
loading without actually clearing the cache, so introduce the no_space_cache
option.  Before we check the super blocks cache generation field and if it was
populated we always turned space caching on.  Now we check this and set the
space cache option on, and then parse the mount options so that if we want it
off it get's turned off.  Then we check the mount option all the places we do
the caching work instead of checking the super's cache generation.  This makes
things more consistent and lets us turn space caching off.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:51 -04:00
Josef Bacik 830c4adbd0 Btrfs: fix how we mount subvol=<whatever>
We've only been able to mount with subvol=<whatever> where whatever was a subvol
within whatever root we had as the default.  This allows us to mount -o
subvol=path/to/subvol/you/want relative from the normal fs_tree root.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:29 -04:00
Josef Bacik ba5b8958da Btrfs: use d_obtain_alias when mounting subvol/subvolid
Currently what we do is just wrong.  We either

1) Alloc a new "root" dentry with sb->s_root as it's parent which is just wrong
as we could walk into this subvol later on via another path and hilarity could
ensue.  Also we don't check the return value of d_splice_alias which isn't good
either.

or

2) Do a d_find_alias() which we could have lost our dentry from cache at this
point and found nothing.

So use d_obtain_alias().  In the case that we already have the inode/dentry in
cache we will get the correct dentry.  If not we will get a disconnected dentry
tree so if we walk into it later on everything will be connected up properly.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:29 -04:00
Linus Torvalds 1acc9309eb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  btrfs: fix oops when doing space balance
  Btrfs: don't panic if we get an error while balancing V2
  btrfs: add missing options displayed in mount output
2011-07-08 23:25:45 -07:00
David Sterba 0942caa373 btrfs: add missing options displayed in mount output
There are three missed mount options settable by user which are not
currently displayed in mount output.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-06 18:46:43 -04:00
Linus Torvalds 8397345172 Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  vfs: make unlink() and rmdir() return ENOENT in preference to EROFS
  lmLogOpen() broken failure exit
  usb: remove bad dput after dentry_unhash
  more conservative S_NOSEC handling
2011-06-07 18:36:59 -07:00
Linus Torvalds e6ece70732 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
  btrfs: fix uninitialized variable warning
  btrfs: add helper for fs_info->closing
  Btrfs: add mount -o inode_cache
  btrfs: scrub: add explicit plugging
  btrfs: use btrfs_ino to access inode number
  Btrfs: don't save the inode cache if we are deleting this root
  btrfs: false BUG_ON when degraded
  Btrfs: don't save the inode cache in non-FS roots
  Btrfs: make sure we don't overflow the free space cache crc page
  Btrfs: fix uninit variable in the delayed inode code
  btrfs: scrub: don't reuse bios and pages
  Btrfs: leave spinning on lookup and map the leaf
  Btrfs: check for duplicate entries in the free space cache
  Btrfs: don't try to allocate from a block group that doesn't have enough space
  Btrfs: don't always do readahead
  Btrfs: try not to sleep as much when doing slow caching
  Btrfs: kill BTRFS_I(inode)->block_group
  Btrfs: don't look at the extent buffer level 3 times in a row
  Btrfs: map the node block when looking for readahead targets
  Btrfs: set range_start to the right start in count_range_bits
  ...
2011-06-05 06:17:23 +09:00
Chris Mason 4b9465cb9e Btrfs: add mount -o inode_cache
This makes the inode map cache default to off until we
fix the overflow problem when the free space crcs don't fit
inside a single page.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-04 08:03:47 -04:00
Al Viro 9e1f1de02c more conservative S_NOSEC handling
Caching "we have already removed suid/caps" was overenthusiastic as merged.
On network filesystems we might have had suid/caps set on another client,
silently picked by this client on revalidate, all of that *without* clearing
the S_NOSEC flag.

AFAICS, the only reasonably sane way to deal with that is
	* new superblock flag; unless set, S_NOSEC is not going to be set.
	* local block filesystems set it in their ->mount() (more accurately,
mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
local block ones clear it)
	* if any network filesystem (or a cluster one) wants to use S_NOSEC,
it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
inode attribute changes are picked from other clients.

It's not an earth-shattering hole (anybody that can set suid on another client
will almost certainly be able to write to the file before doing that anyway),
but it's a bug that needs fixing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-03 18:24:58 -04:00
Linus Torvalds a0c3061093 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (58 commits)
  Btrfs: use the device_list_mutex during write_dev_supers
  Btrfs: setup free ino caching in a more asynchronous way
  btrfs scrub: don't coalesce pages that are logically discontiguous
  Btrfs: return -ENOMEM in clear_extent_bit
  Btrfs: add mount -o auto_defrag
  Btrfs: using rcu lock in the reader side of devices list
  Btrfs: drop unnecessary device lock
  Btrfs: fix the race between remove dev and alloc chunk
  Btrfs: fix the race between reading and updating devices
  Btrfs: fix bh leak on __btrfs_open_devices path
  Btrfs: fix unsafe usage of merge_state
  Btrfs: allocate extent state and check the result properly
  fs/btrfs: Add missing btrfs_free_path
  Btrfs: check return value of btrfs_inc_extent_ref()
  Btrfs: return error to caller if read_one_inode() fails
  Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
  Btrfs: return error code to caller when btrfs_del_item fails
  Btrfs: return error code to caller when btrfs_previous_item fails
  btrfs: fix typo 'testeing' -> 'testing'
  btrfs: typo: 'btrfS' -> 'btrfs'
  ...
2011-05-27 13:57:12 -07:00
Chris Mason 4cb5300bc8 Btrfs: add mount -o auto_defrag
This will detect small random writes into files and
queue the up for an auto defrag process.  It isn't well suited to
database workloads yet, but works for smaller files such as rpm, sqlite
or bdb databases.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-26 17:52:15 -04:00
Dan Magenheimer 90a887c9a2 btrfs: add cleancache support
This sixth patch of eight in this cleancache series "opts-in"
cleancache for btrfs.  Filesystems must explicitly enable
cleancache by calling cleancache_init_fs anytime an instance
of the filesystem is mounted.  Btrfs uses its own readpage
which must be hooked, but all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs hook
which must be called on unmount.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
2011-05-26 10:01:56 -06:00
Chris Mason d6c0cb379c Merge branch 'cleanups_and_fixes' into inode_numbers
Conflicts:
	fs/btrfs/tree-log.c
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 14:37:47 -04:00
Julia Lawall b083916638 fs/btrfs: Add missing btrfs_free_path
Btrfs_alloc_path should be matched with btrfs_free_path in error-handling code.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r exists@
local idexpression struct btrfs_path * x;
expression ra,rb;
position p1,p2;
@@

x = btrfs_alloc_path@p1(...)
...  when != btrfs_free_path(x,...)
     when != if (...) { ... btrfs_free_path(x,...) ...}
     when != x = ra
if(...) { ... when != x = rb
     when forall
     when != btrfs_free_path(x,...)
 \(return <+...x...+>; \| return@p2...; \) }

@script:python@
p1 << r.p1;
p2 << r.p2;
@@

cocci.print_main("alloc",p1)
cocci.print_secs("return",p2)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:41 -04:00
Chris Mason aa2dfb372a Merge branch 'allocator' of git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:36:34 -04:00
Chris Mason 945d8962ce Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into inode_numbers
Conflicts:
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/tree-log.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:33:42 -04:00
Miao Xie 16cdcec736 btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030

After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:30:56 -04:00
Arne Jansen bcd53741cc btrfs: move btrfs_cmp_device_free_bytes to super.c
this function won't be used here anymore, so move it super.c where it is
used for df-calculation
2011-05-13 15:36:05 +02:00
David Sterba 306e16ce13 btrfs: rename variables clashing with global function names
reported by gcc -Wshadow:
page_index, page_offset, new_inode, dev_name

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:19 +02:00
Xin Zhong e15d054242 Btrfs: fix subvolume mount by name problem when default mount subvolume is set
We create two subvolumes (meego_root and meego_home) in
btrfs root directory. And set meego_root as default mount
subvolume. After we remount btrfs, meego_root is mounted
to top directory by default. Then when we try to mount
meego_home (subvol=meego_home) to a subdirectory, it failed.
The problem is when default mount subvolume is set to
meego_root, we search meego_home in meego_root but can not find
it. So the solution is to add a new mount option (subvolrootid)
to specify subvol id of root and search subvol name in it. For
our case, now we can use "-o subvolrootid=0,subvol=meego_home)
to mount meego_home.

Detail information can be found in meego bugzilla:
https://bugs.meego.com/show_bug.cgi?id=15055

Signed-off-by: Zhong, Xin <xin.zhong@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:26:50 -04:00
Tsutomu Itoh 200da64e0b Btrfs: fix /proc/mounts info.
Some mount options are not displayed by /proc/mounts.
This patch displays the option such as compress_type by /proc/mounts.

Ex.
  [before]
    $ mount | grep sdc2
    /dev/sdc2 on /test12 type btrfs (rw,space_cache,compress=lzo)
    $ cat /proc/mounts | grep sdc2
    /dev/sdc2 /test12 btrfs rw,relatime,compress 0 0

  [after]
    $ mount | grep sdc2
    /dev/sdc2 on /test12 type btrfs (rw,space_cache,compress=lzo)
    $ cat /proc/mounts | grep sdc2
    /dev/sdc2 /test12 btrfs rw,relatime,compress=lzo,space_cache 0 0

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-05 01:19:41 -04:00
liubo 1abe9b8a13 Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
              dd-7822  [000]  2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
              dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
 btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
   flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
   flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
   flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
   flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

Here is what I have added:

1) ordere_extent:
        btrfs_ordered_extent_add
        btrfs_ordered_extent_remove
        btrfs_ordered_extent_start
        btrfs_ordered_extent_put

These provide critical information to understand how ordered_extents are
updated.

2) extent_map:
        btrfs_get_extent

extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.

3) writepage:
        __extent_writepage
        btrfs_writepage_end_io_hook

Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.

4) inode:
        btrfs_inode_new
        btrfs_inode_request
        btrfs_inode_evict

These can show where and when a inode is created, when a inode is evicted.

5) sync:
        btrfs_sync_file
        btrfs_sync_fs

These show sync arguments.

6) transaction:
        btrfs_transaction_commit

In transaction based filesystem, it will be useful to know the generation and
who does commit.

7) back reference and cow:
	btrfs_delayed_tree_ref
	btrfs_delayed_data_ref
	btrfs_delayed_ref_head
	btrfs_cow_block

Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.

8) chunk:
	btrfs_chunk_alloc
	btrfs_chunk_free

Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.

9) reserved_extent:
	btrfs_reserved_extent_alloc
	btrfs_reserved_extent_free

These can show how btrfs uses its space.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:33 -04:00
Linus Torvalds 4660ba63f1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix fiemap bugs with delalloc
  Btrfs: set FMODE_EXCL in btrfs_device->mode
  Btrfs: make btrfs_rm_device() fail gracefully
  Btrfs: Avoid accessing unmapped kernel address
  Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl
  Btrfs: allow balance to explicitly allocate chunks as it relocates
  Btrfs: put ENOSPC debugging under a mount option
2011-02-25 14:03:39 -08:00
Chris Mason 91435650c2 Btrfs: put ENOSPC debugging under a mount option
ENOSPC in btrfs is getting to the point where the extra debugging isn't
required.  I've put it under mount -o enospc_debug just in case someone
is having difficult problems.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-16 15:28:36 -05:00
Linus Torvalds cb5520f02c Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (33 commits)
  Btrfs: Fix page count calculation
  btrfs: Drop __exit attribute on btrfs_exit_compress
  btrfs: cleanup error handling in btrfs_unlink_inode()
  Btrfs: exclude super blocks when we read in block groups
  Btrfs: make sure search_bitmap finds something in remove_from_bitmap
  btrfs: fix return value check of btrfs_start_transaction()
  btrfs: checking NULL or not in some functions
  Btrfs: avoid uninit variable warnings in ordered-data.c
  Btrfs: catch errors from btrfs_sync_log
  Btrfs: make shrink_delalloc a little friendlier
  Btrfs: handle no memory properly in prepare_pages
  Btrfs: do error checking in btrfs_del_csums
  Btrfs: use the global block reserve if we cannot reserve space
  Btrfs: do not release more reserved bytes to the global_block_rsv than we need
  Btrfs: fix check_path_shared so it returns the right value
  btrfs: check return value of btrfs_start_ioctl_transaction() properly
  btrfs: fix return value check of btrfs_join_transaction()
  fs/btrfs/inode.c: Add missing IS_ERR test
  btrfs: fix missing break in switch phrase
  btrfs: fix several uncheck memory allocations
  ...
2011-02-07 14:06:18 -08:00
Tsutomu Itoh 98d5dc13e7 btrfs: fix return value check of btrfs_start_transaction()
The error check of btrfs_start_transaction() is added, and the mistake
of the error check on several places is corrected.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-01 07:17:27 -05:00
Chris Mason eab49bec41 Merge branch 'bug-fixes' of git://repo.or.cz/linux-btrfs-devel into btrfs-38 2011-01-28 16:24:59 -05:00
Tero Roponen 3f3d0bc0df Btrfs: Free correct pointer after using strsep
We must save and free the original kstrdup()'ed pointer
because strsep() modifies its first argument.

Signed-off-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:05:11 +08:00
Ian Kent bdc924bb4c Btrfs: Fix memory leak on finding existing super
We missed a memory deallocation in commit 450ba0ea.

If an existing super block is found at mount and there is no
error condition then the pre-allocated tree_root and fs_info
are no not used and are not freeded.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:05:07 +08:00
Linus Torvalds eee2a817df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
  Btrfs: forced readonly mounts on errors
  btrfs: Require CAP_SYS_ADMIN for filesystem rebalance
  Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
  btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()
  btrfs: check NULL or not
  btrfs: Don't pass NULL ptr to func that may deref it.
  btrfs: mount failure return value fix
  btrfs: Mem leak in btrfs_get_acl()
  btrfs: fix wrong free space information of btrfs
  btrfs: make the chunk allocator utilize the devices better
  btrfs: restructure find_free_dev_extent()
  btrfs: fix wrong calculation of stripe size
  btrfs: try to reclaim some space when chunk allocation fails
  btrfs: fix wrong data space statistics
  fs/btrfs: Fix build of ctree
  Btrfs: fix off by one while setting block groups readonly
  Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls
  Btrfs: Add readonly snapshots support
  Btrfs: Refactor btrfs_ioctl_snap_create()
  btrfs: Extract duplicate decompress code
  ...
2011-01-17 14:43:43 -08:00