Commit Graph

469435 Commits

Author SHA1 Message Date
Qu Wenruo e6c4efd87a btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map
The following commit enhanced the merge_extent_mapping() to reduce
fragment in extent map tree, but it can't handle case which existing
lies before map_start:
51f39 btrfs: Use right extent length when inserting overlap extent map.

[BUG]
When existing extent map's start is before map_start,
the em->len will be minus, which will corrupt the extent map and fail to
insert the new extent map.
This will happen when someone get a large extent map, but when it is
going to insert it into extent map tree, some one has already commit
some write and split the huge extent into small parts.

[REPRODUCER]
It is very easy to tiger using filebench with randomrw personality.
It is about 100% to reproduce when using 8G preallocated file in 60s
randonrw test.

[FIX]
This patch can now handle any existing extent position.
Since it does not directly use existing->start, now it will find the
previous and next extent around map_start.
So the old existing->start < map_start bug will never happen again.

[ENHANCE]
This patch will insert the best fitted extent map into extent map tree,
other than the oldest [map_start, map_start + sectorsize) or the
relatively newer but not perfect [map_start, existing->start).

The patch will first search existing extent that does not intersects with
the desired map range [map_start, map_start + len).
The existing extent will be either before or behind map_start, and based
on the existing extent, we can find out the previous and next extent
around map_start.

So the best fitted extent would be [prev->end, next->start).
For prev or next is not found, em->start would be prev->end and em->end
wold be next->start.

With this patch, the fragment in extent map tree should be reduced much
more than the 51f39 commit and reduce an unneeded extent map tree search.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-18 07:14:46 -07:00
Liu Bo 4d1a40c66b Btrfs: fix up bounds checking in lseek
An user reported this, it is because that lseek's SEEK_SET/SEEK_CUR/SEEK_END
allow a negative value for @offset, but btrfs's SEEK_DATA/SEEK_HOLE don't
prepare for that and convert the negative @offset into unsigned type,
so we get (end < start) warning.

[ 1269.835374] ------------[ cut here ]------------
[ 1269.836809] WARNING: CPU: 0 PID: 1241 at fs/btrfs/extent_io.c:430 insert_state+0x11d/0x140()
[ 1269.838816] BTRFS: end < start 4094 18446744073709551615
[ 1269.840334] CPU: 0 PID: 1241 Comm: a.out Tainted: G        W      3.16.0+ #306
[ 1269.858229] Call Trace:
[ 1269.858612]  [<ffffffff81801a69>] dump_stack+0x4e/0x68
[ 1269.858952]  [<ffffffff8107894c>] warn_slowpath_common+0x8c/0xc0
[ 1269.859416]  [<ffffffff81078a36>] warn_slowpath_fmt+0x46/0x50
[ 1269.859929]  [<ffffffff813b0fbd>] insert_state+0x11d/0x140
[ 1269.860409]  [<ffffffff813b1396>] __set_extent_bit+0x3b6/0x4e0
[ 1269.860805]  [<ffffffff813b21c7>] lock_extent_bits+0x87/0x200
[ 1269.861697]  [<ffffffff813a5b28>] btrfs_file_llseek+0x148/0x2a0
[ 1269.862168]  [<ffffffff811f201e>] SyS_lseek+0xae/0xc0
[ 1269.862620]  [<ffffffff8180b212>] system_call_fastpath+0x16/0x1b
[ 1269.862970] ---[ end trace 4d33ea885832054b ]---

This assumes that btrfs starts finding DATA/HOLE from the beginning of file
if the assigned @offset is negative.

Also we add alignment for lock_extent_bits 's range.

Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:46:30 -07:00
Miao Xie f612496bca Btrfs: cleanup the read failure record after write or when the inode is freeing
After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
  mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
  the corrupted data is corrected, so the failure record can be removed. And if
  some errors happen on the mirrors, we also needn't worry about it because the
  failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:02 -07:00
Miao Xie 8b110e393c Btrfs: implement repair function when direct read fails
This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:01 -07:00
Miao Xie 28e1cc7d1b Btrfs: Set real mirror number for read operation on RAID0/5/6
We need real mirror number for RAID0/5/6 when reading data, or if read error
happens, we would pass 0 as the number of the mirror on which the io error
happens. It is wrong and would cause the filesystem read the data from the
corrupted mirror again.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:00 -07:00
Miao Xie 1203b6813e Btrfs: modify clean_io_failure and make it suit direct io
We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:59 -07:00
Miao Xie ffdd2018dd Btrfs: modify repair_io_failure and make it suit direct io
The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:58 -07:00
Miao Xie 2fe6303e7c Btrfs: split bio_readpage_error into several functions
The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:56 -07:00
Miao Xie 454ff3de42 Btrfs: Cleanup unused variant and argument of IO failure handlers
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:55 -07:00
Miao Xie 6c387ab20d Btrfs: fix missing error handler if submiting re-read bio fails
We forgot to free failure record and bio after submitting re-read bio failed,
fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:54 -07:00
Miao Xie c1dc08967f Btrfs: do file data check by sub-bio's self
Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:53 -07:00
Miao Xie dc380aea5f Btrfs: cleanup similar code of the buffered data data check and dio read data check
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:52 -07:00
Miao Xie 23ea8e5a07 Btrfs: load checksum data once when submitting a direct read io
The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:50 -07:00
Miao Xie c3929c3624 Btrfs: modify rw_devices counter under chunk_mutex context
rw_devices counter is often used to tune the profile when doing chunk allocation,
so we should modify it under the chunk_mutex context to avoid getting wrong
chunk profile.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:49 -07:00
Miao Xie 5f37583569 Btrfs: move the missing device to its own fs device list
For a missing device, we don't know it belong to which fs before we read its
fsid from the chunk tree. So we add them into the current fs device list at first.
When we get its fsid, we should move them to their own fs device list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:48 -07:00
Miao Xie 416d7b802a Btrfs: stop mounting the fs if the non-ENOENT errors happen when opening seed fs
When we open a seed filesystem, if the degraded mount option is set, we continue to
mount the fs if we don't find some devices in the seed filesystem. But we should stop
mounting if other errors happen. Fix it

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:47 -07:00
Miao Xie 82372bc816 Btrfs: make the logic of source device removing more clear
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:46 -07:00
Miao Xie 67a2c45ee7 Btrfs: fix use-after-free problem of the device during device replace
The problem is:
	Task0(device scan task)		Task1(device replace task)
	scan_one_device()
	mutex_lock(&uuid_mutex)
	device = find_device()
					mutex_lock(&device_list_mutex)
					lock_chunk()
					rm_and_free_source_device
					unlock_chunk()
					mutex_unlock(&device_list_mutex)
	check device

Destroying the target device if device replace fails also has the same problem.

We fix this problem by locking uuid_mutex during destroying source device or
target device, just like the device remove operation.

It is a temporary solution, we can fix this problem and make the code more
clear by atomic counter in the future.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:44 -07:00
Miao Xie adbbb8631b Btrfs: fix unprotected device list access when cloning fs devices
We can build a new filesystem based a seed filesystem, and we need clone
the fs devices when we open the new filesystem. But someone might clear
the seed flag of the seed filesystem, then mount that filesystem and
remove some device. If we mount the new filesystem, we might access
a device list which was being changed when we clone the fs devices.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:43 -07:00
Miao Xie 2196d6e8a7 Btrfs: Fix misuse of chunk mutex
There were several problems about chunk mutex usage:
- Lock chunk mutex when updating metadata. It would cause the nested
  deadlock because updating metadata might need allocate new chunks
  that need acquire chunk mutex. We remove chunk mutex at this case,
  because b-tree lock and other lock mechanism can help us.
- ABBA deadlock occured between device_list_mutex and chunk_mutex.
  When we update device status, we must acquire device_list_mutex at the
  beginning, and then we might get chunk_mutex during the device status
  update because we need allocate new chunks for metadata COW. But at
  most place, we acquire chunk_mutex at first and then acquire device list
  mutex. We need change the lock order.
- Some place we needn't acquire chunk_mutex. For example we needn't get
  chunk_mutex when we free a empty seed fs_devices structure.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:42 -07:00
Miao Xie 15484377f5 Btrfs: fix unprotected device list access when getting the fs information
When we get the fs information, we forgot to acquire the mutex of device list,
it might cause the problem we might access a device that was removed. Fix
it by acquiring the device list mutex.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:41 -07:00
Miao Xie fe48a5c00f Btrfs: fix unprotected system chunk array insertion
We didn't protect the system chunk array when we added a new
system chunk into it, it would cause the array be corrupted
if someone remove/add some system chunk into array at the same
time. Fix it by chunk lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:40 -07:00
Miao Xie 7cc8e58d53 Btrfs: fix unprotected device's variants on 32bits machine
->total_bytes,->disk_total_bytes,->bytes_used is protected by chunk
lock when we change them, but sometimes we read them without any lock,
and we might get unexpected value. We fix this problem like inode's
i_size.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:38 -07:00
Miao Xie 1c1161870c Btrfs: update free_chunk_space during allocting a new chunk
We should update free_chunk_space in time when we allocate a new chunk,
not when we deal with the pending device update and block group insertion,
because we need the real free_chunk_space data to calculate the reserved
space, if we don't update it in time, we would consider the disk space which
has be allocated as free space, and would use it to do overcommit reservation.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:37 -07:00
Miao Xie 43530c46cc Btrfs: fix unprotected device->bytes_used update
We should update device->bytes_used in the lock context of
chunk_mutex, or we would get wrong data.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:36 -07:00
Miao Xie 5d778aaeb0 Btrfs: Fix wrong free_chunk_space assignment during removing a device
During removing a device, we have modified free_chunk_space when we
shrink the device, so we needn't assign a new value to it after
the device shrink. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:35 -07:00
Miao Xie ce7213c70c Btrfs: fix wrong device bytes_used in the super block
device->bytes_used will be changed when allocating a new chunk, and
disk_total_size will be changed if resizing is successful.
Meanwhile, the on-disk super blocks of the previous transaction
might not be updated. Considering the consistency of the metadata
in the previous transaction, We should use the size in the previous
transaction to check if the super block is beyond the boundary
of the device.

Though it is not big problem because we don't use it now, but anyway
it is better that we make it be consistent with the common metadata,
maybe we will use it in the future.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:34 -07:00
Miao Xie 935e5cc935 Btrfs: fix wrong disk size when writing super blocks
total_size will be changed when resizing a device, and disk_total_size
will be changed if resizing is successful. Meanwhile, the on-disk super
blocks of the previous transaction might not be updated. Considering
the consistency of the metadata in the previous transaction, We should
use the size in the previous transaction to check if the super block is
beyond the boundary of the device. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:33 -07:00
Miao Xie 1c43366d3b Btrfs: fix unprotected assignment of the target device
We didn't protect the assignment of the target device, it might cause the
problem that the super block update was skipped because we might find wrong
size of the target device during the assignment. Fix it by moving the
assignment sentences into the initialization function of the target device.
And there is another merit that we can check if the target device is suitable
more early.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:31 -07:00
Miao Xie c7662111c7 Btrfs: cleanup double assignment of device->bytes_used when device replace finishes
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:30 -07:00
Miao Xie 90180da42c Btrfs: cleanup unused num_can_discard in fs_devices
The member variants - num_can_discard - of fs_devices structure
are set, but no one use them to do anything. so remove them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:29 -07:00
Li RongQing 82f70d62f7 btrfs: remove the wrong comments
This comments became wrong after c3c532[bdi: add helper function for
doing init and register of a bdi for a file system], so remove them.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:28 -07:00
Filipe Manana a2cc11db24 Btrfs: fix directory recovery from fsync log
When replaying a directory from the fsync log, if a directory entry
exists both in the fs/subvol tree and in the log, the directory's inode
got its i_size updated incorrectly, accounting for the dentry's name
twice.

Reproducer, from a test for xfstests:

    _scratch_mkfs >> $seqres.full 2>&1
    _init_flakey
    _mount_flakey

    touch $SCRATCH_MNT/foo
    sync

    touch $SCRATCH_MNT/bar
    xfs_io -c "fsync" $SCRATCH_MNT
    xfs_io -c "fsync" $SCRATCH_MNT/bar

    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    [ -f $SCRATCH_MNT/foo ] || echo "file foo is missing"
    [ -f $SCRATCH_MNT/bar ] || echo "file bar is missing"

    _unmount_flakey
    _check_scratch_fs $FLAKEY_DEV

The filesystem check at the end failed with the message:
"root 5 root dir 256 error".

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:27 -07:00
Liu Bo 25ce459c1a Btrfs: fix loop writing of async reclaim
One of my tests shows that when we really don't have space to reclaim via
flush_space and also run out of space, this async reclaim work loops on adding
itself into the workqueue and keeps writing something to disk according to
iostat's results, and these writes mainly comes from commit_transaction which
writes super_block.  This's unacceptable as it can be bad to disks, especially
memeory storages.

This adds a check to avoid the above situation.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:25 -07:00
Josef Bacik dc046b10c8 Btrfs: make fiemap not blow when you have lots of snapshots
We have been iterating all references for each extent we have in a file when we
do fiemap to see if it is shared.  This is fine when you have a few clones or a
few snapshots, but when you have 5k snapshots suddenly fiemap just sits there
and stares at you.  So add btrfs_check_shared which will use the backref walking
code but will short circuit as soon as it finds a root or inode that doesn't
match the one we currently have.  This makes fiemap on my testbox go from
looking at me blankly for a day to spitting out actual output in a reasonable
amount of time.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:24 -07:00
Filipe Manana 78a017a2c9 Btrfs: add missing compression property remove in btrfs_ioctl_setflags
The behaviour of a 'chattr -c' consists of getting the current flags,
clearing the FS_COMPR_FL bit and then sending the result to the set
flags ioctl - this means the bit FS_NOCOMP_FL isn't set in the flags
passed to the ioctl. This results in the compression property not being
cleared from the inode - it was cleared only if the bit FS_NOCOMP_FL
was set in the received flags.

Reproducer:

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt && cd /mnt
    $ mkdir a
    $ chattr +c a
    $ touch a/file
    $ lsattr a/file
    --------c------- a/file
    $ chattr -c a
    $ touch a/file2
    $ lsattr a/file2
    --------c------- a/file2
    $ lsattr -d a
    ---------------- a

Reported-by: Andreas Schneider <asn@cryptomilk.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:23 -07:00
Qu Wenruo 12b894cb28 btrfs: Fix a deadlock in btrfs_dev_replace_finishing()
btrfs-transacion:5657
[stack snip]
btrfs_bio_map()
    btrfs_bio_counter_inc_blocked()
        percpu_counter_inc(&fs_info->bio_counter)  ###bio_counter > 0(A)
        __btrfs_bio_map()
            btrfs_dev_replace_lock()
                mutex_lock(dev_replace->lock)	   ###wait mutex(B)

btrfs:32612
[stack snip]
btrfs_dev_replace_start()
    btrfs_dev_replace_lock()
	mutex_lock(dev_replace->lock)		   ###hold mutex(B)
    btrfs_dev_replace_finishing()
        btrfs_rm_dev_replace_blocked()
            wait until percpu_counter_sum == 0	   ###wait on bio_counter(A)

This bug can be triggered quite easily by the following test script:
http://pastebin.com/MQmb37Cy

This patch will fix the ABBA problem by calling
btrfs_dev_replace_unlock() before btrfs_rm_dev_replace_blocked().

The consistency of btrfs devices list and their superblocks is protected
by device_list_mutex, not btrfs_dev_replace_lock/unlock().
So it is safe the move btrfs_dev_replace_unlock() before
btrfs_rm_dev_replace_blocked().

Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:22 -07:00
Liu Bo a583c02664 Btrfs: cleanup the same name in end_bio_extent_readpage
We've defined a 'offset' out of bio_for_each_segment_all.

This is just a clean rename, no function changes.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:20 -07:00
Mark Fasheh 0b4699dcb6 btrfs: don't go readonly on existing qgroup items
btrfs_drop_snapshot() leaves subvolume qgroup items on disk after
completion. This can cause problems with snapshot creation. If a new
snapshot tries to claim the deleted subvolumes id, btrfs will get -EEXIST
from add_qgroup_item() and go read-only. The following commands will
reproduce this problem (assume btrfs is on /dev/sda and is mounted at
/btrfs)

mkfs.btrfs -f /dev/sda
mount -t btrfs /dev/sda /btrfs/
btrfs quota enable /btrfs/
btrfs su sna /btrfs/ /btrfs/snap
btrfs su de /btrfs/snap
sleep 45
umount /btrfs/
mount -t btrfs /dev/sda /btrfs/

We can fix this by catching -EEXIST in add_qgroup_item() and
initializing the existing items. We have the problem of orphaned
relation items being on disk from an old snapshot but that is outside
the scope of this patch.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:19 -07:00
Liu Bo b7831b20f3 Btrfs: show real function name in btrfs workqueue tracepoint
Use %pf instead of %p, just same as kernel workqueue tracepoints.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:18 -07:00
Filipe Manana 2a39e59802 Btrfs: shrink further sizeof(struct extent_buffer)
The map_start and map_len fields aren't used anywhere, so just remove
them. On a x86_64 system, this reduced sizeof(struct extent_buffer)
from 296 bytes to 280 bytes, and therefore 14 extent_buffer structs can
now fit into a page instead of 13.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:17 -07:00
Filipe Manana 4395e0c4da Btrfs: send, lower mem requirements for processing xattrs
Maximum xattr size can be up to nearly the leaf size. For an fs with a
leaf size larger than the page size, using kmalloc requires allocating
multiple pages that are contiguous, which might not be possible if
there's heavy memory fragmentation. Therefore fallback to vmalloc if
we fail to allocate with kmalloc. Also start with a smaller buffer size,
since xattr values typically are smaller than a page.

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:16 -07:00
David Sterba f87c4318af btrfs: remove stale define after removing ordered operations
Last user removed in commit "btrfs: disable strict file flushes for
renames and truncates" (8d875f95da).

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:15 -07:00
Filipe Manana 2000552396 Btrfs: improve free space cache management and space allocation
While under random IO, a block group's free space cache eventually reaches
a state where it has a mix of extent entries and bitmap entries representing
free space regions.

As later free space regions are returned to the cache, some of them are merged
with existing extent entries if they are contiguous with them. But others are
not merged, because despite the existence of adjacent free space regions in
the cache, the merging doesn't happen because the existing free space regions
are represented in bitmap extents. Even when new free space regions are merged
with existing extent entries (enlarging the free space range they represent),
we create chances of having after an enlarged region that is contiguous with
some other region represented in a bitmap entry.

Both clustered and non-clustered space allocation work by iterating over our
extent and bitmap entries and skipping any that represents a region smaller
then the allocation request (and giving preference to extent entries before
bitmap entries). By having a contiguous free space region that is represented
by 2 (or more) entries (mix of extent and bitmap entries), we end up not
satisfying an allocation request with a size larger than the size of any of
the entries but no larger than the sum of their sizes. Making the caller assume
we're under a ENOSPC condition or force it to allocate multiple smaller space
regions (as we do for file data writes), which adds extra overhead and more
chances of causing fragmentation due to the smaller regions being all spread
apart from each other (more likely when under concurrency).

For example, if we have the following in the cache:

* extent entry representing free space range: [128Mb - 256Kb, 128Mb[

* bitmap entry covering the range [128Mb, 256Mb[, but only with the bits
  representing the range [128Mb, 128Mb + 768Kb[ set - that is, only that
  space in this 128Mb area is marked as free

An allocation request for 1Mb, starting at offset not greater than 128Mb - 256Kb,
would fail before, despite the existence of such contiguous free space area in the
cache. The caller could only allocate up to 768Kb of space at once and later another
256Kb (or vice-versa). In between each smaller allocation request, another task
working on a different file/inode might come in and take that space, preventing the
former task of getting a contiguous 1Mb region of free space.

Therefore this change implements the ability to move free space from bitmap
entries into existing and new free space regions represented with extent
entries. This is done when a space region is added to the cache.

A test was added to the sanity tests that explains in detail the issue too.

Some performance test results with compilebench on a 4 cores machine, with
32Gb of ram and using an HDD follow.

Test: compilebench -D /mnt -i 30 -r 1000 --makej

Before this change:

   intial create total runs 30 avg 69.02 MB/s (user 0.28s sys 0.57s)
   compile total runs 30 avg 314.96 MB/s (user 0.12s sys 0.25s)
   read compiled tree total runs 3 avg 27.14 MB/s (user 1.52s sys 0.90s)
   delete compiled tree total runs 30 avg 3.14 seconds (user 0.15s sys 0.66s)

After this change:

   intial create total runs 30 avg 68.37 MB/s (user 0.29s sys 0.55s)
   compile total runs 30 avg 382.83 MB/s (user 0.12s sys 0.24s)
   read compiled tree total runs 3 avg 27.82 MB/s (user 1.45s sys 0.97s)
   delete compiled tree total runs 30 avg 3.18 seconds (user 0.17s sys 0.65s)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:13 -07:00
Anand Jain 3c1dbdf54a btrfs: rename total_bytes to avoid confusion
we are assigning number_devices to the total_bytes,
that's very confusing for a moment

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:12 -07:00
Anand Jain de4c296f63 btrfs: fix typo in the log message
there is no matching open parenthesis for the closing parenthesis

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:11 -07:00
Anand Jain b2efedca68 btrfs: rw_devices shouldn't be incremented for seed fs in btrfs_rm_dev_replace_srcdev()
seed fs devices don't participate as rw_device, so don't increment
rw_devices when the device being handled belongs to a seed fs.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:10 -07:00
Anand Jain 8bef8401a0 btrfs: fix memory leak when there is no more seed device
When we replace all the seed device in the system there is
no point in just keeping the btrfs_fs_devices with out
any device

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:09 -07:00
Anand Jain 94d5f0c2ae btrfs: update sprout seed pointer when seed fs is relinquished
We are not updating sprout fs seed pointer when all seed device
is replaced. This patch will check if all seed device has been
replaced and then update the sprout pointer accordingly.

Same reproducer as in the previous patch would apply here.
And notice that btrfs_close_device will check if seed fs is
present and spits out the error with out this patch.

int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
{
::
                seed_devices = fs_devices->seed;
::
        while (seed_devices) {
                fs_devices = seed_devices;
                seed_devices = fs_devices->seed;
                __btrfs_close_devices(fs_devices);
                free_fs_devices(fs_devices);
        }

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:08 -07:00
Anand Jain 63dd86fa79 btrfs: fix rw_devices miss match after seed replace
reproducer:
    reproducer:
    mount /dev/sdb /btrfs
    btrfs dev add /dev/sdc /btrfs
    btrfs rep start -B /dev/sdb /dev/sdd /btrfs
    umount /btrfs

WARNING: CPU: 0 PID: 3882 at fs/btrfs/volumes.c:892 __btrfs_close_devices+0x1c8/0x200 [btrfs]()

which is

        WARN_ON(fs_devices->rw_devices);

   The problem here is that we did not add one to the rw_devices when
   we replace the seed device with a writable device.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:06 -07:00