Commit Graph

461 Commits

Author SHA1 Message Date
Chris Mason 9627aeee3e Merge branch 'raid56-scrub-replace' of git://github.com/miaoxie/linux-btrfs into for-linus 2014-12-02 18:42:03 -08:00
Filipe Manana 495e64f4fe Btrfs: fix fs mapping extent map leak
On chunk allocation error (label "error_del_extent"), after adding the
extent map to the tree and to the pending chunks list, we would leave
decrementing the extent map's refcount by 2 instead of 3 (our allocation
+ tree reference + list reference).

Also, on chunk/block group removal, if the block group was on the list
pending_chunks we weren't decrementing the respective list reference.

Detected by 'rmmod btrfs':

[20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
[20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-1+ #1
[20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20770.106130]  0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040
[20770.106132]  ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0
[20770.106134]  ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78
[20770.106136] Call Trace:
[20770.106142]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
[20770.106145]  [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90
[20770.106164]  [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs]
[20770.106176]  [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs]
[20770.106179]  [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4
[20770.106182]  [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[20770.106184]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana 04216820fe Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

        checking extents
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        owner ref check failed [833912832 16384]
        Errors found in extent allocation tree or chunk allocation
        checking free space cache
        checking fs roots
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        root 5 root dir 256 error
        root 5 inode 260 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 262 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 263 errors 2001, no inode item, link count wrong
        (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Miao Xie 4245215d6a Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:47 +08:00
Miao Xie 2c8cdd6ee4 Btrfs, replace: write dirty pages into the replace target device
The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:46 +08:00
Miao Xie af8e2d1df9 Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Zhao Lei 6de6565075 Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-12-03 10:18:44 +08:00
Zhao Lei f90523d1aa Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909     goto out;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-12-03 10:18:44 +08:00
Qu Wenruo 084b6e7c76 btrfs: Fix a lockdep warning when running xfstest.
The following lockdep warning is triggered during xfstests:

[ 1702.980872] =========================================================
[ 1702.981181] [ INFO: possible irq lock inversion dependency detected ]
[ 1702.981482] 3.18.0-rc1 #27 Not tainted
[ 1702.981781] ---------------------------------------------------------
[ 1702.982095] kswapd0/77 just changed the state of lock:
[ 1702.982415]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs]
[ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1702.983160]  (&fs_info->dev_replace.lock){+.+.+.}

and interrupts could create inverse lock ordering between them.

[ 1702.984675]
other info that might help us debug this:
[ 1702.985524] Chain exists of:
  &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

[ 1702.986799]  Possible interrupt unsafe locking scenario:

[ 1702.987681]        CPU0                    CPU1
[ 1702.988137]        ----                    ----
[ 1702.988598]   lock(&fs_info->dev_replace.lock);
[ 1702.989069]                                local_irq_disable();
[ 1702.989534]                                lock(&delayed_node->mutex);
[ 1702.990038]                                lock(&found->groups_sem);
[ 1702.990494]   <Interrupt>
[ 1702.990938]     lock(&delayed_node->mutex);
[ 1702.991407]
 *** DEADLOCK ***

It is because the btrfs_kobj_{add/rm}_device() will call memory
allocation with GFP_KERNEL,
which may flush fs page cache to free space, waiting for it self to do
the commit, causing the deadlock.

To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
dev_replace lock range, also involing split the
btrfs_rm_dev_replace_srcdev() function into remove and free parts.

Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
called out of the lock range.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 05:55:38 -08:00
Linus Torvalds 90d0c376f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "The largest set of changes here come from Miao Xie.  He's cleaning up
  and improving read recovery/repair for raid, and has a number of
  related fixes.

  I've merged another set of fsync fixes from Filipe, and he's also
  improved the way we handle metadata write errors to make sure we force
  the FS readonly if things go wrong.

  Otherwise we have a collection of fixes and cleanups.  Dave Sterba
  gets a cookie for removing the most lines (thanks Dave)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (139 commits)
  btrfs: Fix compile error when CONFIG_SECURITY is not set.
  Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off
  btrfs: Make btrfs handle security mount options internally to avoid losing security label.
  Btrfs: send, don't delay dir move if there's a new parent inode
  btrfs: add more superblock checks
  Btrfs: fix race in WAIT_SYNC ioctl
  Btrfs: be aware of btree inode write errors to avoid fs corruption
  Btrfs: remove redundant btrfs_verify_qgroup_counts declaration.
  btrfs: fix shadow warning on cmp
  Btrfs: fix compilation errors under DEBUG
  Btrfs: fix crash of btrfs_release_extent_buffer_page
  Btrfs: add missing end_page_writeback on submit_extent_page failure
  btrfs: Fix the wrong condition judgment about subset extent map
  Btrfs: fix build_backref_tree issue with multiple shared blocks
  Btrfs: cleanup error handling in build_backref_tree
  btrfs: move checks for DUMMY_ROOT into a helper
  btrfs: new define for the inline extent data start
  btrfs: kill extent_buffer_page helper
  btrfs: drop constant param from btrfs_release_extent_buffer_page
  btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB
  ...
2014-10-11 08:03:52 -04:00
Fabian Frederick 1b6e44690d Btrfs: fix compilation errors under DEBUG
bi_sector and bi_size moved to bi_iter since commit 4f024f3797
("block: Abstract out bvec iterator")

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
Josef Bacik 47ab2a6c68 Btrfs: remove empty block groups automatically
One problem that has plagued us is that a user will use up all of his space with
data, remove a bunch of that data, and then try to create a bunch of small files
and run out of space.  This happens because all the chunks were allocated for
data since the metadata requirements were so low.  But now there's a bunch of
empty data block groups and not enough metadata space to do anything.  This
patch solves this problem by automatically deleting empty block groups.  If we
notice the used count go down to 0 when deleting or on mount notice that a block
group has a used count of 0 then we will queue it to be deleted.

When the cleaner thread runs we will double check to make sure the block group
is still empty and then we will delete it.  This patch has the side effect of no
longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
helpful for both this and relocate.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22 17:13:21 -07:00
Chris Mason 0f23ae74f5 Revert "Btrfs: device_list_add() should not update list when mounted"
This reverts commit b96de000bc.

This commit is triggering failures to mount by subvolume id in some
configurations.  The main problem is how many different ways this
scanning function is used, both for scanning while mounted and
unmounted.  A proper cleanup is too big for late rcs.

For now, just revert the commit and we'll put a better fix into a later
merge window.

Signed-off-by: Chris Mason <clm@fb.com>
2014-09-18 07:49:05 -07:00
Miao Xie 28e1cc7d1b Btrfs: Set real mirror number for read operation on RAID0/5/6
We need real mirror number for RAID0/5/6 when reading data, or if read error
happens, we would pass 0 as the number of the mirror on which the io error
happens. It is wrong and would cause the filesystem read the data from the
corrupted mirror again.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:00 -07:00
Miao Xie c3929c3624 Btrfs: modify rw_devices counter under chunk_mutex context
rw_devices counter is often used to tune the profile when doing chunk allocation,
so we should modify it under the chunk_mutex context to avoid getting wrong
chunk profile.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:49 -07:00
Miao Xie 5f37583569 Btrfs: move the missing device to its own fs device list
For a missing device, we don't know it belong to which fs before we read its
fsid from the chunk tree. So we add them into the current fs device list at first.
When we get its fsid, we should move them to their own fs device list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:48 -07:00
Miao Xie 416d7b802a Btrfs: stop mounting the fs if the non-ENOENT errors happen when opening seed fs
When we open a seed filesystem, if the degraded mount option is set, we continue to
mount the fs if we don't find some devices in the seed filesystem. But we should stop
mounting if other errors happen. Fix it

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:47 -07:00
Miao Xie 82372bc816 Btrfs: make the logic of source device removing more clear
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:46 -07:00
Miao Xie 67a2c45ee7 Btrfs: fix use-after-free problem of the device during device replace
The problem is:
	Task0(device scan task)		Task1(device replace task)
	scan_one_device()
	mutex_lock(&uuid_mutex)
	device = find_device()
					mutex_lock(&device_list_mutex)
					lock_chunk()
					rm_and_free_source_device
					unlock_chunk()
					mutex_unlock(&device_list_mutex)
	check device

Destroying the target device if device replace fails also has the same problem.

We fix this problem by locking uuid_mutex during destroying source device or
target device, just like the device remove operation.

It is a temporary solution, we can fix this problem and make the code more
clear by atomic counter in the future.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:44 -07:00
Miao Xie adbbb8631b Btrfs: fix unprotected device list access when cloning fs devices
We can build a new filesystem based a seed filesystem, and we need clone
the fs devices when we open the new filesystem. But someone might clear
the seed flag of the seed filesystem, then mount that filesystem and
remove some device. If we mount the new filesystem, we might access
a device list which was being changed when we clone the fs devices.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:43 -07:00
Miao Xie 2196d6e8a7 Btrfs: Fix misuse of chunk mutex
There were several problems about chunk mutex usage:
- Lock chunk mutex when updating metadata. It would cause the nested
  deadlock because updating metadata might need allocate new chunks
  that need acquire chunk mutex. We remove chunk mutex at this case,
  because b-tree lock and other lock mechanism can help us.
- ABBA deadlock occured between device_list_mutex and chunk_mutex.
  When we update device status, we must acquire device_list_mutex at the
  beginning, and then we might get chunk_mutex during the device status
  update because we need allocate new chunks for metadata COW. But at
  most place, we acquire chunk_mutex at first and then acquire device list
  mutex. We need change the lock order.
- Some place we needn't acquire chunk_mutex. For example we needn't get
  chunk_mutex when we free a empty seed fs_devices structure.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:42 -07:00
Miao Xie fe48a5c00f Btrfs: fix unprotected system chunk array insertion
We didn't protect the system chunk array when we added a new
system chunk into it, it would cause the array be corrupted
if someone remove/add some system chunk into array at the same
time. Fix it by chunk lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:40 -07:00
Miao Xie 7cc8e58d53 Btrfs: fix unprotected device's variants on 32bits machine
->total_bytes,->disk_total_bytes,->bytes_used is protected by chunk
lock when we change them, but sometimes we read them without any lock,
and we might get unexpected value. We fix this problem like inode's
i_size.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:38 -07:00
Miao Xie 1c1161870c Btrfs: update free_chunk_space during allocting a new chunk
We should update free_chunk_space in time when we allocate a new chunk,
not when we deal with the pending device update and block group insertion,
because we need the real free_chunk_space data to calculate the reserved
space, if we don't update it in time, we would consider the disk space which
has be allocated as free space, and would use it to do overcommit reservation.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:37 -07:00
Miao Xie 43530c46cc Btrfs: fix unprotected device->bytes_used update
We should update device->bytes_used in the lock context of
chunk_mutex, or we would get wrong data.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:36 -07:00
Miao Xie 5d778aaeb0 Btrfs: Fix wrong free_chunk_space assignment during removing a device
During removing a device, we have modified free_chunk_space when we
shrink the device, so we needn't assign a new value to it after
the device shrink. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:35 -07:00
Miao Xie ce7213c70c Btrfs: fix wrong device bytes_used in the super block
device->bytes_used will be changed when allocating a new chunk, and
disk_total_size will be changed if resizing is successful.
Meanwhile, the on-disk super blocks of the previous transaction
might not be updated. Considering the consistency of the metadata
in the previous transaction, We should use the size in the previous
transaction to check if the super block is beyond the boundary
of the device.

Though it is not big problem because we don't use it now, but anyway
it is better that we make it be consistent with the common metadata,
maybe we will use it in the future.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:34 -07:00
Miao Xie 935e5cc935 Btrfs: fix wrong disk size when writing super blocks
total_size will be changed when resizing a device, and disk_total_size
will be changed if resizing is successful. Meanwhile, the on-disk super
blocks of the previous transaction might not be updated. Considering
the consistency of the metadata in the previous transaction, We should
use the size in the previous transaction to check if the super block is
beyond the boundary of the device. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:33 -07:00
Miao Xie 1c43366d3b Btrfs: fix unprotected assignment of the target device
We didn't protect the assignment of the target device, it might cause the
problem that the super block update was skipped because we might find wrong
size of the target device during the assignment. Fix it by moving the
assignment sentences into the initialization function of the target device.
And there is another merit that we can check if the target device is suitable
more early.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:31 -07:00
Miao Xie 90180da42c Btrfs: cleanup unused num_can_discard in fs_devices
The member variants - num_can_discard - of fs_devices structure
are set, but no one use them to do anything. so remove them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:29 -07:00
Anand Jain 3c1dbdf54a btrfs: rename total_bytes to avoid confusion
we are assigning number_devices to the total_bytes,
that's very confusing for a moment

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:12 -07:00
Anand Jain b2efedca68 btrfs: rw_devices shouldn't be incremented for seed fs in btrfs_rm_dev_replace_srcdev()
seed fs devices don't participate as rw_device, so don't increment
rw_devices when the device being handled belongs to a seed fs.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:10 -07:00
Anand Jain 8bef8401a0 btrfs: fix memory leak when there is no more seed device
When we replace all the seed device in the system there is
no point in just keeping the btrfs_fs_devices with out
any device

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:09 -07:00
Anand Jain 94d5f0c2ae btrfs: update sprout seed pointer when seed fs is relinquished
We are not updating sprout fs seed pointer when all seed device
is replaced. This patch will check if all seed device has been
replaced and then update the sprout pointer accordingly.

Same reproducer as in the previous patch would apply here.
And notice that btrfs_close_device will check if seed fs is
present and spits out the error with out this patch.

int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
{
::
                seed_devices = fs_devices->seed;
::
        while (seed_devices) {
                fs_devices = seed_devices;
                seed_devices = fs_devices->seed;
                __btrfs_close_devices(fs_devices);
                free_fs_devices(fs_devices);
        }

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:08 -07:00
Anand Jain 25e8e9113d btrfs: replace seed device followed by unmount causes kernel WARNING
reproducer:
mount /dev/sdb /btrfs
btrfs dev add /dev/sdc /btrfs
btrfs rep start -B /dev/sdb /dev/sdd /btrfs
umount /btrfs

WARNING: CPU: 0 PID: 12661 at fs/btrfs/volumes.c:891 __btrfs_close_devices+0x1b0/0x200 [btrfs]()
::

__btrfs_close_devices()
::
        WARN_ON(fs_devices->open_devices);

After the seed device has been replaced the new target device
is no more a seed device. So we need to update the device
numbers in the fs_devices as pointed by the fs_info.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:05 -07:00
Anand Jain d51908ce4e btrfs: preparatory to make btrfs_rm_dev_replace_srcdev() seed aware
There is no logical change in this patch, just a preparatory patch,
so that changes can be easily reasoned.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:04 -07:00
Filipe Manana f98de9b9c0 Btrfs: make btrfs_search_forward return with nodes unlocked
None of the uses of btrfs_search_forward() need to have the path
nodes (level >= 1) read locked, only the leaf needs to be locked
while the caller processes it. Therefore make it return a path
with all nodes unlocked, except for the leaf.

This change is motivated by the observation that during a file
fsync we repeatdly call btrfs_search_forward() and process the
returned leaf while upper nodes of the returned path (level >= 1)
are read locked, which unnecessarily blocks other tasks that want
to write to the same fs/subvol btree.
Therefore instead of modifying the fsync code to unlock all nodes
with level >= 1 immediately after calling btrfs_search_forward(),
change btrfs_search_forward() to do it, so that it benefits all
callers.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:02 -07:00
Miao Xie 443f24fee7 Btrfs: cleanup unused latest_devid and latest_trans in fs_devices
The member variants - latest_devid and latest_trans - of fs_devices structure
are set, but no one use them to do anything. so remove them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:49 -07:00
Miao Xie addc3fa74e Btrfs: Fix the problem that the dirty flag of dev stats is cleared
The io error might happen during writing out the device stats, and the
device stats information and dirty flag would be update at that time,
but the current code didn't consider this case, just clear the dirty
flag, it would cause that we forgot to write out the new device stats
information. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:46 -07:00
HIMANGI SARAOGI 14586651ed Btrfs: use BUG_ON
Use BUG_ON(x) rather than if(x) BUG();

The semantic patch that fixes this problem is as follows:

// <smpl>
@@ identifier x; @@
-if (x) BUG();
+BUG_ON(x);
// </smpl>

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:34 -07:00
Miao Xie d20983b40e Btrfs: fix writing data into the seed filesystem
If we mounted a seed filesystem with degraded option, and then added a new
device into the seed filesystem, then we found adding device failed because
of the IO failure.

Steps to reproduce:
 # mkfs.btrfs -d raid1 -m raid1 <dev0> <dev1>
 # btrfstune -S 1 <dev0>
 # mount <dev0> -o degraded <mnt>
 # btrfs device add -f <dev2> <mnt>

It is because the original didn't set the chunk on the seed device to be
read-only if the degraded flag was set. It was introduced by patch f48b90756,
which fixed the problem the raid1 filesystem became read-only after one device
of it was missing. But this fix method was not right, we should set the read-only
flag according to the number of the missing devices, not the degraded mount
option, if the number of the missing devices is less than the max error number
that the profile of the chunk tolerates, we don't set it to be read-only.

Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:27 -07:00
David Sterba 962a298f35 btrfs: kill the key type accessor helpers
btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:12 -07:00
Liu Bo 9e0af23764 Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.

Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.

Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --

normal_work_helper(arg)
    work = container_of(arg, struct btrfs_work, normal_work);

    work->func() <---- (we name it work X)
    for ordered_work in wq->ordered_list
            ordered_work->ordered_func()
            ordered_work->ordered_free()

The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will

file a readahead request
    btrfs_readpages()
         for page that is not in page cache
                __do_readpage()
                     submit_extent_page()
                           btrfs_submit_bio_hook()
                                 btrfs_bio_wq_end_io()
                                 submit_bio()
                                 end_workqueue_bio() <--(ret by the 1st endio)
                                      queue a work(named work Y) for the 2nd
                                      also the real endio()

So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.

A bit more explanation,

A,B,C -- struct btrfs_work
arg   -- struct work_struct

kthread:
worker_thread()
    pick up a work_struct from @worklist
    process_one_work(arg)
	worker->current_work = arg;  <-- arg is A->normal_work
	worker->current_func(arg)
		normal_work_helper(arg)
		     A = container_of(arg, struct btrfs_work, normal_work);

		     A->func()
		     A->ordered_func()
		     A->ordered_free()  <-- A gets freed

		     B->ordered_func()
			  submit_compressed_extents()
			      find_free_extent()
				  load_free_space_inode()
				      ...   <-- (the above readhead stack)
				      end_workqueue_bio()
					   btrfs_queue_work(work C)
		     B->ordered_free()

As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.

Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).

When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.

So the situation is that our kthread is waiting forever on work C.

Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.

With this patch, I no long hit the above hang.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-24 07:17:02 -07:00
Miao Xie 7df69d3e94 Btrfs: Fix wrong device size when we are resizing the device
total_bytes of device is just a in-memory variant which is used to record
the size of the device, and it might be changed before we resize a device,
if the resize operation fails, it will be fallbacked. But some code used it
to update on-disk metadata of the device, it would cause the problem that
on-disk metadata of the devices was not consistent. We should use the other
variant named disk_total_bytes to update the on-disk metadata of device,
because that variant is updated only when the resize operation is successful.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:18 -07:00
Miao Xie ff61d17c63 Btrfs: Fix the problem that the replace destroys the seed filesystem
The seed filesystem was destroyed by the device replace, the reproduce
method is:
 # mkfs.btrfs -f <dev0>
 # btrfstune -S 1 <dev0>
 # mount <dev0> <mnt>
 # btrfs device add <dev1> <mnt>
 # umount <mnt>
 # mount <dev1> <mnt>
 # btrfs replace start -f <dev0> <dev2> <mnt>
 # umount <mnt>
 # mount <dev0> <mnt>

It is because we erase the super block on the seed device. It is wrong,
we should not change anything on the seed device.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:16 -07:00
Miao Xie 3a7d55c84c Btrfs: fix wrong missing device counter decrease
The missing devices are accounted by its own fs device, for example
the missing devices in seed filesystem will be accounted by the fs device
of the seed filesystem, not by the new filesystem which is based on
the seed filesystem, so when we remove the missing device in the
seed filesystem, we should decrease the counter of its own fs device.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:10 -07:00
Miao Xie 69611ac810 Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
We forgot to zero some members in fs_devices when we create new fs_devices
from the one of the seed fs. It would cause the problem that we got wrong
chunk profile when allocating chunks. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:32 -07:00
Anand Jain 77bdae4d13 btrfs: check generation as replace duplicates devid+uuid
When FS in unmounted we need to check generation number as well
since devid+uuid combination could match with the missing replaced
disk when it reappears, and without this patch it might pair with
the replaced disk again.

 device_list_add() function is called in the following threads,
	mount device option
	mount argument
	ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan)
	ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready <dev>)
 they have been unit tested to work fine with this patch.

 If the user knows what he is doing and really want to pair with
 replaced disk (which is not a standard operation), then he should
 first clear the kernel btrfs device list in the memory by doing
 the module unload/load and followed with the mount -o device option.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:30 -07:00
Anand Jain b96de000bc Btrfs: device_list_add() should not update list when mounted
device_list_add() is called when user runs btrfs dev scan, which would add
any btrfs device into the btrfs_fs_devices list.

Now think of a mounted btrfs. And a new device which contains the a SB
from the mounted btrfs devices.

In this situation when user runs btrfs dev scan, the current code would
just replace existing device with the new device.

Which is to note that old device is neither closed nor gracefully
removed from the btrfs.

The FS is still operational with the old bdev however the device name
is the btrfs_device is new which is provided by the btrfs dev scan.

reproducer:

devmgt[1] detach /dev/sdc

replace the missing disk /dev/sdc

btrfs rep start -f 1 /dev/sde /btrfs
Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120
        Total devices 2 FS bytes used 32.00KiB
        devid    1 size 958.94MiB used 115.88MiB path /dev/sde
        devid    2 size 958.94MiB used 103.88MiB path /dev/sdd

make /dev/sdc to reappear

devmgt attach host2

btrfs dev scan

btrfs fi show -m
Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120^M
        Total devices 2 FS bytes used 32.00KiB^M
        devid    1 size 958.94MiB used 115.88MiB path /dev/sdc <- Wrong.
        devid    2 size 958.94MiB used 103.88MiB path /dev/sdd

since /dev/sdc has been replaced with /dev/sde, the /dev/sdc shouldn't be
part of the btrfs-fsid when it reappears. If user want it to be part of it
then sys admin should be using btrfs device add instead.

[1] github.com/anajain/devmgt.git

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:28 -07:00
Eric Sandeen 0bfaa9c5cb btrfs: test for valid bdev before kobj removal in btrfs_rm_device
commit 99994cd btrfs: dev delete should remove sysfs entry
added a btrfs_kobj_rm_device, which dereferences device->bdev...
right after we check whether device->bdev might be NULL.

I don't honestly know if it's possible to have a NULL device->bdev
here, but assuming that it is (given the test), we need to move
the kobject removal to be under that test.

(Coverity spotted this)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-19 11:49:44 -07:00