Commit Graph

953 Commits

Author SHA1 Message Date
Linus Torvalds a682e00354 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull md updates from Shaohua Li:
 "Mainly fixes bugs and improves performance:

   - Improve scalability for raid1 from Coly

   - Improve raid5-cache read performance, disk efficiency and IO
     pattern from Song and me

   - Fix a race condition of disk hotplug for linear from Coly

   - A few cleanup patches from Ming and Byungchul

   - Fix a memory leak from Neil

   - Fix WRITE SAME IO failure from me

   - Add doc for raid5-cache from me"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits)
  md/raid1: fix write behind issues introduced by bio_clone_bioset_partial
  md/raid1: handle flush request correctly
  md/linear: shutup lockdep warnning
  md/raid1: fix a use-after-free bug
  RAID1: avoid unnecessary spin locks in I/O barrier code
  RAID1: a new I/O barrier implementation to remove resync window
  md/raid5: Don't reinvent the wheel but use existing llist API
  md: fast clone bio in bio_clone_mddev()
  md: remove unnecessary check on mddev
  md/raid1: use bio_clone_bioset_partial() in case of write behind
  md: fail if mddev->bio_set can't be created
  block: introduce bio_clone_bioset_partial()
  md: disable WRITE SAME if it fails in underlayer disks
  md/raid5-cache: exclude reclaiming stripes in reclaim check
  md/raid5-cache: stripe reclaim only counts valid stripes
  MD: add doc for raid5-cache
  Documentation: move MD related doc into a separate dir
  md: ensure md devices are freed before module is unloaded.
  md/r5cache: improve journal device efficiency
  md/r5cache: enable chunk_aligned_read with write back cache
  ...
2017-02-24 14:42:19 -08:00
Jens Axboe 818551e2b2 Merge branch 'for-4.11/next' into for-4.11/linus-merge
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 14:08:19 -07:00
Ming Lei d7a1030839 md: fast clone bio in bio_clone_mddev()
Firstly bio_clone_mddev() is used in raid normal I/O and isn't
in resync I/O path.

Secondly all the direct access to bvec table in raid happens on
resync I/O except for write behind of raid1, in which we still
use bio_clone() for allocating new bvec table.

So this patch replaces bio_clone() with bio_clone_fast()
in bio_clone_mddev().

Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
suggested by Christoph Hellwig.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-15 11:24:54 -08:00
Ming Lei ed7ef732ca md: remove unnecessary check on mddev
mddev is never NULL and neither is ->bio_set, so
remove the check.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-15 11:24:13 -08:00
Ming Lei 10273170fd md: fail if mddev->bio_set can't be created
The current behaviour is to fall back to allocate
bio from 'fs_bio_set', that isn't a correct way
because it might cause deadlock.

So this patch simply return failure if mddev->bio_set
can't be created.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-15 11:23:24 -08:00
NeilBrown 9356863c94 md: ensure md devices are freed before module is unloaded.
Commit: cbd1998377 ("md: Fix unfortunate interaction with evms")
change mddev_put() so that it would not destroy an md device while
->ctime was non-zero.

Unfortunately, we didn't make sure to clear ->ctime when unloading
the module, so it is possible for an md device to remain after
module unload.  An attempt to open such a device will trigger
an invalid memory reference in:
  get_gendisk -> kobj_lookup -> exact_lock -> get_disk

when tring to access disk->fops, which was in the module that has
been removed.

So ensure we clear ->ctime in md_exit(), and explain how that is useful,
as it isn't immediately obvious when looking at the code.

Fixes: cbd1998377 ("md: Fix unfortunate interaction with evms")
Tested-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 09:17:53 -08:00
Jan Kara dc3b17cc8b block: Use pointer to backing_dev_info from request_queue
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02 08:20:48 -07:00
Song Liu a85dd7b8df md/r5cache: flush data only stripes in r5l_recovery_log()
For safer operation, all arrays start in write-through mode, which has been
better tested and is more mature. And actually the write-through/write-mode
isn't persistent after array restarted, so we always start array in
write-through mode. However, if recovery found data-only stripes before the
shutdown (from previous write-back mode), it is not safe to start the array in
write-through mode, as write-through mode can not handle stripes with data in
write-back cache. To solve this problem, we flush all data-only stripes in
r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with
empty cache in write-through mode.

This logic is implemented in r5c_recovery_flush_data_only_stripes():

1. enable write back cache
2. flush all stripes
3. wake up conf->mddev->thread
4. wait for all stripes get flushed (reuse wait_for_quiescent)
5. disable write back cache

The wait in 4 will be waked up in release_inactive_stripe_list()
when conf->active_stripes reaches 0.

It is safe to wake up mddev->thread here because all the resource
required for the thread has been initialized.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 11:20:15 -08:00
Shaohua Li 20737738d3 Merge branch 'md-next' into md-linus 2016-12-13 12:40:15 -08:00
Linus Torvalds 36869cb93d Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
2016-12-13 10:19:16 -08:00
Shaohua Li 2953079c69 md: separate flags for superblock changes
The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change unrelated flags. These usage are most for superblock
write, so spearate superblock related flags. This should make the code
clearer and also fix real bugs.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08 22:01:47 -08:00
Shaohua Li 82a301cb0e md: MD_RECOVERY_NEEDED is set for mddev->recovery
Fixes: 90f5f7ad4f38("md: Wait for md_check_recovery before attempting device
removal.")

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08 22:00:43 -08:00
NeilBrown e2342ca832 md: fix refcount problem on mddev when stopping array.
md_open() gets a counted reference on an mddev using mddev_find().
If it ends up returning an error, it must drop this reference.

There are two error paths where the reference is not dropped.
One only happens if the process is signalled and an awkward time,
which is quite unlikely.
The other was introduced recently in commit af8d8e6f0.

Change the code to ensure the drop the reference when returning an error,
and make it harded to re-introduce this sort of bug in the future.

Reported-by: Marc Smith <marc.smith@mcc.edu>
Fixes: af8d8e6f03 ("md: changes for MD_STILL_CLOSED flag")
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-05 17:11:03 -08:00
Shaohua Li 034e33f5ed md: stop write should stop journal reclaim
__md_stop_writes currently doesn't stop raid5-cache reclaim thread. It's
possible the reclaim thread is still running and doing write, which
doesn't match what __md_stop_writes should do. The extra ->quiesce()
call should not harm any raid types. For raid5-cache, this will
guarantee we reclaim all caches before we update superblock.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Cc: Song Liu <songliubraving@fb.com>
2016-11-23 19:30:25 -08:00
Shaohua Li ce1ccd079f raid5-cache: suspend reclaim thread instead of shutdown
There is mechanism to suspend a kernel thread. Use it instead of playing
create/destroy game.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Cc: Song Liu <songliubraving@fb.com>
2016-11-23 19:30:06 -08:00
NeilBrown 46533ff7fe md: Use REQ_FAILFAST_* on metadata writes where appropriate
This can only be supported on personalities which ensure
that md_error() never causes an array to enter the 'failed'
state.  i.e. if marking a device Faulty would cause some
data to be inaccessible, the device is status is left as
non-Faulty.  This is true for RAID1 and RAID10.

If we get a failure writing metadata but the device doesn't
fail, it must be the last device so we re-write without
FAILFAST to improve chance of success.  We also flag the
device as LastDev so that future metadata updates don't
waste time on failfast writes.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22 09:11:33 -08:00
NeilBrown 688834e6ae md/failfast: add failfast flag for md to be used by some personalities.
This patch just adds a 'failfast' per-device flag which can be stored
in v0.90 or v1.x metadata.
The flag is not used yet but the intent is that it can be used for
mirrored (raid1/raid10) arrays where low latency is more important
than keeping all devices on-line.

Setting the flag for a device effectively gives permission for that
device to be marked as Faulty and excluded from the array on the first
error.  The underlying driver will be directed not to retry requests
that result in failures.  There is a proviso that the device must not
be marked faulty if that would cause the array as a whole to fail, it
may only be marked Faulty if the array remains functional, but is
degraded.

Failures on read requests will cause the device to be marked
as Faulty immediately so that further reads will avoid that
device.  No attempt will be made to correct read errors by
over-writing with the correct data.

It is expected that if transient errors, such as cable unplug, are
possible, then something in user-space will revalidate failed
devices and re-add them when they appear to be working again.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22 08:58:17 -08:00
Shaohua Li 504634f60f md: add blktrace event for writes to superblock
superblock write is an expensive operation. With raid5-cache, it can be called
regularly. Tracing to help performance debug.

Signed-off-by: Shaohua Li <shli@fb.com>
Cc: NeilBrown <neilb@suse.com>
2016-11-18 09:47:57 -08:00
NeilBrown 6119e6792b md: remove md_super_wait() call after bitmap_flush()
bitmap_flush() finishes with bitmap_update_sb(), and that finishes
with write_page(..., 1), so write_page() will wait for all writes
to complete.  So there is no point calling md_super_wait()
immediately afterwards.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-09 17:14:28 -08:00
NeilBrown 060b0689f5 md: perform async updates for metadata where possible.
When adding devices to, or removing device from, an array we need to
update the metadata.  However we don't need to do it synchronously as
data integrity doesn't depend on these changes being recorded
instantly.  So avoid the synchronous call to md_update_sb and just set
a flag so that the thread will do it.

This can reduce the number of updates performed when lots of devices
are being added or removed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:23 -08:00
NeilBrown 9d48739ef1 md: change all printk() to pr_err() or pr_warn() etc.
1/ using pr_debug() for a number of messages reduces the noise of
   md, but still allows them to be enabled when needed.
2/ try to be consistent in the usage of pr_err() and pr_warn(), and
   document the intention
3/ When strings have been split onto multiple lines, rejoin into
   a single string.
   The cost of having lines > 80 chars is less than the cost of not
   being able to easily search for a particular message.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:21 -08:00
NeilBrown 7f0f0d87fa md: fix some issues with alloc_disk_sb()
1/ don't print a warning if allocation fails.
 page_alloc() does that already.
2/ always check return status for error.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:21 -08:00
Tomasz Majchrzak 91a6c4aded md: wake up personality thread after array state update
When raid1/raid10 array fails to write to one of the drives, the request
is added to bio_end_io_list and finished by personality thread. The
thread doesn't handle it as long as MD_CHANGE_PENDING flag is set. In
case of external metadata this flag is cleared, however the thread is
not woken up. It causes request to be blocked for few seconds (until
another action on the array wakes up the thread) or to get stuck
indefinitely.

Wake up personality thread once MD_CHANGE_PENDING has been cleared.
Moving 'restart_array' call after the flag is cleared it not a solution
because in read-write mode the call doesn't wake up the thread.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:21 -08:00
Tomasz Majchrzak dcbcb48650 md: don't fail an array if there are unacknowledged bad blocks
If external metadata handler supports bad blocks and unacknowledged bad
blocks are present, don't report disk via sysfs as faulty. Such
situation can be still handled so disk just has to be blocked for a
moment. It makes it consistent with kernel state as corresponding rdev
flag is also not set.

When the disk in being unblocked there are few cases:
1. Disk has been in blocked and faulty state, it is being unblocked but
it still remains in faulty state. Metadata handler will remove it from
array in the next call.
2. There is no bad block support in external metadata handler and bad
blocks are present - put the disk in blocked and faulty state (see
case 1).
3. There is bad block support in external metadata handler and all bad
blocks are acknowledged - clear all flags, continue.
4. There is bad block support in external metadata handler but there are
still unacknowledged bad blocks - clear all flags, continue. It is fine
to clear Blocked flag because it was probably not set anyway (if it was
it is case 1). BlockedBadBlocks flag can also be cleared because the
request waiting for it will set it again when it finds out that some bad
block is still not acknowledged. Recovery is not necessary but there are
no problems if the flag is set. Sysfs rdev state is still reported as
blocked (due to unacknowledged bad blocks) so metadata handler will
process remaining bad blocks and unblock disk again.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:20 -08:00
Tomasz Majchrzak 35b785f769 md: add bad block support for external metadata
Add new rdev flag which external metadata handler can use to switch
on/off bad block support. If new bad block is encountered, notify it via
rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
cleared, notify update to rdev 'bad_blocks' sysfs file.

When bad blocks support is being removed, just clear rdev flag. It is
not necessary to reset badblocks->shift field. If there are bad blocks
cleared or added at the same time, it is ok for those changes to be
applied to the structure. The array is in blocked state and the drive
which cannot handle bad blocks any more will be removed from the array
before it is unlocked.

Simplify state_show function by adding a separator at the end of each
string and overwrite last separator with new line.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:20 -08:00
Christoph Hellwig 70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
NeilBrown 1217e1d199 md: be careful not lot leak internal curr_resync value into metadata. -- (all)
mddev->curr_resync usually records where the current resync is up to,
but during the starting phase it has some "magic" values.

 1 - means that the array is trying to start a resync, but has yielded
     to another array which shares physical devices, and also needs to
     start a resync
 2 - means the array is trying to start resync, but has found another
     array which shares physical devices and has already started resync.

 3 - means that resync has commensed, but it is possible that nothing
     has actually been resynced yet.

It is important that this value not be visible to user-space and
particularly that it doesn't get written to the metadata, as the
resync or recovery checkpoint.  In part, this is because it may be
slightly higher than the correct value, though this is very rare.
In part, because it is not a multiple of 4K, and some devices only
support 4K aligned accesses.

There are two places where this value is propagates into either
->curr_resync_completed or ->recovery_cp or ->recovery_offset.
These currently avoid the propagation of values 1 and 3, but will
allow 3 to leak through.

Change them to only propagate the value if it is > 3.

As this can cause an array to fail, the patch is suitable for -stable.

Cc: stable@vger.kernel.org (v3.7+)
Reported-by: Viswesh <viswesh.vichu@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-28 22:04:05 -07:00
Tomasz Majchrzak 16f889499a md: report 'write_pending' state when array in sync
If there is a bad block on a disk and there is a recovery performed from
this disk, the same bad block is reported for a new disk. It involves
setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external
metadata this flag is not being cleared as array state is reported as
'clean'. The read request to bad block in RAID5 array gets stuck as it
is waiting for a flag to be cleared - as per commit c3cce6cda1
("md/raid5: ensure device failure recorded before write request
returns.").

The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been
clarified in commit 070dc6dd71 ("md: resolve confusion of
MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in
personality error handlers since and it doesn't fully comply with
initial purpose. It was supposed to notify that write request is about
to start, however now it is also used to request metadata update.
Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has
been set and in_sync has been set to 0 at the same time. Error handlers
just set the flag without modifying in_sync value. Sysfs array state is
a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is
set and in_sync is set to 1. Userspace has no idea it is expected to
take some action.

Swap the order that array state is checked so 'write_pending' is
reported ahead of 'clean' ('write_pending' is a misleading name but it
is too late to rename it now).

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24 15:28:19 -07:00
Shaohua Li bb086a89a4 md: set rotational bit
if all disks in an array are non-rotational, set the array
non-rotational.

This only works for array with all disks populated at startup. Support
for disk hotadd/hotremove could be added later if necessary.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-03 10:20:27 -07:00
Shaohua Li 90bcf13381 md: fix a potential deadlock
lockdep reports a potential deadlock. Fix this by droping the mutex
before md_import_device

[ 1137.126601] ======================================================
[ 1137.127013] [ INFO: possible circular locking dependency detected ]
[ 1137.127013] 4.8.0-rc4+ #538 Not tainted
[ 1137.127013] -------------------------------------------------------
[ 1137.127013] mdadm/16675 is trying to acquire lock:
[ 1137.127013]  (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff81243cf3>] __blkdev_get+0x63/0x450
[ 1137.127013]
but task is already holding lock:
[ 1137.127013]  (detected_devices_mutex){+.+.+.}, at: [<ffffffff81a5138c>] md_ioctl+0x2ac/0x1f50
[ 1137.127013]
which lock already depends on the new lock.

[ 1137.127013]
the existing dependency chain (in reverse order) is:
[ 1137.127013]
-> #1 (detected_devices_mutex){+.+.+.}:
[ 1137.127013]        [<ffffffff810b6f19>] lock_acquire+0xb9/0x220
[ 1137.127013]        [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0
[ 1137.127013]        [<ffffffff81a4eeaf>] md_autodetect_dev+0x3f/0x90
[ 1137.127013]        [<ffffffff81595be8>] rescan_partitions+0x1a8/0x2c0
[ 1137.127013]        [<ffffffff81590081>] __blkdev_reread_part+0x71/0xb0
[ 1137.127013]        [<ffffffff815900e5>] blkdev_reread_part+0x25/0x40
[ 1137.127013]        [<ffffffff81590c4b>] blkdev_ioctl+0x51b/0xa30
[ 1137.127013]        [<ffffffff81242bf1>] block_ioctl+0x41/0x50
[ 1137.127013]        [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0
[ 1137.127013]        [<ffffffff81215321>] SyS_ioctl+0x41/0x70
[ 1137.127013]        [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8
[ 1137.127013]
-> #0 (&bdev->bd_mutex){+.+.+.}:
[ 1137.127013]        [<ffffffff810b6af2>] __lock_acquire+0x1662/0x1690
[ 1137.127013]        [<ffffffff810b6f19>] lock_acquire+0xb9/0x220
[ 1137.127013]        [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0
[ 1137.127013]        [<ffffffff81243cf3>] __blkdev_get+0x63/0x450
[ 1137.127013]        [<ffffffff81244307>] blkdev_get+0x227/0x350
[ 1137.127013]        [<ffffffff812444f6>] blkdev_get_by_dev+0x36/0x50
[ 1137.127013]        [<ffffffff81a46d65>] lock_rdev+0x35/0x80
[ 1137.127013]        [<ffffffff81a49bb4>] md_import_device+0xb4/0x1b0
[ 1137.127013]        [<ffffffff81a513d6>] md_ioctl+0x2f6/0x1f50
[ 1137.127013]        [<ffffffff815909b3>] blkdev_ioctl+0x283/0xa30
[ 1137.127013]        [<ffffffff81242bf1>] block_ioctl+0x41/0x50
[ 1137.127013]        [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0
[ 1137.127013]        [<ffffffff81215321>] SyS_ioctl+0x41/0x70
[ 1137.127013]        [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8
[ 1137.127013]
other info that might help us debug this:

[ 1137.127013]  Possible unsafe locking scenario:

[ 1137.127013]        CPU0                    CPU1
[ 1137.127013]        ----                    ----
[ 1137.127013]   lock(detected_devices_mutex);
[ 1137.127013]                                lock(&bdev->bd_mutex);
[ 1137.127013]                                lock(detected_devices_mutex);
[ 1137.127013]   lock(&bdev->bd_mutex);
[ 1137.127013]
 *** DEADLOCK ***

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21 09:09:44 -07:00
Guoqing Jiang c20c33f0e2 md-cluster: clean related infos of cluster
cluster_info and bitmap_info.nodes also need to be
cleared when array is stopped.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21 09:09:44 -07:00
Guoqing Jiang af8d8e6f03 md: changes for MD_STILL_CLOSED flag
When stop clustered raid while it is pending on resync,
MD_STILL_CLOSED flag could be cleared since udev rule
is triggered to open the mddev. So obviously array can't
be stopped soon and returns EBUSY.

	mdadm -Ss          md-raid-arrays.rules
  set MD_STILL_CLOSED          md_open()
	... ... ...          clear MD_STILL_CLOSED
	do_md_stop

We make below changes to resolve this issue:

1. rename MD_STILL_CLOSED to MD_CLOSING since it is set
   when stop array and it means we are stopping array.
2. let md_open returns early if CLOSING is set, so no
   other threads will open array if one thread is trying
   to close it.
3. no need to clear CLOSING bit in md_open because 1 has
   ensure the bit is cleared, then we also don't need to
   test CLOSING bit in do_md_stop.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21 09:09:44 -07:00
Guoqing Jiang e566aef12a md-cluster: call md_kick_rdev_from_array once ack failed
The new_disk_ack could return failure if WAITING_FOR_NEWDISK
is not set, so we need to kick the dev from array in case
failure happened.

And we missed to check err before call new_disk_ack othwise
we could kick a rdev which isn't in array, thanks for the
reminder from Shaohua.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21 09:09:44 -07:00
Guoqing Jiang 47a7b0d888 md-cluster: make md-cluster also can work when compiled into kernel
The md-cluster is compiled as module by default,
if it is compiled by built-in way, then we can't
make md-cluster works.

[64782.630008] md/raid1:md127: active with 2 out of 2 mirrors
[64782.630528] md-cluster module not found.
[64782.630530] md127: Could not setup cluster service (-2)

Fixes: edb39c9 ("Introduce md_cluster_operations to handle cluster functions")
Cc: stable@vger.kernel.org (v4.1+)
Reported-by: Marc Smith <marc.smith@mcc.edu>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-08 11:11:27 -07:00
Linus Torvalds 86a1679860 Merge tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD fixes from Shaohua Li:
 "This includes several bug fixes:

   - Alexey Obitotskiy fixed a hang for faulty raid5 array with external
     management

   - Song Liu fixed two raid5 journal related bugs

   - Tomasz Majchrzak fixed a bad block recording issue and an
     accounting issue for raid10

   - ZhengYuan Liu fixed an accounting issue for raid5

   - I fixed a potential race condition and memory leak with DIF/DIX
     enabled

   - other trival fixes"

* tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  raid5: avoid unnecessary bio data set
  raid5: fix memory leak of bio integrity data
  raid10: record correct address of bad block
  md-cluster: fix error return code in join()
  r5cache: set MD_JOURNAL_CLEAN correctly
  md: don't print the same repeated messages about delayed sync operation
  md: remove obsolete ret in md_start_sync
  md: do not count journal as spare in GET_ARRAY_INFO
  md: Prevent IO hold during accessing to faulty raid5 array
  MD: hold mddev lock to change bitmap location
  raid5: fix incorrectly counter of conf->empty_inactive_list_nr
  raid10: increment write counter after bio is split
2016-08-30 11:24:04 -07:00
Song Liu 486b0f7bcd r5cache: set MD_JOURNAL_CLEAN correctly
Currently, the code sets MD_JOURNAL_CLEAN when the array has
MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
will be MD_JOURNAL_CLEAN even if the journal device is missing.

With this patch, the MD_JOURNAL_CLEAN is only set when the journal
device presents.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-24 10:21:50 -07:00
Artur Paszkiewicz c622ca543b md: don't print the same repeated messages about delayed sync operation
This fixes a long-standing bug that caused a flood of messages like:
"md: delaying data-check of md1 until md2 has finished (they share one
or more physical units)"

It can be reproduced like this:
1. Create at least 3 raid1 arrays on a pair of disks, each on different
   partitions.
2. Request a sync operation like 'check' or 'repair' on 2 arrays by
   writing to their md/sync_action attribute files. One operation should
   start and one should be delayed and a message like the above will be
   printed.
3. Issue a write to the third array. Each write will cause 2 copies of
   the message to be printed.

This happens when wake_up(&resync_wait) is called, usually by
md_check_recovery(). Then the delayed sync thread again prints the
message and is put to sleep. This patch adds a check in md_do_sync() to
prevent printing this message more than once for the same pair of
devices.

Reported-by: Sven Koehler <sven.koehler@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=151801
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-17 10:22:08 -07:00
Guoqing Jiang 207efcd2b5 md: remove obsolete ret in md_start_sync
The ret is not needed anymore since we have already
move resync_start into md_do_sync in commit 41a9a0d.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-17 10:22:07 -07:00
Song Liu b347af816a md: do not count journal as spare in GET_ARRAY_INFO
GET_ARRAY_INFO counts journal as spare (spare_disks), which is not
accurate. This patch fixes this.

Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-16 18:34:15 -07:00
Jens Axboe 1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Shaohua Li 3f35e210ed Merge branch 'mymd/for-next' into mymd/for-linus 2016-07-28 09:34:14 -07:00
Shaohua Li 5d8817833c MD: fix null pointer deference
The md device might not have personality (for example, ddf raid array). The
issue is introduced by 8430e7e0af9a15(md: disconnect device from personality
before trying to remove it)

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-28 09:06:34 -07:00
Tomasz Majchrzak 573275b58e md: add missing sysfs_notify on array_state update
Changeset 6791875e2e has added early return from a function so there is no
sysfs notification for 'active' and 'clean' state change.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-19 11:28:39 -07:00
Alexey Obitotskiy 4cb9da7d9c Fix kernel module refcount handling
md loads raidX modules and increments module refcount each time level
has changed but does not decrement it. You are unable to unload raid0
module after reshape because raid0 reshape changes level to raid4
and back to raid0.

Signed-off-by: Aleksey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-19 11:17:31 -07:00
Arnd Bergmann 0e3ef49eda md: use seconds granularity for error logging
The md code stores the exact time of the last error in the
last_read_error variable using a timespec structure. It only
ever uses the seconds portion of that though, so we can
use a scalar for it.

There won't be an overflow in 2038 here, because it already
used monotonic time and 32-bit is enough for that, but I've
decided to use time64_t for consistency in the conversion.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-19 11:00:47 -07:00
NeilBrown d787be4092 md: reduce the number of synchronize_rcu() calls when multiple devices fail.
Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made
which can delay several milliseconds in some case.
If lots of devices fail at once - as could happen with a large RAID10 where one set
of devices are removed all at once - these delays can add up to be very inconcenient.

As failure is not reversible we can check for that first, setting a
separate flag if it is found, and then all synchronize_rcu() once for
all the flagged devices.  Then ->hot_remove_disk() function can skip the
synchronize_rcu() step if the flag is set.

fix build error(Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:22 -07:00
NeilBrown 8430e7e0af md: disconnect device from personality before trying to remove it.
When the HOT_REMOVE_DISK ioctl is used to remove a device, we
call remove_and_add_spares() which will remove it from the personality
if possible.  This improves the chances that the removal will succeed.

When writing "remove" to dev-XX/state, we don't.  So that can fail more easily.

So add the remove_and_add_spares() into "remove" handling.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:12 -07:00
Xiao Ni 4ba1e78891 MD:Update superblock when err == 0 in size_store
This is a simple check before updating the superblock. It should update
the superblock when update_size return 0.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:11 -07:00
Cong Wang 5b1f5bc332 md: use a mutex to protect a global list
We saw a list corruption in the list all_detected_devices:

 WARNING: CPU: 16 PID: 226 at lib/list_debug.c:29 __list_add+0x3c/0xa9()
 list_add corruption. next->prev should be prev (ffff880859d58320), but was ffff880859ce74c0. (next=ffffffff81abfdb0).
 Modules linked in: ahci libahci libata sd_mod scsi_mod
 CPU: 16 PID: 226 Comm: kworker/u241:4 Not tainted 4.1.20 #1
 Hardware name: Dell Inc. PowerEdge C6220/04GD66, BIOS 2.2.3 11/07/2013
 Workqueue: events_unbound async_run_entry_fn
  0000000000000000 ffff880859a5baf8 ffffffff81502872 ffff880859a5bb48
  0000000000000009 ffff880859a5bb38 ffffffff810692a5 ffff880859ee8828
  ffffffff812ad02c ffff880859d58320 ffffffff81abfdb0 ffff880859eb90c0
 Call Trace:
  [<ffffffff81502872>] dump_stack+0x4d/0x63
  [<ffffffff810692a5>] warn_slowpath_common+0xa1/0xbb
  [<ffffffff812ad02c>] ? __list_add+0x3c/0xa9
  [<ffffffff81069305>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff812ad02c>] __list_add+0x3c/0xa9
  [<ffffffff81406f28>] md_autodetect_dev+0x41/0x62
  [<ffffffff81285862>] rescan_partitions+0x25f/0x29d
  [<ffffffff81506372>] ? mutex_lock+0x13/0x31
  [<ffffffff811a090f>] __blkdev_get+0x1aa/0x3cd
  [<ffffffff811a0b91>] blkdev_get+0x5f/0x294
  [<ffffffff81377ceb>] ? put_device+0x17/0x19
  [<ffffffff8128227c>] ? disk_put_part+0x12/0x14
  [<ffffffff812836f3>] add_disk+0x29d/0x407
  [<ffffffff81384345>] ? __pm_runtime_use_autosuspend+0x5c/0x64
  [<ffffffffa004a724>] sd_probe_async+0x115/0x1af [sd_mod]
  [<ffffffff81083177>] async_run_entry_fn+0x72/0x12c
  [<ffffffff8107c44c>] process_one_work+0x198/0x2ce
  [<ffffffff8107cac7>] worker_thread+0x1dd/0x2bb
  [<ffffffff8107c8ea>] ? cancel_delayed_work_sync+0x15/0x15
  [<ffffffff8107c8ea>] ? cancel_delayed_work_sync+0x15/0x15
  [<ffffffff81080d9c>] kthread+0xae/0xb6
  [<ffffffff81080000>] ? param_array_set+0x40/0xfa
  [<ffffffff81080cee>] ? __kthread_parkme+0x61/0x61
  [<ffffffff81508152>] ret_from_fork+0x42/0x70
  [<ffffffff81080cee>] ? __kthread_parkme+0x61/0x61

I suspect it is because there is no lock protecting this
global list, autostart_arrays() is called in ioctl() path
where there is no lock.

Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-09 09:37:23 -07:00
Mike Christie 28a8f0d317 block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00