Commit Graph

4370 Commits

Author SHA1 Message Date
Mikulas Patocka 0a83df6c8c dm crypt: increase mempool reserve to better support swapping
Increase mempool size from 16 to 64 entries.  This increase improves
swap on dm-crypt performance.

When swapping to dm-crypt, all available memory is temporarily exhausted
and dm-crypt can only use the mempool reserve.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-08-15 09:23:14 -04:00
Mike Snitzer 802934b2cf dm round robin: do not use this_cpu_ptr() without having preemption disabled
Use local_irq_save() to disable preemption before calling
this_cpu_ptr().

Reported-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Fixes: b0b477c7e0 ("dm round robin: use percpu 'repeat_count' and 'current_path'")
Cc: stable@vger.kernel.org # 4.6+
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-08-15 09:23:14 -04:00
Jens Axboe 1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Alexey Obitotskiy 11367799f3 md: Prevent IO hold during accessing to faulty raid5 array
After array enters in faulty state (e.g. number of failed drives
becomes more then accepted for raid5 level) it sets error flags
(one of this flags is MD_CHANGE_PENDING). For internal metadata
arrays MD_CHANGE_PENDING cleared into md_update_sb, but not for
external metadata arrays. MD_CHANGE_PENDING flag set prevents to
finish all new or non-finished IOs to array and hold them in
pending state. In some cases this can leads to deadlock situation.

For example, we have faulty array (2 of 4 drives failed) and
udev handle array state changes and blkid started (or other
userspace application that used array to read/write) but unable
to finish reads due to IO hold. At the same time we unable to get
exclusive access to array (to stop array in our case) because
another external application still use this array.

Fix makes possible to return IO with errors immediately.
So external application can finish working with array and
give exclusive access to other applications to perform
required management actions with array.

Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-05 22:03:10 -07:00
Shaohua Li d9dd26b20c MD: hold mddev lock to change bitmap location
Changing the location changes a lot of things. Holding the lock to avoid race.
This makes the .quiesce called with mddev lock hold too.

Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-05 22:02:40 -07:00
Heinz Mauelshagen 2a034ec197 dm raid: fix use of wrong status char during resynchronization
During a resynchronization, device status char 'a' is output on the raid
status line for every device of a RAID set.  It changes from 'a' to 'A'
(unless device failure) when the resynchronization completes.

Interrupting and restarting a resynchronization, by reloading the DM
table, erroneously lead to status char 'A'.

Fix this by avoiding setting the MD_RECOVERY_REQUESTED flag in
raid_preresume().

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-08-04 10:05:30 -04:00
Heinz Mauelshagen b2a4872a45 dm raid: constructor fails on non-zero incompat_features
When lvm2 userspace requests a RaidLV repair, it sets the rebuild
constructor flag on the new replacement DataLVs but does not clear the
respective MetaLVs.  Hence the superblock that is loaded from such new
MetaLVs may have a non-zero incompat_features member and the constructor
will fail with false-positive on incompat_features.

Solve by initializing the incompat_features member properly.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-08-03 12:36:54 -04:00
Heinz Mauelshagen f15f64d65b dm raid: fix processing of max_recovery_rate constructor flag
__CTR_FLAG_MIN_RECOVERY_RATE was used instead of __CTR_FLAG_MAX_RECOVERY_RATE
thus causing max_recovery_rate to be rejected in case min_recovery_rate
was already set.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-08-03 10:30:52 -04:00
Mike Snitzer eaf9a7361f dm: set DMF_SUSPENDED* _before_ clearing DMF_NOFLUSH_SUSPENDING
Otherwise, there is potential for both DMF_SUSPENDED* and
DMF_NOFLUSH_SUSPENDING to not be set during dm_suspend() -- which is
definitely _not_ a valid state.

This fix, in conjuction with "dm rq: fix the starting and stopping of
blk-mq queues", addresses the potential for request-based DM multipath's
__multipath_map() to see !dm_noflush_suspending() during suspend.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2016-08-02 16:21:37 -04:00
Mike Snitzer 7d9595d848 dm rq: fix the starting and stopping of blk-mq queues
Improve dm_stop_queue() to cancel any requeue_work.  Also, have
dm_start_queue() and dm_stop_queue() clear/set the QUEUE_FLAG_STOPPED
for the blk-mq request_queue.

On suspend dm_stop_queue() handles stopping the blk-mq request_queue
BUT: even though the hw_queues are marked BLK_MQ_S_STOPPED at that point
there is still a race that is allowing block/blk-mq.c to call ->queue_rq
against a hctx that it really shouldn't.  Add a check to
dm_mq_queue_rq() that guards against this rarity (albeit _not_
race-free).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # must patch dm.c on < 4.8 kernels
2016-08-02 16:21:36 -04:00
Mike Snitzer 1814f2e3fb dm mpath: add locking to multipath_resume and must_push_back
Multiple flags were being tested without locking.  Protect against
non-atomic bit changes in m->flags by holding m->lock (while testing or
setting the queue_if_no_path related flags).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-08-02 16:21:34 -04:00
Mike Snitzer 99f3c90d0d dm flakey: error READ bios during the down_interval
When the corrupt_bio_byte feature was introduced it caused READ bios to
no longer be errored with -EIO during the down_interval.  This had to do
with the complexity of needing to submit READs if the corrupt_bio_byte
feature was used.

Fix it so READ bios are properly errored with -EIO; doing so early in
flakey_map() as long as there isn't a match for the corrupt_bio_byte
feature.

Fixes: a3998799fb ("dm flakey: add corrupt_bio_byte feature")
Reported-by: Akira Hayakawa <ruby.wktk@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2016-08-02 16:08:59 -04:00
ZhengYuan Liu ff00d3b4e5 raid5: fix incorrectly counter of conf->empty_inactive_list_nr
The counter conf->empty_inactive_list_nr is only used for determine if the
raid5 is congested which is deal with in function raid5_congested().
It was increased in get_free_stripe() when conf->inactive_list got to be
empty and decreased in release_inactive_stripe_list() when splice
temp_inactive_list to conf->inactive_list. However, this may have a
problem when raid5_get_active_stripe or stripe_add_to_batch_list was called,
because these two functions may call list_del_init(&sh->lru) to delete sh from
"conf->inactive_list + hash" which may cause "conf->inactive_list + hash" to
be empty when atomic_inc_not_zero(&sh->count) got false. So a check should be
done at these two point and increase empty_inactive_list_nr accordingly.
Otherwise the counter may get to be negative number which would influence
async readahead from VFS.

Signed-off-by: ZhengYuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-01 20:18:21 -07:00
Tomasz Majchrzak 9b622e2bbc raid10: increment write counter after bio is split
md pending write counter must be incremented after bio is split,
otherwise it gets decremented too many times in end bio callback and
becomes negative.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-30 14:09:30 -07:00
Linus Torvalds 867900b5ec Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD updates from Shaohua Li:
 - A bunch of patches from Neil Brown to fix RCU usage
 - Two performance improvement patches from Tomasz Majchrzak
 - Alexey Obitotskiy fixes module refcount issue
 - Arnd Bergmann fixes time granularity
 - Cong Wang fixes a list corruption issue
 - Guoqing Jiang fixes a deadlock in md-cluster
 - A null pointer deference fix from me
 - Song Liu fixes misuse of raid6 rmw
 - Other trival/cleanup fixes from Guoqing Jiang and Xiao Ni

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (28 commits)
  MD: fix null pointer deference
  raid10: improve random reads performance
  md: add missing sysfs_notify on array_state update
  Fix kernel module refcount handling
  md: use seconds granularity for error logging
  md: reduce the number of synchronize_rcu() calls when multiple devices fail.
  md: be extra careful not to take a reference to a Faulty device.
  md/multipath: add rcu protection to rdev access in multipath_status.
  md/raid5: add rcu protection to rdev accesses in raid5_status.
  md/raid5: add rcu protection to rdev accesses in want_replace
  md/raid5: add rcu protection to rdev accesses in handle_failed_sync.
  md/raid1: add rcu protection to rdev in fix_read_error
  md/raid1: small code cleanup in end_sync_write
  md/raid1: small cleanup in raid1_end_read/write_request
  md/raid10: simplify print_conf a little.
  md/raid10: minor code improvement in fix_read_error()
  md/raid10: add rcu protection to rdev access during reshape.
  md/raid10: add rcu protection to rdev access in raid10_sync_request.
  md/raid10: add rcu protection in raid10_status.
  md/raid10: fix refounct imbalance when resyncing an array with a replacement device.
  ...
2016-07-28 18:04:39 -07:00
Linus Torvalds f0c98ebc57 libnvdimm for 4.8
1/ Replace pcommit with ADR / directed-flushing:
    The pcommit instruction, which has not shipped on any product, is
    deprecated. Instead, the requirement is that platforms implement either
    ADR, or provide one or more flush addresses per nvdimm. ADR
    (Asynchronous DRAM Refresh) flushes data in posted write buffers to the
    memory controller on a power-fail event. Flush addresses are defined in
    ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure:
    "Flush Hint Address Structure". A flush hint is an mmio address that
    when written and fenced assures that all previous posted writes
    targeting a given dimm have been flushed to media.
 
 2/ On-demand ARS (address range scrub):
    Linux uses the results of the ACPI ARS commands to track bad blocks
    in pmem devices.  When latent errors are detected we re-scrub the media
    to refresh the bad block list, userspace can also request a re-scrub at
    any time.
 
 3/ Support for the Microsoft DSM (device specific method) command format.
 
 4/ Support for EDK2/OVMF virtual disk device memory ranges.
 
 5/ Various fixes and cleanups across the subsystem.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ
 NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY
 ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h
 trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG
 KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu
 qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP
 WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7
 41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA
 DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl
 b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC
 6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV
 cN3edFVIdxvZeMgM5Ubq
 =xCBG
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:

 - Replace pcommit with ADR / directed-flushing.

   The pcommit instruction, which has not shipped on any product, is
   deprecated.  Instead, the requirement is that platforms implement
   either ADR, or provide one or more flush addresses per nvdimm.

   ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
   to the memory controller on a power-fail event.

   Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
   Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
   A flush hint is an mmio address that when written and fenced assures
   that all previous posted writes targeting a given dimm have been
   flushed to media.

 - On-demand ARS (address range scrub).

   Linux uses the results of the ACPI ARS commands to track bad blocks
   in pmem devices.  When latent errors are detected we re-scrub the
   media to refresh the bad block list, userspace can also request a
   re-scrub at any time.

 - Support for the Microsoft DSM (device specific method) command
   format.

 - Support for EDK2/OVMF virtual disk device memory ranges.

 - Various fixes and cleanups across the subsystem.

* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
  libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
  nfit: do an ARS scrub on hitting a latent media error
  nfit: move to nfit/ sub-directory
  nfit, libnvdimm: allow an ARS scrub to be triggered on demand
  libnvdimm: register nvdimm_bus devices with an nd_bus driver
  pmem: clarify a debug print in pmem_clear_poison
  x86/insn: remove pcommit
  Revert "KVM: x86: add pcommit support"
  nfit, tools/testing/nvdimm/: unify shutdown paths
  libnvdimm: move ->module to struct nvdimm_bus_descriptor
  nfit: cleanup acpi_nfit_init calling convention
  nfit: fix _FIT evaluation memory leak + use after free
  tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
  tools/testing/nvdimm: add virtual ramdisk range
  acpi, nfit: treat virtual ramdisk SPA as pmem region
  pmem: kill __pmem address space
  pmem: kill wmb_pmem()
  libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
  fs/dax: remove wmb_pmem()
  libnvdimm, pmem: flush posted-write queues on shutdown
  ...
2016-07-28 17:38:16 -07:00
Shaohua Li 3f35e210ed Merge branch 'mymd/for-next' into mymd/for-linus 2016-07-28 09:34:14 -07:00
Shaohua Li 5d8817833c MD: fix null pointer deference
The md device might not have personality (for example, ddf raid array). The
issue is introduced by 8430e7e0af9a15(md: disconnect device from personality
before trying to remove it)

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-28 09:06:34 -07:00
Linus Torvalds f7e6816994 - initially based on Jens' 'for-4.8/core' (given all the flag churn) and
later merged with 'for-4.8/core' to pickup the QUEUE_FLAG_DAX commits
   that DM depends on to provide its DAX support
 
 - clean up the bio-based vs request-based DM core code by moving the
   request-based DM core code out to dm-rq.[hc]
 
 - reinstate bio-based support in the DM multipath target (done with the
   idea that fast storage like NVMe over Fabrics could benefit) -- while
   preserving support for request_fn and blk-mq request-based DM mpath
 
 - SCSI and DM multipath persistent reservation fixes that were
   coordinated with Martin Petersen.
 
 - the DM raid target saw the most extensive change this cycle; it now
   provides reshape and takeover support (by layering ontop of the
   corresponding MD capabilities)
 
 - DAX support for DM core and the linear, stripe and error targets
 
 - A DM thin-provisioning block discard vs allocation race fix that
   addresses potential for corruption
 
 - A stable fix for DM verity-fec's block calculation during decode
 
 - A few cleanups and fixes to DM core and various targets
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXkRZmAAoJEMUj8QotnQNat2wH/i4LpkoGI5tI6UhyKWxRkzJp
 vKaJ0zuZ2Ez73DucJujNuvaiyHq1IjHD5pfr8JQO3E8ygDkRC2KjF2O8EXp0Has6
 U1uLahQej72MAs0ZJTpvfE+JiY6qyIl4K+xxuPmYm2f2S5TWTIgOetYjJQmcMlQo
 Y8zFfcDYn4Dv5rMdvDT4+1ePETxq74wcBwTxyW3OAbHE1f0JjsUGdMKzXB1iTWcM
 VjLjWI//ETfFdIlDO0w2Qbd90aLUjmTR2k67RGnbPj5kNUNikv/X6iiY32KERR/0
 vMiiJ7JS+a44P7FJqCMoAVM/oBYFiSNpS4LYevOgHb0G0ikF8kaSeqBPC6sMYvg=
 =uYt9
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - initially based on Jens' 'for-4.8/core' (given all the flag churn)
   and later merged with 'for-4.8/core' to pickup the QUEUE_FLAG_DAX
   commits that DM depends on to provide its DAX support

 - clean up the bio-based vs request-based DM core code by moving the
   request-based DM core code out to dm-rq.[hc]

 - reinstate bio-based support in the DM multipath target (done with the
   idea that fast storage like NVMe over Fabrics could benefit) -- while
   preserving support for request_fn and blk-mq request-based DM mpath

 - SCSI and DM multipath persistent reservation fixes that were
   coordinated with Martin Petersen.

 - the DM raid target saw the most extensive change this cycle; it now
   provides reshape and takeover support (by layering ontop of the
   corresponding MD capabilities)

 - DAX support for DM core and the linear, stripe and error targets

 - a DM thin-provisioning block discard vs allocation race fix that
   addresses potential for corruption

 - a stable fix for DM verity-fec's block calculation during decode

 - a few cleanups and fixes to DM core and various targets

* tag 'dm-4.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (73 commits)
  dm: allow bio-based table to be upgraded to bio-based with DAX support
  dm snap: add fake origin_direct_access
  dm stripe: add DAX support
  dm error: add DAX support
  dm linear: add DAX support
  dm: add infrastructure for DAX support
  dm thin: fix a race condition between discarding and provisioning a block
  dm btree: fix a bug in dm_btree_find_next_single()
  dm raid: fix random optimal_io_size for raid0
  dm raid: address checkpatch.pl complaints
  dm: call PR reserve/unreserve on each underlying device
  sd: don't use the ALL_TG_PT bit for reservations
  dm: fix second blk_delay_queue() parameter to be in msec units not jiffies
  dm raid: change logical functions to actually return bool
  dm raid: use rdev_for_each in status
  dm raid: use rs->raid_disks to avoid memory leaks on free
  dm raid: support delta_disks for raid1, fix table output
  dm raid: enhance reshape check and factor out reshape setup
  dm raid: allow resize during recovery
  dm raid: fix rs_is_recovering() to allow for lvextend
  ...
2016-07-26 17:12:11 -07:00
Toshi Kani b5ab4a9ba5 dm: allow bio-based table to be upgraded to bio-based with DAX support
Allow table type DM_TYPE_BIO_BASED to extend with DM_TYPE_DAX_BIO_BASED
since DM_TYPE_DAX_BIO_BASED supports bio-based requests.

This is needed to allow a snapshot of an LV with DAX support to be
removed.  One of the intermediate table reloads that lvm2 does switches
from DM_TYPE_BIO_BASED to DM_TYPE_DAX_BIO_BASED.  No known reason to
disallow this so...

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-20 23:49:52 -04:00
Toshi Kani f6e629bd23 dm snap: add fake origin_direct_access
dax-capable mapped-device is marked as DM_TYPE_DAX_BIO_BASED,
which supports both dax and bio-based operations.  dm-snap
needs to work with dax-capable device when bio-based operation
is used.

Add fake origin_direct_access() to origin device so that its
origin device is also marked as DM_TYPE_DAX_BIO_BASED for
dax-capable device.  This allows to extend target's DM table.
dm-snap works normally when bio-based operation is used.

dm-snap does not support dax operation, and mount with dax
option to a target device or snapshot device fails.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-20 23:49:51 -04:00
Toshi Kani beec25b457 dm stripe: add DAX support
Change dm-stripe to implement direct_access function,
stripe_direct_access(), which maps bdev and sector and
calls direct_access function of its physical target device.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-20 23:49:51 -04:00
Mike Snitzer f8df1fdf18 dm error: add DAX support
Allow the error target to replace an existing DAX-enabled target.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-20 23:49:50 -04:00
Toshi Kani 84b22f8378 dm linear: add DAX support
Change dm-linear to implement direct_access function,
linear_direct_access(), which maps sector and calls direct_access
function of its physical target device.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-20 23:49:49 -04:00
Toshi Kani 545ed20e6d dm: add infrastructure for DAX support
Change mapped device to implement direct_access function,
dm_blk_direct_access(), which calls a target direct_access function.
'struct target_type' is extended to have target direct_access interface.
This function limits direct accessible size to the dm_target's limit
with max_io_len().

Add dm_table_supports_dax() to iterate all targets and associated block
devices to check for DAX support.  To add DAX support to a DM target the
target must only implement the direct_access function.

Add a new dm type, DM_TYPE_DAX_BIO_BASED, which indicates that mapped
device supports DAX and is bio based.  This new type is used to assure
that all target devices have DAX support and remain that way after
QUEUE_FLAG_DAX is set in mapped device.

At initial table load, QUEUE_FLAG_DAX is set to mapped device when setting
DM_TYPE_DAX_BIO_BASED to the type.  Any subsequent table load to the
mapped device must have the same type, or else it fails per the check in
table_load().

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-20 23:49:49 -04:00
Christoph Hellwig ed996a52c8 block: simplify and cleanup bvec pool handling
Instead of a flag and an index just make sure an index of 0 means
no need to free the bvec array.  Also move the constants related
to the bvec pools together and use a consistent naming scheme for
them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:37:02 -06:00
Christoph Hellwig 70246286e9 block: get rid of bio_rw and READA
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces.  For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense.  Any check for READA is replaced with an
explicit check for REQ_RAHEAD.  Also remove the READA alias for
REQ_RAHEAD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:37:01 -06:00
Joe Thornber 2a0fbffb1e dm thin: fix a race condition between discarding and provisioning a block
The discard passdown was being issued after the block was unmapped,
which meant the block could be reprovisioned whilst the passdown discard
was still in flight.

We can only identify unshared blocks (safe to do a passdown a discard
to) once they're unmapped and their ref count hits zero.  Block ref
counts are now used to guard against concurrent allocation of these
blocks that are being discarded.  So now we unmap the block, issue
passdown discards, and the immediately increment ref counts for regions
that have been discarded via passed down (this is safe because
allocation occurs within the same thread).  We then decrement ref counts
once the passdown discard IO is complete -- signaling these blocks may
now be allocated.

This fixes the potential for corruption that was reported here:
https://www.redhat.com/archives/dm-devel/2016-June/msg00311.html

Reported-by: Dennis Yang <dennisyang@qnap.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-20 12:43:35 -04:00
Joe Thornber e7e0f73047 dm btree: fix a bug in dm_btree_find_next_single()
dm_btree_find_next_single() can short-circuit the search for a block
with a return of -ENODATA if all entries are higher than the search key
passed to lower_bound().

This hasn't been a problem because of the way the btree has been used by
DM thinp.  But it must be fixed now in preparation for fixing the race
in DM thinp's handling of simultaneous block discard vs allocation.
Otherwise, once that fix is in place, some of the blocks in a discard
would not be unmapped as expected.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-20 12:43:34 -04:00
Tomasz Majchrzak 0e5313e2d4 raid10: improve random reads performance
RAID10 random read performance is lower than expected due to excessive spinlock
utilisation which is required mostly for rebuild/resync. Simplify allow_barrier
as it's in IO path and encounters a lot of unnecessary congestion.

As lower_barrier just takes a lock in order to decrement a counter, convert
counter (nr_pending) into atomic variable and remove the spin lock. There is
also a congestion for wake_up (it uses lock internally) so call it only when
it's really needed. As wake_up is not called constantly anymore, ensure process
waiting to raise a barrier is notified when there are no more waiting IOs.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-19 15:20:28 -07:00
Tomasz Majchrzak 573275b58e md: add missing sysfs_notify on array_state update
Changeset 6791875e2e has added early return from a function so there is no
sysfs notification for 'active' and 'clean' state change.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-19 11:28:39 -07:00
Alexey Obitotskiy 4cb9da7d9c Fix kernel module refcount handling
md loads raidX modules and increments module refcount each time level
has changed but does not decrement it. You are unable to unload raid0
module after reshape because raid0 reshape changes level to raid4
and back to raid0.

Signed-off-by: Aleksey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-19 11:17:31 -07:00
Arnd Bergmann 0e3ef49eda md: use seconds granularity for error logging
The md code stores the exact time of the last error in the
last_read_error variable using a timespec structure. It only
ever uses the seconds portion of that though, so we can
use a scalar for it.

There won't be an overflow in 2038 here, because it already
used monotonic time and 32-bit is enough for that, but I've
decided to use time64_t for consistency in the conversion.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-19 11:00:47 -07:00
Heinz Mauelshagen 89d3d9a1e3 dm raid: fix random optimal_io_size for raid0
raid_io_hints() was retrieving the number of data stripes used for the
calculation of io_opt from struct r5conf, which is not defined for raid0
mappings.

Base the calculation on the in-core raid_set structure instead.

Also, adjust to use to_bytes() for the sector -> bytes conversion
throughout.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-19 11:37:08 -04:00
Heinz Mauelshagen 094f394df6 dm raid: address checkpatch.pl complaints
Use 'unsigned int' where appropriate.
Return negative errors.
Correct an indentation.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-19 11:37:07 -04:00
Christoph Hellwig 9c72bad1f3 dm: call PR reserve/unreserve on each underlying device
So far we tried to rely on the SCSI 'all target ports' bit to register
all path, but for many setups this didn't work properly as the different
paths are seen as separate initiators to the target instead of multiple
ports of the same initiator.  Because of that we'll stop setting the
'all target ports' bit in SCSI, and let device mapper handle iterating
over the device for each path and register them manually.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:35 -04:00
Tahsin Erdogan bd9f55ea1c dm: fix second blk_delay_queue() parameter to be in msec units not jiffies
Commit d548b34b06 ("dm: reduce the queue delay used in dm_request_fn
from 100ms to 10ms") always intended the value to be 10 msecs -- it
just expressed it in jiffies because earlier commit 7eaceaccab ("block:
remove per-queue plugging") did.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: d548b34b06 ("dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms")
Cc: stable@vger.kernel.org # 4.1+ -- stable@ backports must be applied to drivers/md/dm.c
2016-07-18 15:37:34 -04:00
Heinz Mauelshagen d7ccc2e2a0 dm raid: change logical functions to actually return bool
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:33 -04:00
Heinz Mauelshagen 326824099f dm raid: use rdev_for_each in status
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:33 -04:00
Heinz Mauelshagen ffeeac7515 dm raid: use rs->raid_disks to avoid memory leaks on free
Also makes code more consistent throughout.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:32 -04:00
Heinz Mauelshagen 7a7c330fc2 dm raid: support delta_disks for raid1, fix table output
Add "delta_disks" constructor argument support to raid1 to allow for
consistent userspace disk addition/removal handling.

Fix raid_status() to report all raid disks with status and table output
on disk adding reshapes, not just the ones listed on the mddev; optimize
its rebuild and writemostly output.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:31 -04:00
Heinz Mauelshagen 469b304b58 dm raid: enhance reshape check and factor out reshape setup
Enhance rs_reshape_requested() check function to be more transparent and
fix its raid10 check.

Streamline the constructor by factoring out reshaping preparation into
fucntion rs_prepare_reshape().

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:31 -04:00
Heinz Mauelshagen 2a5556c2a8 dm raid: allow resize during recovery
Resizing a RAID set during recovery can be allowed, because the MD
resynchronization thread will either stop any ongoing recovery in case
of shrinking below the current recovery position or carry on recovery
to the new size if the set is growing.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:30 -04:00
Heinz Mauelshagen 345a6cdc25 dm raid: fix rs_is_recovering() to allow for lvextend
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:29 -04:00
Heinz Mauelshagen 37f10be150 dm raid: fix rebuild and catch bogus sync/resync flags
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:28 -04:00
Heinz Mauelshagen b1956dc4fa dm raid: fix ctr memory leaks on error paths
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:28 -04:00
Heinz Mauelshagen 65359ee6b1 dm raid: fix typo in write_mostly flag
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:27 -04:00
Heinz Mauelshagen 4348309a8b dm raid: also reject size change during recovery
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:26 -04:00
Heinz Mauelshagen f6895fd505 dm raid: fix new superblock/bitmap creation on disk addition
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:26 -04:00
Heinz Mauelshagen 2527b56e0d dm raid: add comments and fix typos
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:25 -04:00
Heinz Mauelshagen fbe6365bb4 dm raid: fix raid10 device size error on out-of-place reshape
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:24 -04:00
Heinz Mauelshagen 2d92a3c2a4 dm raid: prohibit 'nosync' on new raid6 and reject resize during reshape
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:24 -04:00
Heinz Mauelshagen 4dff2f1e26 dm raid: clarify and fix recovery
Add function rs_setup_recovery() to allow for defined setup of RAID set
recovery in the constructor.

Will be called with dev_sectors={0, rdev->sectors, MaxSectors} to
recover a new or enforced sync, grown or not to be synhronized RAID set
respectively.

Prevents recovery on raid0, which doesn't support it.

Enforces recovery on raid6 to ensure properly defined Syndromes
mandatory for that MD personality are being created.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:23 -04:00
Heinz Mauelshagen 0095dbc98b dm raid: fix rs_set_capacity on growing reshape
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:22 -04:00
Heinz Mauelshagen 9d9d939c80 dm raid: make rs_set_capacity to work on shrinking reshape
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:22 -04:00
Heinz Mauelshagen 6ee0bae9c8 dm raid: enhance comments in takeover checks
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:21 -04:00
Heinz Mauelshagen ae3c6cfff9 dm raid: remove bogus comment and fix comment typos
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:20 -04:00
Heinz Mauelshagen 75dd3b9ecb dm raid: more restricting data_offset value checks
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:19 -04:00
Heinz Mauelshagen 5fa146b25b dm raid: reject too many write_mostly devices
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:19 -04:00
Heinz Mauelshagen 0a7b818892 dm raid: the sync_page_io() metadata_op argument is bool
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:18 -04:00
Heinz Mauelshagen 0d851d14b8 dm raid: prohibit to pass in both sync and nosync ctr flags
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:17 -04:00
Heinz Mauelshagen ff4a88bf1c dm raid: avoid superfluous memory barriers on static metadata
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-18 15:37:17 -04:00
Mike Snitzer 7193a9defc dm rq: check kthread_run return for .request_fn request-based DM
Check return value of kthread_run() in dm_old_init_request_queue().

Reported-by: Minfei Huang <mnghuan@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-06 09:06:37 -04:00
Yijing Wang 89b920e003 bcache: Remove redundant block_size assignment
We have assigned sb->block_size before the switch,
so remove the redundant one.

Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Eric Wheeler <bcache@lists.ewheeler.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:34:50 -06:00
Yijing Wang 7abc70d700 bcache: update document info
There is no return in continue_at(), update the documentation.

Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:34:49 -06:00
Yijing Wang c50d4d5dd3 bcache: Remove redundant parameter for cache_alloc()
Cache_sb is not used in cache_alloc, and we have copied
sb info to cache->sb already, remove it.

Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:34:47 -06:00
Sami Tolvanen 602d1657c6 dm verity fec: fix block calculation
do_div was replaced with div64_u64 at some point, causing a bug with
block calculation due to incompatible semantics of the two functions.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Fixes: a739ff3f54 ("dm verity: add support for forward error correction")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-01 23:29:08 -04:00
Bart Van Assche 028b39e314 dm ioctl: Simplify parameter buffer management code
Merge the two DM_PARAMS_[KV]MALLOC flags into a single flag.

Doing so avoids the crashes seen with previous attempts to consolidate
buffer management to use kvfree() without first flagging that memory had
actually been allocated.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-01 10:54:11 -04:00
Bart Van Assche 350b539328 dm crypt: Fix sparse complaints
Avoid that sparse complains about assigning a __le64 value to a u64
variable.  Remove the (u64) casts since these are superfluous.  This
patch does not change the behavior of the source code.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-07-01 10:53:21 -04:00
Arnd Bergmann 68c1c4d5ea dm raid: don't use 'const' in function return
A newly introduced function has 'const int' as the return type,
but as "make W=1" reports, that has no meaning:

drivers/md/dm-raid.c:510:18: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]

This changes the return type to plain 'int'.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 33e53f0685 ("dm raid: introduce extended superblock and new raid types to support takeover/reshaping")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-16 12:09:54 -04:00
Heinz Mauelshagen 6e20902e8f dm raid: fix failed takeover/reshapes by keeping raid set frozen
Superblock updates where bogus causing some takovers/reshapes to fail.

Introduce new runtime flag (RT_FLAG_KEEP_RS_FROZEN) to keep a raid set
frozen when a layout change was requested.  Userpace will immediately
reload the table w/o the flags requesting such change once they made it
to the superblocks and any change of recovery/reshape offsets has to be
avoided until after read.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 18:52:14 -04:00
Heinz Mauelshagen 4257e085e2 dm raid: support to change bitmap region size
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 18:52:13 -04:00
Heinz Mauelshagen 9dbd1aa3a8 dm raid: add reshaping support to the target
Add bool functions rs_is_recovering and rs_is_reshaping()
to test for ongoing recovery/reshaping respectively in order
to reject respective requests on ongoing ones.

Remove ctr array size check, because ti->len and array
sectors will differ during disk addition/removal reshape.

Use __is_raid10_near() rather than type string compare.

Introduce rs_check_reshape() and rs_start_reshape(),
use the former in the ctr to reject bogus rehsape requests
and the latter in preresume to actually start a reshape.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 18:52:12 -04:00
Heinz Mauelshagen 40ba37e564 dm raid: add prerequisite functions and definitions for reshaping
Add rs_is_reshapable(), rs_data_stripes(), rs_reshape_requested(),
rs_set_dev_and_array_sectors() and rs_adjust_data_offsets()

Remove superfluous check for reshape message

Correct runtime bit definitions to be incremental

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 18:52:11 -04:00
Heinz Mauelshagen a30cbc0d1c dm raid: inverse check for flags from invalid to valid flags
It is more intuitive to manage each raid level's features in terms of
what is supported rather than what isn't supported.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:25:02 -04:00
Mike Snitzer e6ca5e1a03 dm raid: various code cleanups
Renamed functions and variables with leading single underscore to have a
double underscore.  Renamed some functions to have better names.  Folded
functions that were split out without reason.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:25:01 -04:00
Mike Snitzer bfcee0e312 dm raid: rename functions that alloc and free struct raid_set
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:25:01 -04:00
Mike Snitzer 4286325b4b dm raid: remove all the bitops wrappers
Removes obfuscation that is of little value.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:25:00 -04:00
Mike Snitzer bb91a63fcc dm raid: rename _in_range to __within_range
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:24:59 -04:00
Mike Snitzer ef9b85a651 dm raid: add missing "dm-raid0" module alias
Also update module description to "raid0/1/10/4/5/6 target"

Reported by Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:24:59 -04:00
Mike Snitzer 3fa6cf3821 dm raid: rename _argname_by_flag to dm_raid_arg_name_by_flag
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:24:58 -04:00
Mike Snitzer 9b6e542329 dm raid: bump to v1.9.0 and make the extended SB feature flag reflect it
No idea what Heinz was doing with the versioning but upstream commit
4c9971ca6a ("dm raid: make sure no feature flags are set in metadata")
bumped to 1.8.0 already.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:24:57 -04:00
Mike Snitzer bd83a4c4f8 dm raid: remove ti_error_* wrappers
There ti_error_* wrappers added very little.  No other DM target has
ever gone to such lengths to wrap setting ti->error.

Also fixes some NULL derefences via rs->ti->error.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:24:57 -04:00
Mike Snitzer 43157840fd dm raid: tabify appropriate whitespace
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:24:56 -04:00
Heinz Mauelshagen 3a1c1ef2fd dm raid: enhance status interface and fixup takeover/raid0
The target's status interface has to provide the new 'data_offset' value
to allow userspace to retrieve the kernels offset to the data on each
raid device of a raid set.  This is the base for out-of-place reshaping
required to not write over any data during reshaping (e.g. change
raid6_zr -> raid6_nc):

 - add rs_set_cur() to be able to start up existing array in case of no
   takeover; use in ctr on takeover check

 - enhance raid_status()

 - add supporting functions to get resync/reshape progress and raid
   device status chars

 - fixup rebuild table line output race, which does miss to emit
   'rebuild N' on fully synced/rebuild devices, because it is relying on
   the transient 'In_sync' raid device flag

 - add new status line output for 'data_offset', which'll later be used
   for out-of-place reshaping

 - fixup takeover not working for all levels

 - fixup raid0 message interface oops caused by missing checks
   for the md threads, which don't exist in case of raid0

 - remove ALL_FREEZE_FLAGS not needed for takeover

 - adjust comments

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:24:55 -04:00
Heinz Mauelshagen ecbfb9f118 dm raid: add raid level takeover support
Add raid level takeover support allowing arbitrary takeovers between
raid levels supported by md personalities (i.e. raid0, raid1/10 and
raid4/5/6):

 - add rs_config_{backup|restore} function to allow for temporary
   storing ctr requested layout changes and restore them for takeover
   conersion decision after the superblocks got loaded and analyzed

 - add members to store layout to 'struct raid_set' (not mandatory
   for takeover but needed for reshape in later patch)

 - add rebuild_disks bitfield to 'struct raid_set' and set bits in ctr
   to use in setting up takeover (base to address a 'rebuild' related
   raid_status() table line bug and needed as well for reshape in future
   patch)

 - add runtime flags and respective manipulation functions to be able to
   control e.g. wrting of superlocks to the preresume function on
   takeover and (later) reshape

 - add functions to detect takeover, check it's valid (mandatory here to
   avoid failing on md_run()), setup for it and use in the ctr; those
   will be likely moved out once reshaping gets added to simplify the
   ctr

 - start raid set readonly in ctr and switch to readwrite, optionally
   updating superblocks, in preresume in order to allow suspend to
   quiesce any active table before (which involves superblock updates);
   this ensures the proper sequence of writing the current and any new
   takeover(/reshape) metadata

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:24:46 -04:00
Heinz Mauelshagen 7b34df74d2 dm raid: enhance super_sync() to support new superblock members
Add transferring the new takeover/reshape related superblock
members introduced to the super_sync() function:

 - add/move supporting functions

 - add failed devices bitfield transfer functions to retrieve the
   bitfield from superblock format or update it in the superblock

 - add code to transfer all new members

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:09:35 -04:00
Heinz Mauelshagen 4763e543a6 dm raid: add new reshaping/raid10 format table line options to parameter parser
Support the follwoing arguments in the ctr parameter parser:

 - add 'delta_disks', 'data_offset' taking int and sector respectively

 - 'raid10_use_near_sets' bool argument to optionally select
   near sets with supporting raid10 mappings

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:09:34 -04:00
Heinz Mauelshagen 33e53f0685 dm raid: introduce extended superblock and new raid types to support takeover/reshaping
Add new members to the dm-raid superblock and new raid types to support
takeover/reshape.

Add all necessary members needed to support takeover and reshape in one
go -- aiming to limit the amount of changes to the superblock layout.

This is a larger patch due to the new superblock members, their related
flags, validation of both and involved API additions/changes:

 - add additional members to keep track of:
   - state about forward/backward reshaping
   - reshape position
   - new level, layout, stripe size and delta disks
   - data offset to current and new data for out-of-place reshapes
   - failed devices bitfield extensions to keep track of max raid devices

 - adjust super_validate() to cope with new superblock members

 - adjust super_init_validation() to cope with new superblock members

 - add definitions for ctr flags supporting delta disks etc.

 - add new raid types (raid6_n_6 etc.)

 - add new raid10 supporting function API (_is_raid10_*())

 - adjust to changed raid10 supporting function API

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-06-14 17:09:32 -04:00
NeilBrown d787be4092 md: reduce the number of synchronize_rcu() calls when multiple devices fail.
Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made
which can delay several milliseconds in some case.
If lots of devices fail at once - as could happen with a large RAID10 where one set
of devices are removed all at once - these delays can add up to be very inconcenient.

As failure is not reversible we can check for that first, setting a
separate flag if it is found, and then all synchronize_rcu() once for
all the flagged devices.  Then ->hot_remove_disk() function can skip the
synchronize_rcu() step if the flag is set.

fix build error(Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:22 -07:00
NeilBrown f5b67ae86e md: be extra careful not to take a reference to a Faulty device.
It is important that we never increment rdev->nr_pending on a Faulty
device as ->hot_remove_disk() assumes that once the Faulty flag is visible
no code will take a new reference.

Some places take a new reference after only check In_sync.  This should
be safe as the two are changed together.  However to make the code more
obviously safe, add checks for 'Faulty' as well.

Note: the actual rule is:
  Never increment nr_pending if  Faulty is set and Blocked is clear,
  never clear Faulty, and never set Blocked without holding a reference
  through nr_pending.

fix build error (Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:21 -07:00
NeilBrown 40cf2123c5 md/multipath: add rcu protection to rdev access in multipath_status.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:21 -07:00
NeilBrown 5fd133511d md/raid5: add rcu protection to rdev accesses in raid5_status.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:20 -07:00
NeilBrown 3f232d6a95 md/raid5: add rcu protection to rdev accesses in want_replace
Being in the middle of resync is no longer protection against failed
rdevs disappearing.  So add rcu protection.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:19 -07:00
NeilBrown e50d399232 md/raid5: add rcu protection to rdev accesses in handle_failed_sync.
The rdev could be freed while handle_failed_sync is running, so
rcu protection is needed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:19 -07:00
NeilBrown 707a6a420c md/raid1: add rcu protection to rdev in fix_read_error
Since remove_and_add_spares() was added to hot_remove_disk() it has
been possible for an rdev to be hot-removed while fix_read_error()
was running, so we need to be more careful, and take a reference to
the rdev while performing IO.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:18 -07:00
NeilBrown 854abd7584 md/raid1: small code cleanup in end_sync_write
'mirror' is only used to find 'rdev', several times.
So just find 'rdev' once, and use it instead.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:17 -07:00
NeilBrown e5872d58f5 md/raid1: small cleanup in raid1_end_read/write_request
Both functions use conf->mirrors[mirror].rdev several times, so
improve readability by storing this in a local variable.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:17 -07:00
NeilBrown 4056ca51a2 md/raid10: simplify print_conf a little.
'tmp' is only ever used to extract 'tmp->rdev', so just use 'rdev' directly.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:16 -07:00
NeilBrown d683c8e0f7 md/raid10: minor code improvement in fix_read_error()
rdev already holds conf->mirrors[d].rdev, so no need to load it again.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:16 -07:00