Commit Graph

4965 Commits

Author SHA1 Message Date
Linus Torvalds 3645e6d0dc Merge tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD updates from Shaohua Li:
 "This update mainly fixes bugs:

   - Make raid5 ppl support several ppl from Pawel

   - Several raid5-cache bug fixes from Song

   - Bitmap fixes from Neil and Me

   - One raid1/10 regression fix since 4.12 from Me

   - Other small fixes and cleanup"

* tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md/bitmap: disable bitmap_resize for file-backed bitmaps.
  raid5-ppl: Recovery support for multiple partial parity logs
  md: Runtime support for multiple ppls
  md/raid0: attach correct cgroup info in bio
  lib/raid6: align AVX512 constants to 512 bits, not bytes
  raid5: remove raid5_build_block
  md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_show
  md: replace seq_release_private with seq_release
  md: notify about new spare disk in the container
  md/raid1/10: reset bio allocated from mempool
  md/raid5: release/flush io in raid5_do_work()
  md/bitmap: copy correct data for bitmap super
2017-09-07 12:41:48 -07:00
Linus Torvalds a0725ab0c7 Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the first pull request for 4.14, containing most of the code
  changes. It's a quiet series this round, which I think we needed after
  the churn of the last few series. This contains:

   - Fix for a registration race in loop, from Anton Volkov.

   - Overflow complaint fix from Arnd for DAC960.

   - Series of drbd changes from the usual suspects.

   - Conversion of the stec/skd driver to blk-mq. From Bart.

   - A few BFQ improvements/fixes from Paolo.

   - CFQ improvement from Ritesh, allowing idling for group idle.

   - A few fixes found by Dan's smatch, courtesy of Dan.

   - A warning fixup for a race between changing the IO scheduler and
     device remova. From David Jeffery.

   - A few nbd fixes from Josef.

   - Support for cgroup info in blktrace, from Shaohua.

   - Also from Shaohua, new features in the null_blk driver to allow it
     to actually hold data, among other things.

   - Various corner cases and error handling fixes from Weiping Zhang.

   - Improvements to the IO stats tracking for blk-mq from me. Can
     drastically improve performance for fast devices and/or big
     machines.

   - Series from Christoph removing bi_bdev as being needed for IO
     submission, in preparation for nvme multipathing code.

   - Series from Bart, including various cleanups and fixes for switch
     fall through case complaints"

* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
  kernfs: checking for IS_ERR() instead of NULL
  drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
  drbd: Fix allyesconfig build, fix recent commit
  drbd: switch from kmalloc() to kmalloc_array()
  drbd: abort drbd_start_resync if there is no connection
  drbd: move global variables to drbd namespace and make some static
  drbd: rename "usermode_helper" to "drbd_usermode_helper"
  drbd: fix race between handshake and admin disconnect/down
  drbd: fix potential deadlock when trying to detach during handshake
  drbd: A single dot should be put into a sequence.
  drbd: fix rmmod cleanup, remove _all_ debugfs entries
  drbd: Use setup_timer() instead of init_timer() to simplify the code.
  drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
  drbd: new disk-option disable-write-same
  drbd: Fix resource role for newly created resources in events2
  drbd: mark symbols static where possible
  drbd: Send P_NEG_ACK upon write error in protocol != C
  drbd: add explicit plugging when submitting batches
  drbd: change list_for_each_safe to while(list_first_entry_or_null)
  drbd: introduce drbd_recv_header_maybe_unplug
  ...
2017-09-07 11:59:42 -07:00
Linus Torvalds 80cee03bf1 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "Here is the crypto update for 4.14:

  API:
   - Defer scompress scratch buffer allocation to first use.
   - Add __crypto_xor that takes separte src and dst operands.
   - Add ahash multiple registration interface.
   - Revamped aead/skcipher algif code to fix async IO properly.

  Drivers:
   - Add non-SIMD fallback code path on ARM for SVE.
   - Add AMD Security Processor framework for ccp.
   - Add support for RSA in ccp.
   - Add XTS-AES-256 support for CCP version 5.
   - Add support for PRNG in sun4i-ss.
   - Add support for DPAA2 in caam.
   - Add ARTPEC crypto support.
   - Add Freescale RNGC hwrng support.
   - Add Microchip / Atmel ECC driver.
   - Add support for STM32 HASH module"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
  crypto: af_alg - get_page upon reassignment to TX SGL
  crypto: cavium/nitrox - Fix an error handling path in 'nitrox_probe()'
  crypto: inside-secure - fix an error handling path in safexcel_probe()
  crypto: rockchip - Don't dequeue the request when device is busy
  crypto: cavium - add release_firmware to all return case
  crypto: sahara - constify platform_device_id
  MAINTAINERS: Add ARTPEC crypto maintainer
  crypto: axis - add ARTPEC-6/7 crypto accelerator driver
  crypto: hash - add crypto_(un)register_ahashes()
  dt-bindings: crypto: add ARTPEC crypto
  crypto: algif_aead - fix comment regarding memory layout
  crypto: ccp - use dma_mapping_error to check map error
  lib/mpi: fix build with clang
  crypto: sahara - Remove leftover from previous used spinlock
  crypto: sahara - Fix dma unmap direction
  crypto: af_alg - consolidation of duplicate code
  crypto: caam - Remove unused dentry members
  crypto: ccp - select CONFIG_CRYPTO_RSA
  crypto: ccp - avoid uninitialized variable warning
  crypto: serpent - improve __serpent_setkey with UBSAN
  ...
2017-09-06 15:17:17 -07:00
NeilBrown e8a27f836f md/bitmap: disable bitmap_resize for file-backed bitmaps.
bitmap_resize() does not work for file-backed bitmaps.
The buffer_heads are allocated and initialized when
the bitmap is read from the file, but resize doesn't
read from the file, it loads from the internal bitmap.
When it comes time to write the new bitmap, the bh is
non-existent and we crash.

The common case when growing an array involves making the array larger,
and that normally means making the bitmap larger.  Doing
that inside the kernel is possible, but would need more code.
It is probably easier to require people who use file-backed
bitmaps to remove them and re-add after a reshape.

So this patch disables the resizing of arrays which have
file-backed bitmaps.  This is better than crashing.

Reported-by: Zhilong Liu <zlliu@suse.com>
Fixes: d60b479d17 ("md/bitmap: add bitmap_resize function to allow bitmap resizing.")
Cc: stable@vger.kernel.org (v3.5+).
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-31 22:57:03 -07:00
Pawel Baldysiak 675dc2ccc2 raid5-ppl: Recovery support for multiple partial parity logs
Search PPL buffer in order to find out the latest PPL header (the one
with largest generation number) and use it for recovery. The PPL entry
format and recovery algorithm are the same as for single PPL approach.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-28 07:45:49 -07:00
Pawel Baldysiak ddc088238c md: Runtime support for multiple ppls
Increase PPL area to 1MB and use it as circular buffer to store PPL. The
entry with highest generation number is the latest one. If PPL to be
written is larger then space left in a buffer, rewind the buffer to the
start (don't wrap it).

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-28 07:45:48 -07:00
Bart Van Assche 1c23484c35 dm mpath: do not lock up a CPU with requeuing activity
When using the block layer in single queue mode, get_request()
returns ERR_PTR(-EAGAIN) if the queue is dying and the REQ_NOWAIT
flag has been passed to get_request(). Avoid that the kernel
reports soft lockup complaints in this case due to continuous
requeuing activity.

Fixes: 7083abbbf ("dm mpath: avoid that path removal can trigger an infinite loop")
Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28 09:58:27 -04:00
Bart Van Assche 604407890e dm: fix printk() rate limiting code
Using the same rate limiting state for different kinds of messages
is wrong because this can cause a high frequency message to suppress
a report of a low frequency message. Hence use a unique rate limiting
state per message type.

Fixes: 71a16736a1 ("dm: use local printk ratelimit")
Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28 09:58:27 -04:00
Bart Van Assche 68515cc721 dm mpath: retry BLK_STS_RESOURCE errors
Retry requests instead of failing them if an out-of-memory error occurs
or the block driver below dm-mpath is busy.  This restores the v4.12
behavior of noretry_error(), namely that -ENOMEM results in a retry.

Fixes: 2a842acab1 ("block: introduce new block status code type")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28 09:58:26 -04:00
Bart Van Assche 54385bf75c dm: fix the second dec_pending() argument in __split_and_process_bio()
Detected by sparse.

Fixes: 4e4cbee93d ("block: switch bios to blk_status_t")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28 09:36:19 -04:00
Shaohua Li 8a8e6f84ad md/raid0: attach correct cgroup info in bio
The discard bio doesn't attach the original bio cgroup info. Normal bio
is cloned, so is fine.

Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25 10:21:48 -07:00
Guoqing Jiang 27a4ff8f49 raid5: remove raid5_build_block
Now raid5_build_block is just called to set the
sector of r5dev, raid5_compute_blocknr can be
used directly for the purpose.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25 10:21:47 -07:00
Song Liu a72cbf83b0 md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_show
In r5c_journal_mode_show(), it is necessary to call mddev_lock()
before accessing conf and conf->log. Otherwise, the conf->log
may change (and become NULL).

Signed-off-by: Song Liu <songliubraving@fb.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25 10:21:46 -07:00
Cihangir Akturk 26e13043b7 md: replace seq_release_private with seq_release
Since commit f15146380d ("fs: seq_file - add event counter to simplify
poll() support"), md.c code has been no longer used the private field of
the struct seq_file, but seq_release_private() has been continued to be
used to release the allocated seq_file. This seems to have been
forgotten. So convert it to use seq_release() instead of
seq_release_private().

Signed-off-by: Cihangir Akturk <cakturk@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25 10:21:45 -07:00
Alexey Obitotskiy 5492c46e94 md: notify about new spare disk in the container
In case of external metadata arrays spare disks are added to containers
first. mdadm keeps monitoring /proc/mdstat output and when spare disk is
available, it moves it from the container to the array. The problem is
there is no notification of new spare disk in the container and mdadm
waits a long time (until timeout) before it takes the action.

Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25 10:21:45 -07:00
Shaohua Li 208410b546 md/raid1/10: reset bio allocated from mempool
Data allocated from mempool doesn't always get initialized, this happens when
the data is reused instead of fresh allocation. In the raid1/10 case, we must
reinitialize the bios.

Reported-by: Jonathan G. Underwood <jonathan.underwood@gmail.com>
Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync pages)
Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync pages)
Cc: stable@vger.kernel.org (4.12+)
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25 10:21:44 -07:00
Song Liu 9c72a18e46 md/raid5: release/flush io in raid5_do_work()
In raid5, there are scenarios where some ios are deferred to a later
time, and some IO need a flush to complete. To make sure we make
progress with these IOs, we need to call the following functions:

    flush_deferred_bios(conf);
    r5l_flush_stripe_to_raid(conf->log);

Both of these functions are called in raid5d(), but missing in
raid5_do_work(). As a result, these functions are not called
when multi-threading (group_thread_cnt > 0) is enabled. This patch
adds calls to these function to raid5_do_work().

Note for stable branches:

  r5l_flush_stripe_to_raid(conf->log) is need for 4.4+
  flush_deferred_bios(conf) is only needed for 4.11+

Cc: stable@vger.kernel.org (4.4+)
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-24 10:06:19 -07:00
Shaohua Li 8031c3ddc7 md/bitmap: copy correct data for bitmap super
raid5 cache could write bitmap superblock before bitmap superblock is
initialized. The bitmap superblock is less than 512B. The current code will
only copy the superblock to a new page and write the whole 512B, which will
zero the the data after the superblock. Unfortunately the data could include
bitmap, which we should preserve. The patch will make superblock read do 4k
chunk and we always copy the 4k data to new page, so the superblock write will
old data to disk and we don't change the bitmap.

Reported-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org (4.10+)
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-24 10:04:54 -07:00
Christoph Hellwig 74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
Christoph Hellwig 10433d04b8 raid5: remove a call to get_start_sect
The block layer always remaps partitions before calling into the
->make_request methods of drivers.  Thus the call to get_start_sect in
in_chunk_boundary will always return 0 and can be removed.

Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:49 -06:00
Shaohua Li afc1f55ca4 MD: not clear ->safemode for external metadata array
->safemode should be triggered by mdadm for external metadaa array, otherwise
array's state confuses mdadm.

Fixes: 33182d15c6bf(md: always clear ->safemode when md_check_recovery gets the mddev lock.)
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-11 20:42:06 -07:00
Jens Axboe d62e26b3ff block: pass in queue to inflight accounting
No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 13:09:16 -06:00
Christoph Hellwig 62d20aa6a0 dm-crypt: don't mess with BIP_BLOCK_INTEGRITY
This flag is never set right after calling bio_integrity_alloc,
so don't clear it and confuse the reader.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 10:00:31 -06:00
Song Liu a9501d7421 md/r5cache: fix io_unit handling in r5l_log_endio()
In r5l_log_endio(), once log->io_list_lock is released, the io unit
may be accessed (or even freed) by other threads. Current code
doesn't handle the io_unit properly, which leads to potential race
conditions.

This patch solves this race condition by:

1. Add a pending_stripe count flush_payload. Multiple flush_payloads
   are counted as only one pending_stripe. Flag has_flush_payload is
   added to show whether the io unit has flush_payload;
2. In r5l_log_endio(), check flags has_null_flush and
   has_flush_payload with log->io_list_lock held. After the lock
   is released, this IO unit is only accessed when we know the
   pending_stripe counter cannot be zeroed by other threads.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-08 07:42:37 -07:00
Song Liu b44886c54a md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_set
In r5c_journal_mode_set(), it is necessary to call mddev_lock()
before accessing conf and conf->log. Otherwise, the conf->log
may change (and become NULL).

Shaohua: fix unlock in failure cases

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-08 07:42:36 -07:00
NeilBrown 81fe48e9aa md: fix test in md_write_start()
md_write_start() needs to clear the in_sync flag is it is set, or if
there might be a race with set_in_sync() such that the later will
set it very soon.  In the later case it is sufficient to take the
spinlock to synchronize with set_in_sync(), and then set the flag
if needed.

The current test is incorrect.
It should be:
  if "flag is set" or "race is possible"

"flag is set" is trivially "mddev->in_sync".
"race is possible" should be tested by "mddev->sync_checkers".

If sync_checkers is 0, then there can be no race.  set_in_sync() will
wait in percpu_ref_switch_to_atomic_sync() for an RCU grace period,
and as md_write_start() holds the rcu_read_lock(), set_in_sync() will
be sure ot see the update to writes_pending.

If sync_checkers is > 0, there could be race.  If md_write_start()
happened entirely between
		if (!mddev->in_sync &&
		    percpu_ref_is_zero(&mddev->writes_pending)) {
and
			mddev->in_sync = 1;
in set_in_sync(), then it would not see that is_sync had been set,
and set_in_sync() would not see that writes_pending had been
incremented.

This bug means that in_sync is sometimes not set when it should be.
Consequently there is a small chance that the array will be marked as
"clean" when in fact it is inconsistent.

Fixes: 4ad23a9764 ("MD: use per-cpu counter for writes_pending")
cc: stable@vger.kernel.org (v4.12+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-08 07:42:36 -07:00
NeilBrown 33182d15c6 md: always clear ->safemode when md_check_recovery gets the mddev lock.
If ->safemode == 1, md_check_recovery() will try to get the mddev lock
and perform various other checks.
If mddev->in_sync is zero, it will call set_in_sync, and clear
->safemode.  However if mddev->in_sync is not zero, ->safemode will not
be cleared.

When md_check_recovery() drops the mddev lock, the thread is woken
up again.  Normally it would just check if there was anything else to
do, find nothing, and go to sleep.  However as ->safemode was not
cleared, it will take the mddev lock again, then wake itself up
when unlocking.

This results in an infinite loop, repeatedly calling
md_check_recovery(), which RCU or the soft-lockup detector
will eventually complain about.

Prior to commit 4ad23a9764 ("MD: use per-cpu counter for
writes_pending"), safemode would only be set to one when the
writes_pending counter reached zero, and would be cleared again
when writes_pending is incremented.  Since that patch, safemode
is set more freely, but is not reliably cleared.

So in md_check_recovery() clear ->safemode before checking ->in_sync.

Fixes: 4ad23a9764 ("MD: use per-cpu counter for writes_pending")
Cc: stable@vger.kernel.org (4.12+)
Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Reported-by: David R <david@unsolicited.net>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-08 07:42:35 -07:00
Ard Biesheuvel 45fe93dff2 crypto: algapi - make crypto_xor() take separate dst and src arguments
There are quite a number of occurrences in the kernel of the pattern

  if (dst != src)
          memcpy(dst, src, walk.total % AES_BLOCK_SIZE);
  crypto_xor(dst, final, walk.total % AES_BLOCK_SIZE);

or

  crypto_xor(keystream, src, nbytes);
  memcpy(dst, keystream, nbytes);

where crypto_xor() is preceded or followed by a memcpy() invocation
that is only there because crypto_xor() uses its output parameter as
one of the inputs. To avoid having to add new instances of this pattern
in the arm64 code, which will be refactored to implement non-SIMD
fallbacks, add an alternative implementation called crypto_xor_cpy(),
taking separate input and output arguments. This removes the need for
the separate memcpy().

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-08-04 09:27:15 +08:00
Linus Torvalds 9583f1c99f Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD fixes from Shaohua Li:
 "This fixes several bugs, three of them are marked for stable:

   - an initialization issue fixed by Ming

   - a bio clone race issue fixed by me

   - an async tx flush issue fixed by Ofer

   - other cleanups"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  MD: fix warnning for UP case
  md/raid5: add thread_group worker async_tx_issue_pending_all
  md: simplify code with bio_io_error
  md/raid1: fix writebehind bio clone
  md: raid1-10: move raid1/raid10 common code into raid1-10.c
  md: raid1/raid10: initialize bvec table via bio_add_page()
  md: remove 'idx' from 'struct resync_pages'
2017-07-28 12:24:21 -07:00
Linus Torvalds 1731a47444 - A few DM integrity fixes that improve performance. One that address
inefficiencies in the on-disk journal device layout.  Another that
   makes use of the block layer's on-stack plugging when writing the
   journal.
 
 - A dm-bufio fix for the blk_status_t conversion that went in during the
   merge window.
 
 - A few DM raid fixes that address correctness when suspending the
   device and a validation fix for validation that occurs during device
   activation.
 
 - A couple DM zoned target fixes.  Important one being the fix to not
   use GFP_KERNEL in the IO path due to concerns about deadlock in
   low-memory conditions (e.g. swap over a DM zoned device, etc).
 
 - A DM DAX device fix to make sure dm_dax_flush() is called if the
   underlying DAX device is operating as a write cache.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJZe13OAAoJEMUj8QotnQNav/gIAMXMUbXlYHVikVNq+6rNkXRk
 FlsltNcJEDeZCit0nJd/2nOWGpssXdz+7cJTUU28Kp+3IscIolSHS51bzfSFI05V
 7LbYqEX1EdXkTwEeYfHlAoOexvj4oarpAWWQF/ACU8rHCruaqfqIa57mstxLoyDY
 XcxsIY/fds6GZViLB0MD/jBAKaLWX90aFZ9MQcF7AmdpMr56kCO2PUhiqHcrN47t
 BjH7E5QSKGl2pMND1bR6pleWFw8HB7h82Qjaasd5bQuVWseQ4u9Illxny6bhhk2E
 BiEWjzFvZB+JL1zl7JIXnBjhdmbwgAVvoW6EqHuVzHuR0X8gylBF2gDLnSzUZu4=
 =3MxS
 -----END PGP SIGNATURE-----

Merge tag 'for-4.13/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - a few DM integrity fixes that improve performance. One that address
   inefficiencies in the on-disk journal device layout. Another that
   makes use of the block layer's on-stack plugging when writing the
   journal.

 - a dm-bufio fix for the blk_status_t conversion that went in during
   the merge window.

 - a few DM raid fixes that address correctness when suspending the
   device and a validation fix for validation that occurs during device
   activation.

 - a couple DM zoned target fixes. Important one being the fix to not
   use GFP_KERNEL in the IO path due to concerns about deadlock in
   low-memory conditions (e.g. swap over a DM zoned device, etc).

 - a DM DAX device fix to make sure dm_dax_flush() is called if the
   underlying DAX device is operating as a write cache.

* tag 'for-4.13/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm, dax: Make sure dm_dax_flush() is called if device supports it
  dm verity fec: fix GFP flags used with mempool_alloc()
  dm zoned: use GFP_NOIO in I/O path
  dm zoned: remove test for impossible REQ_OP_FLUSH conditions
  dm raid: bump target version
  dm raid: avoid mddev->suspended access
  dm raid: fix activation check in validate_raid_redundancy()
  dm raid: remove WARN_ON() in raid10_md_layout_to_format()
  dm bufio: fix error code in dm_bufio_write_dirty_buffers()
  dm integrity: test for corrupted disk format during table load
  dm integrity: WARN_ON if variables representing journal usage get out of sync
  dm integrity: use plugging when writing the journal
  dm integrity: fix inefficient allocation of journal space
2017-07-28 12:17:17 -07:00
Vivek Goyal 273752c9ff dm, dax: Make sure dm_dax_flush() is called if device supports it
Currently dm_dax_flush() is not being called, even if underlying dax
device supports write cache, because DAXDEV_WRITE_CACHE is not being
propagated up to the DM dax device.

If the underlying dax device supports write cache, set
DAXDEV_WRITE_CACHE on the DM dax device.  This will cause dm_dax_flush()
to be called.

Fixes: abebfbe2f7 ("dm: add ->flush() dax operation support")
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-26 15:55:44 -04:00
NeilBrown 34c96507e8 dm verity fec: fix GFP flags used with mempool_alloc()
mempool_alloc() cannot fail for GFP_NOIO allocation, so there is no
point testing for failure.

One place the code tested for failure was passing "0" as the GFP
flags.  This is most unusual and is probably meant to be GFP_NOIO,
so that is changed.

Also, allocation from ->extra_pool and ->prealloc_pool are repeated
before releasing the previous allocation.  This can deadlock if the code
is servicing a write under high memory pressure.  To avoid deadlocks,
change these to use GFP_NOWAIT and leave the error handling in place.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-26 15:55:44 -04:00
Damien Le Moal 4218a95546 dm zoned: use GFP_NOIO in I/O path
Use GFP_NOIO for memory allocations in the I/O path.  Other memory
allocations in the initialization path can use GFP_KERNEL.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-26 15:55:43 -04:00
Shaohua Li ed9b66d218 MD: fix warnning for UP case
spin_is_locked always returns 0 for UP case, so ignores it

Reported-by: Joshua Kinard <kumba@gentoo.org>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-25 15:18:13 -07:00
Mikulas Patocka edbe9597ac dm zoned: remove test for impossible REQ_OP_FLUSH conditions
The value REQ_OP_FLUSH is only used by the block code for
request-based devices.

Remove the tests for REQ_OP_FLUSH from the bio-based dm-zoned-target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25 15:12:17 -04:00
Heinz Mauelshagen ac6a318888 dm raid: bump target version
Bumo dm-raid target version to 1.12.1 to reflect that commit cc27b0c78c
("md: fix deadlock between mddev_suspend() and md_write_start()") is
available.

This version change allows userspace to detect that MD fix is available.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25 14:54:20 -04:00
Heinz Mauelshagen 0cf352e5a0 dm raid: avoid mddev->suspended access
Use runtime flag to ensure that an mddev gets suspended/resumed just once.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25 14:54:19 -04:00
Heinz Mauelshagen f4af3f82da dm raid: fix activation check in validate_raid_redundancy()
During growing reshapes (i.e. stripes being added to a raid set), the
new stripe images are not in-sync and not part of the raid set until
the reshape is started.

LVM2 has to request multiple table reloads involving superblock updates
in order to reflect proper size of SubLVs in the cluster.  Before a stripe
adding reshape starts, validate_raid_redundancy() fails as a result of that
because it checks the total number of devices against the number of rebuild
ones rather than the actual ones in the raid set (as retrieved from the
superblock) thus resulting in failed raid4/5/6/10 redundancy checks.

E.g. convert 3 stripes -> 7 stripes raid5 (which only allows for maximum
1 device to fail) requesting +4 delta disks causing 4 devices to rebuild
during reshaping thus failing activation.

To fix this, move validate_raid_redundancy() to get access to the
current raid_set members.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25 14:54:19 -04:00
Heinz Mauelshagen bbac1e06a4 dm raid: remove WARN_ON() in raid10_md_layout_to_format()
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25 14:54:18 -04:00
Dan Carpenter edc11d49f8 dm bufio: fix error code in dm_bufio_write_dirty_buffers()
We should be returning normal negative error codes here.  The "a"
variables comes from &c->async_write_error which is a blk_status_t
converted to a regular error code.

In the current code, the blk_status_t gets propogated back to
pool_create() and eventually results in an Oops.

Fixes: 4e4cbee93d ("block: switch bios to blk_status_t")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25 10:11:15 -04:00
Mikulas Patocka bc86a41e96 dm integrity: test for corrupted disk format during table load
If the dm-integrity superblock was corrupted in such a way that the
journal_sections field was zero, the integrity target would deadlock
because it would wait forever for free space in the journal.

Detect this situation and refuse to activate the device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 7eada909bf ("dm: add integrity target")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25 10:11:14 -04:00
Mikulas Patocka aa03a91ffa dm integrity: WARN_ON if variables representing journal usage get out of sync
If this WARN_ON triggers it speaks to programmer error, and likely
implies corruption, but no released kernel should trigger it.  This
WARN_ON serves to assist DM integrity developers as changes are
made/tested in the future.

BUG_ON is excessive for catching programmer error, if a user or
developer would like warnings to trigger a panic, they can enable that
via /proc/sys/kernel/panic_on_warn

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25 10:11:13 -04:00
Ofer Heifetz 7e96d55963 md/raid5: add thread_group worker async_tx_issue_pending_all
Since thread_group worker and raid5d kthread are not in sync, if
worker writes stripe before raid5d then requests will be waiting
for issue_pendig.

Issue observed when building raid5 with ext4, in some build runs
jbd2 would get hung and requests were waiting in the HW engine
waiting to be issued.

Fix this by adding a call to async_tx_issue_pending_all in the
raid5_do_work.

Signed-off-by: Ofer Heifetz <oferh@marvell.com>
Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-24 07:49:15 -07:00
Guoqing Jiang 6308d8e3d4 md: simplify code with bio_io_error
Since bio_io_error sets bi_status to BLK_STS_IOERR,
and calls bio_endio, so we can use it directly.

And as mentioned by Shaohua, there are also two
places in raid5.c can use bio_io_error either.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-21 13:16:52 -07:00
Shaohua Li 16d56e2fcc md/raid1: fix writebehind bio clone
After bio is submitted, we should not clone it as its bi_iter might be
invalid by driver. This is the case of behind_master_bio. In certain
situration, we could dispatch behind_master_bio immediately for the
first disk and then clone it for other disks.

https://bugzilla.kernel.org/show_bug.cgi?id=196383

Reported-and-tested-by: Markus <m4rkusxxl@web.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Fix: 841c1316c7da(md: raid1: improve write behind)
Cc: stable@vger.kernel.org (4.12+)
Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-21 12:47:20 -07:00
Ming Lei be453e7761 md: raid1-10: move raid1/raid10 common code into raid1-10.c
No function change, just move 'struct resync_pages' and related
helpers into raid1-10.c

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-21 12:47:20 -07:00
Ming Lei fb0eb5df09 md: raid1/raid10: initialize bvec table via bio_add_page()
We will support multipage bvec soon, so initialize bvec
table using the standardy way instead of writing the
talbe directly. Otherwise it won't work any more once
multipage bvec is enabled.

Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-21 12:47:20 -07:00
Ming Lei 022e510fcb md: remove 'idx' from 'struct resync_pages'
bio_add_page() won't fail for resync bio, and the page index for each
bio is same, so remove it.

More importantly the 'idx' of 'struct resync_pages' is initialized in
mempool allocator function, the current way is wrong since mempool is
only responsible for allocation, we can't use that for initialization.

Suggested-by: NeilBrown <neilb@suse.com>
Reported-by: NeilBrown <neilb@suse.com>
Reported-and-tested-by: Patrick <dto@gmx.net>
Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync pages)
Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync pages)
Cc: stable@vger.kernel.org (4.12+)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-21 12:47:20 -07:00
Mikulas Patocka a7c3e62bdc dm integrity: use plugging when writing the journal
When copying data from the journal to the appropriate place, we submit
many IOs.  Some of these IOs could go to adjacent areas.  Use on-stack
plugging so that adjacent IOs get merged during submission.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-19 19:02:39 -04:00
Mikulas Patocka 9dd59727db dm integrity: fix inefficient allocation of journal space
When using a block size greater than 512 bytes, the dm-integrity target
allocates journal space inefficiently.  It allocates one journal entry
for each 512-byte chunk of data, fills an entry for each block of data
and leaves the remaining entries unused.

This issue doesn't cause data corruption, but all the unused journal
entries degrade performance severely.

For example, with 4k blocks and an 8k bio, it would allocate 16 journal
entries but only use 2 entries.  The remaining 14 entries were left
unused.

Fix this by adding the missing 'log2_sectors_per_block' shifts that are
required to have each journal entry map to a full block.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 7eada909bf ("dm: add integrity target")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-19 18:44:18 -04:00