Commit Graph

4183 Commits

Author SHA1 Message Date
Kirill A. Shutemov 09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Guoqing Jiang f9a67b1182 md/bitmap: clear bitmap if bitmap_create failed
If bitmap_create returns an error, we need to call
either bitmap_destroy or bitmap_free to do clean up,
and the selection is based on mddev->bitmap is set
or not.

And the sysfs_put(bitmap->sysfs_can_clear) is moved
from bitmap_destroy to bitmap_free, and the comment
of bitmap_create is changed as well.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-01 13:05:50 -07:00
Shaohua Li ed3b98c71c MD: add rdev reference for super write
Xiao Ni reported below crash:
[26396.335146] BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
[26396.342990] IP: [<ffffffffa0425b00>] super_written+0x20/0x80 [md_mod]
[26396.349449] PGD 0
[26396.351468] Oops: 0002 [#1] SMP
[26396.354898] Modules linked in: ext4 mbcache jbd2 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_td
[26396.408404] CPU: 5 PID: 3261 Comm: loop0 Not tainted 4.5.0 #1
[26396.414140] Hardware name: Dell Inc. PowerEdge R715/0G2DP3, BIOS 3.2.2 09/15/2014
[26396.421608] task: ffff8808339be680 ti: ffff8808365f4000 task.ti: ffff8808365f4000
[26396.429074] RIP: 0010:[<ffffffffa0425b00>]  [<ffffffffa0425b00>] super_written+0x20/0x80 [md_mod]
[26396.437952] RSP: 0018:ffff8808365f7c38  EFLAGS: 00010046
[26396.443252] RAX: ffffffffa0425ae0 RBX: ffff8804336a7900 RCX: ffffe8f9f7b41198
[26396.450371] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8804336a7900
[26396.457489] RBP: ffff8808365f7c50 R08: 0000000000000005 R09: 00001801e02ce3d7
[26396.464608] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[26396.471728] R13: ffff8808338d9a00 R14: 0000000000000000 R15: ffff880833f9fe00
[26396.478849] FS:  00007f9e5066d740(0000) GS:ffff880237b40000(0000) knlGS:0000000000000000
[26396.486922] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[26396.492656] CR2: 00000000000002a8 CR3: 00000000019ea000 CR4: 00000000000006e0
[26396.499775] Stack:
[26396.501781]  ffff8804336a7900 0000000000000000 0000000000000000 ffff8808365f7c68
[26396.509199]  ffffffff81308cd0 ffff8804336a7900 ffff8808365f7ca8 ffffffff81310637
[26396.516618]  00000000a0233a00 ffff880833f9fe00 0000000000000000 ffff880833fb0000
[26396.524038] Call Trace:
[26396.526485]  [<ffffffff81308cd0>] bio_endio+0x40/0x60
[26396.531529]  [<ffffffff81310637>] blk_update_request+0x87/0x320
[26396.537439]  [<ffffffff8131a20a>] blk_mq_end_request+0x1a/0x70
[26396.543261]  [<ffffffff81313889>] blk_flush_complete_seq+0xd9/0x2a0
[26396.549517]  [<ffffffff81313ccf>] flush_end_io+0x15f/0x240
[26396.554993]  [<ffffffff8131a22a>] blk_mq_end_request+0x3a/0x70
[26396.560815]  [<ffffffff8131a314>] __blk_mq_complete_request+0xb4/0xe0
[26396.567246]  [<ffffffff8131a35c>] blk_mq_complete_request+0x1c/0x20
[26396.573506]  [<ffffffffa04182df>] loop_queue_work+0x6f/0x72c [loop]
[26396.579764]  [<ffffffff81697844>] ? __schedule+0x2b4/0x8f0
[26396.585242]  [<ffffffff810a7812>] kthread_worker_fn+0x52/0x170
[26396.591065]  [<ffffffff810a77c0>] ? kthread_create_on_node+0x1a0/0x1a0
[26396.597582]  [<ffffffff810a7238>] kthread+0xd8/0xf0
[26396.602453]  [<ffffffff810a7160>] ? kthread_park+0x60/0x60
[26396.607929]  [<ffffffff8169bdcf>] ret_from_fork+0x3f/0x70
[26396.613319]  [<ffffffff810a7160>] ? kthread_park+0x60/0x60

md_super_write() and corresponding md_super_wait() generally are called
with reconfig_mutex locked, which prevents disk disappears. There is one
case this rule is broken. write_sb_page of bitmap.c doesn't hold the
mutex. next_active_rdev does increase rdev reference, but it decreases
the reference too early (eg, before IO finish). disk can disappear at
the window. We unconditionally increase rdev reference in
md_super_write() to avoid the race.

Reported-and-tested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-31 10:04:18 -07:00
Wei Fang 466ad29223 md: fix a trivial typo in comments
Fix a trivial typo in md_ioctl().

Signed-off-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-31 10:04:18 -07:00
Wei Fang 816b0acf3d md:raid1: fix a dead loop when read from a WriteMostly disk
If first_bad == this_sector when we get the WriteMostly disk
in read_balance(), valid disk will be returned with zero
max_sectors. It'll lead to a dead loop in make_request(), and
OOM will happen because of endless allocation of struct bio.

Since we can't get data from this disk in this case, so
continue for another disk.

Signed-off-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-31 10:04:17 -07:00
Linus Torvalds 4526b710c1 Merge tag 'md/4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD updates from Shaohua Li:
 "This update mainly fixes bugs.

   - a raid5 discard related fix from Jes
   - a MD multipath bio clone fix from Ming
   - raid1 error handling deadlock fix from Nate and corresponding
     raid10 fix from myself
   - a raid5 stripe batch fix from Neil
   - a patch from Sebastian to avoid unnecessary uevent
   - several cleanup/debug patches"

* tag 'md/4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md/raid5: Cleanup cpu hotplug notifier
  raid10: include bio_end_io_list in nr_queued to prevent freeze_array hang
  raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang
  md: fix typos for stipe
  md/bitmap: remove redundant return in bitmap_checkpage
  md/raid1: remove unnecessary BUG_ON
  md: multipath: don't hardcopy bio in .make_request path
  md/raid5: output stripe state for debug
  md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list
  Update MD git tree URL
  md/bitmap: remove redundant check
  MD: warn for potential deadlock
  md: Drop sending a change uevent when stopping
  RAID5: revert e9e4c377e2 to fix a livelock
  RAID5: check_reshape() shouldn't call mddev_suspend
  md/raid5: Compare apples to apples (or sectors to sectors)
2016-03-21 14:18:10 -07:00
Linus Torvalds 237045fc3c Merge branch 'for-4.6/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This is the block driver pull request for this merge window.  It sits
  on top of for-4.6/core, that was just sent out.

  This contains:

   - A set of fixes for lightnvm.  One from Alan, fixing an overflow,
     and the rest from the usual suspects, Javier and Matias.

   - A set of fixes for nbd from Markus and Dan, and a fixup from Arnd
     for correct usage of the signed 64-bit divider.

   - A set of bug fixes for the Micron mtip32xx, from Asai.

   - A fix for the brd discard handling from Bart.

   - Update the maintainers entry for cciss, since that hardware has
     transferred ownership.

   - Three bug fixes for bcache from Eric Wheeler.

   - Set of fixes for xen-blk{back,front} from Jan and Konrad.

   - Removal of the cpqarray driver.  It has been disabled in Kconfig
     since 2013, and we were initially scheduled to remove it in 3.15.

   - Various updates and fixes for NVMe, with the most important being:

        - Removal of the per-device NVMe thread, replacing that with a
          watchdog timer instead. From Christoph.

        - Exposing the namespace WWID through sysfs, from Keith.

        - Set of cleanups from Ming Lin.

        - Logging the controller device name instead of the underlying
          PCI device name, from Sagi.

        - And a bunch of fixes and optimizations from the usual suspects
          in this area"

* 'for-4.6/drivers' of git://git.kernel.dk/linux-block: (49 commits)
  NVMe: Expose ns wwid through single sysfs entry
  drivers:block: cpqarray clean up
  brd: Fix discard request processing
  cpqarray: remove it from the kernel
  cciss: update MAINTAINERS
  NVMe: Remove unused sq_head read in completion path
  bcache: fix cache_set_flush() NULL pointer dereference on OOM
  bcache: cleaned up error handling around register_cache()
  bcache: fix race of writeback thread starting before complete initialization
  NVMe: Create discard zero quirk white list
  nbd: use correct div_s64 helper
  mtip32xx: remove unneeded variable in mtip_cmd_timeout()
  lightnvm: generalize rrpc ppa calculations
  lightnvm: remove struct nvm_dev->total_blocks
  lightnvm: rename ->nr_pages to ->nr_sects
  lightnvm: update closed list outside of intr context
  xen/blback: Fit the important information of the thread in 17 characters
  lightnvm: fold get bb tbl when using dual/quad plane mode
  lightnvm: fix up nonsensical configure overrun checking
  xen-blkback: advertise indirect segment support earlier
  ...
2016-03-18 17:13:31 -07:00
Anna-Maria Gleixner 1d034e68e2 md/raid5: Cleanup cpu hotplug notifier
The raid456_cpu_notify() hotplug callback lacks handling of the
CPU_UP_CANCELED case. That means if CPU_UP_PREPARE fails, the scratch
buffer is leaked.

Add handling for CPU_UP_CANCELED[_FROZEN] hotplug notifier transitions
to free the scratch buffer.

CC: Shaohua Li <shli@kernel.org>
CC: linux-raid@vger.kernel.org
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-17 14:30:15 -07:00
Shaohua Li 23ddba80eb raid10: include bio_end_io_list in nr_queued to prevent freeze_array hang
This is the raid10 counterpart of the bug fixed by Nate
(raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang)

Fixes: 95af587e95(md/raid10: ensure device failure recorded before write request returns)
Cc: stable@vger.kernel.org (V4.3+)
Cc: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-17 14:27:01 -07:00
Nate Dailey ccfc7bf1f0 raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang
If raid1d is handling a mix of read and write errors, handle_read_error's
call to freeze_array can get stuck.

This can happen because, though the bio_end_io_list is initially drained,
writes can be added to it via handle_write_finished as the retry_list
is processed. These writes contribute to nr_pending but are not included
in nr_queued.

If a later entry on the retry_list triggers a call to handle_read_error,
freeze array hangs waiting for nr_pending == nr_queued+extra. The writes
on the bio_end_io_list aren't included in nr_queued so the condition will
never be satisfied.

To prevent the hang, include bio_end_io_list writes in nr_queued.

There's probably a better way to handle decrementing nr_queued, but this
seemed like the safest way to avoid breaking surrounding code.

I'm happy to supply the script I used to repro this hang.

Fixes: 55ce74d4bfe1b(md/raid1: ensure device failure recorded before write request returns.)
Cc: stable@vger.kernel.org (v4.3+)
Signed-off-by: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-17 14:24:51 -07:00
Linus Torvalds 70477371dc Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "Here is the crypto update for 4.6:

  API:
   - Convert remaining crypto_hash users to shash or ahash, also convert
     blkcipher/ablkcipher users to skcipher.
   - Remove crypto_hash interface.
   - Remove crypto_pcomp interface.
   - Add crypto engine for async cipher drivers.
   - Add akcipher documentation.
   - Add skcipher documentation.

  Algorithms:
   - Rename crypto/crc32 to avoid name clash with lib/crc32.
   - Fix bug in keywrap where we zero the wrong pointer.

  Drivers:
   - Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver.
   - Add PIC32 hwrng driver.
   - Support BCM6368 in bcm63xx hwrng driver.
   - Pack structs for 32-bit compat users in qat.
   - Use crypto engine in omap-aes.
   - Add support for sama5d2x SoCs in atmel-sha.
   - Make atmel-sha available again.
   - Make sahara hashing available again.
   - Make ccp hashing available again.
   - Make sha1-mb available again.
   - Add support for multiple devices in ccp.
   - Improve DMA performance in caam.
   - Add hashing support to rockchip"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
  crypto: qat - remove redundant arbiter configuration
  crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
  crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
  crypto: qat - Change the definition of icp_qat_uof_regtype
  hwrng: exynos - use __maybe_unused to hide pm functions
  crypto: ccp - Add abstraction for device-specific calls
  crypto: ccp - CCP versioning support
  crypto: ccp - Support for multiple CCPs
  crypto: ccp - Remove check for x86 family and model
  crypto: ccp - memset request context to zero during import
  lib/mpi: use "static inline" instead of "extern inline"
  lib/mpi: avoid assembler warning
  hwrng: bcm63xx - fix non device tree compatibility
  crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode.
  crypto: qat - The AE id should be less than the maximal AE number
  lib/mpi: Endianness fix
  crypto: rockchip - add hash support for crypto engine in rk3288
  crypto: xts - fix compile errors
  crypto: doc - add skcipher API documentation
  crypto: doc - update AEAD AD handling
  ...
2016-03-17 11:22:54 -07:00
Bryn M. Reeves 98dbc9c6c6 dm: fix rq_end_stats() NULL pointer in dm_requeue_original_request()
An "old" (.request_fn) DM 'struct request' stores a pointer to the
associated 'struct dm_rq_target_io' in rq->special.

dm_requeue_original_request(), previously named
dm_requeue_unmapped_original_request(), called dm_unprep_request() to
reset rq->special to NULL.  But rq_end_stats() would go on to hit a NULL
pointer deference because its call to tio_from_request() returned NULL.

Fix this by calling rq_end_stats() _before_ dm_unprep_request()

Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: e262f34741 ("dm stats: add support for request-based DM devices")
Cc: stable@vger.kernel.org # 4.2+
2016-03-14 17:04:34 -04:00
Guoqing Jiang d85326cf86 md: fix typos for stipe
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-14 11:36:10 -07:00
Guoqing Jiang c6f0b9f195 md/bitmap: remove redundant return in bitmap_checkpage
The "return 0" is not needed since bitmap_checkpage
will finally return 0 for the case.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-14 11:36:07 -07:00
Guoqing Jiang b3c95b425e md/raid1: remove unnecessary BUG_ON
Since bitmap_start_sync will not return until
sync_blocks is not less than PAGE_SIZE>>9, so
the BUG_ON is not needed anymore.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-14 11:35:58 -07:00
Ming Lei fafcde3ac1 md: multipath: don't hardcopy bio in .make_request path
Inside multipath_make_request(), multipath maps the incoming
bio into low level device's bio, but it is totally wrong to
copy the bio into mapped bio via '*mapped_bio = *bio'. For
example, .__bi_remaining is kept in the copy, especially if
the incoming bio is chained to via bio splitting, so .bi_end_io
can't be called for the mapped bio at all in the completing path
in this kind of situation.

This patch fixes the issue by using clone style.

Cc: stable@vger.kernel.org (v3.14+)
Reported-and-tested-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-14 11:32:26 -07:00
Mike Snitzer c3667cc619 dm thin: consistently return -ENOSPC if pool has run out of data space
Commit 0a927c2f02 ("dm thin: return -ENOSPC when erroring retry list due
to out of data space") was a step in the right direction but didn't go
far enough.

Add a new 'out_of_data_space' flag to 'struct pool' and set it if/when
the pool runs of of data space.  This fixes cell_error() and
error_retry_list() to not blindly return -EIO.

We cannot rely on the 'error_if_no_space' feature flag since it is
transient (in that it can be reset once space is added, plus it only
controls whether errors are issued, it doesn't reflect whether the
pool is actually out of space).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-11 16:15:22 -05:00
Mike Snitzer 843f0f2e8f dm cache: bump the target version
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:12 -05:00
Joe Thornber d14fcf3dd7 dm cache: make sure every metadata function checks fail_io
Otherwise operations may be attempted that will only ever go on to crash
(since the metadata device is either missing or unreliable if 'fail_io'
is set).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2016-03-10 17:12:12 -05:00
Mike Snitzer 3f0680402c dm: add missing newline between DM_DEBUG_BLOCK_STACK_TRACING and DM_BUFIO
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:11 -05:00
Mike Snitzer 7dd85bb0e9 dm cache policy smq: clarify that mq registration failure was for 'mq'
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:11 -05:00
Mike Snitzer c80914e81e dm: return error if bio_integrity_clone() fails in clone_bio()
clone_bio() now checks if bio_integrity_clone() returned an error rather
than just drop it on the floor.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:10 -05:00
Joe Thornber 2eae9e4489 dm thin metadata: don't issue prefetches if a transaction abort has failed
If a transaction abort has failed then we can no longer use the metadata
device.  Typically this happens if the superblock is unreadable.

This fix addresses a crash seen during metadata device failure testing.

Fixes: 8a01a6af75 ("dm thin: prefetch missing metadata pages")
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:09 -05:00
DingXiang 4df2bf466a dm snapshot: disallow the COW and origin devices from being identical
Otherwise loading a "snapshot" table using the same device for the
origin and COW devices, e.g.:

echo "0 20971520 snapshot 253:3 253:3 P 8" | dmsetup create snap

will trigger:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[ 1958.979934] IP: [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
[ 1958.989655] PGD 0
[ 1958.991903] Oops: 0000 [#1] SMP
...
[ 1959.059647] CPU: 9 PID: 3556 Comm: dmsetup Tainted: G          IO    4.5.0-rc5.snitm+ #150
...
[ 1959.083517] task: ffff8800b9660c80 ti: ffff88032a954000 task.ti: ffff88032a954000
[ 1959.091865] RIP: 0010:[<ffffffffa040efba>]  [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
[ 1959.104295] RSP: 0018:ffff88032a957b30  EFLAGS: 00010246
[ 1959.110219] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000001
[ 1959.118180] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff880329334a00
[ 1959.126141] RBP: ffff88032a957b50 R08: 0000000000000000 R09: 0000000000000001
[ 1959.134102] R10: 000000000000000a R11: f000000000000000 R12: ffff880330884d80
[ 1959.142061] R13: 0000000000000008 R14: ffffc90001c13088 R15: ffff880330884d80
[ 1959.150021] FS:  00007f8926ba3840(0000) GS:ffff880333440000(0000) knlGS:0000000000000000
[ 1959.159047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1959.165456] CR2: 0000000000000098 CR3: 000000032f48b000 CR4: 00000000000006e0
[ 1959.173415] Stack:
[ 1959.175656]  ffffc90001c13040 ffff880329334a00 ffff880330884ed0 ffff88032a957bdc
[ 1959.183946]  ffff88032a957bb8 ffffffffa040f225 ffff880329334a30 ffff880300000000
[ 1959.192233]  ffffffffa04133e0 ffff880329334b30 0000000830884d58 00000000569c58cf
[ 1959.200521] Call Trace:
[ 1959.203248]  [<ffffffffa040f225>] dm_exception_store_create+0x1d5/0x240 [dm_snapshot]
[ 1959.211986]  [<ffffffffa040d310>] snapshot_ctr+0x140/0x630 [dm_snapshot]
[ 1959.219469]  [<ffffffffa0005c44>] ? dm_split_args+0x64/0x150 [dm_mod]
[ 1959.226656]  [<ffffffffa0005ea7>] dm_table_add_target+0x177/0x440 [dm_mod]
[ 1959.234328]  [<ffffffffa0009203>] table_load+0x143/0x370 [dm_mod]
[ 1959.241129]  [<ffffffffa00090c0>] ? retrieve_status+0x1b0/0x1b0 [dm_mod]
[ 1959.248607]  [<ffffffffa0009e35>] ctl_ioctl+0x255/0x4d0 [dm_mod]
[ 1959.255307]  [<ffffffff813304e2>] ? memzero_explicit+0x12/0x20
[ 1959.261816]  [<ffffffffa000a0c3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 1959.268615]  [<ffffffff81215eb6>] do_vfs_ioctl+0xa6/0x5c0
[ 1959.274637]  [<ffffffff81120d2f>] ? __audit_syscall_entry+0xaf/0x100
[ 1959.281726]  [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70
[ 1959.288814]  [<ffffffff81216449>] SyS_ioctl+0x79/0x90
[ 1959.294450]  [<ffffffff8167e4ae>] entry_SYSCALL_64_fastpath+0x12/0x71
...
[ 1959.323277] RIP  [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
[ 1959.333090]  RSP <ffff88032a957b30>
[ 1959.336978] CR2: 0000000000000098
[ 1959.344121] ---[ end trace b049991ccad1169e ]---

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1195899
Cc: stable@vger.kernel.org
Signed-off-by: Ding Xiang <dingxiang@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:09 -05:00
Joe Thornber 9ed84698fd dm cache: make the 'mq' policy an alias for 'smq'
smq seems to be performing better than the old mq policy in all
situations, as well as using a quarter of the memory.

Make 'mq' an alias for 'smq' when choosing a cache policy.  The tunables
that were present for the old mq are faked, and have no effect.  mq
should be considered deprecated now.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:08 -05:00
Bob Liu e233d800a9 dm: drop unnecessary assignment of md->queue
md->queue and q are the same thing in dm_old_init_request_queue() and
dm_mq_init_request_queue().

Also drop the temporary 'struct request_queue *q' in
dm_old_init_request_queue().

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:07 -05:00
Mike Snitzer 032482fda4 dm: reorder 'struct mapped_device' members to fix alignment and holes
Saves 16 bytes by eliminating 4 4byte holes but more importantly:
numerous members that crossed cachelines were fixed.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:07 -05:00
Mike Snitzer 1d3aa6f683 dm: remove dummy definition of 'struct dm_table'
Change the map pointer in 'struct mapped_device' from 'struct dm_table
__rcu *' to 'void __rcu *' to avoid the need for the dummy definition.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:06 -05:00
Mike Snitzer 115485e83f dm: add 'dm_numa_node' module parameter
Allows user to control which NUMA node the memory for DM device
structures (e.g. mapped_device, request_queue, gendisk, blk_mq_tag_set)
is allocated from.

Defaults to NUMA_NO_NODE (-1).  Allowable range is from -1 until the
last online NUMA node id.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:06 -05:00
Mike Snitzer 29f929b52d dm thin metadata: remove needless newline from subtree_dec() DMERR message
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:05 -05:00
Mike Snitzer ec31f3f78a dm mpath: cleanup reinstate_path() et al based on code review
fail_path() will print a "Failing path ..." message but reinstate_path()
doesn't print a "Reinstating path ...".  Add that message to
reinstate_path() to add symmetry and aid system debugging.

Remove reinstate_path()'s check for the path_selector providing
.reinstate_path hook.  All path selectors provide this and any future
ones must too.

activate_path() calls pg_init_done() with SCSI_DH_DEV_OFFLINED but
pg_init_done() doesn't expicitly handle it in its swicth statement.  Add
SCSI_DH_DEV_OFFLINED to the default case.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:04 -05:00
Shaohua Li fb3229d5cd md/raid5: output stripe state for debug
Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be
quite convenient if we know the stripe state. This is what this patch does.

Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-09 10:08:38 -08:00
NeilBrown 550da24f8d md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list
break_stripe_batch_list breaks up a batch and copies some flags from
the batch head to the members, preserving others.

It doesn't preserve or copy STRIPE_PREREAD_ACTIVE.  This is not
normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a
stripe_head is added to a batch, and is not set on stripe_heads
already in a batch.

However there is no locking to ensure one thread doesn't set the flag
after it has just been cleared in another.  This does occasionally happen.

md/raid5 maintains a count of the number of stripe_heads with
STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes.  When
break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently
this could becomes incorrect and will never again return to zero.

md/raid5 delays the handling of some stripe_heads until
preread_active_stripes becomes zero.  So when the above mention race
happens, those stripe_heads become blocked and never progress,
resulting is write to the array handing.

So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE
in the members of a batch.

URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741
URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153
URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz
Reported-by: Martin Svec <martin.svec@zoner.cz> (and others)
Tested-by: Tom Weber <linux@junkyard.4t2.com>
Fixes: 1b956f7a8f ("md/raid5: be more selective about distributing flags across batch.")
Cc: stable@vger.kernel.org (v4.1 and later)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-09 09:31:41 -08:00
Eric Wheeler f8b11260a4 bcache: fix cache_set_flush() NULL pointer dereference on OOM
When bch_cache_set_alloc() fails to kzalloc the cache_set, the
asyncronous closure handling tries to dereference a cache_set that
hadn't yet been allocated inside of cache_set_flush() which is called
by __cache_set_unregister() during cleanup.  This appears to happen only
during an OOM condition on bcache_register.

Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: stable@vger.kernel.org
2016-03-08 09:19:10 -07:00
Eric Wheeler 9b299728ed bcache: cleaned up error handling around register_cache()
Fix null pointer dereference by changing register_cache() to return an int
instead of being void.  This allows it to return -ENOMEM or -ENODEV and
enables upper layers to handle the OOM case without NULL pointer issues.

See this thread:
  http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3521

Fixes this error:
  gargamel:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register

  bcache: register_cache() error opening sdh2: cannot allocate memory
  BUG: unable to handle kernel NULL pointer dereference at 00000000000009b8
  IP: [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache]
  PGD 120dff067 PUD 1119a3067 PMD 0
  Oops: 0000 [#1] SMP
  Modules linked in: veth ip6table_filter ip6_tables
  (...)
  CPU: 4 PID: 3371 Comm: kworker/4:3 Not tainted 4.4.2-amd64-i915-volpreempt-20160213bc1 #3
  Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
  Workqueue: events cache_set_flush [bcache]
  task: ffff88020d5dc280 ti: ffff88020b6f8000 task.ti: ffff88020b6f8000
  RIP: 0010:[<ffffffffc05a7e8d>]  [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache]

Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Tested-by: Marc MERLIN <marc@merlins.org>
Cc: <stable@vger.kernel.org>
2016-03-08 09:19:08 -07:00
Eric Wheeler 07cc6ef8ed bcache: fix race of writeback thread starting before complete initialization
The bch_writeback_thread might BUG_ON in read_dirty() if
dc->sb==BDEV_STATE_DIRTY and bch_sectors_dirty_init has not yet completed
its related initialization.  This patch downs the dc->writeback_lock until
after initialization is complete, thus preventing bch_writeback_thread
from proceeding prematurely.

See this thread:
  http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3453

Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Tested-by: Marc MERLIN <marc@merlins.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-08 09:17:30 -07:00
Eric Engestrom c97e0602bc md/bitmap: remove redundant check
daemon_sleep is an unsigned, so testing if it's 0 or less than 1 does
the same thing.

Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-07 09:30:16 -08:00
Shaohua Li 70d9798b95 MD: warn for potential deadlock
The personality thread shouldn't call mddev_suspend(). Because
mddev_suspend() will for all IO finish, but IO is handled in personality
thread, so this could cause deadlock. To trigger this early, add a
warning if mddev_suspend() is called from personality thread.

Suggested-by: NeilBrown <neilb@suse.com>
Cc: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26 09:44:57 -08:00
Sebastian Parschauer 399146b80e md: Drop sending a change uevent when stopping
When stopping an MD device, then its device node /dev/mdX may still
exist afterwards or it is recreated by udev. The next open() call
can lead to creation of an inoperable MD device. The reason for
this is that a change event (KOBJ_CHANGE) is sent to udev which
races against the remove event (KOBJ_REMOVE) from md_free().
So drop sending the change event.

A change is likely also required in mdadm as many versions send the
change event to udev as well.

Neil mentioned the change event is a workaround for old kernel
Commit: 934d9c23b4 ("md: destroy partitions and notify udev when md array is stopped.")
new mdadm can handle device remove now, so this isn't required any more.

Cc: NeilBrown <neilb@suse.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26 09:44:56 -08:00
Shaohua Li 6ab2a4b806 RAID5: revert e9e4c377e2 to fix a livelock
Revert commit
e9e4c377e2f563(md/raid5: per hash value and exclusive wait_for_stripe)

The problem is raid5_get_active_stripe waits on
conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes
in this order:
- release all stripes with hash 0
- raid5_get_active_stripe still sleeps since active_stripes >
  max_nr_stripes * 3 / 4
- release all stripes with hash other than 0. active_stripes becomes 0
- raid5_get_active_stripe still sleeps, since nobody wakes up
  wait_for_stripe[0]
The system live locks. The problem is active_stripes isn't a per-hash
count. Revert the patch makes the live lock go away.

Cc: stable@vger.kernel.org (v4.2+)
Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26 09:44:56 -08:00
Shaohua Li 27a353c026 RAID5: check_reshape() shouldn't call mddev_suspend
check_reshape() is called from raid5d thread. raid5d thread shouldn't
call mddev_suspend(), because mddev_suspend() waits for all IO finish
but IO is handled in raid5d thread, we could easily deadlock here.

This issue is introduced by
738a273 ("md/raid5: fix allocation of 'scribble' array.")

Cc: stable@vger.kernel.org (v4.1+)
Reported-and-tested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26 09:44:11 -08:00
Jes Sorensen e7597e69de md/raid5: Compare apples to apples (or sectors to sectors)
'max_discard_sectors' is in sectors, while 'stripe' is in bytes.

This fixes the problem where DISCARD would get disabled on some larger
RAID5 configurations (6 or more drives in my testing), while it worked
as expected with smaller configurations.

Fixes: 620125f2bf ("MD: raid5 trim support")
Cc: stable@vger.kernel.org v3.7+
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-25 16:38:53 -08:00
Mike Snitzer 9f54cec553 dm mpath: remove __pgpath_busy forward declaration, rename to pgpath_busy
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:44 -05:00
Mike Snitzer be7d31cca8 dm mpath: switch from 'unsigned' to 'bool' for flags where appropriate
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:43 -05:00
Mike Snitzer b0b477c7e0 dm round robin: use percpu 'repeat_count' and 'current_path'
Now that dm-mpath core is lockless in the per-IO fast path it is
critical, for performance, to have the .select_path hook
(rr_select_path) also be as lockless as possible.

The new percpu members of 'struct selector' allow for lockless support
of 'repeat_count' governed repeat use of a previously selected path.  If
a path fails while it is 'current_path' the worst case is concurrent IO
might be mapped to the failed path until the .fail_path hook
(rr_fail_path) is called.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:42 -05:00
Mike Snitzer 90a4323ccf dm path selector: remove 'repeat_count' return from .select_path hook
If a path selector has any use for a repeat_count it should be handled
locally and not depend on the dm-mpath core to be concerned with it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:42 -05:00
Mike Snitzer 9659f81144 dm mpath: push path selector locking down to path selectors
Proper locking of the lists used by the path selectors should be handled
within the selectors (relying on dm-mpath.c code's use of the m->lock
spinlock was reckless).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:41 -05:00
Mike Snitzer 21136f89d7 dm mpath: remove repeat_count support from multipath core
Preparation for making __multipath_map() avoid taking the m->lock
spinlock -- in favor of using RCU locking.

repeat_count was primarily for bio-based DM multipath's benefit.  There
is really no need for it anymore now that DM multipath is request-based.
As such, repeat_count > 1 is no longer honored and a warning is
displayed if the user attempts to use a value > 1.  This is a temporary
change for the round-robin path-selector (as a later commit will restore
its support for repeat_count > 1).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:40 -05:00
Mike Snitzer 7943bd6dd3 dm mpath: remove unnecessary casts in front of ti->private
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:40 -05:00
Mike Snitzer 78ce23b518 dm mpath: use blk_mq_alloc_request() and blk_mq_free_request() directly
There isn't any need to support both old .request_fn and blk-mq paths
in the blk-mq specific portion of __multipath_map().  Call
blk_mq_alloc_request() directly rather than use blk_get_request().

Similarly, call blk_mq_free_request(), rather than blk_put_request(), in
multipath_release_clone().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:39 -05:00
Mike Snitzer 2eff1924e1 dm mpath: cleanup 'struct dm_mpath_io' management code
Refactor and rename existing interfaces to be more specific and
self-documenting.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:39 -05:00
Mike Snitzer 8637a6bf14 dm mpath: use blk-mq pdu for per-request 'struct dm_mpath_io'
Allow the multipath target to avoid making small allocations for each
'struct dm_mpath_io' that is needed for each request.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:38 -05:00
Mike Snitzer 591ddcfc4b dm: allow immutable request-based targets to use blk-mq pdu
This will allow DM multipath to use a portion of the blk-mq pdu space
for target data (e.g. struct dm_mpath_io).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:37 -05:00
Mike Snitzer 30187e1d48 dm: rename target's per_bio_data_size to per_io_data_size
Request-based DM will also make use of per_bio_data_size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:37 -05:00
Mike Snitzer eca7ee6dc0 dm: distinquish old .request_fn (dm-old) vs dm-mq request-based DM
Rename various methods to have either a "dm_old" or "dm_mq" prefix.
Improve code comments to assist with understanding the duality of code
that handles both "dm_old" and "dm_mq" cases.

It is no much easier to quickly look at the code and _know_ that a given
method is either 1) "dm_old" only 2) "dm_mq" only 3) common to both.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:33 -05:00
Mike Snitzer c5248f79f3 dm: remove support for stacking dm-mq on .request_fn device(s)
Remove all fiddley code that propped up this support for a blk-mq
request-queue ontop of all .request_fn devices.

Testing has proven this niche request-based dm-mq mode to be buggy, when
testing fault tolerance with DM multipath, and there is no point trying
to preserve it.

Should help improve efficiency of pure dm-mq code and make code
maintenance less delicate.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:33:46 -05:00
Mike Snitzer 818c5f3bef dm: fix a couple locking issues with use of block interfaces
old_stop_queue() was checking blk_queue_stopped() without holding the
q->queue_lock.

dm_requeue_original_request() needed to check blk_queue_stopped(), with
q->queue_lock held, before calling blk_mq_kick_requeue_list().  And a
side-effect of that change is start_queue() must also call
blk_mq_kick_requeue_list().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:33:09 -05:00
Mike Snitzer 1c357a1e86 dm: allocate blk_mq_tag_set rather than embed in mapped_device
The blk_mq_tag_set is only needed for dm-mq support.  There is point
wasting space in 'struct mapped_device' for non-dm-mq devices.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # check kzalloc return
2016-02-22 12:07:14 -05:00
Mike Snitzer faad87df4b dm: add 'dm_mq_nr_hw_queues' and 'dm_mq_queue_depth' module params
Allow user to change these values via module params or sysfs.

'dm_mq_nr_hw_queues' defaults to 1 (max 32).

'dm_mq_queue_depth' defaults to 2048 (up from 64, which proved far too
small under moderate sized workloads -- the dm-multipath device would
continuously block waiting for tags (requests) to become available).
The maximum is BLK_MQ_MAX_DEPTH (currently 10240).

Keep in mind the total number of pre-allocated requests per
request-based dm-mq device is 'dm_mq_nr_hw_queues' * 'dm_mq_queue_depth'
(currently 2048).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 12:07:10 -05:00
Mike Snitzer c91852ff08 dm: optimize dm_request_fn()
DM multipath is the only request-based DM target -- which only supports
tables with a single target that is immutable.  Leverage this fact in
dm_request_fn().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:22 -05:00
Mike Snitzer 16f122661d dm: optimize dm_mq_queue_rq()
DM multipath is the only dm-mq target.  But that aside, request-based DM
only supports tables with a single target that is immutable.  Leverage
this fact in dm_mq_queue_rq() by using the 'immutable_target' stored in
the mapped_device when the table was made active.  This saves the need
to even take the read-side of the SRCU via dm_{get,put}_live_table.

If the active DM table does not have an immutable target (e.g. "error"
target was swapped in) then fallback to the slow-path where the target
is looked up from the live table.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:22 -05:00
Mike Snitzer f083b09b78 dm: set DM_TARGET_WILDCARD feature on "error" target
The DM_TARGET_WILDCARD feature indicates that the "error" target may
replace any target; even immutable targets.  This feature will be useful
to preserve the ability to replace the "multipath" target even once it
is formally converted over to having the DM_TARGET_IMMUTABLE feature.

Also, implicit in the DM_TARGET_WILDCARD feature flag being set is that
.map, .map_rq, .clone_and_map_rq and .release_clone_rq are all defined
in the target_type.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:21 -05:00
Mike Snitzer e522c03905 dm: cleanup dm_any_congested()
The request-based DM support for checking queue congestion doesn't
require access to the live DM table.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:20 -05:00
Mike Snitzer ae6ad75e5c dm: remove unused dm_get_rq_mapinfo()
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:20 -05:00
Mike Snitzer 6acfe68bac dm: fix excessive dm-mq context switching
Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
than if an underlying null_blk device were used directly.  One of the
reasons for this drop in performance is that blk_insert_clone_request()
was calling blk_mq_insert_request() with @async=true.  This forced the
use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
which ushered in ping-ponging between process context (fio in this case)
and kblockd's kworker to submit the cloned request.  The ftrace
function_graph tracer showed:

  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...
  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...

Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
_not_ use kblockd to submit the cloned requests isn't enough to
eliminate the observed context switches.

In addition to this dm-mq specific blk-core fix, there are 2 DM core
fixes to dm-mq that (when paired with the blk-core fix) completely
eliminate the observed context switching:

1)  don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this).  So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b00e !

2)  use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request).  Using blk_mq_complete_request() for
    blk-mq requests is important for performance.  It should be noted
    that, like blk_complete_request(), blk_mq_complete_request() doesn't
    natively handle partial completions -- but the request-based
    DM-multipath target does provide the required partial completion
    support by dm.c:end_clone_bio() triggering requeueing of the request
    via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.

dm-mq fix #2 is _much_ more important than #1 for eliminating the
context switches.
Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

With these changes multithreaded async read IOPs improved from ~950K
to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
IOPs of the underlying null_blk device for the same workload is ~1950K.

Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()")
Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM")
Cc: stable@vger.kernel.org # 4.1+
Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
2016-02-22 11:04:40 -05:00
Mike Snitzer 956a402580 dm: fix sparse "unexpected unlock" warnings in ioctl code
Rename dm_get_live_table_for_ioctl to dm_grab_bdev_for_ioctl and have it
do the dm_{get,put}_live_table() rather than split those operations.

The dm_grab_bdev_for_ioctl() callers only care about the block_device
associated with a singleton DM device so there isn't any need to retain
a reference to the live DM table.  It is sufficient to:
1) dm_get_live_table()
2) bdgrab() the bdev associated with the singleton table's target
3) dm_put_live_table()
4) bdput() the bdev

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-21 20:27:51 -05:00
Mike Snitzer 664820265d dm: do not return target from dm_get_live_table_for_ioctl()
None of the callers actually used the returned target.
Also, just reuse bdev pointer passed to dm_blk_ioctl().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-21 20:27:51 -05:00
Mike Snitzer 4328daa2e7 dm: fix dm_rq_target_io leak on faults with .request_fn DM w/ blk-mq paths
Using request-based DM mpath configured with the following stacking
(.request_fn DM mpath ontop of scsi-mq paths):

echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
echo N > /sys/module/dm_mod/parameters/use_blk_mq

'struct dm_rq_target_io' would leak if a request is requeued before a
blk-mq clone is allocated (or fails to allocate).  free_rq_tio()
wasn't being called.

kmemleak reported:

unreferenced object 0xffff8800b90b98c0 (size 112):
  comm "kworker/7:1H", pid 5692, jiffies 4295056109 (age 78.589s)
  hex dump (first 32 bytes):
    00 d0 5c 2c 03 88 ff ff 40 00 bf 01 00 c9 ff ff  ..\,....@.......
    e0 d9 b1 34 00 88 ff ff 00 00 00 00 00 00 00 00  ...4............
  backtrace:
    [<ffffffff81672b6e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811dbb63>] kmem_cache_alloc+0xc3/0x1e0
    [<ffffffff8117eae5>] mempool_alloc_slab+0x15/0x20
    [<ffffffff8117ec1e>] mempool_alloc+0x6e/0x170
    [<ffffffffa00029ac>] dm_old_prep_fn+0x3c/0x180 [dm_mod]
    [<ffffffff812fbd78>] blk_peek_request+0x168/0x290
    [<ffffffffa0003e62>] dm_request_fn+0xb2/0x1b0 [dm_mod]
    [<ffffffff812f66e3>] __blk_run_queue+0x33/0x40
    [<ffffffff812f9585>] blk_delay_work+0x25/0x40
    [<ffffffff81096fff>] process_one_work+0x14f/0x3d0
    [<ffffffff81097715>] worker_thread+0x125/0x4b0
    [<ffffffff8109ce88>] kthread+0xd8/0xf0
    [<ffffffff8167cb8f>] ret_from_fork+0x3f/0x70
    [<ffffffffffffffff>] 0xffffffffffffffff

crash> struct -o dm_rq_target_io
struct dm_rq_target_io {
    ...
}
SIZE: 112

Fixes: e5863d9ad7 ("dm: allocate requests in target when stacking on blk-mq devices")
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-21 20:27:50 -05:00
Shaohua Li 9ea064158f Merge branch 'mymd/for-next' into mymd/for-linus 2016-02-03 15:43:59 -08:00
Herbert Xu bbdb23b5d6 dm crypt: Use skcipher and ahash
This patch replaces uses of ablkcipher with skcipher, and the long
obsolete hash interface with ahash.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-01-27 20:35:48 +08:00
Shaohua Li fc2561ec0a md-cluster: delete useless code
page->index already considers node offset. The node_offset calculation
in write_sb_page is useless and confusion.

Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: NeilBrown <neilb@suse.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-01-24 18:13:37 -08:00
Shaohua Li 4ac7a65f80 md-cluster: fix missing memory free
There are several places we allocate dlm_lock_resource, but not free it.

leave() need free a lock resource too (from Guoqing)
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Guoqing Jiang <gqjiang@suse.com>
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-01-24 18:13:18 -08:00
Linus Torvalds 641203549a Merge branch 'for-4.5/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This is the block driver pull request for 4.5, with the exception of
  NVMe, which is in a separate branch and will be posted after this one.

  This pull request contains:

   - A set of bcache stability fixes, which have been acked by Kent.
     These have been used and tested for more than a year by the
     community, so it's about time that they got in.

   - A set of drbd updates from the drbd team (Andreas, Lars, Philipp)
     and Markus Elfring, Oleg Drokin.

   - A set of fixes for xen blkback/front from the usual suspects, (Bob,
     Konrad) as well as community based fixes from Kiri, Julien, and
     Peng.

   - A 2038 time fix for sx8 from Shraddha, with a fix from me.

   - A small mtip32xx cleanup from Zhu Yanjun.

   - A null_blk division fix from Arnd"

* 'for-4.5/drivers' of git://git.kernel.dk/linux-block: (71 commits)
  null_blk: use sector_div instead of do_div
  mtip32xx: restrict variables visible in current code module
  xen/blkfront: Fix crash if backend doesn't follow the right states.
  xen/blkback: Fix two memory leaks.
  xen/blkback: make st_ statistics per ring
  xen/blkfront: Handle non-indirect grant with 64KB pages
  xen-blkfront: Introduce blkif_ring_get_request
  xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule()
  xen/blkback: Free resources if connect_ring failed.
  xen/blocks: Return -EXX instead of -1
  xen/blkback: make pool of persistent grants and free pages per-queue
  xen/blkback: get the number of hardware queues/rings from blkfront
  xen/blkback: pseudo support for multi hardware queues/rings
  xen/blkback: separate ring information out of struct xen_blkif
  xen/blkfront: correct setting for xen_blkif_max_ring_order
  xen/blkfront: make persistent grants pool per-queue
  xen/blkfront: Remove duplicate setting of ->xbdev.
  xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors.
  xen/blkfront: negotiate number of queues/rings to be used with backend
  xen/blkfront: split per device io_lock
  ...
2016-01-21 18:19:38 -08:00
Shaohua Li 849674e4fb MD: rename some functions
These short function names are hard to search. Rename them to make vim happy.

Signed-off-by: Shaohua Li <shli@fb.com>
2016-01-20 13:52:20 -08:00
Linus Torvalds 3c28c9ccaf md updates for 4.5
Mostly clustered-raid1 and raid5 journal updates.
 one Y2038 fix and other minor stuff.
 
 One patch removes me from the MAINTAINERS file and adds a record of
 my md maintainership to Credits.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWmJEhAAoJEDnsnt1WYoG5raQQAI9lBrHO+Q8C8RImPsemLX0X
 ypjH38XUwEwKNYYfsCVI7PKAqCl7r8ITzY054gKsU0iHAfqLQlEN8aMz0v0fJQhg
 Msb7utrEMQ0UERwNcc+3J78ffFAdWrkHVd64Ley0h/pizFPSlL0K2RuIGTBc9sGX
 Hz2Ci11Ch7FdK7C/Zl7I6tK1pkthu3hBXYEZyg1GngRRhZEJj2U7mBmy1E37NA72
 o66B5r5FSlnIA8MAo/EAViCxtMJKBPRWU/WnkMhOJ1Yyw/FwMpbM2prLBLtYFqwF
 OLZOLuDUHY5HxdX2U+3R0hBzF78aozcH6od60SWg7wOmI/IkXYiYFujlxMd132FE
 OT+aa+UHHDEkATTSyt98OmxIkQ8uqKiNsSYqBk9lpNAPtmEbhqRX4RAOdrqP0G83
 DX7iyZpAK4YhB4BkJxMtNdSIOnss1TwfOdKyvoBZYmY6bTKh7p+dpw4cvIjV4VDi
 p6+BUQdJQ7mHRLV9QI4IuG52AJO8cRGc1OVvqLEMzO8uZlpyxX9nJrSqeP/dKKfa
 pJ5pYssilXEeKCDnODGqSRdt9aU4ENDW/oIkAW2U3cnSHUwBoMLF1WJ+M3Atbm+s
 i3/iDp26SnSiHM+DVHije5v0OGOroYdJwKDIFWToElcfc9Q5IDHU+KP8oeuPqqOS
 WA08l+zj+ahfP7Yu1DUC
 =gl+r
 -----END PGP SIGNATURE-----

Merge tag 'md/4.5' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "Mostly clustered-raid1 and raid5 journal updates.  one Y2038 fix and
  other minor stuff.

  One patch removes me from the MAINTAINERS file and adds a record of my
  md maintainership to Credits"

Many thanks to Neil, who has been around for a _looong_ time.

* tag 'md/4.5' of git://neil.brown.name/md: (26 commits)
  md/raid: only permit hot-add of compatible integrity profiles
  Remove myself as MD Maintainer, and add to Credits.
  raid5-cache: handle journal hotadd in quiesce
  MD: add journal with array suspended
  md: set MD_HAS_JOURNAL in correct places
  md: Remove 'ready' field from mddev.
  md: remove unnecesary md_new_event_inintr
  raid5: allow r5l_io_unit allocations to fail
  raid5-cache: use a mempool for the metadata block
  raid5-cache: use a bio_set
  raid5-cache: add journal hot add/remove support
  drivers: md: use ktime_get_real_seconds()
  md: avoid warning for 32-bit sector_t
  raid5-cache: free meta_page earlier
  raid5-cache: simplify r5l_move_io_unit_list
  md: update comment for md_allow_write
  md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY
  md-cluster: Protect communication with mutexes
  md-cluster: Defer MD reloading to mddev->thread
  md-cluster: update the documentation
  ...
2016-01-15 12:28:00 -08:00
Linus Torvalds 7d1fc01afc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  floppy: make local variable non-static
  exynos: fixes an incorrect header guard
  dt-bindings: fixes some incorrect header guards
  cpufreq-dt: correct dead link in documentation
  cpufreq: ARM big LITTLE: correct dead link in documentation
  treewide: Fix typos in printk
  Documentation: filesystem: Fix typo in fs/eventfd.c
  fs/super.c: use && instead of & for warn_on condition
  Documentation: fix sysfs-ptp
  lib: scatterlist: fix Kconfig description
2016-01-14 17:04:19 -08:00
Linus Torvalds d080827f85 libnvdimm for 4.5
1/ Media error handling: The 'badblocks' implementation that originated
    in md-raid is up-levelled to a generic capability of a block device.
    This initial implementation is limited to being consulted in the pmem
    block-i/o path.  Later, 'badblocks' will be consulted when creating
    dax mappings.
 
 2/ Raw block device dax: For virtualization and other cases that want
    large contiguous mappings of persistent memory, add the capability to
    dax-mmap a block device directly.
 
 3/ Increased /dev/mem restrictions: Add an option to treat all io-memory
    as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access while a driver is
    actively using an address range.  This behavior is controlled via the
    new CONFIG_IO_STRICT_DEVMEM option and can be overridden by the
    existing "iomem=relaxed" kernel command line option.
 
 4/ Miscellaneous fixes include a 'pfn'-device huge page alignment fix,
    block device shutdown crash fix, and other small libnvdimm fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWlrhjAAoJEB7SkWpmfYgCFbAQALKsQfFwT6JFS+zlPgiNpbqw
 2VMNKEH0AfGYGj96mT02j2q+vSUmXLMIDMTsbe0sDdtwFZtQbFmhmryzPWUVppSu
 KGTlLPW8vuEhQVs91+UI3BQKkvpi0+tbR8hPOh9W6QhjpRT+lyHFKnsNR5HZy5wB
 K4/VMaT5ffd5/pXRTjkYiPQYTwWyfcvNjICj0YtqhPvOwS031m77JpFsWJ8HSpEX
 K99VlzNUPMXd1pYkHmFNXWw52fhRGNhwAEomLeKMdQfKms+KnbKp8BOSA0aCqU8E
 kpujQcilDXJwykFQZOFI3Z5Dxvrv8lxFTU8HRMBvo3ESzfTWjfqcvyjGOjDUcruw
 ihESFSJtdZzhrBiMnf9RRqSpMFJvAT8MVT6Q4D3mZUHCMPbUqFJsQjMPt9hEH3ho
 4F0D2lesOCkubUKFTZmjMoDb+szuKbVhYK8TeFVVEhizinc/Aj0NKuazJqi+CXB/
 xh0ER4ZxD8wvzqFFWvS5UvR1G9I5fr7+3jGRUrqGLHlSdeXP9dkEg28ao3QbWk3x
 1dPOen6ZqQ9WJ/E7eGmXbVEz2R4Xd79hMXQzdQwmKDk/KbxRoAp7hyU8BslAyrBf
 HCdmVt+RAgrxZYfFRXuLhqwEBThJnNrgZA3qu74FUpkpFg6xRUu1bAYBiF7N+bFi
 82b5UbMkveBTtkXjJoiR
 =7V5r
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "The bulk of this has appeared in -next and independently received a
  build success notification from the kbuild robot.  The 'for-4.5/block-
  dax' topic branch was rebased over the weekend to drop the "block
  device end-of-life" rework that Al would like to see re-implemented
  with a notifier, and to address bug reports against the badblocks
  integration.

  There is pending feedback against "libnvdimm: Add a poison list and
  export badblocks" received last week.  Linda identified some localized
  fixups that we will handle incrementally.

  Summary:

   - Media error handling: The 'badblocks' implementation that
     originated in md-raid is up-levelled to a generic capability of a
     block device.  This initial implementation is limited to being
     consulted in the pmem block-i/o path.  Later, 'badblocks' will be
     consulted when creating dax mappings.

   - Raw block device dax: For virtualization and other cases that want
     large contiguous mappings of persistent memory, add the capability
     to dax-mmap a block device directly.

   - Increased /dev/mem restrictions: Add an option to treat all
     io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access
     while a driver is actively using an address range.  This behavior
     is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be
     overridden by the existing "iomem=relaxed" kernel command line
     option.

   - Miscellaneous fixes include a 'pfn'-device huge page alignment fix,
     block device shutdown crash fix, and other small libnvdimm fixes"

* tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (32 commits)
  block: kill disk_{check|set|clear|alloc}_badblocks
  libnvdimm, pmem: nvdimm_read_bytes() badblocks support
  pmem, dax: disable dax in the presence of bad blocks
  pmem: fail io-requests to known bad blocks
  libnvdimm: convert to statically allocated badblocks
  libnvdimm: don't fail init for full badblocks list
  block, badblocks: introduce devm_init_badblocks
  block: clarify badblocks lifetime
  badblocks: rename badblocks_free to badblocks_exit
  libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h
  libnvdimm: Add a poison list and export badblocks
  nfit_test: Enable DSMs for all test NFITs
  md: convert to use the generic badblocks code
  block: Add badblock management for gendisks
  badblocks: Add core badblock management code
  block: fix del_gendisk() vs blkdev_ioctl crash
  block: enable dax for raw block devices
  block: introduce bdev_file_inode()
  restrict /dev/mem to idle io memory ranges
  arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug
  ...
2016-01-13 19:15:14 -08:00
Dan Williams 1501efadc5 md/raid: only permit hot-add of compatible integrity profiles
It is not safe for an integrity profile to be changed while i/o is
in-flight in the queue.  Prevent adding new disks or otherwise online
spares to an array if the device has an incompatible integrity profile.

The original change to the blk_integrity_unregister implementation in
md, commmit c7bfced9a6 "md: suspend i/o during runtime
blk_integrity_unregister" introduced an immediate hang regression.

This policy of disallowing changes the integrity profile once one has
been established is shared with DM.

Here is an abbreviated log from a test run that:
1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]

[   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
[   59.078302] md: data integrity enabled on md0
[..]
[   90.489209] md0: incompatible integrity profile for pmem1m
[..]
[  205.671277] md: super_written gets error=-5
[  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
[  205.677386] md/raid1:md0: Operation continuing on 1 devices.
[  205.683037] RAID1 conf printout:
[  205.684699]  --- wd:1 rd:2
[  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
[  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
[  205.691717] md: recovery of RAID array md0

Fixes: c7bfced9a6 ("md: suspend i/o during runtime blk_integrity_unregister")
Cc: <stable@vger.kernel.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reported-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14 11:49:57 +11:00
Shaohua Li 16a43f6a65 raid5-cache: handle journal hotadd in quiesce
Handle journal hotadd in quiesce to avoid creating duplicated threads.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14 11:49:43 +11:00
Shaohua Li 87d4d91616 MD: add journal with array suspended
Hot add journal disk in recovery thread context brings a lot of trouble
as IO could be running. Unlike spare disk hot add, adding journal disk
with array suspended makes more sense and implmentation is much easier.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14 11:49:43 +11:00
Shaohua Li a62ab49eb5 md: set MD_HAS_JOURNAL in correct places
Set MD_HAS_JOURNAL when a array is loaded or journal is initialized.
This is to avoid the flags set too early in journal disk hotadd.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14 11:49:43 +11:00
Linus Torvalds 33caf82acf Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "All kinds of stuff.  That probably should've been 5 or 6 separate
  branches, but by the time I'd realized how large and mixed that bag
  had become it had been too close to -final to play with rebasing.

  Some fs/namei.c cleanups there, memdup_user_nul() introduction and
  switching open-coded instances, burying long-dead code, whack-a-mole
  of various kinds, several new helpers for ->llseek(), assorted
  cleanups and fixes from various people, etc.

  One piece probably deserves special mention - Neil's
  lookup_one_len_unlocked().  Similar to lookup_one_len(), but gets
  called without ->i_mutex and tries to avoid ever taking it.  That, of
  course, means that it's not useful for any directory modifications,
  but things like getting inode attributes in nfds readdirplus are fine
  with that.  I really should've asked for moratorium on lookup-related
  changes this cycle, but since I hadn't done that early enough...  I
  *am* asking for that for the coming cycle, though - I'm going to try
  and get conversion of i_mutex to rwsem with ->lookup() done under lock
  taken shared.

  There will be a patch closer to the end of the window, along the lines
  of the one Linus had posted last May - mechanical conversion of
  ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
  inode_is_locked()/inode_lock_nested().  To quote Linus back then:

    -----
    |    This is an automated patch using
    |
    |        sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
    |        sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
    |        sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[     ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
    |        sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
    |        sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
    |
    |    with a very few manual fixups
    -----

  I'm going to send that once the ->i_mutex-affecting stuff in -next
  gets mostly merged (or when Linus says he's about to stop taking
  merges)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  nfsd: don't hold i_mutex over userspace upcalls
  fs:affs:Replace time_t with time64_t
  fs/9p: use fscache mutex rather than spinlock
  proc: add a reschedule point in proc_readfd_common()
  logfs: constify logfs_block_ops structures
  fcntl: allow to set O_DIRECT flag on pipe
  fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
  fs: xattr: Use kvfree()
  [s390] page_to_phys() always returns a multiple of PAGE_SIZE
  nbd: use ->compat_ioctl()
  fs: use block_device name vsprintf helper
  lib/vsprintf: add %*pg format specifier
  fs: use gendisk->disk_name where possible
  poll: plug an unused argument to do_poll
  amdkfd: don't open-code memdup_user()
  cdrom: don't open-code memdup_user()
  rsxx: don't open-code memdup_user()
  mtip32xx: don't open-code memdup_user()
  [um] mconsole: don't open-code memdup_user_nul()
  [um] hostaudio: don't open-code memdup_user()
  ...
2016-01-12 17:11:47 -08:00
Linus Torvalds 03891f9c85 - The most significant set of changes this cycle is the Forward Error
Correction (FEC) support that has been added to the DM verity target.
   Google uses DM verity on all Android devices and it is believed that
   this FEC support will enable DM verity to recover from storage
   failures seen since DM verity was first deployed as part of Android.
 
 - A stable fix for a race in the destruction of DM thin pool's workqueue
 
 - A stable fix for hung IO if a DM snapshot copy hit an error
 
 - A few small cleanups in DM core and DM persistent data.
 
 - A couple DM thinp range discard improvements (address atomicity of
   finding a range and the efficiency of discarding a partially mapped
   thin device)
 
 - Add ability to debug DM bufio leaks by recording stack trace when a
   buffer is allocated.  Upon detected leak the recorded stack is dumped.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWlAf0AAoJEMUj8QotnQNaHPIH/2rvnzg71RsP/7IRI/5DHETP
 ubxKhKd7tfqwJEjuQvhiYB1Ubo+gvXEuT51C2G1ug2QzHsjymmE14q/60ElB7+/U
 ++bGisWvqm4ZqWWM9yffqbESzNOfNTn7dLduaxGeLxVG3zVLfzQRfSPOqhk1FiIv
 H35v0Xx/j1NAHQtcocVYzG4P5BwfgmeyuYmUq8BklHNlwa3drBKnMZfIlF4u2216
 Z3K7d+5nLpSsPyejzpQlByHTUt/eVy1Y2ZBgudWITaP5DAcUQwHyLZI4k3skmMiK
 O/xLZ54aeKI9NhtEwH8s8jOd3b7Kvw/oAw5nfPj7jmIDF3if8U2HCU6KgfBVwwU=
 =fOsS
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - The most significant set of changes this cycle is the Forward Error
   Correction (FEC) support that has been added to the DM verity target.

   Google uses DM verity on all Android devices and it is believed that
   this FEC support will enable DM verity to recover from storage
   failures seen since DM verity was first deployed as part of Android.

 - A stable fix for a race in the destruction of DM thin pool's
   workqueue

 - A stable fix for hung IO if a DM snapshot copy hit an error

 - A few small cleanups in DM core and DM persistent data.

 - A couple DM thinp range discard improvements (address atomicity of
   finding a range and the efficiency of discarding a partially mapped
   thin device)

 - Add ability to debug DM bufio leaks by recording stack trace when a
   buffer is allocated.  Upon detected leak the recorded stack is
   dumped.

* tag 'dm-4.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm snapshot: fix hung bios when copy error occurs
  dm thin: bump thin and thin-pool target versions
  dm thin: fix race condition when destroying thin pool workqueue
  dm space map metadata: remove unused variable in brb_pop()
  dm verity: add ignore_zero_blocks feature
  dm verity: add support for forward error correction
  dm verity: factor out verity_for_bv_block()
  dm verity: factor out structures and functions useful to separate object
  dm verity: move dm-verity.c to dm-verity-target.c
  dm verity: separate function for parsing opt args
  dm verity: clean up duplicate hashing code
  dm btree: factor out need_insert() helper
  dm bufio: use BUG_ON instead of conditional call to BUG
  dm bufio: store stacktrace in buffers to help find buffer leaks
  dm bufio: return NULL to improve code clarity
  dm block manager: cleanup code that prints stacktrace
  dm: don't save and restore bi_private
  dm thin metadata: make dm_thin_find_mapped_range() atomic
  dm thin metadata: speed up discard of partially mapped volumes
2016-01-11 22:25:00 -08:00
Dan Williams d3b407fb3f badblocks: rename badblocks_free to badblocks_exit
For symmetry with badblocks_init() make it clear that this path only
destroys incremental allocations of a badblocks instance, and does not
free the badblocks instance itself.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:04 -08:00
Vishal Verma fc974ee2bf md: convert to use the generic badblocks code
Retain badblocks as part of rdev, but use the accessor functions from
include/linux/badblocks for all manipulation.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:03 -08:00
Mikulas Patocka 385277bfb5 dm snapshot: fix hung bios when copy error occurs
When there is an error copying a chunk dm-snapshot can incorrectly hold
associated bios indefinitely, resulting in hung IO.

The function copy_callback sets pe->error if there was error copying the
chunk, and then calls complete_exception.  complete_exception calls
pending_complete on error, otherwise it calls commit_exception with
commit_callback (and commit_callback calls complete_exception).

The persistent exception store (dm-snap-persistent.c) assumes that calls
to prepare_exception and commit_exception are paired.
persistent_prepare_exception increases ps->pending_count and
persistent_commit_exception decreases it.

If there is a copy error, persistent_prepare_exception is called but
persistent_commit_exception is not.  This results in the variable
ps->pending_count never returning to zero and that causes some pending
exceptions (and their associated bios) to be held forever.

Fix this by unconditionally calling commit_exception regardless of
whether the copy was successful.  A new "valid" parameter is added to
commit_exception -- when the copy fails this parameter is set to zero so
that the chunk that failed to copy (and all following chunks) is not
recorded in the snapshot store.  Also, remove commit_callback now that
it is merely a wrapper around pending_complete.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2016-01-08 20:03:05 -05:00
Mike Snitzer 1c2e54e1ed dm thin: bump thin and thin-pool target versions
Commit 3d5f6733 ("dm thin metadata: speed up discard of partially mapped
volumes"), or some other dm-thinp change during the Linux 4.5
development window, really should've bumped these target versions.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-01-06 20:59:40 -05:00
NeilBrown 274d8cbde1 md: Remove 'ready' field from mddev.
This field is always set in tandem with ->pers, and when it is tested
->pers is also tested.  So ->ready is not needed.

It was needed once, but code rearrangement and locking changes have
removed that needed.

Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-07 11:01:14 +11:00
Guoqing Jiang bb9ef71646 md: remove unnecesary md_new_event_inintr
md_new_event had removed sysfs_notify since 'commit 72a23c211e
("Make sure all changes to md/sync_action are notified.")', so we
can use md_new_event and delete md_new_event_inintr.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-07 11:01:14 +11:00
Christoph Hellwig 5036c39020 raid5: allow r5l_io_unit allocations to fail
And propagate the error up the stack so we can add the stripe
to no_stripes_list and retry our log operation later.  This avoids
blocking raid5d due to reclaim, an it allows to get rid of the
deadlock-prone GFP_NOFAIL allocation.

shli: add missing mempool_destroy()

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:40:12 +11:00
Christoph Hellwig e8deb63810 raid5-cache: use a mempool for the metadata block
We only have a limited number in flight, so use a page based mempool.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:40:08 +11:00
Christoph Hellwig c38d29b33b raid5-cache: use a bio_set
This allows us to make guaranteed forward progress.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:40:04 +11:00
Shaohua Li f6b6ec5cfa raid5-cache: add journal hot add/remove support
Add support for journal disk hot add/remove. Mostly trival checks in md
part. The raid5 part is a little tricky. For hot-remove, we can't wait
pending write as it's called from raid5d. The wait will cause deadlock.
We simplily fail the hot-remove. A hot-remove retry can success
eventually since if journal disk is faulty all pending write will be
failed and finish. For hot-add, since an array supporting journal but
without journal disk will be marked read-only, we are safe to hot add
journal without stopping IO (should be read IO, while journal only
handles write IO).

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:57 +11:00
Deepa Dinamani 9ebc6ef188 drivers: md: use ktime_get_real_seconds()
get_seconds() API is not y2038 safe on 32 bit systems and the API
is deprecated. Replace it with calls to ktime_get_real_seconds()
API instead. Change mddev structure types to time64_t accordingly.

32 bit signed timestamps will overflow in the year 2038.

Change the user interface mdu_array_info_s structure timestamps:
ctime and utime values used in ioctls GET_ARRAY_INFO and
SET_ARRAY_INFO to unsigned int. This will extend the field to last
until the year 2106.
The long term plan is to get rid of ctime and utime values in
this structure as this information can be read from the on-disk
meta data directly.

Clamp the tim64_t timestamps to positive values with a max of U32_MAX
when returning from GET_ARRAY_INFO ioctl to accommodate above changes
in the data type of timestamps to unsigned int.

v0.90 on disk meta data uses u32 for maintaining time stamps.
So this will also last until year 2106.
Assumption is that the usage of v0.90 will be deprecated by
year 2106.

Timestamp fields in the on disk meta data for v1.0 version already
use 64 bit data types. Remove the truncation of the bits while
writing to or reading from these from the disk.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:53 +11:00
Arnd Bergmann 3312c951ef md: avoid warning for 32-bit sector_t
When CONFIG_LBDAF is not set, sector_t is only 32-bits wide, which
means we cannot have devices with more than 2TB, and the code that
is trying to handle compatibility support for large devices in
md version 0.90 is meaningless but also causes a compile-time warning:

drivers/md/md.c: In function 'super_90_load':
drivers/md/md.c:1029:19: warning: large integer implicitly truncated to unsigned type [-Woverflow]
drivers/md/md.c: In function 'super_90_rdev_size_change':
drivers/md/md.c:1323:17: warning: large integer implicitly truncated to unsigned type [-Woverflow]

This adds a check for CONFIG_LBDAF to avoid even getting into this
code path, and also adds an explicit cast to let the compiler know
it doesn't have to warn about the truncation.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:48 +11:00
Christoph Hellwig ad66d445ee raid5-cache: free meta_page earlier
Once the I/O completed we don't need the meta page anymore.  As the iounits
can live on for a long time this reduces memory pressure a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:43 +11:00
Christoph Hellwig 3848c0bcb0 raid5-cache: simplify r5l_move_io_unit_list
It's only used for one kind of move, so make that explicit.  Also clean
up the code a bit by using list_for_each_safe.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:34 +11:00
Guoqing Jiang abf3508d8f md: update comment for md_allow_write
MD_CHANGE_CLEAN had been replaced with MD_CHANGE_PENDING after
commit 070dc6 ("md: resolve confusion of MD_CHANGE_CLEAN"),
so make the change accordingly.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:26 +11:00
Guoqing Jiang e19508fa4d md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY
1. fix unbalanced parentheses.
2. add more description about that MD_CLUSTER_SEND_LOCKED_ALREADY
   will be cleared after set it in add_new_disk.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:21 +11:00
Guoqing Jiang 8b9277c814 md-cluster: Protect communication with mutexes
Communication can happen through multiple threads. It is possible that
one thread steps over another threads sequence. So, we use mutexes to
protect both the send and receive sequences.

Send communication is locked through state bit, MD_CLUSTER_SEND_LOCK.
Communication is locked with bit manipulation in order to allow
"lock and hold" for the add operation. In case of an add operation,
if the lock is held, MD_CLUSTER_SEND_LOCKED_ALREADY is set.
When md_update_sb() calls metadata_update_start(), it checks
(in a single statement to avoid races), if the communication
is already locked. If yes, it merely returns zero, else it
locks the token lockresource.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:17 +11:00