Commit Graph

3335 Commits

Author SHA1 Message Date
NeilBrown 9c4bdf697c md/raid6: avoid data corruption during recovery of double-degraded RAID6
During recovery of a double-degraded RAID6 it is possible for
some blocks not to be recovered properly, leading to corruption.

If a write happens to one block in a stripe that would be written to a
missing device, and at the same time that stripe is recovering data
to the other missing device, then that recovered data may not be written.

This patch skips, in the double-degraded case, an optimisation that is
only safe for single-degraded arrays.

Bug was introduced in 2.6.32 and fix is suitable for any kernel since
then.  In an older kernel with separate handle_stripe5() and
handle_stripe6() functions the patch must change handle_stripe6().

Cc: stable@vger.kernel.org (2.6.32+)
Fixes: 6c0069c0ae
Cc: Yuri Tikhonov <yur@emcraft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Reported-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Tested-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
2014-08-18 14:49:46 +10:00
NeilBrown a40687ff73 md/raid5: avoid livelock caused by non-aligned writes.
If a stripe in a raid6 array received a write to each data block while
the array is degraded, and if any of these writes to a missing device
are not page-aligned, then a live-lock happens.

In this case the P and Q blocks need to be read so that the part of
the missing block which is *not* being updated by the write can be
constructed.  Due to a logic error, these blocks are not loaded, so
the update cannot proceed and the stripe is 'handled' repeatedly in an
infinite loop.

This bug is unlikely as most writes are page aligned.  However as it
can lead to a livelock it is suitable for -stable.  It was introduced
in 3.16.

Cc: stable@vger.kernel.org (v3.16)
Fixed: 67f455486d
Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-18 14:49:41 +10:00
Linus Torvalds ba368991f6 . Allow the thin target to paired with any size external origin; also
allow thin snapshots to be larger than the external origin.
 
 . Add support for quickly loading a repetitive pattern into the
   dm-switch target.
 
 . Use per-bio data in the dm-crypt target instead of always using a
   mempool for each allocation.  Required switching to kmalloc alignment
   for the bio slab.
 
 . Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag
 
 . Fix the dm-cache and dm-thin targets' export of the minimum_io_size to
   match the data block size -- this fixes an issue where mkfs.xfs would
   improperly infer raid striping was in place on the underlying storage.
 
 . Small cleanups in dm-io, dm-mpath and dm-cache
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJT64yEAAoJEMUj8QotnQNatjQH/2mqm8EtPuZas70zHVDzjMlE
 ZyV8xgHpU0MBmiBi+JhUBv9iKX4sVa+C25559WkKtxRVMnZmI1WDry4TagiqrhnK
 9o/uvdWigJMR+uwahwe4UErEtKscOQJD30a8taN/suJ6Z2C7XJJRUZPsyL4a3Vov
 w+UIi7aYDEGp/2VQ8mvTTxjdF5x5km4wKsjBTs03uTrrkEJ+bIUndl2I1X+X4bsw
 kiWYOQwmcnD8GwYkSrthJYLsS3Hjur/J/My7KZwXc00ANLOexqHdKfRDwH8b36+m
 olKXv3swCd8vi+jJYEYzuW9213ACsSEGP7h8NFVZ/+2FeDsSzB/C7zjW9okIUIw=
 =y/3r
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper changes from Mike Snitzer:

 - Allow the thin target to paired with any size external origin; also
   allow thin snapshots to be larger than the external origin.

 - Add support for quickly loading a repetitive pattern into the
   dm-switch target.

 - Use per-bio data in the dm-crypt target instead of always using a
   mempool for each allocation.  Required switching to kmalloc alignment
   for the bio slab.

 - Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag

 - Fix the dm-cache and dm-thin targets' export of the minimum_io_size
   to match the data block size -- this fixes an issue where mkfs.xfs
   would improperly infer raid striping was in place on the underlying
   storage.

 - Small cleanups in dm-io, dm-mpath and dm-cache

* tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm table: propagate QUEUE_FLAG_NO_SG_MERGE
  dm switch: efficiently support repetitive patterns
  dm switch: factor out switch_region_table_read
  dm cache: set minimum_io_size to cache's data block size
  dm thin: set minimum_io_size to pool's data block size
  dm crypt: use per-bio data
  block: use kmalloc alignment for bio slab
  dm table: make dm_table_supports_discards static
  dm cache metadata: use dm-space-map-metadata.h defined size limits
  dm cache: fail migrations in the do_worker error path
  dm cache: simplify deferred set reference count increments
  dm thin: relax external origin size constraints
  dm thin: switch to an atomic_t for tracking pending new block preparations
  dm mpath: eliminate pg_ready() wrapper
  dm io: simplify dec_count and sync_io
2014-08-14 09:17:56 -06:00
Linus Torvalds d429a3639c Merge branch 'for-3.17/drivers' of git://git.kernel.dk/linux-block
Pull block driver changes from Jens Axboe:
 "Nothing out of the ordinary here, this pull request contains:

   - A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
     and Surbhi Palande.  No new features, just a lot of fixes.

   - The usual round of drbd updates from Andreas Gruenbacher, Lars
     Ellenberg, and Philipp Reisner.

   - virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
     has taken it one step further and added support for actually using
     more than one queue.

   - Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
     compliment the the default behavior of adding to the tail of the
     queue.  From Douglas Gilbert"

* 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
  bcache: Drop unneeded blk_sync_queue() calls
  bcache: add mutex lock for bch_is_open
  bcache: Correct printing of btree_gc_max_duration_ms
  bcache: try to set b->parent properly
  bcache: fix memory corruption in init error path
  bcache: fix crash with incomplete cache set
  bcache: Fix more early shutdown bugs
  bcache: fix use-after-free in btree_gc_coalesce()
  bcache: Fix an infinite loop in journal replay
  bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
  bcache: bcache_write tracepoint was crashing
  bcache: fix typo in bch_bkey_equal_header
  bcache: Allocate bounce buffers with GFP_NOWAIT
  bcache: Make sure to pass GFP_WAIT to mempool_alloc()
  bcache: fix uninterruptible sleep in writeback thread
  bcache: wait for buckets when allocating new btree root
  bcache: fix crash on shutdown in passthrough mode
  bcache: fix lockdep warnings on shutdown
  bcache allocator: send discards with correct size
  bcache: Fix to remove the rcu_sched stalls.
  ...
2014-08-14 09:10:21 -06:00
Linus Torvalds 2213d7c29a md updates for 3.17
Most interesting is that md devices (major == 9) with
 minor numbers of 512 or more will no longer be created
 simply by opening a block device file.  They can only
 be created by writing to
    /sys/module/md_mod/parameters/new_array
 The 'auto-create-on-open' semantic is cumbersome and we
 need to start moving away from it.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAU+hOmDnsnt1WYoG5AQJxZxAAjCJAOHiqIBeFg2auJdZF5u5dgqshJQk3
 gMpl96Arazf5YXJNTMRoNsfgPg/XjG/9MDS9UoyzgAsQTRi90UYjq+gcSODdHGcF
 pVFxdrPf3Ja+cnlniW3aCuosv143c9N06IUDOm5TuY7nioQQA7gVS6HEpctmLbgi
 R5Q2jQ5GeSVhWKsowYBdzbso3y/QAH0zS5AS4tW3c+l7YMie3gfBqNHTHiQby9I1
 NwEUaWrU13jUHT6M4CWIp+wqgORXVqFNHx1dxYefTEgIdB1Bfo15MgDorG1sReF7
 I+H1zvc/QpvEQ/WWJ6Cg5gItVZqof6SPljnQHkt0kqtu6fDJD3wYBDYKE8VgP9k1
 9zJUibyOOqDFAxrEkVy6XnQ50bYV2uNPFSKpw3jFLXpapRzL602NRWvgHGLkPEbS
 TofriiykOzYGxKkIhIWtiqzOaOhPMo+Z771WGRjoF4ZOAALYRnItvC/FFoRKCHeZ
 xTOp2xnKwAX6vHsrIHpJP+0/no2R8kzkwnSFSCgoRdraFGxfqY2N32svZaXFUt0c
 FlaAktWPtM+gaVzzJzEnm4kfruvVRvPV9zyVa1ro990E+00X5eGHi7g7mPmTQ33q
 gxQK/d/jL22RoOXppfaqj2lApQsQJ6F1rMyuvUwLK+gyFacMQyGtRfHqJo5JZKRT
 rBFNoOcGh3A=
 =v8Er
 -----END PGP SIGNATURE-----

Merge tag 'md/3.17' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "Most interesting is that md devices (major == 9) with minor numbers of
  512 or more will no longer be created simply by opening a block device
  file.  They can only be created by writing to

      /sys/module/md_mod/parameters/new_array

  The 'auto-create-on-open' semantic is cumbersome and we need to start
  moving away from it"

* tag 'md/3.17' of git://neil.brown.name/md:
  md: don't allow bitmap file to be added to raid0/linear.
  md/raid0: check for bitmap compatability when changing raid levels.
  md: Recovery speed is wrong
  md: disable probing for md devices 512 and over.
  md/raid1,raid10: always abort recover on write error.
2014-08-11 07:02:35 -07:00
Jeff Moyer 200612ec33 dm table: propagate QUEUE_FLAG_NO_SG_MERGE
Commit 05f1dd5 ("block: add queue flag for disabling SG merging")
introduced a new queue flag: QUEUE_FLAG_NO_SG_MERGE.  This gets set by
default in blk_mq_init_queue for mq-enabled devices.  The effect of
the flag is to bypass the SG segment merging.  Instead, the
bio->bi_vcnt is used as the number of hardware segments.

With a device mapper target on top of a device with
QUEUE_FLAG_NO_SG_MERGE set, we can end up sending down more segments
than a driver is prepared to handle.  I ran into this when backporting
the virtio_blk mq support.  It triggerred this BUG_ON, in
virtio_queue_rq:

        BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);

The queue's max is set here:
        blk_queue_max_segments(q, vblk->sg_elems-2);

Basically, what happens is that a bio is built up for the dm device
(which does not have the QUEUE_FLAG_NO_SG_MERGE flag set) using
bio_add_page.  That path will call into __blk_recalc_rq_segments, so
what you end up with is bi_phys_segments being much smaller than bi_vcnt
(and bi_vcnt grows beyond the maximum sg elements).  Then, when the bio
is submitted, it gets cloned.  When the cloned bio is submitted, it will
end up in blk_recount_segments, here:

        if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags))
                bio->bi_phys_segments = bio->bi_vcnt;

and now we've set bio->bi_phys_segments to a number that is beyond what
was registered as queue_max_segments by the driver.

The right way to fix this is to propagate the queue flag up the stack.

The rules for propagating the flag are simple:
- if the flag is set for any underlying device, it must be set for the
  upper device
- consequently, if the flag is not set for any underlying device, it
  should not be set for the upper device.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.16+
2014-08-10 20:54:49 -04:00
NeilBrown d66b1b395a md: don't allow bitmap file to be added to raid0/linear.
An array can only accept a bitmap if it will call bitmap_daemon_work
periodically, which means it needs a thread running.

If there is no thread, don't allow a bitmap to be added.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-08 15:43:20 +10:00
NeilBrown a8461a61c2 md/raid0: check for bitmap compatability when changing raid levels.
If an array has a bitmap, then it cannot be converted to raid0.

Reported-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-08 15:33:17 +10:00
Xiao Ni ac7e50a383 md: Recovery speed is wrong
When we calculate the speed of recovery, the numerator that contains
the recovery done sectors.  It's need to subtract the sectors which
don't finish recovery.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-08 12:11:25 +10:00
Linus Torvalds 98959948a7 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Move the nohz kick code out of the scheduler tick to a dedicated IPI,
   from Frederic Weisbecker.

  This necessiated quite some background infrastructure rework,
  including:

   * Clean up some irq-work internals
   * Implement remote irq-work
   * Implement nohz kick on top of remote irq-work
   * Move full dynticks timer enqueue notification to new kick
   * Move multi-task notification to new kick
   * Remove unecessary barriers on multi-task notification

 - Remove proliferation of wait_on_bit() action functions and allow
   wait_on_bit_action() functions to support a timeout.  (Neil Brown)

 - Another round of sched/numa improvements, cleanups and fixes.  (Rik
   van Riel)

 - Implement fast idling of CPUs when the system is partially loaded,
   for better scalability.  (Tim Chen)

 - Restructure and fix the CPU hotplug handling code that may leave
   cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
   cpu.  (Kirill Tkhai)

 - Robustify the sched topology setup code.  (Peterz Zijlstra)

 - Improve sched_feat() handling wrt.  static_keys (Jason Baron)

 - Misc fixes.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  sched/fair: Fix 'make xmldocs' warning caused by missing description
  sched: Use macro for magic number of -1 for setparam
  sched: Robustify topology setup
  sched: Fix sched_setparam() policy == -1 logic
  sched: Allow wait_on_bit_action() functions to support a timeout
  sched: Remove proliferation of wait_on_bit() action functions
  sched/numa: Revert "Use effective_load() to balance NUMA loads"
  sched: Fix static_key race with sched_feat()
  sched: Remove extra static_key*() function indirection
  sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
  sched: Transform resched_task() into resched_curr()
  sched/deadline: Kill task_struct->pi_top_task
  sched: Rework check_for_tasks()
  sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
  sched/fair: Disable runtime_enabled on dying rq
  sched/numa: Change scan period code to match intent
  sched/numa: Rework best node setting in task_numa_migrate()
  sched/numa: Examine a task move when examining a task swap
  sched/numa: Simplify task_numa_compare()
  sched/numa: Use effective_load() to balance NUMA loads
  ...
2014-08-04 16:23:30 -07:00
Kent Overstreet 0781c8748c bcache: Drop unneeded blk_sync_queue() calls
this is needed for the queue/block device we created (it's done by
blk_cleanup_queue() which we do call) - but calling it for the block devices we
only opened is pointless.

Change-Id: I53dfded14ed15b9581d10ca8399d5e1b3abbf9f2
2014-08-04 15:23:04 -07:00
Jianjian Huo 789d21dbd9 bcache: add mutex lock for bch_is_open
Since bch_is_open will iterate linked list bch_cache_sets and
uncached_devices, it needs bch_register_lock.

Signed-off-by: Jianjian Huo <samuel.huo@gmail.com>
2014-08-04 15:23:04 -07:00
Surbhi Palande 5b25abade2 bcache: Correct printing of btree_gc_max_duration_ms
time_stats::btree_gc_max_duration_mc is not bit shifted by 8

Fixes BUG #138

Change-Id: I44fc6e1d0579674016acc533f1a546b080e5371a
Signed-off-by: Surbhi Palande <sap@daterainc.com>
2014-08-04 15:23:04 -07:00
Slava Pestov 2452cc8906 bcache: try to set b->parent properly
bcache_flash_dev.ktest would reliably crash with 8k and 16k bucket size
before; now it passes.

Change-Id: Ib542232235e39298c3a7548fe52b645cabb823d1
2014-08-04 15:23:04 -07:00
Slava Pestov c9a78332b4 bcache: fix memory corruption in init error path
If register_cache_set() failed, we would touch ca->set after
it had already been freed. Also, fix an assertion to catch
this.

Change-Id: I748e5f5b223e2d9b2602075dec2f997cced2394d
2014-08-04 15:23:04 -07:00
Slava Pestov bf0c55c986 bcache: fix crash with incomplete cache set
Change-Id: I6abde52afe917633480caaf4e2518f42a816d886
2014-08-04 15:23:04 -07:00
Kent Overstreet d83353b319 bcache: Fix more early shutdown bugs
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:04 -07:00
Slava Pestov 400ffaa2ac bcache: fix use-after-free in btree_gc_coalesce()
If we goto out_nocoalesce after we free new_nodes[0], we end up freeing
new_nodes[0] again. This was generating a lockdep warning. The fix is
to set new_nodes[0] to NULL, since the out_nocoalesce path safely
ignores NULL entries in the new_nodes array.

This regression was introduced in 2d7f9531.

Change-Id: I76564d7257800583214376b4bacf236cda90c89c
2014-08-04 15:23:04 -07:00
Kent Overstreet 6b708de64a bcache: Fix an infinite loop in journal replay
When running with multiple cache devices, if one of the devices has a completely
empty journal but we'd already found some journal entries on a previosu device
we'd go into an infinite loop.

Change-Id: I1dcdc0d738192746de28f40e8b08825b0dea5e2b
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov 913dc33fb2 bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
'b' was NULL.

Change-Id: Icac0fd04afa2d23f213d96d51afd53374e6dd0c0
2014-08-04 15:23:03 -07:00
Slava Pestov 60ae81eee8 bcache: bcache_write tracepoint was crashing
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov 8e09480806 bcache: fix typo in bch_bkey_equal_header
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Kent Overstreet 501d52a90c bcache: Allocate bounce buffers with GFP_NOWAIT
There's no point in blocking on these allocations, since our fallback paths will
probably go faster than blocking.

Change-Id: I733ca202c25cb36bde02607a0a60552229a4241c
2014-08-04 15:23:03 -07:00
Kent Overstreet bcf090e004 bcache: Make sure to pass GFP_WAIT to mempool_alloc()
this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT.
bcache uses GFP_NOWAIT in various other places where we have a fallback,
circuits must've gotten crossed when writing this code or something.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov 9e5c353510 bcache: fix uninterruptible sleep in writeback thread
There were two issues here:

- writeback thread did not start until the device first became dirty
- writeback thread used uninterruptible sleep once running

Without this patch I see kernel warnings printed and a load average of
1.52 after booting my test VM. With this patch the warnings are gone and
the load average is near 0.00 as expected.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov c5aa4a3157 bcache: wait for buckets when allocating new btree root
Tested:
- sometimes bcache_tier test would hang on startup with a failure
  to allocate the btree root -- no longer seeing this

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov a664d0f05a bcache: fix crash on shutdown in passthrough mode
We never started the writeback thread in this case, so don't stop it.
2014-08-04 15:23:03 -07:00
Slava Pestov e5112201c1 bcache: fix lockdep warnings on shutdown 2014-08-04 15:23:03 -07:00
Slava Pestov 8b326d3a2a bcache allocator: send discards with correct size 2014-08-04 15:23:03 -07:00
Surbhi Palande dbd810ab67 bcache: Fix to remove the rcu_sched stalls.
while loop was executing infinitely.
This fix ends the while loop gracefully.

Signed-off-by: Surbhi Palande <sap@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Kent Overstreet 9aa61a992a bcache: Fix a journal replay bug
journal replay wansn't validating pointers with bch_extent_invalid() before
derefing, fixed

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Kent Overstreet 5b1016e62f bcache: Fix a bug when detaching
After detaching a backing device from a cache set, a bit wasn't getting
reset meaning the second detach wouldn't work correctly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Mikulas Patocka 56b1ebf2d9 dm switch: efficiently support repetitive patterns
Add support for quickly loading a repetitive pattern into the
dm-switch target.

In the "set_regions_mappings" message, the user may now use "Rn,m" as
one of the arguments.  "n" and "m" are hexadecimal numbers.  The "Rn,m"
argument repeats the last "n" arguments in the following "m" slots.

For example:
dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
is equivalent to
dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
:1 :2 :1 :2 :1 :2 :1 :2 :1 :2

Requested-by: Jay Wang <jwang@nimblestorage.com>
Tested-by: Jay Wang <jwang@nimblestorage.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:37 -04:00
Mikulas Patocka 99eb1908e6 dm switch: factor out switch_region_table_read
Move code that reads the table to a switch_region_table_read.
It will be needed for the next commit.  No functional change.

Tested-by: Jay Wang <jwang@nimblestorage.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:36 -04:00
Mike Snitzer b02465308f dm cache: set minimum_io_size to cache's data block size
Before, if the block layer's limit stacking didn't establish an
optimal_io_size that was compatible with the cache's data block size
we'd set optimal_io_size to the data block size and minimum_io_size to 0
(which the block layer adjusts to be physical_block_size).

Update cache_io_hints() to set both minimum_io_size and optimal_io_size
to the cache's data block size.  This fixes an issue where mkfs.xfs
would create more XFS Allocation Groups on cache volumes than on a
normal linear LV of comparable size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:36 -04:00
Mike Snitzer fdfb4c8c1a dm thin: set minimum_io_size to pool's data block size
Before, if the block layer's limit stacking didn't establish an
optimal_io_size that was compatible with the thin-pool's data block size
we'd set optimal_io_size to the data block size and minimum_io_size to 0
(which the block layer adjusts to be physical_block_size).

Update pool_io_hints() to set both minimum_io_size and optimal_io_size
to the thin-pool's data block size.  This fixes an issue reported where
mkfs.xfs would create more XFS Allocation Groups on thinp volumes than
on a normal linear LV of comparable size, see:
https://bugzilla.redhat.com/show_bug.cgi?id=1003227

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:35 -04:00
Mikulas Patocka 298a9fa08a dm crypt: use per-bio data
Change dm-crypt so that it uses auxiliary data allocated with the bio.

Dm-crypt requires two allocations per request - struct dm_crypt_io and
struct ablkcipher_request (with other data appended to it).  It
previously only used mempool allocations.

Some requests may require more dm_crypt_ios and ablkcipher_requests,
however most requests need just one of each of these two structures to
complete.

This patch changes it so that the first dm_crypt_io and ablkcipher_request
are allocated with the bio (using target per_bio_data_size option).  If
the request needs additional values, they are allocated from the mempool.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:35 -04:00
Mikulas Patocka a7ffb6a533 dm table: make dm_table_supports_discards static
The function dm_table_supports_discards is only called from
dm-table.c:dm_table_set_restrictions().  So move it above
dm_table_set_restrictions and make it static.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:34 -04:00
Mike Snitzer 895b47d798 dm cache metadata: use dm-space-map-metadata.h defined size limits
Commit 7d48935e cleaned up the persistent-data's space-map-metadata
limits by elevating them to dm-space-map-metadata.h.  Update
dm-cache-metadata to use these same limits.

The calculation for DM_CACHE_METADATA_MAX_SECTORS didn't account for the
sizeof the disk_bitmap_header.  So the supported maximum metadata size
is a bit smaller (reduced from 33423360 to 33292800 sectors).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-08-01 12:30:33 -04:00
Joe Thornber 304affaa88 dm cache: fail migrations in the do_worker error path
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:33 -04:00
Joe Thornber 8c081b52c6 dm cache: simplify deferred set reference count increments
Factor out inc_and_issue and inc_ds helpers to simplify deferred set
reference count increments.  Also cleanup cache_map to consistently call
cell_defer and inc_ds when the bio is DM_MAPIO_REMAPPED.

No functional change.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:32 -04:00
Joe Thornber e5aea7b49f dm thin: relax external origin size constraints
Track the size of any external origin.  Previously the external origin's
size had to be a multiple of the thin-pool's block size, that is no
longer a requirement.  In addition, snapshots that are larger than the
external origin are now supported.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:32 -04:00
Joe Thornber 50f3c3efdd dm thin: switch to an atomic_t for tracking pending new block preparations
Previously we used separate boolean values to track quiescing and
copying actions.  By switching to an atomic_t we can support blocks that
need a partial copy and partial zero.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:31 -04:00
Mike Snitzer 6afbc01d75 dm mpath: eliminate pg_ready() wrapper
pg_ready() is not comprehensive in its logic and only serves to
obfuscate code.  Replace pg_ready() with the appropriate logic in
multipath_map().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:31 -04:00
Joe Thornber 97e7cdf12b dm io: simplify dec_count and sync_io
Remove the io struct off the stack in sync_io() and allocate it from
the mempool like is done in async_io().

dec_count() now always calls a callback function and always frees the io
struct back to the mempool (so sync_io and async_io share this pattern).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:30 -04:00
Anssi Hannula 44fa816bb7 dm cache: fix race affecting dirty block count
nr_dirty is updated without locking, causing it to drift so that it is
non-zero (either a small positive integer, or a very large one when an
underflow occurs) even when there are no actual dirty blocks.  This was
due to a race between the workqueue and map function accessing nr_dirty
in parallel without proper protection.

People were seeing under runs due to a race on increment/decrement of
nr_dirty, see: https://lkml.org/lkml/2014/6/3/648

Fix this by using an atomic_t for nr_dirty.

Reported-by: roma1390@gmail.com
Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-08-01 12:25:22 -04:00
Greg Thelen d8c712ea47 dm bufio: fully initialize shrinker
1d3d4437ea ("vmscan: per-node deferred work") added a flags field to
struct shrinker assuming that all shrinkers were zero filled.  The dm
bufio shrinker is not zero filled, which leaves arbitrary kmalloc() data
in flags.  So far the only defined flags bit is SHRINKER_NUMA_AWARE.
But there are proposed patches which add other bits to shrinker.flags
(e.g. memcg awareness).

Rather than simply initializing the shrinker, this patch uses kzalloc()
when allocating the dm_bufio_client to ensure that the embedded shrinker
and any other similar structures are zeroed.

This fixes theoretical over aggressive shrinking of dm bufio objects.
If the uninitialized dm_bufio_client.shrinker.flags contains
SHRINKER_NUMA_AWARE then shrink_slab() would call the dm shrinker for
each numa node rather than just once.  This has been broken since 3.12.

Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.12+
2014-08-01 12:07:21 -04:00
NeilBrown af5628f05d md: disable probing for md devices 512 and over.
The way md devices are traditionally created in the kernel
is simply to open the device with the desired major/minor number.

This can be problematic as some support tools, notably udev and
programs run by udev, can open a device just to see what is there, and
find that it has created something.  It is easy for a race to cause
udev to open an md device just after it was destroy, causing it to
suddenly re-appear.

For some time we have had an alternate way to create md devices
  echo md_somename > /sys/modules/md_mod/paramaters/new_array

This will always use a minor number of 512 or higher, which mdadm
normally avoids.
Using this makes the creation-by-opening unnecessary, but does
not disable it, so it is still there to cause problems.

This patch disable probing for devices with a major of 9 (MD_MAJOR)
and a minor of 512 and up.  This devices created by writing to
new_array cannot be re-created by opening the node in /dev.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-07-31 13:54:54 +10:00
NeilBrown 2446dba03f md/raid1,raid10: always abort recover on write error.
Currently we don't abort recovery on a write error if the write error
to the recovering device was triggerd by normal IO (as opposed to
recovery IO).

This means that for one bitmap region, the recovery might write to the
recovering device for a few sectors, then not bother for subsequent
sectors (as it never writes to failed devices).  In this case
the bitmap bit will be cleared, but it really shouldn't.

The result is that if the recovering device fails and is then re-added
(after fixing whatever hardware problem triggerred the failure),
the second recovery won't redo the region it was in the middle of,
so some of the device will not be recovered properly.

If we abort the recovery, the region being processes will be cancelled
(bit not cleared) and the whole region will be retried.

As the bug can result in data corruption the patch is suitable for
-stable.  For kernels prior to 3.11 there is a conflict in raid10.c
which will require care.

Original-from: jiao hui <jiaohui@bwstor.com.cn>
Reported-and-tested-by: jiao hui <jiaohui@bwstor.com.cn>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org
2014-07-31 10:16:52 +10:00
Ingo Molnar ca5bc6cd5d Merge branch 'sched/urgent' into sched/core, to merge fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28 10:03:00 +02:00