Commit Graph

3248 Commits

Author SHA1 Message Date
Joe Thornber 5a32083d03 dm: take care to copy the space map roots before locking the superblock
In theory copying the space map root can fail, but in practice it never
does because we're careful to check what size buffer is needed.

But make certain we're able to copy the space map roots before
locking the superblock.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # drop dm-era and dm-cache changes as needed
2014-03-27 16:56:23 -04:00
Joe Thornber a9d45396f5 dm transaction manager: fix corruption due to non-atomic transaction commit
The persistent-data library used by dm-thin, dm-cache, etc is
transactional.  If anything goes wrong, such as an io error when writing
new metadata or a power failure, then we roll back to the last
transaction.

Atomicity when committing a transaction is achieved by:

a) Never overwriting data from the previous transaction.
b) Writing the superblock last, after all other metadata has hit the
   disk.

This commit and the following commit ("dm: take care to copy the space
map roots before locking the superblock") fix a bug associated with (b).
When committing it was possible for the superblock to still be written
in spite of an io error occurring during the preceeding metadata flush.
With these commits we're careful not to take the write lock out on the
superblock until after the metadata flush has completed.

Change the transaction manager's semantics for dm_tm_commit() to assume
all data has been flushed _before_ the single superblock that is passed
in.

As a prerequisite, split the block manager's block unlocking and
flushing by simplifying dm_bm_flush_and_unlock() to dm_bm_flush().  Now
the unlocking must be done separately.

This issue was discovered by forcing io errors at the crucial time
using dm-flakey.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-03-27 16:56:23 -04:00
Heinz Mauelshagen 64ab346a36 dm cache: remove remainder of distinct discard block size
Discard block size not being equal to cache block size causes data
corruption by erroneously avoiding migrations in issue_copy() because
the discard state is being cleared for a group of cache blocks when it
should not.

Completely remove all code that enabled a distinction between the
cache block size and discard block size.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:23 -04:00
Mike Snitzer d132cc6d9e dm cache: prevent corruption caused by discard_block_size > cache_block_size
If the discard block size is larger than the cache block size we will
not properly quiesce IO to a region that is about to be discarded.  This
results in a race between a cache migration where no copy is needed, and
a write to an adjacent cache block that's within the same large discard
block.

Workaround this by limiting the discard_block_size to cache_block_size.
Also limit the max_discard_sectors to cache_block_size.

A more comprehensive fix that introduces range locking support in the
bio_prison and proper quiescing of a discard range that spans multiple
cache blocks is already in development.

Reported-by: Morgan Mears <Morgan.Mears@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org
2014-03-27 16:56:23 -04:00
Joe Thornber 428e469864 dm bitset: only flush the current word if it has been dirtied
This change offers a big performance boost for dm-era.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:23 -04:00
Joe Thornber eec40579d8 dm: add era target
dm-era is a target that behaves similar to the linear target.  In
addition it keeps track of which blocks were written within a user
defined period of time called an 'era'.  Each era target instance
maintains the current era as a monotonically increasing 32-bit
counter.

Use cases include tracking changed blocks for backup software, and
partially invalidating the contents of a cache to restore cache
coherency after rolling back a vendor snapshot.

dm-era is primarily expected to be paired with the dm-cache target.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:23 -04:00
John Sheu cb85114956 bcache: remove nested function usage
Uninlined nested functions can cause crashes when using ftrace, as they don't
follow the normal calling convention and confuse the ftrace function graph
tracer as it examines the stack.

Also, nested functions are supported as a gcc extension, but may fail on other
compilers (e.g. llvm).

Signed-off-by: John Sheu <john.sheu@gmail.com>
2014-03-18 12:39:28 -07:00
Kent Overstreet 3a2fd9d509 bcache: Kill bucket->gc_gen
gc_gen was a temporary used to recalculate last_gc, but since we only need
bucket->last_gc when gc isn't running (gc_mark_valid = 1), we can just update
last_gc directly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:24:54 -07:00
Kent Overstreet 2531d9ee61 bcache: Kill unused freelist
This was originally added as at optimization that for various reasons isn't
needed anymore, but it does add a lot of nasty corner cases (and it was
responsible for some recently fixed bugs). Just get rid of it now.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:36 -07:00
Kent Overstreet 0a63b66db5 bcache: Rework btree cache reserve handling
This changes the bucket allocation reserves to use _real_ reserves - separate
freelists - instead of watermarks, which if nothing else makes the current code
saner to reason about and is going to be important in the future when we add
support for multiple btrees.

It also adds btree_check_reserve(), which checks (and locks) the reserves for
both bucket allocation and memory allocation for btree nodes; the old code just
kinda sorta assumed that since (e.g. for btree node splits) it had the root
locked and that meant no other threads could try to make use of the same
reserve; this technically should have been ok for memory allocation (we should
always have a reserve for memory allocation (the btree node cache is used as a
reserve and we preallocate it)), but multiple btrees will mean that locking the
root won't be sufficient anymore, and for the bucket allocation reserve it was
technically possible for the old code to deadlock.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:35 -07:00
Kent Overstreet 56b30770b2 bcache: Kill btree_io_wq
With the locking rework in the last patch, this shouldn't be needed anymore -
btree_node_write_work() only takes b->write_lock which is never held for very
long.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:35 -07:00
Kent Overstreet 2a285686c1 bcache: btree locking rework
Add a new lock, b->write_lock, which is required to actually modify - or write -
a btree node; this lock is only held for short durations.

This means we can write out a btree node without taking b->lock, which _is_ held
for long durations - solving a deadlock when btree_flush_write() (from the
journalling code) is called with a btree node locked.

Right now just occurs in bch_btree_set_root(), but with an upcoming journalling
rework is going to happen a lot more.

This also turns b->lock is now more of a read/intent lock instead of a
read/write lock - but not completely, since it still blocks readers. May turn it
into a real intent lock at some point in the future.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:35 -07:00
Kent Overstreet 05335cff9f bcache: Fix a race when freeing btree nodes
This isn't a bulletproof fix; btree_node_free() -> bch_bucket_free() puts the
bucket on the unused freelist, where it can be reused right away without any
ordering requirements. It would be better to wait on at least a journal write to
go down before reusing the bucket. bch_btree_set_root() does this, and inserting
into non leaf nodes is completely synchronous so we should be ok, but future
patches are just going to get rid of the unused freelist - it was needed in the
past for various reasons but shouldn't be anymore.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:34 -07:00
Kent Overstreet 4fe6a81670 bcache: Add a real GC_MARK_RECLAIMABLE
This means the garbage collection code can better check for data and metadata
pointers to the same buckets.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:36 -07:00
Kent Overstreet c13f3af924 bcache: Add bch_keylist_init_single()
This will potentially save us an allocation when we've got inode/dirent bkeys
that don't fit in the keylist's inline keys.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:36 -07:00
Kent Overstreet 1575402052 bcache: Improve priority_stats
Break down data into clean data/dirty data/metadata.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:35 -07:00
Kent Overstreet 7159b1ad3d bcache: Better alloc tracepoints
Change the invalidate tracepoint to indicate how much data we're invalidating,
and change the alloc tracepoints to indicate what offset they're for.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:35 -07:00
Kent Overstreet 3f5e0a34da bcache: Kill dead cgroup code
This hasn't been used or even enabled in ages.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:35 -07:00
Nicholas Swenson 3f6ef38110 bcache: stop moving_gc marking buckets that can't be moved.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
2014-03-18 12:22:34 -07:00
Kent Overstreet 10d9dcf6ee bcache: Fix moving_pred()
Avoid a potential null pointer deref (e.g. from check keys for cache misses)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:34 -07:00
Nicholas Swenson da415a096f bcache: Fix moving_gc deadlocking with a foreground write
Deadlock happened because a foreground write slept, waiting for a bucket
to be allocated. Normally the gc would mark buckets available for invalidation.
But the moving_gc was stuck waiting for outstanding writes to complete.
These writes used the bcache_wq, the same queue foreground writes used.

This fix gives moving_gc its own work queue, so it was still finish moving
even if foreground writes are stuck waiting for allocation. It also makes
work queue a parameter to the data_insert path, so moving_gc can use its
workqueue for writes.

Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:33 -07:00
Kent Overstreet 90db6919f5 bcache: Fix discard granularity
blk_stack_limits() doesn't like a discard granularity of 0.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:33 -07:00
Kent Overstreet 487dded86e bcache: Fix another bug recovering from unclean shutdown
The on disk bucket gens are allowed to be out of date, when we reuse buckets
that didn't have any live data in them. To deal with this, the initial gc has to
update the bucket gen when we find a pointer gen newer than the bucket's gen.

Unfortunately we weren't doing this for pointers in the journal that we're about
to replay.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:33 -07:00
Kent Overstreet 0bd143fd80 bcache: Fix a bug recovering from unclean shutdown
The code to fixup incorrect bucket prios incorrectly did not skip btree node
freeing keys

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:32 -07:00
Kent Overstreet 27201cfdaa bcache: Fix a journalling reclaim after recovery bug
On recovery we weren't correctly keeping track of what journal buckets had open
journal entries, thus it was possible for them to be overwritten until we'd
written all new journal entries.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:21:48 -07:00
Kent Overstreet 65ddf45a31 bcache: Fix a null ptr deref in journal replay
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-17 19:01:03 -07:00
Kent Overstreet 4fa03402cd bcache: Fix a lockdep splat in an error path
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-17 18:59:09 -07:00
Heinz Mauelshagen e893fba90c dm cache: fix access beyond end of origin device
In order to avoid wasting cache space a partial block at the end of the
origin device is not cached.  Unfortunately, the check for such a
partial block at the end of the origin device was flawed.

Fix accesses beyond the end of the origin device that occured due to
attempted promotion of an undetected partial block by:

- initializing the per bio data struct to allow cache_end_io to work properly
- recognizing access to the partial block at the end of the origin device
- avoiding out of bounds access to the discard bitset

Otherwise, users can experience errors like the following:

 attempt to access beyond end of device
 dm-5: rw=0, want=20971520, limit=20971456
 ...
 device-mapper: cache: promotion failed; couldn't copy block

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-03-12 13:52:00 -04:00
Heinz Mauelshagen 8b9d966665 dm cache: fix truncation bug when copying a block to/from >2TB fast device
During demotion or promotion to a cache's >2TB fast device we must not
truncate the cache block's associated sector to 32bits.  The 32bit
temporary result of from_cblock() caused a 32bit multiplication when
calculating the sector of the fast device in issue_copy_real().

Use an intermediate 64bit type to store the 32bit from_cblock() to allow
for proper 64bit multiplication.

Here is an example of how this bug manifests on an ext4 filesystem:

 EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt.
 JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-03-12 13:49:27 -04:00
Joe Thornber cebc2de44d dm space map metadata: fix refcount decrement below 0 which caused corruption
This has been a relatively long-standing issue that wasn't nailed down
until Teng-Feng Yang's meticulous bug report to dm-devel on 3/7/2014,
see: http://www.redhat.com/archives/dm-devel/2014-March/msg00021.html

From that report:
  "When decreasing the reference count of a metadata block with its
  reference count equals 3, we will call dm_btree_remove() to remove
  this enrty from the B+tree which keeps the reference count info in
  metadata device.

  The B+tree will try to rebalance the entry of the child nodes in each
  node it traversed, and the rebalance process contains the following
  steps.

  (1) Finding the corresponding children in current node (shadow_current(s))
  (2) Shadow the children block (issue BOP_INC)
  (3) redistribute keys among children, and free children if necessary (issue BOP_DEC)

  Since the update of a metadata block's reference count could be
  recursive, we will stash these reference count update operations in
  smm->uncommitted and then process them in a FILO fashion.

  The problem is that step(3) could free the children which is created
  in step(2), so the BOP_DEC issued in step(3) will be carried out
  before the BOP_INC issued in step(2) since these BOPs will be
  processed in FILO fashion. Once the BOP_DEC from step(3) tries to
  decrease the reference count of newly shadow block, it will report
  failure for its reference equals 0 before decreasing. It looks like we
  can solve this issue by processing these BOPs in a FIFO fashion
  instead of FILO."

Commit 5b564d80 ("dm space map: disallow decrementing a reference count
below zero") changed the code to report an error for this temporary
refcount decrement below zero.  So what was previously a harmless
invalid refcount became a hard failure due to the new error path:

 device-mapper: space map common: unable to decrement a reference count below 0
 device-mapper: thin: 253:6: dm_thin_insert_block() failed: error = -22
 device-mapper: thin: 253:6: switching pool to read-only mode

This bug is in dm persistent-data code that is common to the DM thin and
cache targets.  So any users of those targets should apply this fix.

Fix this by applying recursive space map operations in FIFO order rather
than FILO.

Resolves: https://bugzilla.kernel.org/show_bug.cgi?id=68801

Reported-by: Apollon Oikonomopoulos <apoikos@debian.org>
Reported-by: edwillam1007@gmail.com
Reported-by: Teng-Feng Yang <shinrairis@gmail.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
2014-03-07 12:02:47 -05:00
Joe Thornber 738211f70a dm thin: fix noflush suspend IO queueing
i) by the time DM core calls the postsuspend hook the dm_noflush flag
has been cleared.  So the old thin_postsuspend did nothing.  We need to
use the presuspend hook instead.

ii) There was a race between bios leaving DM core and arriving in the
deferred queue.

thin_presuspend now sets a 'requeue' flag causing all bios destined for
that thin to be requeued back to DM core.  Then it requeues all held IO,
and all IO on the deferred queue (destined for that thin).  Finally
postsuspend clears the 'requeue' flag.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-05 15:26:59 -05:00
Joe Thornber 18adc57779 dm thin: fix deadlock in __requeue_bio_list
The spin lock in requeue_io() was held for too long, allowing deadlock.
Don't worry, due to other issues addressed in the following "dm thin:
fix noflush suspend IO queueing" commit, this code was never called.

Fix this by taking the spin lock for a much shorter period of time.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-05 15:26:58 -05:00
Joe Thornber 3e1a069909 dm thin: fix out of data space handling
Ideally a thin pool would never run out of data space; the low water
mark would trigger userland to extend the pool before we completely run
out of space.  However, many small random IOs to unprovisioned space can
consume data space at an alarming rate.  Adjust your low water mark if
you're frequently seeing "out-of-data-space" mode.

Before this fix, if data space ran out the pool would be put in
PM_READ_ONLY mode which also aborted the pool's current metadata
transaction (data loss for any changes in the transaction).  This had a
side-effect of needlessly compromising data consistency.  And retry of
queued unserviceable bios, once the data pool was resized, could
initiate changes to potentially inconsistent pool metadata.

Now when the pool's data space is exhausted transition to a new pool
mode (PM_OUT_OF_DATA_SPACE) that allows metadata to be changed but data
may not be allocated.  This allows users to remove thin volumes or
discard data to recover data space.

The pool is no longer put in PM_READ_ONLY mode in response to the pool
running out of data space.  And PM_READ_ONLY mode no longer aborts the
pool's current metadata transaction.  Also, set_pool_mode() will now
notify userspace when the pool mode is changed.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-05 15:26:58 -05:00
Mike Snitzer 07f2b6e038 dm thin: ensure user takes action to validate data and metadata consistency
If a thin metadata operation fails the current transaction will abort,
whereby causing potential for IO layers up the stack (e.g. filesystems)
to have data loss.  As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the
thin metadata's superblock which:
1) requires the user verify the thin metadata is consistent (e.g. use
   thin_check, etc)
2) suggests the user verify the thin data is consistent (e.g. use fsck)

The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is
to run thin_repair.

On metadata operation failure: abort current metadata transaction, set
pool in read-only mode, and now set the needs_check flag.

As part of this change, constraints are introduced or relaxed:
* don't allow a pool to transition to write mode if needs_check is set
* don't allow data or metadata space to be resized if needs_check is set
* if a thin pool's metadata space is exhausted: the kernel will now
  force the user to take the pool offline for repair before the kernel
  will allow the metadata space to be extended.

Also, update Documentation to include information about when the thin
provisioning target commits metadata, how it handles metadata failures
and running out of space.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
2014-03-05 15:25:35 -05:00
Mike Snitzer cdc2b41584 dm thin: synchronize the pool mode during suspend
Commit b5330655 ("dm thin: handle metadata failures more consistently")
increased potential for the pool's mode to be changed in response to
metadata operation failures.

When the pool mode is changed it isn't synchronized with the mode in
pool_features stored in the target's context (ti->private) that is used
as the basis for (re)establishing the pool mode during resume via
bind_control_target.

It is important that we synchronize the pool mode when it is changed
otherwise the pool may experience and unexpected mode transition on the
next resume (especially if there was no new table load).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-03-04 11:17:51 -05:00
Mikulas Patocka 2c945820ca dm snapshot: fix metadata corruption
Commit 55494bf294 ("dm snapshot: use dm-bufio") broke snapshots.
Before that 3.14-rc1 commit, loading a snapshot's list of exceptions
involved reading exception areas one by one into ps->area and inserting
those exceptions into the hash table.  Commit 55494bf294 changed
it so that dm-bufio with prefetch is used to load exceptions in batchs.
Exceptions are loaded correctly, but ps->area is left uninitialized.
When a new exception is allocated, it is stored in this uninitialized
ps->area which will be written to the disk.  This causes metadata
corruption.

Fix this corruption by copying the last area that was read via dm-bufio
into ps->area.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-03 17:58:13 -05:00
Mike Snitzer c64d240df3 dm: fix Kconfig indentation
Since DM_DEBUG_BLOCK_STACK_TRACING is a DM_PERSISTENT_DATA config option
move it from drivers/md/Kconfig to drivers/md/persistent-data/Kconfig.

Doing so fixes indentation for other DM config options.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-03 17:31:07 -05:00
Greg Kroah-Hartman aa074c1c80 Merge 3.14-rc5 into char-misc-next
We want these fixes in here as well.
2014-03-02 19:53:09 -08:00
Heinz Mauelshagen 14f398ca2f dm cache mq: fix memory allocation failure for large cache devices
The memory allocated for the multiqueue policy's hash table doesn't need
to be physically contiguous.  Use vzalloc() instead of kzalloc().
Fedora has been carrying this fix since 10/10/2013.

Failure seen during creation of a 10TB cached device with a 2048 sector
block size and 411GB cache size:

 dmsetup: page allocation failure: order:9, mode:0x10c0d0
 CPU: 11 PID: 29235 Comm: dmsetup Not tainted 3.10.4 #3
 Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a       12/30/2011
  000000000010c0d0 ffff880090941898 ffffffff81387ab4 ffff880090941928
  ffffffff810bb26f 0000000000000009 000000000010c0d0 ffff880090941928
  ffffffff81385dbc ffffffff815f3840 ffffffff00000000 000002000010c0d0
 Call Trace:
  [<ffffffff81387ab4>] dump_stack+0x19/0x1b
  [<ffffffff810bb26f>] warn_alloc_failed+0x110/0x124
  [<ffffffff81385dbc>] ? __alloc_pages_direct_compact+0x17c/0x18e
  [<ffffffff810bda2e>] __alloc_pages_nodemask+0x6c7/0x75e
  [<ffffffff810bdad7>] __get_free_pages+0x12/0x3f
  [<ffffffff810ea148>] kmalloc_order_trace+0x29/0x88
  [<ffffffff810ec1fd>] __kmalloc+0x36/0x11b
  [<ffffffffa031eeed>] ? mq_create+0x1dc/0x2cf [dm_cache_mq]
  [<ffffffffa031efc0>] mq_create+0x2af/0x2cf [dm_cache_mq]
  [<ffffffffa0314605>] dm_cache_policy_create+0xa7/0xd2 [dm_cache]
  [<ffffffffa0312530>] ? cache_ctr+0x245/0xa13 [dm_cache]
  [<ffffffffa031263e>] cache_ctr+0x353/0xa13 [dm_cache]
  [<ffffffffa012b916>] dm_table_add_target+0x227/0x2ce [dm_mod]
  [<ffffffffa012e8e4>] table_load+0x286/0x2ac [dm_mod]
  [<ffffffffa012e65e>] ? dev_wait+0x8a/0x8a [dm_mod]
  [<ffffffffa012e324>] ctl_ioctl+0x39a/0x3c2 [dm_mod]
  [<ffffffffa012e35a>] dm_ctl_ioctl+0xe/0x12 [dm_mod]
  [<ffffffff81101181>] vfs_ioctl+0x21/0x34
  [<ffffffff811019d3>] do_vfs_ioctl+0x3b1/0x3f4
  [<ffffffff810f4d2e>] ? ____fput+0x9/0xb
  [<ffffffff81050b6c>] ? task_work_run+0x7e/0x92
  [<ffffffff81101a68>] SyS_ioctl+0x52/0x82
  [<ffffffff81391d92>] system_call_fastpath+0x16/0x1b

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-02-28 12:18:29 -05:00
Heinz Mauelshagen e0d849fad7 dm cache: fix truncation bug when mapping I/O to >2TB fast device
When remapping a block to the cache's fast device that is larger than
2TB we must not truncate the destination sector to 32bits.  The 32bit
temporary result of from_cblock() was being overflowed in
remap_to_cache() due to the logical left shift.

Use an intermediate 64bit type to store the 32bit from_cblock() result
to fix the overflow.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-02-28 09:23:02 -05:00
Mike Snitzer 7d48935eff dm thin: allow metadata space larger than supported to go unused
It was always intended that a user could provide a thin metadata device
that is larger than the max supported by the on-disk format.  The extra
space would just go unused.

Unfortunately that never worked.  If the user attempted to use a larger
metadata device on creation they would get an error like the following:

 device-mapper: space map common: space map too large
 device-mapper: transaction manager: couldn't create metadata space map
 device-mapper: thin metadata: tm_create_with_sm failed
 device-mapper: table: 252:17: thin-pool: Error creating metadata object
 device-mapper: ioctl: error adding target to table

Fix this by allowing the initial metadata space map creation to cap its
size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS).
get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via
THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at
THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported).

Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for
the sizeof the disk_bitmap_header.  So the supported maximum metadata
size is a bit smaller (reduced from 33423360 to 33292800 sectors).

Lastly, remove the "excess space will not be used" warning message from
get_metadata_dev_size(); it resulted in printing the warning multiple
times.  Factor out warn_if_metadata_device_too_big(), call it from
pool_ctr() and maybe_resize_metadata_dev().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-02-27 11:49:08 -05:00
Hannes Reinecke a1989b3300 dm mpath: fix stalls when handling invalid ioctls
An invalid ioctl will never be valid, irrespective of whether multipath
has active paths or not.  So for invalid ioctls we do not have to wait
for multipath to activate any paths, but can rather return an error
code immediately.  This fix resolves numerous instances of:

 udevd[]: worker [] unexpectedly returned with status 0x0100

that have been seen during testing.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-02-26 09:44:44 -05:00
Kent Overstreet dabb443340 bcache: Fix a shutdown bug
Shutdown wasn't cancelling/waiting on journal_write_work()

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-02-25 18:42:49 -08:00
Kent Overstreet 1b4eaf3d38 bcache: Fix flash_dev_cache_miss() for real this time
The code was using sectors to count the number of sectors it was zeroing... but
then it passed it to bio_advance()... after it had been set to 0. Amusing...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-02-25 18:41:11 -08:00
Mike Snitzer 1acacc0784 dm thin: fix the error path for the thin device constructor
dm_pool_close_thin_device() must be called if dm_set_target_max_io_len()
fails in thin_ctr().  Otherwise __pool_destroy() will fail because the
pool will still have an open thin device:

 device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open
 device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed.

Also, must establish error code if failing thin_ctr() because the pool
is in fail_io mode.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
2014-02-24 11:41:18 -05:00
Kent Overstreet 85cbe1f88c bcache: Fix another compiler warning on m68k
Use a bigger hammer this time

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org>
2014-02-18 08:55:05 -08:00
Greg Kroah-Hartman ba4b60e85d Merge 3.14-rc3 into char-misc-next
We need the fixes here for future mei and other patches.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-18 08:09:40 -08:00
Mikulas Patocka f3a44fe060 dm raid1: fix immutable biovec related BUG when retrying read bio
When restoring bi_end_io, increase bi_remaining before retrying the bio
to avoid BUG_ON(atomic_read(&bio->bi_remaining) <= 0) in bio_endio().

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-02-18 10:48:57 -05:00
Mikulas Patocka d73f990729 dm io: fix I/O to multiple destinations
Commit 003b5c5719 ("block: Convert drivers
to immutable biovecs") broke dm-mirror due to dm-io breakage.

dm-io had three possible iterators (DM_IO_PAGE_LIST, DM_IO_BVEC,
DM_IO_VMA) that iterate over pages where the I/O should be performed.

The switch to immutable biovecs changed the DM_IO_BVEC iterator to
DM_IO_BIO.  Before this change the iterator stored the pointer to a bio
vector in the dpages structure.  The iterator incremented the pointer in
the dpages structure as it advanced over the pages.  After the immutable
biovecs change, the DM_IO_BIO iterator stores a pointer to the bio in
the dpages structure and uses bio_advance to change the bio as it
advances.

The problem is that the function dispatch_io stores the content of the
dpages structure into the variable old_pages and restores it before
issuing I/O to each of the devices.  Before the change, the statement
"*dp = old_pages;" restored the iterator to its starting position.
After the change, struct dpages holds a pointer to the bio, thus the
statement "*dp = old_pages;" doesn't restore the iterator.

Consequently, in the context of dm-mirror: only the first mirror leg is
written correctly, the kernel locks up when trying to write the other
mirror legs because the number of sectors to write in the where->count
variable doesn't match the number of sectors returned by the iterator.

This patch fixes the bug by partially reverting the original patch - it
changes the code so that struct dpages holds a pointer to the bio vector,
so that the statement "*dp = old_pages;" restores the iterator correctly.

The field "context_u" holds the offset from the beginning of the current
bio vector entry, just like the "bio->bi_iter.bi_bvec_done" field.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-02-17 11:00:05 -05:00
Mike Snitzer 4d1662a30d dm thin: avoid metadata commit if a pool's thin devices haven't changed
Commit 905e51b ("dm thin: commit outstanding data every second")
introduced a periodic commit.  This commit occurs regardless of whether
any thin devices have made changes.

Fix the periodic commit to check if any of a pool's thin devices have
changed using dm_pool_changed_this_transaction().

Reported-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
2014-02-17 11:00:05 -05:00
Mike Snitzer 80ae49aaed dm cache: do not add migration to completed list before unhooking bio
When completing an overwrite bio, in overwrite_endio(), the associated
migration should not be added to the 'completed_migrations' until the
bio's fields are restored with dm_unhook_bio().

Otherwise, do_worker() can race to process 'completed_migrations' before
dm_unhook_bio() -- so the bio's bi_end_io is incorrect.  This is
unlikely to cause any problems given the current code but should be
fixed on the basis of correctness.

Also, the cache's spinlock only needs to be held when manipulating the
'completed_migrations' list -- other changes don't need protection.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-02-17 11:00:05 -05:00
Mike Snitzer c6eda5e81c dm cache: move hook_info into common portion of per_bio_data structure
Commit c9d28d5d ("dm cache: promotion optimisation for writes")
incorrectly placed the 'hook_info' member in the writethrough-only
portion of the per_bio_data structure.

Given that the overwrite optimization may be used for writeback the
'hook_info' member must be placed above the 'cache' member of the
per_bio_data structure.  Any members above 'cache' are available from
both writeback and writethrough modes' per_bio_data structure.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
2014-02-17 11:00:05 -05:00
Linus Torvalds bd3813d52d Two bugfixes for md
both tagged for -stable
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAUvxBNjnsnt1WYoG5AQLymhAAnKznI2YhFVqK21mpo1l2JDSkwxqIBvBZ
 hcW24zF6dNU4cJFmRQqOeL2AkzHWSqX4/J/DGXvI9wFll1CkdNs+UVQJ12Pod3gK
 gTDmqRCe/x+bQxrOR5VfyKv0slia12vn9mqfDd2mX41wcr7ceHsdHbemPhgIcUCC
 WLERQi9Yn/Eb2+rltTzZ3XaHwIlIozqZ0yRZ6wH45iyuk+uiholEJjJp8LOWpzTe
 rKE4s5qd1NAAJsrMHZ11mZWq/4VtgYJ3AcWVXVWqBPxmlI0FnBPU/KVpJkAcrVjB
 N6tqmR1/nHcrGlaOgWSS6UfNGVMe3L2HJpaIdjTM65Tdb+WFpEPevTy9qYsLC3Ic
 zV/KmErUtSFMJKYBr9YyRnSpXtnSDo8BeRsWJm9ZaA5UV9yUVBNwWDFNFP/Bkqze
 v4wLMRj54U5fjRZBq/PaFbk/A2nDCkGHC4uZCgJ+Mwhoo6rxpho/oKBjBBlmpw3q
 4Q0yWgZ8F/ZWFUrGzi1TY3tdYrl3yCOpZ3l5aRTtTqlU3aVShIIiKCKDvs2v8l6h
 C5igUbnW5BtsMMCOwdULc/lHgN3vMbJEA+7YdmeouDEY5QAk0O6nxan3y+cbtC5u
 F+++tkWzSQZJRGhdAxdAXsABYfHiR7Wnft96+iMpnQYbm35CdYYwlOhhl0iI/+Ec
 FcpDXOz9faA=
 =J3I5
 -----END PGP SIGNATURE-----

Merge tag 'md/3.14-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
 "Two bugfixes for md

  both tagged for -stable"

* tag 'md/3.14-fixes' of git://neil.brown.name/md:
  md/raid5: Fix CPU hotplug callback registration
  md/raid1: restore ability for check and repair to fix read errors.
2014-02-14 12:48:16 -08:00
Oleg Nesterov 789b5e0315 md/raid5: Fix CPU hotplug callback registration
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:

	register_cpu_notifier(&foobar_cpu_notifier);

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	put_online_cpus();

A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.

So reorganize the code in raid5 this way to fix the deadlock with callback
registration.

Cc: linux-raid@vger.kernel.org
Cc: stable@vger.kernel.org (v2.6.32+)
Fixes: 36d1c6476b
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-02-13 13:46:45 +11:00
David Fries ac8f73305e connector: add portid to unicast in addition to broadcasting
This allows replying only to the requestor portid while still
supporting broadcasting.  Pass 0 to portid for the previous behavior.

Signed-off-by: David Fries <David@Fries.net>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-07 15:40:17 -08:00
NeilBrown 1877db7558 md/raid1: restore ability for check and repair to fix read errors.
commit 30bc9b5387
    md/raid1: fix bio handling problems in process_checks()

Move the bio_reset() to a point before where BIO_UPTODATE is checked,
so that check now always report that the bio is uptodate, even if it is not.

This causes process_check() to sometimes treat read-errors as
successful matches so the good data isn't written out.

This patch preserves the flag until it is needed.

Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed
an even worse bug).  So suitable for any -stable since 3.10.

Reported-and-tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: stable@vger.kernel.org (3.10+)
Fixed: 30bc9b5387
Signed-off-by: NeilBrown <neilb@suse.de>
2014-02-05 12:26:04 +11:00
Jens Axboe 96d2e8b5e2 Merge branch 'bcache-for-3.14' of git://evilpiepirate.org/~kent/linux-bcache into for-linus 2014-01-30 12:57:55 -07:00
Linus Torvalds 53d8ab29f8 Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver changes from Jens Axboe:

 - bcache update from Kent Overstreet.

 - two bcache fixes from Nicholas Swenson.

 - cciss pci init error fix from Andrew.

 - underflow fix in the parallel IDE pg_write code from Dan Carpenter.
   I'm sure the 1 (or 0) users of that are now happy.

 - two PCI related fixes for sx8 from Jingoo Han.

 - floppy init fix for first block read from Jiri Kosina.

 - pktcdvd error return miss fix from Julia Lawall.

 - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
   Michael Opdenacker.

 - comment typo fix for the loop driver from Olaf Hering.

 - potential oops fix for null_blk from Raghavendra K T.

 - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
   an OOM problem and a problem with handling security locked conditions

* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
  mg_disk: Spelling s/finised/finished/
  null_blk: Null pointer deference problem in alloc_page_buffers
  mtip32xx: Correctly handle security locked condition
  mtip32xx: Make SGL container per-command to eliminate high order dma allocation
  drivers/block/loop.c: fix comment typo in loop_config_discard
  drivers/block/cciss.c:cciss_init_one(): use proper errnos
  drivers/block/paride/pg.c: underflow bug in pg_write()
  drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
  drivers/block/sx8.c: use module_pci_driver()
  floppy: bail out in open() if drive is not responding to block0 read
  bcache: Fix auxiliary search trees for key size > cacheline size
  bcache: Don't return -EINTR when insert finished
  bcache: Improve bucket_prio() calculation
  bcache: Add bch_bkey_equal_header()
  bcache: update bch_bkey_try_merge
  bcache: Move insert_fixup() to btree_keys_ops
  bcache: Convert sorting to btree_keys
  bcache: Convert debug code to btree_keys
  bcache: Convert btree_iter to struct btree_keys
  bcache: Refactor bset_tree sysfs stats
  ...
2014-01-30 11:40:10 -08:00
Linus Torvalds f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Nicholas Swenson e3b4825b85 bcache: bugfix - gc thread now gets woken when cache is full
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
2014-01-29 13:06:42 -08:00
Kent Overstreet 3572324af0 bcache: Minor fixes from kbuild robot
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-29 13:06:41 -08:00
Darrick J. Wong 9471744767 bcache: fix BUG_ON due to integer overflow with GC_SECTORS_USED
The BUG_ON at the end of __bch_btree_mark_key can be triggered due to
an integer overflow error:

BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 13);
...
SET_GC_SECTORS_USED(g, min_t(unsigned,
	     GC_SECTORS_USED(g) + KEY_SIZE(k),
	     (1 << 14) - 1));
BUG_ON(!GC_SECTORS_USED(g));

In bcache.h, the SECTORS_USED bitfield is defined to be 13 bits wide.
While the SET_ code tries to ensure that the field doesn't overflow by
clamping it to (1<<14)-1 == 16383, this is incorrect because 16383
requires 14 bits.  Therefore, if GC_SECTORS_USED() + KEY_SIZE() =
8192, the SET_ statement tries to store 8192 into a 13-bit field.  In
a 13-bit field, 8192 becomes zero, thus triggering the BUG_ON.

Therefore, create a field width constant and a max value constant, and
use those to create the bitfield and check the inputs to
SET_GC_SECTORS_USED.  Arguably the BITMASK() template ought to have
BUG_ON checks for too-large values, but that's a separate patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2014-01-29 13:06:15 -08:00
Linus Torvalds 5c85121bf6 md updates for 3.14
All bug fixes, two tagged for -stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAUuG8WTnsnt1WYoG5AQKW7A//V93TYUiUAG6zNUFrNZjuoXP0ym3jlpkH
 eIIFdcV7rr0Irgtd8+s9cW8Cjsbq3d/vMbFlwP1Co32mnCnFFojeKtCvM9GqkYrH
 o4Zr1nVAVYzKO4awByK3wBT9WbzEc/XlDgYQIpZExIYeZdzLOm6HyvlRbcE86Ug5
 QoGYOUlLu4LUZmFgB9zQ7JM0GACV5pS1afSObtACj2t2x5GVHNU84u+M+D8urPXO
 wnf+AIAzquh5F+8MX+DxmMEUaSzUHf8fXOM3jYVbzPI71SpaHssL4SwBn+j4I/8/
 SCSqeIh7qMSuqUy63/iHKCy5qAgNuRdL9fYlOTkpxzHm81Ddj8u7fySsApggVOa2
 yeKTkSRlsMFeu+LiGKNi/fINVxboaoYJVZ2DTNtKxSuW2VL2aPNz1Qjq4QnR3nSI
 LpaB3VeVKdMsH8Em1a8cgZWcjo5YFAcNtUnJq2fvj9VZ3SJNw4ZoKDL+l718iGao
 xIwAXMSafAHQVAAaNVFkwrea13TeOyxikY5Ra4vWfm+Fw8TzmYq5DqO0zaILwdAJ
 2FnNj2/2y3hk2K7qBcEvjjEakxPlTwzrzxZMfJDRMuQLqvrjbXiMGOnWzgl1D/9x
 4/uPjeFZLG7byxmIyg4Y83NkPgkWnRPpGK98r26pUH1UgnRF0a5aUFXQk7rsQrU6
 noRkZ9EPD8s=
 =d81E
 -----END PGP SIGNATURE-----

Merge tag 'md/3.14' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "All bug fixes, two tagged for -stable"

* tag 'md/3.14' of git://neil.brown.name/md:
  md/raid5: close recently introduced race in stripe_head management.
  md/raid5: fix long-standing problem with bitmap handling on write failure.
  md: check command validity early in md_ioctl().
  md: ensure metadata is writen after raid level change.
  md/raid10: avoid fullsync when not necessary.
  md: allow a partially recovered device to be hot-added to an array.
  md: Change handling of save_raid_disk and metadata update during recovery.
2014-01-24 17:41:50 -08:00
Linus Torvalds fe41c2c018 A set of device-mapper changes for 3.14.
A lot of attention was paid to improving the thin-provisioning target's
 handling of metadata operation failures and running out of space.  A new
 'error_if_no_space' feature was added to allow users to error IOs rather
 than queue them when either the data or metadata space is exhausted.
 
 Additional fixes/features include:
 - a few fixes to properly support thin metadata device resizing
 - a solution for reliably waiting for a DM device's embedded kobject to
   be released before destroying the device
 - old dm-snapshot is updated to use the dm-bufio interface to take
   advantage of readahead capabilities that improve snapshot activation
 - new dm-cache target tunables to control how quickly data is promoted
   to the cache (fast) device
 - improved write efficiency of cluster mirror target by combining
   userspace flush and mark requests
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJS4GClAAoJEMUj8QotnQNacdEH/2ES5k5itUQRY9jeI+u2zYNP
 vdsRTYf+97+B3jpRvpWbMt4kxT2tjaQbkxJ+iKRHy2MBLFUgq8ruH1RS/Q5VbDeg
 6i6ol8mpNxhlvo/KTMxXqRcWDSxShiMfhz2lXC2bJ7M4sP/iiH85s4Pm4YQ59jpd
 OIX7qj36m/cV/le9YQbexJEEsaj+3genbzL26wyyvtG/rT9fWnXa7clj2gqTdToG
 YCEBCRf5FH9X6W/Oc50nMw5n2dt/MRmPre/MAlOjemeaosB0WJiKaswM25rnvHp0
 JnhxQ2K2C5KIKAWIfwPOImdb9zWW7p1dIRLsS8nHBUQr0BF5VRkmvpnYH4qBtcc=
 =e7e0
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper changes from Mike Snitzer:
 "A lot of attention was paid to improving the thin-provisioning
  target's handling of metadata operation failures and running out of
  space.  A new 'error_if_no_space' feature was added to allow users to
  error IOs rather than queue them when either the data or metadata
  space is exhausted.

  Additional fixes/features include:
   - a few fixes to properly support thin metadata device resizing
   - a solution for reliably waiting for a DM device's embedded kobject
     to be released before destroying the device
   - old dm-snapshot is updated to use the dm-bufio interface to take
     advantage of readahead capabilities that improve snapshot
     activation
   - new dm-cache target tunables to control how quickly data is
     promoted to the cache (fast) device
   - improved write efficiency of cluster mirror target by combining
     userspace flush and mark requests"

* tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits)
  dm log userspace: allow mark requests to piggyback on flush requests
  dm space map metadata: fix bug in resizing of thin metadata
  dm cache: add policy name to status output
  dm thin: fix pool feature parsing
  dm sysfs: fix a module unload race
  dm snapshot: use dm-bufio prefetch
  dm snapshot: use dm-bufio
  dm snapshot: prepare for switch to using dm-bufio
  dm snapshot: use GFP_KERNEL when initializing exceptions
  dm cache: add block sizes and total cache blocks to status output
  dm btree: add dm_btree_find_lowest_key
  dm space map metadata: fix extending the space map
  dm space map common: make sure new space is used during extend
  dm: wait until embedded kobject is released before destroying a device
  dm: remove pointless kobject comparison in dm_get_from_kobject
  dm snapshot: call destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
  dm cache policy mq: introduce three promotion threshold tunables
  dm cache policy mq: use list_del_init instead of list_del + INIT_LIST_HEAD
  dm thin: fix set_pool_mode exposed pool operation races
  dm thin: eliminate the no_free_space flag
  ...
2014-01-22 20:17:48 -08:00
Dongmao Zhang 5066a4df1f dm log userspace: allow mark requests to piggyback on flush requests
In the cluster evironment, cluster write has poor performance because
userspace_flush() has to contact a userspace program (cmirrord) for
clear/mark/flush requests.  But both mark and flush requests require
cmirrord to communicate the message to all the cluster nodes for each
flush call.  This behaviour is really slow.

To address this we now merge mark and flush requests together to reduce
the kernel-userspace-kernel time.  We allow a new directive,
"integrated_flush" that can be used to instruct the kernel log code to
combine flush and mark requests when directed by userspace.  If not
directed by userspace (due to an older version of the userspace code
perhaps), the kernel will function as it did previously - preserving
backwards compatibility.  Additionally, flush requests are performed
lazily when only clear requests exist.

Signed-off-by: Dongmao Zhang <dmzhang@suse.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-21 23:46:27 -05:00
Linus Torvalds f075e0f699 Merge branch 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "The bulk of changes are cleanups and preparations for the upcoming
  kernfs conversion.

   - cgroup_event mechanism which is and will be used only by memcg is
     moved to memcg.

   - pidlist handling is updated so that it can be served by seq_file.

     Also, the list is not sorted if sane_behavior.  cgroup
     documentation explicitly states that the file is not sorted but it
     has been for quite some time.

   - All cgroup file handling now happens on top of seq_file.  This is
     to prepare for kernfs conversion.  In addition, all operations are
     restructured so that they map 1-1 to kernfs operations.

   - Other cleanups and low-pri fixes"

* 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (40 commits)
  cgroup: trivial style updates
  cgroup: remove stray references to css_id
  doc: cgroups: Fix typo in doc/cgroups
  cgroup: fix fail path in cgroup_load_subsys()
  cgroup: fix missing unlock on error in cgroup_load_subsys()
  cgroup: remove for_each_root_subsys()
  cgroup: implement for_each_css()
  cgroup: factor out cgroup_subsys_state creation into create_css()
  cgroup: combine css handling loops in cgroup_create()
  cgroup: reorder operations in cgroup_create()
  cgroup: make for_each_subsys() useable under cgroup_root_mutex
  cgroup: css iterations and css_from_dir() are safe under cgroup_mutex
  cgroup: unify pidlist and other file handling
  cgroup: replace cftype->read_seq_string() with cftype->seq_show()
  cgroup: attach cgroup_open_file to all cgroup files
  cgroup: generalize cgroup_pidlist_open_file
  cgroup: unify read path so that seq_file is always used
  cgroup: unify cgroup_write_X64() and cgroup_write_string()
  cgroup: remove cftype->read(), ->read_map() and ->write()
  hugetlb_cgroup: convert away from cftype->read()
  ...
2014-01-21 17:51:34 -08:00
NeilBrown 7da9d450ab md/raid5: close recently introduced race in stripe_head management.
As release_stripe and __release_stripe decrement ->count and then
manipulate ->lru both under ->device_lock, it is important that
get_active_stripe() increments ->count and clears ->lru also under
->device_lock.

However we currently list_del_init ->lru under the lock, but increment
the ->count outside the lock.  This can lead to races and list
corruption.

So move the atomic_inc(&sh->count) up inside the ->device_lock
protected region.

Note that we still increment ->count without device lock in the case
where get_free_stripe() was called, and in fact don't take
->device_lock at all in that path.
This is safe because if the stripe_head can be found by
get_free_stripe, then the hash lock assures us the no-one else could
possibly be calling release_stripe() at the same time.

Fixes: 566c09c534
Cc: stable@vger.kernel.org (3.13)
Reported-and-tested-by: Ian Kumlien <ian.kumlien@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-22 11:45:03 +11:00
Joe Thornber fca028438f dm space map metadata: fix bug in resizing of thin metadata
This bug was introduced in commit 7e664b3dec ("dm space map metadata:
fix extending the space map").

When extending a dm-thin metadata volume we:

- Switch the space map into a simple bootstrap mode, which allocates
  all space linearly from the newly added space.
- Add new bitmap entries for the new space
- Increment the reference counts for those newly allocated bitmap
  entries
- Commit changes to disk
- Switch back out of bootstrap mode.

But, the disk commit may allocate space itself, if so this fact will be
lost when switching out of bootstrap mode.

The bug exhibited itself as an error when the bitmap_root, with an
erroneous ref count of 0, was subsequently decremented as part of a
later disk commit.  This would cause the disk commit to fail, and thinp
to enter read_only mode.  The metadata was not damaged (thin_check
passed).

The fix is to put the increments + commit into a loop, running until
the commit has not allocated extra space.  In practise this loop only
runs twice.

With this fix the following device mapper testsuite test passes:
 dmtest run --suite thin-provisioning -n thin_remove_works_after_resize

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # depends on commit 7e664b3dec
2014-01-21 12:15:01 -05:00
Linus Torvalds d3bad75a6d Driver core / sysfs patches for 3.14-rc1
Here's the big driver core and sysfs patch set for 3.14-rc1.
 
 There's a lot of work here moving sysfs logic out into a "kernfs" to
 allow other subsystems to also have a virtual filesystem with the same
 attributes of sysfs (handle device disconnect, dynamic creation /
 removal  as needed / unneeded, etc.  This is primarily being done for
 the cgroups filesystem, but the goal is to also move debugfs to it when
 it is ready, solving all of the known issues in that filesystem as well.
 The code isn't completed yet, but all should be stable now (there is a
 big section that was reverted due to problems found when testing.)
 
 There's also some other smaller fixes, and a driver core addition that
 allows for a "collection" of objects, that the DRM people will be using
 soon (it's in this tree to make merges after -rc1 easier.)
 
 All of this has been in linux-next with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iEYEABECAAYFAlLdh0cACgkQMUfUDdst+ylv4QCfeDKDgLo4LsaBIIrFSxLoH/c7
 UUsAoMPRwA0h8wy+BQcJAg4H4J4maKj3
 =0pc0
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core / sysfs patches from Greg KH:
 "Here's the big driver core and sysfs patch set for 3.14-rc1.

  There's a lot of work here moving sysfs logic out into a "kernfs" to
  allow other subsystems to also have a virtual filesystem with the same
  attributes of sysfs (handle device disconnect, dynamic creation /
  removal as needed / unneeded, etc)

  This is primarily being done for the cgroups filesystem, but the goal
  is to also move debugfs to it when it is ready, solving all of the
  known issues in that filesystem as well.  The code isn't completed
  yet, but all should be stable now (there is a big section that was
  reverted due to problems found when testing)

  There's also some other smaller fixes, and a driver core addition that
  allows for a "collection" of objects, that the DRM people will be
  using soon (it's in this tree to make merges after -rc1 easier)

  All of this has been in linux-next with no reported issues"

* tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits)
  kernfs: associate a new kernfs_node with its parent on creation
  kernfs: add struct dentry declaration in kernfs.h
  kernfs: fix get_active failure handling in kernfs_seq_*()
  Revert "kernfs: fix get_active failure handling in kernfs_seq_*()"
  Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq"
  Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"
  Revert "kernfs: remove KERNFS_REMOVED"
  Revert "kernfs: restructure removal path to fix possible premature return"
  Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"
  Revert "kernfs: remove kernfs_addrm_cxt"
  Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed"
  Revert "kernfs: implement kernfs_{de|re}activate[_self]()"
  Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers"
  Revert "pci: use device_remove_file_self() instead of device_schedule_callback()"
  Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()"
  Revert "s390: use device_remove_file_self() instead of device_schedule_callback()"
  Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()"
  Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"
  kernfs: remove unnecessary NULL check in __kernfs_remove()
  drivers/base: provide an infrastructure for componentised subsystems
  ...
2014-01-20 15:49:44 -08:00
Mike Snitzer 2e68c4e6ca dm cache: add policy name to status output
The cache's policy may have been established using the "default" alias,
which is currently the "mq" policy but the default policy may change in
the future.  It is useful to know exactly which policy is being used.

Add a 'real' member to the dm_cache_policy_type structure and have the
"default" dm_cache_policy_type point to the real "mq"
dm_cache_policy_type.  Update dm_cache_policy_get_name() to check if
real is set, if so report the name of the real policy (not the alias).

Requested-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-16 13:44:11 -05:00
Mike Snitzer 74aa45c33c dm thin: fix pool feature parsing
Commit 787a996cb2 ("dm thin: add error_if_no_space feature")
mistakenly forgot to increase the number of feature args supported.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-15 21:16:24 -05:00
NeilBrown 9f97e4b128 md/raid5: fix long-standing problem with bitmap handling on write failure.
Before a write starts we set a bit in the write-intent bitmap.
When the write completes we clear that bit if the write was successful
to all devices.  However if the write wasn't fully successful we
should not clear the bit.  If the faulty drive is subsequently
re-added, the fact that the bit is still set ensure that we will
re-write the data that is missing.

This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
bitmap bit when this flag is not set.
Currently we correctly set the flag if a write starts when some
devices are failed or missing.  But we do *not* set the flag if some
device failed during the write attempt.
This is wrong and can result in clearing the bit inappropriately.

So: set the flag when a write fails.

This bug has been present since bitmaps were introduces, so the fix is
suitable for any -stable kernel.

Reported-by: Ethan Wilson <ethan.wilson@shiftmail.org>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-16 09:35:38 +11:00
Nicolas Schichan cb335f88eb md: check command validity early in md_ioctl().
Verify that the cmd parameter passed to md_ioctl() is valid before
doing anything.

This fixes mddev->hold_active being set to 0 when an invalid ioctl
command is passed to md_ioctl() before the array has been configured.

Clearing mddev->hold_active in that case can lead to a livelock
situation when an invalid ioctl number is given to md_ioctl() by a
process when the mddev is currently being opened by another process:

Process 1				Process 2
---------				---------

md_alloc()
  mddev_find()
  -> returns a new mddev with
     hold_active == UNTIL_IOCTL
  add_disk()
  -> sends KOBJ_ADD uevent

					(sees KOBJ_ADD uevent for device)
                    			md_open()
                    			md_ioctl(INVALID_IOCTL)
                    			-> returns ENODEV and clears
                       			   mddev->hold_active
                    			md_release()
                      			md_put()
                      			-> deletes the mddev as
                         		   hold_active is 0

md_open()
  mddev_find()
  -> returns a newly
    allocated mddev with
    mddev->gendisk == NULL
-> returns with ERESTARTSYS
   (kernel restarts the open syscall)

Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-16 08:55:00 +11:00
Linus Torvalds 1a60864fc1 md: half a dozen bug fixes for 3.13
All of these fix real bugs the people have hit, and are tagged
 for -stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAUtYZqznsnt1WYoG5AQK50g//XuqVR/esIpGR+knf+1sD3Zk85Vp33kGL
 2UfbQbi40q/uLjBhJhOSkx/sYGw1Eo255vNX+yMVjYT9F+xbhI8vlLfecqx5Fk5J
 M+JH1sM7E2T79boFLoOBGSl/qppsQsPHa3p87FmFHQrrAuEMIbFiP98MnQjdSiv4
 Cu9cAR7x7njepHeMXBFiV7URaYtCHAXR9iMdkebkKIFlfND8w2QYD+LWo3SzBKs9
 jTrSBJRpXLHE+bZLOQPhAryb7nWkcT1R7N0vsVMQKcq1o6ZiRNnk/B9xNtV34hkc
 5zwTPe/d5AsV6Tsxg0dSs7xcBn/A+F5lg8fzdOhyE1F13COmB7sepjPTMPAy/oP1
 zjyPwnnWkHMDUW2usf3aqPMt+LGMofRCJHXjkqpMgIWQ96SQUY8F9PPxchkUCsx/
 A38I+vXl2jGDHh/DFSduef3sDOF6TYyKyLteJftyny96dc1RutrZSbHPdrkDz1YQ
 6zcyvpv0FexiXITrLg70FG8fnRMK91ZfHrmuzVP7tpm2TyeIfDriLhTAIXAcXHOT
 l22a1bNj4shFfztnD0CbH6nY/iJM7ov0x5+IyG5/iYbipon02MenQeV9km6JVwQb
 OCGHYCTswiFSduX1E1ru52dHXifbANWgzcUH0sjGQ0YZNmxvPRBWDjB1H2J1auzW
 J8T10qimw1w=
 =uvyl
 -----END PGP SIGNATURE-----

Merge tag 'md/3.13-fixes' of git://neil.brown.name/md

Pull late md fixes from Neil Brown:
 "Half a dozen md bug fixes.

  All of these fix real bugs the people have hit, and are tagged for
  -stable.  Sorry they are late ....  Christmas holidays and all that.
  Hopefully they can still squeak into 3.13"

* tag 'md/3.13-fixes' of git://neil.brown.name/md:
  md: fix problem when adding device to read-only array with bitmap.
  md/raid10: fix bug when raid10 recovery fails to recover a block.
  md/raid5: fix a recently broken BUG_ON().
  md/raid1: fix request counting bug in new 'barrier' code.
  md/raid10: fix two bugs in handling of known-bad-blocks.
  md/raid5: Fix possible confusion when multiple write errors occur.
2014-01-15 15:07:36 +07:00
Mikulas Patocka 2995fa78e4 dm sysfs: fix a module unload race
This reverts commit be35f48610 ("dm: wait until embedded kobject is
released before destroying a device") and provides an improved fix.

The kobject release code that calls the completion must be placed in a
non-module file, otherwise there is a module unload race (if the process
calling dm_kobject_release is preempted and the DM module unloaded after
the completion is triggered, but before dm_kobject_release returns).

To fix this race, this patch moves the completion code to dm-builtin.c
which is always compiled directly into the kernel if BLK_DEV_DM is
selected.

The patch introduces a new dm_kobject_holder structure, its purpose is
to keep the completion and kobject in one place, so that it can be
accessed from non-module code without the need to export the layout of
struct mapped_device to that code.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-01-14 23:23:04 -05:00
Mikulas Patocka 55b082e614 dm snapshot: use dm-bufio prefetch
This patch modifies dm-snapshot so that it prefetches the buffers when
loading the exceptions.

The number of buffers read ahead is specified in the DM_PREFETCH_CHUNKS
macro.  The current value for DM_PREFETCH_CHUNKS (12) was found to
provide the best performance on a single 15k SCSI spindle.  In the
future we may modify this default or make it configurable.

Also, introduce the function dm_bufio_set_minimum_buffers to setup
bufio's number of internal buffers before freeing happens.  dm-bufio may
hold more buffers if enough memory is available.  There is no guarantee
that the specified number of buffers will be available - if you need a
guarantee, use the argument reserved_buffers for
dm_bufio_client_create.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-14 23:23:03 -05:00
Mikulas Patocka 55494bf294 dm snapshot: use dm-bufio
Use dm-bufio for initial loading of the exceptions.
Introduce a new function dm_bufio_forget that frees the given buffer.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-14 23:23:02 -05:00
Mikulas Patocka 2cadabd512 dm snapshot: prepare for switch to using dm-bufio
Change the functions get_exception, read_exception and insert_exceptions
so that ps->area is passed as an argument.

This patch doesn't change any functionality, but it refactors the code
to allow for a cleaner switch over to using dm-bufio.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-14 13:38:32 -05:00
Mikulas Patocka 119bc54736 dm snapshot: use GFP_KERNEL when initializing exceptions
The list of initial exceptions is loaded in the target constructor.  We
are allowed to allocate memory with GFP_KERNEL at this point.  So,
change alloc_completed_exception to use GFP_KERNEL when being called
from the constructor.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-14 11:18:16 -05:00
NeilBrown 830778a180 md: ensure metadata is writen after raid level change.
level_store() currently does not make sure the metadata is
updates to reflect the new raid level.  It simply sets MD_CHANGE_DEVS.

Any level with a ->thread will quickly notice this and update the
metadata.  However RAID0 and Linear do not have a thread so no
metadata update happens until the array is stopped.  At that point the
metadata is written.

This is later that we would like.  While the delay doesn't risk any
data it can cause confusion.  So if there is no md thread, immediately
update the metadata after a level change.

Reported-by: Richard Michael <rmichael@edgeofthenet.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14 16:44:21 +11:00
NeilBrown 0b59bb6422 md/raid10: avoid fullsync when not necessary.
This is the raid10 equivalent of

commit 4f0a5e012c
    MD RAID1: Further conditionalize 'fullsync'

If a device in a newly assembled array is not fully recovered we
currently do a fully resync by don't need to.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14 16:44:21 +11:00
NeilBrown 7eb418851f md: allow a partially recovered device to be hot-added to an array.
When adding a new device into an array it is normally important to
clear any stale data from ->recovery_offset else the new device may
not be recovered properly.

However when re-adding a device which is known to be nearly in-sync,
this is not needed and can be detrimental.  The (bitmap-based)
resync will still happen, and further recovery is only needed from
where-ever it was already up to.

So if save_raid_disk is set, signifying a re-add, don't clear
->recovery_offset.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14 16:44:21 +11:00
NeilBrown f466722ca6 md: Change handling of save_raid_disk and metadata update during recovery.
Since commit d70ed2e4fa
   MD: Allow restarting an interrupted incremental recovery.

we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences.  This patch
changes things to make them work better.

At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".

Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>".  So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.

After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count".  That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point.  So this
was an improvement.

However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering).  So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.

It really is wrong to not update all the metadata together.  Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata.  i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening.  We
now add another one to say that it is a bitmap-based recovery.

With this we can remove the code that disables the write-out of
metadata on some devices.

So this patch:
 - moves the setting of 'saved_raid_disk' from add_new_disk to
   the validate_super methods.  This makes sure it is always set
   properly, both when adding a new device to an array, and when
   assembling an array from a collection of devices.
 - Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
   used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
   bitmap-based recovery is allowed.
   This is only present in v1.x metadata. v0.90 doesn't support
   devices which are in the middle of recovery at all.
 - Only skips writing metadata to Faulty devices.

 - Also allows rdev state to be set to "-insync" via sysfs.
   This can be used for external-metadata arrays.  When the
   'role' is set the device is assumed to be in-sync.  If, after
   setting the role, we set the state to "-insync", the role is
   moved to saved_raid_disk which effectively says the device is
   partly in-sync with that slot and needs a bitmap recovery.

Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14 16:44:21 +11:00
NeilBrown 8313b8e57f md: fix problem when adding device to read-only array with bitmap.
If an array is started degraded, and then the missing device
is found it can be re-added and a minimal bitmap-based recovery
will bring it fully up-to-date.

If the array is read-only a recovery would not be allowed.
But also if the array is read-only and the missing device was
present very recently, then there could be no need for any
recovery at all, so we simply include the device in the read-only
array without any recovery.

However... if the missing device was removed a little longer ago
it could be missing some updates, but if a bitmap is present it will
be conditionally accepted pending a bitmap-based update.  We don't
currently detect this case properly and will include that old
device into the read-only array with no recovery even though it really
needs a recovery.

This patch keeps track of whether a bitmap-based-recovery is really
needed or not in the new Bitmap_sync rdev flag.  If that is set,
then the device will not be added to a read-only array.

Cc: Andrei Warkentin <andreiw@vmware.com>
Fixes: d70ed2e4fa
Cc: stable@vger.kernel.org (3.2+)
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14 16:44:08 +11:00
NeilBrown e8b8491585 md/raid10: fix bug when raid10 recovery fails to recover a block.
commit e875ecea26
    md/raid10 record bad blocks as needed during recovery.

added code to the "cannot recover this block" path to record a bad
block rather than fail the whole recovery.
Unfortunately this new case was placed *after* r10bio was freed rather
than *before*, yet it still uses r10bio.
This is will crash with a null dereference.

So move the freeing of r10bio down where it is safe.

Cc: stable@vger.kernel.org (v3.1+)
Fixes: e875ecea26
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14 16:44:08 +11:00
NeilBrown 5af9bef72c md/raid5: fix a recently broken BUG_ON().
commit 6d183de407
    md/raid5: fix newly-broken locking in get_active_stripe.

simplified a BUG_ON, but removed too much so now it sometimes fires
when it shouldn't.

When the STRIPE_EXPANDING flag is set, the stripe_head might be on a
special list while multiple stripe_heads are collected, or it might
not be on any list, even a 'free' list when the refcount is zero.  As
long as STRIPE_EXPANDING is set, it will be found and added back to a
list eventually.

So both of the BUG_ONs which test for the ->lru being empty or not
need to avoid the case where STRIPE_EXPANDING is set.

The patch which broke this was marked for -stable, so this patch needs
to be applied to any branch that received 6d183de4

Fixes: 6d183de407
Cc: stable@vger.kernel.org (any release to which above was applied)
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14 16:44:07 +11:00
NeilBrown 41a336e011 md/raid1: fix request counting bug in new 'barrier' code.
The new iobarrier implementation in raid1 (which keeps normal writes
and resync activity separate) counts every request what is not before
the current resync point in either next_window_requests or
current_window_requests.
It flags that the request is counted by setting ->start_next_window.

allow_barrier follows this model exactly and decrements one of the
*_window_requests if and only if ->start_next_window is set.

However wait_barrier(), which increments *_window_requests uses a
slightly different test for setting -.start_next_window (which is set
from the return value of this function).
So there is a possibility of the counts getting out of sync, and this
leads to the resync hanging.

So change wait_barrier() to return a non-zero value in exactly the
same cases that it increments *_window_requests.

But was introduced in 3.13-rc1.

Reported-by: Bruno Wolff III <bruno@wolff.to>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68061
Fixes: 79ef3a8aa1
Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14 16:44:07 +11:00
NeilBrown b50c259e25 md/raid10: fix two bugs in handling of known-bad-blocks.
If we discover a bad block when reading we split the request and
potentially read some of it from a different device.

The code path of this has two bugs in RAID10.
1/ we get a spin_lock with _irq, but unlock without _irq!!
2/ The calculation of 'sectors_handled' is wrong, as can be clearly
   seen by comparison with raid1.c

This leads to at least 2 warnings and a probable crash is a RAID10
ever had known bad blocks.

Cc: stable@vger.kernel.org (v3.1+)
Fixes: 856e08e237
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14 16:44:07 +11:00
NeilBrown 1cc03eb932 md/raid5: Fix possible confusion when multiple write errors occur.
commit 5d8c71f9e5
    md: raid5 crash during degradation

Fixed a crash in an overly simplistic way which could leave
R5_WriteError or R5_MadeGood set in the stripe cache for devices
for which it is no longer relevant.
When those devices are removed and spares added the flags are still
set and can cause incorrect behaviour.

commit 14a75d3e07
    md/raid5: preferentially read from replacement device if possible.

Fixed the same bug if a more effective way, so we can now revert
the original commit.

Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Cc: stable@vger.kernel.org (3.2+ - 3.2 will need a different fix though)
Fixes: 5d8c71f9e5
Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14 16:44:07 +11:00
Hugh Dickins b3ff8a2f95 cgroup: remove stray references to css_id
Trivial: remove the few stray references to css_id, which itself
was removed in v3.13's 2ff2a7d03b "cgroup: kill css_id".

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-01-13 10:48:18 -05:00
Mike Snitzer 6a388618f1 dm cache: add block sizes and total cache blocks to status output
Improve cache_status to emit:
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
<cache block size> <#used cache blocks>/<#total cache blocks>
...

Adding the block sizes allows for easier calculation of the overall size
of both the metadata and cache devices.  Adding <#total cache blocks>
provides useful context for how much of the cache is used.

Unfortunately these additions to the status will require updates to
users' scripts that monitor the cache status.  But these changes help
provide more comprehensive information about the cache device and will
simplify tools that are being developed to manage dm-cache devices --
because they won't need to issue 3 operations to cobble together the
information that we can easily provide via a single status ioctl.

While updating the status documentation in cache.txt spaces were
tabify'd.

Requested-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-01-10 10:24:33 -05:00
Joe Thornber f164e6900f dm btree: add dm_btree_find_lowest_key
dm_btree_find_lowest_key is the reciprocal of dm_btree_find_highest_key.
Factor out common code for dm_btree_find_{highest,lowest}_key.

dm_btree_find_lowest_key is needed for an upcoming DM target, as such it
is best to get this interface in place.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-09 16:29:17 -05:00
Kent Overstreet 9dd6358a21 bcache: Fix auxiliary search trees for key size > cacheline size
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:15 -08:00
Kent Overstreet 3b3e9e50dd bcache: Don't return -EINTR when insert finished
We need to return -EINTR after a split because we invalidated iterators
(and freed the btree node) - but if we were finished inserting, we don't
want to redo the traversal.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:14 -08:00
Kent Overstreet e0a985a4b1 bcache: Improve bucket_prio() calculation
When deciding what order to reuse buckets we take into account both the bucket's
priority (which indicates lru order) and also the amount of live data in that
bucket. The way they were scaled together wasn't as correct as it could be...
this patch improves and documents it.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:14 -08:00
Nicholas Swenson 3bdad1e40d bcache: Add bch_bkey_equal_header()
Checks if two keys have equivalent header fields.
(good enough for replacement or merging)

Used in bch_bkey_try_merge, and replacing a key
in the btree.

Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:14 -08:00
Nicholas Swenson 0f49cf3d83 bcache: update bch_bkey_try_merge
Added generic header checks to bch_bkey_try_merge,
which then calls the bkey specific function

Removed extraneous checks from bch_extent_merge

Signed-off-by: Nicholas Swenson <nks@daterainc.com>
2014-01-08 13:05:14 -08:00
Kent Overstreet 829a60b905 bcache: Move insert_fixup() to btree_keys_ops
Now handling overlapping extents/keys is a method that's specific to what the
btree node contains.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:14 -08:00
Kent Overstreet 89ebb4a28b bcache: Convert sorting to btree_keys
More work to disentangle various code from struct btree

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:13 -08:00
Kent Overstreet dc9d98d621 bcache: Convert debug code to btree_keys
More work to disentangle various code from struct btree

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08 13:05:13 -08:00