linux

Commit Graph

Author	SHA1	Message	Date
Mike Snitzer	a5664dad7e	dm ioctl: make bio or request based device type immutable Determine whether a mapped device is bio-based or request-based when loading its first (inactive) table and don't allow that to be changed later. This patch performs different device initialisation in each of the two cases. (We don't think it's necessary to add code to support changing between the two types.) Allowed md->type transitions: DM_TYPE_NONE to DM_TYPE_BIO_BASED DM_TYPE_NONE to DM_TYPE_REQUEST_BASED We now prevent table_load from replacing the inactive table with a conflicting type of table even after an explicit table_clear. Introduce 'type_lock' into the struct mapped_device to protect md->type and to prepare for the next patch that will change the queue initialization and allocate memory while md->type_lock is held. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> drivers/md/dm-ioctl.c \| 15 +++++++++++++++ drivers/md/dm.c \| 37 ++++++++++++++++++++++++++++++------- drivers/md/dm.h \| 5 +++++ include/linux/dm-ioctl.h \| 4 ++-- 4 files changed, 52 insertions(+), 9 deletions(-)	2010-08-12 04:14:01 +01:00
Mikulas Patocka	708e929513	dm: skip second flush on bio unsupported error When processing barriers, skip the second flush if processing the bio failed with -EOPNOTSUPP. This can happen with discard+barrier requests. If the device doesn't support discard, there would be two useless SYNCHRONIZE CACHE commands. The first dm_flush cannot be so easily optimized out, so we leave it there. Previously, -EOPNOTSUPP could be received in dec_pending only with empty barriers and we ignored that error, assuming the device not supporting cache flushes has cache always consistent. With the addition of discard barriers, this -EOPNOTSUPP can also be generated by discards and we must record it in md->barrier_error for process_barrier. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:00 +01:00
Tomohiro Kusumi	87c961cb74	dm snapshot: persistent use define for disk header chunk size This patch fixes hard-coded value for the size of a chunk that includes disk header for persistent snapshot. It should be changed to existing macro NUM_SNAPSHOT_HDR_CHUNKS instead of using hard-coded value 1. Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:59 +01:00
Julia Lawall	a9c88f2ebc	dm crypt: use kstrdup Use kstrdup when the goal of an allocation is copy a string into the allocated region. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression from,to; expression flag,E1,E2; statement S; @@ - to = kmalloc(strlen(from) + 1,flag); + to = kstrdup(from, flag); ... when != \(from = E1 \\| to = E1 \) if (to==NULL \|\| ...) S ... when != \(from = E2 \\| to = E2 \) - strcpy(to, from); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:58 +01:00
Arnd Bergmann	402ab352c2	dm ioctl: use nonseekable_open The dm control device does not implement read/write, so it has no use for seeking. Using no_llseek prevents falling back to default_llseek, which requires the BKL. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:57 +01:00
Kiyoshi Ueda	3f77316de0	dm: separate device deletion from dm_put This patch separates the device deletion code from dm_put() to make sure the deletion happens in the process context. By this patch, device deletion always occurs in an ioctl (process) context and dm_put() can be called in interrupt context. As a result, the request-based dm's bad dm_put() usage pointed out by Mikulas below disappears. http://marc.info/?l=dm-devel&m=126699981019735&w=2 Without this patch, I confirmed there is a case to crash the system: dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt()) Some more backgrounds and details: In request-based dm, a device opener can remove a mapped_device while the last request is still completing, because bios in the last request complete first and then the device opener can close and remove the mapped_device before the last request completes: CPU0 CPU1 ================================================================= <<INTERRUPT>> blk_end_request_all(clone_rq) blk_update_request(clone_rq) bio_endio(clone_bio) == end_clone_bio blk_update_request(orig_rq) bio_endio(orig_bio) <<I/O completed>> dm_blk_close() dev_remove() dm_put(md) <<Free md>> blk_finish_request(clone_rq) .... dm_end_request(clone_rq) free_rq_clone(clone_rq) blk_end_request_all(orig_rq) rq_completed(md) So request-based dm used dm_get()/dm_put() to hold md for each I/O until its request completion handling is fully done. However, the final dm_put() can call the device deletion code which must not be run in interrupt context and may cause kernel panic. To solve the problem, this patch moves the device deletion code, dm_destroy(), to predetermined places that is actually deleting the mapped_device in ioctl (process) context, and changes dm_put() just to decrement the reference count of the mapped_device. By this change, dm_put() can be used in any context and the symmetric model below is introduced: dm_create(): create a mapped_device dm_destroy(): destroy a mapped_device dm_get(): increment the reference count of a mapped_device dm_put(): decrement the reference count of a mapped_device dm_destroy() waits for all references of the mapped_device to disappear, then deletes the mapped_device. dm_destroy() uses active waiting with msleep(1), since deleting the mapped_device isn't performance-critical task. And since at this point, nobody opens the mapped_device and no new reference will be taken, the pending counts are just for racing completing activity and will eventually decrease to zero. For the unlikely case of the forced module unload, dm_destroy_immediate(), which doesn't wait and forcibly deletes the mapped_device, is also introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f" may be stuck and never return. And now, because the mapped_device is deleted at this point, subsequent accesses to the mapped_device may cause NULL pointer references. Cc: stable@kernel.org Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:56 +01:00
Kiyoshi Ueda	98f332855e	dm ioctl: release _hash_lock between devices in remove_all This patch changes dm_hash_remove_all() to release _hash_lock when removing a device. After removing the device, dm_hash_remove_all() takes _hash_lock and searches the hash from scratch again. This patch is a preparation for the next patch, which changes device deletion code to wait for md reference to be 0. Without this patch, the wait in the next patch may cause AB-BA deadlock: CPU0 CPU1 ----------------------------------------------------------------------- dm_hash_remove_all() down_write(_hash_lock) table_status() md = find_device() dm_get(md) <increment md->holders> dm_get_live_or_inactive_table() dm_get_inactive_table() down_write(_hash_lock) <in the md deletion code> <wait for md->holders to be 0> Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:55 +01:00
Kiyoshi Ueda	abdc568b05	dm: prevent access to md being deleted This patch prevents access to mapped_device which is being deleted. Currently, even after a mapped_device has been removed from the hash, it could be accessed through idr_find() using minor number. That could cause a race and NULL pointer reference below: CPU0 CPU1 ------------------------------------------------------------------ dev_remove(param) down_write(_hash_lock) dm_lock_for_deletion(md) spin_lock(_minor_lock) set_bit(DMF_DELETING) spin_unlock(_minor_lock) __hash_remove(hc) up_write(_hash_lock) dev_status(param) md = find_device(param) down_read(_hash_lock) __find_device_hash_cell(param) dm_get_md(param->dev) md = dm_find_md(dev) spin_lock(_minor_lock) md = idr_find(MINOR(dev)) spin_unlock(_minor_lock) dm_put(md) free_dev(md) dm_get(md) up_read(_hash_lock) __dev_status(md, param) dm_put(md) This patch fixes such problems. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:54 +01:00
Peter Rajnoha	856a6f1dbd	dm ioctl: return uevent flag after rename All the dm ioctls that generate uevents set the DM_UEVENT_GENERATED flag so that userspace knows whether or not to wait for a uevent to be processed before continuing, The dm rename ioctl sets this flag but was not structured to return it to userspace. This patch restructures the rename ioctl processing to behave like the other ioctls that return data and so fix this. Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:53 +01:00
Alasdair G Kergon	094ea9a071	dm ioctl: make __dev_status void __dev_status() cannot fail so make it void and simplify callers. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:52 +01:00
Peter Rajnoha	6be5449401	dm ioctl: remove __dev_status from geometry and target message Remove useless __dev_status call while processing an ioctl that sets up device geometry and target message. The data is not returned to userspace so there is no point collecting it and in the case of target_message it is collected before processing the message so if it did return it might be stale. Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:52 +01:00
Mikulas Patocka	c241104506	dm snapshot: test chunk size against both origin and snapshot Validate chunk size against both origin and snapshot sector size Don't allow chunk size smaller than either origin or snapshot logical sector size. Reading or writing data not aligned to sector size is not allowed and causes immediate errors. This requires us to open the origin before initialising the exception store and to export dm_snap_origin. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:51 +01:00
Mikulas Patocka	1e5554c842	dm snapshot: iterate origin and cow devices Iterate both origin and snapshot devices iterate_devices method should call the callback for all the devices where the bio may be remapped. Thus, snapshot_iterate_devices should call the callback for both snapshot and origin underlying devices because it remaps some bios to the snapshot and some to the origin. snapshot_iterate_devices called the callback only for the origin device. This led to badly calculated device limits if snapshot and origin were placed on different types of disks. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:50 +01:00
Alasdair G Kergon	6bbf79a140	dm mpath: fix NULL pointer dereference when path parameters missing multipath_ctr() forgets to return an error after detecting missing path parameters. Fix this. Signed-off-by: Patrick LoPresti <lopresti@gmail.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:49 +01:00
Linus Torvalds	3d30701b58	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: (24 commits) md: clean up do_md_stop md: fix another deadlock with removing sysfs attributes. md: move revalidate_disk() back outside open_mutex md/raid10: fix deadlock with unaligned read during resync md/bitmap: separate out loading a bitmap from initialising the structures. md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. md/bitmap: optimise scanning of empty bitmaps. md/bitmap: clean up plugging calls. md/bitmap: reduce dependence on sysfs. md/bitmap: white space clean up and similar. md/raid5: export raid5 unplugging interface. md/plug: optionally use plugger to unplug an array during resync/recovery. md/raid5: add simple plugging infrastructure. md/raid5: export is_congested test raid5: Don't set read-ahead when there is no queue md: add support for raising dm events. md: export various start/stop interfaces md: split out md_rdev_init md: be more careful setting MD_CHANGE_CLEAN md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk ...	2010-08-10 15:38:19 -07:00
NeilBrown	fd8aa2c181	Merge git://git.infradead.org/users/dwmw2/libraid-2.6 into for-linus	2010-08-10 10:02:33 +10:00
David Woodhouse	2144381da4	Merge branch 'async' of macbook:git/btrfs-unstable Conflicts: drivers/md/Makefile lib/raid6/unroll.pl	2010-08-09 10:36:44 +01:00
NeilBrown	6e17b02764	md: clean up do_md_stop There is only one error exit from do_md_stop, so make that more explicit and discard the 'err' variable. Also drop the 'revalidate' variable by moving the unlock calls around. Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-08 21:22:45 +10:00
NeilBrown	bb4f1e9d0e	md: fix another deadlock with removing sysfs attributes. Move the deletion of sysfs attributes from reconfig_mutex to open_mutex didn't really help as a process can try to take open_mutex while holding reconfig_mutex, so the same deadlock can happen, just requiring one more process to be involved in the chain. I looks like I cannot easily use locking to wait for the sysfs deletion to complete, so don't. The only things that we cannot do while the deletions are still pending is other things which can change the sysfs namespace: run, takeover, stop. Each of these can fail with -EBUSY. So set a flag while doing a sysfs deletion, and fail run, takeover, stop if that flag is set. This is suitable for 2.6.35.x Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-08 21:21:27 +10:00
Dan Williams	147e0b6a63	md: move revalidate_disk() back outside open_mutex Commit `b821eaa5` "md: remove ->changed and related code" moved revalidate_disk() under open_mutex, and lockdep noticed. [ INFO: possible circular locking dependency detected ] 2.6.32-mdadm-locking #1 ------------------------------------------------------- mdadm/3640 is trying to acquire lock: (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff811acecb>] revalidate_disk+0x5b/0x90 but task is already holding lock: (&mddev->open_mutex){+.+...}, at: [<ffffffffa055e07a>] do_md_stop+0x4a/0x4d0 [md_mod] which lock already depends on the new lock. It is suitable for 2.6.35.x Cc: <stable@kernel.org> Reported-by: Przemyslaw Czarnowski <przemyslaw.hawrylewicz.czarnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-08 21:20:17 +10:00
Arnd Bergmann	6e9624b8ca	block: push down BKL into .open and .release The open and release block_device_operations are currently called with the BKL held. In order to change that, we must first make sure that all drivers that currently rely on this have no regressions. This blindly pushes the BKL into all .open and .release operations for all block drivers to prepare for the next step. The drivers can subsequently replace the BKL with their own locks or remove it completely when it can be shown that it is not needed. The functions blkdev_get and blkdev_put are the only remaining users of the big kernel lock in the block layer, besides a few uses in the ioctl code, none of which need to serialize with blkdev_{get,put}. Most of these two functions is also under the protection of bdev->bd_mutex, including the actual calls to ->open and ->release, and the common code does not access any global data structures that need the BKL. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:25:34 +02:00
FUJITA Tomonori	00fff26539	block: remove q->prepare_flush_fn completely This removes q->prepare_flush_fn completely (changes the blk_queue_ordered API). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:24:15 +02:00
FUJITA Tomonori	144d6ed551	dm: stop using q->prepare_flush_fn use REQ_FLUSH flag instead. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:24:14 +02:00
Christoph Hellwig	7b6d91daee	block: unify flags for struct bio and struct request Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:20:39 +02:00
Christoph Hellwig	33659ebbae	block: remove wrappers for request type/flags Remove all the trivial wrappers for the cmd_type and cmd_flags fields in struct requests. This allows much easier grepping for different request types instead of unwinding through macros. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:17:56 +02:00
NeilBrown	51e9ac7703	md/raid10: fix deadlock with unaligned read during resync If the 'bio_split' path in raid10-read is used while resync/recovery is happening it is possible to deadlock. Fix this be elevating ->nr_waiting for the duration of both parts of the split request. This fixes a bug that has been present since 2.6.22 but has only started manifesting recently for unknown reasons. It is suitable for and -stable since then. Reported-by: Justin Bronder <jsbronder@gentoo.org> Tested-by: Justin Bronder <jsbronder@gentoo.org> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2010-08-07 21:17:00 +10:00
NeilBrown	69e51b449d	md/bitmap: separate out loading a bitmap from initialising the structures. dm makes this distinction between ->ctr and ->resume, so we need to too. Also get the new bitmap_load to clear out the bitmap first, as this is most consistent with the dm suspend/resume approach Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:21:34 +10:00
NeilBrown	e384e58549	md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. This allows md/raid5 to fully work as a dm target. Normally md uses a 'filemap' which contains a list of pages of bits each of which may be written separately. dm-log uses and all-or-nothing approach to writing the log, so when using a dm-log, ->filemap is NULL and the flags normally stored in filemap_attr are stored in ->logattrs instead. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:21:34 +10:00
NeilBrown	ef42567335	md/bitmap: optimise scanning of empty bitmaps. A bitmap is stored as one page per 2048 bits. If none of the bits are set, the page is not allocated. When bitmap_get_counter finds that a page isn't allocate, it just reports that one bit work of space isn't flagged, rather than reporting that 2048 bits worth of space are unflagged. This can cause searches for flagged bits (e.g. bitmap_close_sync) to do more work than is really necessary. So change bitmap_get_counter (when creating) to report a number of blocks that more accurately reports the range of the device for which no counter currently exists. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:21:32 +10:00
NeilBrown	b63d7c2e29	md/bitmap: clean up plugging calls. 1/ use md_unplug in bitmap.c as we will soon be using bitmaps under arrays with no queue attached. 2/ Don't bother plugging the queue when we set a bit in the bitmap. The reason for this was to encourage as many bits as possible to get set before we unplug and write stuff out. However every personality already plugs the queue after bitmap_startwrite either directly (raid1/raid10) or be setting STRIPE_BIT_DELAY which causes the queue to be plugged later (raid5). Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:21:32 +10:00
NeilBrown	5ff5afffe6	md/bitmap: reduce dependence on sysfs. For dm-raid45 we will want to use bitmaps in dm-targets which don't have entries in sysfs, so cope with the mddev not living in sysfs. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:21:31 +10:00
NeilBrown	ac2f40be46	md/bitmap: white space clean up and similar. Fixes some whitespace problems Fixed some checkpatch.pl complaints. Replaced kmalloc ... memset(0), with kzalloc Fixed an unlikely memory leak on an error path. Reformatted a number of 'if/else' sets, sometimes replacing goto with an else clause. Removed some old comments and commented-out code. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:07:22 +10:00
NeilBrown	9f7c222001	md/raid5: export raid5 unplugging interface. Also remove remaining accesses to ->queue and ->gendisk when ->queue is NULL (As it is in a DM target). Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:53:10 +10:00
NeilBrown	252ac5221a	md/plug: optionally use plugger to unplug an array during resync/recovery. If an array doesn't have a 'queue' then md_do_sync cannot unplug it. In that case it will have a 'plugger', so make that available to the mddev, and use it to unplug the array if needed. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:53:08 +10:00
NeilBrown	2ac8740151	md/raid5: add simple plugging infrastructure. md/raid5 uses the plugging infrastructure provided by the block layer and 'struct request_queue'. However when we plug raid5 under dm there is no request queue so we cannot use that. So create a similar infrastructure that is much lighter weight and use it for raid5. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:53:08 +10:00
NeilBrown	11d8a6e371	md/raid5: export is_congested test the dm module will need this for dm-raid45. Also only access ->queue->backing_dev_info->congested_fn if ->queue actually exists. It won't in a dm target. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:29 +10:00
NeilBrown	4a5add4995	raid5: Don't set read-ahead when there is no queue dm-raid456 does not provide a 'queue' for raid5 to use, so we must make raid5 stop depending on the queue. First: read_ahead dm handles read-ahead adjustment fully in userspace, so simply don't do any readahead adjustments if there is no queue. Also re-arrange code slightly so all the accesses to ->queue are together. Finally, move the blk_queue_merge_bvec function into the 'if' as the ->split_io setting in dm-raid456 has the same effect. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	768a418db1	md: add support for raising dm events. dm uses scheduled work to raise events to user-space. So allow md device to have work_structs and schedule them on an error. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	390ee602a1	md: export various start/stop interfaces export entry points for starting and stopping md arrays. This will be used by a module to make md/raid5 work under dm. Also stop calling md_stop_writes from md_stop, as that won't work well with dm - it will want to call the two separately. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	e8bb9a839a	md: split out md_rdev_init This functionality will be needed separately in a subsequent patch, so split it into it's own exported function. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	676e42d896	md: be more careful setting MD_CHANGE_CLEAN When MD_CHANGE_CLEAN is set we might block in md_write_start. So we should only set it when fairly sure that something will clear it. There are two places where it is set so as to encourage a metadata update to record the progress of resync/recovery. This should only be done if the internal metadata update mechanisms are in use, which can be tested by by inspecting '->persistent'. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	f4be6b43f1	md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk We will shortly allow md devices with no gendisk (they are attached to a dm-target instead). That will cause mdname() to return 'mdX'. There is one place where mdname really needs to be unique: when creating the name for a slab cache. So in that case, if there is no gendisk, you the address of the mddev formatted in HEX to provide a unique name. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:26 +10:00
NeilBrown	c41d4ac40d	md/raid5: factor out code for changing size of stripe cache. Separate the actual 'change' code from the sysfs interface so that it can eventually be called internally. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-21 13:28:15 +10:00
NeilBrown	00bcb4ac7e	md: reduce dependence on sysfs. We will want md devices to live as dm targets where sysfs is not visible. So allow md to not connect to sysfs. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-21 13:27:53 +10:00
NeilBrown	3424bf6a77	md/raid5: don't include 'spare' drives when reshaping to fewer devices. There are few situations where it would make any sense to add a spare when reducing the number of devices in an array, but it is conceivable: A 6 drive RAID6 with two missing devices could be reshaped to a 5 drive RAID6, and a spare could become available just in time for the reshape, but not early enough to have been recovered first. 'freezing' recovery can make this easy to do without any races. However doing such a thing is a bad idea. md will not record the partially-recovered state of the 'spare' and when the reshape finished it will think that the spare is still spare. Easiest way to avoid this confusion is to simply disallow it. Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:36:04 +10:00
NeilBrown	2f11588249	md/raid5: add a missing 'continue' in a loop. As the comment says, the tail of this loop only applies to devices that are not fully in sync, so if In_sync was set, we should avoid the rest of the loop. This bug will hardly ever cause an actual problem. The worst it can do is allow an array to be assembled that is dirty and degraded, which is not generally a good idea (without warning the sysadmin first). This will only happen if the array is RAID4 or a RAID5/6 in an intermediate state during a reshape and so has one drive that is all 'parity' - no data - while some other device has failed. This is certainly possible, but not at all common. Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:35:49 +10:00
NeilBrown	415e72d034	md/raid5: Allow recovered part of partially recovered devices to be in-sync During a recovery of reshape the early part of some devices might be in-sync while the later parts are not. We we know we are looking at an early part it is good to treat that part as in-sync for stripe calculations. This is particularly important for a reshape which suffers device failure. Treating the data as in-sync can mean the difference between data-safety and data-loss. Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:35:39 +10:00
NeilBrown	674806d62f	md/raid5: More careful check for "has array failed". When we are reshaping an array, the device failure combinations that cause us to decide that the array as failed are more subtle. In particular, any 'spare' will be fully in-sync in the section of the array that has already been reshaped, thus failures that affect only that section are less critical. So encode this subtlety in a new function and call it as appropriate. The case that showed this problem was a 4 drive RAID5 to 8 drive RAID6 conversion where the last two devices failed. This resulted in: good good good good incomplete good good failed failed while converting a 5-drive RAID6 to 8 drive RAID5 The incomplete device causes the whole array to look bad, bad as it was actually good for the section that had been converted to 8-drives, all the data was actually safe. Reported-by: Terry Morris <tbmorris@tbmorris.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:35:27 +10:00
NeilBrown	70fffd0bfa	md: Don't update ->recovery_offset when reshaping an array to fewer devices. When an array is reshaped to have fewer devices, the reshape proceeds from the end of the devices to the beginning. If a device happens to be non-In_sync (which is possible but rare) we would normally update the ->recovery_offset as the reshape progresses. However that would be wrong as the recover_offset records that the early part of the device is in_sync, while in fact it would only be the later part that is in_sync, and in any case the offset number would be measured from the wrong end of the device. Relatedly, if after a reshape a spare is discovered to not be recoverred all the way to the end, not allow spare_active to incorporate it in the array. This becomes relevant in the following sample scenario: A 4 drive RAID5 is converted to a 6 drive RAID6 in a combined operation. The RAID5->RAID6 conversion will cause a 5 drive to be included as a spare, then the 5drive -> 6drive reshape will effectively rebuild that spare as it progresses. The 6th drive is treated as in_sync the whole time as there is never any case that we might consider reading from it, but must not because there is no valid data. If we interrupt this reshape part-way through and reverse it to return to a 5-drive RAID6 (or event a 4-drive RAID5), we don't want to update the recovery_offset - as that would be wrong - and we don't want to include that spare as active in the 5-drive RAID6 when the reversed reshape completed and it will be mostly out-of-sync still. Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:35:18 +10:00
NeilBrown	e4e11e385d	md/raid5: avoid oops when number of devices is reduced then increased. The entries in the stripe_cache maintained by raid5 are enlarged when we increased the number of devices in the array, but not shrunk when we reduce the number of devices. So if entries are added after reducing the number of devices, we much ensure to initialise the whole entry, not just the part that is currently relevant. Otherwise if we enlarge the array again, we will reference uninitialised values. As grow_buffers/shrink_buffer now want to use a count that is stored explicity in the raid_conf, they should get it from there rather than being passed it as a parameter. Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:35:02 +10:00

1 2 3 4 5 ...

1722 Commits