Commit Graph

467 Commits

Author SHA1 Message Date
NeilBrown 80268ee927 md: Don't try to set an array to 'read-auto' if it is already in that state.
'read-auto' is a variant of 'readonly' which will switch to writable
on the first write attempt.

Calling do_md_stop to set the array readonly when it is already readonly
returns an error.  So make sure not to do that.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:12 +11:00
NeilBrown ea43ddd849 md: Allow metadata_version to be updated for externally managed metadata.
For externally managed metadata, the 'metadata_version' sysfs
attribute is really just a channel for user-space programs to
communicate about how the array is being managed.
It can be useful for this to be changed while the array is active.

Normally changes to metadata_version are not permitted while the array
is active.  Change that so that if the metadata is externally managed,
the metadata_version can be changed to a different flavour of external
management.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:11 +11:00
Chris Webb 7d3c6f8717 md: Fix rdev_size_store with size == 0
Fix rdev_size_store with size == 0.
size == 0 means to use the largest size allowed by the
underlying device and is used when modifying an active array.

This fixes a regression introduced by
 commit d7027458d6

Cc: <stable@kernel.org>
Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:11 +11:00
Tejun Heo 074a7aca7a block: move stats from disk to part0
Move stats related fields - stamp, in_flight, dkstats - from disk to
part0 and unify stat handling such that...

* part_stat_*() now updates part0 together if the specified partition
  is not part0.  ie. part_stat_*() are now essentially all_stat_*().

* {disk|all}_stat_*() are gone.

* part_round_stats() is updated similary.  It handles part0 stats
  automatically and disk_round_stats() is killed.

* part_{inc|dec}_in_fligh() is implemented which automatically updates
  part0 stats for parts other than part0.

* disk_map_sector_rcu() is updated to return part0 if no part matches.
  Combined with the above changes, this makes NULL special case
  handling in callers unnecessary.

* Separate stats show code paths for disk are collapsed into part
  stats show code paths.

* Rename disk_stat_lock/unlock() to part_stat_lock/unlock()

While at it, reposition stat handling macros a bit and add missing
parentheses around macro parameters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:08 +02:00
Tejun Heo 0762b8bde9 block: always set bdev->bd_part
Till now, bdev->bd_part is set only if the bdev was for parts other
than part0.  This patch makes bdev->bd_part always set so that code
paths don't have to differenciate common handling.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:08 +02:00
Tejun Heo ed9e198234 block: implement and use {disk|part}_to_dev()
Implement {disk|part}_to_dev() and use them to access generic device
instead of directly dereferencing {disk|part}->dev.  To make sure no
user is left behind, rename generic devices fields to __dev.

This is in preparation of unifying partition 0 handling with other
partitions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:07 +02:00
NeilBrown 9744197c3d md: Don't wait UNINTERRUPTIBLE for other resync to finish
When two md arrays share some block device (e.g each uses different
partitions on the one device), a resync of one array will wait for
the resync on the other to finish.

This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE,
the softlockup code notices and complains.

So use TASK_INTERRUPTIBLE instead and make sure to flush signals
before calling schedule.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-09-19 11:49:54 +10:00
NeilBrown 271f5a9b8f Remove invalidate_partition call from do_md_stop.
When stopping an md array, or just switching to read-only, we
currently call invalidate_partition while holding the mddev lock.
The main reason for this is probably to ensure all dirty buffers
are flushed (invalidate_partition calls fsync_bdev).

However if any dirty buffers are found, it will almost certainly cause
a deadlock as starting writeout will require an update to the
superblock, and performing that updates requires taking the mddev
lock - which is already held.

This deadlock can be demonstrated by running "reboot -f -n" with
a root filesystem on md/raid, and some dirty buffers in memory.

All other calls to stop an array should already happen after a flush.
The normal sequence is to stop using the array (e.g. umount) which
will cause __blkdev_put to call sync_blockdev.  Then open the
array and issue the STOP_ARRAY ioctl while the buffers are all still
clean.

So this invalidate_partition is normally a no-op, except for one case
where it will cause a deadlock.

So remove it.

This patch possibly addresses the regression recored in
   http://bugzilla.kernel.org/show_bug.cgi?id=11460
and
   http://bugzilla.kernel.org/show_bug.cgi?id=11452

though it isn't yet clear how it ever worked.


Signed-off-by: NeilBrown <neilb@suse.de>
2008-09-01 12:32:52 +10:00
Dan Williams 56ac36d722 md: cancel check/repair requests when recovery is needed
If a 'repair' is requested when an array is in a position to 'recover' raid1
will perform the repair while md believes a recovery is happening.  Address
this at both ends, i.e. cancel check/repair requests upon detecting a
recover condition and do not call ->spare_active after completing a
check/repair.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-08-07 10:02:47 -07:00
NeilBrown c89a8eee61 Allow faulty devices to be removed from a readonly array.
Removing faulty devices from an array is a two stage process.
First the device is moved from being a part of the active array
to being similar to a spare device.  Then it can be removed
by a request from user space.

The first step is currently not performed for read-only arrays,
so the second step can never succeed.

So allow readonly arrays to remove failed devices (which aren't
blocked).

Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05 15:56:32 +10:00
NeilBrown dba034eef2 Fail safely when trying to grow an array with a write-intent bitmap.
We cannot currently change the size of a write-intent bitmap.
So if we change the size of an array which has such a bitmap, it
tries to set bits beyond the end of the bitmap.

For now, simply reject any request to change the size of an array
which has a bitmap.  mdadm can remove the bitmap and add a new one
after the array has changed size.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05 15:56:32 +10:00
NeilBrown 2b25000bf5 Restore force switch of md array to readonly at reboot time.
A recent patch allowed do_md_stop to know whether it was being called
via an ioctl or not, and thus where to allow for an extra open file
descriptor when checking if it is in use.
This broke then switch to readonly performed by the shutdown notifier,
which needs to work even when the array is still (apparently) active
(as md doesn't get told when the filesystem becomes readonly).

So restore this feature by pretending that there can be lots of
file descriptors open, but we still want do_md_stop to switch to
readonly.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05 15:56:31 +10:00
NeilBrown 19052c0e85 Make writes to md/safe_mode_delay immediately effective.
If we reduce the 'safe_mode_delay', it could still wait for the old
delay to completely expire before doing anything about safe_mode.
Thus the effect if the change is delayed.

To make the effect more immediate, run the timeout function
immediately if the delay was reduced.  This may cause it to run
slightly earlier that required, but that is the safer option.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05 15:56:31 +10:00
Dan Williams e542713529 md: do not count blocked devices as spares
remove_and_add_spares() assumes that failed devices have been hot-removed
from the array.  Removal is skipped in the 'blocked' case so do not count a
device in this state as 'spare'.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-28 17:52:44 -07:00
Dan Williams d8e64406a0 md: delay notification of 'active_idle' to the recovery thread
sysfs_notify might sleep, so do not call it from md_safemode_timeout.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-23 13:09:48 -07:00
NeilBrown 4b80991c6c md: Protect access to mddev->disks list using RCU
All modifications and most access to the mddev->disks list are made
under the reconfig_mutex lock.  However there are three places where
the list is walked without any locking.  If a reconfig happens at this
time, havoc (and oops) can ensue.

So use RCU to protect these accesses:
  - wrap them in rcu_read_{,un}lock()
  - use list_for_each_entry_rcu
  - add to the list with list_add_rcu
  - delete from the list with list_del_rcu
  - delay the 'free' with call_rcu rather than schedule_work

Note that export_rdev did a list_del_init on this list.  In almost all
cases the entry was not in the list anymore so it was a no-op and so
safe.  It is no longer safe as after list_del_rcu we may not touch
the list_head.
An audit shows that export_rdev is called:
  - after unbind_rdev_from_array, in which case the delete has
     already been done,
  - after bind_rdev_to_array fails, in which case the delete isn't needed.
  - before the device has been put on a list at all (e.g. in
      add_new_disk where reading the superblock fails).
  - and in autorun devices after a failure when the device is on a
      different list.

So remove the list_del_init call from export_rdev, and add it back
immediately before the called to export_rdev for that last case.

Note also that ->same_set is sometimes used for lists other than
mddev->list (e.g. candidates).  In these cases rcu is not needed.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 17:05:25 +10:00
NeilBrown f2ea68cf42 md: only count actual openers as access which prevent a 'stop'
Open isn't the only thing that increments ->active.  e.g. reading
/proc/mdstat will increment it briefly.  So to avoid false positives
in testing for concurrent access, introduce a new counter that counts
just the number of times the md device it open.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 17:05:25 +10:00
Andre Noll f233ea5c9e md: Make mddev->array_size sector-based.
This patch renames the array_size field of struct mddev_s to array_sectors
and converts all instances to use units of 512 byte sectors instead of 1k
blocks.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 17:05:22 +10:00
Andre Noll 15f4a5fdf3 md: Make super_type->rdev_size_change() take sector-based sizes.
Also, change the type of the size parameter from unsigned long long to
sector_t and rename it to num_sectors.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 14:42:12 +10:00
Andre Noll d07bd3bcc4 md: Fix check for overlapping devices.
The checks in overlaps() expect all parameters either in block-based
or sector-based quantities. However, its single caller passes two
rdev->data_offset arguments as well as two rdev->size arguments, the
former being sector counts while the latter are measured in 1K blocks.

This could cause rdev_size_store() to accept an invalid size from user
space. Fix it by passing only sector-based quantities to overlaps().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 14:42:07 +10:00
Neil Brown d7027458d6 md: Tidy up rdev_size_store a bit:
- used strict_strtoull in place of simple_strtoull
 - use my_mddev in place of rdev->mddev (they have the same value)
and more significantly,
 - don't adjust mddev->size to fit, rather reject changes which make
   rdev->size smaller than mddev->size

Adjusting mddev->size is a hangover from bind_rdev_to_array which
does a similar thing.  But it really is a better design to insist that
mddev->size is set as required, then the rdev->sizes are set to allow
for that.  The previous way invites confusion.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 14:22:18 +10:00
Andre Noll 0f420358e3 md: Turn rdev->sb_offset into a sector-based quantity.
Rename it to sb_start to make sure all users have been converted.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:23 +10:00
Andre Noll b73df2d3d6 md: Make calc_dev_sboffset() return a sector count.
As BLOCK_SIZE_BITS is 10 and

	MD_NEW_SIZE_SECTORS(2 * x) = 2 * NEW_SIZE_BLOCKS(x),

the return value of calc_dev_sboffset() doubles. Fix up all three
callers accordingly.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:23 +10:00
Andre Noll e7debaa495 md: Replace calc_dev_size() by calc_num_sectors().
Number of sectors is the preferred unit for sizes of raid devices,
so change calc_dev_size() so that it returns this unit instead of
the number of 1K blocks.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:23 +10:00
Andre Noll d71f9f88d7 md: Make update_size() take the number of sectors.
Changing the internal representations of sizes of raid devices
from 1K blocks to sector counts (512B units) is desirable because
it allows to get rid of many divisions/multiplications and unnecessary
casts that are present in the current code.

This patch is a first step in this direction. It replaces the old
1K-based "size" argument of update_size() by "num_sectors" and
fixes up its two callers.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:22 +10:00
Neil Brown df5b20cf68 md: Better control of when do_md_stop is allowed to stop the array.
do_md_stop check the number of active users before allowing the array
to be stopped.
Two problems:
  1/ it assumes the request is coming through an open file descriptor
     (via ioctl) so it allows for that.  This is not always the case.
  2/ it doesn't do the check it the array hasn't been activated.
     This is not good for cases when we use an inactive array to hold
     some devices in a container.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:22 +10:00
Andre Noll 26ef379f53 md: get_disk_info(): Don't convert between signed and unsigned and back.
The current code copies a signed int from user space, converts it to
unsigned and passes the unsigned value to find_rdev_nr() which expects
a signed value. Simply pass the signed value from user space directly.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:21 +10:00
Andre Noll 80fab1d77b md: Simplify restart_array().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:21 +10:00
Andre Noll ebc2433728 md: alloc_disk_sb(): Return proper error value.
If alloc_page() fails, ENOMEM is a more suitable error value
than EINVAL.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:20 +10:00
Andre Noll ce0c8e05f8 md: Simplify sb_equal().
The only caller of sb_equal() tests the return value against
zero, so it's OK to return the negated return value of memcmp().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:20 +10:00
Andre Noll 05710466c9 md: Simplify uuid_equal().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:20 +10:00
Andre Noll 35020f1a06 md: sb_equal(): Fix misleading printk.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:53:20 +10:00
Andre Noll 7f6ce76928 md: Fix a typo in the comment to cmd_match().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:53:00 +10:00
Andre Noll 910d8cb3f4 md: Fix typo in array_state comment.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:52:45 +10:00
Andre Noll 9687a60c78 md: sync_speed_show(): Trivial cleanups.
- Remove superfluous parentheses.
- Make format string match the type of the variable that is printed.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:52:26 +10:00
Andre Noll 13e53df354 md: do_md_run(): Fix misleading error message.
In case pers->run() succeeds but creating the bitmap fails, we
print an error message stating that pers->run() has failed.

Print this message only if pers->run() really failed.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:52:15 +10:00
Andre Noll 2f9618ce63 md: md_getgeo(): Move comment to proper position.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:52:00 +10:00
Andre Noll bb57fc64b2 md: md_ioctl(): Fix misleading indentation.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:51:29 +10:00
Dan Williams b5470dc5fc md: resolve external metadata handling deadlock in md_allow_write
md_allow_write() marks the metadata dirty while holding mddev->lock and then
waits for the write to complete.  For externally managed metadata this causes a
deadlock as userspace needs to take the lock to communicate that the metadata
update has completed.

Change md_allow_write() in the 'external' case to start the 'mark active'
operation and then return -EAGAIN.  The expected side effects while waiting for
userspace to write 'active' to 'array_state' are holding off reshape (code
currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
cause updates to 'raid_disks' to fail.  Except for 'stripe_cache_size' changes
these failures can be mitigated by coordinating with mdmon.

md_write_start() still prevents writes from occurring until the metadata
handler has had a chance to take action as it unconditionally waits for
MD_CHANGE_CLEAN to be cleared.

[neilb@suse.de: return -EAGAIN, try GFP_NOIO]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-06-30 17:18:19 -07:00
Chris Webb 0cd17fec98 Support changing rdev size on running arrays.
From: Chris Webb <chris@arachsys.com>

Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the
superblock if necessary for this metadata version. We prevent the available
space from shrinking to less than the used size, and allow it to be set to zero
to fill all the available space on the underlying device.

Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:46 +10:00
Neil Brown 526647320e Make sure all changes to md/dev-XX/state are notified
The important state change happens during an interrupt
in md_error.  So just set a flag there and call sysfs_notify
later in process context.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:44 +10:00
Neil Brown a99ac97113 Make sure all changes to md/degraded are notified.
When a device fails, when a spare is activated, when
an array is reshaped, or when an array is started,
the extent to which the array is degraded can change.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:43 +10:00
Neil Brown 72a23c211e Make sure all changes to md/sync_action are notified.
When the 'resync' thread starts or stops, when we explicitly
set sync_action, or when we determine that there is definitely nothing
to do, we notify sync_action.

To stop "sync_action" from occasionally showing the wrong value,
we introduce a new flags - MD_RECOVERY_RECOVER - to say that a
recovery is probably needed or happening, and we make sure
that we set MD_RECOVERY_RUNNING before clearing MD_RECOVERY_NEEDED.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:41 +10:00
Neil Brown 0fd62b861e Make sure all changes to md/array_state are notified.
Changes in md/array_state could be of interest to a monitoring
program.  So make sure all changes trigger a notification.

Exceptions:
   changing active_idle to active is not reported because it
      is frequent and not interesting.
   changing active to active_idle is only reported on arrays
      with externally managed metadata, as it is not interesting
      otherwise.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:36 +10:00
Neil Brown c7d0c941ae Don't reject HOT_REMOVE_DISK request for an array that is not yet started.
There is really no need for this test here, and there are valid
cases for selectively removing devices from an array that
it not actually active.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:34 +10:00
Neil Brown 199050ea1f rationalise return value for ->hot_add_disk method.
For all array types but linear, ->hot_add_disk returns 1 on
success, 0 on failure.
For linear, it returns 0 on success and -errno on failure.

This doesn't cause a functional problem because the ->hot_add_disk
function of linear is used quite differently to the others.
However it is confusing.

So convert all to return 0 for success or -errno on failure
and fix call sites to match.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:33 +10:00
Neil Brown 6c2fce2ef6 Support adding a spare to a live md array with external metadata.
i.e. extend the 'md/dev-XXX/slot' attribute so that you can
tell a device to fill an vacant slot in an and md array.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:31 +10:00
Neil Brown 8ed0a5216a Enable setting of 'offset' and 'size' of a hot-added spare.
offset_store and rdev_size_store allow control of the region of a
device which is to be using in an md/raid array.
They only allow these values to be set when an array is being assembled,
as changing them on an active array could be dangerous.
However when adding a spare device to an array, we might need to
set the offset and size before starting recovery.  So allow
these values to be set also if "->raid_disk < 0" which indicates that
the device is still a spare.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:29 +10:00
Neil Brown 1a0fd49773 Don't try to make md arrays dirty if that is not meaningful.
Arrays personalities such as 'raid0' and 'linear' have no redundancy,
and so marking them as 'clean' or 'dirty' is not meaningful.
So always allow write requests without requiring a superblock update.

Such arrays types are detected by ->sync_request being NULL.  If it is
not possible to send a sync request we don't need a 'dirty' flag because
all a dirty flag does is trigger some sync_requests.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:27 +10:00
Neil Brown f48ed53838 Close race in md_probe
There is a possible race in md_probe.  If two threads call md_probe
for the same device, then one could exit (having checked that
->gendisk exists) before the other has called kobject_init_and_add,
thus returning an incomplete kobj which will cause problems when
we try to add children to it.

So extend the range of protection of disks_mutex slightly to
avoid this possibility.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:26 +10:00
Neil Brown 5e96ee65c8 Allow setting start point for requested check/repair
This makes it possible to just resync a small part of an array.
e.g. if a drive reports that it has questionable sectors,
a 'repair' of just the region covering those sectors will
cause them to be read and, if there is an error, re-written
with correct data.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:24 +10:00
Neil Brown 9bbbca3a0e Fix error paths if md_probe fails.
md_probe can fail (e.g. alloc_disk could fail) without
returning an error (as it alway returns NULL).
So when we call mddev_find immediately afterwards, we need
to check that md_probe actually succeeded.  This means checking
that mdev->gendisk is non-NULL.

cc: <stable@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:17 +10:00
Dan Williams a6d8113a98 md: fix uninitialized use of mddev->recovery_wait
If an array was created with --assume-clean we will oops when trying to
set ->resync_max.

Fix this by initializing ->recovery_wait in mddev_find.

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:08 -07:00
NeilBrown dfc7064500 md: restart recovery cleanly after device failure.
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.

For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.

We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
  - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
    which causes the recovery not be be checkpointed.
  - We remove spares and then re-added them which loses important state
    information.

The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed.  If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error.  So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.

Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded).  Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.

Issue:  If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available,  do we want to:
 1/ complete the current rebuild, then go back and rebuild the new spare or
 2/ restart the rebuild from the start and rebuild both devices in
    parallel.

Both options can be argued for.  The code currently takes option 2 as
  a/ this requires least code change
  b/ this results in a minimally-degraded array in minimal time.

Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:10 -07:00
Bernd Schubert 90b08710e4 md: allow parallel resync of md-devices.
In some configurations, a raid6 resync can be limited by CPU speed
(Calculating P and Q and moving data) rather than by device speed.  In
these cases there is nothing to be gained byt serialising resync of arrays
that share a device, and doing the resync in parallel can provide benefit.
 So add a sysfs tunable to flag an array as being allowed to resync in
parallel with other arrays that use (a different part of) the same device.

Signed-off-by: Bernd Schubert <bs@q-leap.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:10 -07:00
Dan Williams 4f54b0e948 md: notify userspace on 'stop' events
This additional notification to 'array_state' is needed to allow the
monitor application to learn about stop events via sysfs.  The
sysfs_notify("sync_action") call that comes at the end of do_md_stop()
(via md_new_event) is insufficient since the 'sync_action' attribute has
been removed by this point.

(Seems like a sysfs-notify-on-removal patch is a better fix.  Currently
removal updates the event count but does not wake up waiters)

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:10 -07:00
NeilBrown 09a44cc150 md: notify userspace on 'write-pending' changes to array_state
When an array enters write pending, 'array_state' changes, so we must be
sure to sysfs_notify.

Also, when waiting for user-space to acknowledge 'write-pending' by
marking the metadata as dirty, we don't want to wait for MD_CHANGE_DEVS to
be cleared as that might not happen.  So explicity test for the bits that
we are really interested in.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:10 -07:00
Christoph Hellwig 6bcfd60186 md: kill file_path wrapper
Kill the trivial and rather pointless file_path wrapper around d_path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:09 -07:00
Dan Williams 6bfe0b4990 md: support blocking writes to an array on device failure
Allows a userspace metadata handler to take action upon detecting a device
failure.

Based on an original patch by Neil Brown.

Changes:
-added blocked_wait waitqueue to rdev
-don't qualify Blocked with Faulty always let userspace block writes
-added md_wait_for_blocked_rdev to wait for the block device to be clear, if
 userspace misses the notification another one is sent every 5 seconds
-set MD_RECOVERY_NEEDED after clearing "blocked"
-kill DoBlock flag, just test mddev->external

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:33 -07:00
Dan Williams 11e2ede022 md: prevent duplicates in bind_rdev_to_array
Found when trying to reassemble an active externally managed array.  Without
this check we hit the more noisy "sysfs duplicate" warning in the later call
to kobject_add.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:33 -07:00
Dan Williams 242b363e22 md: remove a stray command from a copy and paste error in resync_start_store
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:33 -07:00
NeilBrown 648b629ed4 md: fix up switching md arrays between read-only and read-write
When setting an array to 'readonly' or to 'active' via sysfs, we must make the
appropriate set_disk_ro call too.

Also when switching to "read_auto" (which is like readonly, but blocks on the
first write so that metadata can be marked 'dirty') we need to be more careful
about what state we are changing from.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:32 -07:00
NeilBrown 31a59e3425 md: fix 'safemode' handling for external metadata.
'safemode' relates to marking an array as 'clean' if there has been no write
traffic for a while (a couple of seconds), to reduce the chance of the array
being found dirty on reboot.

->safemode is set to '1' when there have been no write for a while, and it
gets set to '0' when the superblock is updates with the 'clean' flag set.

This requires a few fixes for 'external' metadata:
 - When an array is set to 'clean' via sysfs, 'safemode' must be cleared.
 - when we write to an array that has 'safemode' set (there must have been
        some delay in updating the metadata), we need to clear safemode.
 - Don't try to update external metadata in md_check_recovery for safemode
        transitions - it won't work.

Also, don't try to support "immediate safe mode" (safemode==2) for external
metadata, it cannot really work (the safemode timeout can be set very low if
this is really needed).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:32 -07:00
NeilBrown d897dbf914 md: reinitialise more mddev fields in do_md_stop.
I keep finding problems where an mddev gets reused and some fields has a value
from a previous usage that confuses the new usage.  So clear all fields that
could possible need clearing when calling do_md_stop.

Also initialise the 'level' of a new array to LEVEL_NONE (which isn't 0).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:32 -07:00
NeilBrown 8377bc8080 md: skip all metadata update processing when using external metadata.
All the metadata update processing for external metadata is on in user-space
or through the sysfs interfaces, so make "md_update_sb" a no-op in that case.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:32 -07:00
Dan Williams 6a51830e14 md: fix use after free when removing rdev via sysfs
rdev->mddev is no longer valid upon return from entry->store() when the
'remove' command is given.

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:32 -07:00
Linus Torvalds bd5d435a96 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: Skip I/O merges when disabled
  block: add large command support
  block: replace sizeof(rq->cmd) with BLK_MAX_CDB
  ide: use blk_rq_init() to initialize the request
  block: use blk_rq_init() to initialize the request
  block: rename and export rq_init()
  block: no need to initialize rq->cmd with blk_get_request
  block: no need to initialize rq->cmd in prepare_flush_fn hook
  block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline
  block/elevator.c:elv_rq_merge_ok() mustn't be inline
  block: make queue flags non-atomic
  block: add dma alignment and padding support to blk_rq_map_kern
  unexport blk_max_pfn
  ps3disk: Remove superfluous cast
  block: make rq_init() do a full memset()
  relay: fix splice problem
2008-04-29 08:18:03 -07:00
Denis V. Lunev c7705f3449 drivers: use non-racy method for proc entries creation (2)
Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
be setup before gluing PDE to main tree.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: Neil Brown <neilb@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:06:22 -07:00
Nick Piggin 75ad23bc0f block: make queue flags non-atomic
We can save some atomic ops in the IO path, if we clearly define
the rules of how to modify the queue flags.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:33 +02:00
Harvey Harrison 9a7b2b0f36 md: fix integer as NULL pointer warnings in md.c
drivers/md/md.c:734:16: warning: Using plain integer as NULL pointer
drivers/md/md.c:1115:16: warning: Using plain integer as NULL pointer

Add some braces to match the else-block as well.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:42 -07:00
Nick Andrew fdefa4d87e RAID: remove trailing space from printk line
drivers/md/*.[ch] contains only one more printk line with a trailing space.
Remove it.

Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
2008-04-21 22:42:58 +00:00
NeilBrown 0e82989d95 md: remove the 'super' sysfs attribute from devices in an 'md' array
Exposing the binary blob which is the md 'super-block' via sysfs doesn't
really fit with the whole sysfs model, and ever since commit
8118a859dc ("sysfs: fix off-by-one error
in fill_read_buffer()") it doesn't actually work at all (as the size of
the blob is often one page).

(akpm: as in, fs/sysfs/file.c:fill_read_buffer() goes BUG)

So just remove it altogether.  It isn't really useful.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19 18:53:35 -07:00
NeilBrown 52720ae77d md: fix formatting error in /proc/mdstat
If an md array is "auto-read-only", then this appears in /proc/mdstat as

   /dev/md0: active(auto-read-only)

whereas if it is truely readonly, it appears as

   /dev/md0: active (read-only)

The difference being a space.

One program known to parse this file expects the space and gets badly
confused.  It will be fixed, but it would be best if what the kernel generates
is more consistent too.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-10 18:01:19 -07:00
NeilBrown 27c529bb8e md: lock access to rdev attributes properly
When we access attributes of an rdev (component device on an md array) through
sysfs, we really need to lock the array against concurrent changes.  We
currently do that when we change an attribute, but not when we read an
attribute.  We need to lock when reading as well else rdev->mddev could become
NULL while we are accessing it.

So add appropriate locking (mddev_lock) to rdev_attr_show.

rdev_size_store requires some extra care as well as it needs to unlock the
mddev while scanning other mddevs for overlapping regions.  We currently
assume that rdev->mddev will still be unchanged after the scan, but that
cannot be certain.  So take a copy of rdev->mddev for use at the end of the
function.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:18 -08:00
NeilBrown 2515619823 md: make sure a reshape is started when device switches to read-write
A resync/reshape/recovery thread will refuse to progress when the array is
marked read-only.  So whenever it mark it not read-only, it is important to
wake up thread resync thread.  There is one place we didn't do this.

The problem manifests if the start_ro module parameters is set, and a raid5
array that is in the middle of a reshape (restripe) is started.  The array
will initially be semi-read-only (meaning it acts like it is readonly until
the first write).  So the reshape will not proceed.

On the first write, the array will become read-write, but the reshape will not
be started, and there is no event which will ever restart that thread.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:18 -08:00
NeilBrown d0fae18f1b md: clean up irregularity with raid autodetect
When a raid1 array is stopped, all components currently get added to the list
for auto-detection.  However we should really only add components that were
found by autodetection in the first place.  So add a flag to record that
information, and use it.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:18 -08:00
NeilBrown a1801f858e md: guard against possible bad array geometry in v1 metadata
Make sure the data doesn't start before the end of the superblock when the
superblock is at the start of the device.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:17 -08:00
Jan Blunck c32c2f63a9 d_path: Make seq_path() use a struct path argument
seq_path() is always called with a dentry and a vfsmount from a struct path.
Make seq_path() take it directly as an argument.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:17:08 -08:00
NeilBrown 73c34431c7 md: change ITERATE_RDEV_GENERIC to rdev_for_each_list, and remove ITERATE_RDEV_PENDING.
Finish ITERATE_ to for_each conversion.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:19 -08:00
NeilBrown d089c6af10 md: change ITERATE_RDEV to rdev_for_each
As this is more in line with common practice in the kernel.  Also swap the
args around to be more like list_for_each.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:19 -08:00
NeilBrown 29ac4aa3fc md: change INTERATE_MDDEV to for_each_mddev
As this is more consistent with kernel style.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:19 -08:00
NeilBrown 20a49ff679 md: change a few 'int' to 'size_t' in md
As suggested by Andrew Morton.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:19 -08:00
NeilBrown 177a99b23e md: fix use-after-free bug when dropping an rdev from an md array
Due to possible deadlock issues we need to use a schedule work to kobject_del
an 'rdev' object from a different thread.

A recent change means that kobject_add no longer gets a refernce, and
kobject_del doesn't put a reference.  Consequently, we need to explicitly hold
a reference to ensure that the last reference isn't dropped before the
scheduled work get a chance to call kobject_del.

Also, rename delayed_delete to md_delayed_delete to that it is more obvious in
a stack trace which code is to blame.

Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:19 -08:00
NeilBrown a17184a911 md: allow an md array to appear with 0 drives if it has external metadata
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:19 -08:00
NeilBrown ca38805945 md: lock address when changing attributes of component devices
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:18 -08:00
NeilBrown c5d79adba7 md: allow devices to be shared between md arrays
Currently, a given device is "claimed" by a particular array so that it cannot
be used by other arrays.

This is not ideal for DDF and other metadata schemes which have their own
partitioning concept.

So for externally managed metadata, just claim the device for md in general,
require that "offset" and "size" are set properly for each device, and make
sure that if a device is included in different arrays then the active sections
do not overlap.

This involves adding another flag to the rdev which makes it awkward to set
"->flags = 0" to clear certain flags.  So now clear flags explicitly by name
when we want to clear things.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:18 -08:00
NeilBrown 1ec4a9398d md: set and test the ->persistent flag for md devices more consistently
If you try to start an array for which the number of raid disks is listed as
zero, md will currently try to read metadata off any devices that have been
given.  This was done because the value of raid_disks is used to signal
whether array details have been provided by userspace (raid_disks > 0) or must
be read from the devices (raid_disks == 0).

However for an array without persistent metadata (or with externally managed
metadata) this is the wrong thing to do.  So we add a test in do_md_run to
give an error if raid_disks is zero for non-persistent arrays.

This requires that mddev->persistent is set corrently at this point, which it
currently isn't for in-kernel autodetected arrays.

So set ->persistent for autodetect arrays, and remove the settign in
super_*_validate which is now redundant.

Also clear ->persistent when stopping an array so it is consistently zero when
starting an array.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:18 -08:00
NeilBrown c620727779 md: allow a maximum extent to be set for resyncing
This allows userspace to control resync/reshape progress and synchronise it
with other activities, such as shared access in a SAN, or backing up critical
sections during a tricky reshape.

Writing a number of sectors (which must be a multiple of the chunk size if
such is meaningful) causes a resync to pause when it gets to that point.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:18 -08:00
NeilBrown c303da6d71 md: give userspace control over removing failed devices when external metdata in use
When a device fails, we must not allow an further writes to the array until
the device failure has been recorded in array metadata.  When metadata is
managed externally, this requires some synchronisation...

Allow/require userspace to explicitly remove failed devices from active
service in the array by writing 'none' to the 'slot' attribute.  If this
reduces the number of failed devices to 0, the write block will automatically
be lowered.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:18 -08:00
NeilBrown e691063a61 md: support 'external' metadata for md arrays
- Add a state flag 'external' to indicate that the metadata is managed
  externally (by user-space) so important changes need to be
  left of user-space to handle.
  Alternates are non-persistant ('none') where there is no stable metadata -
  after the  array is stopped there is no record of it's status - and
  internal which can be version 0.90 or version 1.x
  These are selected by writing to the 'metadata' attribute.

- move the updating of superblocks (sync_sbs) to after we have checked if
  there are any superblocks or not.

- New array state 'write_pending'.  This means that the metadata records
  the array as 'clean', but a write has been requested, so the metadata has
  to be updated to record a 'dirty' array before the write can continue.
  This change is reported to md by writing 'active' to the array_state
  attribute.

- tidy up marking of sb_dirty:
   - don't set sb_dirty when resync finishes as md_check_recovery
     calls md_update_sb when the sync thread finishes anyway.
   - Don't set sb_dirty in multipath_run as the array might not be dirty.
   - don't mark superblock dirty when switching to 'clean' if there
     is no internal superblock (if external, userspace can choose to
     update the superblock whenever it chooses to).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:18 -08:00
Greg Kroah-Hartman c10997f657 Kobject: convert drivers/* from kobject_unregister() to kobject_put()
There is no need for kobject_unregister() anymore, thanks to Kay's
kobject cleanup changes, so replace all instances of it with
kobject_put().


Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:40 -08:00
Greg Kroah-Hartman f9cb074bff Kobject: rename kobject_init_ng() to kobject_init()
Now that the old kobject_init() function is gone, rename
kobject_init_ng() to kobject_init() to clean up the namespace.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:38 -08:00
Greg Kroah-Hartman b2d6db5878 Kobject: rename kobject_add_ng() to kobject_add()
Now that the old kobject_add() function is gone, rename kobject_add_ng()
to kobject_add() to clean up the namespace.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:38 -08:00
Greg Kroah-Hartman 649316b25b Kobject: convert drivers/md/md.c to use kobject_init/add_ng()
This converts the code to use the new kobject functions, cleaning up the
logic in doing so.

Cc: Neil Brown <neilb@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:37 -08:00
Kay Sievers edfaa7c365 Driver core: convert block from raw kobjects to core devices
This moves the block devices to /sys/class/block. It will create a
flat list of all block devices, with the disks and partitions in one
directory. For compatibility /sys/block is created and contains symlinks
to the disks.

  /sys/class/block
  |-- sda -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
  |-- sda1 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda1
  |-- sda10 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda10
  |-- sda5 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda5
  |-- sda6 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda6
  |-- sda7 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda7
  |-- sda8 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda8
  |-- sda9 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda9
  `-- sr0 -> ../../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0

  /sys/block/
  |-- sda -> ../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
  `-- sr0 -> ../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:36 -08:00
Greg Kroah-Hartman 3830c62fef Kobject: change drivers/md/md.c to use kobject_init_and_add
Stop using kobject_register, as this way we can control the sending of
the uevent properly, after everything is properly initialized.

Cc: Neil Brown <neilb@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:29 -08:00
Alan D. Brunelle 2ad8b1ef11 Add UNPLUG traces to all appropriate places
Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG.

Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-09 13:41:32 +01:00
Pavel Emelyanov ba25f9dcc4 Use helpers to obtain task pid in printks
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:43 -07:00
Iustin Pop d7f3d291a0 md: expose the degraded status of an assembled array through sysfs
The 'degraded' attribute is useful to quickly determine if the array is
degraded, instead of parsing 'mdadm -D' output or relying on the other
techniques (number of working devices against number of defined devices,
etc.).  The md code already keeps track of this attribute, so it's useful to
export it.

Signed-off-by: Iustin Pop <iusty@k1024.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:03 -07:00
NeilBrown 2b12ab6d33 md: 'sync_action' in sysfs returns wrong value for readonly arrays
When an array is started read-only, MD_RECOVERY_NEEDED can be set but no
recovery will be running.  This causes 'sync_action' to report the wrong
value.

We could remove the test for MD_RECOVERY_NEEDED, but doing so would leave a
small gap after requesting a sync action, where 'sync_action' would still
report the old value.

So make sure that for a read-only array, 'sync_action' always returns 'idle'.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:03 -07:00
Michael J. Evans 4d936ec1fd md: software Raid autodetect dev list not array
In current release kernels the md module (Software RAID) uses a static
array (dev_t[128]) to store partition/device info temporarily for
autostart.

I discovered this (and that the devices are added as disks/partitions are
discovered at boot) while I was debugging why only one of my MD arrays would
come up whole, while all the others were short a disk.

I eventually discovered that it was enumerating through all of 9 of my 11 hds
(2 had only 4 partitions apiece) while the other 9 have 15 partitions (I
wanted 64 per drive...).  The last partition of the 8th drive in my 9 drive
raid 5 sets wasn't added, thus making the final md array short both a parity
and data disk, and it was started later, elsewhere.

This patch replaces that static array with a list.

[akpm@linux-foundation.org: removed unused var]
Signed-off-by: Michael J. Evans <mjevans1983@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:03 -07:00
Jens Axboe fd5d806266 block: convert blkdev_issue_flush() to use empty barriers
Then we can get rid of ->issue_flush_fn() and all the driver private
implementations of that.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:05:02 +02:00
Greg Kroah-Hartman 19c38de88a kobjects: fix up improper use of the kobject name field
A number of different drivers incorrect access the kobject name field
directly.  This is not correct as the name might not be in the array.
Use the proper accessor function instead.
2007-10-12 14:51:02 -07:00
NeilBrown 6712ecf8f6 Drop 'size' argument from bio_endio and bi_end_io
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant.  Remove it.

Now there is no need for bio_endio to subtract the size completed
from bi_size.  So don't do that either.

While we are at it, change bi_end_io to return void.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:57 +02:00
Jens Axboe 165125e1e4 [BLOCK] Get rid of request_queue_t typedef
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-24 09:28:11 +02:00
NeilBrown 4ad1366376 md: change bitmap_unplug and others to void functions
bitmap_unplug only ever returns 0, so it may as well be void.  Two callers try
to print a message if it returns non-zero, but that message is already printed
by bitmap_file_kick.

write_page returns an error which is not consistently checked.  It always
causes BITMAP_WRITE_ERROR to be set on an error, and that can more
conveniently be checked.

When the return of write_page is checked, an error causes bitmap_file_kick to
be called - so move that call into write_page - and protect against recursive
calls into bitmap_file_kick.

bitmap_update_sb returns an error that is never checked.

So make these 'void' and be consistent about checking the bit.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:15 -07:00
NeilBrown f0d76d70bc md: check that internal bitmap does not overlap other data
We current completely trust user-space to set up metadata describing an
consistant array.  In particlar, that the metadata, data, and bitmap do not
overlap.

But userspace can be buggy, and it is better to report an error than corrupt
data.  So put in some appropriate checks.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:15 -07:00
NeilBrown 713f6ab18b md: improve the is_mddev_idle test fix
Don't use 'unsigned' variable to track sync vs non-sync IO, as the only thing
we want to do with them is a signed comparison, and fix up the comment which
had become quite wrong.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:15 -07:00
NeilBrown df968c4e8d md: improve message about invalid superblock during autodetect
People try to use raid auto-detect with version-1 superblocks (which is not
supported) and get confused when they are told they have an invalid
superblock.

So be more explicit, and say it it is not a valid v0.90 superblock.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:15 -07:00
Rafael J. Wysocki 8314418629 Freezer: make kernel threads nonfreezable by default
Currently, the freezer treats all tasks as freezable, except for the kernel
threads that explicitly set the PF_NOFREEZE flag for themselves.  This
approach is problematic, since it requires every kernel thread to either
set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
care for the freezing of tasks at all.

It seems better to only require the kernel threads that want to or need to
be frozen to use some freezer-related code and to remove any
freezer-related code from the other (nonfreezable) kernel threads, which is
done in this patch.

The patch causes all kernel threads to be nonfreezable by default (ie.  to
have PF_NOFREEZE set by default) and introduces the set_freezable()
function that should be called by the freezable kernel threads in order to
unset PF_NOFREEZE.  It also makes all of the currently freezable kernel
threads call set_freezable(), so it shouldn't cause any (intentional)
change of behaviour to appear.  Additionally, it updates documentation to
describe the freezing of tasks more accurately.

[akpm@linux-foundation.org: build fixes]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:02 -07:00
Dan Williams 685784aaf3 xor: make 'xor_blocks' a library routine for use with async_tx
The async_tx api tries to use a dma engine for an operation, but will fall
back to an optimized software routine otherwise.  Xor support is
implemented using the raid5 xor routines.  For organizational purposes this
routine is moved to a common area.

The following fixes are also made:
* rename xor_block => xor_blocks, suggested by Adrian Bunk
* ensure that xor.o initializes before md.o in the built-in case
* checkpatch.pl fixes
* mark calibrate_xor_blocks __init, Adrian Bunk

Cc: Adrian Bunk <bunk@stusta.de>
Cc: NeilBrown <neilb@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2007-07-13 08:06:14 -07:00
NeilBrown a778b73ff7 md: fix bug with linear hot-add and elsewhere
Adding a drive to a linear array seems to have stopped working, due to changes
elsewhere in md, and insufficient ongoing testing...

So the patch to make linear hot-add work in the first place introduced a
subtle bug elsewhere that interracts poorly with older version of mdadm.

This fixes it all up.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:14 -07:00
NeilBrown 435b71be20 md: improve the is_mddev_idle test
During a 'resync' or similar activity, md checks if the devices in the
array are otherwise active and winds back resync activity when they are.
This test in done in is_mddev_idle, and it is somewhat fragile - it
sometimes thinks there is non-sync io when there isn't.

The test compares the total sectors of io (disk_stat_read) with the sectors
of resync io (disk->sync_io).  This has problems because total sectors gets
updated when a request completes, while resync io gets updated when the
request is submitted.  The time difference can cause large differenced
between the two which do not actually imply non-resync activity.  The test
currently allows for some fuzz (+/- 4096) but there are some cases when it
is not enough.

The test currently looks for any (non-fuzz) difference, either positive or
negative.  This clearly is not needed.  Any non-sync activity will cause
the total sectors to grow faster than the sync_io count (never slower) so
we only need to look for a positive differences.

If we do this then the amount of in-flight sync io will never cause the
appearance of non-sync IO.  Once enough non-sync IO to worry about starts
happening, resync will be slowed down and the measurements will thus be
more precise (as there is less in-flight) and control of resync will still
be suitably responsive.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:37 -07:00
Linus Torvalds 44ce6294d0 Revert "md: improve partition detection in md array"
This reverts commit 5b479c91da.

Quoth Neil Brown:

  "It causes an oops when auto-detecting raid arrays, and it doesn't
   seem easy to fix.

   The array may not be 'open' when do_md_run is called, so
   bdev->bd_disk might be NULL, so bd_set_size can oops.

   This whole approach of opening an md device before it has been
   assembled just seems to get more and more painful.  I think I'm going
   to have to come up with something clever to provide both backward
   comparability with usage expectation, and sane integration into the
   rest of the kernel."

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 18:51:36 -07:00
NeilBrown 5b479c91da md: improve partition detection in md array
md currently uses ->media_changed to make sure rescan_partitions
is call on md array after they are assembled.

However that doesn't happen until the array is opened, which is later
than some people would like.

So use blkdev_ioctl to do the rescan immediately that the
array has been assembled.

This means we can remove all the ->change infrastructure as it was only used
to trigger a partition rescan.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
NeilBrown 08a02ecd28 md: allow reshape_position for md arrays to be set via sysfs
"reshape_position" records how much progress has been made on a "reshape"
(adding drives, changing layout or chunksize).

When it is set, the number of drives, layout and chunksize can have
two possible values, an old an a new.

So allow these different values to be visible, and allow both old and new to
be set: Set the old ones first, then the reshape_position, then the new
values.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
NeilBrown 4d167f0937 md: stop using csum_partial for checksum calculation in md
If CONFIG_NET is not selected, csum_partial is not exported, so md.ko cannot
use it.  We shouldn't really be using csum_partial anyway as it is an
internal-to-networking interface.

So replace it with C code to do the same thing.  Speed is not crucial here, so
something simple and correct is best.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
NeilBrown e11e93facc md: move test for whether level supports bitmap to correct place
We need to check for internal-consistency of superblock in load_super.
validate_super is for inter-device consistency.

With the test in the wrong place, a badly created array will confuse md rather
an produce sensible errors.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
Martin Peschke c3f94b40e1 md: cleanup: use seq_release_private() where appropriate
We can save some lines of code by using seq_release_private().

Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
Ahmed S. Darwish 50511da3da drivers/md.c: Use ARRAY_SIZE macro when appropriate
Use ARRAY_SIZE macro already defined in kernel.h

Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
Peter Zijlstra f98393a64c mm: remove destroy_dirty_buffers from invalidate_bdev()
Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
been used in 6 years (so akpm says).

find * -name \*.[ch] | xargs grep -l invalidate_bdev |
while read file; do
	quilt add $file;
	sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
done

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
NeilBrown 5792a2856a [PATCH] md: avoid a deadlock when removing a device from an md array via sysfs
A device can be removed from an md array via e.g.
  echo remove > /sys/block/md3/md/dev-sde/state

This will try to remove the 'dev-sde' subtree which will deadlock
since
  commit e7b0d26a86

With this patch we run the kobject_del via schedule_work so as to
avoid the deadlock.

Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 21:12:47 -07:00
NeilBrown 5e55e2f5fc [PATCH] md: convert compile time warnings into runtime warnings
...  still not sure why we need this ....

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:05:15 -07:00
NeilBrown 041ae52e26 [PATCH] md: clear the congested_fn when stopping a raid5
If this mddev and queue got reused for another array that doesn't register a
congested_fn, this function would get called incorretly.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:05:14 -07:00
NeilBrown b4c4c7b809 [PATCH] md: restart a (raid5) reshape that has been aborted due to a read/write error
An error always aborts any resync/recovery/reshape on the understanding that
it will immediately be restarted if that still makes sense.  However a reshape
currently doesn't get restarted.  With this patch it does.

To avoid restarting when it is not possible to do work, we call into the
personality to check that a reshape is ok, and strengthen raid5_check_reshape
to fail if there are too many failed devices.

We also break some code out into a separate function: remove_and_add_spares as
the indent level for that code was getting crazy.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:36 -08:00
NeilBrown d1b5380c7f [PATCH] md: clean out unplug and other queue function on md shutdown
The mddev and queue might be used for another array which does not set these,
so they need to be cleared.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:36 -08:00
NeilBrown 7dd5e7c3db [PATCH] md: move warning about creating a raid array on partitions of the one device
md tries to warn the user if they e.g.  create a raid1 using two partitions of
the same device, as this does not provide true redundancy.

However it also warns if a raid0 is created like this, and there is nothing
wrong with that.

At the place where the warning is currently printer, we don't necessarily know
what level the array will be, so move the warning from the point where the
device is added to the point where the array is started.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:36 -08:00
Eric W. Biederman 0b4d414714 [PATCH] sysctl: remove insert_at_head from register_sysctl
The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name.  Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:59 -08:00
Eric W. Biederman ff1d28efc5 [PATCH] sysctl: md: remove unnecessary insert_at_head flag
The sysctls used by the md driver are have unique binary numbers so remove the
insert_at_head flag as it serves no useful purpose.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:55 -08:00
Arjan van de Ven fa027c2a0a [PATCH] mark struct file_operations const 4
Many struct file_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

[akpm@sdl.org: dvb fix]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:45 -08:00
NeilBrown 2a2275d630 [PATCH] md: fix potential memalloc deadlock in md
If a GFP_KERNEL allocation is attempted in md while the mddev_lock is held,
it is possible for a deadlock to eventuate.

This happens if the array was marked 'clean', and the memalloc triggers a
write-out to the md device.

For the writeout to succeed, the array must be marked 'dirty', and that
requires getting the mddev_lock.

So, before attempting a GFP_KERNEL allocation while holding the lock, make
sure the array is marked 'dirty' (unless it is currently read-only).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:51:00 -08:00
NeilBrown 1031be7a5f [PATCH] md: make sure the events count in an md array never returns to zero
Now that we sometimes step the array events count backwards (when
transitioning dirty->clean where nothing else interesting has happened - so
that we don't need to write to spares all the time), it is possible for the
event count to return to zero, which is potentially confusing and triggers and
MD_BUG.

We could possibly remove the MD_BUG, but is just as easy, and probably safer,
to make sure we never return to zero.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:50:59 -08:00
NeilBrown 3f9d7b0d81 [PATCH] md: fix a few problems with the interface (sysfs and ioctl) to md
While developing more functionality in mdadm I found some bugs in md...

- When we remove a device from an inactive array (write 'remove' to
  the 'state' sysfs file - see 'state_store') would should not
  update the superblock information - as we may not have
  read and processed it all properly yet.

- initialise all raid_disk entries to '-1' else the 'slot sysfs file
  will claim '0' for all devices in an array before the array is
  started.

- all '\n' not to be present at the end of words written to
  sysfs files

- when we use SET_ARRAY_INFO to set the md metadata version,
  set the flag to say that there is persistant metadata.

- allow GET_BITMAP_FILE to be called on an array that hasn't
  been started yet.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:51 -08:00
NeilBrown 1757128438 [PATCH] md: assorted md and raid1 one-liners
Fix few bugs that meant that:
  - superblocks weren't alway written at exactly the right time (this
    could show up if the array was not written to - writting to the array
    causes lots of superblock updates and so hides these errors).

  - restarting device recovery after a clean shutdown (version-1 metadata
    only) didn't work as intended (or at all).

1/ Ensure superblock is updated when a new device is added.
2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync.
   The body of this if takes one of two branches depending on whether
   MD_RECOVERY_SYNC is set, so testing it in the clause of the if
   is wrong.
3/ Flag superblock for updating after a resync/recovery finishes.
4/ If we find the neeed to restart a recovery in the middle (version-1
   metadata only) make sure a full recovery (not just as guided by
   bitmaps) does get done.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:21 -08:00
Jeff Garzik fdee8ae449 [PATCH] MD: conditionalize some code
The autorun code is only used if this module is built into the static
kernel image.  Adjust #ifdefs accordingly.

Signed-off-by: Jeff Garzik <jeff@garzik.org>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:21 -08:00
NeilBrown 0d4ca600fc [PATCH] md: tidy up device-change notification when an md array is stopped
An md array can be stopped leaving all the setting still in place, or it can
torn down and destroyed.  set_capacity and other change notifications only
happen in the latter case, but should happen in both.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:20 -08:00
Josef Sipek c649bb9c55 [PATCH] struct path: convert md
Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:47 -08:00
NeilBrown d63a5a74de [PATCH] lockdep: avoid lockdep warning in md
md_open takes ->reconfig_mutex which causes lockdep to complain.  This
(normally) doesn't have deadlock potential as the possible conflict is with a
reconfig_mutex in a different device.

I say "normally" because if a loop were created in the array->member hierarchy
a deadlock could happen.  However that causes bigger problems than a deadlock
and should be fixed independently.

So we flag the lock in md_open as a nested lock.  This requires defining
mutex_lock_interruptible_nested.

Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:39 -08:00
Peter Zijlstra 2e7b651df1 [PATCH] remove the old bd_mutex lockdep annotation
Remove the old complex and crufty bd_mutex annotation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:38 -08:00
Nigel Cunningham 7dfb71030f [PATCH] Add include/linux/freezer.h and move definitions from sched.h
Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.

[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki 4b438a23fb [PATCH] md: do not freeze md threads for suspend
If there's a swap file on a software RAID, it should be possible to use this
file for saving the swsusp's suspend image.  Also, this file should be
available to the memory management subsystem when memory is being freed before
the suspend image is created.

For the above reasons it seems that md_threads should not be frozen during the
suspend and the appended patch makes this happen, but then there is the
question if they don't cause any data to be written to disks after the suspend
image has been created, provided that all filesystems are frozen at that time.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-08 18:29:24 -08:00
NeilBrown 2f47130361 [PATCH] md: change ONLINE/OFFLINE events to a single CHANGE event
It turns out that CHANGE is preferred to ONLINE/OFFLINE for various reasons
(not least of which being that udev understands it already).

So remove the recently added KOBJ_OFFLINE (no-one is likely to care anyway)
and change the ONLINE to a CHANGE event

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-08 18:29:23 -08:00
NeilBrown 7870db4c7f [PATCH] md: send online/offline uevents when an md array starts/stops
This allows udev to do something intelligent when an array becomes
available.

Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:55 -08:00
NeilBrown 01ab5662f5 [PATCH] md: simplify checking of available size when resizing an array
When "mdadm --grow --size=xxx" is used to resize an array (use more or less of
each device), we check the new siza against the available space in each
device.

We already have that number recorded in rdev->size, so calculating it is
pointless (and wrong in one obscure case).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:51 -07:00
NeilBrown 2b6e845986 [PATCH] md: fix bug where spares don't always get rebuilt properly when they become live
If save_raid_disk is >= 0, then the device could be a device that is already
in sync that is being re-added.  So we need to default this value to -1.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:51 -07:00
NeilBrown 1c05b4bc22 [PATCH] md: endian annotation for v1 superblock access
Includes a couple of bugfixes found by sparse.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-21 13:35:05 -07:00
Akinobu Mita e24650c2e7 [PATCH] md: fix /proc/mdstat refcounting
I have seen mdadm oops after successfully unloading md module.

This patch privents from unloading md module while
mdadm is polling /proc/mdstat.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Akinbou Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:43 -07:00
NeilBrown 5842730de1 [PATCH] md: fix bug where new drives added to an md array sometimes don't sync properly
This fixes a bug introduced in 2.6.18.

If a drive is added to a raid1 using older tools (mdadm-1.x or raidtools)
then it will be included in the array without any resync happening.

It has been submitted for 2.6.18.1.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-06 08:53:41 -07:00
Eric Sesterhenn 52e5f9d1cf BUG_ON cleanup for drivers/md/
This changes two if() BUG(); usages to BUG_ON(); so people
can disable it safely.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 23:33:23 +02:00
NeilBrown 3a0f5bbb1a [PATCH] md: add error reporting to superblock write failure
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:19 -07:00
NeilBrown e8703fe1f5 [PATCH] md: remove MAX_MD_DEVS which is an arbitrary limit
Once upon a time we needed to fixed limit to the number of md devices,
probably because we preallocated some array.  This need no longer exists, but
we still have an arbitrary limit.

So remove MAX_MD_DEVS and allow as many devices as we can fit into the 'minor'
part of a device number.

Also remove some useless noise at init time (which reports MAX_MD_DEVS) and
remove MD_THREAD_NAME_MAX which hasn't been used for a while.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:18 -07:00
NeilBrown 61df9d91e9 [PATCH] md: make messages about resync/recovery etc more specific
It is possible to request a 'check' of an md/raid array where the whole array
is read and consistancies are reported.

This uses the same mechanisms as 'resync' and so reports in the kernel logs
that a resync is being started.  This understandably confuses/worries people.

Also the text in /proc/mdstat suggests a 'resync' is happen when it is just a
check.

This patch changes those messages to be more specific about what is happening.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:18 -07:00
Paul Clements 9b1d1dac18 [PATCH] md: new sysfs interface for setting bits in the write-intent-bitmap
Add a new sysfs interface that allows the bitmap of an array to be dirtied.
The interface is write-only, and is used as follows:

echo "1000" > /sys/block/md2/md/bitmap

(dirty the bit for chunk 1000 [offset 0] in the in-memory and on-disk
bitmaps of array md2)

echo "1000-2000" > /sys/block/md1/md/bitmap

(dirty the bits for chunks 1000-2000 in md1's bitmap)

This is useful, for example, in cluster environments where you may need to
combine two disjoint bitmaps into one (following a server failure, after a
secondary server has taken over the array).  By combining the bitmaps on
the two servers, a full resync can be avoided (This was discussed on the
list back on March 18, 2005, "[PATCH 1/2] md bitmap bug fixes" thread).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:17 -07:00
NeilBrown 850b2b420c [PATCH] md: replace magic numbers in sb_dirty with well defined bit flags
Instead of magic numbers (0,1,2,3) in sb_dirty, we have
some flags instead:
MD_CHANGE_DEVS
   Some device state has changed requiring superblock update
   on all devices.
MD_CHANGE_CLEAN
   The array has transitions from 'clean' to 'dirty' or back,
   requiring a superblock update on active devices, but possibly
   not on spares
MD_CHANGE_PENDING
   A superblock update is underway.

We wait for an update to complete by waiting for all flags to be clear.  A
flag can be set at any time, even during an update, without risk that the
change will be lost.

Stop exporting md_update_sb - isn't needed.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:17 -07:00
Adrian Bunk fbedac04fa [PATCH] md: the scheduled removal of the START_ARRAY ioctl for md
This patch contains the scheduled removal of the START_ARRAY ioctl for md.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:16 -07:00
NeilBrown 8469219596 [PATCH] md: avoid backward event updates in md superblock when degraded.
If we
  - shut down a clean array,
  - restart with one (or more) drive(s) missing
  - make some changes
  - pause, so that they array gets marked 'clean',
the event count on the superblock of included drives
will be the same as that of the removed drives.
So adding the removed drive back in will cause it
to be included with no resync.

To avoid this, we only update the eventcount backwards when the array
is not degraded.  In this case there can (should) be no non-connected
drives that we can get confused with, and this is the particular case
where updating-backwards is valuable.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:31 -07:00
Andrew Morton d0a0a5ee7a [PATCH] md: fix oops in error-handling
During early MD setup (superblock reading), we don't have a personality yet.
But the error-handling code tries to dereference mddev->pers.  Fix.

Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:17 -07:00
NeilBrown 67463acb64 [PATCH] md: require CAP_SYS_ADMIN for (re-)configuring md devices via sysfs
The ioctl requires CAP_SYS_ADMIN, so sysfs should too.  Note that we don't
require CAP_SYS_ADMIN for reading attributes even though the ioctl does.
There is no reason to limit the read access, and much of the information is
already available via /proc/mdstat

Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:17 -07:00
NeilBrown 80ca3a44f5 [PATCH] md: unify usage of symbolic names for perms
Some places we use number (0660) someplaces names (S_IRUGO).  Change all
numbers to be names, and change 0655 to be what it should be.

Also make some formatting more consistent.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:17 -07:00
NeilBrown ff4e8d9a9f [PATCH] md: fix resync speed calculation for restarted resyncs
We introduced 'io_sectors' recently so we could count the sectors that causes
io during resync separate from sectors which didn't cause IO - there can be a
difference if a bitmap is being used to accelerate resync.

However when a speed is reported, we find the number of sectors processed
recently by subtracting an oldish io_sectors count from a current
'curr_resync' count.  This is wrong because curr_resync counts all sectors,
not just io sectors.

So, add a field to mddev to store the curren io_sectors separately from
curr_resync, and use that in the calculations.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:16 -07:00
NeilBrown 0b8c9de05c [PATCH] md: delay starting md threads until array is completely setup
When an array is started we start one or two threads (two if there is a
reshape or recovery that needs to be completed).

We currently start these *before* the array is completely set up and in
particular before queue->queuedata is set.  If the thread actually starts
very quickly on another CPU, we can end up dereferencing queue->queuedata
and oops.

This patch also makes sure we don't try to start a recovery if a reshape is
being restarted.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:16 -07:00
NeilBrown 31b65a0d38 [PATCH] md: set desc_nr correctly for version-1 superblocks
This has to be done in ->load_super, not ->validate_super

Without this, hot-adding devices to an array doesn't always
work right - though there is a work around in mdadm-2.5.2 to
make this less of an issue.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:16 -07:00
Ingo Molnar 663d440eaa [PATCH] lockdep: annotate blkdev nesting
Teach special (recursive) locking code to the lock validator.

Effects on non-lockdep kernels:

- the introduction of the following function variants:

  extern struct block_device *open_partition_by_devnum(dev_t, unsigned);

  extern int blkdev_put_partition(struct block_device *);

  static int
  blkdev_get_whole(struct block_device *bdev, mode_t mode, unsigned flags);

 which on non-lockdep are the same as open_by_devnum(), blkdev_put()
 and blkdev_get().

- a subclass parameter to do_open(). [unused on non-lockdep]

- a subclass parameter to __blkdev_put(), which is a new internal
  function for the main blkdev_put*() functions. [parameter unused
  on non-lockdep kernels, except for two sanity check WARN_ON()s]

these functions carry no semantical difference - they only express
object dependencies towards the lockdep subsystem.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:10 -07:00
Jörn Engel 6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Greg Kroah-Hartman ce7b0f46bb [PATCH] devfs: Remove the gendisk devfs_name field as it's no longer needed
And remove the now unneeded number field.
Also fixes all drivers that set these fields.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-26 12:25:08 -07:00
Greg Kroah-Hartman ff23eca3e8 [PATCH] devfs: Remove the devfs_fs_kernel.h file from the tree
Also fixes up all files that #include it.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-26 12:25:08 -07:00
Greg Kroah-Hartman 8ab5e4c15b [PATCH] devfs: Remove devfs_remove() function from the kernel tree
Removes the devfs_remove() function and all callers of it.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-26 12:25:07 -07:00
Greg Kroah-Hartman 1a715c5cf9 [PATCH] devfs: Remove devfs_mk_bdev() function from the kernel tree
Removes the devfs_mk_bdev() function and all callers of it.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-26 12:25:06 -07:00
Greg Kroah-Hartman 95dc112a57 [PATCH] devfs: Remove devfs_mk_dir() function from the kernel tree
Removes the devfs_mk_dir() function and all callers of it.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-26 12:25:06 -07:00
Adrian Bunk 0538195424 [PATCH] drivers/md/md.c: make code static
Make needlessly global code static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:40 -07:00
NeilBrown f655675b3f [PATCH] md: Allow the write_mostly flag to be set via sysfs
It appears in /sys/mdX/md/dev-YYY/state
and can be set or cleared by writing 'writemostly' or '-writemostly'
respectively.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:40 -07:00
NeilBrown a94213b1fa [PATCH] md: Allow resync_start to be set and queried via sysfs
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:40 -07:00
NeilBrown d4dbd0250e [PATCH] md: Allow raid 'layout' to be read and set via sysfs
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:39 -07:00
NeilBrown 45dc2de1e5 [PATCH] md: Allow rdev state to be set via sysfs
The md/dev-XXX/state file can now be written:

 "faulty" simulates an error on the device
 "remove" removes the device from the array (if it is not busy)

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:39 -07:00
NeilBrown 9e653b6342 [PATCH] md: Set/get state of array via sysfs
This allows the state of an md/array to be directly controlled via sysfs and
adds the ability to stop and array without tearing it down.

Array states/settings:

 clear
     No devices, no size, no level
     Equivalent to STOP_ARRAY ioctl
 inactive
     May have some settings, but array is not active
        all IO results in error
     When written, doesn't tear down array, but just stops it
 suspended (not supported yet)
     All IO requests will block. The array can be reconfigured.
     Writing this, if accepted, will block until array is quiescent
 readonly
     no resync can happen.  no superblocks get written.
     write requests fail
 read-auto
     like readonly, but behaves like 'clean' on a write request.

 clean - no pending writes, but otherwise active.
     When written to inactive array, starts without resync
     If a write request arrives then
       if metadata is known, mark 'dirty' and switch to 'active'.
       if not known, block and switch to write-pending
     If written to an active array that has pending writes, then fails.
 active
     fully active: IO and resync can be happening.
     When written to inactive array, starts with resync

 write-pending (not supported yet)
     clean, but writes are blocked waiting for 'active' to be written.

 active-idle
     like active, but no writes have been seen for a while (100msec).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:39 -07:00
NeilBrown 4254376914 [PATCH] md: Don't write dirty/clean update to spares - leave them alone
- record the 'event' count on each individual device (they
  might sometimes be slightly different now)
- add a new value for 'sb_dirty': '3' means that the super
  block only needs to be updated to record a clean<->dirty
  transition.
- Prefer odd event numbers for dirty states and even numbers
  for clean states
- Using all the above, don't update the superblock on
  a spare device if the update is just doing a clean-dirty
  transition.  To accomodate this, a transition from
  dirty back to clean might now decrement the events counter
  if nothing else has changed.

The net effect of this is that spare drives will not see any IO requests
during normal running of the array, so they can go to sleep if that is what
they want to do.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:39 -07:00
NeilBrown 07d84d109d [PATCH] md: Allow re-add to work on array without bitmaps
When an array has a bitmap, a device can be removed and re-added and only
blocks changes since the removal (as recorded in the bitmap) will be resynced.

It should be possible to do a similar thing to arrays without bitmaps.  i.e.
if a device is removed and re-added and *no* changes have been made in the
interim, then the add should not require a resync.

This patch allows that option.  This means that when assembling an array one
device at a time (e.g.  during device discovery) the array can be enabled
read-only as soon as enough devices are available, but extra devices can still
be added without causing a resync.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:39 -07:00
NeilBrown acc55e2201 [PATCH] md/bitmap: tidy up i_writecount handling in md/bitmap
md/bitmap modifies i_writecount of a bitmap file to make sure that no-one else
writes to it.  The reverting of the change is sometimes done twice, and there
is one error path where it is omitted.

This patch tidies that up.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:38 -07:00
NeilBrown d7375ab324 [PATCH] md/bitmap: fix online removal of file-backed bitmaps
When "mdadm --grow /dev/mdX --bitmap=none" is used to remove a filebacked
bitmap, the bitmap was disconnected from the array, but the file wasn't closed
(until the array was stopped).

The file also wasn't closed if adding the bitmap file failed.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:38 -07:00
Adrian Bunk 5e56341d02 [PATCH] md: make md_print_devices() static
This patch makes the needlessly global md_print_devices() static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:37 -07:00
NeilBrown 7c7546ccf6 [PATCH] md: allow a linear array to have drives added while active
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:37 -07:00
NeilBrown 5fd6c1dce0 [PATCH] md: allow checkpoint of recovery with version-1 superblock
For a while we have had checkpointing of resync.  The version-1 superblock
allows recovery to be checkpointed as well, and this patch implements that.

Due to early carelessness we need to add a feature flag to signal that the
recovery_offset field is in use, otherwise older kernels would assume that a
partially recovered array is in fact fully recovered.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:37 -07:00
NeilBrown a8a55c387d [PATCH] md: remove nuisance message at shutdown
At shutdown, we switch all arrays to read-only, which creates a message for
every instantiated array, even those which aren't actually active.

So remove the message for non-active arrays.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:37 -07:00
NeilBrown 16f17b39f3 [PATCH] md: increase the delay before marking metadata clean, and make it configurable
When a md array has been idle (no writes) for 20msecs it is marked as 'clean'.
 This delay turns out to be too short for some real workloads.  So increase it
to 200msec (the time to update the metadata should be a tiny fraction of that)
and make it sysfs-configurable.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:37 -07:00
NeilBrown 9443a1d1f7 [PATCH] md: remove useless ioctl warning
This warning was slightly useful back in 2.2 days, but is more an annoyance
now.  It makes it awkward to add new ioctls (that we we are likely to do that
in the current climate, but it is possible).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:36 -07:00
NeilBrown c331eb04b9 [PATCH] md: Fix badness in sysfs_notify caused by md_new_event
From: NeilBrown <neilb@suse.de>

If an error is reported by a drive in a RAID array (which is done via
bi_end_io - in interrupt context), we call md_error and md_new_event which
calls sysfs_notify.  However sysfs_notify grabs a mutex and so cannot be
called in interrupt context.

This patch just creates a variant of md_new_event which avoids the sysfs
call, and uses that.  A better fix for later is to arrange for the event to
be called from user-context.

Note: avoiding the sysfs call isn't a problem as an error will not, by
itself, modify the sync_action attribute.  (We do still need to
wake_up(&md_event_waiters) as an error by itself will modify /proc/mdstat).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-31 16:27:11 -07:00
Neil Brown c71d48877e [PATCH] Unlock md devices when stopping them on reboot.
otherwise we get nasty messages about locks not being released.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-26 11:52:11 -07:00
NeilBrown 2adc7d47c4 [PATCH] md: Fix inverted test for 'repair' directive.
We should be able to write 'repair' to /sys/block/mdX/md/sync_action,
however due to and inverted test, that always given EINVAL.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:17 -07:00
Ingo Molnar 5dc5cf7dd2 [PATCH] md: locking fix
- fix mddev_lock() usage bugs in md_attr_show() and md_attr_store().
  [they did not anticipate the possibility of getting a signal]

- remove mddev_lock_uninterruptible() [unused]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-20 07:54:04 -07:00
NeilBrown 4508a7a734 [PATCH] sysfs: Allow sysfs attribute files to be pollable
It works like this:
  Open the file
  Read all the contents.
  Call poll requesting POLLERR or POLLPRI (so select/exceptfds works)
  When poll returns,
     close the file and go to top of loop.
   or lseek to start of file and go back to the 'read'.

Events are signaled by an object manager calling
   sysfs_notify(kobj, dir, attr);

If the dir is non-NULL, it is used to find a subdirectory which
contains the attribute (presumably created by sysfs_create_group).

This has a cost of one int  per attribute, one wait_queuehead per kobject,
one int per open file.

The name "sysfs_notify" may be confused with the inotify
functionality.  Maybe it would be nice to support inotify for sysfs
attributes as well?

This patch also uses sysfs_notify to allow /sys/block/md*/md/sync_action
to be pollable

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-04-14 11:41:24 -07:00
NeilBrown 926ce2d8a7 [PATCH] md: Remove some code that can sleep from under a spinlock
And remove the comments that were put in inplace of a fix too....

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:19:01 -08:00
NeilBrown df5b89b323 [PATCH] md: Convert reconfig_sem to reconfig_mutex
... being careful that mutex_trylock is inverted wrt down_trylock

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:03 -08:00
Arjan van de Ven 48c9c27b8b [PATCH] sem2mutex: drivers/md
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:03 -08:00
NeilBrown 8ddeeae51f [PATCH] md: Fix md grow/size code to correctly find the maximum available space
An md array can be asked to change the amount of each device that it is using,
and in particular can be asked to use the maximum available space.  This
currently only works if the first device is not larger than the rest.  As
'size' gets changed and so 'fit' becomes wrong.  So check if a 'fit' is
required early and don't corrupt it.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:03 -08:00
NeilBrown e464eafdb4 [PATCH] md: Support suspending of IO to regions of an md array
This allows user-space to access data safely.  This is needed for raid5
reshape as user-space needs to take a backup of the first few stripes before
allowing reshape to commence.

It will also be useful in cluster-aware raid1 configurations so that all
cluster members can leave a section of the array untouched while a
resync/recovery happens.

A 'start' and 'end' of the suspended range are written to 2 sysfs attributes.
Note that only one range can be suspended at a time.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:02 -08:00
NeilBrown 16484bf596 [PATCH] md: Make 'reshape' a possible sync_action action
This allows reshape to be triggerred via sysfs (which is the only way to start
it happening).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:02 -08:00
NeilBrown 63c70c4f3a [PATCH] md: Split reshape handler in check_reshape and start_reshape
check_reshape checks validity and does things that can be done instantly -
like adding devices to raid1.  start_reshape initiates a restriping process to
convert the whole array.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:02 -08:00
NeilBrown f67055780c [PATCH] md: Checkpoint and allow restart of raid5 reshape
We allow the superblock to record an 'old' and a 'new' geometry, and a
position where any conversion is up to.  The geometry allows for changing
chunksize, layout and level as well as number of devices.

When using verion-0.90 superblock, we convert the version to 0.91 while the
conversion is happening so that an old kernel will refuse the assemble the
array.  For version-1, we use a feature bit for the same effect.

When starting an array we check for an incomplete reshape and restart the
reshape process if needed.  If the reshape stopped at an awkward time (like
when updating the first stripe) we refuse to assemble the array, and let
user-space worry about it.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown 292695531a [PATCH] md: Final stages of raid5 expand code
This patch adds raid5_reshape and end_reshape which will start and finish the
reshape processes.

raid5_reshape is only enabled in CONFIG_MD_RAID5_RESHAPE is set, to discourage
accidental use.

Read the 'help' for the CONFIG_MD_RAID5_RESHAPE entry.

and Make sure that you have backups, just in case.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown ccfcc3c10b [PATCH] md: Core of raid5 resize process
This patch provides the core of the resize/expand process.

sync_request notices if a 'reshape' is happening and acts accordingly.

It allocated new stripe_heads for the next chunk-wide-stripe in the target
geometry, marking them STRIPE_EXPANDING.

Then it finds which stripe heads in the old geometry can provide data needed
by these and marks them STRIPE_EXPAND_SOURCE.  This causes stripe_handle to
read all blocks on those stripes.

Once all blocks on a STRIPE_EXPAND_SOURCE stripe_head are read, any that are
needed are copied into the corresponding STRIPE_EXPANDING stripe_head.  Once a
STRIPE_EXPANDING stripe_head is full, it is marks STRIPE_EXPAND_READY and then
is written out and released.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00