Commit Graph

1127 Commits

Author SHA1 Message Date
Dan Williams 80c3a6ce4b md: add 'size' as a personality method
In preparation for giving userspace control over ->array_sectors we need
to be able to retrieve the 'default' size, and the 'anticipated' size
when a reshape is requested.  For personalities that do not reshape emit
a warning if anything but the default size is requested.

In the raid5 case we need to update ->previous_raid_disks to make the
new 'default' size available.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-03-31 14:57:49 +11:00
Atsushi SAKAI 93ed05e2a5 md: fix typo in FSF address
Hello,

 I found a typo Bosto"m" in FSF address.
And I am checking around linux source code.
Here is the only place which uses Bosto"m" (not Boston).

Signed-off-by: Atsushi SAKAI <sakaia@jp.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:57:37 +11:00
NeilBrown fc9739c6d6 md: add takeover support for converting raid6 back into raid5
If a raid6 is still in the layout that comes from converting raid5
into a raid6. this will allow us to convert it back again.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:57:20 +11:00
NeilBrown e9d4758f6e md: add takeover support for raid4 -> raid5 conversion.
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:57:09 +11:00
NeilBrown b354603527 md/raid5: allow layout/chunksize to be changed on an active 2-drive raid5.
2-drive raid5's aren't very interesting.  But if you are converting
a raid1 into a raid5, you will at least temporarily have one.  And
that it a good time to set the layout/chunksize for the new RAID5
if you aren't happy with the defaults.

layout and chunksize don't actually affect the placement of data
on a 2-drive raid5, so we just do some internal book-keeping.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:56:41 +11:00
NeilBrown d562b0c431 md: add ->takeover method for raid5 to be able to take over raid1
The RAID1 must have two drives and be a suitable size to
be a multiple of a chunksize that isn't too small.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:39 +11:00
NeilBrown 245f46c2c2 md: add ->takeover method to support changing the personality managing an array
Implement this for RAID6 to be able to 'takeover' a RAID5 array.  The
new RAID6 will use a layout which places Q on the last device, and
that device will be missing.
If there are any available spares, one will immediately have Q
recovered onto it.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:39 +11:00
NeilBrown 409c57f380 md: enable suspend/resume of md devices.
To be able to change the 'level' of an md/raid array, we need to
suspend the device so that no requests are active - then move some
pointers around etc.

The code already keeps counts of active requests and the ->quiesce
function can be used to wait until those counts hit zero.
However the quiesce function blocks new requests once they are all
ready 'inside' the personality module, and that is too late if we want
to replace the personality modules.

So make all md requests come in through a common md_make_request
function that keeps track of how many requests have entered the
modules but may not yet be on the internal reference counts.
Allow md_make_request to be blocked when we want to suspend the
device, and make it possible to wait for all those in-transit requests
to be added to internal lists so that ->quiesce can wait for them.

There is still a problem that when a request completes, we drop the
ref count inside the personality code so there is a short time between
when the refcount hits zero, and when the personality code is no
longer being used.
The personality code never blocks (schedule or spinlock) between
dropping the refcount and exiting the routine, so this should be safe
(as put_module calls synchronize_sched() before unmapping the module
code).

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:39 +11:00
NeilBrown e0cf8f045b md: md_unregister_thread should cope with being passed NULL
Mostly md_unregister_thread is only called when we know that the
thread is NULL, but sometimes we need to check first.  It is safer
to put the check inside md_unregister_thread itself.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:39 +11:00
NeilBrown 91adb56473 md/raid5: refactor raid5 "run"
.. so that the code to create the private data structures is separate.
This will help with future code to change the level of an active
array.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:39 +11:00
NeilBrown 34817e8c39 md: make sure new_level, new_chunksize, new_layout always have sensible values.
When an md array is undergoing a change, we have new_* fields that
show the new values.
When no change is happening, it is least confusing if these have
the same value as the normal fields.
This is true in most cases, but not when the values are set via sysfs.

So fix this up.

A subsequent patch will BUG_ON if these things aren't consistent.


Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:38 +11:00
NeilBrown 67cc2b8165 md/raid5: finish support for DDF/raid6
DDF requires RAID6 calculations over different devices in a different
order.
For md/raid6, we calculate over just the data devices, starting
immediately after the 'Q' block.
For ddf/raid6 we calculate over all devices, using zeros in place of
the P and Q blocks.

This requires unfortunately complex loops...

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:38 +11:00
NeilBrown 99c0fb5f92 md/raid5: Add support for new layouts for raid5 and raid6.
DDF uses different layouts for P and Q blocks than current md/raid6
so add those that are missing.
Also add support for RAID6 layouts that are identical to various
raid5 layouts with the simple addition of one device to hold all of
the 'Q' blocks.
Finally add 'raid5' layouts to match raid4.
These last to will allow online level conversion.

Note that this does not provide correct support for DDF/raid6 yet
as the order in which data blocks are summed to produce the Q block
is significant and different between current md code and DDF
requirements.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:38 +11:00
NeilBrown 911d4ee853 md/raid5: simplify raid5_compute_sector interface
Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass
a 'struct stripe_head *' and fill in the relevant fields.  This is
more extensible.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:38 +11:00
NeilBrown d0dabf7e57 md/raid6: remove expectation that Q device is immediately after P device.
Code currently assumes that the devices in a raid6 stripe are
  0 1 ... N-1 P Q
in some rotated order.  We will shortly add new layouts in which
this strict pattern is broken.
So remove this expectation.  We still assume that the data disks
are roughly in-order.  However P and Q can be inserted anywhere within
that order.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:38 +11:00
NeilBrown 112bf8970d md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument
This similar to the recent change to get_active_stripe.
There is no functional change, just come rearrangement to make
future patches cleaner.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:38 +11:00
NeilBrown b5663ba405 md/raid5: simplify interface for init_stripe and get_active_stripe
Rather than passing 'pd_idx' and 'disks' to these functions, just pass
'previous' which tells whether to use the 'previous' or 'current'
geometry during a reshape, and let init_stripe calculate
disks and pd_idx and anything else it might need.

This is not a substantial simplification and even adds a division.
However we will shortly be adding more complexity to init_stripe
to handle more interesting 'reshape' activities, and without this
change, the interface to these functions would get very complex.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:39:38 +11:00
Andre Noll dd8ac336c1 md: Represent raid device size in sectors.
This patch renames the "size" field of struct mdk_rdev_s to
"sectors" and changes this field to store sectors instead of
blocks.

All users of this field, linear.c, raid0.c and md.c, are fixed up
accordingly which gets rid of many multiplications and divisions.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:33:13 +11:00
Andre Noll 58c0fed400 md: Make mddev->size sector-based.
This patch renames the "size" field of struct mddev_s to "dev_sectors"
and stores the number of 512-byte sectors instead of the number of
1K-blocks in it.

All users of that field, including raid levels 1,4-6,10, are adjusted
accordingly. This simplifies the code a bit because it allows to get
rid of a couple of divisions/multiplications by two.

In order to make checkpatch happy, some minor coding style issues
have also been addressed. In particular, size_store() now uses
strict_strtoull() instead of simple_strtoull().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:33:13 +11:00
NeilBrown 575a80fa4f md: be more consistent about setting WriteMostly flag when adding a drive to an array
When a drive is added to an array using ADD_NEW_DISK, there are two
places we can get certain flags from:  the metadata on the disk or the
flags passed through the IOCTL.

For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value
from either of those sources depending on if it is set (i.e. we
effectively 'or' the two sources together).

This makes it awkward to clear, and is at best inconsistent.

As documented code (in mdadm) requires that setting
MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the
inconsistency by always using the value for this flag from the ioctl,
and ignoring the value on disk.


Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:33:13 +11:00
NeilBrown 97e4f42d62 md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash
Version 1.x metadata has the ability to record the status of a
partially completed drive recovery.
However we only update that record on a clean shutdown.
It would be nice to update it on unclean shutdowns too, particularly
when using a bitmap that removes much to the 'sync' effort after an
unclean shutdown.

One complication with checkpointing recovery is that we only know
where we are up to in terms of IO requests started, not which ones
have completed.  And we need to know what has completed to record
how much is recovered.  So occasionally pause the recovery until all
submitted requests are completed, then update the record of where
we are up to.

When we have a bitmap, we already do that pause occasionally to keep
the bitmap up-to-date.  So enhance that code to record the recovery
offset and schedule a superblock update.
And when there is no bitmap, just pause 16 times during the resync to
do a checkpoint.
'16' is a fairly arbitrary number.  But we don't really have any good
way to judge how often is acceptable, and it seems like a reasonable
number for now.


Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:33:13 +11:00
NeilBrown 43b2e5d86d md: move md_k.h from include/linux/raid/ to drivers/md/
It really is nicer to keep related code together..

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:33:13 +11:00
NeilBrown bff61975b3 md: move lots of #include lines out of .h files and into .c
This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h

Remove include/raid/md.h as its only remaining use was to #include
other files.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:33:13 +11:00
NeilBrown 8b2b5c217c md: move LEVEL_* definition from md_k.h to md_u.h
.. as they are part of the user-space interface.
Also move MdpMinorShift into there so we can remove duplication.

Lastly move mdp_major in.  It is less obviously part of the user-space
interface, but do_mounts_md.c uses it, and it is acting a bit like
user-space.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:27:03 +11:00
Christoph Hellwig ef740c372d md: move headers out of include/linux/raid/
Move the headers with the local structures for the disciplines and
bitmap.h into drivers/md/ so that they are more easily grepable for
hacking and not far away.  md.h is left where it is for now as there
are some uses from the outside.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:27:03 +11:00
Christoph Hellwig 2a40a8aed0 cleanup drivers/md/Makefile
Use the -y variables instead of the old -objs so we can easily add
conditional objects to the modules.  Also always use += to add
subobjects to avoid problems when placing additional objects in
some place in the file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:27:02 +11:00
Christoph Hellwig 3dbd8c2e3f md: stop defining MAJOR_NR
MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier
kernels, so no need to keep it around.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:27:02 +11:00
Martin K. Petersen 3f9d99c12a MD data integrity support
md: Add support for data integrity to MD

If all subdevices support the same protection format the MD device is
flagged as integrity capable.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:27:02 +11:00
NeilBrown 355a43e641 md: write bitmap information to devices that are undergoing recovery.
When we add some spares to an array and start recovery, and we have
a bitmap which is stored 'internally' on all devices, we call
bitmap_write_all to make sure the bitmap is correct on the new
device(s).
However that doesn't work as write_sb_page only writes to
'In_sync' devices, and devices undergoing recovery are not
'In_sync' until recovery finishes.

So extend write_sb_page (actually next_active_rdev) to include devices
that are under recovery.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:27:02 +11:00
NeilBrown d0a4bb4927 md: never clear bit from the write-intent bitmap when the array is degraded.
It is safe to clear a bit from the write-intent bitmap for a raid1
if we know the data has been written to all devices, which is
what the current test does.

But it is not always safe to update the 'events_cleared' counter in
that case.  This is because one request could complete successfully
after some other request has partially failed.

So simply disable the clearing and updating of events_cleared whenever
the array is degraded.  This might end up not clearing some bits that
could safely be cleared, but it is safest approach.

Note that the bug fixed here did not risk corrupting data by letting
the array get out-of-sync.  Rather it meant that when a device is
removed and re-added to the array, it might incorrectly require a full
recovery rather than just recovering based on the bitmap.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:27:02 +11:00
NeilBrown 1187cf0a3c md: Allow write-intent bitmaps to have chunksize < PAGE_SIZE
md currently insists that the chunk size used for write-intent
bitmaps (the amount of data that corresponds to one chunk)
be at least one page.

The reason for this restriction is lost in the mists of time,
but a review of the code (and a vague memory) suggests that the only
problem would be related to resync.  Resync tries very hard to
work in multiples of a page, but also needs to sync with units
of a bitmap_chunk too.

This connection comes out in the bitmap_start_sync call.

So change bitmap_start_sync to always work in multiples of a page.
If the bitmap chunk size is less that one page, we flag multiple
chunks as 'syncing' and generally make them all appear to the
resync routines like one chunk.

All other code either already works with data ranges that could
span multiple chunks, or explicitly only cares about a single chunk.

Signed-off-by: Neil Brown <neilb@suse.de>
2009-03-31 14:27:02 +11:00
NeilBrown eea1bf384e md: Fix is_mddev_idle test (again).
There are two problems with is_mddev_idle.

1/ sync_io is 'atomic_t' and hence 'int'.  curr_events and all the
   rest are 'long'.
   So if sync_io were to wrap on a 64bit host, the value of
   curr_events would go very negative suddenly, and take a very
   long time to return to positive.

   So do all calculations as 'int'.  That gives us plenty of precision
   for what we need.

2/ To initialise rdev->last_events we simply call is_mddev_idle, on
   the assumption that it will make sure that last_events is in a
   suitable range.  It used to do this, but now it does not.
   So now we need to be more explicit about initialisation.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:27:02 +11:00
NeilBrown 73d5c38a95 md: avoid races when stopping resync.
There has been a race in raid10 and raid1 for a long time
which has only recently started showing up due to a scheduler changed.

When a sync_read request finishes, as soon as reschedule_retry
is called, another thread can mark the resync request as having
completed, so md_do_sync can finish, ->stop can be called, and
->conf can be freed.  So using conf after reschedule_retry is not
safe.

Similarly, when finishing a sync_write, calling md_done_sync must be
the last thing we do, as it allows a chain of events which will free
conf and other data structures.

The first of these requires action in raid10.c
The second requires action in raid1.c and raid10.c

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-02-25 13:18:47 +11:00
NeilBrown 78200d45cd md/raid10: Don't call bitmap_cond_end_sync when we are doing recovery.
For raid1/4/5/6, resync (fixing inconsistencies between devices) is
very similar to recovery (rebuilding a failed device onto a spare).
The both walk through the device addresses in order.

For raid10 it can be quite different.  resync follows the 'array'
address, and makes sure all copies are the same.  Recover walks
through 'device' addresses and recreates each missing block.

The 'bitmap_cond_end_sync' function allows the write-intent-bitmap
(When present) to be updated to reflect a partially completed resync.
It makes assumptions which mean that it does not work correctly for
raid10 recovery at all.

In particularly, it can cause bitmap-directed recovery of a raid10 to
not recovery some of the blocks that need to be recovered.

So move the call to bitmap_cond_end_sync into the resync path, rather
than being in the common "resync or recovery" path.


Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-02-25 13:18:47 +11:00
NeilBrown 09b4068a7f md/raid10: Don't skip more than 1 bitmap-chunk at a time during recovery.
When doing recovery on a raid10 with a write-intent bitmap, we only
need to recovery chunks that are flagged in the bitmap.

However if we choose to skip a chunk as it isn't flag, the code
currently skips the whole raid10-chunk, thus it might not recovery
some blocks that need recovering.

This patch fixes it.

In case that is confusing, it might help to understand that there
is a 'raid10 chunk size' which guides how data is distributed across
the devices, and a 'bitmap chunk size' which says how much data
corresponds to a single bit in the bitmap.

This bug only affects cases where the bitmap chunk size is smaller
than the raid10 chunk size.



Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-02-25 13:18:47 +11:00
Jens Axboe 93dbb39350 block: fix bad definition of BIO_RW_SYNC
We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417fe.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-02-18 10:32:00 +01:00
NeilBrown de01dfadf2 md: Ensure an md array never has too many devices.
Each different metadata format supported by md supports a
different maximum number of devices.
We really should be enforcing this maximum in the kernel, but
we aren't quite doing that properly.

We currently only enforce it at the 'hot_add' point, which is an
older interface which is not used by current userspace.

We need to also enforce it at 'add_new_disk' time for active arrays
and at 'do_md_run' time when starting a new array.

So move the test from 'hot_add' into 'bind_rdev_to_array' which is
called from both 'hot_add' and 'add_new_disk, and add a new
test in 'analyse_sbs' which is called from 'do_md_run'.

This bug (or missing feature) has been around "forever" and so
the patch is suitable for any -stable that is currently maintained.

Cc: stable@kernel.org

Signed-off-by: NeilBrown <neilb@suse.de>
2009-02-06 18:02:46 +11:00
Andre Noll 852c8bf484 md: Fix a bug in linear.c causing which_dev() to return the wrong device.
ab5bd5cbc8 introduced the following
bug in linear software raid for large arrays on 32 bit machines:

which_dev() computes the device holding a given sector by shifting
down the sector number to a 32 bit range, dividing by the array
spacing and looking up the resulting index in the hash table of
the array.

Because the computed index might be slightly too small, a loop at
the end of which_dev() increases the index until the given sector
actually falls into the range of the device associated with that index.

The changes of the above mentioned commit caused this loop to check
whether the _index_ rather than the sector number is small enough,
effectively bypassing the loop and thus possibly returning the wrong
device.

As reported by Simon Kirby, this leads to errors such as

	linear_make_request: Sector 2340486136 out of bounds on dev sdi: 156301312 sectors, offset 2109870464

Fix this bug by introducing a local variable for the index so that
the variable containing the passed sector is left unchanged.

Cc: stable@kernel.org
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-02-06 15:10:52 +11:00
NeilBrown 4706b349f4 md: Allow read error in a single drive raid1 to be passed up.
If a raid1 only has a single working device and gets a read error, 
we choose to simply return that error up to the filesystem (or whatever)
rather than failing the whole array.

However the codes doesn't quite do that.  We attempt a readbalance
which allocates the same drive, so we retry the read - indefinitely. 

Instead:  If read_balance in the error case chooses the same drive that just
failed, treat it as a failure and don't retry.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-02-06 15:06:47 +11:00
NeilBrown 4044ba58dd md: don't retry recovery of raid1 that fails due to error on source drive.
If a raid1 has only one working drive and it has a sector which
gives an error on read, then an attempt to recover onto a spare will
fail, but as the single remaining drive is not removed from the
array, the recovery will be immediately re-attempted, resulting
in an infinite recovery loop.

So detect this situation and don't retry recovery once an error
on the lone remaining drive is detected.

Allow recovery to be retried once every time a spare is added
in case the problem wasn't actually a media error.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-09 08:31:11 +11:00
NeilBrown efeb53c0e5 md: Allow md devices to be created by name.
Using sequential numbers to identify md devices is somewhat artificial.
Using names can be a lot more user-friendly.

Also, creating md devices by opening the device special file is a bit
awkward.

So this patch provides a new option for creating and naming devices.

Writing a name such as "md_home" to
    /sys/modules/md_mod/parameters/new_array
will cause an array with that name to be created.  It will appear in
/sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
It will have an arbitrary minor number allocated.

md devices that a created by an open are destroyed on the last
close when the device is inactive.
For named md devices, they will not be destroyed until the array
is explicitly stopped, either with the STOP_ARRAY ioctl or by
writing 'clear' to /sys/block/md_XXXX/md/array_state.

The name of the array must start 'md_' to avoid conflict with
other devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-09 08:31:10 +11:00
NeilBrown d3374825ce md: make devices disappear when they are no longer needed.
Currently md devices, once created, never disappear until the module
is unloaded.  This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.

If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed.  However it
is not possible to hook into the gendisk destruction process to enable
this.

So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed.  However this has a
complication.
Between the call
   __blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
   __blkdev_get->md_open

there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.

Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created.  We need to
ensure that this reference can not be used.  i.e. the ->open must
fail.

So:
 1/  in md_probe we set a flag in the mddev (hold_active) which
     indicates that the array should be treated as active, even
     though there are no references, and no appearance of activity.
     This is cleared by md_release when the device is closed if it
     is no longer needed.
     This ensures that the gendisk will survive between md_probe and
     md_open.

 2/  In md_open we check if the mddev we expect to open matches
     the gendisk that we did open.
     If there is a mismatch we return -ERESTARTSYS and modify
     __blkdev_get to retry from the top in that case.
     In the -ERESTARTSYS sys case we make sure to wait until
     the old gendisk (that we succeeded in opening) is really gone so
     we loop at most once.

Some udev configurations will always open an md device when it first
appears.   If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it.  This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).

As an array can be stopped by writing to a sysfs attribute
  echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects.  This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().



Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-09 08:31:10 +11:00
NeilBrown a21d15042d md: centralise all freeing of an 'mddev' in 'md_free'
md_free is the .release handler for the md kobj_type.
So it makes sense to release all the objects referenced by
the mddev in there, rather than just prior to calling kobject_put
for what we think is the last time.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-09 08:31:09 +11:00
NeilBrown 8b76539823 md: move allocation of ->queue from mddev_find to md_probe
It is more balanced to just do simple initialisation in mddev_find,
which allocates and links a new md device, and leave all the
more sophisticated allocation to md_probe (which calls mddev_find).
md_probe already allocated the gendisk.  It should allocate the
queue too.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-09 08:31:08 +11:00
Cheng Renquan cd2ac9321c md: need another print_sb for mdp_superblock_1
md_print_devices is called in two code path: MD_BUG(...), and md_ioctl
with PRINT_RAID_DEBUG.  it will dump out all in use md devices
information;

However, it wrongly processed two types of superblock in one:

The header file <linux/raid/md_p.h> has defined two types of superblock,
struct mdp_superblock_s (typedefed with mdp_super_t) according to md with
metadata 0.90, and struct mdp_superblock_1 according to md with metadata
1.0 and later,

These two types of superblock are very different,

The md_print_devices code processed them both in mdp_super_t, that would
lead to wrong informaton dump like:

	[ 6742.345877]
	[ 6742.345887] md:	**********************************
	[ 6742.345890] md:	* <COMPLETE RAID STATE PRINTOUT> *
	[ 6742.345892] md:	**********************************
	[ 6742.345896] md1: <ram7><ram6><ram5><ram4>
	[ 6742.345907] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
	[ 6742.345909] md: rdev superblock:
	[ 6742.345914] md:  SB: (V:0.90.0) ID:<42ef13c7.598c059a.5f9f1645.801e9ee6> CT:4919856d
	[ 6742.345918] md:     L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
	[ 6742.345922] md:     UT:4919856d ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:b7992907 E:00000001
	[ 6742.345924]      D  0:  DISK<N:0,(1,8),R:0,S:6>
	[ 6742.345930]      D  1:  DISK<N:1,(1,10),R:1,S:6>
	[ 6742.345933]      D  2:  DISK<N:2,(1,12),R:2,S:6>
	[ 6742.345937]      D  3:  DISK<N:3,(1,14),R:3,S:6>
	[ 6742.345942] md:     THIS:  DISK<N:3,(1,14),R:3,S:6>
	...
	[ 6742.346058] md0: <ram3><ram2><ram1><ram0>
	[ 6742.346067] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
	[ 6742.346070] md: rdev superblock:
	[ 6742.346073] md:  SB: (V:1.0.0) ID:<369aad81.00000000.00000000.00000000> CT:9a322a9c
	[ 6742.346077] md:     L-1507699579 S976570180 ND:48 RD:0 md0 LO:65536 CS:196610
	[ 6742.346081] md:     UT:00000018 ST:0 AD:131048 WD:0 FD:8 SD:0 CSUM:00000000 E:00000000
	[ 6742.346084]      D  0:  DISK<N:-1,(-1,-1),R:-1,S:-1>
	[ 6742.346089]      D  1:  DISK<N:-1,(-1,-1),R:-1,S:-1>
	[ 6742.346092]      D  2:  DISK<N:-1,(-1,-1),R:-1,S:-1>
	[ 6742.346096]      D  3:  DISK<N:-1,(-1,-1),R:-1,S:-1>
	[ 6742.346102] md:     THIS:  DISK<N:0,(0,0),R:0,S:0>
	...
	[ 6742.346219] md:	**********************************
	[ 6742.346221]

Here md1 is metadata 0.90.0, and md0 is metadata 1.2

After some more code to distinguish these two types of superblock, in this patch,

it will generate dump information like:

	[ 7906.755790]
	[ 7906.755799] md:	**********************************
	[ 7906.755802] md:	* <COMPLETE RAID STATE PRINTOUT> *
	[ 7906.755804] md:	**********************************
	[ 7906.755808] md1: <ram7><ram6><ram5><ram4>
	[ 7906.755819] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
	[ 7906.755821] md: rdev superblock (MJ:0):
	[ 7906.755826] md:  SB: (V:0.90.0) ID:<3fca7a0d.a612bfed.5f9f1645.801e9ee6> CT:491989f3
	[ 7906.755830] md:     L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
	[ 7906.755834] md:     UT:491989f3 ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:00fb52ad E:00000001
	[ 7906.755836]      D  0:  DISK<N:0,(1,8),R:0,S:6>
	[ 7906.755842]      D  1:  DISK<N:1,(1,10),R:1,S:6>
	[ 7906.755845]      D  2:  DISK<N:2,(1,12),R:2,S:6>
	[ 7906.755849]      D  3:  DISK<N:3,(1,14),R:3,S:6>
	[ 7906.755855] md:     THIS:  DISK<N:3,(1,14),R:3,S:6>
	...
	[ 7906.755972] md0: <ram3><ram2><ram1><ram0>
	[ 7906.755981] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
	[ 7906.755984] md: rdev superblock (MJ:1):
	[ 7906.755989] md:  SB: (V:1) (F:0) Array-ID:<5fbcf158:55aa:5fbe:9a79:1e939880dcbd>
	[ 7906.755990] md:    Name: "DG5:0" CT:1226410480
	[ 7906.755998] md:       L5 SZ130944 RD:4 LO:2 CS:128 DO:24 DS:131048 SO:8 RO:0
	[ 7906.755999] md:     Dev:00000003 UUID: 9194d744:87f7:a448:85f2:7497b84ce30a
	[ 7906.756001] md:       (F:0) UT:1226410480 Events:0 ResyncOffset:-1 CSUM:0dbcd829
	[ 7906.756003] md:         (MaxDev:384)
	...
	[ 7906.756113] md:	**********************************
	[ 7906.756116]

this md0 (metadata 1.2) information dumping is exactly according to struct
mdp_superblock_1.

Signed-off-by: Cheng Renquan <crquan@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Dan Williams <dan.j.williams@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-09 08:31:08 +11:00
Cheng Renquan 159ec1fc06 md: use list_for_each_entry macro directly
The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
list_for_each_entry_safe, from <linux/list.h>, it should be defined to
use list_for_each_entry_safe, instead of reinventing the wheel.

But some calls to each_entry_safe don't really need a safe version,
just a direct list_for_each_entry is enough, this could save a temp
variable (tmp) in every function that used rdev_for_each.

In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
totally save many tmp vars; and only in the other situations that will call
list_del to delete an entry, the safe version is used.

Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-09 08:31:08 +11:00
Andre Noll ccacc7d2cf md: raid0: make hash_spacing and preshift sector-based.
This patch renames the hash_spacing and preshift members of struct
raid0_private_data to spacing and sector_shift respectively and
changes the semantics as follows:

We always have spacing = 2 * hash_spacing. In case
sizeof(sector_t) > sizeof(u32) we also have sector_shift = preshift + 1
while sector_shift = preshift = 0 otherwise.

Note that the values of nb_zone and zone are unaffected by these changes
because in the sector_div() preceeding the assignement of these two
variables both arguments double.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-09 08:31:08 +11:00
Andre Noll 83838ed878 md: raid0: Represent the size of strip zones in sectors.
This completes the block -> sector conversion of struct strip_zone.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-09 08:31:07 +11:00
Andre Noll 0825b87a7d md: raid0 create_strip_zones(): Add KERN_INFO/KERN_ERR to printk's.
This patch consists only of these trivial changes.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-09 08:31:07 +11:00
Andre Noll 6b8796cc3d md: raid0 create_strip_zones(): Make two local variables sector-based.
current_offset and curr_zone_offset stored the corresponding offsets
as 1K quantities. Rename them to current_start and curr_zone_start
to match the naming of struct strip_zone and store the offsets as
sector counts.

Also, add KERN_INFO to the printk() affected by this change to make
checkpatch happy.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-09 08:31:07 +11:00