Commit Graph

523 Commits

Author SHA1 Message Date
Jingoo Han
bb8e0e84b3 block: replace strict_strtoul() with kstrtoul()
The use of strict_strtoul() is not preferred, because strict_strtoul() is
obsolete.  Thus, kstrtoul() should be used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:56 -07:00
Josh Durgin
da6a6b6397 rbd: fix error handling from rbd_snap_name()
rbd_snap_name() calls rbd_dev_v{1,2}_snap_name() depending on the
format of the image. The format 1 version returns NULL on error, which
is handled by the caller. The format 2 version returns an ERR_PTR,
which the caller of rbd_snap_name() does not expect.

Fortunately this is unlikely to occur in practice because
rbd_snap_id_by_name() is called before rbd_snap_name(). This would hit
similar errors to rbd_snap_name() (like the snapshot not existing) and
return early, so rbd_snap_name() would not hit an error unless the
snapshot was removed between the two calls or memory was exhausted.

Use an ERR_PTR in rbd_dev_v1_snap_name() so that the specific error
can be propagated, and it is consistent with rbd_dev_v2_snap_name().
Handle the ERR_PTR in the only rbd_snap_name() caller.

Suggested-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:16:44 -07:00
Josh Durgin
efadc98aab rbd: ignore unmapped snapshots that no longer exist
This prevents erroring out while adding a device when a snapshot
unrelated to the current mapping is deleted between reading the
snapshot context and reading the snapshot names. If the mapped
snapshot name is not found an error still occurs as usual.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:16:32 -07:00
Josh Durgin
9875201e10 rbd: fix use-after free of rbd_dev->disk
Removing a device deallocates the disk, unschedules the watch, and
finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
from the watch callback, updates the disk size and rbd_dev
structure. With no locking between them, rbd_dev_refresh() may use the
device or rbd_dev after they've been freed.

To fix this, check whether RBD_DEV_FLAG_REMOVING is set before
updating the disk size in rbd_dev_refresh(). In order to prevent a
race where rbd_dev_refresh() is already revalidating the disk when
rbd_remove() is called, move the call to rbd_bus_del_dev() after the
watch is unregistered and all notifies are complete. It's safe to
defer deleting this structure because no new requests can be submitted
once the RBD_DEV_FLAG_REMOVING is set, since the device cannot be
opened.

Fixes: http://tracker.ceph.com/issues/5636
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:16:25 -07:00
Josh Durgin
20e0af67ce rbd: make rbd_obj_notify_ack() synchronous
The only user of rbd_obj_notify_ack() is rbd_watch_cb(). It used
asynchronously with no tracking of when the notify ack completes, so
it may still be in progress when the osd_client is shut down.  This
results in a BUG() since the osd client assumes no requests are in
flight when it stops. Since all notifies are flushed before the
osd_client is stopped, waiting for the notify ack to complete before
returning from the watch callback ensures there are no notify acks in
flight during shutdown.

Rename rbd_obj_notify_ack() to rbd_obj_notify_ack_sync() to reflect
its new synchronous nature.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:16:02 -07:00
Josh Durgin
9abc59908e rbd: complete notifies before cleaning up osd_client and rbd_dev
To ensure rbd_dev is not used after it's released, flush all pending
notify callbacks before calling rbd_dev_image_release(). No new
notifies can be added to the queue at this point because the watch has
already be unregistered with the osd_client.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:15:57 -07:00
Linus Torvalds
6cccc7d301 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "This includes both the first pile of Ceph patches (which I sent to
  torvalds@vger, sigh) and a few new patches that add support for
  fscache for Ceph.  That includes a few fscache core fixes that David
  Howells asked go through the Ceph tree.  (Thanks go to Milosz Tanski
  for putting this feature together)

  This first batch of patches (included here) had (has) several
  important RBD bug fixes, hole punch support, several different
  cleanups in the page cache interactions, improvements in the truncate
  code (new truncate mutex to avoid shenanigans with i_mutex), and a
  series of fixes in the synchronous striping read/write code.

  On top of that is a random collection of small fixes all across the
  tree (error code checks and error path cleanup, obsolete wq flags,
  etc)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits)
  ceph: use d_invalidate() to invalidate aliases
  ceph: remove ceph_lookup_inode()
  ceph: trivial buildbot warnings fix
  ceph: Do not do invalidate if the filesystem is mounted nofsc
  ceph: page still marked private_2
  ceph: ceph_readpage_to_fscache didn't check if marked
  ceph: clean PgPrivate2 on returning from readpages
  ceph: use fscache as a local presisent cache
  fscache: Netfs function for cleanup post readpages
  FS-Cache: Fix heading in documentation
  CacheFiles: Implement interface to check cache consistency
  FS-Cache: Add interface to check consistency of a cached object
  rbd: fix null dereference in dout
  rbd: fix buffer size for writes to images with snapshots
  libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc
  rbd: fix I/O error propagation for reads
  ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  ceph: allow sync_read/write return partial successed size of read/write.
  ceph: fix bugs about handling short-read for sync read mode.
  ceph: remove useless variable revoked_rdcache
  ...
2013-09-09 09:13:22 -07:00
Josh Durgin
c35455791c rbd: fix null dereference in dout
The order parameter is sometimes NULL in _rbd_dev_v2_snap_size(), but
the dout() always derefences it. Move this to another dout() protected
by a check that order is non-NULL.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
2013-09-03 22:08:51 -07:00
Josh Durgin
03507db631 rbd: fix buffer size for writes to images with snapshots
rbd_osd_req_create() needs to know the snapshot context size to create
a buffer large enough to send it with the message front. It gets this
from the img_request, which was not set for the obj_request yet. This
resulted in trying to write past the end of the front payload, hitting
this BUG:

libceph: BUG_ON(p > msg->front.iov_base + msg->front.iov_len);

Fix this by associating the obj_request with its img_request
immediately after it's created, before the osd request is created.

Fixes: http://tracker.ceph.com/issues/5760
Suggested-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
2013-09-03 22:08:46 -07:00
Josh Durgin
17c1cc1d92 rbd: fix I/O error propagation for reads
When a request returns an error, the driver needs to report the entire
extent of the request as completed.  Writes already did this, since
they always set xferred = length, but reads were skipping that step if
an error other than -ENOENT occurred.  Instead, rbd would end up
passing 0 xferred to blk_end_request(), which would always report
needing more data.  This resulted in an assert failing when more data
was required by the block layer, but all the object requests were
done:

[ 1868.719077] rbd: obj_request read result -108 xferred 0
[ 1868.719077]
[ 1868.719518] end_request: I/O error, dev rbd1, sector 0
[ 1868.719739]
[ 1868.719739] Assertion failure in rbd_img_obj_callback() at line 1736:
[ 1868.719739]
[ 1868.719739]   rbd_assert(more ^ (which == img_request->obj_request_count));

Without this assert, reads that hit errors would hang forever, since
the block layer considered them incomplete.

Fixes: http://tracker.ceph.com/issues/5647
CC: stable@vger.kernel.org  # v3.10
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
2013-09-03 22:06:10 -07:00
Greg Kroah-Hartman
b15a21ddda rbd: convert bus code to use bus_groups
The bus_attrs field of struct bus_type is going away soon, dev_groups
should be used instead.  This converts the RBD bus code to use the
correct field.

Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Acked-by: Alex Elder <elder@linaro.org>
Cc: <ceph-devel@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-27 22:07:49 -07:00
Jingoo Han
a158073c43 block: rbd: use NULL instead of 0
The local variables such as 'bio_list', and 'pages' are pointers;
thus, use NULL instead of 0 to fix the following sparse warnings.

drivers/block/rbd.c:2166:32: warning: Using plain integer as NULL pointer
drivers/block/rbd.c:2168:31: warning: Using plain integer as NULL pointer

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:40 -07:00
Sage Weil
e976cad0f0 rbd: fix a couple warnings
gcc isn't quite smart enough and generates these warnings:

drivers/block/rbd.c: In function 'rbd_img_request_fill':
drivers/block/rbd.c:1266:22: warning: 'bio_list' may be used uninitialized in this function [-Wmaybe-uninitialized]
drivers/block/rbd.c:2186:14: note: 'bio_list' was declared here
drivers/block/rbd.c:2247:10: warning: 'pages' may be used uninitialized in this function [-Wmaybe-uninitialized]

even though they are initialized for their respective code paths.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:50 -07:00
Alex Elder
d552c6191b rbd: take a little credit
Add a name to the list of authors.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:44 -07:00
Alex Elder
cfbf6377b6 rbd: use rwsem to protect header updates
Updating an image header needs to be protected to ensure it's
done consistently.  However distinct headers can be updated
concurrently without a problem.  Instead of using the global
control lock to serialize headder updates, just rely on the header
semaphore.  (It's already used, this just moves it out to cover
a broader section of the code.)

That leaves the control mutex protecting only the creation of rbd
clients, so rename it.

This resolves:
    http://tracker.ceph.com/issues/5222

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:43 -07:00
Alex Elder
1ba0f1e797 rbd: don't hold ctl_mutex to get/put device
When an rbd device is first getting mapped, its device registration
is protected the control mutex.  There is no need to do that though,
because the device has already been assigned an id that's guaranteed
to be unique.

An unmap of an rbd device won't proceed if the device has a non-zero
open count or is already being unmapped.  So there's no need to hold
the control mutex in that case either.

Finally, an rbd device can't be opened if it is being removed, and
it won't go away if there is a non-zero open count.  So here too
there's no need to hold the control mutex while getting or putting a
reference to an rbd device's Linux device structure.

Drop the mutex calls in these cases.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:42 -07:00
Alex Elder
82a442d239 rbd: protect against concurrent unmaps
Make sure two concurrent unmap operations on the same rbd device
won't collide, by only proceeding with the removal and cleanup of a
device if is not already underway.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:41 -07:00
Alex Elder
751cc0e3cf rbd: set removing flag while holding list lock
When unmapping a device, its id is supplied, and that is used to
look up which rbd device should be unmapped.  Looking up the
device involves searching the rbd device list while holding
a spinlock that protects access to that list.

Currently all of this is done under protection of the control lock,
but that protection is going away soon.  To ensure the rbd_dev is
still valid (still on the list) while setting its REMOVING flag, do
so while still holding the list lock.  To do so, get rid of
__rbd_get_dev(), and open code what it did in the one place it
was used.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:41 -07:00
Alex Elder
08f75463c1 rbd: protect against duplicate client creation
If more than one rbd image has the same ceph cluster configuration
(same options, same set of monitors, same keys) they normally share
a single rbd client.

When an image is getting mapped, rbd looks to see if an existing
client can be used, and creates a new one if not.

The lookup and creation are not done under a common lock though, so
mapping two images concurrently could lead to duplicate clients
getting set up needlessly.  This isn't a major problem, but it's
wasteful and different from what's intended.

This patch fixes that by using the control mutex to protect
both the lookup and (if needed) creation of the client.  It
was previously used just when creating.

This resolves:
    http://tracker.ceph.com/issues/3094

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:39 -07:00
Alex Elder
3b5cf2a2f1 rbd: clean up a few things in the refresh path
This includes a few relatively small fixes I found while examining
the code that refreshes image information.

This resolves:
    http://tracker.ceph.com/issues/5040

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:38 -07:00
Alex Elder
e215605417 rbd: flush dcache after zeroing page data
Neither zero_bio_chain() nor zero_pages() contains a call to flush
caches after zeroing a portion of a page.  This can cause problems
on architectures that have caches that allow virtual address
aliasing.

This resolves:
    http://tracker.ceph.com/issues/4777

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:37 -07:00
Alex Elder
912c317d46 rbd: drop original request earlier for existence check
The reference to the original request dropped at the end of
rbd_img_obj_exists_callback() corresponds to the reference taken
in rbd_img_obj_exists_submit() to account for the stat request
referring to it.  Move the put of that reference up right after
clearing that pointer to make its purpose more obvious.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-01 09:52:02 -07:00
Geert Uytterhoeven
491205a8b4 rbd: Use min_t() to fix comparison of distinct pointer types warning
drivers/block/rbd.c: In function ‘zero_pages’:
drivers/block/rbd.c:1102: warning: comparison of distinct pointer types lacks a cast

Remove the hackish casts and use min_t() to fix this.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-07-01 09:52:01 -07:00
Linus Torvalds
bd2931b5cf Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fix from Sage Weil:
 "This is a recently spotted regression in the snapshot behavior...

  It turns out several tests weren't being run in the nightlies so this
  took a while to spot"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: send snapshot context with writes
2013-06-29 10:31:15 -07:00
Josh Durgin
d2d1f17a0d rbd: send snapshot context with writes
Sending the right snapshot context with each write is required for
snapshots to work. Due to the ordering of calls, the snapshot context
is never set for any requests. This causes writes to the current
version of the image to be reflected in all snapshots, which are
supposed to be read-only.

This happens because rbd_osd_req_format_write() sets the snapshot
context based on obj_request->img_request. At this point, however,
obj_request->img_request has not been set yet, to the snapshot context
is set to NULL. Fix this by moving rbd_img_obj_request_add(), which
sets obj_request->img_request, before the osd request formatting
calls.

This resolves:
    http://tracker.ceph.com/issues/5465

Reported-by: Karol Jurak <karol.jurak@gmail.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-06-27 05:55:29 -07:00
Linus Torvalds
78750f1908 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fix from Sage Weil:
 "This fixes another problem with using v2 images on 3.10 due to the
  order in which fields are read from the image header.

  Hopefully this is the last one"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fetch object order before using it
2013-06-26 08:47:46 -10:00
Josh Durgin
1617e40c1e rbd: fetch object order before using it
rbd_dev_v2_header_onetime() fetches striping information, and
checks whether the image can be read by compariing the stripe unit
to the object size. It determines the object size by shifting
the object order, which is 0 at this point since it has not been
read yet. Move the call to get the image size and object order
before rbd_dev_v2_header_onetime() so it is set before use.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-06-25 12:27:31 -07:00
Linus Torvalds
7ecba6f2f3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fix from Sage Weil:
 "This fixes a problem preventing the kernel and userland librbd
  libraries from sharing data with the new format 2 images"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: use the correct length for format 2 object names
2013-06-21 06:27:40 -10:00
Josh Durgin
3a96d5cd7b rbd: use the correct length for format 2 object names
Format 2 objects use 16 characters for the object name suffix to be
able to express the full 64-bit range of object numbers. Format 1
images only use 12 characters for this. Using 12-character names for
format 2 caused userspace and kernel rbd clients to read differently
named objects, which made an image written by one client look empty to
the other client.

CC: stable@vger.kernel.org  # 3.9+
Reported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-06-13 08:46:15 -07:00
Linus Torvalds
8d7a8fe2ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
 "There is a pair of fixes for double-frees in the recent bundle for
  3.10, a couple of fixes for long-standing bugs (sleep while atomic and
  an endianness fix), and a locking fix that can be triggered when osds
  are going down"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix cleanup in rbd_add()
  rbd: don't destroy ceph_opts in rbd_add()
  ceph: ceph_pagelist_append might sleep while atomic
  ceph: add cpu_to_le32() calls when encoding a reconnect capability
  libceph: must hold mutex for reset_changed_osds()
2013-06-12 08:28:19 -07:00
Alex Elder
3abef3b358 rbd: fix cleanup in rbd_add()
Bjorn Helgaas pointed out that a recent commit introduced a
use-after-free condition in an error path for rbd_add().
He correctly stated:

    I think b536f69a3a "rbd: set up devices only for mapped images"
    introduced a use-after-free error in rbd_add():
	...
    If rbd_dev_device_setup() returns an error, we call
    rbd_dev_image_release(), which ultimately kfrees rbd_dev.
    Then we call rbd_dev_destroy(), which references fields in
    the already-freed rbd_dev struct before kfreeing it again.

The simple fix is to return the error code after the call to
rbd_dev_image_release().

Closer examination revealed that there's no need to clean up
rbd_opts in that function, so fix that too.

Update some other comments that have also become out of date.

Reported-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-17 12:50:10 -05:00
Alex Elder
7262cfca43 rbd: don't destroy ceph_opts in rbd_add()
Whether rbd_client_create() successfully creates a new client or
not, it takes responsibility for getting the ceph_opts structure
it's passed destroyed.  If successful, the structure becomes
associated with the created client; if not, rbd_client_create()
will destroy it.

Previously, rbd_get_client() would call ceph_destroy_options()
if rbd_get_client() failed, and that meant it got called twice.
That led freeing various pointers more than once, which is never a
good idea.

This resolves:
    http://tracker.ceph.com/issues/4559

Cc: stable@vger.kernel.org # 3.8+
Reported-by: Dan van der Ster <dan@vanderster.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-17 12:50:03 -05:00
Linus Torvalds
109c3c0292 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "Yes, this is a much larger pull than I would like after -rc1.  There
  are a few things included:

   - a few fixes for leaks and incorrect assertions
   - a few patches fixing behavior when mapped images are resized
   - handling for cloned/layered images that are flattened out from
     underneath the client

  The last bit was non-trivial, and there is some code movement and
  associated cleanup mixed in.  This was ready and was meant to go in
  last week but I missed the boat on Friday.  My only excuse is that I
  was waiting for an all clear from the testing and there were many
  other shiny things to distract me.

  Strictly speaking, handling the flatten case isn't a regression and
  could wait, so if you like we can try to pull the series apart, but
  Alex and I would much prefer to have it all in as it is a case real
  users will hit with 3.10."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (33 commits)
  rbd: re-submit flattened write request (part 2)
  rbd: re-submit write request for flattened clone
  rbd: re-submit read request for flattened clone
  rbd: detect when clone image is flattened
  rbd: reference count parent requests
  rbd: define parent image request routines
  rbd: define rbd_dev_unparent()
  rbd: don't release write request until necessary
  rbd: get parent info on refresh
  rbd: ignore zero-overlap parent
  rbd: support reading parent page data for writes
  rbd: fix parent request size assumption
  libceph: init sent and completed when starting
  rbd: kill rbd_img_request_get()
  rbd: only set up watch for mapped images
  rbd: set mapping read-only flag in rbd_add()
  rbd: support reading parent page data
  rbd: fix an incorrect assertion condition
  rbd: define rbd_dev_v2_header_info()
  rbd: get rid of trivial v1 header wrappers
  ...
2013-05-15 13:36:19 -07:00
Alex Elder
638f5abed3 rbd: re-submit flattened write request (part 2)
Add code to rbd_img_obj_exists_callback() to detect when a clone's
parent image has disappeared, and re-submit the original write
request in that case.

Kill off some redundant assertions.

This completes the resolution for:
    http://tracker.ceph.com/issues/3763

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:46 -05:00
Alex Elder
bbea1c1a31 rbd: re-submit write request for flattened clone
Add code to rbd_img_parent_read_full_callback() to detect when a
clone's parent image has disappeared, and re-submit the original
write request in that case.  (See the previous commit for more
reasoning about why this is appropriate.)

Rename some variables in rbd_img_obj_parent_read_full_callback()
to match the convention used in the previous patch.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:45 -05:00
Alex Elder
02c74fbad9 rbd: re-submit read request for flattened clone
If a clone image gets flattened while a parent read request is
underway, the original rbd object request needs to be resubmitted.

The reason is that by the time we get the response to the parent
read request, the data read from the parent may be out of date.
In other words, we could see this sequence of events:

    rbd client                      parent image/osd
    ----------                      ----------------
    original object ENOENT;
        issue parent read
                                    respond to parent read
                                    child image flattened
    original image header refresh
             <--- original object written independently here
    parent read response received

Add code to rbd_img_parent_read_callback() to detect when a clone's
parent image has disappeared (as evidenced by its parent overlap
becoming 0), and re-submit the original read request in that case.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:45 -05:00
Alex Elder
392a9dad7e rbd: detect when clone image is flattened
A format 2 clone image can be the subject of a "flatten" operation,
during which all of its data gets "copied up" from its parent image,
leaving the image fully populated.  Once this is complete, the
clone's association with the parent is abolished.

Since this can occur when a clone is mapped, we need to detect when
it has occurred and handle it accordingly.  We know an image has
been flattened when we know it at one time had a parent, but we have
learned (via a "get_parent" object class method call) it no longer
has one.

There might be in-flight requests at the point we learn an image has
been flattened, so we can't simply clean up parent data structures
right away.  Instead, we'll drop the initial parent reference when
the parent has disappeared (rather than when the image gets
destroyed), which will allow the last in-flight reference to clean
things up when it's complete.

We leverage the fact that a zero parent overlap renders an image
effectively unlayered.  We set the overlap to 0 at the point we
detect the clone image has flattened, which allows the unlayered
behavior to take effect immediately, while keeping other parent
structures in place until in-flight requests to complete.

This and the next few patches resolve:
    http://tracker.ceph.com/issues/3763

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:45 -05:00
Alex Elder
a2acd00e79 rbd: reference count parent requests
Keep a reference count for uses of the parent information for an rbd
device.

An initial reference is set in rbd_img_request_create() if the
target image has a parent (with non-zero overlap).  Each image
request for an image with a non-zero parent overlap gets another
reference when it's created, and that reference is dropped when the
request is destroyed.

The initial reference is dropped when the image gets torn down.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:44 -05:00
Alex Elder
e93f315235 rbd: define parent image request routines
Define rbd_parent_request_create() and rbd_parent_request_destroy()
to handle the creation of parent image requests submitted for
layered image objects.  For simplicity, let rbd_img_request_put()
handle dropping the reference to any image request (parent or not),
and call whichever destructor is appropriate on the last put.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:44 -05:00
Alex Elder
fb65d2284c rbd: define rbd_dev_unparent()
Define rbd_dev_unparent() to encapsulate cleaning up parent data
structures from a layered rbd image.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:43 -05:00
Alex Elder
8785b1d487 rbd: don't release write request until necessary
Previously when a layered write was going to involve a copyup
request, the original osd request was released before submitting the
parent full-object read.  The osd request for the copyup would then
be allocated in rbd_img_obj_parent_read_full_callback().

Shortly we will be handling the event of mapped layered images
getting flattened, and when that occurs we need to resubmit the
original request.  We therefore don't want to release the osd
request until we really konw we're going to replace it--in the
callback function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:43 -05:00
Alex Elder
642a25375f rbd: get parent info on refresh
Get parent info for format 2 images on every refresh (rather than
just during the initial probe).  This will be needed to detect the
disappearance of the parent image in the event a mapped image
becomes unlayered (i.e., flattened).  Avoid leaking the previous
parent spec on the second and subsequent times this information is
requested by dropping the previous one (if any) before updating it.
(Also, extract the pool id into a local variable before assigning
it into the parent spec.)

Switch to using a non-zero parent overlap value rather than the
existence of a parent (a non-null parent_spec pointer) to determine
whether to mark a request layered.  It will soon be possible for
a layered image to become unlayered while a request is in flight.

This means that the layered flag for an image request indicates that
there was a non-zero parent overlap at the time the image request
was created.  The parent overlap can change thereafter, which may
lead to special handling at request submission or completion time.

This and the next several patches are related to:
    http://tracker.ceph.com/issues/3763

NOTE:
If an error occurs while refreshing the parent info (i.e.,
requesting it after initial probe), the old parent info will
persist.  This is not really correct, and is a scenario that needs
to be addressed.  For now we'll assert that the failure mode is
unlikely, but the issue has been documented in tracker issue 5040.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:33 -05:00
Alex Elder
70cf49cfc7 rbd: ignore zero-overlap parent
An rbd clone image that has an overlap with its parent of 0 is
effectively not a layered image at all.  Detect this case and treat
such an image as non-layered.  Issue a warning to be sure the user
knows what's going on.

This resolves:
    http://tracker.ceph.com/issues/5028

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 14:12:41 -05:00
Alex Elder
b91f09f17b rbd: support reading parent page data for writes
Currently, rbd_img_obj_parent_read_full() assumes the incoming
object request contains bio data.  But if a layered image is part of
a multi-layer stack of images it will result in read requests of
page data to parent images.

This is handling the same kind of issue as was resolved by this
commit:
    5b2ab72d  rbd: support reading parent page data

This resolves:
    http://tracker.ceph.com/issues/5027

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 14:12:40 -05:00
Alex Elder
ebda6408f2 rbd: fix parent request size assumption
The code that reads object data from the parent for a copyup on
write request currently assumes that the size of that request is the
size of a "full" object from the original target image.

That is not necessarily the case.  The parent overlap could reduce
the request size below that.  To fix that assumption we need to
record the number of pages in the copyup_pages array, for both an
image request and an object request.  Rename a local variable in
rbd_img_obj_parent_read_full_callback() to reflect we're recording
the length of the parent read request, not the size of the target
object.

This resolves:
    http://tracker.ceph.com/issues/5038

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 14:09:01 -05:00
Alex Elder
c48f3f86e2 rbd: kill rbd_img_request_get()
Get rid of rbd_img_request_get(), because it isn't used, and maybe
won't ever be needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:17:00 -05:00
Alex Elder
1f3ef78861 rbd: only set up watch for mapped images
Any changes to parent images are immaterial to any mapped clone.
So there is no need to have a watch event registered on header
objects except for the header object of an image that is mapped.
In fact, a watch request is a write operation, and we may only
have read access to a parent image.

We can't set up the watch request until we know the name of the
header object though.  So pass a flag to rbd_dev_image_probe() to
indicate whether this probe is for a mapping or for a parent image.

Change the second parameter to rbd_dev_header_watch_sync() be
Boolean while we're at it.

This resolves:
    http://tracker.ceph.com/issues/4941

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:16:55 -05:00
Alex Elder
7ce4eef7b5 rbd: set mapping read-only flag in rbd_add()
The rbd_dev->mapping field for a parent image is not meaningful.
Since rbd_image_probe() is used both for images being mapped and
their parents, it doesn't make sense to set that flag in that
function.

So move the setting of the mapping.read_only flag out of
rbd_dev_image_probe() and into rbd_add() instead.

This resolves:
    http://tracker.ceph.com/issues/4940

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:16:50 -05:00
Alex Elder
5b2ab72d36 rbd: support reading parent page data
Currently, rbd_img_parent_read() assumes the incoming object request
contains bio data.  But if a layered image is part of a multi-layer
stack of images it will result in read requests of page data to parent
images.

Fortunately, it's not hard to add support for page data.

This resolves:
    http://tracker.ceph.com/issues/4939

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:16:25 -05:00
Alex Elder
91c6febb38 rbd: fix an incorrect assertion condition
In rbd_img_obj_parent_read_full_callback() there is an assertion
intended to verify the size of the image request for a full parent
read was the size of the original request's target object.  But
assertion was looking at the parent image order rather than the
original one, and these values can differ.

Fix that.

This resolves:
    http://tracker.ceph.com/issues/4938

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:16:10 -05:00
Alex Elder
2df3fac758 rbd: define rbd_dev_v2_header_info()
This rearranges rbd_dev_v2_refresh() so it works more like
rbd_dev_v1_header_info().  While format 1 images need to read the
whole header object to get any information, format 2 can collect
almost all information selectively.  So the one-time initialization
will remain in a separate function--based on rbd_dev_v2_probe().

Rename rbd_dev_v2_refresh() to be rbd_dev_v2_header_info(), and have
it call rbd_dev_v2_header_onetime() if it's being called for the
first time for the given rbd device.

Rename rbd_dev_v2_probe() to be rbd_dev_v2_header_onetime() and
remove the image size and snapshot context calls it held in
common with the refresh function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:52 -05:00
Alex Elder
99a41ebcee rbd: get rid of trivial v1 header wrappers
Get rid of the trivial wrapper functions rbd_dev_v1_refresh() and
rbd_dev_v1_probe(), substituting rbd_dev_v1_header_read() calls
in their place.

Rename rbd_dev_v1_header_read() to be rbd_dev_v1_header_info(), to
be more generic (it will better reflect what happens with format 2
images).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:46 -05:00
Alex Elder
30d60ba2f2 rbd: simplify rbd_dev_v1_probe()
An rbd_dev structure's fields are all zero-filled for an initial
probe, so there's no need to explicitly zero the parent_spec
and parent_overlap fields in rbd_dev_v1_probe().  Removing these
assignments makes rbd_dev_v1_probe() *almost* trivial.

Move the dout() message that announces discovery of an image into
rbd_dev_image_probe(), generalize to support images in either format
and only show it if an image is fully discovered.

This highlights that are some unnecessary cleanups in the error
path for rbd_dev_v1_probe(), so they can be removed.

Now rbd_dev_v1_probe() *is* a trivial wrapper function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:41 -05:00
Alex Elder
662518b128 rbd: update in-core header directly
Now that rbd_header_from_disk() only fills in one-time fields once,
we can extend it slightly so it releases the other fields before
replacing their values.  This way there's no need to pass a
temporary buffer and then copy all the results in.  Just use the rbd
device header structure in rbd_header_from_disk() so its values get
updated directly.

Note that this means we need to take the header semaphore at the
point we update things.  So pass the rbd_dev rather than the address
of its header as its first argument to rbd_header_from_disk(), and
have it return an error code.

As a result, rbd_dev_v1_header_read() does all the work,
rbd_read_header() becomes unnecessary, and rbd_dev_v1_refresh()
becomes a very simple wrapper.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:37 -05:00
Alex Elder
bb23e37acb rbd: refactor rbd_header_from_disk()
This rearranges rbd_header_from_disk so that it:
    - allocates the snapshot context right away
    - keeps results in local variables, not changing the passed-in
      header until it's known we'll succeed
    - does initialization of set-once fields in a header only if
      they have not already been set

The last point is moot at the moment, because rbd_read_header()
(the only caller) always supplies a zero-filled header buffer.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:33 -05:00
Alex Elder
46578dcdca rbd: zero format 1 header structure earlier
The passed-in header structure is zeroed in rbd_header_from_disk().
Instead, have the caller do it.  Note that there are two callers,
rbd_dev_v1_refresh() and rbd_dev_v1_probe().  The latter already has
a zeroed header structure so zeroing it isn't necessary there.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:28 -05:00
Alex Elder
f35a4dee14 rbd: set the mapping size and features later
Defer setting the size and features fields of a mapped image until
after the Linux disk structure is set up.  Set the capacity of the
disk after that.

Rearrange the definition of rbd_image_header, separating the fields
that are set only once from those that can be updated.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:00 -05:00
Linus Torvalds
4de13d7aa8 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:

 - Major bit is Kents prep work for immutable bio vecs.

 - Stable candidate fix for a scheduling-while-atomic in the queue
   bypass operation.

 - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
   discard bios.

 - Tejuns changes to convert the writeback thread pool to the generic
   workqueue mechanism.

 - Runtime PM framework, SCSI patches exists on top of these in James'
   tree.

 - A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
  relay: move remove_buf_file inside relay_close_buf
  partitions/efi.c: replace useless kzalloc's by kmalloc's
  fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
  block: fix max discard sectors limit
  blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  Documentation: cfq-iosched: update documentation help for cfq tunables
  writeback: expose the bdi_wq workqueue
  writeback: replace custom worker pool implementation with unbound workqueue
  writeback: remove unused bdi_pending_list
  aoe: Fix unitialized var usage
  bio-integrity: Add explicit field for owner of bip_buf
  block: Add an explicit bio flag for bios that own their bvec
  block: Add bio_alloc_pages()
  block: Convert some code to bio_for_each_segment_all()
  block: Add bio_for_each_segment_all()
  bounce: Refactor __blk_queue_bounce to not use bi_io_vec
  raid1: use bio_copy_data()
  pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
  pktcdvd: use bio_copy_data()
  block: Add bio_copy_data()
  ...
2013-05-08 10:13:35 -07:00
Alex Elder
51344a38ba rbd: always set read-only flag in rbd_add()
Hold off setting the read-only flag in rbd_add() for an image being
mapped until we have successfully probed the image.  At that point
we know whether it's a snapshot mapping or not, so we can set the
read-only flag in that one place rather than doing so (for
snapshots) in rbd_dev_mapping_set().  To do this, pass a flag to the
image probe routine indicating whether we want a read-only mapping.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:48:12 -05:00
Alex Elder
6d80b130d5 rbd: kill rbd_dev_clear_mapping()
This function is a duplicate of rbd_dev_mapping_clear(), and was
added by mistake.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:48:12 -05:00
Alex Elder
8f4b7d9821 rbd: don't look up snapshot id in rbd_dev_mapping_set()
Currently rbd_dev_mapping_set() looks up the snapshot id for the
snapshot whose name is found in the rbd device's spec structure.

That function gets called by rbd_dev_device_setup(), which is
called by rbd_add() *after* rbd_dev_image_probe().  If the
image probe succeeds, the rbd device's spec will already have
been updated to include names and ids for all fields.

Therefore there's no need to look up the snapshot id in
rbd_dev_mapping_set().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:48:11 -05:00
Alex Elder
c734b79655 rbd: don't print warning if not mapping a parent
The presence of the LAYERING bit in an rbd image's feature mask does
not guarantee the image actually has a parent image.  Currently that
bit is set only when a clone (i.e., image with a parent) is created,
but it is (currently) not cleared if that clone gets flattened back
into a "normal" image.  A "parent_id" query will leave the
parent_spec for the image being mapped a null pointer, but will not
return an error.

Currently, whenever an image with the LAYERED feature gets mapped, a
warning about the use of layered images gets printed.  But we don't
want to do this for a flattened image, so print the warning only
if we find there is a parent spec after the probe.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:48:11 -05:00
Alex Elder
29334ba49c rbd: kill rbd_update_mapping_size()
Since rbd_update_mapping_size() is now a trivial wrapper, just open
code it in its two callers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:45:39 -05:00
Alex Elder
00a653e216 rbd: update capacity in rbd_dev_refresh()
When a mapped image changes size, we change the capacity recorded
for the Linux disk associated with it, in rbd_update_mapping_size().
That function is called in two places--the format 1 and format 2
refresh routines.

There is no need to set the capacity while holding the header
semaphore.  Instead, do it in the common rbd_dev_refresh(), using
the logic that's already there to initiate disk revalidation.

Add handling in the request function, just in case a request
that exceeds the capacity of the device comes in (perhaps one
that was started before a refresh shrunk the device).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:45:30 -05:00
Alex Elder
e627db085e rbd: revalidate only for mapping size changes
This commit:
    d98df63e rbd: revalidate_disk upon rbd resize
instituted a call to revalidate_disk() to notify interested parties
that a mapped image has changed size.  This works well, as long as
the the rbd device doesn't map a snapshot.

A snapshot will never change size.  However, the base image the
snapshot is associated with can, and it can do so while the snapshot
is mapped.

The problem is that the test for the size is looking at the size of
the base image, not the size of the mapped snapshot.  This patch
corrects that.

Update the warning message shown in the event of error, and move
it into the callers.

This resolves:
    http://tracker.ceph.com/issues/4911

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:40:48 -05:00
Alex Elder
49ece55428 rbd: fix leak of format 2 snapshot context
When rbd_dev_v2_refresh() is called, the rbd device already has a
snapshot context associated with it.  But that never gets freed,
the pointer just gets overwritten.

Fix this by dropping the rbd device's reference to the snapshot
context before overwriting the pointer.

Because ceph_put_snap_context() already handles for a null pointer
we don't need to check for that (for the probe case, where no
context has yet been assigned).

This resolves:
    http://tracker.ceph.com/issues/4912

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:38:30 -05:00
Linus Torvalds
292088ee03 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "A couple of fixes + getting rid of __blkdev_put() return value"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  proc: Use PDE attribute setting accessor functions
  make blkdev_put() return void
  block_device_operations->release() should return void
  mtd_blktrans_ops->release() should return void
  hfs: SMP race on directory close()
2013-05-07 15:14:53 -07:00
Al Viro
db2a144bed block_device_operations->release() should return void
The value passed is 0 in all but "it can never happen" cases (and those
only in a couple of drivers) *and* it would've been lost on the way
out anyway, even if something tried to pass something meaningful.
Just don't bother.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-07 02:16:21 -04:00
Alex Elder
b5b09be30c rbd: fix image request leak on parent read
When a read for a layered image object finds the target object
doesn't exist, a read image request for the parent image is created
and submitted.  When that completes, the callback routine was
not releasing that parent image request.  Fix that.

The slab allocation stuff just added has greatly simplified the
search for the source of this memory leak.

This resolves:
    http://tracker.ceph.com/issues/4803

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 12:15:28 -05:00
Alex Elder
78c2a44aae rbd: allocate image object names with a slab allocator
The names of objects used for image object requests are always fixed
size.  So create a slab cache to manage them.  Define a new function
rbd_segment_name_free() to match rbd_segment_name() (which is what
supplies the dynamically-allocated name buffer).

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:30 -05:00
Alex Elder
868311b1eb rbd: allocate object requests with a slab allocator
Create a slab cache to manage rbd_obj_request allocation.  We aren't
using a constructor, and we'll zero-fill object request structures
when they're allocated.

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:30 -05:00
Alex Elder
f907ad5596 rbd: allocate name separate from obj_request
The next patch will define a slab allocator for a object requests.
To use that we'll need to allocate the name of an object separate
from the request structure itself.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:29 -05:00
Alex Elder
1c2a9dfe21 rbd: allocate image requests with a slab allocator
Create a slab cache to manage rbd_img_request allocation.  Nothing
too fancy at this point--we'll still initialize everything at
allocation time (no constructor)

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:29 -05:00
Alex Elder
30d1cff817 rbd: use binary search for snapshot lookup
Use bsearch(3) to make snapshot lookup by id more efficient.  (There
could be thousands of snapshots, and conceivably many more.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:17 -05:00
Alex Elder
15228ede7d rbd: clear EXISTS flag if mapped snapshot disappears
This functionality inadvertently disappeared in the last patch.

Image snapshots can get removed at just about any time.  In
particular it can disappear even if it is in use by an rbd
client as a mapped image.

The rbd client deals with such a disappearance by responding to new
requests with ENXIO.  This is implemented by each rbd device
maintaining an EXISTS flag, which is normally set but cleared if a
snapshot disappears.

This patch (re-)implements the clearing of that flag.

Whenever mapped image header information is refreshed, if the
mapping is for a snapshot, verify the mapped snapshot is still
present in the updated snapshot context.  If it is not, clear the
flag.

It is not necessary to check this in the initial probe, because the
probe will not succeed if the snapshot doesn't exist.

This resolves:
    http://tracker.ceph.com/issues/4880

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:57:03 -05:00
Alex Elder
33dca39f5c rbd: kill off the snapshot list
We no longer use the snapshot list for anything.  When we need to
look up a snapshot name, id, size, or feature mask, we just do it
directly rather than relying on this list being updated with every
refresh.  The main reason it existed was for the benefit of the
device/sysfs entries that previously were associated with snapshots.

So get rid of the snapshot list, and struct rbd_snap, and the
hundreds of lines of code that supported them.

This resolves:
    http://tracker.ceph.com/issues/4868

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:22 -07:00
Alex Elder
2ad3d7167e rbd: define rbd_snap_size() and rbd_snap_features()
This patch defines a handful of new functions that will allow
us to get rid of the rbd device structure's list of snapshots.

Define rbd_snap_id_by_name() to look up a snapshot id given its
name.  This is efficient for format 1 images but not for format 2.
Fortunately it only gets called at mapping time so it's not that
critical.

Use rbd_snap_id_by_name() to find out the id for a snapshot getting
mapped, and pass that id to new functions rbd_snap_size() and
rbd_snap_features() to look up information about a given snapshot's
size and feature mask given its snapshot id.  All this gets done
in rbd_dev_mapping_set().

As a result, snap_by_name() is no longer needed, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:20 -07:00
Alex Elder
54cac61fb6 rbd: use snap_id not index to look up snap info
In order to align with what was needed for format 1 rbd images,
rbd_dev_v2_snap_info() was set up to take as argument an index into
the array of snapshot ids in a rbd device's snapshot context.

This switches that around, so we pass the snapshot id instead.
In doing this, rbd_snap_name() now returns a dynamically-allocated
string rather than a fixed one, so there's no need to make a
duplicate in its caller, rbd_dev_spec_update().

This means the following functions take a snapshot id where they
previously used an index value:
    rbd_dev_snap_info()
    rbd_dev_v1_snap_info()
    rbd_dev_v2_snap_info()

A new function, rbd_dev_snap_index(), determines the snap index for
format 1 images and uses it to look up the name.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:19 -07:00
Alex Elder
9682fc6d3a rbd: look up snapshot name in names buffer
Rather than scanning the list of snapshot structures for it, scan
the snapshot context buffer containing snapshot names in order to
determine for a format 1 image the name associated with a given
snapshot id.

Pull out the part of rbd_dev_v1_snap_info() that does this scan into
a new function, _rbd_dev_v1_snap_name().  Have that function return
a dynamically-allocated copy of the name, and don't duplicate it in
rbd_dev_v1_snap_info().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:18 -07:00
Alex Elder
dedc81ea84 rbd: drop obj_request->version
Nothing ever uses the version field maintained in the object request
structure any more, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:17 -07:00
Alex Elder
e2a58ee55b rbd: drop rbd_obj_method_sync() version parameter
Only NULL is passed as the version argument to rbd_obj_method_sync(),
so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:16 -07:00
Alex Elder
cc4a38bdd5 rbd: more version parameter removal
Continued from the last patch, more parameters that can go away
because we no longer have a need to track object versions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:15 -07:00
Alex Elder
7097f8df6e rbd: get rid of some version parameters
Several functions in rbd have parameters meant to allow the version
of an object to be passed in or out.  The purpose of those was to
allow the version of a header object to be maintained, but we no
longer do that.  As a result, these parameters are never actually
needed or used, so get rid of them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:14 -07:00
Alex Elder
b21ebdddeb rbd: stop tracking header object version
The rbd code takes care to maintain the version of the header
object.  This was done in hopes of using it to detect a change in
the object between reading it and setting up a watch request to
be notified of changes.

The mechanism was never fully implemented, however.  And we now
avoid the original problem by setting up the watch request before
ever reading the content of the header.

The osd doesn't interpret the object version supplied with a WATCH
osd op, nor does it use the version supplied with a NOTIFY_ACK op
(we can just supply 0 for both).  There is therefore no need to
maintain the header's object version any more, so stop doing so.

We'll be able to simplify some more rbd code in the next few patches
as a result of this.

This resolves:
    http://tracker.ceph.com/issues/3952

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:13 -07:00
Alex Elder
cb75223d2b rbd: snap names are pointer to constant data
Make explicit that snapshot names don't change by making functions
return and take parameters that that point to const qualified data.

This resolves:
    http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:12 -07:00
Alex Elder
a3fbe5d447 rbd: don't revalidate so much
Whenever a header object event causes a mapped rbd image to refresh
its header information, revalidate_disk() is being called.  This was
done in rbd_dev_refresh() outside the control mutex in order to
avoid a lock inversion.  Although a an event like this *might*
indicate the image has changed size, most of the time it does not.

Record the image size before and after the refresh, and only
call revalidate_disk() if it changes.

This resolves:
    http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:11 -07:00
Alex Elder
96882f55c4 rbd: fix up the layering warning message
A warning gets spewed for any image being probed, including parent
images.  Set up a condition such that the warning message only gets
printed for the image being mapped, not any of its parents.

Also, I didn't like the way the warning ended up being so long.
Make it a terse warning instead.  People experimenting with layering
will know what the message means.

This is part of:
    http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:10 -07:00
Alex Elder
812164f8c3 ceph: use ceph_create_snap_context()
Now that we have a library routine to create snap contexts, use it.

This is part of:
    http://tracker.ceph.com/issues/4857

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:09 -07:00
Alex Elder
b536f69a3a rbd: set up devices only for mapped images
Stop setting up Linux devices during the image probe operation.
Instead, set up the devices as a separate step after the image
probe, in rbd_add().

A consequence of this is that only mapped images get devices
assigned to them, which is pretty sweet.

This resolves:
    http://tracker.ceph.com/issues/4774

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:07 -07:00
Alex Elder
8ad42cd0c0 rbd: don't have device release destroy rbd_dev
Currently an rbd_device structure gets destroyed from the release
routine for the device embedded within it.  Stop doing that, instead
calling rbd_dev_image_release() right after rbd_bus_del_dev()
wherever the latter is called.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:05 -07:00
Alex Elder
6fd48b3be9 rbd: define rbd_dev_unprobe()
Define a new function rbd_dev_unprobe() which undoes state changes
that occur from calling rbd_dev_v1_probe() or rbd_dev_v2_probe().
Note that this is a superset of rbd_header_free(), which is now
getting removed (it seems to have been used improperly anyway).

Flesh out rbd_dev_image_release() so it undoes exactly what
rbd_dev_image_probe() does.

This means that:
    - rbd_dev_device_release() gets called when the last device
      reference gets dropped;
    - that undoes everything done by the rbd_dev_device_setup() call
      at the end of rbd_dev_image_probe() (and nothing more), ending
      by calling rbd_dev_image_release(); and
    - rbd_dev_image_release() undoes everything else done by
      rbd_dev_image_probe() (and this includes a call to
      rbd_dev_unprobe().

This means the image and device portions of an rbd device are fairly
cleanly separated now, so error paths should be a little easier to
verify than they used to be.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:04 -07:00
Alex Elder
200a6a8be5 rbd: don't destroy rbd_dev in device release function
Rename rbd_dev_probe_finish() to be rbd_dev_device_setup().  Its
purpose is to set up the Linux side of an rbd device mapping.
Rename rbd_dev_release() to be rbd_dev_device_release(), making
it more obvious it serves as the inverse of the setup function
(or it will).

Encapsulate some of what was done in rbd_dev_release() into a new
function rbd_dev_image_release(), which serves as the inverse of
setting up the ceph side of the mapped rbd image.

Define a new helper rbd_dev_clear_mapping() to simply zero out the
fields of a mapping structure--the inverse of rbd_dev_set_mapping().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:03 -07:00
Alex Elder
79ab7558aa rbd: drop module later
Drop the module reference at the end of rbd_remove() for symmetry
with adding a reference at the top of rbd_add().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:02 -07:00
Alex Elder
b644de2ba0 rbd: set up watch in rbd_dev_image_probe()
Move setting up the watch request for an image so it's done in
rbd_dev_image_probe() rather than rbd_dev_probe_finish().  Move
it all the way up to before doing the initial probe.  This avoids
a potential race condition, in which we get (and use) the initial
snapshot context for an image, and it gets changed between that
time and the time we get the watch set up.

This resolves:
    http://tracker.ceph.com/issues/3871

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:01 -07:00
Alex Elder
96f03e08f9 rbd: don't bother checking whether order changes
When a format 2 image is refreshed, code is in place to verify that
the object order never changes from what it was originally.  This
relies on the fact that the refresh will occur *after* an initial
load of information about the image.

An upcoming patch makes it possible for the refresh to occur first,
so we can no longer make this order check.  The order really can't
ever change anyway--this was just a sanity check.  So get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:00 -07:00
Alex Elder
0d8189e175 rbd: don't clean up watch in device release function
Currently, a watch on an rbd device header object gets torn down
when its final Linux device reference gets dropped.  Instead, tear
it down when removing the device.  If an error occurs cleaning up
the watch event when unmapping, abort the unmap request.

All images (including parents) still get watch requests set up, so
tear these down also, in rbd_dev_remove_parent().  For now, ignore
any errors that occur in this case.

Get rid of local variable "rc" in rbd_remove(); use "ret" instead
(they both somehow ended up defined in the function and only one is
needed).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:59 -07:00
Alex Elder
332bb12db9 rbd: define rbd_header_name()
Define a new function rbd_header_name(), which allocates and formats
the name of the header object for the rbd device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:58 -07:00
Alex Elder
9bb81c9be9 rbd: move more initialization into rbd_dev_image_probe()
Move a block of initialization related to the "ceph-side" of an rbd
image out of rbd_dev_probe_finish() and into rbd_dev_image_probe().

Add appropriate error handling to clean things up in the event any
of these new functions return an error.

We know that rbd_dev_snaps_update(), rbd_dev_spec_update(), and
rbd_dev_probe_parent() all clean up after themselves before they
return an error, so no special cleanup is required except when an
earlier call succeeds.  Since rbd_dev_spec_update() only updates the
spec field (whose cleanup will be handled by dropping the last
reference to the spec) there is no cleanup action associatied with
that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:57 -07:00
Alex Elder
5de10f3b0c rbd: probe for the parent earlier
Probe for a parent device earlier in rbd_dev_probe_finish(), before
starting to set up the Linux side of the rbd device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:56 -07:00
Alex Elder
2e93bf9e46 rbd: remove parent devices on probe error
When an error occurs while finishing probing a device it is assumed
that parent devices get cleaned up when deleting a device.  They
don't.  Add a call to clean them up.  Note that this means the
parent spec will already be cleaned up so it doesn't have to be
in one of the rbd_add() error paths.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:55 -07:00