Commit Graph

549 Commits

Author SHA1 Message Date
Christoph Hellwig 7ad18afad0 rbd: convert to blk-mq
This converts the rbd driver to use the blk-mq infrastructure.  Except
for switching to a per-request work item this is almost mechanical.

This was tested by Alexandre DERUMIER in November, and found to give
him 120000 iops, although the only comparism available was an old
3.10 kernel which gave 80000iops.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <elder@linaro.org>
[idryomov@gmail.com: context, blk_mq_init_queue() EH]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-02-19 14:27:42 +03:00
Ilya Dryomov cf32bd9c86 rbd: do not treat standalone as flatten
If the clone is resized down to 0, it becomes standalone.  If such
resize is carried over while an image is mapped we would detect this
and call rbd_dev_parent_put() which means "let go of all parent state,
including the spec(s) of parent images(s)".  This leads to a mismatch
between "rbd info" and sysfs parent fields, so a fix is in order.

    # rbd create --image-format 2 --size 1 foo
    # rbd snap create foo@snap
    # rbd snap protect foo@snap
    # rbd clone foo@snap bar
    # DEV=$(rbd map bar)
    # rbd resize --allow-shrink --size 0 bar
    # rbd resize --size 1 bar
    # rbd info bar | grep parent
            parent: rbd/foo@snap

Before:

    # cat /sys/bus/rbd/devices/0/parent
    (no parent image)

After:

    # cat /sys/bus/rbd/devices/0/parent
    pool_id 0
    pool_name rbd
    image_id 10056b8b4567
    image_name foo
    snap_id 2
    snap_name snap
    overlap 0

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-02-19 13:31:39 +03:00
Ilya Dryomov 73e39e4dba rbd: fix error paths in rbd_dev_refresh()
header_rwsem should be released on errors.  Also remove useless
rbd_dev->mapping.size != rbd_dev->header.image_size test.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-19 13:31:38 +03:00
Rickard Strandqvist 3a25cf43e0 rbd: nuke copy_token()
It's been largely superseded by dup_token() and unused for over
2 years, identified by cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
[idryomov@redhat.com: changelog]
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-19 13:31:38 +03:00
Ilya Dryomov e69b8d414f rbd: drop parent_ref in rbd_dev_unprobe() unconditionally
This effectively reverts the last hunk of 392a9dad7e ("rbd: detect
when clone image is flattened").

The problem with parent_overlap != 0 condition is that it's possible
and completely valid to have an image with parent_overlap == 0 whose
parent state needs to be cleaned up on unmap.  The next commit, which
drops the "clone image now standalone" logic, opens up another window
of opportunity to hit this, but even without it

    # cat parent-ref.sh
    #!/bin/bash
    rbd create --image-format 2 --size 1 foo
    rbd snap create foo@snap
    rbd snap protect foo@snap
    rbd clone foo@snap bar
    rbd resize --allow-shrink --size 0 bar
    rbd resize --size 1 bar
    DEV=$(rbd map bar)
    rbd unmap $DEV

leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client
hanging around.

My thinking behind calling rbd_dev_parent_put() unconditionally is that
there shouldn't be any requests in flight at that point in time as we
are deep into unmap sequence.  Hence, even if rbd_dev_unparent() caused
by flatten is delayed by in-flight requests, it will have finished by
the time we reach rbd_dev_unprobe() caused by unmap, thus turning
unconditional rbd_dev_parent_put() into a no-op.

Fixes: http://tracker.ceph.com/issues/10352

Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-01-28 16:12:02 +03:00
Ilya Dryomov ae43e9d05e rbd: fix rbd_dev_parent_get() when parent_overlap == 0
The comment for rbd_dev_parent_get() said

    * We must get the reference before checking for the overlap to
    * coordinate properly with zeroing the parent overlap in
    * rbd_dev_v2_parent_info() when an image gets flattened.  We
    * drop it again if there is no overlap.

but the "drop it again if there is no overlap" part was missing from
the implementation.  This lead to absurd parent_ref values for images
with parent_overlap == 0, as parent_ref was incremented for each
img_request and virtually never decremented.

Fix this by leveraging the fact that refresh path calls
rbd_dev_v2_parent_info() under header_rwsem and use it for read in
rbd_dev_parent_get(), instead of messing around with atomics.  Get rid
of barriers in rbd_dev_v2_parent_info() while at it - I don't see what
they'd pair with now and I suspect we are in a pretty miserable
situation as far as proper locking goes regardless.

Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-01-28 16:11:51 +03:00
Ilya Dryomov 7e868b6eff rbd: don't treat CEPH_OSD_OP_DELETE as extent op
CEPH_OSD_OP_DELETE is not an extent op, stop treating it as such.  This
sneaked in with discard patches - it's one of the three osd ops (the
other two are CEPH_OSD_OP_TRUNCATE and CEPH_OSD_OP_ZERO) that discard
is implemented with.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-12-17 20:09:51 +03:00
SF Markus Elfring e96a650a81 ceph, rbd: delete unnecessary checks before two function calls
The functions ceph_put_snap_context() and iput() test whether their
argument is NULL and then return immediately. Thus the test around the
call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[idryomov@redhat.com: squashed rbd.c hunk, changelog]
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17 20:09:50 +03:00
Jan Kara a8d4205623 rbd: Fix error recovery in rbd_obj_read_sync()
When we fail to allocate page vector in rbd_obj_read_sync() we just
basically ignore the problem and continue which will result in an oops
later. Fix the problem by returning proper error.

CC: Yehuda Sadeh <yehuda@inktank.com>
CC: Sage Weil <sage@inktank.com>
CC: ceph-devel@vger.kernel.org
CC: stable@vger.kernel.org
Coverity-id: 1226882
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-10-30 13:11:51 +03:00
Ilya Dryomov f5ee37bd31 rbd: use a single workqueue for all devices
Using one queue per device doesn't make much sense given that our
workfn processes "devices" and not "requests".  Switch to a single
workqueue for all devices.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-30 13:11:25 +03:00
Ilya Dryomov 792c3a9149 rbd: rbd workqueues need a resque worker
Need to use WQ_MEM_RECLAIM for our workqueues to prevent I/O lockups
under memory pressure - we sit on the memory reclaim path.

Cc: stable@vger.kernel.org # 3.17, needs backporting for 3.16
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Tested-by: Micha Krause <micha@krausam.de>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14 12:57:05 -07:00
Josh Durgin b76f82398c rbd: set the remaining discard properties to enable support
max_discard_sectors must be set for the queue to support discard.
Operations implementing discard for rbd zero data, so report that.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-10-14 21:03:37 +04:00
Josh Durgin d3246fb0da rbd: use helpers to handle discard for layered images correctly
Only allocate two osd ops for discard requests, since the
preallocation hint is only added for regular writes.  Use
rbd_img_obj_request_fill() to recreate the original write or discard
osd operations, isolating that logic to one place, and change the
assert in rbd_osd_req_create_copyup() to accept discard requests as
well.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-10-14 21:03:36 +04:00
Josh Durgin 3b434a2aff rbd: extract a method for adding object operations
rbd_img_request_fill() creates a ceph_osd_request and has logic for
adding the appropriate osd ops to it based on the request type and
image properties.

For layered images, the original rbd_obj_request is resent with a
copyup operation in front, using a new ceph_osd_request. The logic for
adding the original operations should be the same as when first
sending them, so move it to a helper function.

op_type only needs to be checked once, so create a helper for that as
well and call it outside the loop in rbd_img_request_fill().

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-10-14 21:03:35 +04:00
Josh Durgin 1c220881e3 rbd: make discard trigger copy-on-write
Discard requests are a form of write, so they should go through the
same process as plain write requests and trigger copy-on-write for
layered images.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-10-14 21:03:34 +04:00
Josh Durgin d0265de7c3 rbd: tolerate -ENOENT for discard operations
Discard may try to delete an object from a non-layered image that does not exist.
If this occurs, the image already has no data in that range, so change the
result to success.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-10-14 21:03:33 +04:00
Josh Durgin bef95455a4 rbd: fix snapshot context reference count for discards
Discards take a reference to the snapshot context of an image when
they are created.  This reference needs to be cleaned up when the
request is done just as it is for regular writes.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-10-14 21:03:32 +04:00
Josh Durgin 3c5df89367 rbd: read image size for discard check safely
In rbd_img_request_fill() the image size is only checked to determine
whether we can truncate an object instead of zeroing it for discard
requests. Take rbd_dev->header_rwsem while reading the image size, and
move this read into the discard check, so that non-discard ops don't
need to take the semaphore in this function.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-10-14 21:03:32 +04:00
Guangliang Zhao 90e98c5229 rbd: initial discard bits from Guangliang Zhao
This patch add the discard support for rbd driver.

There are three types operation in the driver:
1. The objects would be removed if they completely contained
   within the discard range.
2. The objects would be truncated if they partly contained within
   the discard range, and align with their boundary.
3. Others would be zeroed.

A discard request from blkdev_issue_discard() is defined which
REQ_WRITE and REQ_DISCARD both marked and no data, so we must
check the REQ_DISCARD first when getting the request type.

This resolve:
	http://tracker.ceph.com/issues/190

[ Ilya Dryomov: This is incomplete and somewhat buggy, see follow up
  commits by Josh Durgin for refinements and fixes which weren't
  folded in to preserve authorship. ]

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-10-14 21:03:31 +04:00
Guangliang Zhao 6d2940c881 rbd: extend the operation type
It could only handle the read and write operations now,
extend it for the coming discard support.

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-10-14 21:03:30 +04:00
Guangliang Zhao c622d22615 rbd: skip the copyup when an entire object writing
It need to copyup the parent's content when layered writing,
but an entire object write would overwrite it, so skip it.

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-10-14 21:03:29 +04:00
Ilya Dryomov 70d045f660 rbd: add img_obj_request_simple() helper
To clarify the conditions and make it easier to add new ones.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-10-14 21:03:28 +04:00
Josh Durgin 4e752f0ab0 rbd: access snapshot context and mapping size safely
These fields may both change while the image is mapped if a snapshot
is created or deleted or the image is resized.  They are guarded by
rbd_dev->header_rwsem, so hold that while reading them, and store a
local copy to refer to outside of the critical section. The local copy
will stay consistent since the snapshot context is reference counted,
and the mapping size is just a u64. This prevents torn loads from
giving us inconsistent values.

Move reading header.snapc into the caller of rbd_img_request_create()
so that we only need to take the semaphore once. The read-only caller,
rbd_parent_request_create() can just pass NULL for snapc, since the
snapshot context is only relevant for writes.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-10-14 21:03:27 +04:00
Ilya Dryomov 7dd440c9e0 rbd: do not return -ERANGE on auth failures
Trying to map an image out of a pool for which we don't have an 'x'
permission bit fails with -ERANGE from ceph_extract_encoded_string()
due to an unsigned vs signed bug.  Fix it and get rid of the -EINVAL
sink, thus propagating rbd::get_id cls method errors.  (I've seen
a bunch of unexplained -ERANGE reports, I bet this is it).

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-10-14 21:03:26 +04:00
Wei Yongjun 255939e783 rbd: fix error return code in rbd_dev_device_setup()
Fix to return -ENOMEM from the workqueue alloc error handling
case instead of 0, as done elsewhere in this function.

Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
2014-09-10 11:59:06 +04:00
Ilya Dryomov 58d1362b50 rbd: avoid format-security warning inside alloc_workqueue()
drivers/block/rbd.c: In function ‘rbd_dev_device_setup’:
drivers/block/rbd.c:5090:19: warning: format not a string literal and no format arguments [-Wformat-security]

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-09-10 11:59:06 +04:00
Ilya Dryomov 9584d50826 rbd: remove extra newlines from rbd_warn() messages
rbd_warn() string should be a single line - rbd_warn() appends \n.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-08-07 16:01:09 +04:00
Ilya Dryomov 7a716aac01 rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
Now that rbd_img_request_create() is called from work functions, no
need to use GFP_ATOMIC.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-08-07 16:00:52 +04:00
Ilya Dryomov bc1ecc65a2 rbd: rework rbd_request_fn()
While it was never a good idea to sleep in request_fn(), commit
34c6bc2c91 ("locking/mutexes: Add extra reschedule point") made it
a *bad* idea.  mutex_lock() since 3.15 may reschedule *before* putting
task on the mutex wait queue, which for tasks in !TASK_RUNNING state
means block forever.  request_fn() may be called with !TASK_RUNNING on
the way to schedule() in io_schedule().

Offload request handling to a workqueue, one per rbd device, to avoid
calling blocking primitives from rbd_request_fn().

Fixes: http://tracker.ceph.com/issues/8818

Cc: stable@vger.kernel.org # 3.16, needs backporting for 3.15
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Tested-by: Eric Eastman <eric0e@aol.com>
Tested-by: Greg Wilson <greg.wilson@keepertech.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-08-07 14:56:20 +04:00
Ilya Dryomov 4d9b67cddd rbd: take snap_id into account when reading in parent info
If we are mapping a snapshot, we must read in the parent_overlap value
of that snapshot instead of that of the base image.  Not doing so may
in particular result in us returning zeros instead of user data:

    # cat overlap-snap.sh
    #!/bin/bash
    rbd create --size 10 --image-format 2 foo
    FOO_DEV=$(rbd map foo)
    dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
    echo "Base image"
    dd if=$FOO_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
    rbd snap create foo@snap
    rbd snap protect foo@snap
    rbd clone foo@snap bar
    rbd snap create bar@snap
    BAR_DEV=$(rbd map bar@snap)
    echo "Snapshot"
    dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
    rbd resize --allow-shrink --size 4 bar
    echo "Snapshot after base image resize"
    dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd

    # ./overlap-snap.sh
    Base image
    0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed  ...;.K"%`4(E..6.
    Snapshot
    0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed  ...;.K"%`4(E..6.
    Resizing image: 100% complete...done.
    Snapshot after base image resize
    0000000: e781 e33b d34b 2225 0000 0000 0000 0000  ...;.K"%........

Even though bar@snap is taken with the old bar parent_overlap (8M),
reads from bar@snap beyond the new bar parent_overlap (4M) return
zeroes.  Fix it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-25 13:16:50 +04:00
Ilya Dryomov e8f59b595d rbd: do not read in parent info before snap context
Currently rbd_dev_v2_header_info() reads in parent info before the snap
context is read in.  This is wrong, because we may need to look at the
the parent_overlap value of the snapshot instead of that of the base
image, for example when mapping a snapshot - see next commit.  (When
mapping a snapshot, all we got is its name and we need the snap context
to translate that name into an id to know which parent info to look
for.)

The approach taken here is to make sure rbd_dev_v2_parent_info() is
called after the snap context has been read in.  The other approach
would be to add a parent_overlap field to struct rbd_mapping and
maintain it the same way rbd_mapping::size is maintained.  The reason
I chose the first approach is that the value of keeping around both
base image values and the actual mapping values is unclear to me.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-25 13:16:14 +04:00
Ilya Dryomov 5ff1108ccc rbd: update mapping size only on refresh
There is no sense in trying to update the mapping size before it's even
been set.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-25 13:15:58 +04:00
Ilya Dryomov 52bb1f9bed rbd: harden rbd_dev_refresh() and callers a bit
Recently discovered watch/notify problems showed that we really can't
ignore errors in anything refresh related.  Alas, currently there is
not much we can do in response to those errors, except print warnings.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-25 13:15:44 +04:00
Ilya Dryomov 0407759971 rbd: split rbd_dev_spec_update() into two functions
rbd_dev_spec_update() has two modes of operation, with nothing in
common between them.  Split it into two functions, one for each mode
and make our expectations more clear.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-25 13:15:35 +04:00
Ilya Dryomov 7626eb7d82 rbd: remove unnecessary asserts in rbd_dev_image_probe()
spec->image_id assert doesn't buy us much and image_format is asserted
in rbd_dev_header_name() and rbd_dev_header_info() anyway.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-25 13:15:18 +04:00
Ilya Dryomov a720ae0901 rbd: introduce rbd_dev_header_info()
A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-25 13:15:00 +04:00
Ilya Dryomov ff96128fb0 rbd: show the entire chain of parent images
Make /sys/bus/rbd/devices/<id>/parent show the entire chain of parent
images.  While at it, kernel sprintf() doesn't return negative values,
casting to unsigned long long is no longer necessary and there is no
good reason to split into multiple sprintf() calls.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-25 13:14:27 +04:00
Himangi Saraogi 7d5079aa8b rbd: use rbd_segment_name_free() instead of kfree()
Free memory allocated using kmem_cache_zalloc using kmem_cache_free
rather than kfree. The helper rbd_segment_name_free does the job here.
Its position is shifted above the calling function.

The Coccinelle semantic patch that detects this change is as follows:

// <smpl>
@@
expression x,E,c;
@@

 x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
 ... when != x = E
     when != &x
?-kfree(x)
+kmem_cache_free(c,x)
// </smpl>

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-07-24 11:57:27 +04:00
Ilya Dryomov fbba11b3be rbd: do not leak image_id in rbd_dev_v2_parent_info()
image_id is leaked if the parent happens to have been recorded already.
Fix it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-08 15:08:46 +04:00
Ilya Dryomov 76756a51e2 rbd: use rbd_obj_watch_request_helper() helper
Switch rbd_dev_header_{un,}watch_sync() to use the new helper and fix
rbd_dev_header_unwatch_sync() to destroy watch_request structures
before queuing watch-remove message while at it.  This mistake slipped
into commit b30a01f2a3 ("rbd: fix osd_request memory leak in
__rbd_dev_header_watch_sync()") and could lead to "image still in use"
errors on image removal.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-08 15:08:45 +04:00
Ilya Dryomov bb040aa03c rbd: add rbd_obj_watch_request_helper() helper
In the past, rbd_dev_header_watch_sync() used to handle both watch and
unwatch requests and was entangled and leaky.  Commit b30a01f2a3
("rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()")
split it into two separate functions.  This commit cleanly abstracts
the common bits, relying on the fixed rbd_obj_request_wait().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-08 15:08:45 +04:00
Ilya Dryomov 71c20a066f rbd: rbd_obj_request_wait() should cancel the request if interrupted
rbd_obj_request_wait() should cancel the underlying OSD request if
interrupted.  Otherwise libceph will hold onto it indefinitely, causing
assert failures or leaking the original object request.

This also adds an rbd wrapper around ceph_osdc_cancel_request() to
match rbd_obj_request_submit() and rbd_obj_request_wait().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-08 15:08:45 +04:00
Ilya Dryomov 9638556a27 rbd: handle parent_overlap on writes correctly
The following check in rbd_img_obj_request_submit()

    rbd_dev->parent_overlap <= obj_request->img_offset

allows the fall through to the non-layered write case even if both
parent_overlap and obj_request->img_offset belong to the same RADOS
object.  This leads to data corruption, because the area to the left of
parent_overlap ends up unconditionally zero-filled instead of being
populated with parent data.  Suppose we want to write 1M to offset 6M
of image bar, which is a clone of foo@snap; object_size is 4M,
parent_overlap is 5M:

    rbd_data.<id>.0000000000000001
     ---------------------|----------------------|------------
    | should be copyup'ed | should be zeroed out | write ...
     ---------------------|----------------------|------------
   4M                    5M                     6M
                    parent_overlap    obj_request->img_offset

4..5M should be copyup'ed from foo, yet it is zero-filled, just like
5..6M is.

Given that the only striping mode kernel client currently supports is
chunking (i.e. stripe_unit == object_size, stripe_count == 1), round
parent_overlap up to the next object boundary for the purposes of the
overlap check.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-06-23 12:55:37 +04:00
Josh Durgin 22001f619f rbd: only set disk to read-only once
rbd_open(), called every time the device is opened, calls
set_device_ro().  There's no reason to set the device read-only or
read-write every time it is opened. Just do this once during device
setup, using set_disk_ro() instead because the struct block_device
isn't available to us there.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-06-10 18:09:25 -07:00
Josh Durgin 77f33c0373 rbd: move calls that may sleep out of spin lock range
get_user() and set_disk_ro() may allocate memory, leading to a
potential deadlock if theye are called while a spin lock is held.

Move the acquisition and release of rbd_dev->lock from rbd_ioctl()
into rbd_ioctl_set_ro(), so it can occur between get_user() and
set_disk_ro().

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-06-10 18:09:25 -07:00
Guangliang Zhao 131fd9f6fc rbd: add ioctl for rbd
When running the following commands:
    [root@ceph0 mnt]# blockdev --setro /dev/rbd1
    [root@ceph0 mnt]# blockdev --getro /dev/rbd1
    0

The block setro didn't take effect, it is because
the rbd doesn't support ioctl of block driver.

This resolves:
	http://tracker.ceph.com/issues/6265

Signed-off-by: Guangliang Zhao <guangliang@unitedstack.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-06-10 18:09:24 -07:00
Ilya Dryomov ffe312cf31 rbd: fix ida/idr memory leak
ida_destroy() needs to be called on module exit to release ida caches.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-06-06 09:30:00 +08:00
Alex Elder 0f2d5be792 rbd: use reference counts for image requests
Each image request contains a reference count, but to date it has
not actually been used.  (I think this was just an oversight.) A
recent report involving rbd failing an assertion shed light on why
and where we need to use these reference counts.

Every OSD request associated with an object request uses
rbd_osd_req_callback() as its callback function.  That function will
call a helper function (dependent on the type of OSD request) that
will set the object request's "done" flag if the object request if
appropriate.  If that "done" flag is set, the object request is
passed to rbd_obj_request_complete().

In rbd_obj_request_complete(), requests are processed in sequential
order.  So if an object request completes before one of its
predecessors in the image request, the completion is deferred.
Otherwise, if it's a completing object's "turn" to be completed, it
is passed to rbd_img_obj_end_request(), which records the result of
the operation, accumulates transferred bytes, and so on.  Next, the
successor to this request is checked and if it is marked "done",
(deferred) completion processing is performed on that request, and
so on.  If the last object request in an image request is completed,
rbd_img_request_complete() is called, which (typically) destroys
the image request.

There is a race here, however.  The instant an object request is
marked "done" it can be provided (by a thread handling completion of
one of its predecessor operations) to rbd_img_obj_end_request(),
which (for the last request) can then lead to the image request
getting torn down.  And this can happen *before* that object has
itself entered rbd_img_obj_end_request().  As a result, once it
*does* enter that function, the image request (and even the object
request itself) may have been freed and become invalid.

All that's necessary to avoid this is to properly count references
to the image requests.  We tear down an image request's object
requests all at once--only when the entire image request has
completed.  So there's no need for an image request to count
references for its object requests.  However, we don't want an
image request to go away until the last of its object requests
has passed through rbd_img_obj_callback().  In other words,
we don't want rbd_img_request_complete() to necessarily
result in the image request being destroyed, because it may
get called before we've finished processing on all of its
object requests.

So the fix is to add a reference to an image request for
each of its object requests.  The reference can be viewed
as representing an object request that has not yet finished
its call to rbd_img_obj_callback().  That is emphasized by
getting the reference right after assigning that as the image
object's callback function.  The corresponding release of that
reference is done at the end of rbd_img_obj_callback(), which
every image object request passes through exactly once.

Cc: stable@vger.kernel.org
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-06-06 09:29:59 +08:00
Ilya Dryomov b30a01f2a3 rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()
osd_request, along with r_request and r_reply messages attached to it
are leaked in __rbd_dev_header_watch_sync() if the requested image
doesn't exist.  This is because lingering requests are special and get
an extra ref in the reply path.  Fix it by unregistering linger request
on the error path and split __rbd_dev_header_watch_sync() into two
functions to make it maintainable.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-06-06 09:29:59 +08:00
Ilya Dryomov 30ba1f0202 rbd: make sure we have latest osdmap on 'rbd map'
Given an existing idle mapping (img1), mapping an image (img2) in
a newly created pool (pool2) fails:

    $ ceph osd pool create pool1 8 8
    $ rbd create --size 1000 pool1/img1
    $ sudo rbd map pool1/img1
    $ ceph osd pool create pool2 8 8
    $ rbd create --size 1000 pool2/img2
    $ sudo rbd map pool2/img2
    rbd: sysfs write failed
    rbd: map failed: (2) No such file or directory

This is because client instances are shared by default and we don't
request an osdmap update when bumping a ref on an existing client.  The
fix is to use the mon_get_version request to see if the osdmap we have
is the latest, and block until the requested update is received if it's
not.

Fixes: http://tracker.ceph.com/issues/8184

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-06-06 09:29:58 +08:00