TID of cap flush ack is 64 bits, but ceph_inode_info::flushing_cap_tid
is only 16 bits. 16 bits should be plenty to let the cap flush updates
pipeline appropriately, but we need to cast in the proper direction when
comparing these differently-sized versions. So downcast the 64-bits one
to 16 bits.
Reflects ceph.git commit a5184cf46a6e867287e24aeb731634828467cd98.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Pull Ceph updates from Sage Weil:
"There is the long-awaited discard support for RBD (Guangliang Zhao,
Josh Durgin), a pile of RBD bug fixes that didn't belong in late -rc's
(Ilya Dryomov, Li RongQing), a pile of fs/ceph bug fixes and
performance and debugging improvements (Yan, Zheng, John Spray), and a
smattering of cleanups (Chao Yu, Fabian Frederick, Joe Perches)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits)
ceph: fix divide-by-zero in __validate_layout()
rbd: rbd workqueues need a resque worker
libceph: ceph-msgr workqueue needs a resque worker
ceph: fix bool assignments
libceph: separate multiple ops with commas in debugfs output
libceph: sync osd op definitions in rados.h
libceph: remove redundant declaration
ceph: additional debugfs output
ceph: export ceph_session_state_name function
ceph: include the initial ACL in create/mkdir/mknod MDS requests
ceph: use pagelist to present MDS request data
libceph: reference counting pagelist
ceph: fix llistxattr on symlink
ceph: send client metadata to MDS
ceph: remove redundant code for max file size verification
ceph: remove redundant io_iter_advance()
ceph: move ceph_find_inode() outside the s_mutex
ceph: request xattrs if xattr_version is zero
rbd: set the remaining discard properties to enable support
rbd: use helpers to handle discard for layered images correctly
...
Fix some coccinelle warnings:
fs/ceph/caps.c:2400:6-10: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2401:6-15: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2402:6-17: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2403:6-22: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2404:6-22: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2405:6-19: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2440:4-20: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2469:3-16: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2490:2-18: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2519:3-7: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2549:3-12: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2575:2-6: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2589:3-7: WARNING: Assignment of bool to 0/1
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Current code set new file/directory's initial ACL in a non-atomic
manner.
Client first sends request to MDS to create new file/directory, then set
the initial ACL after the new file/directory is successfully created.
The fix is include the initial ACL in create/mkdir/mknod MDS requests.
So MDS can handle creating file/directory and setting the initial ACL in
one request.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Current code uses page array to present MDS request data. Pages in the
array are allocated/freed by caller of ceph_mdsc_do_request(). If request
is interrupted, the pages can be freed while they are still being used by
the request message.
The fix is use pagelist to present MDS request data. Pagelist is
reference counted.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
this allow pagelist to present data that may be sent multiple times.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Implement version 2 of CEPH_MSG_CLIENT_SESSION syntax,
which includes additional client metadata to allow
the MDS to report on clients by user-sensible names
like hostname.
Signed-off-by: John Spray <john.spray@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Both ceph_update_writeable_page and ceph_setattr will verify file size
with max size ceph supported.
There are two caller for ceph_update_writeable_page, ceph_write_begin and
ceph_page_mkwrite. For ceph_write_begin, we have already verified the size in
generic_write_checks of ceph_write_iter; for ceph_page_mkwrite, we have no
chance to change file size when mmap. Likewise we have already verified the size
in inode_change_ok when we call ceph_setattr.
So let's remove the redundant code for max file size verification.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
ceph_find_inode() may wait on freeing inode, using it inside the s_mutex
may cause deadlock. (the freeing inode is waiting for OSD read reply, but
dispatch thread is blocked by the s_mutex)
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Following sequence of events can happen.
- Client releases an inode, queues cap release message.
- A 'lookup' reply brings the same inode back, but the reply
doesn't contain xattrs because MDS didn't receive the cap release
message and thought client already has up-to-data xattrs.
The fix is force sending a getattr request to MDS if xattrs_version
is 0. The getattr mask is set to CEPH_STAT_CAP_XATTR, so MDS knows client
does not have xattr.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
we may corrupt waiting list if a request in the waiting list is kicked.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
So the recovering MDS does not need to fetch these ununsed inodes during
cache rejoin. This may reduce MDS recovery time.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Now that d_invalidate always succeeds it is not longer necessary or
desirable to hard code d_drop calls into filesystem specific
d_revalidate implementations.
Remove the unnecessary d_drop calls and rely on d_invalidate
to drop the dentries. Using d_invalidate ensures that paths
to mount points will not be dropped.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull Ceph updates from Sage Weil:
"There is a lot of refactoring and hardening of the libceph and rbd
code here from Ilya that fix various smaller bugs, and a few more
important fixes with clone overlap. The main fix is a critical change
to the request_fn handling to not sleep that was exposed by the recent
mutex changes (which will also go to the 3.16 stable series).
Yan Zheng has several fixes in here for CephFS fixing ACL handling,
time stamps, and request resends when the MDS restarts.
Finally, there are a few cleanups from Himangi Saraogi based on
Coccinelle"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
rbd: remove extra newlines from rbd_warn() messages
rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
rbd: rework rbd_request_fn()
ceph: fix kick_requests()
ceph: fix append mode write
ceph: fix sizeof(struct tYpO *) typo
ceph: remove redundant memset(0)
rbd: take snap_id into account when reading in parent info
rbd: do not read in parent info before snap context
rbd: update mapping size only on refresh
rbd: harden rbd_dev_refresh() and callers a bit
rbd: split rbd_dev_spec_update() into two functions
rbd: remove unnecessary asserts in rbd_dev_image_probe()
rbd: introduce rbd_dev_header_info()
rbd: show the entire chain of parent images
ceph: replace comma with a semicolon
rbd: use rbd_segment_name_free() instead of kfree()
ceph: check zero length in ceph_sync_read()
ceph: reset r_resend_mds after receiving -ESTALE
...
There are a few d_obtain_alias callers that are using it to get the
root of a filesystem which may already have an alias somewhere else.
This is not the same as the filehandle-lookup case, and none of them
actually need DCACHE_DISCONNECTED set.
It isn't really a serious problem, but it would really be clearer if we
reserved DCACHE_DISCONNECTED for those cases where it's actually needed.
In the btrfs case this was causing a spurious printk from
nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED
dentry. Josef worked around this by unsetting DCACHE_DISCONNECTED
manually in 3a0dfa6a12 "Btrfs: unset DCACHE_DISCONNECTED when mounting
default subvol", and this replaces that workaround.
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__do_request() may unregister the request. So we should update
iterator 'p' before calling __do_request()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
generic_write_checks() may update 'pos', so we need to pass 'pos'
to ceph_sync_write() and ceph_sync_direct_write();
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
xattrs array of pointers is allocated with kcalloc() - no need to
memset() it to 0 right after that.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Replace a comma between expression statements by a semicolon. This changes
the semantics of the code, but given the current indentation appears to be
what is intended.
A simplified version of the Coccinelle semantic patch that performs this
transformation is as follows:
// <smpl>
@r@
expression e1,e2;
@@
e1
-,
+;
e2;
// </smpl>
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Pull Ceph updates from Sage Weil:
"This has a mix of bug fixes and cleanups.
Alex's patch fixes a rare race in RBD. Ilya's patches fix an ENOENT
check when a second rbd image is mapped and a couple memory leaks.
Zheng fixes several issues with fragmented directories and multiple
MDSs. Josh fixes a spin/sleep issue, and Josh and Guangliang's
patches fix setting and unsetting RBD images read-only.
Naturally there are several other cleanups mixed in for good measure"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
rbd: only set disk to read-only once
rbd: move calls that may sleep out of spin lock range
rbd: add ioctl for rbd
ceph: use truncate_pagecache() instead of truncate_inode_pages()
ceph: include time stamp in every MDS request
rbd: fix ida/idr memory leak
rbd: use reference counts for image requests
rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()
rbd: make sure we have latest osdmap on 'rbd map'
libceph: add ceph_monc_wait_osdmap()
libceph: mon_get_version request infrastructure
libceph: recognize poolop requests in debugfs
ceph: refactor readpage_nounlock() to make the logic clearer
mds: check cap ID when handling cap export message
ceph: remember subtree root dirfrag's auth MDS
ceph: introduce ceph_fill_fragtree()
ceph: handle cap import atomically
ceph: pre-allocate ceph_cap struct for ceph_add_cap()
ceph: update inode fields according to issued caps
rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
...
Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.
In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...
Replace seq_printf where possible.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the last pr_warning callsites in fs branch
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We recently modified the client/MDS protocol to include a timestamp in the
client request. This allows ctime updates to follow the client's clock
in most cases, which avoids subtle problems when clocks are out of sync
and timestamps are updated sometimes by the MDS clock (for most requests)
and sometimes by the client clock (for cap writeback).
Signed-off-by: Sage Weil <sage@inktank.com>
If the return value of ceph_osdc_readpages() is not negative,
it is certainly greater than or equal to zero.
Remove the useless condition judgment and redundant braces.
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
handle following sequence of events:
- mds0 exports an inode to mds1. client receives the cap import
message from mds1. caps from mds0 are removed while handling
the cap import message.
- mds1 exports an inode to mds0. client receives the cap export
message from mds1. handle_cap_export() adds placeholder caps
for mds0
- client receives the first cap export message (for exporting
inode from mds0 to mds1)
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Move the code that update the i_fragtree into a separate function.
Also add simple probabilistic test to decide whether the i_fragtree
should be updated
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
cap import messages are processed by both handle_cap_import() and
handle_cap_grant(). These two functions are not executed in the same
atomic context, so they can races with cap release.
The fix is make handle_cap_import() not release the i_ceph_lock when
it returns. Let handle_cap_grant() release the lock after it finishes
its job.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
So that ceph_add_cap() can be used while i_ceph_lock is locked.
This simplifies the code that handle cap import/export.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Cap message and request reply from non-auth MDS may carry stale
information (corresponding locks are in LOCK states) even they
have the newest inode version. So client should update inode fields
according to issued caps.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
cap grant/revoke message from non-auth MDS can update inode's size
and truncate_seq/truncate_size. (the message arrives before auth
MDS's cap trunc message)
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
posix_acl_xattr_set() already does the check, and it's the only
way to feed in an ACL from userspace.
So the check here is useless, remove it.
Signed-off-by: zhang zhen <zhenzhang.zhang@huawei.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Currently, the fl_owner isn't set for flock locks. Some filesystems use
byte-range locks to simulate flock locks and there is a common idiom in
those that does:
fl->fl_owner = (fl_owner_t)filp;
fl->fl_start = 0;
fl->fl_end = OFFSET_MAX;
Since flock locks are generally "owned" by the open file description,
move this into the common flock lock setup code. The fl_start and fl_end
fields are already set appropriately, so remove the unneeded setting of
that in flock ops in those filesystems as well.
Finally, the lease code also sets the fl_owner as if they were owned by
the process and not the open file description. This is incorrect as
leases have the same ownership semantics as flock locks. Set them the
same way. The lease code doesn't actually use the fl_owner value for
anything, so this is more for consistency's sake than a bugfix.
Reported-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (Staging portion)
Acked-by: J. Bruce Fields <bfields@fieldses.org>