linux

Commit Graph

Author	SHA1	Message	Date
Sage Weil	d52f847a84	ceph: rewrite msgpool using mempool_t Since we don't need to maintain large pools of messages, we can just use the standard mempool_t. We maintain a msgpool 'wrapper' because we need the mempool_t* in the alloc function, and mempool gives us only pool_data. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:18 -07:00
Cheng Renquan	640ef79d27	ceph: use ceph_sb_to_client instead of ceph_client ceph_sb_to_client and ceph_client are really identical, we need to dump one; while function ceph_client is confusing with "struct ceph_client", ceph_sb_to_client's definition is more clear; so we'd better switch all call to ceph_sb_to_client. -static inline struct ceph_client ceph_client(struct super_block sb) -{ - return sb->s_fs_info; -} Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:17 -07:00
Cheng Renquan	2d06eeb877	ceph: handle kzalloc() failure Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:16 -07:00
Sage Weil	7c315c552c	ceph: drop unnecessary msgpool for mon_client subscribe_ack Preallocate a single message to reuse instead. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:16 -07:00
Sage Weil	6694d6b95c	ceph: drop unnecessary msgpool for mon_client auth_reply Preallocate a single reply message that we can reuse instead. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:15 -07:00
Sage Weil	3143edd3a1	ceph: clean up statfs Avoid unnecessary msgpool. Preallocate reply. Fix use-after-free race. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:15 -07:00
Sage Weil	6f46cb2935	ceph: fix theoretically possible double-put on connection This would only trigger if we bailed out before resetting r_con_filling_msg because the server reply was corrupt (oversized). Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:14 -07:00
Dan Carpenter	c7708075f1	ceph: cleanup: remove dead code "xattr" is never NULL here. We took care of that in the previous if statement block. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:14 -07:00
Sage Weil	104648ad3f	ceph: reduce build_path debug output Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:13 -07:00
Yehuda Sadeh	31459fe4b2	ceph: use __page_cache_alloc and add_to_page_cache_lru Following Nick Piggin patches in btrfs, pagecache pages should be allocated with __page_cache_alloc, so they obey pagecache memory policies. Also, using add_to_page_cache_lru instead of using a private pagevec where applicable. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:12 -07:00
Stephen Rothwell	f553069e5d	ceph: update for removal of kref_set Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:12 -07:00
Sage Weil	21b667f69b	ceph: simplify page setup for incoming data Drop largely useless helper __prepare_pages(), and simplify sanity checks. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:11 -07:00
Sage Weil	81a6cf2d30	ceph: invalidate affected dentry leases on aborted requests If we abort a request, we return to caller, but the request may still complete. And if we hold the dir FILE_EXCL bit, we may not release a lease when sending a request. A simple un-tar, control-c, un-tar again will reproduce the bug (manifested as a 'Cannot open: File exists'). Ensure we invalidate affected dentry leases (as well dir I_COMPLETE) so we don't have valid (but incorrect) leases. Do the same, consistently, at other sites where I_COMPLETE is similarly cleared. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 10:25:45 -07:00
Sage Weil	b4556396fa	ceph: fix race between aborted requests and fill_trace When we abort requests we need to prevent fill_trace et al from doing anything that relies on locks held by the VFS caller. This fixes a race between the reply handler and the abort code, ensuring that continue holding the dir mutex until the reply handler completes. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 10:25:45 -07:00
Sage Weil	e1518c7c0a	ceph: clean up mds reply, error handling We would occasionally BUG out in the reply handler because r_reply was nonzero, due to a race with ceph_mdsc_do_request temporarily setting r_reply to an ERR_PTR value. This is unnecessary, messy, and also wrong in the EIO case. Clean up by consistently using r_err for errors and r_reply for messages. Also fix the abort logic to trigger consistently for all errors that return to the caller early (e.g., EIO from timeout case). If an abort races with a reply, use the result from the reply. Also fix locking for r_err, r_reply update in the reply handler. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 10:25:44 -07:00
Sage Weil	e84346b726	ceph: preserve seq # on requeued messages after transient transport errors If the tcp connection drops and we reconnect to reestablish a stateful session (with the mds), we need to resend previously sent (and possibly received) messages with the _same_ seq # so that they can be dropped on the other end if needed. Only assign a new seq once after the message is queued. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-11 21:20:38 -07:00
Sage Weil	f818a73674	ceph: fix cap removal races The iterate_session_caps helper traverses the session caps list and tries to grab an inode reference. However, the __ceph_remove_cap was clearing the inode backpointer _before_ removing itself from the session list, causing a null pointer dereference. Clear cap->ci under protection of s_cap_lock to avoid the race, and to tightly couple the list and backpointer state. Use a local flag to indicate whether we are releasing the cap, as cap->session may be modified by a racing thread in iterate_session_caps. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-11 20:56:31 -07:00
Sage Weil	45c6ceb547	ceph: zero unused message header, footer fields We shouldn't leak any prior memory contents to other parties. And random data, particularly in the 'version' field, can cause problems down the line. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-11 15:17:40 -07:00
Sage Weil	9abf82b8bc	ceph: fix locking for waking session requests after reconnect The session->s_waiting list is protected by mdsc->mutex, not s_mutex. This was causing (rare) s_waiting list corruption. Fix errors paths too, while we're here. A more thorough cleanup of this function is coming soon. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-11 09:53:57 -07:00
Sage Weil	d85b705663	ceph: resubmit requests on pg mapping change (not just primary change) OSD requests need to be resubmitted on any pg mapping change, not just when the pg primary changes. Resending only when the primary changes results in occasional 'hung' requests during osd cluster recovery or rebalancing. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-11 09:53:56 -07:00
Sage Weil	04d000eb35	ceph: fix open file counting on snapped inodes when mds returns no caps It's possible the MDS will not issue caps on a snapped inode, in which case an open request may not __ceph_get_fmode(), botching the open file counting. (This is actually a server bug, but the client shouldn't BUG out in this case.) Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-11 09:53:55 -07:00
Sage Weil	0ceed5db32	ceph: unregister osd request on failure The osd request wasn't being unregistered when the osd returned a failure code, even though the result was returned to the caller. This would cause it to eventually time out, and then crash the kernel when it tried to resend the request using a stale page vector. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-11 09:53:18 -07:00
Sage Weil	54ad023ba8	ceph: don't use writeback_control in writepages completion The ->writepages writeback_control is not still valid in the writepages completion. We were touching it solely to adjust pages_skipped when there was a writeback error (EIO, ENOSPC, EPERM due to bad osd credentials), causing an oops in the writeback code shortly thereafter. Updating pages_skipped on error isn't correct anyway, so let's just rip out this (clearly broken) code to pass the wbc to the completion. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-05 21:31:40 -07:00
Sage Weil	5dfc589a84	ceph: unregister bdi before kill_anon_super releases device name Unregister and destroy the bdi in put_super, after mount is r/o, but before put_anon_super releases the device name. For symmetry, bdi_destroy in destroy_client (we bdi_init in create_client). Only set s_bdi if bdi_register succeeds, since we use it to decide whether to bdi_unregister. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-04 16:14:46 -07:00
Sage Weil	b0930f8d38	ceph: remove bad auth_x kmem_cache It's useless, since our allocations are already a power of 2. And it was allocated per-instance (not globally), which caused a name collision when we tried to mount a second file system with auth_x enabled. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:25 -07:00
Sage Weil	7ff899da02	ceph: fix lockless caps check The __ variant requires caller to hold i_lock. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:25 -07:00
Sage Weil	ea1409f961	ceph: clear dir complete, invalidate dentry on replayed rename If a rename operation is resent to the MDS following an MDS restart, the client does not get a full reply (containing the resulting metadata) back. In that case, a ceph_rename() needs to compensate by doing anything useful that fill_inode() would have, like d_move(). It also needs to invalidate the dentry (to workaround the vfs_rename_dir() bug) and clear the dir complete flag, just like fill_trace(). Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:25 -07:00
Sage Weil	5c6a2cdb4f	ceph: fix direct io truncate offset truncate_inode_pages_range wants the end offset to align with the last byte in a page. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:25 -07:00
Sage Weil	ae18756b9f	ceph: discard incoming messages with bad seq # We can get old message seq #'s after a tcp reconnect for stateful sessions (i.e., the MDS). If we get a higher seq #, that is an error, and we shouldn't see any bad seq #'s for stateless (mon, osd) connections. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:24 -07:00
Sage Weil	684be25c52	ceph: fix seq counting for skipped messages Increment in_seq even when the message is skipped for some reason. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:24 -07:00
Sage Weil	d45d0d970f	ceph: add missing #includes Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:24 -07:00
Sage Weil	0b0c06d147	ceph: fix leaked spinlock during mds reconnect Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:23 -07:00
Sage Weil	c8f16584ac	ceph: print more useful version info on module load Decouple the client version from the server side. Print relevant protocol and map version info instead. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:23 -07:00
Sage Weil	91dee39eeb	ceph: fix snap realm splits The snap realm split was checking i_snap_realm, not the list_head, to determine if an inode belonged in the new realm. The check always failed, which meant we always moved the inode, corrupting the old realm's list and causing various crashes. Also wait to release old realm reference to avoid possibility of use after free. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:23 -07:00
Sage Weil	c10f5e12ba	ceph: clear dir complete on d_move d_move() reorders the d_subdirs list, breaking the readdir result caching. Unless/until d_move preserves that ordering, clear CEPH_I_COMPLETE on rename. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:22 -07:00
Linus Torvalds	96e35b40c0	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: use separate class for ceph sockets' sk_lock ceph: reserve one more caps space when doing readdir ceph: queue_cap_snap should always queue dirty context ceph: fix dentry reference leak in dcache readdir ceph: decode v5 of osdmap (pool names) [protocol change] ceph: fix ack counter reset on connection reset ceph: fix leaked inode ref due to snap metadata writeback race ceph: fix snap context reference leaks ceph: allow writeback of snapped pages older than 'oldest' snapc ceph: fix dentry rehashing on virtual .snap dir	2010-04-14 18:45:31 -07:00
Sage Weil	a6a5349d17	ceph: use separate class for ceph sockets' sk_lock Use a separate class for ceph sockets to prevent lockdep confusion. Because ceph sockets only get passed kernel pointers, there is no dependency from sk_lock -> mmap_sem. If we share the same class as other sockets, lockdep detects a circular dependency from mmap_sem (page fault) -> fs mutex -> sk_lock -> mmap_sem because dependencies are noted from both ceph and user contexts. Using a separate class prevents the sk_lock(ceph) -> mmap_sem dependency and makes lockdep happy. Signed-off-by: Sage Weil <sage@newdream.net>	2010-04-13 14:07:07 -07:00
Yehuda Sadeh	e1e4dd0caa	ceph: reserve one more caps space when doing readdir We were missing space for the directory cap. The result was a BUG at fs/ceph/caps.c:2178. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-04-13 12:28:54 -07:00
Sage Weil	fc837c8f04	ceph: queue_cap_snap should always queue dirty context This simplifies the calling convention, and fixes a bug where we queue a capsnap with a context other than i_head_snapc (the one that matches the dirty pages). The result was a BUG at fs/ceph/caps.c:2178 on writeback completion when a capsnap matching the writeback snapc could not be found. Signed-off-by: Sage Weil <sage@newdream.net>	2010-04-13 12:28:31 -07:00
Sage Weil	f5b066287c	ceph: fix dentry reference leak in dcache readdir When filldir returned an error (e.g. buffer full for a large directory), we would leak a dentry reference, causing an oops on umount. Signed-off-by: Sage Weil <sage@newdream.net>	2010-04-12 14:25:51 -07:00
Sage Weil	2844a76a25	ceph: decode v5 of osdmap (pool names) [protocol change] Teach the client to decode an updated format for the osdmap. The new format includes pool names, which will be useful shortly. Get this change in earlier rather than later. Signed-off-by: Sage Weil <sage@newdream.net>	2010-04-09 15:50:58 -07:00
Sage Weil	0e0d5e0c4b	ceph: fix ack counter reset on connection reset If in_seq_acked isn't reset along with in_seq, we don't ack received messages until we reach the old count, consuming gobs memory on the other end of the connection and introducing a large delay when those messages are eventually deleted. Signed-off-by: Sage Weil <sage@newdream.net>	2010-04-02 16:07:19 -07:00
Sage Weil	819ccbfa44	ceph: fix leaked inode ref due to snap metadata writeback race We create a ceph_cap_snap if there is dirty cap metadata (for writeback to mds) OR dirty pages (for writeback to osd). It is thus possible that the metadata has been written back to the MDS but the OSD data has not when the cap_snap is created. This results in a cap_snap with dirty(caps) == 0. The problem is that cap writeback to the MDS isn't necessary, and a FLUSHSNAP cap op gets no ack from the MDS. This leaves the cap_snap attached to the inode along with its inode reference. Fix the problem by dropping the cap_snap if it becomes 'complete' (all pages written out) and dirty(caps) == 0 in ceph_put_wrbuffer_cap_refs(). Also, BUG() in __ceph_flush_snaps() if we encounter a cap_snap with dirty(caps) == 0. Signed-off-by: Sage Weil <sage@newdream.net>	2010-04-01 09:34:38 -07:00
Sage Weil	6298a33757	ceph: fix snap context reference leaks The get_oldest_context() helper takes a reference to the returned snap context, but most callers weren't dropping that reference. Fix them. Also drop the unused locked __get_oldest_context() variant. Signed-off-by: Sage Weil <sage@newdream.net>	2010-04-01 09:34:37 -07:00
Sage Weil	80e755fede	ceph: allow writeback of snapped pages older than 'oldest' snapc On snap deletion, we don't regenerate ceph_cap_snaps for inodes with dirty pages because deletion does not affect metadata writeback. However, we did run into problems when we went to write back the pages because the 'oldest' snapc is determined by the oldest cap_snap, and that may be the newer snapc that reflects the deletion. This caused confusion and an infinite loop in ceph_update_writeable_page(). Change the snapc checks to allow writeback of any snapc that is equal to OR older than the 'oldest' snapc. When there are no cap_snaps, we were also using the realm's latest snapc for writeback, which complicates ceph_put_wrbufffer_cap_refs(). Instead, use i_head_snapc, the most snapc used for the most recent ('head') data. This makes the writeback snapc (ceph_osd_request.r_snapc) _always_ match a capsnap or i_head_snapc. Also, in writepags_finish(), drop the snapc referenced by the _page_ and do not assume it matches the request snapc (it may not anymore). Signed-off-by: Sage Weil <sage@newdream.net>	2010-04-01 09:34:36 -07:00
Sage Weil	9358c6d4c0	ceph: fix dentry rehashing on virtual .snap dir If a lookup fails on the magic .snap directory, we bind it to a magic snap directory inode in ceph_lookup_finish(). That code assumes the dentry is unhashed, but a recent server-side change started returning NULL leases on lookup failure, causing the .snap dentry to be hashed and NULL by ceph_fill_trace(). This causes dentry hash chain corruption, or a dies when d_rehash() includes BUG_ON(!d_unhashed(entry)); So, avoid processing the NULL dentry lease if it the dentry matches the snapdir name in ceph_fill_trace(). That allows the lookup completion to properly bind it to the snapdir inode. BUG there if dentry is hashed to be sure. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-30 13:55:22 -07:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
Sage Weil	94aa8ae13d	ceph: fix use after free on mds __unregister_request There was a use after free in __unregister_request that would trigger whenever the request map held the last reference. This appears to have triggered an oops during 'umount -f' when requests are being torn down. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-28 21:23:56 -07:00
Sage Weil	393f662096	ceph: fix possible double-free of mds request reference Clear pointer to mds request after dropping the reference to ensure we don't drop it again, as there is at least one error path through this function that does not reset fi->last_readdir to a new value. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-23 07:47:06 -07:00
Sage Weil	d96d60498f	ceph: fix session check on mds reply Fix a broken check that a reply came back from the same MDS we sent the request to. I don't think a case that actually triggers this would ever come up in practice, but it's clearly wrong and easy to fix. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-23 07:47:05 -07:00

1 2 3 4 5 ...

275 Commits