Commit Graph

92 Commits

Author SHA1 Message Date
Yan, Zheng eb13e832f8 ceph: use fl->fl_file as owner identifier of flock and posix lock
flock and posix lock should use fl->fl_file instead of process ID
as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
is usually equal to fl->fl_file, but it also can be a customized
value). The process ID of who holds the lock is just for F_GETLK
fcntl(2).

The fix is rename the 'pid' fields of struct ceph_mds_request_args
and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
We also set the most significant bit of the 'owner' field. MDS can
use that bit to distinguish between old and new clients.

The MDS counterpart of this patch modifies the flock code to not
take the 'pid_namespace' into consideration when checking conflict
locks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-04 21:07:11 -07:00
Sage Weil 45195e42c7 ceph: add acl, noacl options for cephfs mount
Make the 'acl' option dependent on having ACL support compiled in.  Make
the 'noacl' option work even without it so that one can always ask it to
be off and not error out on mount when it is not supported.

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2014-02-17 12:37:12 -08:00
Ilya Dryomov 12b4629a9f libceph: all features fields must be u64
In preparation for ceph_features.h update, change all features fields
from unsigned int/u32 to u64.  (ceph.git has ~40 feature bits at this
point.)

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:08 +02:00
Guangliang Zhao 7221fe4c2e ceph: add acl for cephfs
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Li Wang <li.wang@ubuntykylin.com>
Reviewed-by: Zheng Yan <zheng.z.yan@intel.com>
2013-12-31 20:32:01 +02:00
Yan, Zheng 9f12bd119e ceph: drop unconnected inodes
Positve dentry and corresponding inode are always accompanied in MDS reply.
So no need to keep inode in the cache after dropping all its aliases.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-13 09:13:16 -08:00
Milosz Tanski 99ccbd229c ceph: use fscache as a local presisent cache
Adding support for fscache to the Ceph filesystem. This would bring it to on
par with some of the other network filesystems in Linux (like NFS, AFS, etc...)

In order to mount the filesystem with fscache the 'fsc' mount option must be
passed.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-09-06 16:50:11 +00:00
Sasha Levin 5446429630 ceph: avoid accessing invalid memory
when mounting ceph with a dev name that starts with a slash, ceph
would attempt to access the character before that slash. Since we
don't actually own that byte of memory, we would trigger an
invalid access:

[   43.499934] BUG: unable to handle kernel paging request at ffff880fa3a97fff
[   43.500984] IP: [<ffffffff818f3884>] parse_mount_options+0x1a4/0x300
[   43.501491] PGD 743b067 PUD 10283c4067 PMD 10282a6067 PTE 8000000fa3a97060
[   43.502301] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   43.503006] Dumping ftrace buffer:
[   43.503596]    (ftrace buffer empty)
[   43.504046] CPU: 0 PID: 10879 Comm: mount Tainted: G        W    3.10.0-sasha #1129
[   43.504851] task: ffff880fa625b000 ti: ffff880fa3412000 task.ti: ffff880fa3412000
[   43.505608] RIP: 0010:[<ffffffff818f3884>]  [<ffffffff818f3884>] parse_mount_options$
[   43.506552] RSP: 0018:ffff880fa3413d08  EFLAGS: 00010286
[   43.507133] RAX: ffff880fa3a98000 RBX: ffff880fa3a98000 RCX: 0000000000000000
[   43.507893] RDX: ffff880fa3a98001 RSI: 000000000000002f RDI: ffff880fa3a98000
[   43.508610] RBP: ffff880fa3413d58 R08: 0000000000001f99 R09: ffff880fa3fe64c0
[   43.509426] R10: ffff880fa3413d98 R11: ffff880fa38710d8 R12: ffff880fa3413da0
[   43.509792] R13: ffff880fa3a97fff R14: 0000000000000000 R15: ffff880fa3413d90
[   43.509792] FS:  00007fa9c48757e0(0000) GS:ffff880fd2600000(0000) knlGS:000000000000$
[   43.509792] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   43.509792] CR2: ffff880fa3a97fff CR3: 0000000fa3bb9000 CR4: 00000000000006b0
[   43.509792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   43.509792] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   43.509792] Stack:
[   43.509792]  0000e5180000000e ffffffff85ca1900 ffff880fa38710d8 ffff880fa3413d98
[   43.509792]  0000000000000120 0000000000000000 ffff880fa3a98000 0000000000000000
[   43.509792]  ffffffff85cf32a0 0000000000000000 ffff880fa3413dc8 ffffffff818f3c72
[   43.509792] Call Trace:
[   43.509792]  [<ffffffff818f3c72>] ceph_mount+0xa2/0x390
[   43.509792]  [<ffffffff81226314>] ? pcpu_alloc+0x334/0x3c0
[   43.509792]  [<ffffffff81282f8d>] mount_fs+0x8d/0x1a0
[   43.509792]  [<ffffffff812263d0>] ? __alloc_percpu+0x10/0x20
[   43.509792]  [<ffffffff8129f799>] vfs_kern_mount+0x79/0x100
[   43.509792]  [<ffffffff812a224d>] do_new_mount+0xcd/0x1c0
[   43.509792]  [<ffffffff812a2e8d>] do_mount+0x15d/0x210
[   43.509792]  [<ffffffff81220e55>] ? strndup_user+0x45/0x60
[   43.509792]  [<ffffffff812a2fdd>] SyS_mount+0x9d/0xe0
[   43.509792]  [<ffffffff83fd816c>] tracesys+0xdd/0xe2
[   43.509792] Code: 4c 8b 5d c0 74 0a 48 8d 50 01 49 89 14 24 eb 17 31 c0 48 83 c9 ff $
[   43.509792] RIP  [<ffffffff818f3884>] parse_mount_options+0x1a4/0x300
[   43.509792]  RSP <ffff880fa3413d08>
[   43.509792] CR2: ffff880fa3a97fff
[   43.509792] ---[ end trace 22469cd81e93af51 ]---

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Sage Weil <sage@inktan.com>
2013-07-03 15:32:55 -07:00
Alex Elder 3bf53337af ceph: set up page array mempool with correct size
In create_fs_client() a memory pool is set up be used for arrays of
pages that might be needed in ceph_writepages_start() if memory is
tight.  There are two problems with the way it's initialized:
    - The size provided is the number of pages we want in the
      array, but it should be the number of bytes required for
      that many page pointers.
    - The number of pages computed can end up being 0, while we
      will always need at least one page.

This patch fixes both of these problems.

This resolves the two simple problems defined in:
    http://tracker.ceph.com/issues/4603

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:50 -07:00
Eric W. Biederman 7f78e03513 fs: Limit sys_mount to only request filesystem modules.
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.

A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.

Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.

Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives.  Allowing simple, safe,
well understood work-arounds to known problematic software.

This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work.  While writing this patch I saw a handful of such
cases.  The most significant being autofs that lives in the module
autofs4.

This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.

After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module.  The common pattern in the kernel is to call request_module()
without regards to the users permissions.  In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted.  In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-03 19:36:31 -08:00
Sage Weil 92a49fb0f7 ceph: fix statvfs fr_size
Different versions of glibc are broken in different ways, but the short of
it is that for the time being, frsize should == bsize, and be used as the
multiple for the blocks, free, and available fields.  This mirrors what is
done for NFS.  The previous reporting of the page size for frsize meant
that newer glibc and df would report a very small value for the fs size.

Fixes http://tracker.ceph.com/issues/3793.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-02-22 15:31:00 -08:00
Linus Torvalds 40889e8d9f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph update from Sage Weil:
 "There are a few different groups of commits here.  The largest is
  Alex's ongoing work to enable the coming RBD features (cloning,
  striping).  There is some cleanup in libceph that goes along with it.

  Cyril and David have fixed some problems with NFS reexport (leaking
  dentries and page locks), and there is a batch of patches from Yan
  fixing problems with the fs client when running against a clustered
  MDS.  There are a few bug fixes mixed in for good measure, many of
  which will be going to the stable trees once they're upstream.

  My apologies for the late pull.  There is still a gremlin in the rbd
  map/unmap code and I was hoping to include the fix for that as well,
  but we haven't been able to confirm the fix is correct yet; I'll send
  that in a separate pull once it's nailed down."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
  rbd: get rid of rbd_{get,put}_dev()
  libceph: register request before unregister linger
  libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
  libceph: init event->node in ceph_osdc_create_event()
  libceph: init osd->o_node in create_osd()
  libceph: report connection fault with warning
  libceph: socket can close in any connection state
  rbd: don't use ENOTSUPP
  rbd: remove linger unconditionally
  rbd: get rid of RBD_MAX_SEG_NAME_LEN
  libceph: avoid using freed osd in __kick_osd_requests()
  ceph: don't reference req after put
  rbd: do not allow remove of mounted-on image
  libceph: Unlock unprocessed pages in start_read() error path
  ceph: call handle_cap_grant() for cap import message
  ceph: Fix __ceph_do_pending_vmtruncate
  ceph: Don't add dirty inode to dirty list if caps is in migration
  ceph: Fix infinite loop in __wake_requests
  ceph: Don't update i_max_size when handling non-auth cap
  bdi_register: add __printf verification, fix arg mismatch
  ...
2012-12-20 14:00:13 -08:00
Joe Perches d2cc4dde92 bdi_register: add __printf verification, fix arg mismatch
__printf is useful to verify format and arguments.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-12-13 08:13:07 -06:00
Sage Weil 83aff95eb9 libceph: remove 'osdtimeout' option
This would reset a connection with any OSD that had an outstanding
request that was taking more than N seconds.  The idea was that if the
OSD was buggy, the client could compensate by resending the request.

In reality, this only served to hide server bugs, and we haven't
actually seen such a bug in quite a while.  Moreover, the userspace
client code never did this.

More importantly, often the request is taking a long time because the
OSD is trying to recover, or overloaded, and killing the connection
and retrying would only make the situation worse by giving the OSD
more work to do.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-12-13 08:13:06 -06:00
Linus Torvalds 7035cdf36d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "The bulk of this pull is a series from Alex that refactors and cleans
  up the RBD code to lay the groundwork for supporting the new image
  format and evolving feature set.  There are also some cleanups in
  libceph, and for ceph there's fixed validation of file striping
  layouts and a bugfix in the code handling a shrinking MDS cluster."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
  ceph: avoid 32-bit page index overflow
  ceph: return EIO on invalid layout on GET_DATALOC ioctl
  rbd: BUG on invalid layout
  ceph: propagate layout error on osd request creation
  libceph: check for invalid mapping
  ceph: convert to use le32_add_cpu()
  ceph: Fix oops when handling mdsmap that decreases max_mds
  rbd: update remaining header fields for v2
  rbd: get snapshot name for a v2 image
  rbd: get the snapshot context for a v2 image
  rbd: get image features for a v2 image
  rbd: get the object prefix for a v2 rbd image
  rbd: add code to get the size of a v2 rbd image
  rbd: lay out header probe infrastructure
  rbd: encapsulate code that gets snapshot info
  rbd: add an rbd features field
  rbd: don't use index in __rbd_add_snap_dev()
  rbd: kill create_snap sysfs entry
  rbd: define rbd_dev_image_id()
  rbd: define some new format constants
  ...
2012-10-08 06:38:18 +09:00
Kirill A. Shutemov 8c0a853770 fs: push rcu_barrier() from deactivate_locked_super() to filesystems
There's no reason to call rcu_barrier() on every
deactivate_locked_super().  We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.

Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths.  E.g.  on my machine exit_group() of a last process in IPC
namespace takes 0.07538s.  rcu_barrier() takes 0.05188s of that time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-02 21:35:55 -04:00
Alex Elder c98f533c94 ceph: let path portion of mount "device" be optional
A recent change to /sbin/mountall causes any trailing '/' character
in the "device" (or fs_spec) field in /etc/fstab to be stripped.  As
a result, an entry for a ceph mount that intends to mount the root
of the name space ends up with now path portion, and the ceph mount
option processing code rejects this.

That is, an entry in /etc/fstab like:
    cephserver:port:/ /mnt ceph defaults 0 0
provides to the ceph code just "cephserver:port:" as the "device,"
and that gets rejected.

Although this is a bug in /sbin/mountall, we can have the ceph mount
code support an empty/nonexistent path, interpreting it to mean the
root of the name space.

RFC 5952 offers recommendations for how to express IPv6 addresses,
and recommends the usage found in RFC 3986 (which specifies the
format for URI's) for representing both IPv4 and IPv6 addresses that
include port numbers.  (See in particular the definition of
"authority" found in the Appendix of RFC 3986.)

According to those standards, no host specification will ever
contain a '/' character.  As a result, it is sufficient to scan a
provided "device" from an /etc/fstab entry for the first '/'
character, and if it's found, treat that as the beginning of the
path.  If no '/' character is present, we can treat the entire
string as the monitor host specification(s), and assume the path
to be the root of the name space.  We'll still require a ':' to
separate the host portion from the (possibly empty) path portion.

This means that we can more formally define how ceph will interpret
the "device" it's provided when processing a mount request:

    "device" will look like:
        <server_spec>[,<server_spec>...]:[<path>]
    where
        <server_spec> is <ip>[:<port>]
        <path> is optional, but if present must begin with '/'

This addresses http://tracker.newdream.net/issues/2919

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-10-01 14:30:49 -05:00
Linus Torvalds cc8362b1f6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph changes from Sage Weil:
 "Lots of stuff this time around:

   - lots of cleanup and refactoring in the libceph messenger code, and
     many hard to hit races and bugs closed as a result.
   - lots of cleanup and refactoring in the rbd code from Alex Elder,
     mostly in preparation for the layering functionality that will be
     coming in 3.7.
   - some misc rbd cleanups from Josh Durgin that are finally going
     upstream
   - support for CRUSH tunables (used by newer clusters to improve the
     data placement)
   - some cleanup in our use of d_parent that Al brought up a while back
   - a random collection of fixes across the tree

  There is another patch coming that fixes up our ->atomic_open()
  behavior, but I'm going to hammer on it a bit more before sending it."

Fix up conflicts due to commits that were already committed earlier in
drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
  rbd: create rbd_refresh_helper()
  rbd: return obj version in __rbd_refresh_header()
  rbd: fixes in rbd_header_from_disk()
  rbd: always pass ops array to rbd_req_sync_op()
  rbd: pass null version pointer in add_snap()
  rbd: make rbd_create_rw_ops() return a pointer
  rbd: have __rbd_add_snap_dev() return a pointer
  libceph: recheck con state after allocating incoming message
  libceph: change ceph_con_in_msg_alloc convention to be less weird
  libceph: avoid dropping con mutex before fault
  libceph: verify state after retaking con lock after dispatch
  libceph: revoke mon_client messages on session restart
  libceph: fix handling of immediate socket connect failure
  ceph: update MAINTAINERS file
  libceph: be less chatty about stray replies
  libceph: clear all flags on con_close
  libceph: clean up con flags
  libceph: replace connection state bits with states
  libceph: drop unnecessary CLOSED check in socket state change callback
  libceph: close socket directly from ceph_con_close()
  ...
2012-07-31 14:35:28 -07:00
Sage Weil 1fe60e51a3 libceph: move feature bits to separate header
This is simply cleanup that will keep things more closely synced with the
userland code.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30 16:23:22 -07:00
David Howells 9249e17fe0 VFS: Pass mount flags to sget()
Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called.  They could also be passed to the
compare function.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:38:34 +04:00
Linus Torvalds 56b59b429b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates for 3.4-rc1 from Sage Weil:
 "Alex has been busy.  There are a range of rbd and libceph cleanups,
  especially surrounding device setup and teardown, and a few critical
  fixes in that code.  There are more cleanups in the messenger code,
  virtual xattrs, a fix for CRC calculation/checks, and lots of other
  miscellaneous stuff.

  There's a patch from Amon Ott to make inos behave a bit better on
  32-bit boxes, some decode check fixes from Xi Wang, and network
  throttling fix from Jim Schutt, and a couple RBD fixes from Josh
  Durgin.

  No new functionality, just a lot of cleanup and bug fixing."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (65 commits)
  rbd: move snap_rwsem to the device, rename to header_rwsem
  ceph: fix three bugs, two in ceph_vxattrcb_file_layout()
  libceph: isolate kmap() call in write_partial_msg_pages()
  libceph: rename "page_shift" variable to something sensible
  libceph: get rid of zero_page_address
  libceph: only call kernel_sendpage() via helper
  libceph: use kernel_sendpage() for sending zeroes
  libceph: fix inverted crc option logic
  libceph: some simple changes
  libceph: small refactor in write_partial_kvec()
  libceph: do crc calculations outside loop
  libceph: separate CRC calculation from byte swapping
  libceph: use "do" in CRC-related Boolean variables
  ceph: ensure Boolean options support both senses
  libceph: a few small changes
  libceph: make ceph_tcp_connect() return int
  libceph: encapsulate some messenger cleanup code
  libceph: make ceph_msgr_wq private
  libceph: encapsulate connection kvec operations
  libceph: move prepare_write_banner()
  ...
2012-03-28 10:01:29 -07:00
Alex Elder cffaba15cd ceph: ensure Boolean options support both senses
Many ceph-related Boolean options offer the ability to both enable
and disable a feature.  For all those that don't offer this, add
a new option so that they do.

Note that ceph_show_options()--which reports mount options currently
in effect--only reports the option if it is different from the
default value.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder ee57741c52 rbd: make ceph_parse_options() return a pointer
ceph_parse_options() takes the address of a pointer as an argument
and uses it to return the address of an allocated structure if
successful.  With this interface is not evident at call sites that
the pointer is always initialized.  Change the interface to return
the address instead (or a pointer-coded error code) to make the
validity of the returned pointer obvious.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder 3ce6cd1233 ceph: avoid repeatedly computing the size of constant vxattr names
All names defined in the directory and file virtual extended
attribute tables are constant, and the size of each is known at
compile time.  So there's no need to compute their length every
time any file's attribute is listed.

Record the length of each string and use it when needed to determine
the space need to represent them.  In addition, compute the
aggregate size of strings in each table just once at initialization
time.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Al Viro 48fde701af switch open-coded instances of d_make_root() to new helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 21:29:35 -04:00
Linus Torvalds 1a52bb0b68 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: ensure prealloc_blob is in place when removing xattr
  rbd: initialize snap_rwsem in rbd_add()
  ceph: enable/disable dentry complete flags via mount option
  vfs: export symbol d_find_any_alias()
  ceph: always initialize the dentry in open_root_dentry()
  libceph: remove useless return value for osd_client __send_request()
  ceph: avoid iput() while holding spinlock in ceph_dir_fsync
  ceph: avoid useless dget/dput in encode_fh
  ceph: dereference pointer after checking for NULL
  crush: fix force for non-root TAKE
  ceph: remove unnecessary d_fsdata conditional checks
  ceph: Use kmemdup rather than duplicating its implementation

Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
always initialize the dentry in open_root_dentry)
2012-01-13 10:29:21 -08:00
Sage Weil a40dc6cc2e ceph: enable/disable dentry complete flags via mount option
Enable/disable use of the dentry dir 'complete' flag via a mount option.
This lets the admin control whether ceph uses the dcache to satisfy
negative lookups or readdir when it has the entire directory contents in
its cache.

This is purely a performance optimization; correctness is guaranteed
whether it is enabled or not.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-12 11:00:40 -08:00
Alex Elder d46cfba536 ceph: always initialize the dentry in open_root_dentry()
When open_root_dentry() gets a dentry via d_obtain_alias() it does
not get initialized.  If the dentry obtained came from the cache,
this is OK.  But if not, the result is an improperly initialized
dentry.

To fix this, call ceph_init_dentry() regardless of which path
produced the dentry.  That function returns immediately for a dentry
that is already initialized, it is safe to use either way.

(Credit to Sage, who suggested this fix.)

Signed-off-by: Alex Elder <aelder@sgi.com>
2012-01-11 16:28:25 -08:00
Al Viro 3c5184ef12 ceph: d_alloc_root() may fail
... and ceph_init_dentry(NULL) will oops

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-09 16:36:12 -05:00
Al Viro 34c80b1d93 vfs: switch ->show_options() to struct dentry *
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-06 23:19:54 -05:00
Sage Weil 2151937d7c ceph: fix rasize reporting by ceph_show_options
Fix typo.

Reported-by: mowang da <whooya.xxl@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-12-02 09:27:54 -08:00
Sage Weil 774ac21da7 ceph: initialize root dentry
Set up d_fsdata on the root dentry.  This fixes a NULL pointer dereference
in ceph_d_prune on umount.  It also means we can eventually strip out all
of the conditional checks on d_fsdata because it is now set unconditionally
(prior to setting up the d_ops).

Fix the ceph_d_prune debug print while we're here.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-11 09:50:17 -08:00
H Hartley Sweeten 0c6d4b4e22 ceph/super.c: quiet sparse noise
Quiet the sparse noise:

warning: symbol 'create_fs_client' was not declared. Should it be static?
warning: symbol 'destroy_fs_client' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Sage Weil <sage@newdream.net>
ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-05 21:10:12 -07:00
Noah Watkins 80db8bea6a ceph: replace leading spaces with tabs
Trivial formatting fix.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:16 -07:00
Sage Weil 6ab00d465a libceph: create messenger with client
This simplifies the init/shutdown paths, and makes client->msgr available
during the rest of the setup process.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Sage Weil 83817e35cb ceph: rename rsize -> rasize
It controls readahead.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Noah Watkins 259a187ade ceph: fix memory leak
kfree does not clean up indirect allocations in
ceph_fs_client and ceph_options (e.g. snapdir_name).

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-08-22 13:06:59 -07:00
Yehuda Sadeh e985222743 ceph: set up readahead size when rsize is not passed
This should improve the default read performance, as without it
readahead is practically disabled.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2011-07-26 11:29:14 -07:00
Greg Farnum 8f04d42276 ceph: report f_bfree based on kb_avail rather than diffing.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
2011-07-26 11:27:06 -07:00
Tommi Virtanen 8323c3aa74 ceph: Move secret key parsing earlier.
This makes the base64 logic be contained in mount option parsing,
and prepares us for replacing the homebew key management with the
kernel key retention service.

Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 12:11:16 -07:00
Sage Weil 80456f8672 ceph: move readahead default to fs/ceph from libceph
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:23 -07:00
Yehuda Sadeh ad1fee96cb ceph: add ino32 mount option
The ino32 mount option forces the ceph fs to report 32 bit
ino values.  This is useful for 64 bit kernels with 32 bit userspace.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2011-03-21 12:24:22 -07:00
Sage Weil 50aac4fec5 ceph: fix cap_wanted_delay_{min,max} mount option initialization
These were initialized to 0 instead of the default, fallout from the RBD
refactor in 3d14c5d2b6.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:22 -08:00
Tejun Heo 01e6acc4ea ceph: fsc->*_wq's aren't used in memory reclaim path
fsc->*_wq's aren't depended upon during memory reclaim.  Convert to
alloc_workqueue() w/o WQ_MEM_RECLAIM.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:14 -08:00
Sage Weil 14303d20f3 ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
This implements the DIRLAYOUTHASH protocol feature, which passes the dir
layout over the wire from the MDS.  This gives the client knowledge
of the correct hash function to use for mapping dentries among dir
fragments.

Note that if this feature is _not_ present on the client but is on the
MDS, the client may misdirect requests.  This will result in a forward
and degrade performance.  It may also result in inaccurate NFS filehandle
generation, which will prevent fh resolution when the inode is not present
in the client cache and the parent directories have been fragmented.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Al Viro a7f9fb205a convert ceph
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:18 -04:00
Yehuda Sadeh 3d14c5d2b6 ceph: factor out libceph from Ceph file system
This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph.  This
is mostly a matter of moving files around.  However, a few key pieces
of the interface change as well:

 - ceph_client becomes ceph_fs_client and ceph_client, where the latter
   captures the mon and osd clients, and the fs_client gets the mds client
   and file system specific pieces.
 - Mount option parsing and debugfs setup is correspondingly broken into
   two pieces.
 - The mon client gets a generic handler callback for otherwise unknown
   messages (mds map, in this case).
 - The basic supported/required feature bits can be expanded (and are by
   ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:37:28 -07:00
Sage Weil e9d1774431 ceph: do not ignore osd_idle_ttl mount option
Actually apply the mount option to the mount_args struct.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-03 12:56:57 -07:00
Sage Weil 2d9c98ae97 ceph: make ->sync_fs not wait if wait==0
The ->sync_fs() super op only needs to wait if wait is true.  Otherwise,
just get some dirty cap writeback started.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil a8b763a9b3 ceph: use %pU to print uuid (fsid)
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil c309f0ab26 ceph: clean up fsid mount option
Specify the fsid mount option in hex, not via the major/minor u64 hackery we had
before.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00