copy_from_iter_full(), copy_from_iter_full_nocache() and
csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
et.al., advancing iterator only in case of successful full copy
and returning whether it had been successful or not.
Convert some obvious users. *NOTE* - do not blindly assume that
something is a good candidate for those unless you are sure that
not advancing iov_iter in failure case is the right thing in
this case. Anything that does short read/short write kind of
stuff (or is in a loop, etc.) is unlikely to be a good one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull in an OrangeFS branch containing miscellaneous improvements.
- clean up debugfs globals
- remove dead code in sysfs
- reorganize duplicated sysfs attribute structs
- consolidate sysfs show and store functions
- remove duplicated sysfs_ops structures
- describe organization of sysfs
- make devreq_mutex static
- g_orangefs_stats -> orangefs_stats for consistency
- rename most remaining global variables
Only op_timeout_secs, slot_timeout_secs, and hash_table_size are left
because they are exposed as module parameters. All other global
variables have the orangefs_ prefix.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Mostly this is moving code into orangefs-debugfs.c so that globals turn
into static globals.
Then gossip_debug_mask is renamed orangefs_gossip_debug_mask but keeps
global visibility, so it can be used from a macro.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
This is a new userspace operation, which will be done if the client-core
version is greater than or equal to 2.9.6. This will provide a way to
implement optional features and to determine which features are
supported by the client-core. If the client-core version is older than
2.9.6, no optional features are supported and the op will not be done.
The intent is to allow protocol extensions without relying on the
client-core's current behavior of ignoring what it doesn't understand.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
The client reports its version to the kernel on startup. We already test
that it is above the minimum version. Now we record it in a global
variable so code elsewhere can consult it before making a request the
client may not understand.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
In orangefs_inode_getxattr(), an fsuid is written to dmesg. The kuid is
converted to a userspace uid via from_kuid(current_user_ns(), [...]), but
since dmesg is global, init_user_ns should be used here instead.
In copy_attributes_from_inode(), op_alloc() and fill_default_sys_attrs(),
upcall structures are populated with uids/gids that have been mapped into
the caller's namespace. However, those upcall structures are read by
another process (the userspace filesystem driver), and that process might
be running in another namespace. This effectively lets any user spoof its
uid and gid as seen by the userspace filesystem driver.
To fix the second issue, I just construct the opcall structures with
init_user_ns uids/gids and require the filesystem server to run in the
init namespace. Since orangefs is full of global state anyway (as the error
message in DUMP_DEVICE_ERROR explains, there can only be one userspace
orangefs filesystem driver at once), that shouldn't be a problem.
[
Why does orangefs even exist in the kernel if everything does upcalls into
userspace? What does orangefs do that couldn't be done with the FUSE
interface? If there is no good answer to those questions, I'd prefer to see
orangefs kicked out of the kernel. Can that be done for something that
shipped in a release?
According to commit f7ab093f74 ("Orangefs: kernel client part 1"), they
even already have a FUSE daemon, and the only rational reason (apart from
"but most of our users report preferring to use our kernel module instead")
given for not wanting to use FUSE is one "in-the-works" feature that could
probably be integated into FUSE instead.
]
This patch has been compile-tested.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
* switch orangefs_remount() to taking ORANGEFS_SB(sb) instead of sb
* remove from the list _before_ orangefs_unmount() - request_mutex
in the latter will make sure that nothing observed in the loop in
ORANGEFS_DEV_REMOUNT_ALL handling will get freed until the end
of loop
* on removal, keep the forward pointer and zero the back one. That
way we can drop and regain the spinlock in the loop body (again,
ORANGEFS_DEV_REMOUNT_ALL one) and still be able to get to the
rest of the list.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Al Viro has cleaned up the way ops are processed and waited for,
now orangefs.txt has an overview of how it works. Several recent
related commits have added to the comments in the code as well.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
* turn all those list_del(&op->list) into list_del_init()
* don't pick ops that are already given up in control device
->read()/->write_iter().
* have orangefs_clean_interrupted_operation() notice if op is currently
being copied to/from daemon (by said ->read()/->write_iter()) and
wait for that to finish.
* when we are done copying to/from daemon and find that it had been
given up while we were doing that, wake the waiting ..._clean_interrupted_...
As the result, we are guaranteed that orangefs_clean_interrupted_operation(op)
doesn't return until nobody else can see op. Moreover, we don't need to play
with op refcounts anymore.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
... and clean the end of control device ->write_iter() while we are at it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
new waiting-for-slot logics:
* make request for slot wait for bufmap to be set up if it
comes before it's installed *OR* while it's running down
* make closing control device wait for all slots to be freed
* waiting itself rewritten to (open-coded) analogues of wait_event_...
primitives - we would need wait_event_locked() and, pardon an obscenely
long name, wait_event_interruptible_exclusive_timeout_locked().
* we never wait for more than slot_timeout_secs in total and,
if during the wait the daemon goes away, we only allow
ORANGEFS_BUFMAP_WAIT_TIMEOUT_SECS for it to come back.
* (cosmetical) bitmap is used instead of an array of zeroes and ones
* old (and only reached if we are about to corrupt memory) waiting
for daemon restart in service_operation() removed.
[Martin's fixes folded]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Make cancels reuse the aborted read/write op, to make sure they do not
fail on lack of memory.
Don't issue a cancel unless the daemon has seen our read/write, has not
replied and isn't being shut down.
If cancel *is* issued, don't wait for it to complete; stash the slot
in there and just have it freed when cancel is finally replied to or
purged (and delay dropping the reference until then, obviously).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
There were two just alike, making it hard maybe to tell which one
you were looking at in syslog... so I changed it a little by adding
some extra interesting tidbits to it...
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
... we are not going to get woken up anyway, so it's just going to time out
and whine.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
All timeouts are in _seconds_, so all calls are of form
MSECS_TO_JIFFIES(n * 1000), which is a convoluted way to
spell n * HZ.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
* create with refcount 1
* make op_release() decrement and free if zero (i.e. old put_op()
has become that).
* mark when submitter has given up waiting; from that point nobody
else can move between the lists, change state, etc.
* have daemon read/write_iter grab a reference when picking op
and *always* give it up in the end
* don't put into hash until we know it's been successfully passed to
daemon
* move op->lock _lower_ than htab_in_progress_lock (and make sure
to take it in purge_inprogress_ops())
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Until now, orangefs_devreq_write_iter has just been a wrapper for
the old-fashioned orangefs_devreq_writev... linux would call
.write_iter with "struct kiocb *iocb" and "struct iov_iter *iter"
and .write_iter would just:
return pvfs2_devreq_writev(iocb->ki_filp,
iter->iov,
iter->nr_segs,
&iocb->ki_pos);
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This only changes the names of things, so there is no functional change.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
There was previously MAX_ALIGNED_DEV_REQ_(UP|DOWN)SIZE macros which
evaluated to MAX_DEV_REQ_(UP|DOWN)SIZE+8. As it is unclear what this is
for, other than creating a situation where we accept more data than we
can parse, it is removed.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
AV dislikes many parts of orangefs_devreq_writev. Besides making
orangefs_devreq_writev more easily readable and better commented,
this patch makes an effort to address some of the problems:
> The 5th is quietly ignored unless trailer_size is positive and
> status is zero. If trailer_size > 0 && status == 0, you verify that
> the length of the 5th segment is no more than trailer_size and copy
> it to vmalloc'ed buffer. Without bothering to zero the rest of that
> buffer out.
It was just wrong to allow a 5th segment that is not exactly equal to
trailer_size. Now that that's fixed, there's nothing to zero out in
the vmalloced buffer - it is exactly the right size to hold the
5th segment.
> Another API bogosity: when the 5th segment is present, successful writev()
> returns the sum of sizes of the first 4.
Added size of 5th segment to writev return...
> if concatenation of the first 4 segments is longer than
> 16 + sizeof(struct pvfs2_downcall_s) by no more than sizeof(long) => whine
> and proceed with garbage.
If 4th segment isn't exactly sizeof(struct pvfs2_downcall_s), whine and fail.
> if the 32bit value 4 bytes into op->downcall is zero and 64bit
> value following it is non-zero, the latter is interpreted as the size of
> trailer data.
The latter is what userspace claimed was the length of the trailer data.
The kernel module now compares it to the trailer iovec's iov_len as a
sanity check.
> if there's no trailer, the 5th segment (if present) is completely ignored.
Whine and fail if there should be no trailer, yet a 5th segment is present.
> if vmalloc fails, act as if status (32bit at offset 5 into
> op->downcall) had been -ENOMEM and don't look at the 5th segment at all.
whine and fail with -ENOMEM.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>