This is an example about how to add eventfd support to the current KAIO code,
in order to enable KAIO to post readiness events to a pollable fd (hence
compatible with POSIX select/poll). The KAIO code simply signals the eventfd
fd when events are ready, and this triggers a POLLIN in the fd. This patch
uses a reserved for future use member of the struct iocb to pass an eventfd
file descriptor, that KAIO will use to post events every time a request
completes. At that point, an aio_getevents() will return the completed result
to a struct io_event. I made a quick test program to verify the patch, and it
runs fine here:
http://www.xmailserver.org/eventfd-aio-test.c
The test program uses poll(2), but it'd, of course, work with select and epoll
too.
This can allow to schedule both block I/O and other poll-able devices
requests, and wait for results using select/poll/epoll. In a typical
scenario, an application would submit KAIO request using aio_submit(), and
will also use epoll_ctl() on the whole other class of devices (that with the
addition of signals, timers and user events, now it's pretty much complete),
and then would:
epoll_wait(...);
for_each_event {
if (curr_event_is_kaiofd) {
aio_getevents();
dispatch_aio_events();
} else {
dispatch_epoll_event();
}
}
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a very simple and light file descriptor, that can be used as event
wait/dispatch by userspace (both wait and dispatch) and by the kernel
(dispatch only). It can be used instead of pipe(2) in all cases where those
would simply be used to signal events. Their kernel overhead is much lower
than pipes, and they do not consume two fds. When used in the kernel, it can
offer an fd-bridge to enable, for example, functionalities like KAIO or
syslets/threadlets to signal to an fd the completion of certain operations.
But more in general, an eventfd can be used by the kernel to signal readiness,
in a POSIX poll/select way, of interfaces that would otherwise be incompatible
with it. The API is:
int eventfd(unsigned int count);
The eventfd API accepts an initial "count" parameter, and returns an eventfd
fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).
The POLLIN flag is raised when the internal counter is greater than zero.
The POLLOUT flag is raised when at least a value of "1" can be written to the
internal counter.
The POLLERR flag is raised when an overflow in the counter value is detected.
The write(2) operation can never overflow the counter, since it blocks (unless
O_NONBLOCK is set, in which case -EAGAIN is returned).
But the eventfd_signal() function can do it, since it's supposed to not sleep
during its operation.
The read(2) function reads the __u64 counter value, and reset the internal
value to zero. If the value read is equal to (__u64) -1, an overflow happened
on the internal counter (due to 2^64 eventfd_signal() posts that has never
been retired - unlickely, but possible).
The write(2) call writes an __u64 count value, and adds it to the current
counter. The eventfd fd supports O_NONBLOCK also.
On the kernel side, we have:
struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, unsigned int n);
The eventfd_fget() should be called to get a struct file* from an eventfd fd
(this is an fget() + check of f_op being an eventfd fops pointer).
The kernel can then call eventfd_signal() every time it wants to post an event
to userspace. The eventfd_signal() function can be called from any context.
An eventfd() simple test and bench is available here:
http://www.xmailserver.org/eventfd-bench.c
This is the eventfd-based version of pipetest-4 (pipe(2) based):
http://www.xmailserver.org/pipetest-4.c
Not that performance matters much in the eventfd case, but eventfd-bench
shows almost as double as performance than pipetest-4.
[akpm@linux-foundation.org: fix i386 build]
[akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch implements the necessary compat code for the timerfd system call.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces a new system call for timers events delivered though
file descriptors. This allows timer event to be used with standard POSIX
poll(2), select(2) and read(2). As a consequence of supporting the Linux
f_op->poll subsystem, they can be used with epoll(2) too.
The system call is defined as:
int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);
The "ufd" parameter allows for re-use (re-programming) of an existing timerfd
w/out going through the close/open cycle (same as signalfd). If "ufd" is -1,
s new file descriptor will be created, otherwise the existing "ufd" will be
re-programmed.
The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME. The time
specified in the "utmr->it_value" parameter is the expiry time for the timer.
If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute time,
otherwise it's a relative time.
If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
tv_nsec == 0), this is the period at which the following ticks should be
generated.
The "utmr->it_interval" should be set to zero if only one tick is requested.
Setting the "utmr->it_value" to zero will disable the timer, or will create a
timerfd without the timer enabled.
The function returns the new (or same, in case "ufd" is a valid timerfd
descriptor) file, or -1 in case of error.
As stated before, the timerfd file descriptor supports poll(2), select(2) and
epoll(2). When a timer event happened on the timerfd, a POLLIN mask will be
returned.
The read(2) call can be used, and it will return a u32 variable holding the
number of "ticks" that happened on the interface since the last call to
read(2). The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN will
be returned if no ticks happened.
A quick test program, shows timerfd working correctly on my amd64 box:
http://www.xmailserver.org/timerfd-test.c
[akpm@linux-foundation.org: add sys_timerfd to sys_ni.c]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch implements the necessary compat code for the signalfd system call.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch series implements the new signalfd() system call.
I took part of the original Linus code (and you know how badly it can be
broken :), and I added even more breakage ;) Signals are fetched from the same
signal queue used by the process, so signalfd will compete with standard
kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
seems to be working fine on my Dual Opteron machine. I made a quick test
program for it:
http://www.xmailserver.org/signafd-test.c
The signalfd() system call implements signal delivery into a file descriptor
receiver. The signalfd file descriptor if created with the following API:
int signalfd(int ufd, const sigset_t *mask, size_t masksize);
The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
signalfd file.
The "mask" allows to specify the signal mask of signals that we are interested
in. The "masksize" parameter is the size of "mask".
The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.
The read(2) system call will return a "struct signalfd_siginfo" structure in
the userspace supplied buffer. The return value is the number of bytes copied
in the supplied buffer, or -1 in case of error. The read(2) call can also
return 0, in case the sighand structure to which the signalfd was attached,
has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
return -EAGAIN in case no signal is available.
If the size of the buffer passed to read(2) is lower than sizeof(struct
signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
return -ERESTARTSYS in case a signal hits the process. The format of the
struct signalfd_siginfo is, and the valid fields depends of the (->code &
__SI_MASK) value, in the same way a struct siginfo would:
struct signalfd_siginfo {
__u32 signo; /* si_signo */
__s32 err; /* si_errno */
__s32 code; /* si_code */
__u32 pid; /* si_pid */
__u32 uid; /* si_uid */
__s32 fd; /* si_fd */
__u32 tid; /* si_fd */
__u32 band; /* si_band */
__u32 overrun; /* si_overrun */
__u32 trapno; /* si_trapno */
__s32 status; /* si_status */
__s32 svint; /* si_int */
__u64 svptr; /* si_ptr */
__u64 utime; /* si_utime */
__u64 stime; /* si_stime */
__u64 addr; /* si_addr */
};
[akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch add an anonymous inode source, to be used for files that need
and inode only in order to create a file*. We do not care of having an
inode for each file, and we do not even care of having different names in
the associated dentries (dentry names will be same for classes of file*).
This allow code reuse, and will be used by epoll, signalfd and timerfd
(and whatever else there'll be).
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make autofs container-friendly by caching struct pid reference rather than
pid_t and using pid_nr() to retreive a task's pid_t.
ChangeLog:
- Fix Eric Biederman's comments - Use find_get_pid() to hold a
reference to oz_pgrp and release while unmounting; separate out
changes to autofs and autofs4.
- Fix Cedric's comments: retain old prototype of parse_options()
and move necessary change to its caller.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: containers@lists.osdl.org
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix coding style errors (extra spaces, long lines) in autofs and autofs4 files
being modified for container/pidspace issues.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: <containers@lists.osdl.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
attach_pid() currently takes a pid_t and then uses find_pid() to find the
corresponding struct pid. Sometimes we already have the struct pid. We can
then skip find_pid() if attach_pid() were to take a struct pid parameter.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: <containers@lists.osdl.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up massive code duplication between mpage_writepages() and
generic_writepages().
The new generic function, write_cache_pages() takes a function pointer
argument, which will be called for each page to be written.
Maybe cifs_writepages() too can use this infrastructure, but I'm not
touching that with a ten-foot pole.
The upcoming page writeback support in fuse will also want this.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove unused argument in is_pmbr_valid()
Remove unneeded initialization of local variable legacy_mbr
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't enable SYSV68 partition table support on all m68k boxes by default,
only on Motorola VME boards.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Philippe De Muyter <phdm@macqel.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement the statfs() op for AFS.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a couple of problems with unlinking AFS files.
(1) The parent directory wasn't being updated properly between unlink() and
the following lookup().
It seems that, for some reason, invalidate_remote_inode() wasn't
discarding the directory contents correctly, so this patch calls
invalidate_inode_pages2() instead on non-regular files.
(2) afs_vnode_deleted_remotely() should handle vnodes that don't have a
source server recorded without oopsing.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following bug was uncovered by compiling with '-W' flag:
CC [M] fs/afs/write.o
fs/afs/write.c: In function âafs_write_back_from_locked_pageâ:
fs/afs/write.c:398: warning: comparison of unsigned expression >= 0 is always true
Loop variable 'n' is unsigned, so wraps around happily as far as I can
see. Trival fix attached (compile tested only).
Signed-off-by: Mika Kukkonen <mikukkon@iki.fi>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Generalise the handling of MTD-specific superblocks so that JFFS2 and ROMFS
can both share it.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Hi,
I have been working on some code that detects abnormal events based on audit
system events. One kind of event that we currently have no visibility for is
when a program terminates due to segfault - which should never happen on a
production machine. And if it did, you'd want to investigate it. Attached is a
patch that collects these events and sends them into the audit system.
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Handle the edge cases for POSIX message queue auditing. Collect inode
info when opening an existing mq, and for send/receive operations. Remove
audit_inode_update() as it has really evolved into the equivalent of
audit_inode().
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Collect inode info for the remaining xattr syscalls that operate on a file
descriptor. These don't call a path_lookup variant, so they aren't covered by
the general audit hook.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In 9d6a8c5c21 we changed posix_test_lock
to modify its single file_lock argument instead of taking separate input
and output arguments. This makes it no longer safe to set the output
lock's fl_type to F_UNLCK before looking for a conflict, since that
means searching for a conflict against a lock with type F_UNLCK.
This fixes a regression which causes F_GETLK to incorrectly report no
conflict on most filesystems (including any filesystem that doesn't do
its own locking).
Also fix posix_lock_to_flock() to copy the lock type. This isn't
strictly necessary, since the caller already does this; but it seems
less likely to cause confusion in the future.
Thanks to Doug Chapman for the bug report.
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Acked-by: Doug Chapman <doug.chapman@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A small regression appears to have been introduced in the recent patch
"cleanup compat ioctl handling", which was included in Linus' tree after
2.6.20.
siocdevprivate_ioctl() is no longer defined if CONFIG_NET is undefined,
whereas previously it was a dummy function in this case.
This causes compilation with CONFIG_COMPAT but without CONFIG_NET to fail.
fs/compat_ioctl.c: In function `compat_sys_ioctl':
fs/compat_ioctl.c:3571: warning: implicit declaration of function `siocdevprivate_ioctl'
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Further fixes for AFS write support:
(1) The afs_send_pages() outer loop must do an extra iteration if it ends
with 'first == last' because 'last' is inclusive in the page set
otherwise it fails to send the last page and complete the RxRPC op under
some circumstances.
(2) Similarly, the outer loop in afs_pages_written_back() must also do an
extra iteration if it ends with 'first == last', otherwise it fails to
clear PG_writeback on the last page under some circumstances.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
AFS write support fixes:
(1) Support large files using the 64-bit file access operations if available
on the server.
(2) Use kmap_atomic() rather than kmap() in afs_prepare_page().
(3) Don't do stuff in afs_writepage() that's done by the caller.
[akpm@linux-foundation.org: fix right shift count >= width of type]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make it more useful for debugging purposes.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The maximum size of an NFSv4 SETATTR compound reply should include the
GETATTR operation that we send.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The check for nfs_attribute_timeout(dir) in nfs_check_verifier is
redundant: nfs_lookup_revalidate() will already call nfs_revalidate_inode()
on the parent dir when necessary.
The only case where this is not done is the case of a negative dentry. Fix
this case by moving up the revalidation code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
dentry verifiers are always set to the parent directory's
cache_change_attribute. There is no reason to be testing for anything other
than equality when we're trying to find out if the dentry has been checked
since the last time the directory was modified.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* git://git.infradead.org/mtd-2.6: (21 commits)
[MTD] [CHIPS] Remove MTD_OBSOLETE_CHIPS (jedec, amd_flash, sharp)
[MTD] Delete allegedly obsolete "bank_size" field of mtd_info.
[MTD] Remove unnecessary user space check from mtd.h.
[MTD] [MAPS] Remove flash maps for no longer supported 405LP boards
[MTD] [MAPS] Fix missing printk() parameter in physmap_of.c MTD driver
[MTD] [NAND] platform NAND driver: add driver
[MTD] [NAND] platform NAND driver: update header
[JFFS2] Simplify and clean up jffs2_add_tn_to_tree() some more.
[JFFS2] Remove another bogus optimisation in jffs2_add_tn_to_tree()
[JFFS2] Remove broken insert_point optimisation in jffs2_add_tn_to_tree()
[JFFS2] Remember to calculate overlap on nodes which replace older nodes
[JFFS2] Don't advance c->wbuf_ofs to next eraseblock after wbuf flush
[MTD] [NAND] at91_nand.c: CMDLINE_PARTS support
[MTD] [NAND] Tidy up handling of page number in nand_block_bad()
[MTD] block2mtd_paramline[] mustn't be __initdata
[MTD] [NAND] Support multiple chips in CAFÉ driver
[MTD] [NAND] Rename cafe.c to cafe_nand.c and remove the multi-obj magic
[MTD] [NAND] Use rslib for CAFÉ ECC
[RSLIB] Support non-canonical GF representations
[JFFS2] Remove dead file histo_mips.h
...
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
sound: convert "sound" subdirectory to UTF-8
MAINTAINERS: Add cxacru website/mailing list
include files: convert "include" subdirectory to UTF-8
general: convert "kernel" subdirectory to UTF-8
documentation: convert the Documentation directory to UTF-8
Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
remove broken URLs from net drivers' output
Magic number prefix consistency change to Documentation/magic-number.txt
trivial: s/i_sem /i_mutex/
fix file specification in comments
drivers/base/platform.c: fix small typo in doc
misc doc and kconfig typos
Remove obsolete fat_cvf help text
Fix occurrences of "the the "
Fix minor typoes in kernel/module.c
Kconfig: Remove reference to external mqueue library
Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
Correct comments in genrtc.c to refer to correct /proc file.
Fix more "deprecated" spellos.
Fix "deprecated" typoes.
...
Fix trivial comment conflict in kernel/relay.c.
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).
[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use zero_user_page() instead of open-coding it.
Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use zero_user_page() instead of open-coding it.
Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use zero_user_page() instead of open-coding it.
Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's very common for file systems to need to zero part or all of a page,
the simplist way is just to use kmap_atomic() and memset(). There's
actually a library function in include/linux/highmem.h that does exactly
that, but it's confusingly named memclear_highpage_flush(), which is
descriptive of *how* it does the work rather than what the *purpose* is.
So this patchset renames the function to zero_user_page(), and calls it
from the various places that currently open code it.
This first patch introduces the new function call, and converts all the
core kernel callsites, both the open-coded ones and the old
memclear_highpage_flush() ones. Following this patch is a series of
conversions for each file system individually, per AKPM, and finally a
patch deprecating the old call. The diffstat below shows the entire
patchset.
[akpm@linux-foundation.org: fix a few things]
Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a lookup request arrives, nfsd uses information provided by userspace
(mountd) to find the right filesystem.
It then assumes that the same filehandle type as the incoming filehandle can
be used to create an outgoing filehandle.
However if mountd is buggy, or maybe just being creative, the filesystem may
not support that filesystem type, and the kernel could oops, particularly if
'ex_uuid' is NULL but a FSID_UUID* filehandle type is used.
So add some proper checking that the fsid version/type from the incoming
filehandle is actually supportable, and ignore that information if it isn't
supportable.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1/ decode_sattr and decode_sattr3 never return NULL, so remove
several checks for that. ditto for xdr_decode_hyper.
2/ replace some open coded XDR_QUADLEN calls with calls to
XDR_QUADLEN
3/ in decode_writeargs, simply an 'if' to use a single
calculation.
.page_len is the length of that part of the packet that did
not fit in the first page (the head).
So the length of the data part is the remainder of the
head, plus page_len.
3/ other minor cleanups.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kbuild directly interprets <modulename>-y as objects to build into a module,
no need to assign it to the old foo-objs variable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to zero various parts of 'exp' before any 'goto out', otherwise when
we go to free the contents... we die.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the kernel calls svc_reserve to downsize the expected size of an RPC
reply, it fails to account for the possibility of a checksum at the end of
the packet. If a client mounts a NFSv2/3 with sec=krb5i/p, and does I/O
then you'll generally see messages similar to this in the server's ring
buffer:
RPC request reserved 164 but used 208
While I was never able to verify it, I suspect that this problem is also
the root cause of some oopses I've seen under these conditions:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=227726
This is probably also a problem for other sec= types and for NFSv4. The
large reserved size for NFSv4 compound packets seems to generally paper
over the problem, however.
This patch adds a wrapper for svc_reserve that accounts for the possibility
of a checksum. It also fixes up the appropriate callers of svc_reserve to
call the wrapper. For now, it just uses a hardcoded value that I
determined via testing. That value may need to be revised upward as things
change, or we may want to eventually add a new auth_op that attempts to
calculate this somehow.
Unfortunately, there doesn't seem to be a good way to reliably determine
the expected checksum length prior to actually calculating it, particularly
with schemes like spkm3.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Acked-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The NFSv2 and NFSv3 servers do not handle WRITE requests for 0 bytes
correctly. The specifications indicate that the server should accept the
request, but it should mostly turn into a no-op. Currently, the server
will return an XDR decode error, which it should not.
Attached is a patch which addresses this issue. It also adds some boundary
checking to ensure that the request contains as much data as was requested
to be written. It also correctly handles an NFSv3 request which requests
to write more data than the server has stated that it is prepared to
handle. Previously, there was some support which looked like it should
work, but wasn't quite right.
Signed-off-by: Peter Staubach <staubach@redhat.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nfs4_acl_add_ace() can now be removed.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Neil Brown <neilb@cse.unsw.edu.au>
Acked-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq
(this was possible from the very beginnig, I missed this). So we can unify
flush_work_keventd and flush_work.
Also, rename flush_work() to cancel_work_sync() and fix all callers.
Perhaps this is not the best name, but "flush_work" is really bad.
(akpm: this is why the earlier patches bypassed maintainers)
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Auke Kok <auke-jan.h.kok@intel.com>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Migrate AIO over to use flush_work().
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement support for writing to regular AFS files, including:
(1) write
(2) truncate
(3) fsync, fdatasync
(4) chmod, chown, chgrp, utime.
AFS writeback attempts to batch writes into as chunks as large as it can manage
up to the point that it writes back 65535 pages in one chunk or it meets a
locked page.
Furthermore, if a page has been written to using a particular key, then should
another write to that page use some other key, the first write will be flushed
before the second is allowed to take place. If the first write fails due to a
security error, then the page will be scrapped and reread before the second
write takes place.
If a page is dirty and the callback on it is broken by the server, then the
dirty data is not discarded (same behaviour as NFS).
Shared-writable mappings are not supported by this patch.
[akpm@linux-foundation.org: fix a bunch of warnings]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make some miscellaneous changes to the AFS filesystem:
(1) Assert RCU barriers on module exit to make sure RCU has finished with
callbacks in this module.
(2) Correctly handle the AFS server returning a zero-length read.
(3) Split out data zapping calls into one function (afs_zap_data).
(4) Rename some afs_file_*() functions to afs_*() where they apply to
non-regular files too.
(5) Be consistent about the presentation of volume ID:vnode ID in debugging
output.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since path_walk sets the total_link_count to 0 and calls link_path_walk, we
can just call path_walk directly.
Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup using simple_read_from_buffer() in binfmt_misc, configfs, and sysfs.
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many files include the filename at the beginning, serveral used a wrong one.
Signed-off-by: Uwe Kleine-König <ukleinek@informatik.uni-freiburg.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
The text removed by the following patch refers to functionality that never
worked, to non-existing documentation file, and to mount options marked as
obsolete in the module.
Signed-off-by: Alexander E. Patrakov <patrakov@ums.usu.ru>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Fix the misspellings of "propogate", "writting" and (oh, the shame
:-) "kenrel" in the source tree.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
/proc/pid/clear_refs is only defined in the CONFIG_MMU case, so make sure we
don't have any references to clear_refs_smap() in generic procfs code.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. to match what we do on write(). This way, people who write to files
by using [f]truncate + writable mmap have the same semantics as if they
were using the write() family of system calls.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://oss.sgi.com:8090/xfs/xfs-2.6:
[XFS] Add lockdep support for XFS
[XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks.
[XFS] Get rid of redundant "required" in msg.
[XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg.
[XFS] Remove unused ilen variable and references.
[XFS] Fix to prevent the notorious 'NULL files' problem after a crash.
[XFS] Fix race condition in xfs_write().
[XFS] Fix uquota and oquota enforcement problems.
[XFS] propogate return codes from flush routines
[XFS] Fix quotaon syscall failures for group enforcement requests.
[XFS] Invalidate quotacheck when mounting without a quota type.
[XFS] reducing the number of random number functions.
[XFS] remove more misc. unused args
[XFS] the "aendp" arg to xfs_dir2_data_freescan is always NULL, remove it.
[XFS] The last argument "lsn" of xfs_trans_commit() is always called with
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
JFS: Fix race waking up jfsIO kernel thread
JFS: use __set_current_state()
Copy i_flags to jfs inode flags on write
JFS: document uid, gid, and umask mount options in jfs.txt
sb_read may return NULL, let's explicitly check it.
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make UDF work correctly for files larger than 1GB. As no extent can be
longer than (1<<30)-blocksize bytes, we have to create several extents if a
big hole is being created. As a side-effect, we now don't discard
preallocated blocks when creating a hole.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a few assertions into udf_discard_prealloc() to check that the file is
sane (mostly helps debugging further patches ;).
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make UDF use get_bh() instead of directly accessing b_count and use
brelse() instead of udf_release_data() which does just brelse()...
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a structure extent_position to store a position of an extent and
the corresponding buffer_head in one place.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use sector_t and loff_t for file offsets in UDF filesystem. Otherwise an
overflow may occur for long files. Also make inode_bmap() return offset in
the extent in number of blocks instead of number of bytes - for most
callers this is more convenient.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the atomic_t in struct nfs_server to atomic_long_t in anticipation
of machines that can handle 8+TB of (4K) pages under writeback.
However I suspect other things in NFS will start going *bang* by then.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement utimensat(2) which is an extension to futimesat(2) in that it
a) supports nano-second resolution for the timestamps
b) allows to selectively ignore the atime/mtime value
c) allows to selectively use the current time for either atime or mtime
d) supports changing the atime/mtime of a symlink itself along the lines
of the BSD lutimes(3) functions
For this change the internally used do_utimes() functions was changed to
accept a timespec time value and an additional flags parameter.
Additionally the sys_utime function was changed to match compat_sys_utime
which already use do_utimes instead of duplicating the work.
Also, the completely missing futimensat() functionality is added. We have
such a function in glibc but we have to resort to using /proc/self/fd/* which
not everybody likes (chroot etc).
Test application (the syscall number will need per-arch editing):
#include <errno.h>
#include <fcntl.h>
#include <time.h>
#include <sys/time.h>
#include <stddef.h>
#include <syscall.h>
#define __NR_utimensat 280
#define UTIME_NOW ((1l << 30) - 1l)
#define UTIME_OMIT ((1l << 30) - 2l)
int
main(void)
{
int status = 0;
int fd = open("ttt", O_RDWR|O_CREAT|O_EXCL, 0666);
if (fd == -1)
error (1, errno, "failed to create test file \"ttt\"");
struct stat64 st1;
if (fstat64 (fd, &st1) != 0)
error (1, errno, "fstat failed");
struct timespec t[2];
t[0].tv_sec = 0;
t[0].tv_nsec = 0;
t[1].tv_sec = 0;
t[1].tv_nsec = 0;
if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0)
error (1, errno, "utimensat failed");
struct stat64 st2;
if (fstat64 (fd, &st2) != 0)
error (1, errno, "fstat failed");
if (st2.st_atim.tv_sec != 0 || st2.st_atim.tv_nsec != 0)
{
puts ("atim not reset to zero");
status = 1;
}
if (st2.st_mtim.tv_sec != 0 || st2.st_mtim.tv_nsec != 0)
{
puts ("mtim not reset to zero");
status = 1;
}
if (status != 0)
goto out;
t[0] = st1.st_atim;
t[1].tv_sec = 0;
t[1].tv_nsec = UTIME_OMIT;
if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0)
error (1, errno, "utimensat failed");
if (fstat64 (fd, &st2) != 0)
error (1, errno, "fstat failed");
if (st2.st_atim.tv_sec != st1.st_atim.tv_sec
|| st2.st_atim.tv_nsec != st1.st_atim.tv_nsec)
{
puts ("atim not set");
status = 1;
}
if (st2.st_mtim.tv_sec != 0 || st2.st_mtim.tv_nsec != 0)
{
puts ("mtim changed from zero");
status = 1;
}
if (status != 0)
goto out;
t[0].tv_sec = 0;
t[0].tv_nsec = UTIME_OMIT;
t[1] = st1.st_mtim;
if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0)
error (1, errno, "utimensat failed");
if (fstat64 (fd, &st2) != 0)
error (1, errno, "fstat failed");
if (st2.st_atim.tv_sec != st1.st_atim.tv_sec
|| st2.st_atim.tv_nsec != st1.st_atim.tv_nsec)
{
puts ("mtim changed from original time");
status = 1;
}
if (st2.st_mtim.tv_sec != st1.st_mtim.tv_sec
|| st2.st_mtim.tv_nsec != st1.st_mtim.tv_nsec)
{
puts ("mtim not set");
status = 1;
}
if (status != 0)
goto out;
sleep (2);
t[0].tv_sec = 0;
t[0].tv_nsec = UTIME_NOW;
t[1].tv_sec = 0;
t[1].tv_nsec = UTIME_NOW;
if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0)
error (1, errno, "utimensat failed");
if (fstat64 (fd, &st2) != 0)
error (1, errno, "fstat failed");
struct timeval tv;
gettimeofday(&tv,NULL);
if (st2.st_atim.tv_sec <= st1.st_atim.tv_sec
|| st2.st_atim.tv_sec > tv.tv_sec)
{
puts ("atim not set to NOW");
status = 1;
}
if (st2.st_mtim.tv_sec <= st1.st_mtim.tv_sec
|| st2.st_mtim.tv_sec > tv.tv_sec)
{
puts ("mtim not set to NOW");
status = 1;
}
if (symlink ("ttt", "tttsym") != 0)
error (1, errno, "cannot create symlink");
t[0].tv_sec = 0;
t[0].tv_nsec = 0;
t[1].tv_sec = 0;
t[1].tv_nsec = 0;
if (syscall(__NR_utimensat, AT_FDCWD, "tttsym", t, AT_SYMLINK_NOFOLLOW) != 0)
error (1, errno, "utimensat failed");
if (lstat64 ("tttsym", &st2) != 0)
error (1, errno, "lstat failed");
if (st2.st_atim.tv_sec != 0 || st2.st_atim.tv_nsec != 0)
{
puts ("symlink atim not reset to zero");
status = 1;
}
if (st2.st_mtim.tv_sec != 0 || st2.st_mtim.tv_nsec != 0)
{
puts ("symlink mtim not reset to zero");
status = 1;
}
if (status != 0)
goto out;
t[0].tv_sec = 1;
t[0].tv_nsec = 0;
t[1].tv_sec = 1;
t[1].tv_nsec = 0;
if (syscall(__NR_utimensat, fd, NULL, t, 0) != 0)
error (1, errno, "utimensat failed");
if (fstat64 (fd, &st2) != 0)
error (1, errno, "fstat failed");
if (st2.st_atim.tv_sec != 1 || st2.st_atim.tv_nsec != 0)
{
puts ("atim not reset to one");
status = 1;
}
if (st2.st_mtim.tv_sec != 1 || st2.st_mtim.tv_nsec != 0)
{
puts ("mtim not reset to one");
status = 1;
}
if (status == 0)
puts ("all OK");
out:
close (fd);
unlink ("ttt");
unlink ("tttsym");
return status;
}
[akpm@linux-foundation.org: add missing i386 syscall table entry]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Cc: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes it so that simple_fill_super and get_sb_pseudo assign their
root inodes to be number 1. It also fixes up a couple of callers of
simple_fill_super that were passing in files arrays that had an index at
number 1, and adds a warning for any caller that sends in such an array.
It would have been nice to have made it so that it wasn't possible to make
such a collision, but some callers need to be able to control what inode
number their entries get, so I think this is the best that can be done.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The problems are:
- on filesystems w/o permanent inode numbers, i_ino values can be larger
than 32 bits, which can cause problems for some 32 bit userspace programs on
a 64 bit kernel. We can't do anything for filesystems that have actual
>32-bit inode numbers, but on filesystems that generate i_ino values on the
fly, we should try to have them fit in 32 bits. We could trivially fix this
by making the static counters in new_inode and iunique 32 bits, but...
- many filesystems call new_inode and assume that the i_ino values they are
given are unique. They are not guaranteed to be so, since the static
counter can wrap. This problem is exacerbated by the fix for #1.
- after allocating a new inode, some filesystems call iunique to try to get
a unique i_ino value, but they don't actually add their inodes to the
hashtable, and so they're still not guaranteed to be unique if that counter
wraps.
This patch set takes the simpler approach of simply using iunique and hashing
the inodes afterward. Christoph H. previously mentioned that he thought that
this approach may slow down lookups for filesystems that currently hash their
inodes.
The questions are:
1) how much would this slow down lookups for these filesystems?
2) is it enough to justify adding more infrastructure to avoid it?
What might be best is to start with this approach and then only move to using
IDR or some other scheme if these extra inodes in the hashtable prove to be
problematic.
I've done some cursory testing with this patch and the overhead of hashing and
unhashing the inodes with pipefs is pretty low -- just a few seconds of system
time added on to the creation and destruction of 10 million pipes (very
similar to the overhead that the IDR approach would add).
The hard thing to measure is what effect this has on other filesystems. I'm
open to ways to try and gauge this.
Again, I've only converted pipefs as an example. If this approach is
acceptable then I'll start work on patches to convert other filesystems.
With a pretty-much-worst-case microbenchmark provided by Eric Dumazet
<dada1@cosmosbay.com>:
hashing patch (pipebench):
sys 1m15.329s
sys 1m16.249s
sys 1m17.169s
unpatched (pipebench):
sys 1m9.836s
sys 1m12.541s
sys 1m14.153s
Which works out to 1.05642174294555027017. So ~5-6% slowdown.
This patch:
When a 32-bit program that was not compiled with large file offsets does a
stat and gets a st_ino value back that won't fit in the 32 bit field, glibc
(correctly) generates an EOVERFLOW error. We can't do anything about fs's
with larger permanent inode numbers, but when we generate them on the fly, we
ought to try and have them fit within a 32 bit field.
This patch takes the first step toward this by making the static counters in
these two functions be 32 bits.
[jlayton@redhat.com: mention that it's only the case for 32bit, non-LFS stat]
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When elf loader fails to map executable (due to memory shortage or because
binary is malformed), it can return 0. Normally, this is invisible because
process is killed with SIGKILL and it never returns to user space.
But if exec() is called from kernel thread (hotplug, whatever)
consequences are more interesting and vary depending on architecture.
i386. Nothing especially interesting, execve() just returns
with "success" :-)
x86_64. Fake zero frame is used on way to caller, RSP/RIP are loaded
with zeros, ergo... double fault.
ia64. Similar to i386, but r32...r95 are corrupted. Sometimes it
oopses due to return to zero PC, sometimes it sees NaT in
rXX and oopses due to NaT consumption.
Signed-off-by: Alexey Kuznetsov <alexey@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup using simple_read_from_buffer() in procfs.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It appears that a minor thinko occurred in udf_rmdir and the
(already-cleared) link count on the directory that is being removed was
being decremented instead of the link count on its parent directory. This
gives rise to lots of kernel messages similar to:
UDF-fs warning (device loop1): udf_rmdir: empty directory has nlink != 2 (8)
when removing directory trees. No other ill effects have been observed but
I guess it could theoretically result in the link count overflowing on a
very long-lived, much modified directory.
Signed-off-by: Stephen Mollett <molletts@yahoo.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If you compile and run the below test case in an msdos or vfat directory on
an x86-64 system with -m32 you'll get garbage in the kernel_dirent struct
followed by a SIGSEGV.
The patch fixes this.
Reported and initial fix by Bart Oldeman
#include <sys/types.h>
#include <sys/ioctl.h>
#include <dirent.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
struct kernel_dirent {
long d_ino;
long d_off;
unsigned short d_reclen;
char d_name[256]; /* We must not include limits.h! */
};
#define VFAT_IOCTL_READDIR_BOTH _IOR('r', 1, struct kernel_dirent [2])
#define VFAT_IOCTL_READDIR_SHORT _IOR('r', 2, struct kernel_dirent [2])
int main(void)
{
int fd = open(".", O_RDONLY);
struct kernel_dirent de[2];
while (1) {
int i = ioctl(fd, VFAT_IOCTL_READDIR_BOTH, (long)de);
if (i == -1) break;
if (de[0].d_reclen == 0) break;
printf("SFN: reclen=%2d off=%d ino=%d, %-12s",
de[0].d_reclen, de[0].d_off, de[0].d_ino, de[0].d_name);
if (de[1].d_reclen)
printf("\tLFN: reclen=%2d off=%d ino=%d, %s",
de[1].d_reclen, de[1].d_off, de[1].d_ino, de[1].d_name);
printf("\n");
}
return 0;
}
Signed-off-by: Bart Oldeman <bartoldeman@users.sourceforge.net>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into
ext2-specific i_flags. Hence, when someone sets these flags via a different
interface than ioctl, they are stored correctly.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It seems that the recent Windows changed specification, and it's
undocumented. Windows doesn't update ->free_clusters correctly.
This patch doesn't use ->free_clusters by default. (instead, add "usefree"
for forcing to use it)
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Juergen Beisert <juergen127@kreuzholzen.de>
Cc: Andreas Schwab <schwab@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the thread failed to create the subsequent wait_event will hang forever.
This is likely to happen if kernel hits max_threads limit.
Will be critical for virtualization systems that limit the number of tasks
and kernel memory usage within the container.
(akpm: JBD should be converted fully to the kthread API: kthread_should_stop()
and kthread_stop()).
Cc: <linux-ext4@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a missing check for CAP_SYS_ADMIN in do_change_type().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A patch that stores inode flags such as S_IMMUTABLE, S_APPEND, etc. from
i_flags to EXT3_I(inode)->i_flags when inode is written to disk. The same
thing is done on GETFLAGS ioctl.
Quota code changes these flags on quota files (to make it harder for
sysadmin to screw himself) and these changes were not correctly propagated
into the filesystem (especially, lsattr did not show them and users were
wondering...).
Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into
ext3-specific i_flags. Hence, when someone sets these flags via a
different interface than ioctl, they are stored correctly.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are many places in the kernel where the construction like
foo = list_entry(head->next, struct foo_struct, list);
are used.
The code might look more descriptive and neat if using the macro
list_first_entry(head, type, member) \
list_entry((head)->next, type, member)
Here is the macro itself and the examples of its usage in the generic code.
If it will turn out to be useful, I can prepare the set of patches to
inject in into arch-specific code, drivers, networking, etc.
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A while back, Christoph mentioned that he thought that iunique ought to be
cleaned up to use a more conventional loop construct. This patch does that,
turning the strange goto loop into a do/while.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
notify_change() already calls security_inode_setattr() before
calling iop->setattr.
Alan sayeth
This is a behaviour change on all of these and limits some behaviour of
existing established security modules
When inode_change_ok is called it has side effects. This includes
clearing the SGID bit on attribute changes caused by chmod. If you make
this change the results of some rulesets may be different before or after
the change is made.
I'm not saying the change is wrong but it does change behaviour so that
needs looking at closely (ditto all other attribute twiddles)
Signed-off-by: Steve Beattie <sbeattie@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: John Johansen <jjohansen@suse.de>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
notify_change() already calls security_inode_setattr() before
calling iop->setattr.
Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: John Johansen <jjohansen@suse.de>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can save some lines of code by using seq_release_private().
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge all compat ioctl handling into compat_ioctl.c instead of splitting it
over compat.c and compat_ioctl.c. This also allows to get rid of ioctl32.h
Signed-off-by: Christoph Hellwig <hch@lst.de>
Looks-good-to: Andi Kleen <ak@suse.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for the Motorola sysv68 disk partition (slices in motorola
doc).
Signed-off-by: Philippe De Muyter <phdm@macqel.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that there is no arch-specific compat ioctl handling left there is not
point in having a separate copat_ioctl.h, so merge it into compat_ioctl.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
linux/module.h
-> linux/elf.h
-> asm-i386/elf.h
-> linux/utsname.h
-> linux/sched.h
Noticeably cut the number of files which are rebuild upon touching sched.h
and cut down pulled junk from every module.h inclusion.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kallsyms_lookup() can go iterating over modules list unprotected which is OK
for emergency situations (oops), but not OK for regular stuff like
/proc/*/wchan.
Introduce lookup_symbol_name()/lookup_module_symbol_name() which copy symbol
name into caller-supplied buffer or return -ERANGE. All copying is done with
module_mutex held, so...
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several kallsyms_lookup() pass dummy arguments but only need, say, module's
name. Make kallsyms_lookup() accept NULLs where possible.
Also, makes picture clearer about what interfaces are needed for all symbol
resolving business.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a binary format is unregistered and re-registered, register_binfmt
fails with -EBUSY. The reason is that unregister_binfmt does not set
fmt->next to NULL, and seeing (fmt->next != NULL), register_binfmt fails
with -EBUSY.
One can find his way around by explicitly setting fmt->next to NULL after
unregistering, but that is kind of unclean (one should better be using only
the interfaces, and not the interal members, isn't it?)
Attached one-liner can fix it.
Signed-off-by: Kalash Nainwal <kalash.nainwal@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.
Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remove_inode_dquot_ref() can now become static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove do_sync_file_range() and convert callers to just use
do_sync_mapping_range().
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
REISER_FS /proc option needs to depend on PROC_FS.
fs/reiserfs/procfs.c: In function 'show_super':
fs/reiserfs/procfs.c:134: error: 'reiserfs_proc_info_data_t' has no member named 'max_hash_collisions'
fs/reiserfs/procfs.c:134: error: 'reiserfs_proc_info_data_t' has no member named 'breads'
fs/reiserfs/procfs.c:135: error: 'reiserfs_proc_info_data_t' has no member named 'bread_miss'
fs/reiserfs/procfs.c:135: error: 'reiserfs_proc_info_data_t' has no member named 'search_by_key'
fs/reiserfs/procfs.c:136: error: 'reiserfs_proc_info_data_t' has no member named 'search_by_key_fs_changed'
fs/reiserfs/procfs.c:136: error: 'reiserfs_proc_info_data_t' has no member named 'search_by_key_restarted'
fs/reiserfs/procfs.c:137: error: 'reiserfs_proc_info_data_t' has no member named 'insert_item_restarted'
fs/reiserfs/procfs.c:137: error: 'reiserfs_proc_info_data_t' has no member named 'paste_into_item_restarted'
fs/reiserfs/procfs.c:138: error: 'reiserfs_proc_info_data_t' has no member named 'cut_from_item_restarted'
fs/reiserfs/procfs.c:139: error: 'reiserfs_proc_info_data_t' has no member named 'delete_solid_item_restarted'
fs/reiserfs/procfs.c:139: error: 'reiserfs_proc_info_data_t' has no member named 'delete_item_restarted'
fs/reiserfs/procfs.c:140: error: 'reiserfs_proc_info_data_t' has no member named 'leaked_oid'
fs/reiserfs/procfs.c:140: error: 'reiserfs_proc_info_data_t' has no member named 'leaves_removable'
fs/reiserfs/procfs.c: In function 'show_per_level':
fs/reiserfs/procfs.c:184: error: 'reiserfs_proc_info_data_t' has no member named 'balance_at'
fs/reiserfs/procfs.c:185: error: 'reiserfs_proc_info_data_t' has no member named 'sbk_read_at'
fs/reiserfs/procfs.c:186: error: 'reiserfs_proc_info_data_t' has no member named 'sbk_fs_changed'
fs/reiserfs/procfs.c:187: error: 'reiserfs_proc_info_data_t' has no member named 'sbk_restarted'
fs/reiserfs/procfs.c:188: error: 'reiserfs_proc_info_data_t' has no member named 'free_at'
fs/reiserfs/procfs.c:189: error: 'reiserfs_proc_info_data_t' has no member named 'items_at'
fs/reiserfs/procfs.c:190: error: 'reiserfs_proc_info_data_t' has no member named 'can_node_be_removed'
fs/reiserfs/procfs.c:191: error: 'reiserfs_proc_info_data_t' has no member named 'lnum'
fs/reiserfs/procfs.c:192: error: 'reiserfs_proc_info_data_t' has no member named 'rnum'
fs/reiserfs/procfs.c:193: error: 'reiserfs_proc_info_data_t' has no member named 'lbytes'
fs/reiserfs/procfs.c:194: error: 'reiserfs_proc_info_data_t' has no member named 'rbytes'
fs/reiserfs/procfs.c:195: error: 'reiserfs_proc_info_data_t' has no member named 'get_neighbors'
fs/reiserfs/procfs.c:196: error: 'reiserfs_proc_info_data_t' has no member named 'get_neighbors_restart'
fs/reiserfs/procfs.c:197: error: 'reiserfs_proc_info_data_t' has no member named 'need_l_neighbor'
fs/reiserfs/procfs.c:197: error: 'reiserfs_proc_info_data_t' has no member named 'need_r_neighbor'
fs/reiserfs/procfs.c: In function 'show_bitmap':
fs/reiserfs/procfs.c:224: error: 'reiserfs_proc_info_data_t' has no member named 'free_block'
fs/reiserfs/procfs.c:225: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c:226: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c:227: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c:228: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c:229: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c:230: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c:230: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c: In function 'show_journal':
fs/reiserfs/procfs.c:384: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:385: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:386: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:387: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:388: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:389: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:390: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:391: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:392: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:393: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:394: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:395: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:395: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:395: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c: In function 'reiserfs_proc_info_init':
fs/reiserfs/procfs.c:504: warning: implicit declaration of function '__PINFO'
fs/reiserfs/procfs.c:504: error: request for member 'lock' in something not a structure or union
fs/reiserfs/procfs.c: In function 'reiserfs_proc_info_done':
fs/reiserfs/procfs.c:544: error: request for member 'lock' in something not a structure or union
fs/reiserfs/procfs.c:545: error: request for member 'exiting' in something not a structure or union
fs/reiserfs/procfs.c:546: error: request for member 'lock' in something not a structure or union
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eternal quest to make
while true; do cat /proc/fs/xfs/stat >/dev/null 2>/dev/null; done
while true; do find /proc -type f 2>/dev/null | xargs cat >/dev/null 2>/dev/null; done
while true; do modprobe xfs; rmmod xfs; done
work reliably continues and now kernel oopses in the following way:
BUG: unable to handle ... at virtual address 6b6b6b6b
EIP is at badness
process: cat
proc_oom_score
proc_info_read
sys_fstat64
vfs_read
proc_info_read
sys_read
Failing code is prefetch hidden in list_for_each_entry() in badness().
badness() is reachable from two points. One is proc_oom_score, another
is out_of_memory() => select_bad_process() => badness().
Second path grabs tasklist_lock, while first doesn't.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1) Introduces a new method in 'struct dentry_operations'. This method
called d_dname() might be called from d_path() to build a pathname for
special filesystems. It is called without locks.
Future patches (if we succeed in having one common dentry for all
pipes/sockets) may need to change prototype of this method, but we now
use : char *d_dname(struct dentry *dentry, char *buffer, int buflen);
2) Adds a dynamic_dname() helper function that eases d_dname() implementations
3) Defines d_dname method for sockets : No more sprintf() at socket
creation. This is delayed up to the moment someone does an access to
/proc/pid/fd/...
4) Defines d_dname method for pipes : No more sprintf() at pipe
creation. This is delayed up to the moment someone does an access to
/proc/pid/fd/...
A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a
*nice* speedup on my Pentium(M) 1.6 Ghz :
3.090 s instead of 3.450 s
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for finding out the current file position, open flags and
possibly other info in the future.
These new entries are added:
/proc/PID/fdinfo/FD
/proc/PID/task/TID/fdinfo/FD
For each fd the information is provided in the following format:
pos: 1234
flags: 0100002
[bunk@stusta.de: make struct proc_fdinfo_file_operations static]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the order of fields of struct pid_entry (file fs/proc/base.c) in order
to avoid a hole on 64bit archs. (8 bytes saved per object)
Also change all pid_entry arrays to be const qualified, to make clear they
must not be modified.
Before (on x86_64) :
# size fs/proc/base.o
text data bss dec hex filename
15549 2192 0 17741 454d fs/proc/base.o
After :
# size fs/proc/base.o
text data bss dec hex filename
17229 176 0 17405 43fd fs/proc/base.o
Thats 336 bytes saved on kernel size on x86_64
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive
information about the memory location and usage of processes. Issues:
- maps should not be world-readable, especially if programs expect any
kind of ASLR protection from local attackers.
- maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc
check the maps when %n is in a *printf call, and a setuid(getuid())
process wouldn't be able to read its own maps file. (For reference
see http://lkml.org/lkml/2006/1/22/150)
- a system-wide toggle is needed to allow prior behavior in the case of
non-root applications that depend on access to the maps contents.
This change implements a check using "ptrace_may_attach" before allowing
access to read the maps contents. To control this protection, the new knob
/proc/sys/kernel/maps_protect has been added, with corresponding updates to
the procfs documentation.
[akpm@linux-foundation.org: build fixes]
[akpm@linux-foundation.org: New sysctl numbers are old hat]
Signed-off-by: Kees Cook <kees@outflux.net>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't have a routine called namei() anymore since at least 2.3.x, and
the comment is just totally out of sync with the current lookup logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
inode->i_sb is always set, not need to check for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
WARN_ON(de && de->deleted); is sooo unreliable. Why?
proc_lookup remove_proc_entry
=========== =================
lock_kernel();
spin_lock(&proc_subdir_lock);
[find proc entry]
spin_unlock(&proc_subdir_lock);
spin_lock(&proc_subdir_lock);
[find proc entry]
proc_get_inode
==============
WARN_ON(de && de->deleted); ...
if (!atomic_read(&de->count))
free_proc_entry(de);
else
de->deleted = 1;
So, if you have some strange oops [1], and doesn't see this WARN_ON it means
nothing.
[1] try_module_get() of module which doesn't exist, two lines below
should suffice, or not?
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a slight problem with filesystem type representation in fuse
based filesystems.
From the kernel's view, there are just two filesystem types: fuse and
fuseblk. From the user's view there are lots of different filesystem
types. The user is not even much concerned if the filesystem is fuse based
or not. So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.
The current scheme is to encode the real filesystem type in the mount
source. So an sshfs mount looks like this:
sshfs#user@server:/ /mnt/server fuse rw,nosuid,nodev,...
This url-ish syntax works OK for sshfs and similar filesystems. However
for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
the kernel expects the mount source to be a real device name.
A possibly better scheme would be to encode the real type in the type
field as "type.subtype". So fuse mounts would look like this:
/dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,...
user@server:/ /mnt/server fuse.sshfs rw,nosuid,nodev,...
This patch adds the necessary code to the kernel so that this can be
correctly displayed in /proc/mounts.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Epoll is doing multiple passes over the ready set at the moment, because of
the constraints over the f_op->poll() call. Looking at the code again, I
noticed that we already hold the epoll semaphore in read, and this
(together with other locking conditions that hold while doing an
epoll_wait()) can lead to a smarter way [1] to "ship" events to userspace
(in a single pass).
This is a stress application that can be used to test the new code. It
spwans multiple thread and call epoll_wait() and epoll_ctl() from many
threads. Stress tested on my dual Opteron 254 w/out any problems.
http://www.xmailserver.org/totalmess.c
This is not a benchmark, just something that tries to stress and exploit
possible problems with the new code.
Also, I made a stupid micro-benchmark:
http://www.xmailserver.org/epwbench.c
[1] Considering that epoll must be thread-safe, there are five ways we can
be hit during an epoll_wait() transfer loop (ep_send_events()):
1) The epoll fd going away and calling ep_free
This just can't happen, since we did an fget() in sys_epoll_wait
2) An epoll_ctl(EPOLL_CTL_DEL)
This can't happen because epoll_ctl() gets ep->sem in write, and
we're holding it in read during ep_send_events()
3) An fd stored inside the epoll fd going away
This can't happen because in eventpoll_release_file() we get
ep->sem in write, and we're holding it in read during
ep_send_events()
4) Another epoll_wait() happening on another thread
They both can be inside ep_send_events() at the same time, we get
(splice) the ready-list under the spinlock, so each one will get
its own ready list. Note that an fd cannot be at the same time
inside more than one ready list, because ep_poll_callback() will
not re-queue it if it sees it already linked:
if (ep_is_linked(&epi->rdllink))
goto is_linked;
Another case that can happen, is two concurrent epoll_wait(),
coming in with a userspace event buffer of size, say, ten.
Suppose there are 50 event ready in the list. The first
epoll_wait() will "steal" the whole list, while the second, seeing
no events, will go to sleep. But at the end of ep_send_events() in
the first epoll_wait(), we will re-inject surplus ready fds, and we
will trigger the proper wake_up to the second epoll_wait().
5) ep_poll_callback() hitting us asyncronously
This is the tricky part. As I said above, the ep_is_linked() test
done inside ep_poll_callback(), will guarantee us that until the
item will result linked to a list, ep_poll_callback() will not try
to re-queue it again (read, write data on any of its members). When
we do a list_del() in ep_send_events(), the item will still satisfy
the ep_is_linked() test (whatever data is written in prev/next,
it'll never be its own pointer), so ep_poll_callback() will still
leave us alone. It's only after the eventual smp_mb()+INIT_LIST_HEAD(&epi->rdllink)
that it'll become visible to ep_poll_callback(), but at the point
we're already past it.
[akpm@osdl.org: 80 cols]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- ext3_dx_find_entry() exit with out setting proper error pointer
- do_split() exit with out setting proper error pointer
it is realy painful because many callers contain folowing code:
de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
if (!(de))
return retval;
<<< WOW retval wasn't changed by do_split(), so caller failed
<<< but return SUCCESS :)
- Rearrange do_split() error path. Current error path is realy ugly, all
this up and down jump stuff doesn't make code easy to understand.
[dmonakhov@sw.ru: fix annoying fake error messages]
Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org>
Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
namespaces. But they have different code paths.
This patch merges all the nsproxy and its associated namespace copy/clone
handling (as much as possible). Posted on container list earlier for
feedback.
- Create a new nsproxy and its associated namespaces and pass it back to
caller to attach it to right process.
- Changed all copy_*_ns() routines to return a new copy of namespace
instead of attaching it to task->nsproxy.
- Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.
- Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
just incase.
- Get rid of all individual unshare_*_ns() routines and make use of
copy_*_ns() instead.
[akpm@osdl.org: cleanups, warning fix]
[clg@fr.ibm.com: remove dup_namespaces() declaration]
[serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
[akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <containers@lists.osdl.org>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Tesarik discovered a problem in remove_arg_zero(). He writes:
When a script is loaded, load_script() replaces argv[0] with the
name of the interpreter and the filename passed to the exec syscall.
However, there is no guarantee that the length of the interpreter
name plus the length of the filename is greater than the length of
the original argv[0]. If the difference happens to cross a page boundary,
setup_arg_pages() will call put_dirty_page() [aka install_arg_page()]
with an address outside the VMA.
Therefore, remove_arg_zero() must free all pages which would be unused
after the argument is removed.
So, rewrite the remove_arg_zero function without gotos, with a few comments,
and with the commonly used explicit index/offset. This fixes the problem
and makes it easier to understand as well.
[a.p.zijlstra@chello.nl: add comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Correct the misspelling of the preprocessor check of a Kconfig option to refer
to CONFIG_REISERFS_PROC_INFO and not just the incorrect REISERFS_PROC_INFO.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sb_read may return NULL, let's explicitly check it. If so free new bitmap
blocks array, after this we may safely exit as it done above during bitmap
allocation.
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sb_read may return NULL, so let's explicitly check it.
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace (n & (n-1)) in the context of power of 2 checks with is_power_of_2
Signed-off-by: vignesh babu <vignesh.babu@wipro.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace (n & (n-1)) in the context of power of 2 checks with is_power_of_2
Signed-off-by: vignesh babu <vignesh.babu@wipro.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replacing (n & (n-1)) in the context of power of 2 checks with
is_power_of_2
Signed-off-by: vignesh babu <vignesh.babu@wipro.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, devpts doesn't generate an fsnotify event upon pts creation
because the regular vfs paths aren't involved. Deallocation, on the other
hand, correctly generates a nameremove event thanks to the d_delete()
invocation in devpts_pty_kill().
This patch adds the missing fsnotify_create() trigger in devpts_pty_new().
Signed-off-by: Florin Malita <fmalita@gmail.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add SEEK_MAX and use it to validate lseek arguments from userspace.
Signed-off-by: Chris Snook <csnook@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert magic numbers to SEEK_* values from fs.h
Signed-off-by: Chris Snook <csnook@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Teach the dentry slab shrinker to aggressively shrink parent dentries when
shrinking the dentry cache.
This is done to attempt to improve the situation where the dentry slab cache
gets a lot of internal fragmentation due to pages containing directory
dentries. It is expected that this change will cause some of those dentries
to be reaped earlier, and with less scanning.
Needs careful testing.
Cc: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The time shrink_dcache_parent() takes, grows quadratically with the depth
of the tree under 'parent'. This starts to get noticable at about 10,000.
These kinds of depths don't occur normally, and filesystems which invoke
shrink_dcache_parent() via d_invalidate() seem to have other depth
dependent timings, so it's not even easy to expose this problem.
However with FUSE it's easy to create a deep tree and d_invalidate()
will also get called. This can make a syscall hang for a very long
time.
This is the original discovery of the problem by Russ Cox:
http://article.gmane.org/gmane.comp.file-systems.fuse.devel/3826
The following patch fixes the quadratic behavior, by optionally allowing
prune_dcache() to prune ancestors of a dentry in one go, instead of doing
it one at a time.
Common code in dput() and prune_one_dentry() is extracted into a new helper
function d_kill().
shrink_dcache_parent() as well as shrink_dcache_sb() are converted to use
the ancestry-pruner option. Only for shrink_dcache_memory() is this
behavior not desirable, so it keeps using the old algorithm.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Maneesh Soni <maneesh@in.ibm.com>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This past week I was playing around with that pahole tool
(http://oops.ghostprotocols.net:81/acme/dwarves/) and looking at the size
of various struct in the kernel. I was surprised by the size of the
task_struct on x86_64, approaching 4K. I looked through the fields in
task_struct and found that a number of them were declared as "unsigned
long" rather than "unsigned int" despite them appearing okay as 32-bit
sized fields. On x86_64 "unsigned long" ends up being 8 bytes in size and
forces 8 byte alignment. Is there a reason there a reason they are
"unsigned long"?
The patch below drops the size of the struct from 3808 bytes (60 64-byte
cachelines) to 3760 bytes (59 64-byte cachelines). A couple other fields
in the task struct take a signficant amount of space:
struct thread_struct thread; 688
struct held_lock held_locks[30]; 1680
CONFIG_LOCKDEP is turned on in the .config
[akpm@linux-foundation.org: fix printk warnings]
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Taken from http://bugzilla.kernel.org/show_bug.cgi?id=5079
signed long ranges from -2.147.483.648 to 2.147.483.647 on x86 32bit
10000011110110100100111110111101 .. -2,082,844,739
10000011110110100100111110111101 .. 2,212,122,557 <- this currently gets
stored on the disk but when converting it to a 64bit signed long value it loses
its sign and becomes positive.
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: <linux-ext4@vger.kernel.org>
Andreas says:
This patch is now treating timestamps with the high bit set as negative
times (before Jan 1, 1970). This means we lose 1/2 of the possible range
of timestamps (lopping off 68 years before unix timestamp overflow -
now only 30 years away :-) to handle the extremely rare case of setting
timestamps into the distant past.
If we are only interested in fixing the underflow case, we could just
limit the values to 0 instead of storing negative values. At worst this
will skew the timestamp by a few hours for timezones in the far east
(files would still show Jan 1, 1970 in "ls -l" output).
That said, it seems 32-bit systems (mine at least) allow files to be set
into the past (01/01/1907 works fine) so it seems this patch is bringing
the x86_64 behaviour into sync with other kernels.
On the plus side, we have a patch that is ready to add nanosecond timestamps
to ext3 and as an added bonus adds 2 high bits to the on-disk timestamp so
this extends the maximum date to 2242.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/$PID/fd has r-x------ permissions, so if process does setuid(), it
will not be able to access /proc/*/fd/. This breaks fstatat() emulation
in glibc.
open("foo", O_RDONLY|O_DIRECTORY) = 4
setuid32(65534) = 0
stat64("/proc/self/fd/4/bar", 0xbfafb298) = -1 EACCES (Permission denied)
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Acked-By: Kirill Korotaev <dev@openvz.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
block_write_full_page() forgot to propagate ENPSOC into the address_space.
Cc: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup: setting an outstanding error on a mapping was open coded too many
times. Factor it out in mapping_set_error().
Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hostfs needed some style goodness.
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch allows hostfs_setattr() to work on unlinked open files by calling
set_attr() (the userspace part) with the inode's fd.
Without this, applications that depend on doing attribute changes to unlinked
open files will fail.
It works by using the fd versions instead of the path ones (for example
fchmod() instead of chmod(), fchown() instead of chown()) when an fd is
available.
Signed-off-by: Alberto Bertogli <albertito@gmail.com>
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Dumazet, thank you for disclosing this bug.
Readahead logic somehow fails to populate the page range with data.
It can be because
1) the readahead routine is not always called in the following lines of
fs/splice.c:
if (!loff || nr_pages > 1)
page_cache_readahead(mapping, &in->f_ra, in, index, nr_pages);
2) even called, page_cache_readahead() wont guarantee the pages are there.
It wont submit readahead I/O for pages already in the radix tree, or when
(ra_pages == 0), or after 256 cache hits.
In your case, it should be because of the retried reads, which lead to
excessive cache hits, and disables readahead at some time.
And that _one_ failure of readahead blocks the whole read process.
The application receives EAGAIN and retries the read, but
__generic_file_splice_read() refuse to make progress:
- in the previous invocation, it has allocated a blank page and inserted it
into the radix tree, but never has the chance to start I/O for it: the test
of SPLICE_F_NONBLOCK goes before that.
- in the retried invocation, the readahead code will neither get out of the
cache hit mode, nor will it submit I/O for an already existing page.
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In xfs_write() the iolock is dropped and reacquired in XFS_SEND_DATA()
which means that the file could change from not-cached to cached and we
need to redo the direct I/O checks. We should also redo the direct I/O
checks when the file size changes regardless if O_APPEND is set or not.
SGI-PV: 963483
SGI-Modid: xfs-linux-melb:xfs-kern:28440a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The problem that has been addressed is that of synchronising updates of
the file size with writes that extend a file. Without the fix the update
of a file's size, as a result of a write beyond eof, is independent of
when the cached data is flushed to disk. Often the file size update would
be written to the filesystem log before the data is flushed to disk. When
a system crashes between these two events and the filesystem log is
replayed on mount the file's size will be set but since the contents never
made it to disk the file is full of holes. If some of the cached data was
flushed to disk then it may just be a section of the file at the end that
has holes.
There are existing fixes to help alleviate this problem, particularly in
the case where a file has been truncated, that force cached data to be
flushed to disk when the file is closed. If the system crashes while the
file(s) are still open then this flushing will never occur.
The fix that we have implemented is to introduce a second file size,
called the in-memory file size, that represents the current file size as
viewed by the user. The existing file size, called the on-disk file size,
is the one that get's written to the filesystem log and we only update it
when it is safe to do so. When we write to a file beyond eof we only
update the in- memory file size in the write operation. Later when the I/O
operation, that flushes the cached data to disk completes, an I/O
completion routine will update the on-disk file size. The on-disk file
size will be updated to the maximum offset of the I/O or to the value of
the in-memory file size if the I/O includes eof.
SGI-PV: 958522
SGI-Modid: xfs-linux-melb:xfs-kern:28322a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
This change addresses a race in xfs_write() where, for direct I/O, the
flags need_i_mutex and need_flush are setup before the iolock is acquired.
The logic used to setup the flags may change between setting the flags and
acquiring the iolock resulting in these flags having incorrect values. For
example, if a file is not currently cached then need_i_mutex is set to
zero and then if the file is cached before the iolock is acquired we will
fail to do the flushinval before the direct write.
The flush (and also the call to xfs_zero_eof()) need to be done with the
iolock held exclusive so we need to acquire the iolock before checking for
cached data (or if the write begins after eof) to prevent this state from
changing. For direct I/O I've chosen to always acquire the iolock in
shared mode initially and if there is a need to promote it then drop it
and reacquire it.
There's also some other tidy-ups including removing the O_APPEND offset
adjustment since that work is done in generic_write_checks() (and we don't
use offset as an input parameter anywhere).
SGI-PV: 962170
SGI-Modid: xfs-linux-melb:xfs-kern:28319a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
When uquota and oquota (gquota/pquota) are enabled for accounting both are
enforced if ether has enforcement active.
Conditions:
- Both XFS_UQUOTA_ACCT and XFS_GQUOTA_ACCT are enabled.
- Either XFS_UQUOTA_ENFD or XFS_OQUOTA_ENFD is enabled.
- The usage without enforce is reached at the soft limit.
Problems:
1. "repquota" shows all grace time even if no enforcement.
2. we cannot make a file over a hard limits even if no enforcement.
SGI-PV: 962291
SGI-Modid: xfs-linux-melb:xfs-kern:28272a
Signed-off-by: Kouta Ooizumi <k-ooizumi@tnes.nec.co.jp>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>