linux

Commit Graph

Author	SHA1	Message	Date
Davide Libenzi	395108880e	epoll keyed wakeups: make eventfd use keyed wakeups Introduce keyed event wakeups inside the eventfd code. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: William Lee Irwin III <wli@movementarian.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:20 -07:00
Davide Libenzi	2dfa4eeab0	epoll keyed wakeups: teach epoll about hints coming with the wakeup key Use the events hint now sent by some devices, to avoid unnecessary wakeups for events that are of no interest for the caller. This code handles both devices that are sending keyed events, and the ones that are not (and event the ones that sometimes send events, and sometimes don't). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: William Lee Irwin III <wli@movementarian.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:20 -07:00
Davide Libenzi	bcd0b235bf	eventfd: improve support for semaphore-like behavior People started using eventfd in a semaphore-like way where before they were using pipes. That is, counter-based resource access. Where a "wait()" returns immediately by decrementing the counter by one, if counter is greater than zero. Otherwise will wait. And where a "post(count)" will add count to the counter releasing the appropriate amount of waiters. If eventfd the "post" (write) part is fine, while the "wait" (read) does not dequeue 1, but the whole counter value. The problem with eventfd is that a read() on the fd returns and wipes the whole counter, making the use of it as semaphore a little bit more cumbersome. You can do a read() followed by a write() of COUNTER-1, but IMO it's pretty easy and cheap to make this work w/out extra steps. This patch introduces a new eventfd flag that tells eventfd to only dequeue 1 from the counter, allowing simple read/write to make it behave like a semaphore. Simple test here: http://www.xmailserver.org/eventfd-sem.c To be back-compatible with earlier kernels, userspace applications should probe for the availability of this feature via #ifdef EFD_SEMAPHORE fd = eventfd2 (CNT, EFD_SEMAPHORE); if (fd == -1 && errno == EINVAL) <fallback> #else <fallback> #endif Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: <linux-api@vger.kernel.org> Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:20 -07:00
Tony Battersby	4f0989dbfa	epoll: use real type instead of void * eventpoll.c uses void * in one place for no obvious reason; change it to use the real type instead. Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:20 -07:00
Tony Battersby	e057e15ff6	epoll: clean up ep_modify ep_modify() doesn't need to set event.data from within the ep->lock spinlock as the comment suggests. The only place event.data is used is ep_send_events_proc(), and this is protected by ep->mtx instead of ep->lock. Also update the comment for mutex_lock() at the top of ep_scan_ready_list(), which mentions epoll_ctl(EPOLL_CTL_DEL) but not epoll_ctl(EPOLL_CTL_MOD). ep_modify() can also use spin_lock_irq() instead of spin_lock_irqsave(). Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:19 -07:00
Tony Battersby	d1bc90dd5d	epoll: remove unnecessary xchg xchg in ep_unregister_pollwait() is unnecessary because it is protected by either epmutex or ep->mtx (the same protection as ep_remove()). If xchg was necessary, it would be insufficient to protect against problems: if multiple concurrent calls to ep_unregister_pollwait() were possible then a second caller that returns without doing anything because nwait == 0 could return before the waitqueues are removed by the first caller, which looks like it could lead to problematic races with ep_poll_callback(). So remove xchg and add comments about the locking. Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:19 -07:00
Tony Battersby	d030588282	epoll: remember the event if epoll_wait returns -EFAULT If epoll_wait returns -EFAULT, the event that was being returned when the fault was encountered will be forgotten. This is not a big deal since EFAULT will happen only if a buggy userspace program passes in a bad address, in which case what happens later usually doesn't matter. However, it is easy to remember the event for later, and this patch makes a simple change to do that. Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:19 -07:00
Tony Battersby	abff55cee1	epoll: don't use current in irq context ep_call_nested() (formerly ep_poll_safewake()) uses "current" (without dereferencing it) to detect callback recursion, but it may be called from irq context where the use of current is generally discouraged. It would be better to use get_cpu() and put_cpu() to detect the callback recursion. Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:19 -07:00
Davide Libenzi	bb57c3edcd	epoll: remove debugging code Remove debugging code from epoll. There's no need for it to be included into mainline code. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:19 -07:00
Davide Libenzi	296e236e96	epoll: fix epoll's own poll (update) Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Pavel Pisa <pisa@cmp.felk.cvut.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:19 -07:00
Davide Libenzi	5071f97ec6	epoll: fix epoll's own poll Fix a bug inside the epoll's f_op->poll() code, that returns POLLIN even though there are no actual ready monitored fds. The bug shows up if you add an epoll fd inside another fd container (poll, select, epoll). The problem is that callback-based wake ups used by epoll does not carry (patches will follow, to fix this) any information about the events that actually happened. So the callback code, since it can't call the file* ->poll() inside the callback, chains the file* into a ready-list. So, suppose you added an fd with EPOLLOUT only, and some data shows up on the fd, the file* mapped by the fd will be added into the ready-list (via wakeup callback). During normal epoll_wait() use, this condition is sorted out at the time we're actually able to call the file's f_op->poll(). Inside the old epoll's f_op->poll() though, only a quick check !list_empty(ready-list) was performed, and this could have led to reporting POLLIN even though no ready fds would show up at a following epoll_wait(). In order to correctly report the ready status for an epoll fd, the ready-list must be checked to see if any really available fd+event would be ready in a following epoll_wait(). Operation (calling f_op->poll() from inside f_op->poll()) that, like wake ups, must be handled with care because of the fact that epoll fds can be added to other epoll fds. Test code: / * epoll_test by Davide Libenzi (Simple code to test epoll internals) * Copyright (C) 2008 Davide Libenzi * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA * * Davide Libenzi <davidel@xmailserver.org> * / #include <sys/types.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <errno.h> #include <signal.h> #include <limits.h> #include <poll.h> #include <sys/epoll.h> #include <sys/wait.h> #define EPWAIT_TIMEO (1 1000) #ifndef POLLRDHUP #define POLLRDHUP 0x2000 #endif #define EPOLL_MAX_CHAIN 100L #define EPOLL_TF_LOOP (1 << 0) struct epoll_test_cfg { long size; long flags; }; static int xepoll_create(int n) { int epfd; if ((epfd = epoll_create(n)) == -1) { perror("epoll_create"); exit(2); } return epfd; } static void xepoll_ctl(int epfd, int cmd, int fd, struct epoll_event evt) { if (epoll_ctl(epfd, cmd, fd, evt) < 0) { perror("epoll_ctl"); exit(3); } } static void xpipe(int fds) { if (pipe(fds)) { perror("pipe"); exit(4); } } static pid_t xfork(void) { pid_t pid; if ((pid = fork()) == (pid_t) -1) { perror("pipe"); exit(5); } return pid; } static int run_forked_proc(int (proc)(void ), void data) { int status; pid_t pid; if ((pid = xfork()) == 0) exit((proc)(data)); if (waitpid(pid, &status, 0) != pid) { perror("waitpid"); return -1; } return WIFEXITED(status) ? WEXITSTATUS(status): -2; } static int check_events(int fd, int timeo) { struct pollfd pfd; fprintf(stdout, "Checking events for fd %d\n", fd); memset(&pfd, 0, sizeof(pfd)); pfd.fd = fd; pfd.events = POLLIN \| POLLOUT; if (poll(&pfd, 1, timeo) < 0) { perror("poll()"); return 0; } if (pfd.revents & POLLIN) fprintf(stdout, "\tPOLLIN\n"); if (pfd.revents & POLLOUT) fprintf(stdout, "\tPOLLOUT\n"); if (pfd.revents & POLLERR) fprintf(stdout, "\tPOLLERR\n"); if (pfd.revents & POLLHUP) fprintf(stdout, "\tPOLLHUP\n"); if (pfd.revents & POLLRDHUP) fprintf(stdout, "\tPOLLRDHUP\n"); return pfd.revents; } static int epoll_test_tty(void data) { int epfd, ifd = fileno(stdin), res; struct epoll_event evt; if (check_events(ifd, 0) != POLLOUT) { fprintf(stderr, "Something is cooking on STDIN (%d)\n", ifd); return 1; } epfd = xepoll_create(1); fprintf(stdout, "Created epoll fd (%d)\n", epfd); memset(&evt, 0, sizeof(evt)); evt.events = EPOLLIN; xepoll_ctl(epfd, EPOLL_CTL_ADD, ifd, &evt); if (check_events(epfd, 0) & POLLIN) { res = epoll_wait(epfd, &evt, 1, 0); if (res == 0) { fprintf(stderr, "Epoll fd (%d) is ready when it shouldn't!\n", epfd); return 2; } } return 0; } static int epoll_wakeup_chain(void data) { struct epoll_test_cfg tcfg = data; int i, res, epfd, bfd, nfd, pfds[2]; pid_t pid; struct epoll_event evt; memset(&evt, 0, sizeof(evt)); evt.events = EPOLLIN; epfd = bfd = xepoll_create(1); for (i = 0; i < tcfg->size; i++) { nfd = xepoll_create(1); xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt); bfd = nfd; } xpipe(pfds); if (tcfg->flags & EPOLL_TF_LOOP) { xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt); / * If we're testing for loop, we want that the wakeup * triggered by the write to the pipe done in the child * process, triggers a fake event. So we add the pipe * read size with EPOLLOUT events. This will trigger * an addition to the ready-list, but no real events * will be there. The the epoll kernel code will proceed * in calling f_op->poll() of the epfd, triggering the * loop we want to test. / evt.events = EPOLLOUT; } xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt); / * The pipe write must come after the poll(2) call inside * check_events(). This tests the nested wakeup code in * fs/eventpoll.c:ep_poll_safewake() * By having the check_events() (hence poll(2)) happens first, * we have poll wait queue filled up, and the write(2) in the * child will trigger the wakeup chain. / if ((pid = xfork()) == 0) { sleep(1); write(pfds[1], "w", 1); exit(0); } res = check_events(epfd, 2000) & POLLIN; if (waitpid(pid, NULL, 0) != pid) { perror("waitpid"); return -1; } return res; } static int epoll_poll_chain(void data) { struct epoll_test_cfg tcfg = data; int i, res, epfd, bfd, nfd, pfds[2]; pid_t pid; struct epoll_event evt; memset(&evt, 0, sizeof(evt)); evt.events = EPOLLIN; epfd = bfd = xepoll_create(1); for (i = 0; i < tcfg->size; i++) { nfd = xepoll_create(1); xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt); bfd = nfd; } xpipe(pfds); if (tcfg->flags & EPOLL_TF_LOOP) { xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt); / * If we're testing for loop, we want that the wakeup * triggered by the write to the pipe done in the child * process, triggers a fake event. So we add the pipe * read size with EPOLLOUT events. This will trigger * an addition to the ready-list, but no real events * will be there. The the epoll kernel code will proceed * in calling f_op->poll() of the epfd, triggering the * loop we want to test. / evt.events = EPOLLOUT; } xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt); / * The pipe write mush come before the poll(2) call inside * check_events(). This tests the nested f_op->poll calls code in * fs/eventpoll.c:ep_eventpoll_poll() * By having the pipe write(2) happen first, we make the kernel * epoll code to load the ready lists, and the following poll(2) * done inside check_events() will test nested poll code in * ep_eventpoll_poll(). / if ((pid = xfork()) == 0) { write(pfds[1], "w", 1); exit(0); } sleep(1); res = check_events(epfd, 1000) & POLLIN; if (waitpid(pid, NULL, 0) != pid) { perror("waitpid"); return -1; } return res; } int main(int ac, char av) { int error; struct epoll_test_cfg tcfg; fprintf(stdout, "\n******* Testing TTY events\n"); error = run_forked_proc(epoll_test_tty, NULL); fprintf(stdout, error == 0 ? "****** OK\n": "****** FAIL (%d)\n", error); tcfg.size = 3; tcfg.flags = 0; fprintf(stdout, "\n****** Testing short wakeup chain\n"); error = run_forked_proc(epoll_wakeup_chain, &tcfg); fprintf(stdout, error == POLLIN ? "****** OK\n": "****** FAIL (%d)\n", error); tcfg.size = EPOLL_MAX_CHAIN; tcfg.flags = 0; fprintf(stdout, "\n****** Testing long wakeup chain (HOLD ON)\n"); error = run_forked_proc(epoll_wakeup_chain, &tcfg); fprintf(stdout, error == 0 ? "****** OK\n": "****** FAIL (%d)\n", error); tcfg.size = 3; tcfg.flags = 0; fprintf(stdout, "\n****** Testing short poll chain\n"); error = run_forked_proc(epoll_poll_chain, &tcfg); fprintf(stdout, error == POLLIN ? "****** OK\n": "****** FAIL (%d)\n", error); tcfg.size = EPOLL_MAX_CHAIN; tcfg.flags = 0; fprintf(stdout, "\n****** Testing long poll chain (HOLD ON)\n"); error = run_forked_proc(epoll_poll_chain, &tcfg); fprintf(stdout, error == 0 ? "****** OK\n": "****** FAIL (%d)\n", error); tcfg.size = 3; tcfg.flags = EPOLL_TF_LOOP; fprintf(stdout, "\n****** Testing loopy wakeup chain (HOLD ON)\n"); error = run_forked_proc(epoll_wakeup_chain, &tcfg); fprintf(stdout, error == 0 ? "****** OK\n": "****** FAIL (%d)\n", error); tcfg.size = 3; tcfg.flags = EPOLL_TF_LOOP; fprintf(stdout, "\n****** Testing loopy poll chain (HOLD ON)\n"); error = run_forked_proc(epoll_poll_chain, &tcfg); fprintf(stdout, error == 0 ? "****** OK\n": "******** FAIL (%d)\n", error); return 0; } Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Pavel Pisa <pisa@cmp.felk.cvut.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:18 -07:00
Harvey Harrison	63cd885426	ntfs: remove private wrapper of endian helpers The base versions handle constant folding now and are shorter than these private wrappers, use them directly. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Anton Altaparmakov <aia21@cantab.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:18 -07:00
Eric Sandeen	c2d7543851	filesystem freeze: allow SysRq emergency thaw to thaw frozen filesystems Now that the filesystem freeze operation has been elevated to the VFS, and is just an ioctl away, some sort of safety net for unintentionally frozen root filesystems may be in order. The timeout thaw originally proposed did not get merged, but perhaps something like this would be useful in emergencies. For example, freeze /path/to/mountpoint may freeze your root filesystem if you forgot that you had that unmounted. I chose 'j' as the last remaining character other than 'h' which is sort of reserved for help (because help is generated on any unknown character). I've tested this on a non-root fs with multiple (nested) freezers, as well as on a system rendered unresponsive due to a frozen root fs. [randy.dunlap@oracle.com: emergency thaw only if CONFIG_BLOCK enabled] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Cc: Takashi Sato <t-sato@yk.jp.nec.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:17 -07:00
KAMEZAWA Hiroyuki	327c0e9686	vmscan: fix it to take care of nodemask try_to_free_pages() is used for the direct reclaim of up to SWAP_CLUSTER_MAX pages when watermarks are low. The caller to alloc_pages_nodemask() can specify a nodemask of nodes that are allowed to be used but this is not passed to try_to_free_pages(). This can lead to unnecessary reclaim of pages that are unusable by the caller and int the worst case lead to allocation failure as progress was not been make where it is needed. This patch passes the nodemask used for alloc_pages_nodemask() to try_to_free_pages(). Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:15 -07:00
Johannes Weiner	2678958e12	ramfs-nommu: use generic lru cache Instead of open-coding the lru-list-add pagevec batching when expanding a file mapping from zero, defer to the appropriate page cache function that also takes care of adding the page to the lru list. This is cleaner, saves code and reduces the stack footprint by 16 words worth of pagevec. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Howells <dhowells@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.com> Cc: MinChan Kim <minchan.kim@gmail.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:15 -07:00
Hugh Dickins	851a039cc5	mm: page_mkwrite change prototype to match fault: fix sysfs Fix warnings and return values in sysfs bin_page_mkwrite(), fixing fs/sysfs/bin.c: In function `bin_page_mkwrite': fs/sysfs/bin.c:250: warning: passing argument 2 of `bb->vm_ops->page_mkwrite' from incompatible pointer type fs/sysfs/bin.c: At top level: fs/sysfs/bin.c:280: warning: initialization from incompatible pointer type Expects to have my [PATCH next] sysfs: fix some bin_vm_ops errors Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <npiggin@suse.de> Cc: "Eric W. Biederman" <ebiederm@aristanetworks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:14 -07:00
Nick Piggin	56a76f8275	fs: fix page_mkwrite error cases in core code and btrfs page_mkwrite is called with neither the page lock nor the ptl held. This means a page can be concurrently truncated or invalidated out from underneath it. Callers are supposed to prevent truncate races themselves, however previously the only thing they can do in case they hit one is to raise a SIGBUS. A sigbus is wrong for the case that the page has been invalidated or truncated within i_size (eg. hole punched). Callers may also have to perform memory allocations in this path, where again, SIGBUS would be wrong. The previous patch ("mm: page_mkwrite change prototype to match fault") made it possible to properly specify errors. Convert the generic buffer.c code and btrfs to return sane error values (in the case of page removed from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit without doing anything, and the fault will be retried properly). This fixes core code, and converts btrfs as a template/example. All other filesystems defining their own page_mkwrite should be fixed in a similar manner. Acked-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:14 -07:00
Nick Piggin	c2ec175c39	mm: page_mkwrite change prototype to match fault Change the page_mkwrite prototype to take a struct vm_fault, and return VM_FAULT_xxx flags. There should be no functional change. This makes it possible to return much more detailed error information to the VM (and also can provide more information eg. virtual_address to the driver, which might be important in some special cases). This is required for a subsequent fix. And will also make it easier to merge page_mkwrite() with fault() in future. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Cc: Felix Blyakher <felixb@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:14 -07:00
Ravikiran G Thirumalai	2584e51732	mm: reintroduce and deprecate rlimit based access for SHM_HUGETLB Allow non root users with sufficient mlock rlimits to be able to allocate hugetlb backed shm for now. Deprecate this though. This is being deprecated because the mlock based rlimit checks for SHM_HUGETLB is not consistent with mmap based huge page allocations. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:12 -07:00
Ravikiran G Thirumalai	8a0bdec194	mm: fix SHM_HUGETLB to work with users in hugetlb_shm_group Fix hugetlb subsystem so that non root users belonging to hugetlb_shm_group can actually allocate hugetlb backed shm. Currently non root users cannot even map one large page using SHM_HUGETLB when they belong to the gid in /proc/sys/vm/hugetlb_shm_group. This is because allocation size is verified against RLIMIT_MEMLOCK resource limit even if the user belongs to hugetlb_shm_group. This patch 1. Fixes hugetlb subsystem so that users with CAP_IPC_LOCK and users belonging to hugetlb_shm_group don't need to be restricted with RLIMIT_MEMLOCK resource limits 2. This patch also disables mlock based rlimit checking (which will be reinstated and marked deprecated in a subsequent patch). Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:12 -07:00
Edward Shishkin	e3a7cca1ef	vfs: add/use account_page_dirtied() Add a helper function account_page_dirtied(). Use that from two callsites. reiser4 adds a function which adds a third callsite. Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:12 -07:00
Alexey Dobriyan	0f043a81eb	proc tty: remove struct tty_operations::read_proc struct tty_operations::proc_fops took it's place and there is one less create_proc_read_entry() user now! Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:10 -07:00
Alexey Dobriyan	ae149b6bec	proc tty: add struct tty_operations::proc_fops Used for gradual switch of TTY drivers from using ->read_proc which helps with gradual switch from ->read_proc for the whole tree. As side effect, fix possible race condition when ->data initialized after PDE is hooked into proc tree. ->proc_fops takes precedence over ->read_proc. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:08 -07:00
Dmitri Vorobiev	ced117c73e	Remove two unneeded exports and make two symbols static in fs/mpage.c Commit `29a814d2ee` (vfs: add hooks for ext4's delayed allocation support) exported the following functions mpage_bio_submit() __mpage_writepage() for the benefit of ext4's delayed allocation support. Since commit `a1d6cc563b` (ext4: Rework the ext4_da_writepages() function), these functions are not used by the ext4 driver anymore. However, the now unnecessary exports still remain, and this patch removes those. Moreover, these two functions can become static again. The issue was spotted by namespacecheck. Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-01 07:38:54 -04:00
Al Viro	47e4491b40	Cleanup after commit `585d3bc06f` fsync_bdev() export and a bunch of stubs for !CONFIG_BLOCK case had been left behind Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-01 07:07:16 -04:00
Al Viro	e5824c97a9	Trim includes of fdtable.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:28 -04:00
Al Viro	d9e66c7296	Don't crap into descriptor table in binfmt_som Same story as in binfmt_elf, except that in binfmt_som we actually forget to close the sucker. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:28 -04:00
Al Viro	d0f35dde6e	Trim includes in binfmt_elf Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:27 -04:00
Al Viro	e7b9b550f5	Don't mess with descriptor table in load_elf_binary() ... since we don't tell anyone which descriptor does the file get. We used to, but only in case of ELF binary with a.out loader and that stuff has been gone for a while. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:27 -04:00
Al Viro	5ad4e53bd5	Get rid of indirect include of fs_struct.h Don't pull it in sched.h; very few files actually need it and those can include directly. sched.h itself only needs forward declaration of struct fs_struct; Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:27 -04:00
Al Viro	ce3b0f8d5c	New helper - current_umask() current->fs->umask is what most of fs_struct users are doing. Put that into a helper function. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:26 -04:00
Al Viro	f1191b50ec	check_unsafe_exec() doesn't care about signal handlers sharing ... since we'll unshare sighand anyway Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:26 -04:00
Al Viro	498052bba5	New locking/refcounting for fs_struct * all changes of current->fs are done under task_lock and write_lock of old fs->lock * refcount is not atomic anymore (same protection) * its decrements are done when removing reference from current; at the same time we decide whether to free it. * put_fs_struct() is gone * new field - ->in_exec. Set by check_unsafe_exec() if we are trying to do execve() and only subthreads share fs_struct. Cleared when finishing exec (success and failure alike). Makes CLONE_FS fail with -EAGAIN if set. * check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread is in progress. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:26 -04:00
Al Viro	3e93cd6718	Take fs_struct handling to new file (fs/fs_struct.c) Pure code move; two new helper functions for nfsd and daemonize (unshare_fs_struct() and daemonize_fs_struct() resp.; for now - the same code as used to be in callers). unshare_fs_struct() exported (for nfsd, as copy_fs_struct()/exit_fs() used to be), copy_fs_struct() and exit_fs() don't need exports anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:26 -04:00
Al Viro	f8ef3ed2be	Get rid of bumping fs_struct refcount in pivot_root(2) Not because execve races with _that_ are serious - we really need a situation when final drop of fs_struct refcount is done by something that used to have it as current->fs. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:25 -04:00
Chris Mason	d57e62b897	Btrfs: try to free metadata pages when we free btree blocks COW means we cycle though blocks fairly quickly, and once we free an extent on disk, it doesn't make much sense to keep the pages around. This commit tries to immediately free the page when we free the extent, which lowers our memory footprint significantly. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-31 14:27:58 -04:00
Chris Mason	5a3f23d515	Btrfs: add extra flushing for renames and truncates Renames and truncates are both common ways to replace old data with new data. The filesystem can make an effort to make sure the new data is on disk before actually replacing the old data. This is especially important for rename, which many application use as though it were atomic for both the data and the metadata involved. The current btrfs code will happily replace a file that is fully on disk with one that was just created and still has pending IO. If we crash after transaction commit but before the IO is done, we'll end up replacing a good file with a zero length file. The solution used here is to create a list of inodes that need special ordering and force them to disk before the commit is done. This is similar to the ext3 style data=ordering, except it is only done on selected files. Btrfs is able to get away with this because it does not wait on commits very often, even for fsync (which use a sub-commit). For renames, we order the file when it wasn't already on disk and when it is replacing an existing file. Larger files are sent to filemap_flush right away (before the transaction handle is opened). For truncates, we order if the file goes from non-zero size down to zero size. This is a little different, because at the time of the truncate the file has no dirty bytes to order. But, we flag the inode so that it is added to the ordered list on close (via release method). We also immediately add it to the ordered list of the current transaction so that we can try to flush down any writes the application sneaks in before commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-31 14:27:58 -04:00
Boaz Harrosh	0d8fe329a8	fs: Add exofs to Kernel build - Add exofs to fs/Kconfig under "menu 'Miscellaneous filesystems'" - Add exofs to fs/Makefile Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2009-03-31 19:44:40 +03:00
Boaz Harrosh	214c8adb87	exofs: Documentation Added some documentation in exofs.txt, as well as a BUGS file. For further reading, operation instructions, example scripts and up to date infomation and code please see: http://open-osd.org Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2009-03-31 19:44:38 +03:00
Boaz Harrosh	8cf74b3936	exofs: export_operations implement export_operations and set in superblock. It is now posible to export exofs via nfs Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2009-03-31 19:44:36 +03:00
Boaz Harrosh	ba9e5e98ca	exofs: super_operations and file_system_type This patch ties all operation vectors into a file system superblock and registers the exofs file_system_type at module's load time. * The file system control block (AKA on-disk superblock) resides in an object with a special ID (defined in common.h). Information included in the file system control block is used to fill the in-memory superblock structure at mount time. This object is created before the file system is used by mkexofs.c It contains information such as: - The file system's magic number - The next inode number to be allocated Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2009-03-31 19:44:34 +03:00
Boaz Harrosh	e6af00f1d1	exofs: dir_inode and directory operations implementation of directory and inode operations. * A directory is treated as a file, and essentially contains a list of <file name, inode #> pairs for files that are found in that directory. The object IDs correspond to the files' inode numbers and are allocated using a 64bit incrementing global counter. * Each file's control block (AKA on-disk inode) is stored in its object's attributes. This applies to both regular files and other types (directories, device files, symlinks, etc.). Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2009-03-31 19:44:31 +03:00
Boaz Harrosh	beaec07ba6	exofs: address_space_operations OK Now we start to read and write from osd-objects. We try to collect at most contiguous pages as possible in a single write/read. The first page index is the object's offset. TODO: In 64-bit a single bio can carry at most 128 pages. Add support of chaining multiple bios Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2009-03-31 19:44:29 +03:00
Boaz Harrosh	982980d753	exofs: symlink_inode and fast_symlink_inode operations Generic implementation of symlink ops. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2009-03-31 19:44:27 +03:00
Boaz Harrosh	e806271916	exofs: file and file_inode operations implementation of the file_operations and inode_operations for regular data files. Most file_operations are generic vfs implementations except: - exofs_truncate will truncate the OSD object as well - Generic file_fsync is not good for none_bd devices so open code it - The default for .flush in Linux is todo nothing so call exofs_fsync on the file. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2009-03-31 19:44:24 +03:00
Boaz Harrosh	b14f8ab284	exofs: Kbuild, Headers and osd utils This patch includes osd infrastructure that will be used later by the file system. Also the declarations of constants, on disk structures, and prototypes. And the Kbuild+Kconfig files needed to build the exofs module. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2009-03-31 19:44:20 +03:00
Felix Blyakher	1aacc064e0	Revert "xfs: increase the maximum number of supported ACL entries" This reverts commit `8b11217173`. Premature unintended commit. Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-31 00:23:37 -05:00
NeilBrown	bff61975b3	md: move lots of #include lines out of .h files and into .c This makes the includes more explicit, and is preparation for moving md_k.h to drivers/md/md.h Remove include/raid/md.h as its only remaining use was to #include other files. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
Felix Blyakher	5123bc35d2	Merge branch 'master' of git://git.kernel.org/pub/scm/fs/xfs/xfs	2009-03-30 22:17:44 -05:00
Felix Blyakher	930861c4e6	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-03-30 22:08:33 -05:00
Linus Torvalds	d17abcd541	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: oprofile: Thou shalt not call __exit functions from __init functions cpumask: remove the now-obsoleted pcibus_to_cpumask(): generic cpumask: remove cpumask_t from core cpumask: convert rcutorture.c cpumask: use new cpumask_ functions in core code. cpumask: remove references to struct irqaction's mask field. cpumask: use mm_cpumask() wrapper: kernel/fork.c cpumask: use set_cpu_active in init/main.c cpumask: remove node_to_first_cpu cpumask: fix seq_bitmap_*() functions. cpumask: remove dangerous CPU_MASK_ALL_PTR, &CPU_MASK_ALL	2009-03-30 18:00:26 -07:00
Linus Torvalds	cf2f7d7c90	Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: Revert "proc: revert /proc/uptime to ->read_proc hook" proc 2/2: remove struct proc_dir_entry::owner proc 1/2: do PDE usecounting even for ->read_proc, ->write_proc proc: fix sparse warnings in pagemap_read() proc: move fs/proc/inode-alloc.txt comment into a source file	2009-03-30 16:06:04 -07:00
Jeff Mahoney	3a355cc61d	reiserfs: xattr_create is unused with xattrs disabled This patch ifdefs xattr_create when xattrs aren't enabled. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 14:28:58 -07:00
Alexey Dobriyan	a9caa3de24	Revert "proc: revert /proc/uptime to ->read_proc hook" This reverts commit `6c87df37dc`. proc files implemented through seq_file do pread(2) now. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-03-31 01:14:58 +04:00
Alexey Dobriyan	99b7623380	proc 2/2: remove struct proc_dir_entry::owner Setting ->owner as done currently (pde->owner = THIS_MODULE) is racy as correctly noted at bug #12454. Someone can lookup entry with NULL ->owner, thus not pinning enything, and release it later resulting in module refcount underflow. We can keep ->owner and supply it at registration time like ->proc_fops and ->data. But this leaves ->owner as easy-manipulative field (just one C assignment) and somebody will forget to unpin previous/pin current module when switching ->owner. ->proc_fops is declared as "const" which should give some thoughts. ->read_proc/->write_proc were just fixed to not require ->owner for protection. rmmod'ed directories will be empty and return "." and ".." -- no harm. And directories with tricky enough readdir and lookup shouldn't be modular. We definitely don't want such modular code. Removing ->owner will also make PDE smaller. So, let's nuke it. Kudos to Jeff Layton for reminding about this, let's say, oversight. http://bugzilla.kernel.org/show_bug.cgi?id=12454 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-03-31 01:14:44 +04:00
Alexey Dobriyan	3dec7f59c3	proc 1/2: do PDE usecounting even for ->read_proc, ->write_proc struct proc_dir_entry::owner is going to be removed. Now it's only necessary to protect PDEs which are using ->read_proc, ->write_proc hooks. However, ->owner assignments are racy and make it very easy for someone to switch ->owner on live PDE (as some subsystems do) without fixing refcounts and so on. http://bugzilla.kernel.org/show_bug.cgi?id=12454 So, ->owner is on death row. Proxy file operations exist already (proc_file_operations), just bump usecount when necessary. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-03-31 01:14:27 +04:00
Milind Arun Choudhary	09729a9919	proc: fix sparse warnings in pagemap_read() fs/proc/task_mmu.c:696:12: warning: cast removes address space of expression fs/proc/task_mmu.c:696:9: warning: incorrect type in assignment (different address spaces) fs/proc/task_mmu.c:696:9: expected unsigned long long [noderef] [usertype] <asn:1>out fs/proc/task_mmu.c:696:9: got unsigned long long [usertype] <noident> fs/proc/task_mmu.c:697:12: warning: cast removes address space of expression fs/proc/task_mmu.c:697:9: warning: incorrect type in assignment (different address spaces) fs/proc/task_mmu.c:697:9: expected unsigned long long [noderef] [usertype] <asn:1>end fs/proc/task_mmu.c:697:9: got unsigned long long [usertype] <noident> fs/proc/task_mmu.c:723:12: warning: cast removes address space of expression fs/proc/task_mmu.c:723:26: error: subtraction of different types can't work (different address spaces) fs/proc/task_mmu.c:725:24: error: subtraction of different types can't work (different address spaces) Signed-off-by: Milind Arun Choudhary <milindchoudhary@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-03-31 01:14:22 +04:00
Randy Dunlap	1681bc30f2	proc: move fs/proc/inode-alloc.txt comment into a source file so that people will realize that it exists and can update it as needed. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-03-31 01:13:12 +04:00
Linus Torvalds	e1c5024828	Merge branch 'reiserfs-updates' from Jeff Mahoney * reiserfs-updates: (35 commits) reiserfs: rename [cn]_* variables reiserfs: rename p_._ variables reiserfs: rename p_s_tb to tb reiserfs: rename p_s_inode to inode reiserfs: rename p_s_bh to bh reiserfs: rename p_s_sb to sb reiserfs: strip trailing whitespace reiserfs: cleanup path functions reiserfs: factor out buffer_info initialization reiserfs: add atomic addition of selinux attributes during inode creation reiserfs: use generic readdir for operations across all xattrs reiserfs: journaled xattrs reiserfs: use generic xattr handlers reiserfs: remove i_has_xattr_dir reiserfs: make per-inode xattr locking more fine grained reiserfs: eliminate per-super xattr lock reiserfs: simplify xattr internal file lookups/opens reiserfs: Clean up xattrs when REISERFS_FS_XATTR is unset reiserfs: remove IS_PRIVATE helpers reiserfs: remove link detection code ... Fixed up conflicts manually due to: - quota name cleanups vs variable naming changes: fs/reiserfs/inode.c fs/reiserfs/namei.c fs/reiserfs/stree.c fs/reiserfs/xattr.c - exported include header cleanups include/linux/reiserfs_fs.h	2009-03-30 12:33:01 -07:00
Jeff Mahoney	ee93961be1	reiserfs: rename [cn]_* variables This patch renames n_, c_, etc variables to something more sane. This is the sixth in a series of patches to rip out some of the awful variable naming in reiserfs. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:40 -07:00
Jeff Mahoney	d68caa9530	reiserfs: rename p_._ variables This patch is a simple s/p_._//g to the reiserfs code. This is the fifth in a series of patches to rip out some of the awful variable naming in reiserfs. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:40 -07:00
Jeff Mahoney	a063ae1792	reiserfs: rename p_s_tb to tb This patch is a simple s/p_s_tb/tb/g to the reiserfs code. This is the fourth in a series of patches to rip out some of the awful variable naming in reiserfs. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:40 -07:00
Jeff Mahoney	995c762ea4	reiserfs: rename p_s_inode to inode This patch is a simple s/p_s_inode/inode/g to the reiserfs code. This is the third in a series of patches to rip out some of the awful variable naming in reiserfs. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:39 -07:00
Jeff Mahoney	ad31a4fc03	reiserfs: rename p_s_bh to bh This patch is a simple s/p_s_bh/bh/g to the reiserfs code. This is the second in a series of patches to rip out some of the awful variable naming in reiserfs. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:39 -07:00
Jeff Mahoney	a9dd364358	reiserfs: rename p_s_sb to sb This patch is a simple s/p_s_sb/sb/g to the reiserfs code. This is the first in a series of patches to rip out some of the awful variable naming in reiserfs. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:39 -07:00
Jeff Mahoney	0222e6571c	reiserfs: strip trailing whitespace This patch strips trailing whitespace from the reiserfs code. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:39 -07:00
Jeff Mahoney	3cd6dbe6fe	reiserfs: cleanup path functions This patch cleans up some redundancies in the reiserfs tree path code. decrement_bcount() is essentially the same function as brelse(), so we use that instead. decrement_counters_in_path() is exactly the same function as pathrelse(), so we kill that and use pathrelse() instead. There's also a bit of cleanup that makes the code a bit more readable. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:39 -07:00
Jeff Mahoney	fba4ebb5f0	reiserfs: factor out buffer_info initialization This is the first in a series of patches to make balance_leaf() not quite so insane. This patch factors out the open coded initializations of buffer_info structures and defines a few initializers for the 4 cases they're used. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:39 -07:00
Jeff Mahoney	57fe60df62	reiserfs: add atomic addition of selinux attributes during inode creation Some time ago, some changes were made to make security inode attributes be atomically written during inode creation. ReiserFS fell behind in this area, but with the reworking of the xattr code, it's now fairly easy to add. The following patch adds the ability for security attributes to be added automatically during inode creation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:39 -07:00
Jeff Mahoney	a41f1a4715	reiserfs: use generic readdir for operations across all xattrs The current reiserfs xattr implementation open codes reiserfs_readdir and frees the path before calling the filldir function. Typically, the filldir function is something that modifies the file system, such as a chown or an inode deletion that also require reading of an inode associated with each direntry. Since the file system is modified, the path retained becomes invalid for the next run. In addition, it runs backwards in attempt to minimize activity. This is clearly suboptimal from a code cleanliness perspective as well as performance-wise. This patch implements a generic reiserfs_for_each_xattr that uses the generic readdir and a specific filldir routine that simply populates an array of dentries and then performs a specific operation on them. When all files have been operated on, it then calls the operation on the directory itself. The result is a noticable code reduction and better performance. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:38 -07:00
Jeff Mahoney	0ab2621ebd	reiserfs: journaled xattrs Deadlocks are possible in the xattr code between the journal lock and the xattr sems. This patch implements journalling for xattr operations. The benefit is twofold: * It gets rid of the deadlock possibility by always ensuring that xattr write operations are initiated inside a transaction. * It corrects the problem where xattr backing files aren't considered any differently than normal files, despite the fact they are metadata. I discussed the added journal load with Chris Mason, and we decided that since xattrs (versus other journal activity) is fairly rare, the introduction of larger transactions to support journaled xattrs wouldn't be too big a deal. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:38 -07:00
Jeff Mahoney	48b32a3553	reiserfs: use generic xattr handlers Christoph Hellwig had asked me quite some time ago to port the reiserfs xattrs to the generic xattr interface. This patch replaces the reiserfs-specific xattr handling code with the generic struct xattr_handler. However, since reiserfs doesn't split the prefix and name when accessing xattrs, it can't leverage generic_{set,get,list,remove}xattr without needlessly reconstructing the name on the back end. Update 7/26/07: Added missing dput() to deletion path. Update 8/30/07: Added missing mark_inode_dirty when i_mode is used to represent an ACL and no previous ACL existed. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:38 -07:00
Jeff Mahoney	8ecbe550a1	reiserfs: remove i_has_xattr_dir With the changes to xattr root locking, the i_has_xattr_dir flag is no longer needed. This patch removes it. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:38 -07:00
Jeff Mahoney	8b6dd72a44	reiserfs: make per-inode xattr locking more fine grained The per-inode locking can be made more fine-grained to surround just the interaction with the filesystem itself. This really only applies to protecting reads during a write, since concurrent writes are barred with inode->i_mutex at the vfs level. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:38 -07:00
Jeff Mahoney	d984561b32	reiserfs: eliminate per-super xattr lock With the switch to using inode->i_mutex locking during lookups/creation in the xattr root, the per-super xattr lock is no longer needed. This patch removes it. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:38 -07:00
Jeff Mahoney	6c17675e1e	reiserfs: simplify xattr internal file lookups/opens The xattr file open/lookup code is needlessly complex. We can use vfs-level operations to perform the same work, and also simplify the locking constraints. The locking advantages will be exploited in future patches. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:37 -07:00
Jeff Mahoney	a72bdb1cd2	reiserfs: Clean up xattrs when REISERFS_FS_XATTR is unset The current reiserfs xattr implementation will not clean up old xattr files if files are deleted when REISERFS_FS_XATTR is unset. This results in inaccessible lost files, wasting space. This patch compiles in basic xattr knowledge, such as how to delete them and change ownership for quota tracking. If the file system has never used xattrs, then the operation is quite fast: it returns immediately when it sees there is no .reiserfs_priv directory. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:37 -07:00
Jeff Mahoney	6dfede6963	reiserfs: remove IS_PRIVATE helpers There are a number of helper functions for marking a reiserfs inode private that were leftover from reiserfs did its own thing wrt to private inodes. S_PRIVATE has been in the kernel for some time, so this patch removes the helpers and uses IS_PRIVATE instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:37 -07:00
Jeff Mahoney	010f5a21a3	reiserfs: remove link detection code Early in the reiserfs xattr development, there was a plan to use hardlinks to save disk space for identical xattrs. That code never materialized and isn't going to, so this patch removes the detection code. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:37 -07:00
Jeff Mahoney	ec6ea56b2f	reiserfs: xattr reiserfs_get_page takes offset instead of index This patch changes reiserfs_get_page to take an offset rather than an index since no callers calculate the index differently. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:37 -07:00
Jeff Mahoney	f437c529e3	reiserfs: small variable cleanup This patch removes the xinode and mapping variables from reiserfs_xattr_{get,set}. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:37 -07:00
Jeff Mahoney	0030b64570	reiserfs: use reiserfs_error() This patch makes many paths that are currently using warnings to handle the error. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:37 -07:00
Jeff Mahoney	1e5e59d431	reiserfs: introduce reiserfs_error() Although reiserfs can currently handle severe errors such as journal failure, it cannot handle less severe errors like metadata i/o failure. The following patch adds a reiserfs_error() function akin to the one in ext3. Subsequent patches will use this new error handler to handle errors more gracefully in general. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:36 -07:00
Jeff Mahoney	32e8b10629	reiserfs: rearrange journal abort This patch kills off reiserfs_journal_abort as it is never called, and combines __reiserfs_journal_abort_{soft,hard} into one function called reiserfs_abort_journal, which performs the same work. It is silent as opposed to the old version, since the message was always issued after a regular 'abort' message. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:36 -07:00
Jeff Mahoney	c3a9c2109f	reiserfs: rework reiserfs_panic ReiserFS panics can be somewhat inconsistent. In some cases: * a unique identifier may be associated with it * the function name may be included * the device may be printed separately This patch aims to make warnings more consistent. reiserfs_warning() prints the device name, so printing it a second time is not required. The function name for a warning is always helpful in debugging, so it is now automatically inserted into the output. Hans has stated that every warning should have a unique identifier. Some cases lack them, others really shouldn't have them. reiserfs_warning() now expects an id associated with each message. In the rare case where one isn't needed, "" will suffice. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:36 -07:00
Jeff Mahoney	78b6513d28	reiserfs: add locking around error buffer The formatting of the error buffer is race prone. It uses static buffers for both formatting and output. While overwriting the error buffer can product garbled output, overwriting the format buffer with incompatible % directives can cause crashes. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:36 -07:00
Jeff Mahoney	cacbe3d7ad	reiserfs: prepare_error_buf wrongly consumes va_arg vsprintf will consume varargs on its own. Skipping them manually results in garbage in the error buffer, or Oopses in the case of pointers. This patch removes the advancement and fixes a number of bugs where crashes were observed as side effects of a regular error report. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:36 -07:00
Jeff Mahoney	45b03d5e8e	reiserfs: rework reiserfs_warning ReiserFS warnings can be somewhat inconsistent. In some cases: * a unique identifier may be associated with it * the function name may be included * the device may be printed separately This patch aims to make warnings more consistent. reiserfs_warning() prints the device name, so printing it a second time is not required. The function name for a warning is always helpful in debugging, so it is now automatically inserted into the output. Hans has stated that every warning should have a unique identifier. Some cases lack them, others really shouldn't have them. reiserfs_warning() now expects an id associated with each message. In the rare case where one isn't needed, "" will suffice. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:36 -07:00
Jeff Mahoney	1d889d9958	reiserfs: make some warnings informational In several places, reiserfs_warning is used when there is no warning, just a notice. This patch changes some of them to indicate that the message is merely informational. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:35 -07:00
Jeff Mahoney	a5437152ee	reiserfs: use more consistent printk formatting The output format between a warning/error/panic/info/etc changes with which one is used. The following patch makes the messages more internally consistent, but also more consistent with other Linux filesystems. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:35 -07:00
Jeff Mahoney	eba0030559	reiserfs: use buffer_info for leaf_paste_entries This patch makes leaf_paste_entries more consistent with respect to the other leaf operations. Using buffer_info instead of buffer_head directly allows us to get a superblock pointer for use in error handling. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:35 -07:00
Jeff Mahoney	600ed41675	reiserfs: audit transaction ids to always be unsigned ints This patch fixes up the reiserfs code such that transaction ids are always unsigned ints. In places they can currently be signed ints or unsigned longs. The former just causes an annoying clm-2200 warning and may join a transaction when it should wait. The latter is just for correctness since the disk format uses a 32-bit transaction id. There aren't any runtime problems that result from it not wrapping at the correct location since the value is truncated correctly even on big endian systems. The 0 value might make it to disk, but the mount-time checks will bump it to 10 itself. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:35 -07:00
Jeff Mahoney	702d21c6f6	reiserfs: add support for mount count incrementing The following patch adds the fields for tracking mount counts and last fsck timestamps to the superblock. It also increments the mount count on every read-write mount. Reiserfsprogs 3.6.21 added support for these fields. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-30 12:16:35 -07:00
Linus Torvalds	2d25ee36c8	Merge branch 'bkl-removal' of git://git.lwn.net/linux-2.6 * 'bkl-removal' of git://git.lwn.net/linux-2.6: Fix a lockdep warning in fasync_helper() Add a missing unlock_kernel() in raw_open()	2009-03-30 11:31:47 -07:00
Linus Torvalds	b80e0d2716	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix fuse_file_lseek returning with lock held	2009-03-30 10:08:10 -07:00
Linus Torvalds	ffd1428514	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6: jfs: needs crc32_le jfs: Fix error handling in metapage_writepage() jfs: return f_fsid for statfs(2) jfs: remove xtLookupList() jfs: clean up a dangling comment	2009-03-30 10:02:36 -07:00
Dan Carpenter	5291658d87	fuse: fix fuse_file_lseek returning with lock held This bug was found with smatch (http://repo.or.cz/w/smatch.git/). If we return directly the inode->i_mutex lock doesn't get released. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org	2009-03-30 17:26:24 +02:00
Jonathan Corbet	4a6a449969	Fix a lockdep warning in fasync_helper() Lockdep gripes if file->f_lock is taken in a no-IRQ situation, since that is not always the case. We don't really want to disable IRQs for every acquisition of f_lock; instead, just move it outside of fasync_lock. Reported-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Reported-by: Larry Finger <Larry.Finger@lwfinger.net> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2009-03-30 08:00:24 -06:00
Rusty Russell	af76aba00f	cpumask: fix seq_bitmap_*() functions. 1) seq_bitmap_list() should take a const. 2) All the seq_bitmap should use cpumask_bits(). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-30 22:05:11 +10:30
Christoph Hellwig	27174203f5	xfs: cleanup uuid handling The uuid table handling should not be part of a semi-generic uuid library but in the XFS code using it, so move those bits to xfs_mount.c and refactor the whole glob to make it a proper abstraction. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-03-30 10:21:31 +02:00
Christoph Hellwig	1a5902c5d2	xfs: remove m_attroffset With the upcoming v3 inodes the default attroffset needs to be calculated for each specific inode, so we can't cache it in the superblock anymore. Also replace the assert for wrong inode sizes with a proper error check also included in non-debug builds. Note that the ENOSYS return for that might seem odd, but that error is returned by xfs_mount_validate_sb for all theoretically valid but not supported filesystem geometries. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>	2009-03-29 19:26:46 +02:00
Malcolm Parsons	9da096fd13	xfs: fix various typos Signed-off-by: Malcolm Parsons <malcolm.parsons@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-03-29 09:55:42 +02:00
Hisashi Hifumi	bddaafa11a	xfs: pagecache usage optimization Hi. I introduced "is_partially_uptodate" aops for XFS. A page can have multiple buffers and even if a page is not uptodate, some buffers can be uptodate on pagesize != blocksize environment. This aops checks that all buffers which correspond to a part of a file that we want to read are uptodate. If so, we do not have to issue actual read IO to HDD even if a page is not uptodate because the portion we want to read are uptodate. "block_is_partially_uptodate" function is already used by ext2/3/4. With the following patch random read/write mixed workloads or random read after random write workloads can be optimized and we can get performance improvement. I did a performance test using the sysbench. #sysbench --num-threads=4 --max-requests=100000 --test=fileio --file-num=1 \ --file-block-size=8K --file-total-size=1G --file-test-mode=rndrw \ --file-fsync-freq=0 --file-rw-ratio=0.5 run -2.6.29-rc6 Test execution summary: total time: 123.8645s total number of events: 100000 total time taken by event execution: 442.4994 per-request statistics: min: 0.0000s avg: 0.0044s max: 0.3387s approx. 95 percentile: 0.0118s -2.6.29-rc6-patched Test execution summary: total time: 108.0757s total number of events: 100000 total time taken by event execution: 417.7505 per-request statistics: min: 0.0000s avg: 0.0042s max: 0.3217s approx. 95 percentile: 0.0118s arch: ia64 pagesize: 16k blocksize: 4k Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-03-29 09:53:38 +02:00
Christoph Hellwig	6447c36209	xfs: remove m_litino With the upcoming v3 inodes the inode data/attr area size needs to be calculated for each specific inode, so we can't cache it in the superblock anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-03-29 09:51:14 +02:00
Christoph Hellwig	a19d9f887d	xfs: kill ino64 mount option The ino64 mount option adds a fixed offset to 32bit inode numbers to bring them into the 64bit range. There's no need for this kind of debug tool given that it's easy to produce real 64bit inode numbers for testing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-03-29 09:51:08 +02:00
Christoph Hellwig	a0b0b8a5b3	xfs: kill mutex_t typedef People continue to complain about this for weird reasons, but there's really no point in keeping this typedef for a couple of users anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-03-29 09:51:00 +02:00
Hugh Dickins	7c2c7d9930	fix setuid sometimes wouldn't check_unsafe_exec() also notes whether the fs_struct is being shared by more threads than will get killed by the exec, and if so sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid. But /proc/<pid>/cwd and /proc/<pid>/root lookups make transient use of get_fs_struct(), which also raises that sharing count. This might occasionally cause a setuid program not to change euid, in the same way as happened with files->count (check_unsafe_exec also looks at sighand->count, but /proc doesn't raise that one). We'd prefer exec not to unshare fs_struct: so fix this in procfs, replacing get_fs_struct() by get_fs_path(), which does path_get while still holding task_lock, instead of raising fs->count. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: stable@kernel.org ___ fs/proc/base.c \| 50 +++++++++++++++-------------------------------- 1 file changed, 16 insertions(+), 34 deletions(-) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-28 17:30:00 -07:00
Hugh Dickins	e426b64c41	fix setuid sometimes doesn't Joe Malicki reports that setuid sometimes doesn't: very rarely, a setuid root program does not get root euid; and, by the way, they have a health check running lsof every few minutes. Right, check_unsafe_exec() notes whether the files_struct is being shared by more threads than will get killed by the exec, and if so sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid. But /proc/<pid>/fd and /proc/<pid>/fdinfo lookups make transient use of get_files_struct(), which also raises that sharing count. There's a rather simple fix for this: exec's check on files->count has been redundant ever since 2.6.1 made it unshare_files() (except while compat_do_execve() omitted to do so) - just remove that check. [Note to -stable: this patch will not apply before 2.6.29: earlier releases should just remove the files->count line from unsafe_exec().] Reported-by: Joe Malicki <jmalicki@metacarta.com> Narrowed-down-by: Michael Itz <mitz@metacarta.com> Tested-by: Joe Malicki <jmalicki@metacarta.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-28 17:30:00 -07:00
Hugh Dickins	53e9309e01	compat_do_execve should unshare_files 2.6.26's commit `fd8328be87` "sanitize handling of shared descriptor tables in failing execve()" moved the unshare_files() from flush_old_exec() and several binfmts to the head of do_execve(); but forgot to make the same change to compat_do_execve(), leaving a CLONE_FILES files_struct shared across exec from a 32-bit process on a 64-bit kernel. It's arguable whether the files_struct really ought to be unshared across exec; but 2.6.1 made that so to stop the loading binary's fd leaking into other threads, and a 32-bit process on a 64-bit kernel ought to behave in the same way as 32 on 32 and 64 on 64. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-28 17:30:00 -07:00
Chuck Lever	3c8c45dfab	NFS: Simplify logic to compare socket addresses in client.c Callback requests from IPv4 servers are now always guaranteed to be AF_INET, and never mapped IPv4 AF_INET6 addresses. Both nfs_match_client() and nfs_find_client() can now share the same address comparison logic, so fold them together. We can also dispense with of most of the conditional compilation in here. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-28 16:51:04 -04:00
Trond Myklebust	d188262d60	Merge commit '9f4c899c0d90e1b51b6864834f3877b47c161a0e' into devel	2009-03-28 16:50:58 -04:00
Chuck Lever	f738f51703	NFS: Start PF_INET6 callback listener only if IPv6 support is available Apparently a lot of people need to disable IPv6 completely on their distributor-built systems, which have CONFIG_IPV6_MODULE enabled at build time. They do this by blacklisting the ipv6.ko module. This causes the creation of the NFSv4 callback service listener to fail if CONFIG_IPV6_MODULE is set, but the module cannot be loaded. Now that the kernel's PF_INET6 RPC listeners are completely separate from PF_INET listeners, we can always start PF_INET. Then the NFS client can try to start a PF_INET6 listener, but it isn't required to be available. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-28 16:02:43 -04:00
Chuck Lever	eb16e90778	lockd: Start PF_INET6 listener only if IPv6 support is available Apparently a lot of people need to disable IPv6 completely on their distributor-built systems, which have CONFIG_IPV6_MODULE enabled at build time. They do this by blacklisting the ipv6.ko module. This causes the creation of the lockd service listener to fail if CONFIG_IPV6_MODULE is set, but the module cannot be loaded. Now that the kernel's PF_INET6 RPC listeners are completely separate from PF_INET listeners, we can always start PF_INET. Then lockd can try to start PF_INET6, but it isn't required to be available. Note this has the added benefit that NLM callbacks from AF_INET6 servers will never come from AF_INET remotes. We no longer have to worry about matching mapped IPv4 addresses to AF_INET when comparing addresses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-28 16:01:16 -04:00
Chuck Lever	26298caaca	NFS: Revert creation of IPv6 listeners for lockd and NFSv4 callbacks We're about to convert over to using separate PF_INET and PF_INET6 listeners, instead of a single PF_INET6 listener that also receives AF_INET requests and maps them to AF_INET6. Clear the way by removing the logic in lockd and the NFSv4 callback server that creates an AF_INET6 service listener. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-28 15:55:06 -04:00
Chuck Lever	49a9072f29	SUNRPC: Remove @family argument from svc_create() and svc_create_pooled() Since an RPC service listener's protocol family is specified now via svc_create_xprt(), it no longer needs to be passed to svc_create() or svc_create_pooled(). Remove that argument from the synopsis of those functions, and remove the sv_family field from the svc_serv struct. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-28 15:54:48 -04:00
Chuck Lever	9652ada3fb	SUNRPC: Change svc_create_xprt() to take a @family argument The sv_family field is going away. Pass a protocol family argument to svc_create_xprt() instead of extracting the family from the passed-in svc_serv struct. Again, as this is a listener socket and not an address, we make this new argument an "int" protocol family, instead of an "sa_family_t." Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-28 15:54:36 -04:00
Chuck Lever	adbbe92956	NFSD: If port value written to /proc/fs/nfsd/portlist is invalid, return EINVAL Make sure port value read from user space by write_ports is valid before passing it to svc_find_xprt(). If it wasn't, the writer would get ENOENT instead of EINVAL. Noticed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-28 15:53:42 -04:00
Theodore Ts'o	06705bff91	ext4: Regularize mount options Add support for using the mount options "barrier" and "nobarrier", and "auto_da_alloc" and "noauto_da_alloc", which is more consistent than "barrier=<0\|1>" or "auto_da_alloc=<0\|1>". Most other ext3/ext4 mount options use the foo/nofoo naming convention. We allow the old forms of these mount options for backwards compatibility. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-28 10:59:57 -04:00
Theodore Ts'o	512a004382	ext3: Use WRITE_SYNC for commits which are caused by fsync() If a commit is triggered by fsync(), set a flag indicating the journal blocks associated with the transaction should be flushed out using WRITE_SYNC. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz>	2009-03-27 22:14:27 -04:00
Theodore Ts'o	a64c8610bd	block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks When doing synchronous writes because wbc->sync_mode is set to WBC_SYNC_ALL, send the write request using WRITE_SYNC, so that we don't unduly block system calls such as fsync(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz>	2009-03-27 22:14:10 -04:00
Theodore Ts'o	e7c9e3e99a	ext4: fix locking typo in mballoc which could cause soft lockup hangs Smatch (http://repo.or.cz/w/smatch.git/) complains about the locking in ext4_mb_add_n_trim() from fs/ext4/mballoc.c 4438 list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order], 4439 pa_inode_list) { 4440 spin_lock(&tmp_pa->pa_lock); 4441 if (tmp_pa->pa_deleted) { 4442 spin_unlock(&pa->pa_lock); 4443 continue; 4444 } Brown paper bag time... Reported-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-03-27 19:43:21 -04:00
Dan Carpenter	a7b19448dd	ext4: fix typo which causes a memory leak on error path This was found by smatch (http://repo.or.cz/w/smatch.git/) Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-03-27 19:42:54 -04:00
Linus Torvalds	3ae5080f4c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (37 commits) fs: avoid I_NEW inodes Merge code for single and multiple-instance mounts Remove get_init_pts_sb() Move common mknod_ptmx() calls into caller Parse mount options just once and copy them to super block Unroll essentials of do_remount_sb() into devpts vfs: simple_set_mnt() should return void fs: move bdev code out of buffer.c constify dentry_operations: rest constify dentry_operations: configfs constify dentry_operations: sysfs constify dentry_operations: JFS constify dentry_operations: OCFS2 constify dentry_operations: GFS2 constify dentry_operations: FAT constify dentry_operations: FUSE constify dentry_operations: procfs constify dentry_operations: ecryptfs constify dentry_operations: CIFS constify dentry_operations: AFS ...	2009-03-27 16:23:12 -07:00
Felix Blyakher	8b11217173	xfs: increase the maximum number of supported ACL entries With big installation current 25 maximum number of supported ACL entries is not enough any more. Increase the limit to 100. Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-27 17:28:43 -05:00
Linus Torvalds	2c9e15a011	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: (27 commits) ext2: Zero our b_size in ext2_quota_read() trivial: fix typos/grammar errors in fs/Kconfig quota: Coding style fixes quota: Remove superfluous inlines quota: Remove uppercase aliases for quota functions. nfsd: Use lowercase names of quota functions jfs: Use lowercase names of quota functions udf: Use lowercase names of quota functions ufs: Use lowercase names of quota functions reiserfs: Use lowercase names of quota functions ext4: Use lowercase names of quota functions ext3: Use lowercase names of quota functions ext2: Use lowercase names of quota functions ramfs: Remove quota call vfs: Use lowercase names of quota functions quota: Remove dqbuf_t and other cleanups quota: Remove NODQUOT macro quota: Make global quota locks cacheline aligned quota: Move quota files into separate directory ext4: quota reservation for delayed allocation ...	2009-03-27 14:48:34 -07:00
Linus Torvalds	805de022b1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: fix length calculation in compat code dlm: ignore cancel on granted lock dlm: clear defunct cancel state dlm: replace idr with hash table for connections dlm: comment typo fixes dlm: use ipv6_addr_copy dlm: Change rwlock which is only used in write mode to a spinlock	2009-03-27 14:48:07 -07:00
Jan Kara	86db97c87f	jbd2: Update locking coments Update information about locking in JBD2 revoke code. Inconsistency in comments found by Lin Tan <tammy000@gmail.com>. CC: Lin Tan <tammy000@gmail.com>. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-27 17:20:40 -04:00
Aneesh Kumar K.V	cc0fb9ad7d	ext4: Rename pa_linear to pa_type Impact: code cleanup This patch rename pa_linear to pa_type and add MB_INODE_PA and MB_GROUP_PA to indicate inode and group prealloc space. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-27 17:16:58 -04:00
Thiemo Nagel	fe2c8191fa	ext4: add checks of block references for non-extent inodes Check block references in the inode and indorect blocks for non-extent inodes to make sure they are valid, and flag an error if they are invalid. Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-31 08:36:10 -04:00
Nick Piggin	aabb8fdb41	fs: avoid I_NEW inodes To be on the safe side, it should be less fragile to exclude I_NEW inodes from inode list scans by default (unless there is an important reason to have them). Normally they will get excluded (eg. by zero refcount or writecount etc), however it is a bit fragile for list walkers to know exactly what parts of the inode state is set up and valid to test when in I_NEW. So along these lines, move I_NEW checks upward as well (sometimes taking I_FREEING etc checks with them too -- this shouldn't be a problem should it?) Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:05 -04:00
Sukadev Bhattiprolu	1bd7903560	Merge code for single and multiple-instance mounts new_pts_mount() (including the get_sb_nodev()), shares a lot of code with init_pts_mount(). The only difference between them is the 'test-super' function passed into sget(). Move all common code into devpts_get_sb() and remove the new_pts_mount() and init_pts_mount() functions, Changelog[v3]: [Serge Hallyn]: Remove unnecessary printk()s Changelog[v2]: (Christoph Hellwig): Merge code in 'do_pts_mount()' into devpts_get_sb() Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Tested-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:04 -04:00
Sukadev Bhattiprolu	289f00e225	Remove get_init_pts_sb() With mknod_ptmx() moved to devpts_get_sb(), init_pts_mount() becomes a wrapper around get_init_pts_sb(). Remove get_init_pts_sb() and fold code into init_pts_mount(). Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:04 -04:00
Sukadev Bhattiprolu	945cf2c79f	Move common mknod_ptmx() calls into caller We create 'ptmx' node in both single-instance and multiple-instance mounts. So devpts_get_sb() can call mknod_ptmx() once rather than have both modes calling mknod_ptmx() separately. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:04 -04:00
Sukadev Bhattiprolu	482984f06d	Parse mount options just once and copy them to super block Since all the mount option parsing is done in devpts, we could do it just once and pass it around in devpts functions and eventually store it in the super block. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:04 -04:00
Sukadev Bhattiprolu	fdbf534866	Unroll essentials of do_remount_sb() into devpts On remount, devpts fs only needs to parse the mount options. Users cannot directly create/dirty files in /dev/pts so the MS_RDONLY flag and shrinking the dcache does not really apply to devpts. So effectively on remount, devpts only parses the mount options and updates these options in its super block. As such, we could replace do_remount_sb() call with a direct parse_mount_options(). Doing so enables subsequent patches to avoid parsing the mount options twice and simplify the code. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:03 -04:00
Sukadev Bhattiprolu	a3ec947c85	vfs: simple_set_mnt() should return void simple_set_mnt() is defined as returning 'int' but always returns 0. Callers assume simple_set_mnt() never fails and don't properly cleanup if it were to _ever_ fail. For instance, get_sb_single() and get_sb_nodev() should: up_write(sb->s_unmount); deactivate_super(sb); if simple_set_mnt() fails. Since simple_set_mnt() never fails, would be cleaner if it did not return anything. [akpm@linux-foundation.org: fix build] Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:03 -04:00
Nick Piggin	585d3bc06f	fs: move bdev code out of buffer.c Move some block device related code out from buffer.c and put it in block_dev.c. I'm trying to move non-buffer_head code out of buffer.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:03 -04:00
Al Viro	3ba13d179e	constify dentry_operations: rest Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:03 -04:00
Al Viro	296c2d8663	constify dentry_operations: configfs Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:03 -04:00
Al Viro	ee1ec32903	constify dentry_operations: sysfs Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:02 -04:00
Al Viro	ad28b4ef19	constify dentry_operations: JFS Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:02 -04:00
Al Viro	d8fba0ffe5	constify dentry_operations: OCFS2 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:02 -04:00
Al Viro	92cecbbfa3	constify dentry_operations: GFS2 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:02 -04:00
Al Viro	ce6cdc474a	constify dentry_operations: FAT Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:01 -04:00
Al Viro	4269590a72	constify dentry_operations: FUSE Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:01 -04:00
Al Viro	d72f71eb0e	constify dentry_operations: procfs Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:01 -04:00
Al Viro	5a3fd05a9b	constify dentry_operations: ecryptfs Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:01 -04:00
Al Viro	4fd03e84d8	constify dentry_operations: CIFS Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:01 -04:00
Al Viro	79be57cc7f	constify dentry_operations: AFS Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:00 -04:00
Al Viro	08f11513fa	constify dentry_operations: autofs, autofs4 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:00 -04:00

1 2 3 4 5 ...

13349 Commits