linux

Commit Graph

Author	SHA1	Message	Date
Nelson Elhage	d8805e633e	epoll: fix spurious lockdep warnings epoll can acquire recursively acquire ep->mtx on multiple "struct eventpoll"s at once in the case where one epoll fd is monitoring another epoll fd. This is perfectly OK, since we're careful about the lock ordering, but it causes spurious lockdep warnings. Annotate the recursion using mutex_lock_nested, and add a comment explaining the nesting rules for good measure. Recent versions of systemd are triggering this, and it can also be demonstrated with the following trivial test program: --------------------8<-------------------- int main(void) { int e1, e2; struct epoll_event evt = { .events = EPOLLIN }; e1 = epoll_create1(0); e2 = epoll_create1(0); epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt); return 0; } --------------------8<-------------------- Reported-by: Paul Bolle <pebolle@tiscali.nl> Tested-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Nelson Elhage <nelhage@nelhage.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:57 -07:00
Andy Shevchenko	0a90e0f101	fat: follow rename pack_hex_byte() to hex_byte_pack() There is no functional change. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:57 -07:00
Joe Perches	b9075fa968	treewide: use __printf not __attribute__((format(printf,...))) Standardize the style for compiler based printf format verification. Standardized the location of __printf too. Done via script and a little typing. $ grep -rPl --include=.[ch] -w "__attribute__" \| \ grep -vP "^(tools\|scripts\|include/linux/compiler-gcc.h)" \| \ xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s$\s\(\sformat\s\(\sprintf\s,\s(.+)\s,\s(.+)\s$\s\)\s\)/__printf($1, $2)/g ; print; }' [akpm@linux-foundation.org: revert arch bits] Signed-off-by: Joe Perches <joe@perches.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:54 -07:00
Pavel Emelyanov	d70ef97baf	fs/pipe.c: add ->statfs callback for pipefs Currently a statfs on a pipe's /proc/<pid>/fd/ link returns -ENOSYS. Wire pipfs up so that the statfs succeeds. This is required by checkpoint-restart in the userspace to make it possible to distinguish pipes from fifos. When we dump information about task's open files we use the /proc/pid/fd directoy's symlinks and the fact that opening any of them gives us exactly the same dentry->inode pair as the original process has. Now if a task we're dumping has opened pipe and fifo we need to detect this and act accordingly. Knowing that an fd with type S_ISFIFO resides on a pipefs is the most precise way. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:51 -07:00
Tao Ma	72a2ebd8bc	fs/buffer.c: add device information for error output in __find_get_block_slow() On the ext4 mailing list[1], we got some report about errors in __find_get_block_slow(), but the information is very limited. If the device information is given, we can know the name of the sick volume. Futhermore, we can get the corresponding status of that block(group, inode block etc) by analyzing the disk layout. [1] http://marc.info/?l=linux-ext4&m=131379831421147&w=2 Signed-off-by: Tao Ma <boyu.mt@taobao.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:49 -07:00
Mikulas Patocka	09f363c736	vmscan: fix shrinker callback bug in fs/super.c The callback must not return -1 when nr_to_scan is zero. Fix the bug in fs/super.c and add this requirement to the callback specification. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:49 -07:00
Akinobu Mita	798248206b	lib/string.c: introduce memchr_inv() memchr_inv() is mainly used to check whether the whole buffer is filled with just a specified byte. The function name and prototype are stolen from logfs and the implementation is from SLUB. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Joern Engel <joern@logfs.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:47 -07:00
Mel Gorman	966dbde2c2	ext4: warn if direct reclaim tries to writeback pages Direct reclaim should never writeback pages. Warn if an attempt is made. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:46 -07:00
Mel Gorman	94054fa3fc	xfs: warn if direct reclaim tries to writeback pages Direct reclaim should never writeback pages. For now, handle the situation and warn about it. Ultimately, this will be a BUG_ON. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:46 -07:00
Christoph Lameter	bc3e53f682	mm: distinguish between mlocked and pinned pages Some kernel components pin user space memory (infiniband and perf) (by increasing the page count) and account that memory as "mlocked". The difference between mlocking and pinning is: A. mlocked pages are marked with PG_mlocked and are exempt from swapping. Page migration may move them around though. They are kept on a special LRU list. B. Pinned pages cannot be moved because something needs to directly access physical memory. They may not be on any LRU list. I recently saw an mlockalled process where mm->locked_vm became bigger than the virtual size of the process (!) because some memory was accounted for twice: Once when the page was mlocked and once when the Infiniband layer increased the refcount because it needt to pin the RDMA memory. This patch introduces a separate counter for pinned pages and accounts them seperately. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Mike Marciniszyn <infinipath@qlogic.com> Cc: Roland Dreier <roland@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:46 -07:00
Robert P. J. Day	f5fc870da2	tmpfs: add "tmpfs" to the Kconfig prompt to make it obvious. Add the leading word "tmpfs" to the Kconfig string to make it blindingly obvious that this selection refers to tmpfs. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:45 -07:00
David Rientjes	c9f01245b6	oom: remove oom_disable_count This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:45 -07:00
Christopher Yeoh	fcf634098c	Cross Memory Attach The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. The following patch attempts to achieve this by allowing a destination process, given an address and size from a source process, to copy memory directly from the source process into its own address space via a system call. There is also a symmetrical ability to copy from the current process's address space into a destination process's address space. - Use of /proc/pid/mem has been considered, but there are issues with using it: - Does not allow for specifying iovecs for both src and dest, assuming preadv or pwritev was implemented either the area read from or written to would need to be contiguous. - Currently mem_read allows only processes who are currently ptrace'ing the target and are still able to ptrace the target to read from the target. This check could possibly be moved to the open call, but its not clear exactly what race this restriction is stopping (reason appears to have been lost) - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix domain socket is a bit ugly from a userspace point of view, especially when you may have hundreds if not (eventually) thousands of processes that all need to do this with each other - Doesn't allow for some future use of the interface we would like to consider adding in the future (see below) - Interestingly reading from /proc/pid/mem currently actually involves two copies! (But this could be fixed pretty easily) As mentioned previously use of vmsplice instead was considered, but has problems. Since you need the reader and writer working co-operatively if the pipe is not drained then you block. Which requires some wrapping to do non blocking on the send side or polling on the receive. In all to all communication it requires ordering otherwise you can deadlock. And in the example of many MPI tasks writing to one MPI task vmsplice serialises the copying. There are some cases of MPI collectives where even a single copy interface does not get us the performance gain we could. For example in an MPI_Reduce rather than copy the data from the source we would like to instead use it directly in a mathops (say the reduce is doing a sum) as this would save us doing a copy. We don't need to keep a copy of the data from the source. I haven't implemented this, but I think this interface could in the future do all this through the use of the flags - eg could specify the math operation and type and the kernel rather than just copying the data would apply the specified operation between the source and destination and store it in the destination. Although we don't have a "second user" of the interface (though I've had some nibbles from people who may be interested in using it for intra process messaging which is not MPI). This interface is something which hardware vendors are already doing for their custom drivers to implement fast local communication. And so in addition to this being useful for OpenMPI it would mean the driver maintainers don't have to fix things up when the mm changes. There was some discussion about how much faster a true zero copy would go. Here's a link back to the email with some testing I did on that: http://marc.info/?l=linux-mm&m=130105930902915&w=2 There is a basic man page for the proposed interface here: http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt This has been implemented for x86 and powerpc, other architecture should mainly (I think) just need to add syscall numbers for the process_vm_readv and process_vm_writev. There are 32 bit compatibility versions for 64-bit kernels. For arch maintainers there are some simple tests to be able to quickly verify that the syscalls are working correctly here: http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <linux-man@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:44 -07:00
Andrew Morton	fc360bd9cd	/proc/self/numa_maps: restore "huge" tag for hugetlb vmas The display of the "huge" tag was accidentally removed in `29ea2f698` ("mm: use walk_page_range() instead of custom page table walking code"). Reported-by: Stephen Hemminger <shemminger@vyatta.com> Tested-by: Stephen Hemminger <shemminger@vyatta.com> Reviewed-by: Stephen Wilson <wilsons@start.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:44 -07:00
Linus Torvalds	97d2eb13a0	Merge branch 'for-linus' of git://ceph.newdream.net/git/ceph-client * 'for-linus' of git://ceph.newdream.net/git/ceph-client: libceph: fix double-free of page vector ceph: fix 32-bit ino numbers libceph: force resend of osd requests if we skip an osdmap ceph: use kernel DNS resolver ceph: fix ceph_monc_init memory leak ceph: let the set_layout ioctl set single traits Revert "ceph: don't truncate dirty pages in invalidate work thread" ceph: replace leading spaces with tabs libceph: warn on msg allocation failures libceph: don't complain on msgpool alloc failures libceph: always preallocate mon connection libceph: create messenger with client ceph: document ioctls ceph: implement (optional) max read size ceph: rename rsize -> rasize ceph: make readpages fully async	2011-10-28 16:42:18 -07:00
Linus Torvalds	f362f98e7c	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits) leases: fix write-open/read-lease race nfs: drop unnecessary locking in llseek ext4: replace cut'n'pasted llseek code with generic_file_llseek_size vfs: add generic_file_llseek_size vfs: do (nearly) lockless generic_file_llseek direct-io: merge direct_io_walker into __blockdev_direct_IO direct-io: inline the complete submission path direct-io: separate map_bh from dio direct-io: use a slab cache for struct dio direct-io: rearrange fields in dio/dio_submit to avoid holes direct-io: fix a wrong comment direct-io: separate fields only used in the submission path from struct dio vfs: fix spinning prevention in prune_icache_sb vfs: add a comment to inode_permission() vfs: pass all mask flags check_acl and posix_acl_permission vfs: add hex format for MAY_* flag values vfs: indicate that the permission functions take all the MAY_* flags compat: sync compat_stats with statfs. vfs: add "device" tag to /proc/self/mountstats cleanup: vfs: small comment fix for block_invalidatepage ... Fix up trivial conflict in fs/gfs2/file.c (llseek changes)	2011-10-28 10:49:34 -07:00
Linus Torvalds	f793f29611	Merge http://sucs.org/~rohan/git/gfs2-3.0-nmw * http://sucs.org/~rohan/git/gfs2-3.0-nmw: (24 commits) GFS2: Move readahead of metadata during deallocation into its own function GFS2: Remove two unused variables GFS2: Misc fixes GFS2: rewrite fallocate code to write blocks directly GFS2: speed up delete/unlink performance for large files GFS2: Fix off-by-one in gfs2_blk2rgrpd GFS2: Clean up ->page_mkwrite GFS2: Correctly set goal block after allocation GFS2: Fix AIL flush issue during fsync GFS2: Use cached rgrp in gfs2_rlist_add() GFS2: Call do_strip() directly from recursive_scan() GFS2: Remove obsolete assert GFS2: Cache the most recently used resource group in the inode GFS2: Make resource groups "append only" during life of fs GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme GFS2: Fix lseek after SEEK_DATA, SEEK_HOLE have been added GFS2: Clean up gfs2_create GFS2: Use ->dirty_inode() GFS2: Fix bug trap and journaled data fsync GFS2: Fix inode allocation error path ...	2011-10-28 10:44:50 -07:00
Linus Torvalds	dabcbb1bae	Merge branch '3.2-without-smb2' of git://git.samba.org/sfrench/cifs-2.6 * '3.2-without-smb2' of git://git.samba.org/sfrench/cifs-2.6: (52 commits) Fix build break when freezer not configured Add definition for share encryption CIFS: Make cifs_push_locks send as many locks at once as possible CIFS: Send as many mandatory unlock ranges at once as possible CIFS: Implement caching mechanism for posix brlocks CIFS: Implement caching mechanism for mandatory brlocks CIFS: Fix DFS handling in cifs_get_file_info CIFS: Fix error handling in cifs_readv_complete [CIFS] Fixup trivial checkpatch warning [CIFS] Show nostrictsync and noperm mount options in /proc/mounts cifs, freezer: add wait_event_freezekillable and have cifs use it cifs: allow cifs_max_pending to be readable under /sys/module/cifs/parameters cifs: tune bdi.ra_pages in accordance with the rsize cifs: allow for larger rsize= options and change defaults cifs: convert cifs_readpages to use async reads cifs: add cifs_async_readv cifs: fix protocol definition for READ_RSP cifs: add a callback function to receive the rest of the frame cifs: break out 3rd receive phase into separate function cifs: find mid earlier in receive codepath ...	2011-10-28 10:43:32 -07:00
Linus Torvalds	5619a69396	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (69 commits) xfs: add AIL pushing tracepoints xfs: put in missed fix for merge problem xfs: do not flush data workqueues in xfs_flush_buftarg xfs: remove XFS_bflush xfs: remove xfs_buf_target_name xfs: use xfs_ioerror_alert in xfs_buf_iodone_callbacks xfs: clean up xfs_ioerror_alert xfs: clean up buffer allocation xfs: remove buffers from the delwri list in xfs_buf_stale xfs: remove XFS_BUF_STALE and XFS_BUF_SUPER_STALE xfs: remove XFS_BUF_SET_VTYPE and XFS_BUF_SET_VTYPE_REF xfs: remove XFS_BUF_FINISH_IOWAIT xfs: remove xfs_get_buftarg_list xfs: fix buffer flushing during unmount xfs: optimize fsync on directories xfs: reduce the number of log forces from tail pushing xfs: Don't allocate new buffers on every call to _xfs_buf_find xfs: simplify xfs_trans_ijoin* again xfs: unlock the inode before log force in xfs_change_file_space xfs: unlock the inode before log force in xfs_fs_nfs_commit_metadata ...	2011-10-28 10:31:42 -07:00
J. Bruce Fields	f3c7691e8d	leases: fix write-open/read-lease race In setlease, we use i_writecount to decide whether we can give out a read lease. In open, we break leases before incrementing i_writecount. There is therefore a window between the break lease and the i_writecount increment when setlease could add a new read lease. This would leave us with a simultaneous write open and read lease, which shouldn't happen. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:59:00 +02:00
Andi Kleen	79835a710d	nfs: drop unnecessary locking in llseek This makes NFS follow the standard generic_file_llseek locking scheme. Cc: Trond.Myklebust@netapp.com Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:59:00 +02:00
Andi Kleen	4cce0e28b9	ext4: replace cut'n'pasted llseek code with generic_file_llseek_size This gives ext4 the benefits of unlocked llseek. Cc: tytso@mit.edu Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:59 +02:00
Andi Kleen	5760495a87	vfs: add generic_file_llseek_size Add a generic_file_llseek variant to the VFS that allows passing in the maximum file size of the file system, instead of always using maxbytes from the superblock. This can be used to eliminate some cut'n'paste seek code in ext4. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:59 +02:00
Andi Kleen	ef3d0fd27e	vfs: do (nearly) lockless generic_file_llseek The i_mutex lock use of generic _file_llseek hurts. Independent processes accessing the same file synchronize over a single lock, even though they have no need for synchronization at all. Under high utilization this can cause llseek to scale very poorly on larger systems. This patch does some rethinking of the llseek locking model: First the 64bit f_pos is not necessarily atomic without locks on 32bit systems. This can already cause races with read() today. This was discussed on linux-kernel in the past and deemed acceptable. The patch does not change that. Let's look at the different seek variants: SEEK_SET: Doesn't really need any locking. If there's a race one writer wins, the other loses. For 32bit the non atomic update races against read() stay the same. Without a lock they can also happen against write() now. The read() race was deemed acceptable in past discussions, and I think if it's ok for read it's ok for write too. => Don't need a lock. SEEK_END: This behaves like SEEK_SET plus it reads the maximum size too. Reading the maximum size would have the 32bit atomic problem. But luckily we already have a way to read the maximum size without locking (i_size_read), so we can just use that instead. Without i_mutex there is no synchronization with write() anymore, however since the write() update is atomic on 64bit it just behaves like another racy SEEK_SET. On non atomic 32bit it's the same as SEEK_SET. => Don't need a lock, but need to use i_size_read() SEEK_CUR: This has a read-modify-write race window on the same file. One could argue that any application doing unsynchronized seeks on the same file is already broken. But for the sake of not adding a regression here I'm using the file->f_lock to synchronize this. Using this lock is much better than the inode mutex because it doesn't synchronize between processes. => So still need a lock, but can use a f_lock. This patch implements this new scheme in generic_file_llseek. I dropped generic_file_llseek_unlocked and changed all callers. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:58 +02:00
Andi Kleen	847cc6371b	direct-io: merge direct_io_walker into __blockdev_direct_IO This doesn't change anything for the compiler, but hch thought it would make the code clearer. I moved the reference counting into its own little inline. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:58 +02:00
Andi Kleen	ba253fbf6d	direct-io: inline the complete submission path Add inlines to all the submission path functions. While this increases code size it also gives gcc a lot of optimization opportunities in this critical hotpath. In particular -- together with some other changes -- this allows gcc to get rid of the unnecessary clearing of sdio at the beginning and optimize the messy parameter passing. Any non inlining of a function which takes a sdio parameter would break this optimization because they cannot be done if the address of a structure is taken. Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off. This gives about 2.2% improvement on a large database benchmark with a high IOPS rate. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:58 +02:00
Andi Kleen	18772641db	direct-io: separate map_bh from dio Only a single b_private field in the map_bh buffer head is needed after the submission path. Move map_bh separately to avoid storing this information in the long term slab. This avoids the weird 104 byte hole in struct dio_submit which also needed to be memseted early. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:57 +02:00
Andi Kleen	6e8267f532	direct-io: use a slab cache for struct dio A direct slab call is slightly faster than kmalloc and can be better cached per CPU. It also avoids rounding to the next kmalloc slab. In addition this enforces cache line alignment for struct dio to avoid any false sharing. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:57 +02:00
Andi Kleen	0dc2bc49be	direct-io: rearrange fields in dio/dio_submit to avoid holes Fix most problems reported by pahole. There is still a weird 104 byte hole after map_bh. I'm not sure what causes this. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:56 +02:00
Andi Kleen	cde1ecb324	direct-io: fix a wrong comment There's nothing on the stack, even before my changes. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:56 +02:00
Andi Kleen	eb28be2b4c	direct-io: separate fields only used in the submission path from struct dio This large, but largely mechanic, patch moves all fields in struct dio that are only used in the submission path into a separate on stack data structure. This has the advantage that the memory is very likely cache hot, which is not guaranteed for memory fresh out of kmalloc. This also gives gcc more optimization potential because it can easier determine that there are no external aliases for these variables. The sdio initialization is a initialization now instead of memset. This allows gcc to break sdio into individual fields and optimize away unnecessary zeroing (after all the functions are inlined) Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:56 +02:00
Christoph Hellwig	62a3ddef61	vfs: fix spinning prevention in prune_icache_sb We need to move the inode to the end of the list to actually make the spinning prevention explained in the comment above it work. With a plain list_move it will simply stay in place as we're always reclaiming from the head of the list. Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:55 +02:00
Andreas Gruenbacher	948409c74d	vfs: add a comment to inode_permission() Acked-by: J. Bruce Fields <bfields@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andreas Gruenbacher <agruen@kernel.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:55 +02:00
Andreas Gruenbacher	d124b60a83	vfs: pass all mask flags check_acl and posix_acl_permission Acked-by: J. Bruce Fields <bfields@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andreas Gruenbacher <agruen@kernel.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:54 +02:00
Andreas Gruenbacher	8fd90c8d1d	vfs: indicate that the permission functions take all the MAY_* flags Acked-by: J. Bruce Fields <bfields@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andreas Gruenbacher <agruen@kernel.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:54 +02:00
Eric W. Biederman	1448c721e4	compat: sync compat_stats with statfs. This was found by inspection while tracking a similar bug in compat_statfs64, that has been fixed in mainline since decemeber. - This fixes a bug where not all of the f_spare fields were cleared on mips and s390. - Add the f_flags field to struct compat_statfs - Copy f_flags to userspace in case someone cares. - Use __clear_user to copy the f_spare field to userspace to ensure that all of the elements of f_spare are cleared. On some architectures f_spare is has 5 ints and on some architectures f_spare only has 4 ints. Which makes the previous technique of clearing each int individually broken. I don't expect anyone actually uses the old statfs system call anymore but if they do let them benefit from having the compat and the native version working the same. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:53 +02:00
Bryan Schumaker	a877ee03ac	vfs: add "device" tag to /proc/self/mountstats nfsiostat was failing to find mounted filesystems on kernels after 2.6.38 because of changes to show_vfsstat() by commit `c7f404b40a`. This patch adds back the "device" tag before the nfs server entry so scripts can parse the mountstats file correctly. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> CC: stable@kernel.org [>=2.6.39] Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 13:55:08 +02:00
Wang Sheng-Hui	814e1d25a5	cleanup: vfs: small comment fix for block_invalidatepage The patch is aganist 3.1-rc3. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 13:55:08 +02:00
Steve French	96814ecb40	Add definition for share encryption Samba supports a setfs info level to negotiate encrypted shares. This patch adds the defines so we recognize this info level. Later patches will add the enablement for it. Acked-by: Jeremy Allison <jra@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-27 16:53:31 -05:00
Boaz Harrosh	60325f0c6e	fs/Makefile: Stupid typo breakage of exofs inclusion In my last patch I did a stupid mistake and broke the exofs compilation completely. Fix it ASAP. Instead of obj-y I did obj-$(y) Really Really sorry. Me totally blushing :-{\| Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-27 08:36:51 +02:00
Linus Torvalds	c28cfd60e4	Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd * 'for-linus' of git://git.open-osd.org/linux-open-osd: (21 commits) ore: Enable RAID5 mounts exofs: Support for RAID5 read-4-write interface. ore: RAID5 Write ore: RAID5 read fs/Makefile: Always inspect exofs/ ore: Make ore_calc_stripe_info EXPORT_SYMBOL ore/exofs: Change ore_check_io API ore/exofs: Define new ore_verify_layout ore: Support for partial component table ore: Support for short read/writes exofs: Support for short read/writes ore: Remove check for ios->kern_buff in _prepare_for_striping to later ore: cleanup: Embed an ore_striping_info inside ore_io_state ore: Only IO one group at a time (API change) ore/exofs: Change the type of the devices array (API change) ore: Make ore_striping_info and ore_calc_stripe_info public exofs: Remove unused data_map member from exofs_sb_info exofs: Rename struct ore_components comps => oc exofs/super.c: local functions should be static exofs/ore.c: local functions should be static ...	2011-10-26 21:33:50 +02:00
Linus Torvalds	39adff5f69	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) time, s390: Get rid of compile warning dw_apb_timer: constify clocksource name time: Cleanup old CONFIG_GENERIC_TIME references that snuck in time: Change jiffies_to_clock_t() argument type to unsigned long alarmtimers: Fix error handling clocksource: Make watchdog reset lockless posix-cpu-timers: Cure SMP accounting oddities s390: Use direct ktime path for s390 clockevent device clockevents: Add direct ktime programming function clockevents: Make minimum delay adjustments configurable nohz: Remove "Switched to NOHz mode" debugging messages proc: Consider NO_HZ when printing idle and iowait times nohz: Make idle/iowait counter update conditional nohz: Fix update_ts_time_stat idle accounting cputime: Clean up cputime_to_usecs and usecs_to_cputime macros alarmtimers: Rework RTC device selection using class interface alarmtimers: Add try_to_cancel functionality alarmtimers: Add more refined alarm state tracking alarmtimers: Remove period from alarm structure alarmtimers: Remove interval cap limit hack ...	2011-10-26 17:15:03 +02:00
Linus Torvalds	e33bae14fd	Merge branch 'for-linus' of git://github.com/ericvh/linux * 'for-linus' of git://github.com/ericvh/linux: 9p: fix 9p.txt to advertise msize instead of maxdata net/9p: Convert net/9p protocol dumps to tracepoints fs/9p: change an int to unsigned int fs/9p: Cleanup option parsing in 9p 9p: move dereference after NULL check fs/9p: inode file operation is properly initialized init_special_inode fs/9p: Update zero-copy implementation in 9p	2011-10-26 14:20:53 +02:00
Sage Weil	3395734067	libceph: fix double-free of page vector ceph_release_page_vector() kfrees the vector; we shouldn't do it here too. Reported-by: Jeff Wu <cpwu@tnsoft.com.cn> Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:17 -07:00
Amon Ott	3310f7541f	ceph: fix 32-bit ino numbers Fix 32-bit ino generation to not always be 1. Signed-off-by: Amon Ott <a.ott@m-privacy.de>	2011-10-25 16:10:17 -07:00
Greg Farnum	a35eca958a	ceph: let the set_layout ioctl set single traits Previously we were validating the passed-in stripe unit, object size, and stripe count against each other (and not testing most other stuff). Instead, make sure that the composed previous layout and new values are valid, and only send the new values to the MDS. This lets users change the pool without setting the whole layout, for instance. Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>	2011-10-25 16:10:16 -07:00
Sage Weil	83eaea22bd	Revert "ceph: don't truncate dirty pages in invalidate work thread" This reverts commit `c9af9fb68e`. We need to block and truncate all pages in order to reliably invalidate them. Otherwise, we could: - have some uptodate pages in the cache - queue an invalidate - write(2) locks some pages - invalidate_work skips them - write(2) only overwrites part of the page - page now dirty and uptodate -> partial leakage of invalidated data It's not entirely clear why we started skipping locked pages in the first place. I just ran this through fsx and didn't see any problems. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:16 -07:00
Noah Watkins	80db8bea6a	ceph: replace leading spaces with tabs Trivial formatting fix. Signed-off-by: Noah Watkins <noahwatkins@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:16 -07:00
Sage Weil	b61c27636f	libceph: don't complain on msgpool alloc failures The pool allocation failures are masked by the pool; there is no need to spam the console about them. (That's the whole point of having the pool in the first place.) Mark msg allocations whose failure is safely handled as such. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Sage Weil	6ab00d465a	libceph: create messenger with client This simplifies the init/shutdown paths, and makes client->msgr available during the rest of the setup process. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00

1 2 3 4 5 ...

24450 Commits