linux/fs
Christopher Yeoh fcf634098c Cross Memory Attach
The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.

The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call.  There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.

- Use of /proc/pid/mem has been considered, but there are issues with
  using it:
  - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
  written to would need to be contiguous.
  - Currently mem_read allows only processes who are currently
  ptrace'ing the target and are still able to ptrace the target to read
  from the target. This check could possibly be moved to the open call,
  but its not clear exactly what race this restriction is stopping
  (reason  appears to have been lost)
  - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
  domain socket is a bit ugly from a userspace point of view,
  especially when you may have hundreds if not (eventually) thousands
  of processes  that all need to do this with each other
  - Doesn't allow for some future use of the interface we would like to
  consider adding in the future (see below)
  - Interestingly reading from /proc/pid/mem currently actually
  involves two copies! (But this could be fixed pretty easily)

As mentioned previously use of vmsplice instead was considered, but has
problems.  Since you need the reader and writer working co-operatively if
the pipe is not drained then you block.  Which requires some wrapping to
do non blocking on the send side or polling on the receive.  In all to all
communication it requires ordering otherwise you can deadlock.  And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.

There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could.  For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy.  We don't need to keep a copy of the data
from the source.  I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.

Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI).  This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication.  And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.

There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:

http://marc.info/?l=linux-mm&m=130105930902915&w=2

There is a basic man page for the proposed interface here:

http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.

For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:

http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:44 -07:00
..
9p net/9p: Convert net/9p protocol dumps to tracepoints 2011-10-24 11:13:12 -05:00
adfs Fix common misspellings 2011-03-31 11:26:23 -03:00
affs fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers 2011-07-20 20:47:59 -04:00
afs AFS: Fix silly characters in a comment 2011-07-20 20:48:03 -04:00
autofs4 autofs4: fix debug printk warning uncovered by cleanup 2011-08-08 12:02:43 -07:00
befs befs: Validate length of long symbolic links. 2011-08-17 13:31:24 -07:00
bfs bfs: remove unnecessary dentry_unhash on dir rename 2011-05-28 01:02:50 -04:00
btrfs Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue 2011-10-28 10:49:34 -07:00
cachefiles kill useless checks for sb->s_op == NULL 2011-07-20 01:44:21 -04:00
ceph libceph: fix double-free of page vector 2011-10-25 16:10:17 -07:00
cifs Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue 2011-10-28 10:49:34 -07:00
coda fs: Convert vmalloc/memset to vzalloc 2011-09-15 13:56:28 +02:00
configfs doc: fix broken references 2011-09-27 18:08:04 +02:00
cramfs cramfs: get_cramfs_inode() returns ERR_PTR() on failure 2011-07-17 23:22:02 -04:00
debugfs debugfs: Fix a comment mistake 2011-08-22 17:41:48 -07:00
devpts fs/devpts/inode.c: correctly check d_alloc_name() return code in devpts_pty_new() 2011-03-22 17:44:17 -07:00
dlm Merge branch 'for-3.1' of git://linux-nfs.org/~bfields/linux 2011-07-25 22:49:19 -07:00
ecryptfs Ecryptfs: Add mount option to check uid of device being mounted = expect uid 2011-08-09 23:29:01 -05:00
efs make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) 2011-07-20 01:44:26 -04:00
exofs ore: Enable RAID5 mounts 2011-10-24 17:22:29 -07:00
exportfs vfs: Add open by file handle support 2011-03-15 02:21:44 -04:00
ext2 Merge branch 'next-evm' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/ima-2.6 into next 2011-08-09 10:31:03 +10:00
ext3 Merge branch 'next' of git://selinuxproject.org/~jmorris/linux-security 2011-10-25 09:45:31 +02:00
ext4 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue 2011-10-28 10:49:34 -07:00
fat Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 2011-08-18 14:16:13 -07:00
freevxfs treewide: fix a few typos in comments 2011-05-10 10:16:21 +02:00
fscache FS-Cache: Fix __fscache_uncache_all_inode_pages()'s outer loop 2011-07-21 10:59:16 -07:00
fuse fuse: fix memory leak 2011-09-12 11:47:10 -07:00
gfs2 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue 2011-10-28 10:49:34 -07:00
hfs fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers 2011-07-20 20:47:59 -04:00
hfsplus hfsplus: fix filesystem size checks 2011-09-15 09:03:17 -07:00
hostfs fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers 2011-07-20 20:47:59 -04:00
hpfs fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers 2011-07-20 20:47:59 -04:00
hppfs hppfs: missing include 2011-07-27 22:21:58 -04:00
hugetlbfs lockdep: Add helper function for dir vs file i_mutex annotation 2011-08-25 10:50:18 -07:00
isofs isofs: Remove global fs lock 2011-07-22 19:42:12 -04:00
jbd jbd: Use WRITE_SYNC in journal checkpoint. 2011-06-28 00:06:41 +02:00
jbd2 jbd2: remove jbd2_dev_to_name() from jbd2 tracepoints 2011-07-10 22:05:08 -04:00
jffs2 Merge branch 'next-evm' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/ima-2.6 into next 2011-08-09 10:31:03 +10:00
jfs Merge branch 'next' of git://selinuxproject.org/~jmorris/linux-security 2011-10-25 09:45:31 +02:00
lockd SUNRPC: Replace svc_addr_u by sockaddr_storage 2011-09-14 08:21:48 -04:00
logfs fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers 2011-07-20 20:47:59 -04:00
minix minix_getattr(): don't bother with ->d_parent 2011-07-20 20:47:53 -04:00
ncpfs fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers 2011-07-20 20:47:59 -04:00
nfs Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue 2011-10-28 10:49:34 -07:00
nfs_common Fix common misspellings 2011-03-31 11:26:23 -03:00
nfsd nfs41: implement DESTROY_CLIENTID operation 2011-10-24 04:24:30 -04:00
nilfs2 fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers 2011-07-20 20:47:59 -04:00
nls
notify atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
ntfs atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
ocfs2 Merge branch 'next-evm' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/ima-2.6 into next 2011-08-09 10:31:03 +10:00
omfs omfs: fix (mode & S_IFDIR) abuse 2011-07-26 13:05:28 -04:00
openpromfs fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
partitions Merge branch 'for-linus' into for-3.1/core 2011-07-01 16:17:13 +02:00
proc /proc/self/numa_maps: restore "huge" tag for hugetlb vmas 2011-10-31 17:30:44 -07:00
pstore pstore: Allow the user to explicitly choose a backend 2011-07-22 16:14:39 -07:00
qnx4 block: remove per-queue plugging 2011-03-10 08:52:07 +01:00
quota VFS: Fix the remaining automounter semantics regressions 2011-09-26 19:16:46 -07:00
ramfs ramfs: fix memleak on no-mmu arch 2011-04-14 16:06:56 -07:00
reiserfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2011-10-25 12:11:02 +02:00
romfs romfs: fix romfs_get_unmapped_area() argument check 2011-06-27 18:00:12 -07:00
squashfs doc: fix broken references 2011-09-27 18:08:04 +02:00
sysfs sysfs: Remove support for tagged directories with untagged members (again) 2011-10-25 15:10:28 +02:00
sysv sysv: remove unnecessary dentry_unhash from rmdir, dir rename 2011-05-28 01:02:50 -04:00
ubifs UBIFS: not build debug messages with CONFIG_UBIFS_FS_DEBUG disabled 2011-08-19 18:58:58 +03:00
udf switch udf_ioctl() to inode_permission() 2011-07-20 01:43:07 -04:00
ufs make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) 2011-07-20 01:44:26 -04:00
xfs Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs 2011-10-28 10:31:42 -07:00
Kconfig tmpfs: expand "help" to explain value of TMPFS_POSIX_ACL 2011-08-03 14:25:25 -10:00
Kconfig.binfmt
Makefile fs/Makefile: Stupid typo breakage of exofs inclusion 2011-10-27 08:36:51 +02:00
aio.c Cross Memory Attach 2011-10-31 17:30:44 -07:00
anon_inodes.c vfs: dont chain pipe/anon/socket on superblock s_inodes list 2011-07-26 12:57:09 -04:00
attr.c Merge branch 'next-evm' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/ima-2.6 into next 2011-08-09 10:31:03 +10:00
bad_inode.c fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers 2011-07-20 20:47:59 -04:00
binfmt_aout.c
binfmt_elf.c consolidate BINPRM_FLAGS_ENFORCE_NONDUMP handling 2011-07-20 01:43:10 -04:00
binfmt_elf_fdpic.c consolidate BINPRM_FLAGS_ENFORCE_NONDUMP handling 2011-07-20 01:43:10 -04:00
binfmt_em86.c
binfmt_flat.c CRED: Fix load_flat_shared_library() to initialise bprm correctly 2011-05-03 10:10:51 +10:00
binfmt_misc.c consolidate BINPRM_FLAGS_ENFORCE_NONDUMP handling 2011-07-20 01:43:10 -04:00
binfmt_script.c
binfmt_som.c
bio-integrity.c block: Require subsystems to explicitly allocate bio_set integrity mempool 2011-03-17 11:11:05 +01:00
bio.c block: improve the bio_add_page() and bio_add_pc_page() descriptions 2011-05-28 14:44:46 +02:00
block_dev.c Avoid dereferencing a 'request_queue' after last close. 2011-09-10 17:20:21 +10:00
buffer.c cleanup: vfs: small comment fix for block_invalidatepage 2011-10-28 13:55:08 +02:00
char_dev.c Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block 2011-01-13 10:45:01 -08:00
compat.c Cross Memory Attach 2011-10-31 17:30:44 -07:00
compat_binfmt_elf.c
compat_ioctl.c compat_ioctl: add compat handler for PPPIOCGL2TPSTATS 2011-08-07 22:24:41 -07:00
dcache.c vfs: renumber DCACHE_xyz flags, remove some stale ones 2011-08-06 22:52:40 -07:00
dcookies.c oprofile, dcookies: Fix possible circular locking dependency 2011-05-31 16:33:35 +02:00
direct-io.c direct-io: merge direct_io_walker into __blockdev_direct_IO 2011-10-28 14:58:58 +02:00
drop_caches.c vmscan: change shrinker API by passing shrink_control struct 2011-05-25 08:39:26 -07:00
eventfd.c Docbook: add fs/eventfd.c and fix typos in it 2011-02-21 15:07:04 -08:00
eventpoll.c Merge branch 'master' into for-next 2011-09-15 15:08:18 +02:00
exec.c move RLIMIT_NPROC check from set_user() to do_execve_common() 2011-08-11 11:24:42 -07:00
fcntl.c userns: rename is_owner_or_cap to inode_owner_or_capable 2011-03-23 19:47:13 -07:00
fhandle.c fs/fhandle.c: add <linux/personality.h> for ia64 2011-04-14 16:06:56 -07:00
fifo.c Filesystem: fifo: Fixed coding style issue. 2011-03-21 00:16:09 -04:00
file.c vfs: avoid large kmalloc()s for the fdtable 2011-04-28 11:28:20 -07:00
file_table.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
filesystems.c fs: synchronize_rcu when unregister_filesystem success not failure 2011-04-17 10:42:01 -07:00
fs-writeback.c don't busy retry the inode on failed grab_super_passive() 2011-07-31 22:52:08 +08:00
fs_struct.c sanitize vfsmount refcounting changes 2011-01-16 13:47:07 -05:00
generic_acl.c switch posix_acl_equiv_mode() to umode_t * 2011-08-01 02:10:06 -04:00
inode.c vfs: fix spinning prevention in prune_icache_sb 2011-10-28 14:58:55 +02:00
internal.h superblock: move pin_sb_for_writeback() to fs/super.c 2011-07-20 01:44:38 -04:00
ioctl.c vfs: cleanup do_vfs_ioctl() 2011-03-21 00:16:08 -04:00
ioprio.c
libfs.c fix IN_DELETE_SELF on overwriting rename() on ramfs et.al. 2011-07-22 19:42:11 -04:00
locks.c Merge branch 'for-3.2' of git://linux-nfs.org/~bfields/linux 2011-10-25 15:42:01 +02:00
mbcache.c vmscan: change shrinker API by passing shrink_control struct 2011-05-25 08:39:26 -07:00
mpage.c mm/fs: add hooks to support cleancache 2011-05-26 10:01:43 -06:00
namei.c leases: fix write-open/read-lease race 2011-10-28 14:59:00 +02:00
namespace.c vfs: add "device" tag to /proc/self/mountstats 2011-10-28 13:55:08 +02:00
no-block.c
open.c leases: fix write-open/read-lease race 2011-10-28 14:59:00 +02:00
pipe.c vfs: dont chain pipe/anon/socket on superblock s_inodes list 2011-07-26 12:57:09 -04:00
pnode.c fs: scale mntget/mntput 2011-01-07 17:50:33 +11:00
pnode.h
posix_acl.c vfs: pass all mask flags check_acl and posix_acl_permission 2011-10-28 14:58:54 +02:00
read_write.c Cross Memory Attach 2011-10-31 17:30:44 -07:00
read_write.h
readdir.c
select.c select: remove unused MAX_SELECT_SECONDS 2011-03-21 00:16:08 -04:00
seq_file.c
signalfd.c
splice.c tmpfs: clone shmem_file_splice_read() 2011-07-25 20:57:11 -07:00
stack.c mm: a few small updates for radix-swap 2011-08-03 14:25:24 -10:00
stat.c vfs: remove LOOKUP_NO_AUTOMOUNT flag 2011-09-27 08:12:33 -07:00
statfs.c clean statfs-like syscalls up 2011-03-14 09:15:28 -04:00
super.c Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-block 2011-07-25 10:33:36 -07:00
sync.c fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers 2011-07-20 20:47:59 -04:00
timerfd.c timerfd: Fix wakeup of processes when timer is cancelled on clock change 2011-06-14 11:46:14 +02:00
utimes.c userns: rename is_owner_or_cap to inode_owner_or_capable 2011-03-23 19:47:13 -07:00
xattr.c evm: evm_inode_post_removexattr 2011-07-18 12:29:43 -04:00
xattr_acl.c