Convert fs/9p/mux.c from semaphore to mutex.
NOTE: fixed locking bugs in the process - the code was using semaphores
the other way around.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Eric Van Hensbergen <ericvh@ericvh.myip.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When quota is being turned off we assumed that all the references to dquots
were already dropped. That need not be true as inodes being deleted are
not on superblock's inodes list and hence we need not reach it when
removing quota references from inodes. So invalidate_dquots() has to wait
for all the users of dquots (as quota is already marked as turned off, no
new references can be acquired and so this is bound to happen rather
early). When we do this, we can also remove the iprune_sem locking as it
was protecting us against exactly the same problem when freeing inodes
icache memory.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1) Reduce the size of (struct fdtable) to exactly 64 bytes on 32bits
platforms, lowering kmalloc() allocated space by 50%.
2) Reduce the size of (files_struct), using a special 32 bits (or
64bits) embedded_fd_set, instead of a 1024 bits fd_set for the
close_on_exec_init and open_fds_init fields. This save some ram (248
bytes per task) as most tasks dont open more than 32 files. D-Cache
footprint for such tasks is also reduced to the minimum.
3) Reduce size of allocated fdset. Currently two full pages are
allocated, that is 32768 bits on x86 for example, and way too much. The
minimum is now L1_CACHE_BYTES.
UP and SMP should benefit from this patch, because most tasks will touch
only one cache line when open()/close() stdin/stdout/stderr (0/1/2),
(next_fd, close_on_exec_init, open_fds_init, fd_array[0 .. 2] being in the
same cache line)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Linus points out that ext3_readdir's readahead only cuts in when
ext3_readdir() is operating at the very start of the directory. So for large
directories we end up performing no readahead at all and we suck.
So take it all out and use the core VM's page_cache_readahead(). This means
that ext3 directory reads will use all of readahead's dynamic sizing goop.
Note that we're using the directory's filp->f_ra to hold the readahead state,
but readahead is actually being performed against the underlying blockdev's
address_space. Fortunately the readahead code is all set up to handle this.
Tested with printk. It works. I was struggling to find a real workload which
actually cared.
(The patch also exports page_cache_readahead() to GPL modules)
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a duplicate block device line printed after the "Block device" header
in /proc/devices.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Centralize the page migration functions in anticipation of additional
tinkering. Creates a new file mm/migrate.c
1. Extract buffer_migrate_page() from fs/buffer.c
2. Extract central migration code from vmscan.c
3. Extract some components from mempolicy.c
4. Export pageout() and remove_from_swap() from vmscan.c
5. Make it possible to configure NUMA systems without page migration
and non-NUMA systems with page migration.
I had to so some #ifdeffing in mempolicy.c that may need a cleanup.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Implementation of hugetlbfs_counter() is functionally equivalent to
atomic_inc_return(). Use the simpler atomic form.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These days, hugepages are demand-allocated at first fault time. There's a
somewhat dubious (and racy) heuristic when making a new mmap() to check if
there are enough available hugepages to fully satisfy that mapping.
A particularly obvious case where the heuristic breaks down is where a
process maps its hugepages not as a single chunk, but as a bunch of
individually mmap()ed (or shmat()ed) blocks without touching and
instantiating the pages in between allocations. In this case the size of
each block is compared against the total number of available hugepages.
It's thus easy for the process to become overcommitted, because each block
mapping will succeed, although the total number of hugepages required by
all blocks exceeds the number available. In particular, this defeats such
a program which will detect a mapping failure and adjust its hugepage usage
downward accordingly.
The patch below addresses this problem, by strictly reserving a number of
physical hugepages for hugepage inodes which have been mapped, but not
instatiated. MAP_SHARED mappings are thus "safe" - they will fail on
mmap(), not later with an OOM SIGKILL. MAP_PRIVATE mappings can still
trigger an OOM. (Actually SHARED mappings can technically still OOM, but
only if the sysadmin explicitly reduces the hugepage pool between mapping
and instantiation)
This patch appears to address the problem at hand - it allows DB2 to start
correctly, for instance, which previously suffered the failure described
above.
This patch causes no regressions on the libhugetblfs testsuite, and makes a
test (designed to catch this problem) pass which previously failed (ppc64,
POWER5).
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that compound page handling is properly fixed in the VM, move nommu
over to using compound pages rather than rolling their own refcounting.
nommu vm page refcounting is broken anyway, but there is no need to have
divergent code in the core VM now, nor when it gets fixed.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: David Howells <dhowells@redhat.com>
(Needs testing, please).
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_NO_REAP is documented as an option that will cause this slab not to be
reaped under memory pressure. However, that is not what happens. The only
thing that SLAB_NO_REAP controls at the moment is the reclaim of the unused
slab elements that were allocated in batch in cache_reap(). Cache_reap()
is run every few seconds independently of memory pressure.
Could we remove the whole thing? Its only used by three slabs anyways and
I cannot find a reason for having this option.
There is an additional problem with SLAB_NO_REAP. If set then the recovery
of objects from alien caches is switched off. Objects not freed on the
same node where they were initially allocated will only be reused if a
certain amount of objects accumulates from one alien node (not very likely)
or if the cache is explicitly shrunk. (Strangely __cache_shrink does not
check for SLAB_NO_REAP)
Getting rid of SLAB_NO_REAP fixes the problems with alien cache freeing.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If a file is not found in v9fs_vfs_lookup, the function creates negative
dentry, but doesn't assign any dentry ops. This leaves the negative entry
in the cache (there is no d_delete to mark it for removal). If the file is
created outside of the mounted v9fs filesystem, the file shows up in the
directory with weird permissions.
This patch assigns the default v9fs dentry ops to the negative dentry.
Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
rewrite clustering. This prevents writing excessive amounts of clean data
when doing random rewrites of a cached file.
SGI-PV: 951193
SGI-Modid: xfs-linux-melb:xfs-kern:25531a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
of xfs_itruncate_start().
SGI-PV: 947420
SGI-Modid: xfs-linux-melb:xfs-kern:25527a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
conversion and concurrent truncate operations. Use vn_iowait to wait for
the completion of any pending DIOs. Since the truncate requires exclusive
IOLOCK, so this blocks any further DIO operations since DIO write also
needs exclusive IOBLOCK. This serves as a barrier and prevent any
potential starvation.
SGI-PV: 947420
SGI-Modid: xfs-linux-melb:xfs-kern:208088a
Signed-off-by: Yingping Lu <yingping@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
the trace.
SGI-PV: 948300
SGI-Modid: xfs-linux-melb:xfs-kern:208069a
Signed-off-by: Yingping Lu <yingping@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Add default selection of CRYPTO_CAST5 when selecting RPCSEC_GSS_SPKM3.
Signed-off-by: Kevin Coffman <kwc@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nlmsvc_traverse_shares return value is always zero, hence useless.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Note that we never return non-zero.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Thanks to Frank Filz for pointing out that we list system.nfs4_acl extended
attribute even on filesystems where we don't actually support nfs4_acl.
This is inconsistent with the e.g. ext3 POSIX ACL behaviour, and seems to
annoy cp.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently this will not happen if we exit before rpc_new_task() was called.
Also fix up rpc_run_task() to do the same (for consistency).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
As pointed out by Oliver Neukum.
Cc: Maneesh Soni <maneesh@in.ibm.com>
Cc: Oliver Neukum <oliver@neukum.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
These functions should only be used by the kobject core, and if any
driver tries to use them, bad things happen. Unexport them to try to
prevent this from happening.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
I wanted to export a binary blob via debugfs, and although it was pretty easy
it seems like it'd be easier if there was a helper for it. It's a pity we need
the wrapper struct but I can't see a cleaner way to do it.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The following patch checks for existing sysfs_dirent before
preparing new one while creating sysfs directories and files.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
this converts fs/sysfs to kzalloc() usage.
compile tested with make allyesconfig
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Convert the kobj_map code to use a mutex instead of a semaphore. It
converts the single two users as well, genhd.c and char_dev.c.
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When calling sysfs_remove_dir() don't allow any further sysfs functions
to work for this kobject anymore. This fixes a nasty USB cdc-acm oops
on disconnect.
Many thanks to Bob Copeland and Paul Fulghum for taking the time to
track this down.
Cc: Bob Copeland <email@bobcopeland.com>
Cc: Paul Fulghum <paulkf@microgate.com>
Cc: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch augments the collection of inode info during syscall
processing. It represents part of the functionality that was provided
by the auditfs patch included in RHEL4.
Specifically, it:
- Collects information for target inodes created or removed during
syscalls. Previous code only collects information for the target
inode's parent.
- Adds the audit_inode() hook to syscalls that operate on a file
descriptor (e.g. fchown), enabling audit to do inode filtering for
these calls.
- Modifies filtering code to check audit context for either an inode #
or a parent inode # matching a given rule.
- Modifies logging to provide inode # for both parent and child.
- Protect debug info from NULL audit_names.name.
[AV: folded a later typo fix from the same author]
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The audit hooks (to be added shortly) will want to see dentry->d_inode
too, not just the name.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Kudos to Neil Brown for spotting the problem:
"in nfs_sync_inode, there is effectively the sequence:
nfs_wait_on_requests
nfs_flush_inode
nfs_commit_inode
This seems a bit racy to me as if the only requests are on the
->commit list, and nfs_commit_inode is called separately after
nfs_wait_on_requests completes, and before nfs_commit_inode start
(say: by nfs_write_inode) then none of these function will return
>0, yet there will be some pending request that aren't waited for."
The solution is to search for requests to wait upon, search for dirty
requests, and search for uncommitted requests while holding the
nfsi->req_lock
The patch also cleans up nfs_sync_inode(), getting rid of the redundant
FLUSH_WAIT flag. It turns out that we were always setting it.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We don't need to set PG_private for readahead pages, since they never get
unlocked while I/O is in progress. However there is a small race in
nfs_readpage_release() whereby the page may be unlocked, and have
PG_private set.
Fix is to have PG_private set only for the case of writes...
Also fix a bug in nfs_clear_page_writeback(): Don't attempt to clear the
radix_tree tag if we've already deleted the radix tree entry.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the callback daemon is signalled, but is unable to exit because it still
has users, then we need to flush signals. If not, then svc_recv() can
never sleep, and so we hang.
If we flush signals, then we also have to be prepared to resend them when
we want the thread to exit.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently it returns NULL, which usually gets interpreted as ENOMEM. In
fact it can mean a host of issues.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The mount statistics patches introduced a call to nfs_free_iostats that is
not only redundant, but actually causes an oops.
Also fix a memory leak due to the lack of a call to nfs_free_iostats on
unmount.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>