gcc warns of an array subscript out of bounds in xfs_mod_sb().
The code is written in such a way that if the array subscript is
out of bounds, then it will assert fail. Rearrange the code to
avoid the bounds check warning.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Initialise the xfs_bmalloca_t structure to zero to avoid uninitialised
variable warnings. This is done by zeroing the arg structure rather than
using the uninitialised_var() trick so we know for certain that the
structure is correctly initialised as xfs_bmapi is a very complex
function and it is difficult to prove warnings are spurious.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The -fno-unsigned-char directive has no effect anymore as the
XFs build is clean. However, the kernel build hides pointer sign
differences so turn that back on so that we can clean up all the
mismatches prior to a userspace code resync.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We are now consistently using unsigned char strings for names
so fix up the remaining warnings in the dir2 code to complete
the cleanup.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
To be consistent with the directory code, the attr code should use
unsigned names. Convert the names from the vfs at the highest level
to unsigned, and ænsure they are consistenly used as unsigned down
to disk.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_buf_iomove() uses xfs_caddr_t as it's parameter types, but it doesn't
care about the signedness of the variables as it is just copying the
data. Change the prototype to use void * so that we don't get sign
warnings at call sites.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
To be consistent across the codebase, convert the dirnameops to pass
the directory names by unsigned char strings.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
dmops uses a signed char for it's namespace event. To be consistent
with the rest of the code, convert them to unsigned char for the
namespace string.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Convert the struct xfs_name to use unsigned chars for the name
strings to match both what is stored on disk (__uint8_t) and what
the VFS expects (unsigned char).
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move xfsbdstrat and xfs_bdstrat_cb from xfs_lrw.c and xfs_bioerror
and xfs_bioerror_relse from xfs_rw.c into xfs_buf.c. This also
means xfs_bioerror and xfs_bioerror_relse can be marked static now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Fold XFS_bwrite into it's only caller, xfs_bwrite and move it into
xfs_buf.c instead of leaving it as a fairly large inline function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Don't bother using XFS_bwrite as it doesn't provide much code for
our use case. Instead opencode it and fold xlog_bdstrat_cb into the
new xlog_bdstrat helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Now that the perag structure is allocated memory rather than held in
an array, we don't need to have the busy extent array external to
the structure. Embed it into the perag structure to avoid needing an
extra allocation when setting up.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Add proper error handling in case an error occurs while initializing
new perag structures for a mount point. The mount structure is
restored to its previous state by deleting and freeing any perag
structures added during the call.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The filestreams cache flush is not needed in the sync code as it
does not affect data writeback, and it is now not used by the growfs
code, either, so kill it.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Uninline xfs_perag_{get,put} so that tracepoints can be inserted
into them to speed debugging of reference count problems.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Reference count the per-ag structures to ensure that we keep get/put
pairs balanced. Assert that the reference counts are zero at unmount
time to catch leaks. In future, reference counts will enable us to
safely remove perag structures by allowing us to detect when they
are no longer in use.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The use of an array for the per-ag structures requires reallocation
of the array when growing the filesystem. This requires locking
access to the array to avoid use after free situations, and the
locking is difficult to get right. To avoid needing to reallocate an
array, change the per-ag structures to an allocated object per ag
and index them using a tree structure.
The AGs are always densely indexed (hence the use of an array), but
the number supported is 2^32 and lookups tend to be random and hence
indexing needs to scale. A simple choice is a radix tree - it works
well with this sort of index. This change also removes another
large contiguous allocation from the mount/growfs path in XFS.
The growing process now needs to change to only initialise the new
AGs required for the extra space, and as such only needs to
exclusively lock the tree for inserts. The rest of the code only
needs to lock the tree while doing lookups, and hence this will
remove all the deadlocks that currently occur on the m_perag_lock as
it is now an innermost lock. The lock is also changed to a spinlock
from a read/write lock as the hold time is now extremely short.
To complete the picture, the per-ag structures will need to be
reference counted to ensure that we don't free/modify them while
they are still in use. This will be done in subsequent patch.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Convert the remaining direct lookups of the per ag structures to use
get/put accesses. Ensure that the loops across AGs and prior users
of the interface balance gets and puts correctly.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Use xfs_perag_get() and xfs_perag_put() in the filestreams code.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Start abstracting the perag references so that the indexing of the
structures is not directly coded into all the places that uses the
perag structures. This will allow us to separate the use of the
perag structure and the way it is indexed and hence avoid the known
deadlocks related to growing a busy filesystem.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
xfs_get_perag is really getting the perag that an inode belongs to
based on it's inode number. Convert the use of this function to just
get the perag from a provided ag number. Use this new function to
obtain the per-ag structure when traversing the per AG inode trees
for sync and reclaim.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The xfsbufd wakes every xfsbufd_centisecs (once per second by
default) for each filesystem even when the filesystem is idle. If
the xfsbufd has nothing to do, put it into a long term sleep and
only wake it up when there is work pending (i.e. dirty buffers to
flush soon). This will make laptop power misers happy.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Now that the AIL push algorithm is traversal safe, we don't need a
watchdog function in the xfsaild to catch pushes that fail to make
progress. Remove the watchdog timeout and make pushes purely driven
by demand. This will remove the once-per-second wakeup that is seen
when the filesystem is idle and make laptop power misers happy.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the roll-your-own linked list operations.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Just minor housekeeping, a lot more functions can be trivially made
static; others could if we reordered things a bit...
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
The swap extent ioctl passes in a target inode and a temporary inode
which are clearly named in the ioctl structure. The code then
assigns temp to target and vice versa, making it extremely difficult
to work out which inode is which later in the code. Make this
consistent throughout the code.
Also make xfs_swap_extent static as there are no external users of
the function.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
To be able to diagnose whether the swap extents function is
detecting compatible inode data fork configurations for swapping
extents, add tracing points to the code to allow us to see the
format of the inode forks before and after the swap.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
When swapping extents, we can corrupt inodes by swapping data forks
that are in incompatible formats. This is caused by the two indoes
having different fork offsets due to the presence of an attribute
fork on an attr2 filesystem. xfs_fsr tries to be smart about
setting the fork offset, but the trick it plays only works on attr1
(old fixed format attribute fork) filesystems.
Changing the way xfs_fsr sets up the attribute fork will prevent
this situation from ever occurring, so in the kernel code we can get
by with a preventative fix - check that the data fork in the
defragmented inode is in a format valid for the inode it is being
swapped into. This will lead to files that will silently and
potentially repeatedly fail defragmentation, so issue a warning to
the log when this particular failure occurs to let us know that
xfs_fsr needs updating/fixing.
To help identify how to improve xfs_fsr to avoid this issue, add
trace points for the inodes being swapped so that we can determine
why the swap was rejected and to confirm that the code is making the
right decisions and modifications when swapping forks.
A further complication is even when the swap is allowed to proceed
when the fork offset is different between the two inodes then value
for the maximum number of extents the data fork can hold can be
wrong. Make sure these are also set correctly after the swap occurs.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
When xfs_rtfind_forw() returns an error, the block is returned
uninitialised. xfs_rtfree_range() is not checking the error return,
so could be using an uninitialised block number for modifying bitmap
summary info.
The problem was found by gcc when compiling the *userspace* libxfs
code - it is an copy of the kernel code with the exact same bug.
gcc gives an uninitialised variable warning on the userspace code
but not on the kernel code. You gotta love the consistency (Mmmm,
slightly chewy today!).
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
When reclaiming stale inodes, we need to guarantee that inodes are
unpinned before returning with a "clean" status. If we don't we can
reclaim inodes that are pinned, leading to use after free in the
transaction subsystem as transactions complete.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
lockdep complains about a the lock not being initialised as we do an
ASSERT based check that the lock is not held before we initialise it
to catch inodes freed with the lock held.
lockdep does this check for us in the lock initialisation code, so
remove the ASSERT to stop the lockdep warning.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We cannot do direct inode reclaim without taking the flush lock to
ensure that we do not reclaim an inode under IO. We check the inode
is clean before doing direct reclaim, but this is not good enough
because the inode flush code marks the inode clean once it has
copied the in-core dirty state to the backing buffer.
It is the flush lock that determines whether the inode is still
under IO, even though it is marked clean, and the inode is still
required at IO completion so we can't reclaim it even though it is
clean in core. Hence the requirement that we need to take the flush
lock even on clean inodes because this guarantees that the inode
writeback IO has completed and it is safe to reclaim the inode.
With delayed write inode flushing, we coul dend up waiting a long
time on the flush lock even for a clean inode. The background
reclaim already handles this efficiently, so avoid all the problems
by killing the direct reclaim path altogether.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The reclaim code will handle flushing of dirty inodes before reclaim
occurs, so avoid them when determining whether an inode is a
candidate for flushing to disk when walking the radix trees. This
is based on a test patch from Christoph Hellwig.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Make the inode tree reclaim walk exclusive to avoid races with
concurrent sync walkers and lookups. This is a version of a patch
posted by Christoph Hellwig that avoids all the code duplication.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
When we search for and find a busy extent during allocation we
force the log out to ensure the extent free transaction is on
disk before the allocation transaction. The current implementation
has a subtle bug in it--it does not handle multiple overlapping
ranges.
That is, if we free lots of little extents into a single
contiguous extent, then allocate the contiguous extent, the busy
search code stops searching at the first extent it finds that
overlaps the allocated range. It then uses the commit LSN of the
transaction to force the log out to.
Unfortunately, the other busy ranges might have more recent
commit LSNs than the first busy extent that is found, and this
results in xfs_alloc_search_busy() returning before all the
extent free transactions are on disk for the range being
allocated. This can lead to potential metadata corruption or
stale data exposure after a crash because log replay won't replay
all the extent free transactions that cover the allocation range.
Modified-by: Alex Elder <aelder@sgi.com>
(Dropped the "found" argument from the xfs_alloc_busysearch trace
event.)
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Because inodes remain in cache much longer than inode buffers do
under memory pressure, we can get the situation where we have
stale, dirty inodes being reclaimed but the backing storage has
been freed. Hence we should never, ever flush XFS_ISTALE inodes
to disk as there is no guarantee that the backing buffer is in
cache and still marked stale when the flush occurs.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We currently have some rather odd code in xfs_setattr for
updating the a/c/mtime timestamps:
- first we do a non-transaction update if all three are updated
together
- second we implicitly update the ctime for various changes
instead of relying on the ATTR_CTIME flag
- third we set the timestamps to the current time instead of the
arguments in the iattr structure in many cases.
This patch makes sure we update it in a consistent way:
- always transactional
- ctime is only updated if ATTR_CTIME is set or we do a size
update, which is a special case
- always to the times passed in from the caller instead of the
current time
The only non-size caller of xfs_setattr that doesn't come from
the VFS is updated to set ATTR_CTIME and pass in a valid ctime
value.
Reported-by: Eric Blake <ebb9@byu.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Using DECLARE_EVENT_CLASS allows us to to use trace event code
instead of duplicating it in the binary. This was not available
before 2.6.33 so it had to be done as a separate step once the
prerequisite was merged.
This only requires changes to xfs_trace.h and the results are
rather impressive:
hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o*
text data bss dec hex filename
607732 41884 3616 653232 9f7b0 fs/xfs/xfs.o
1026732 41884 3808 1072424 105d28 fs/xfs/xfs.o.old
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Randy Dunlap Reported printk() format-related warnings reported
on i386 builds in his environment. Dave Chinner provided this
patch to eliminate them.
Signed-off by: Dave Chinner <david@fromorbit.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
After I_SYNC was split from I_LOCK the leftover is always used together with
I_NEW and thus superflous.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
XFS: Free buffer pages array unconditionally
xfs: kill xfs_bmbt_rec_32/64 types
xfs: improve metadata I/O merging in the elevator
xfs: check for not fully initialized inodes in xfs_ireclaim
The code in xfs_free_buf() only attempts to free the b_pages array if the
buffer is a page cache backed or page allocated buffer. The extra log buffer
that is used when the log wraps uses pages that are allocated to a different
log buffer, but it still has a b_pages array allocated when those pages
are associated to with the extra buffer in xfs_buf_associate_memory.
Hence we need to always attempt to free the b_pages array when tearing
down a buffer, not just on buffers that are explicitly marked as page bearing
buffers. This fixes a leak detected by the kernel memory leak code.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
For a long time we've always stored bmap btree records in the 64bit format,
so kill off the dead 32bit type, and make sure the 64bit type is named just
xfs_bmbt_rec everywhere, without any size postfix.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Change all async metadata buffers to use [READ|WRITE]_META I/O types
so that the I/O doesn't get issued immediately. This allows merging of
adjacent metadata requests but still prioritises them over bulk data.
This shows a 10-15% improvement in sequential create speed of small
files.
Don't include the log buffers in this classification - leave them as
sync types so they are issued immediately.
Signed-off-by: Dave Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Add an assert for inodes not added to the inode cache in xfs_ireclaim,
to make sure we're not going to introduce something like the
famous nfsd inode cache bug again.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently the locking in blockdev_direct_IO is a mess, we have three different
locking types and very confusing checks for some of them. The most
complicated one is DIO_OWN_LOCKING for reads, which happens to not actually be
used.
This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read case
is unused anyway, and the write side is almost identical to DIO_NO_LOCKING.
The difference is that DIO_NO_LOCKING always sets the create argument for
the get_blocks callback to zero, but we can easily move that to the actual
get_blocks callbacks. There are four users of the DIO_NO_LOCKING mode:
gfs already ignores the create argument and thus is fine with the new
version, ocfs2 only errors out if create were ever set, and we can remove
this dead code now, the block device code only ever uses create for an
error message if we are fully beyond the device which can never happen,
and last but not least XFS will need the new behavour for writes.
Now we can replace the lock_type variable with a flags one, where no flag
means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
flag. Separate out the check for not allowing to fill holes into a separate
flag, although for now both flags always get set at the same time.
Also revamp the documentation of the locking scheme to actually make sense.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a flags argument to struct xattr_handler and pass it to all xattr
handler methods. This allows using the same methods for multiple
handlers, e.g. for the ACL methods which perform exactly the same action
for the access and default ACLs, just using a different underlying
attribute. With a little more groundwork it'll also allow sharing the
methods for the regular user/trusted/secure handlers in extN, ocfs2 and
jffs2 like it's already done for xfs in this patch.
Also change the inode argument to the handlers to a dentry to allow
using the handlers mechnism for filesystems that require it later,
e.g. cifs.
[with GFS2 bits updated by Steven Whitehouse <swhiteho@redhat.com>]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently the locking in blockdev_direct_IO is a mess, we have three
different locking types and very confusing checks for some of them. The
most complicated one is DIO_OWN_LOCKING for reads, which happens to not
actually be used.
This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read
case is unused anyway, and the write side is almost identical to
DIO_NO_LOCKING. The difference is that DIO_NO_LOCKING always sets the
create argument for the get_blocks callback to zero, but we can easily
move that to the actual get_blocks callbacks. There are four users of the
DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is
fine with the new version, ocfs2 only errors out if create were ever set,
and we can remove this dead code now, the block device code only ever uses
create for an error message if we are fully beyond the device which can
never happen, and last but not least XFS will need the new behavour for
writes.
Now we can replace the lock_type variable with a flags one, where no flag
means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
flag. Separate out the check for not allowing to fill holes into a
separate flag, although for now both flags always get set at the same
time.
Also revamp the documentation of the locking scheme to actually make
sense.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alex Elder <aelder@sgi.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>