Commit Graph

403 Commits

Author SHA1 Message Date
Joern Engel 1c0eeaf569 introduce I_SYNC
I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect.  One of the purposes
now uses the new I_SYNC bit.

Also document the various bits and change their order from historical to
logical.

[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Fengguang Wu 1f7decf6d9 writeback: remove pages_skipped accounting in __block_write_full_page()
Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:

> The following strange behavior can be observed:
>
> 1. large file is written
> 2. after 30 seconds, nr_dirty goes down by 1024
> 3. then for some time (< 30 sec) nothing happens (disk idle)
> 4. then nr_dirty again goes down by 1024
> 5. repeat from 3. until whole file is written
>
> So basically a 4Mbyte chunk of the file is written every 30 seconds.
> I'm quite sure this is not the intended behavior.

It can be produced by the following test scheme:

# cat bin/test-writeback.sh
grep nr_dirty /proc/vmstat
echo 1 > /proc/sys/fs/inode_debug
dd if=/dev/zero of=/var/x bs=1K count=204800&
while true; do grep nr_dirty /proc/vmstat; sleep 1; done

# bin/test-writeback.sh
nr_dirty 19207
nr_dirty 19207
nr_dirty 30924
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
nr_dirty 47150
nr_dirty 47141
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47205
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47215
nr_dirty 47216
nr_dirty 47216
nr_dirty 47216
nr_dirty 47154
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47134
nr_dirty 47134
nr_dirty 47135
nr_dirty 47135
nr_dirty 47135
nr_dirty 46097 <== -1038
nr_dirty 46098
nr_dirty 46098
nr_dirty 46098
[...]
nr_dirty 46091
nr_dirty 46092
nr_dirty 46092
nr_dirty 45069 <== -1023
nr_dirty 45056
nr_dirty 45056
nr_dirty 45056
[...]
nr_dirty 37822
nr_dirty 36799 <== -1023
[...]
nr_dirty 36781
nr_dirty 35758 <== -1023
[...]
nr_dirty 34708
nr_dirty 33672 <== -1024
[...]
nr_dirty 33692
nr_dirty 32669 <== -1023

% ls -li /var/x
847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x

% dmesg|grep 847824  # generated by a debug printk
[  529.263184] redirtied inode 847824 line 548
[  564.250872] redirtied inode 847824 line 548
[  594.272797] redirtied inode 847824 line 548
[  629.231330] redirtied inode 847824 line 548
[  659.224674] redirtied inode 847824 line 548
[  689.219890] redirtied inode 847824 line 548
[  724.226655] redirtied inode 847824 line 548
[  759.198568] redirtied inode 847824 line 548

# line 548 in fs/fs-writeback.c:
543                 if (wbc->pages_skipped != pages_skipped) {
544                         /*
545                          * writeback is not making progress due to locked
546                          * buffers.  Skip this inode for now.
547                          */
548                         redirty_tail(inode);
549                 }

More debug efforts show that __block_write_full_page()
never has the chance to call submit_bh() for that big dirty file:
the buffer head is *clean*. So basicly no page io is issued by
__block_write_full_page(), hence pages_skipped goes up.

Also the comment in generic_sync_sb_inodes():

544                         /*
545                          * writeback is not making progress due to locked
546                          * buffers.  Skip this inode for now.
547                          */

and the comment in __block_write_full_page():

1713                 /*
1714                  * The page was marked dirty, but the buffers were
1715                  * clean.  Someone wrote them back by hand with
1716                  * ll_rw_block/submit_bh.  A rare case.
1717                  */

do not quite agree with each other. The page writeback should be skipped for
'locked buffer', but here it is 'clean buffer'!

This patch fixes this bug. Though I'm not sure why __block_write_full_page()
is called only to do nothing and who actually issued the writeback for us.

This is the two possible new behaviors after the patch:

1) pretty nice: wait 30s and write ALL:)
2) not so good:
	- during the dd: ~16M
	- after 30s:      ~4M
	- after 5s:       ~4M
	- after 5s:     ~176M

The next patch will fix case (2).

Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Christoph Lameter 4ba9b9d0ba Slab API: remove useless ctor parameter and reorder parameters
Slab constructors currently have a flags parameter that is never used.  And
the order of the arguments is opposite to other slab functions.  The object
pointer is placed before the kmem_cache pointer.

Convert

        ctor(void *object, struct kmem_cache *s, unsigned long flags)

to

        ctor(struct kmem_cache *s, void *object)

throughout the kernel

[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Jeremy Fitzhardinge 7f01507234 [XFS] eagerly remove vmap mappings to avoid upsetting Xen
XFS leaves stray mappings around when it vmaps memory to make it virtually
contigious. This upsets Xen if one of those pages is being recycled into a
pagetable, since it finds an extra writable mapping of the page.

This patch solves the problem in a brute force way, by making XFS always
eagerly unmap its mappings.

SGI-PV: 971902
SGI-Modid: xfs-linux-melb:xfs-kern:29886a

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-17 14:14:35 +10:00
Christoph Hellwig 6572bc28de [XFS] simplify validata_fields
Stop using xfs_getattr and a onstack bhv_vattr_t just to get three fields
from the underlying inode and opencode copying from the inode fields
instead.

SGI-PV: 970662
SGI-Modid: xfs-linux-melb:xfs-kern:29711a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-17 11:10:14 +10:00
Nick Piggin d79689c703 xfs: convert to new aops
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: David Chinner <dgc@sgi.com>
Cc: Timothy Shimmin <tes@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:55 -07:00
Tim Shimmin 150f29ef2e [XFS] no longer using io_vnode, as was remaining from 23 cherrypick
Because we cherrypicked SGI-Modid xfs-linux-melb:xfs-kern:29675a
and it depended on the sgi mod which removed io_vnode (which was
not cherrypicked in 23) it was hand modified.
This fixes things back up (to the originial mod) now we have moved
on again.

Reviewed-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 16:20:12 +10:00
Tim Shimmin cd514bdaa8 [XFS] Put back the QUEUE_ORDERED_NONE test in the barrier check.
Put back the QUEUE_ORDERED_NONE test which caused us grief in sles when it
was taken out as, IIRC, it allowed md/lvm to be thought of as supporting
barriers when they weren't in some configurations. This patch will be
reverting what went in as part of a change for the SGI-pv 964544
(SGI-Modid: xfs-linux-melb:xfs-kern:28568a).

SGI-PV: 971783
SGI-Modid: xfs-linux-melb:xfs-kern:29882a

Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
2007-10-16 14:23:21 +10:00
Lachlan McIlroy e893bffd4c [XFS] avoid race in sync_inodes() that can fail to write out all dirty data
In xfs_fs_sync_super() treat a sync the same as a filesystem freeze. This
is needed to force the log to disk for inodes which are not marked dirty
in the Linux inode (the inodes are marked dirty on completion of the log
I/O) and so sync_inodes() will not flush them.

In xfs_fs_write_inode() a synchronous flush will not get an EAGAIN from
xfs_inode_flush() and if an asynchronous flush returns EAGAIN we should
pass it on to the caller. If we get an error while flushing the inode then
re-dirty it so we can try again later.

SGI-PV: 971670
SGI-Modid: xfs-linux-melb:xfs-kern:29860a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 14:22:28 +10:00
Christoph Hellwig 3e5daf05a0 [XFS] simplify xfs_create/mknod/symlink prototype
Simplify the prototype for xfs_create/xfs_mkdir/xfs_symlink by not passing
down a bhv_vattr_t that just hogs stack space. Instead pass down the mode
in a mode_t and in case of xfs_create the rdev as a scalar type as well.

SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29794a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 14:15:32 +10:00
Christoph Hellwig c83bfab1fa [XFS] avoid xfs_getattr in XFS_IOC_FSGETXATTR ioctl
No need to call into xfs_getattr and put a big bhv_vattr_t on the stack
just to get a little information from the XFS inode.

Add a helper called xfs_ioc_fsgetxattr instead that deals with retrieving
the information in a clean way.

SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29780a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 12:21:48 +10:00
Lachlan McIlroy 5903c4956f [XFS] ensure file size is logged on synchronous writes
Synchronous writes currently log inode changes before syncing pages to
disk. Since the file size is updated on I/O completion we wont be writing
out the updated file size and if we crash the file will have the wrong
size. This change moves the logging after the syncing of the pages to
ensure we log the correct file size.

SGI-PV: 970334
SGI-Modid: xfs-linux-melb:xfs-kern:29649a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 12:18:38 +10:00
Christoph Hellwig b267ce9952 [XFS] kill struct bhv_vfs
Now that struct bhv_vfs doesn't have any members left we can kill it and
go directly from the super_block to the xfs_mount everywhere.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29509a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 12:17:27 +10:00
Christoph Hellwig 7439449670 [XFS] move syncing related members from struct bhv_vfs to struct xfs_mount
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29508a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 12:16:35 +10:00
Christoph Hellwig bd186aa901 [XFS] kill the vfs_flags member in struct bhv_vfs
All flags are added to xfs_mount's m_flag instead. Note that the 32bit
inode flag was duplicated in both of them, but only cleared in the mount
when it was not nessecary due to the filesystem beeing small enough. Two
flags are still required here - one to indicate the mount option setting,
and one to indicate if it applies or not.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29507a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:45:57 +10:00
Christoph Hellwig 0ce4cfd4f7 [XFS] kill the vfs_fsid and vfs_altfsid members in struct bhv_vfs
vfs_altfsid was just a pointer to mp->m_fixedfsid so we can trivially
replace it with the latter. vfs_fsid also was identical to m_fixedfsid
through rather obfuscated ways so we can kill it as well and simply its
only user.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29506a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:45:02 +10:00
Christoph Hellwig 745f691912 [XFS] call common xfs vfs-level helpers directly and remove vfs operations
Also remove the now dead behavior code.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29505a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:44:08 +10:00
Christoph Hellwig 48c872a9f3 [XFS] decontaminate vfs operations from behavior details
All vfs ops now take struct xfs_mount pointers and the behaviour related
glue is split out into methods of its own.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29504a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:43:55 +10:00
Christoph Hellwig b09cc77109 [XFS] remove dependency of the quota module on behaviors
Mount options are now parsed by the main XFS module and rejected if quota
support is not available, and there are some new quota operation for the
quotactl syscall and calls to quote in the mount, unmount and sync
callchains.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29503a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:43:26 +10:00
Christoph Hellwig 293688ec42 [XFS] remove dependency of the dmapi module on behaviors
Mount options are now parsed by the main XFS module and rejected if dmapi
support is not available, and there is a new dm operation to send the
mount event.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29502a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:41:15 +10:00
Christoph Hellwig f541d270db [XFS] move freeing the mount structure from xfs_mount_free into the callers
In the next patch we need to look at the mount structure until just before
it's freed, so we need to be able to free it as the very last thing in
xfs_unmount.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29501a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:40:52 +10:00
Christoph Hellwig 0a74cd1964 [XFS] kill struct bhv_vnode
Now that struct bhv_vnode is empty we can just kill it. Retain bhv_vnode_t
as a typedef for struct inode for the time being until all the fallout is
cleaned up.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29500a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:40:24 +10:00
Christoph Hellwig 2aeaa258c0 [XFS] kill the v_number member in struct bhv_vnode
It's entirely unused except for ignored arguments in the mrlock
initialization, so remove it.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29499a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:39:42 +10:00
Christoph Hellwig 1543d79c45 [XFS] move v_trace from bhv_vnode to xfs_inode
struct bhv_vnode is on it's way out, so move the trace buffer to the XFS
inode. Note that this makes the tracing macros rather misnamed, but this
kind of fallout will be fixed up incrementally later on.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29498a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:39:25 +10:00
Christoph Hellwig b677c210ce [XFS] move v_iocount from bhv_vnode to xfs_inode
struct bhv_vnode is on it's way out, so move the I/O count to the XFS
inode.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29497a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:38:56 +10:00
Christoph Hellwig b3aea4edc2 [XFS] kill the v_flag member in struct bhv_vnode
All flags previously handled at the vnode level are not in the xfs_inode
where we already have a flags mechanisms and free bits for flags
previously in the vnode.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29495a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:37:29 +10:00
Christoph Hellwig 2f6f7b3d9b [XFS] kill v_vfsp member from struct bhv_vnode
We can easily get at the vfsp through the super_block but it will soon be
gone anyway.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29494a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 11:23:43 +10:00
Christoph Hellwig 739bfb2a7d [XFS] call common xfs vnode-level helpers directly and remove vnode operations
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29493a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-16 10:40:00 +10:00
Christoph Hellwig 993386c19a [XFS] decontaminate vnode operations from behavior details
All vnode ops now take struct xfs_inode pointers and the behaviour related
glue is split out into methods of it's own. This required fixing
xfs_create/mkdir/symlink to not mess with the inode pointer but rather use
a separate boolean for error handling. Thanks to Dave Chinner for that
fix.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29492a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-15 16:54:29 +10:00
David Chinner da353b0d64 [XFS] Radix tree based inode caching
One of the perpetual scaling problems XFS has is indexing it's incore
inodes. We currently uses hashes and the default hash sizes chosen can
only ever be a tradeoff between memory consumption and the maximum
realistic size of the cache.

As a result, anyone who has millions of inodes cached on a filesystem
needs to tunes the size of the cache via the ihashsize mount option to
allow decent scalability with inode cache operations.

A further problem is the separate inode cluster hash, whose size is based
on the ihashsize but is smaller, and so under certain conditions (sparse
cluster cache population) this can become a limitation long before the
inode hash is causing issues.

The following patchset removes the inode hash and cluster hash and
replaces them with radix trees to avoid the scalability limitations of the
hashes. It also reduces the size of the inodes by 3 pointers....

SGI-PV: 969561
SGI-Modid: xfs-linux-melb:xfs-kern:29481a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-15 16:50:50 +10:00
Christoph Hellwig 39cd9f877e [XFS] kill move.[ch]
Kill uio related functions and defines now that they're unused.

SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29480a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-15 16:50:26 +10:00
Christoph Hellwig 804c83c376 [XFS] stop using uio in the readlink code
Simplify the readlink code to get rid of the last user of uio.

SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29479a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-15 16:50:13 +10:00
Christoph Hellwig 051e7cd44a [XFS] use filldir internally
Currently xfs has a rather complicated internal scheme to allow for
different directory formats in IRIX. This patch rips all code related to
this out and pushes useage of the Linux filldir callback into the lowlevel
directory code. This does not make the code any less portable because
filldir can be used to create dirents of all possible variations
(including the IRIX ones as proved by the IRIX binary emulation code under
arch/mips/).

This patch get rid of an unessecary copy in the readdir path, about 400
lines of code and one of the last two users of the uio structure.

This version is updated to deal with dmapi aswell which greatly simplifies
the get_dirattrs code. The dmapi part has been tested using the
get_dirattrs tools from the xfstest dmapi suite1 with various small and
large directories.

SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29478a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-15 16:49:49 +10:00
Christoph Hellwig eb9df39daf [XFS] remove unessecary vfs argument to DM_EVENT_ENABLED
SGI-PV: 968690
SGI-Modid: xfs-linux-melb:xfs-kern:29340a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-15 16:45:25 +10:00
Eric Sandeen af3a2e8a3f [XFS] move linux/log2.h header to xfs_linux.h
Generally we try not to directly include linux header files in core xfs
code; xfs_linux.h is the spot for that.

SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29326a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-15 16:40:46 +10:00
Eric Sandeen 6385f4d557 [XFS] Remove xfs_physmem
Now that nobody's using it, remove xfs_physmem & friends.

SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29325a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-15 16:40:14 +10:00
Eric Sandeen 40906630f1 [XFS] Remove m_nreadaheads
m_nreadaheads in the mount struct is never used; remove it and the various
macros assigned to it. Also remove a couple other unused macros in the
same areas.

Removes one user of xfs_physmem.

SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29322a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-15 16:37:46 +10:00
David Chinner 0bfefc46dc [XFS] Barriers need to be dynamically checked and switched off
If the underlying block device suddenly stops supporting barriers, we need
to handle the -EOPNOTSUPP error in a sane manner rather than shutting
down the filesystem. If we get this error, clear the barrier flag, reissue
the I/O, and tell the world bad things are occurring.

SGI-PV: 964544
SGI-Modid: xfs-linux-melb:xfs-kern:28568a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-10-15 16:23:45 +10:00
Al Viro 782e3b3b38 Fix up more bio fallout
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-12 00:29:50 -07:00
NeilBrown 6712ecf8f6 Drop 'size' argument from bio_endio and bi_end_io
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant.  Remove it.

Now there is no need for bio_endio to subtract the size completed
from bi_size.  So don't do that either.

While we are at it, change bi_end_io to return void.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:57 +02:00
Lachlan McIlroy 776a75fa5c [XFS] Ensure file size updates have been completed before writing inode to disk.
SGI-PV: 968767
SGI-Modid: xfs-linux-melb:xfs-kern:29675a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-09-18 20:12:51 +10:00
Christoph Hellwig 265c1fac38 [XFS] fix sparse shadowed variable warnings
- in xfs_probe_cluster rename the inner len to pg_len. There's no harm
  here because the outer len isn't used after the inner len comes into
  existence but it keeps the code clean.
- in xfs_da_do_buf remove the inner i because they don't overlap
  and they are both the same type.

SGI-PV: 968555
SGI-Modid: xfs-linux-melb:xfs-kern:29311a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-09-05 14:50:26 +10:00
Christoph Hellwig 34521c5e49 [XFS] Fix sparse warning in kmem_shake_allow
We can't return a masked result of a __bitwise type. Compare it to 0 first
to keep the behaviour without the warning.

SGI-PV: 968555
SGI-Modid: xfs-linux-melb:xfs-kern:29309a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-09-05 14:48:00 +10:00
David Chinner 8da22d7a36 [XFS] Set filestreams object timeout to something sane.
SGI-PV: 968554
SGI-Modid: xfs-linux-melb:xfs-kern:29303a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-09-05 14:47:10 +10:00
Al Viro ad690ef9e6 xfs ioctl __user annotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-26 11:11:57 -07:00
Paul Mundt 20c2df83d2 mm: Remove slab destructors from kmem_cache_create().
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-20 10:11:58 +09:00
Linus Torvalds fdb64f93b3 Merge branch 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6:
  [XFS] Fix inode size update before data write in xfs_setattr
  [XFS] Allow punching holes to free space when at ENOSPC
  [XFS] Implement ->page_mkwrite in XFS.
  [FS] Implement block_page_mkwrite.

Manually fix up conflict with Nick's VM fault handling patches in
fs/xfs/linux-2.6/xfs_file.c

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 14:41:33 -07:00
Nick Piggin d0217ac04c mm: fault feedback #1
Change ->fault prototype.  We now return an int, which contains
VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
 FAULT_RET_ code tells the VM whether a page was found, whether it has been
locked, and potentially other things.  This is not quite the way he wanted
it yet, but that's changed in the next patch (which requires changes to
arch code).

This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
that a page is locked which requires filemap_nopage to go away (because we
can no longer remain backward compatible without that flag), but we were
going to do that anyway.

struct fault_data is renamed to struct vm_fault as Linus asked. address
is now a void __user * that we should firmly encourage drivers not to use
without really good reason.

The page is now returned via a page pointer in the vm_fault struct.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:41 -07:00
Nick Piggin 54cb8821de mm: merge populate and nopage into fault (fixes nonlinear)
Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
the virtual address -> file offset differently from linear mappings.

->populate is a layering violation because the filesystem/pagecache code
should need to know anything about the virtual memory mapping.  The hitch here
is that the ->nopage handler didn't pass down enough information (ie.  pgoff).
 But it is more logical to pass pgoff rather than have the ->nopage function
calculate it itself anyway (because that's a similar layering violation).

Having the populate handler install the pte itself is likewise a nasty thing
to be doing.

This patch introduces a new fault handler that replaces ->nopage and
->populate and (later) ->nopfn.  Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.

The rationale for doing this in the first place is that nonlinear mappings are
subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
to duplicate the synchronisation logic rather than just consolidate the two.

After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
pagecache.  Seems like a fringe functionality anyway.

NOPAGE_REFAULT is removed.  This should be implemented with ->fault, and no
users have hit mainline yet.

[akpm@linux-foundation.org: cleanup]
[randy.dunlap@oracle.com: doc. fixes for readahead]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:41 -07:00
Nick Piggin d00806b183 mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.

Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.

The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache.  Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.

The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.

Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).

Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size.  do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size).  The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data.  Valid mappings to the same
place will see a different page.

Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit.  He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight.  However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit.  Scalability is not an issue.

This patch implements this latter approach.  ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so).  do_no_page only unlocks the page after setting up the mapping
completely.  invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).

This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:41 -07:00
David Chinner 4f57dbc6b5 [XFS] Implement ->page_mkwrite in XFS.
Hook XFS up to ->page_mkwrite to ensure that we know about mmap pages
being written to. This allows use to do correct delayed allocation and
ENOSPC checking as well as remap unwritten extents so that they get
converted correctly during writeback. This is done via the generic
block_page_mkwrite code.

SGI-PV: 940392
SGI-Modid: xfs-linux-melb:xfs-kern:29149a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-19 19:51:21 +10:00
Christoph Hellwig a569425512 knfsd: exportfs: add exportfs.h header
currently the export_operation structure and helpers related to it are in
fs.h.  fs.h is already far too large and there are very few places needing the
export bits, so split them off into a separate header.

[akpm@linux-foundation.org: fix cifs build]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:06 -07:00
Rafael J. Wysocki 8314418629 Freezer: make kernel threads nonfreezable by default
Currently, the freezer treats all tasks as freezable, except for the kernel
threads that explicitly set the PF_NOFREEZE flag for themselves.  This
approach is problematic, since it requires every kernel thread to either
set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
care for the freezing of tasks at all.

It seems better to only require the kernel threads that want to or need to
be frozen to use some freezer-related code and to remove any
freezer-related code from the other (nonfreezable) kernel threads, which is
done in this patch.

The patch causes all kernel threads to be nonfreezable by default (ie.  to
have PF_NOFREEZE set by default) and introduces the set_freezable()
function that should be called by the freezable kernel threads in order to
unset PF_NOFREEZE.  It also makes all of the currently freezable kernel
threads call set_freezable(), so it shouldn't cause any (intentional)
change of behaviour to appear.  Additionally, it updates documentation to
describe the freezing of tasks more accurately.

[akpm@linux-foundation.org: build fixes]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:02 -07:00
Rusty Russell 8e1f936b73 mm: clean up and kernelify shrinker registration
I can never remember what the function to register to receive VM pressure
is called.  I have to trace down from __alloc_pages() to find it.

It's called "set_shrinker()", and it needs Your Help.

1) Don't hide struct shrinker.  It contains no magic.
2) Don't allocate "struct shrinker".  It's not helpful.
3) Call them "register_shrinker" and "unregister_shrinker".
4) Call the function "shrink" not "shrinker".
5) Reduce the 17 lines of waffly comments to 13, but document it properly.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: David Chinner <dgc@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:00 -07:00
Michal Marek faa63e9584 [XFS] Fix XFS_IOC_FSBULKSTAT{,_SINGLE} & XFS_IOC_FSINUMBERS in compat mode
* 32bit struct xfs_fsop_bulkreq has different size and layout of
members, no matter the alignment. Move the code out of the #else
branch (why was it there in the first place?). Define _32 variants of
the ioctl constants.
* 32bit struct xfs_bstat is different because of time_t and on
i386 because of different padding. Make xfs_bulkstat_one() accept a
custom "output formatter" in the private_data argument which takes care
of the xfs_bulkstat_one_compat() that takes care of the different
layout in the compat case.
* i386 struct xfs_inogrp has different padding.
Add a similar "output formatter" mecanism to xfs_inumbers().

SGI-PV: 967354
SGI-Modid: xfs-linux-melb:xfs-kern:29102a

Signed-off-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:42:50 +10:00
Michal Marek 1fa503df66 [XFS] Compat ioctl handler for handle operations
32bit struct xfs_fsop_handlereq has different size and offsets (due to
pointers). TODO: case XFS_IOC_{FSSETDM,ATTRLIST,ATTRMULTI}_BY_HANDLE still
not handled.

SGI-PV: 967354
SGI-Modid: xfs-linux-melb:xfs-kern:29101a

Signed-off-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:41:49 +10:00
Michal Marek 547e00c3c6 [XFS] Compat ioctl handler for XFS_IOC_FSGEOMETRY_V1.
i386 struct xfs_fsop_geom_v1 has no padding after the last member, so the
size is different.

SGI-PV: 967354
SGI-Modid: xfs-linux-melb:xfs-kern:29100a

Signed-off-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:41:39 +10:00
David Chinner 2a82b8be8a [XFS] Concurrent Multi-File Data Streams
In media spaces, video is often stored in a frame-per-file format. When
dealing with uncompressed realtime HD video streams in this format, it is
crucial that files do not get fragmented and that multiple files a placed
contiguously on disk.

When multiple streams are being ingested and played out at the same time,
it is critical that the filesystem does not cross the streams and
interleave them together as this creates seek and readahead cache miss
latency and prevents both ingest and playout from meeting frame rate
targets.

This patch set creates a "stream of files" concept into the allocator to
place all the data from a single stream contiguously on disk so that RAID
array readahead can be used effectively. Each additional stream gets
placed in different allocation groups within the filesystem, thereby
ensuring that we don't cross any streams. When an AG fills up, we select a
new AG for the stream that is not in use.

The core of the functionality is the stream tracking - each inode that we
create in a directory needs to be associated with the directories' stream.
Hence every time we create a file, we look up the directories' stream
object and associate the new file with that object.

Once we have a stream object for a file, we use the AG that the stream
object point to for allocations. If we can't allocate in that AG (e.g. it
is full) we move the entire stream to another AG. Other inodes in the same
stream are moved to the new AG on their next allocation (i.e. lazy
update).

Stream objects are kept in a cache and hold a reference on the inode.
Hence the inode cannot be reclaimed while there is an outstanding stream
reference. This means that on unlink we need to remove the stream
association and we also need to flush all the associations on certain
events that want to reclaim all unreferenced inodes (e.g. filesystem
freeze).

SGI-PV: 964469
SGI-Modid: xfs-linux-melb:xfs-kern:29096a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
2007-07-14 15:40:53 +10:00
Christoph Hellwig fbf3ce8d8e [XFS] XFS should not be looking at filp reference counts
A check for file_count is always a bad idea. Linux has the ->release
method to deal with cleanups on last close and ->flush is only for the
very rare case where we want to perform an operation on every drop of a
reference to a file struct.

This patch gets rid of vop_close and surrounding code in favour of simply
doing the page flushing from ->release.

SGI-PV: 966562
SGI-Modid: xfs-linux-melb:xfs-kern:28952a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:37:37 +10:00
David Chinner 516b2e7c26 [XFS] Fix remount,readonly path to flush everything correctly.
The remount readonly path can fail to writeback properly because we still
have active transactions after calling xfs_quiesce_fs(). Further
investigation shows that this path is broken in the same ways that the xfs
freeze path was broken so fix it the same way.

SGI-PV: 964464
SGI-Modid: xfs-linux-melb:xfs-kern:28869a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:35:58 +10:00
David Chinner effd120edb [XFS] Map unwritten extents correctly for I/o completion processing
If we have multiple unwritten extents within a single page, we fail to
tell the I/o completion construction handlers we need a new handle for the
second and subsequent blocks in the page. While we still issue the I/O
correctly, we do not have the correct ranges recorded in the ioend
structures and hence when we go to convert the unwritten extents we screw
it up.

Make sure we start a new ioend every time the mapping changes so that we
convert the correct ranges on I/O completion.

SGI-PV: 964647
SGI-Modid: xfs-linux-melb:xfs-kern:28797a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:32:49 +10:00
David Chinner b2826136a1 [XFS] Handle null returned from xfs_vtoi() in xfs_setfilesize().
SGI-PV: 965636
SGI-Modid: xfs-linux-melb:xfs-kern:28777a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:31:03 +10:00
David Chinner e927af90aa [XFS] Block on unwritten extent conversion during synchronous direct I/O.
Currently we do not wait on extent conversion to occur, and hence we can
return to userspace from a synchronous direct I/O write without having
completed all the actions in the write. Hence a read after the write may
see zeroes (unwritten extent) rather than the data that was written.

Block the I/O completion by triggering a synchronous workqueue flush to
ensure that the conversion has occurred before we return to userspace.

SGI-PV: 964092
SGI-Modid: xfs-linux-melb:xfs-kern:28775a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:30:52 +10:00
David Chinner f4a9f28a90 [XFS] Flush the block device before closing it on unmount.
SGI-PV: 965630
SGI-Modid: xfs-linux-melb:xfs-kern:28774a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:30:05 +10:00
David Chinner 92821e2ba4 [XFS] Lazy Superblock Counters
When we have a couple of hundred transactions on the fly at once, they all
typically modify the on disk superblock in some way.
create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
free block counts.

When these counts are modified in a transaction, they must eventually lock
the superblock buffer and apply the mods. The buffer then remains locked
until the transaction is committed into the incore log buffer. The result
of this is that with enough transactions on the fly the incore superblock
buffer becomes a bottleneck.

The result of contention on the incore superblock buffer is that
transaction rates fall - the more pressure that is put on the superblock
buffer, the slower things go.

The key to removing the contention is to not require the superblock fields
in question to be locked. We do that by not marking the superblock dirty
in the transaction. IOWs, we modify the incore superblock but do not
modify the cached superblock buffer. In short, we do not log superblock
modifications to critical fields in the superblock on every transaction.
In fact we only do it just before we write the superblock to disk every
sync period or just before unmount.

This creates an interesting problem - if we don't log or write out the
fields in every transaction, then how do the values get recovered after a
crash? the answer is simple - we keep enough duplicate, logged information
in other structures that we can reconstruct the correct count after log
recovery has been performed.

It is the AGF and AGI structures that contain the duplicate information;
after recovery, we walk every AGI and AGF and sum their individual
counters to get the correct value, and we do a transaction into the log to
correct them. An optimisation of this is that if we have a clean unmount
record, we know the value in the superblock is correct, so we can avoid
the summation walk under normal conditions and so mount/recovery times do
not change under normal operation.

One wrinkle that was discovered during development was that the blocks
used in the freespace btrees are never accounted for in the AGF counters.
This was once a valid optimisation to make; when the filesystem is full,
the free space btrees are empty and consume no space. Hence when it
matters, the "accounting" is correct. But that means the when we do the
AGF summations, we would not have a correct count and xfs_check would
complain. Hence a new counter was added to track the number of blocks used
by the free space btrees. This is an *on-disk format change*.

As a result of this, lazy superblock counters are a mkfs option and at the
moment on linux there is no way to convert an old filesystem. This is
possible - xfs_db can be used to twiddle the right bits and then
xfs_repair will do the format conversion for you. Similarly, you can
convert backwards as well. At some point we'll add functionality to
xfs_admin to do the bit twiddling easily....

SGI-PV: 964999
SGI-Modid: xfs-linux-melb:xfs-kern:28652a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:28:50 +10:00
Andrew Morton 3260f78ad6 [XFS] Use generic shrinker interfaces in XFS.
SGI-PV: 964986
SGI-Modid: xfs-linux-melb:xfs-kern:28642a

Signed-Off-By: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:23:53 +10:00
Christoph Hellwig ca165b8892 [XFS] Fix double free in xfs_buf_get_noaddr error handling path
SGI-PV: 964983
SGI-Modid: xfs-linux-melb:xfs-kern:28639a

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:22:50 +10:00
Christoph Hellwig 1fa40b01ae [XFS] Only use refcounted pages for I/O
Many block drivers (aoe, iscsi) really want refcountable pages in bios,
which is what almost everyone send down. XFS unfortunately has a few
places where it sends down buffers that may come from kmalloc, which
breaks them.

Fix the places that use kmalloc()d buffers.

SGI-PV: 964546
SGI-Modid: xfs-linux-melb:xfs-kern:28562a

Signed-Off-By: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-14 15:21:14 +10:00
Jens Axboe 5ffc4ef45b sendfile: remove .sendfile from filesystems that use generic_file_sendfile()
They can use generic_file_splice_read() instead. Since sys_sendfile() now
prefers that, there should be no change in behaviour.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:13 +02:00
Christoph Hellwig 700716c846 [XFS] s/memclear_highpage_flush/zero_user_page/
SGI-PV: 957103
SGI-Modid: xfs-linux-melb:xfs-kern:28678a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-06-19 15:20:31 +10:00
David Chinner df3c724426 [XFS] Write at EOF may not update filesize correctly.
The recent fix for preventing NULL files from being left around does not
update the file size corectly in all cases. The missing case is a write
extending the file that does not need to allocate a block.

In that case we used a read mapping of the extent which forced the use of
the read I/O completion handler instead of the write I/O completion
handle. Hence the file size was not updated on I/O completion.

SGI-PV: 965068
SGI-Modid: xfs-linux-melb:xfs-kern:28657a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nscott@aconex.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-29 18:15:17 +10:00
Christoph Lameter a35afb830f Remove SLAB_CTOR_CONSTRUCTOR
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Linus Torvalds 60c9b2746f Merge git://oss.sgi.com:8090/xfs/xfs-2.6
* git://oss.sgi.com:8090/xfs/xfs-2.6:
  [XFS] Add lockdep support for XFS
  [XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks.
  [XFS] Get rid of redundant "required" in msg.
  [XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg.
  [XFS] Remove unused ilen variable and references.
  [XFS] Fix to prevent the notorious 'NULL files' problem after a crash.
  [XFS] Fix race condition in xfs_write().
  [XFS] Fix uquota and oquota enforcement problems.
  [XFS] propogate return codes from flush routines
  [XFS] Fix quotaon syscall failures for group enforcement requests.
  [XFS] Invalidate quotacheck when mounting without a quota type.
  [XFS] reducing the number of random number functions.
  [XFS] remove more misc. unused args
  [XFS] the "aendp" arg to xfs_dir2_data_freescan is always NULL, remove it.
  [XFS] The last argument "lsn" of xfs_trans_commit() is always called with
2007-05-08 11:59:33 -07:00
Dmitriy Monakhov 0ceb331433 mm: move common segment checks to separate helper function
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Anton Altaparmakov <aia21@cam.ac.uk>
Acked-by: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:57 -07:00
Lachlan McIlroy f7c66ce3f7 [XFS] Add lockdep support for XFS
SGI-PV: 963965
SGI-Modid: xfs-linux-melb:xfs-kern:28485a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-08 13:50:19 +10:00
Lachlan McIlroy 71dfd5a396 [XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks.
In xfs_write() the iolock is dropped and reacquired in XFS_SEND_DATA()
which means that the file could change from not-cached to cached and we
need to redo the direct I/O checks. We should also redo the direct I/O
checks when the file size changes regardless if O_APPEND is set or not.

SGI-PV: 963483
SGI-Modid: xfs-linux-melb:xfs-kern:28440a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-08 13:50:12 +10:00
Tim Shimmin e6a0e9cdff [XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg.
SGI-PV: 963465
SGI-Modid: xfs-linux-melb:xfs-kern:28414a

Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2007-05-08 13:49:59 +10:00
Lachlan McIlroy ba87ea699e [XFS] Fix to prevent the notorious 'NULL files' problem after a crash.
The problem that has been addressed is that of synchronising updates of
the file size with writes that extend a file. Without the fix the update
of a file's size, as a result of a write beyond eof, is independent of
when the cached data is flushed to disk. Often the file size update would
be written to the filesystem log before the data is flushed to disk. When
a system crashes between these two events and the filesystem log is
replayed on mount the file's size will be set but since the contents never
made it to disk the file is full of holes. If some of the cached data was
flushed to disk then it may just be a section of the file at the end that
has holes.

There are existing fixes to help alleviate this problem, particularly in
the case where a file has been truncated, that force cached data to be
flushed to disk when the file is closed. If the system crashes while the
file(s) are still open then this flushing will never occur.

The fix that we have implemented is to introduce a second file size,
called the in-memory file size, that represents the current file size as
viewed by the user. The existing file size, called the on-disk file size,
is the one that get's written to the filesystem log and we only update it
when it is safe to do so. When we write to a file beyond eof we only
update the in- memory file size in the write operation. Later when the I/O
operation, that flushes the cached data to disk completes, an I/O
completion routine will update the on-disk file size. The on-disk file
size will be updated to the maximum offset of the I/O or to the value of
the in-memory file size if the I/O includes eof.

SGI-PV: 958522
SGI-Modid: xfs-linux-melb:xfs-kern:28322a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-08 13:49:46 +10:00
Lachlan McIlroy 2a32963130 [XFS] Fix race condition in xfs_write().
This change addresses a race in xfs_write() where, for direct I/O, the
flags need_i_mutex and need_flush are setup before the iolock is acquired.
The logic used to setup the flags may change between setting the flags and
acquiring the iolock resulting in these flags having incorrect values. For
example, if a file is not currently cached then need_i_mutex is set to
zero and then if the file is cached before the iolock is acquired we will
fail to do the flushinval before the direct write.

The flush (and also the call to xfs_zero_eof()) need to be done with the
iolock held exclusive so we need to acquire the iolock before checking for
cached data (or if the write begins after eof) to prevent this state from
changing. For direct I/O I've chosen to always acquire the iolock in
shared mode initially and if there is a need to promote it then drop it
and reacquire it.

There's also some other tidy-ups including removing the O_APPEND offset
adjustment since that work is done in generic_write_checks() (and we don't
use offset as an input parameter anywhere).

SGI-PV: 962170
SGI-Modid: xfs-linux-melb:xfs-kern:28319a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-08 13:49:39 +10:00
Lachlan McIlroy d3cf209476 [XFS] propogate return codes from flush routines
This patch handles error return values in fs_flush_pages and
fs_flushinval_pages. It changes the prototype of fs_flushinval_pages so we
can propogate the errors and handle them at higher layers. I also modified
xfs_itruncate_start so that it could propogate the error further.

SGI-PV: 961990
SGI-Modid: xfs-linux-melb:xfs-kern:28231a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Stewart Smith <stewart@flamingspork.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-05-08 13:49:27 +10:00
Christoph Lameter 50953fe9e0 slab allocators: Remove SLAB_DEBUG_INITIAL flag
I have never seen a use of SLAB_DEBUG_INITIAL.  It is only supported by
SLAB.

I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again?  The callback is
performed before each freeing of an object.

I would think that it is much easier to check the object state manually
before the free.  That also places the check near the code object
manipulation of the object.

Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on.  If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code.  But there is no such code
in the kernel.  I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e.  add debug code before kfree).

There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches.  Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

This is the last slab flag that SLUB did not support.  Remove the check for
unimplemented flags from SLUB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Rafael J. Wysocki b43376927a [PATCH] Make XFS workqueues nonfreezable
Since freezable workqueues are broken in 2.6.21-rc
(cf. http://marc.theaimsgroup.com/?l=linux-kernel&m=116855740612755,
http://marc.theaimsgroup.com/?l=linux-kernel&m=117261312523921&w=2)
it's better to change the only user of them, which is XFS, to use "normal"
nonfreezable workqueues.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-22 19:39:06 -07:00
Andrew Morton 5085b607fb [PATCH] xfs warning fix
fs/xfs/linux-2.6/xfs_super.c:903: warning: 'noinline' attribute ignored

Cc: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
Eric W. Biederman 0b4d414714 [PATCH] sysctl: remove insert_at_head from register_sysctl
The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name.  Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:59 -08:00
Tim Schmielau cd354f1ae7 [PATCH] remove many unneeded #includes of sched.h
After Al Viro (finally) succeeded in removing the sched.h #include in module.h
recently, it makes sense again to remove other superfluous sched.h includes.
There are quite a lot of files which include it but don't actually need
anything defined in there.  Presumably these includes were once needed for
macros that used to live in sched.h, but moved to other header files in the
course of cleaning it up.

To ease the pain, this time I did not fiddle with any header files and only
removed #includes from .c-files, which tend to cause less trouble.

Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
configs in arch/arm/configs on arm.  I also checked that no new warnings were
introduced by the patch (actually, some warnings are removed that were emitted
by unnecessarily included header files).

Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:54 -08:00
Arjan van de Ven c5ef1c42c5 [PATCH] mark struct inode_operations const 3
Many struct inode_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:46 -08:00
David Chinner 6ab8eb1cff [PATCH] Make XFS use BH_Unwritten and BH_Delay correctly
Don't hide buffer_unwritten behind buffer_delay() and remove the hack that
clears unexpected buffer_unwritten() states now that it can't happen.

Signed-off-by: Dave Chinner <dgc@sgi.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Timothy Shimmin <tes@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:27 -08:00
David Chinner 33a266dda9 [PATCH] Make BH_Unwritten a first class bufferhead flag V2
Currently, XFS uses BH_PrivateStart for flagging unwritten extent state in a
bufferhead.  Recently, I found the long standing mmap/unwritten extent
conversion bug, and it was to do with partial page invalidation not clearing
the unwritten flag from bufferheads attached to the page but beyond EOF.  See
here for a full explaination:

http://oss.sgi.com/archives/xfs/2006-12/msg00196.html

The solution I have checked into the XFS dev tree involves duplicating code
from block_invalidatepage to clear the unwritten flag from the bufferhead(s),
and then calling block_invalidatepage() to do the rest.

Christoph suggested that this would be better solved by pushing the unwritten
flag into the common buffer head flags and just adding the call to
discard_buffer():

http://oss.sgi.com/archives/xfs/2006-12/msg00239.html

The following patch makes BH_Unwritten a first class citizen.

Signed-off-by: Dave Chinner <dgc@sgi.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:27 -08:00
David Chinner e7ff6aed87 [XFS] Don't use kmap in xfs_iozero.
kmap() is inefficient and does not scale well. kmap_atomic() is a better
choice. Use the generic wrapper function instead of open coding the
kmap-memset-dcache flush-kunmap stuff.

SGI-PV: 960904
SGI-Modid: xfs-linux-melb:xfs-kern:28041a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10 18:37:46 +11:00
Eric Sandeen 7bc5306d74 [XFS] Remove unused header files for MAC and CAP checking functionality.
xfs_mac.h and xfs_cap.h provide definitions and macros that aren't used
anywhere in XFS at all. They are left-overs from "to be implement at some
point in the future" functionality that Irix XFS has. If this
functionality ever goes into Linux, it will be provided at a different
layer, most likely through the security hooks in the kernel so we will
never need this functionality in XFS.

Patch provided by Eric Sandeen (sandeen@sandeen.net).

SGI-PV: 960895
SGI-Modid: xfs-linux-melb:xfs-kern:28036a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10 18:37:28 +11:00
David Chinner 3c0dc77b42 [XFS] Make freeze code a little cleaner.
Fixes a few small issues (mostly cosmetic) that were picked up during the
review cycle for the last set of freeze path changes.

SGI-PV: 959267
SGI-Modid: xfs-linux-melb:xfs-kern:28035a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10 18:37:22 +11:00
Eric Sandeen 39058a0e12 [XFS] Clean up use of VFS attr flags
Use the the generic VFS attr flags where appropriate instead of open
coding them to the same values.

Patch provided by Eric Sandeen.

SGI-PV: 960868
SGI-Modid: xfs-linux-melb:xfs-kern:28033a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10 18:37:10 +11:00
Ralf Baechle 4cf3b52080 [XFS] Remove useless memory barrier
wake_up's implementation does an implicit memory barrier so the explicit
memory barrier is not needed in vfs_sync_worker.

Patch provided by Ralf Baechle.

SGI-PV: 960867
SGI-Modid: xfs-linux-melb:xfs-kern:28032a

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10 18:37:04 +11:00
Eric W. Biederman 3a68cbfe02 [XFS] XFS sysctl cleanups
Removes unneeded sysctl insert at head behaviour. Cleans up sysctl
definitions to use C99 initialisers. Patch provided by Eric W. Biederman.

SGI-PV: 960192
SGI-Modid: xfs-linux-melb:xfs-kern:28031a

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10 18:36:59 +11:00
Lachlan McIlroy 6816016137 [XFS] Fix callers of xfs_iozero() to zero the correct range.
The problem is the two callers of xfs_iozero() are rounding out the range
to be zeroed to the end of a fsb and in some cases this extends past the
new eof. The call to commit_write() in xfs_iozero() will cause the Linux
inode's file size to be set too high.

SGI-PV: 960788
SGI-Modid: xfs-linux-melb:xfs-kern:28013a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10 18:36:47 +11:00
David Chinner 2823945fda [XFS] Ensure a frozen filesystem has a clean log before writing the dummy
record.

The current Linux XFS freeze code is a mess. We flush the metadata buffers
out while we are still allowing new transactions to start and then fail to
flush the dirty buffers back out before writing the unmount and dummy
records to the log.

This leads to problems when the frozen filesystem is used for snapshots -
we do log recovery on a readonly image and often it appears that the log
image in the snapshot is not correct. Hence we end up with hangs, oops and
mount failures when trying to mount a snapshot image that has been created
when the filesystem has not been correctly frozen.

To fix this, we need to move th metadata flush to after we wait for all
current transactions to complete in teh second stage of the freeze. This
means that when we write the final log records, the log should be clean
and recovery should never occur on a snapshot image created from a frozen
filesystem.

SGI-PV: 959267
SGI-Modid: xfs-linux-melb:xfs-kern:28010a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10 18:36:40 +11:00
David Chinner 549054afad [XFS] Fix sub-block zeroing for buffered writes into unwritten extents.
When writing less than a filesystem block of data into an unwritten extent
via buffered I/O, __xfs_get_blocks fails to set the buffer new flag. As a
result, the generic code will not zero either edge of the block resulting
in garbage being written to disk either side of the real data. Set the
buffer new state on bufferd writes to unwritten extents to ensure that
zeroing occurs.

SGI-PV: 960328
SGI-Modid: xfs-linux-melb:xfs-kern:28000a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10 18:36:35 +11:00
Lachlan McIlroy 5180602e6f [XFS] remove unused filp from ioctl functions
SGI-PV: 959140
SGI-Modid: xfs-linux-melb:xfs-kern:27712a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10 18:35:46 +11:00
Lachlan McIlroy a3227fb996 [XFS] mraccessf & mrupdatef are supposed to be the "flags" versions of the
functions, but they

a) ignore the flags parameter completely, and b) are never called
directly, only via the flag-less defines anyway

So, drop the #define indirection, and rename mraccessf to mraccess, etc.

SGI-PV: 959138
SGI-Modid: xfs-linux-melb:xfs-kern:27711a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10 18:35:40 +11:00
Lachlan McIlroy e5eb7f202b [XFS] use struct kvec in struct uio
SGI-PV: 954580
SGI-Modid: xfs-linux-melb:xfs-kern:27701a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-02-10 18:35:21 +11:00