Commit Graph

15590 Commits

Author SHA1 Message Date
Linus Torvalds 0efe5e32c8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix data space leak fix
  Btrfs: remove duplicates of filemap_ helpers
  Btrfs: take i_mutex before generic_write_checks
  Btrfs: fix arguments to btrfs_wait_on_page_writeback_range
  Btrfs: fix deadlock with free space handling and user transactions
  Btrfs: fix error cases for ioctl transactions
  Btrfs: Use CONFIG_BTRFS_POSIX_ACL to enable ACL code
  Btrfs: introduce missing kfree
  Btrfs: Fix setting umask when POSIX ACLs are not enabled
  Btrfs: proper -ENOSPC handling
2009-10-01 20:23:15 -07:00
Christoph Hellwig 80e50be422 afs: remove cache.h
It's just a wrapper for <linux/fscache.h>, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-01 16:11:16 -07:00
Alexey Dobriyan 828c09509b const: constify remaining file_operations
[akpm@linux-foundation.org: fix KVM]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-01 16:11:11 -07:00
Chris Mason 9c2693c924 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus 2009-10-01 17:24:44 -04:00
Josef Bacik fbf1908744 Btrfs: fix data space leak fix
There is a problem where page_mkwrite can be called on a dirtied page that
already has a delalloc range associated with it.  The fix is to clear any
delalloc bits for the range we are dirtying so the space accounting gets
handled properly.  This is the same thing we do in the normal write case, so we
are consistent across the board.  With this patch we no longer leak reserved
space.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-01 17:10:23 -04:00
Christoph Hellwig 8aa38c31b7 Btrfs: remove duplicates of filemap_ helpers
Use filemap_fdatawrite_range and filemap_fdatawait_range instead of
local copies of the functions.  For filemap_fdatawait_range that
also means replacing the awkward old wait_on_page_writeback_range
calling convention with the regular filemap byte offsets.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-01 12:58:30 -04:00
Chris Mason 25472b880c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus 2009-10-01 12:58:13 -04:00
Chris Mason ab93dbecfb Btrfs: take i_mutex before generic_write_checks
btrfs_file_write was incorrectly calling generic_write_checks without
taking i_mutex.  This lead to problems with racing around i_size when
doing O_APPEND writes.

The fix here is to move i_mutex higher.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-01 12:29:10 -04:00
Christoph Hellwig 35d62a942d Btrfs: fix arguments to btrfs_wait_on_page_writeback_range
wait_on_page_writeback_range/btrfs_wait_on_page_writeback_range takes
a pagecache offset, not a byte offset into the file.  Shift the arguments
around to wait for the correct range

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-01 10:27:01 -04:00
Linus Torvalds 9abf47f11b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix missing initialization of i_dir_start_lookup member
  nilfs2: fix missing zero-fill initialization of btree node cache
2009-09-30 09:42:24 -07:00
Linus Torvalds 9f44fdc518 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix time encoding with extra epoch bits
  ext4: Add a stub for mpage_da_data in the trace header
  jbd2: Use tracepoints for history file
  ext4: Use tracepoints for mb_history trace file
  ext4, jbd2: Drop unneeded printks at mount and unmount time
  ext4: Handle nested ext4_journal_start/stop calls without a journal
  ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode
  ext4: Avoid updating the inode table bh twice in no journal mode
  ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first
  ext4: async direct IO for holes and fallocate support
  ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O
  ext4: Split uninitialized extents for direct I/O
  ext4: release reserved quota when block reservation for delalloc retry
  ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks
  ext4: Fix hueristic which avoids group preallocation for closed files
  ext4: Use ext4_msg() for ext4_da_writepage() errors
  ext4: Update documentation about quota mount options
2009-09-30 09:32:30 -07:00
Linus Torvalds 4c8f1cb266 Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
  fat: Check s_dirt in fat_sync_fs()
  vfat: change the default from shortname=lower to shortname=mixed
  fat/nls: Fix handling of utf8 invalid char
2009-09-30 09:31:14 -07:00
Theodore Ts'o c1fccc0696 ext4: Fix time encoding with extra epoch bits
"Looking at ext4.h, I think the setting of extra time fields forgets to
mask the epoch bits so the epoch part overwrites nsec part. The second
change is only for coherency (2 -> EXT4_EPOCH_BITS)."

Thanks to Damien Guibouret for pointing out this problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-30 01:13:55 -04:00
Theodore Ts'o bf6993276f jbd2: Use tracepoints for history file
The /proc/fs/jbd2/<dev>/history was maintained manually; by using
tracepoints, we can get all of the existing functionality of the /proc
file plus extra capabilities thanks to the ftrace infrastructure.  We
save memory as a bonus.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-30 00:32:06 -04:00
Theodore Ts'o 296c355cd6 ext4: Use tracepoints for mb_history trace file
The /proc/fs/ext4/<dev>/mb_history was maintained manually, and had a
number of problems: it required a largish amount of memory to be
allocated for each ext4 filesystem, and the s_mb_history_lock
introduced a CPU contention problem.  

By ripping out the mb_history code and replacing it with ftrace
tracepoints, and we get more functionality: timestamps, event
filtering, the ability to correlate mballoc history with other ext4
tracepoints, etc.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-30 00:32:42 -04:00
Sage Weil dd7e0b7b02 Btrfs: fix deadlock with free space handling and user transactions
If an ioctl-initiated transaction is open, we can't force a commit during
the free space checks in order to free up pinned extents or else we
deadlock.  Just ENOSPC instead.

A more satisfying solution that reserves space for the entire user
transaction up front is forthcoming...

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-29 19:50:07 -04:00
Sage Weil 1ab86aedbc Btrfs: fix error cases for ioctl transactions
Fix leak of vfsmount write reference and open_ioctl_trans reference on
ENOMEM.  Clean up the error paths while we're at it.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-29 18:38:44 -04:00
Theodore Ts'o 90576c0b9a ext4, jbd2: Drop unneeded printks at mount and unmount time
There are a number of kernel printk's which are printed when an ext4
filesystem is mounted and unmounted.  Disable them to economize space
in the system logs.  In addition, disabling the mballoc stats by
default saves a number of unneeded atomic operations for every block
allocation or deallocation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 15:51:30 -04:00
Chris Ball 3baf0bed0a Btrfs: Use CONFIG_BTRFS_POSIX_ACL to enable ACL code
We've already defined CONFIG_BTRFS_POSIX_ACL in Kconfig, but we're
currently not using it and are testing CONFIG_FS_POSIX_ACL instead.
CONFIG_FS_POSIX_ACL states "Never use this symbol for ifdefs".

Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-29 13:51:05 -04:00
Julia Lawall fd2696f399 Btrfs: introduce missing kfree
Error handling code following a kzalloc should free the allocated data.

The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,f1,l;
position p1,p2;
expression *ptr != NULL;
@@

x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
<... when != x
     when != if (...) { <+...x...+> }
(
x->f1 = E
|
 (x->f1 == NULL || ...)
|
 f(...,x->f1,...)
)
...>
(
 return \(0\|<+...x...+>\|ptr\);
|
 return@p2 ...;
)

@script:python@
p1 << r.p1;
p2 << r.p2;
@@

print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-29 13:51:04 -04:00
Chris Ball 49cf6f4529 Btrfs: Fix setting umask when POSIX ACLs are not enabled
We currently set sb->s_flags |= MS_POSIXACL unconditionally, which is
incorrect -- it tells the VFS that it shouldn't set umask because we
will, yet we don't set it ourselves if we aren't using POSIX ACLs, so
the umask ends up ignored.

Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-29 13:51:04 -04:00
Curt Wohlgemuth d3d1faf6a7 ext4: Handle nested ext4_journal_start/stop calls without a journal
This patch fixes a problem with handling nested calls to
ext4_journal_start/ext4_journal_stop, when there is no journal present.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 11:01:03 -04:00
Curt Wohlgemuth f3dc272fd5 ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode
This patch a problem that ext4_dirty_inode() was not calling
ext4_mark_inode_dirty() if the current_handle is not valid, which it
is the case in no journal mode.

It also removes a test for non-matching transaction which can never
happen.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 16:06:01 -04:00
Frank Mayhar 830156c79b ext4: Avoid updating the inode table bh twice in no journal mode
This is a cleanup of commit 91ac6f4.  Since ext4_mark_inode_dirty()
has already called ext4_mark_iloc_dirty(), which in turn calls
ext4_do_update_inode(), it's not necessary to have ext4_write_inode()
call ext4_do_update_inode() in no journal mode.  Indeed, it would be
duplicated work.

Reviewed-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 10:07:47 -04:00
Ryusuke Konishi 3cc811bffd nilfs2: fix missing initialization of i_dir_start_lookup member
The i_dir_start_lookup field in nilfs_inode_info objects should be
cleared when the objects are allocated, but the the initialization was
missing in case of reading from disk.  This adds the initialization.

Since the variable just gives a start page on directory lookups, the
bug was nonfatal until now.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-09-29 20:32:13 +09:00
Ryusuke Konishi 1f28fcd925 nilfs2: fix missing zero-fill initialization of btree node cache
This will fix file system corruption which infrequently happens after
mount.  The problem was reported from users with the title "[NILFS
users] Fail to mount NILFS." (Message-ID:
<200908211918.34720.yuri@itinteg.net>), and so forth.  I've also
experienced the corruption multiple times on kernel 2.6.30 and 2.6.31.

The problem turned out to be caused due to discordance between
mapping->nrpages of a btree node cache and the actual number of pages
hung on the cache; if the mapping->nrpages becomes zero even as it has
pages, truncate_inode_pages() returns without doing anything.  Usually
this is harmless except it may cause page leak, but garbage collection
fairly infrequently sees a stale page remained in the btree node cache
of DAT (i.e. disk address translation file of nilfs), and induces the
corruption.

I identified a missing initialization in btree node caches was the
root cause.  This corrects the bug.

I've tested this for kernel 2.6.30 and 2.6.31.

Reported-by: Yuri Chislov <yuri@itinteg.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: stable <stable@kernel.org>
2009-09-29 20:12:56 +09:00
Josef Bacik 9ed74f2dba Btrfs: proper -ENOSPC handling
At the start of a transaction we do a btrfs_reserve_metadata_space() and
specify how many items we plan on modifying.  Then once we've done our
modifications and such, just call btrfs_unreserve_metadata_space() for
the same number of items we reserved.

For keeping track of metadata needed for data I've had to add an extent_io op
for when we merge extents.  This lets us track space properly when we are doing
sequential writes, so we don't end up reserving way more metadata space than
what we need.

The only place where the metadata space accounting is not done is in the
relocation code.  This is because Yan is going to be reworking that code in the
near future, so running btrfs-vol -b could still possibly result in a ENOSPC
related panic.  This patch also turns off the metadata_ratio stuff in order to
allow users to more efficiently use their disk space.

This patch makes it so we track how much metadata we need for an inode's
delayed allocation extents by tracking how many extents are currently
waiting for allocation.  It introduces two new callbacks for the
extent_io tree's, merge_extent_hook and split_extent_hook.  These help
us keep track of when we merge delalloc extents together and split them
up.  Reservations are handled prior to any actually dirty'ing occurs,
and then we unreserve after we dirty.

btrfs_unreserve_metadata_for_delalloc() will make the appropriate
unreservations as needed based on the number of reservations we
currently have and the number of extents we currently have.  Doing the
reservation outside of doing any of the actual dirty'ing lets us do
things like filemap_flush() the inode to try and force delalloc to
happen, or as a last resort actually start allocation on all delalloc
inodes in the fs.  This has survived dbench, fs_mark and an fsx torture
test.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-28 16:29:42 -04:00
Theodore Ts'o f3ce8064b3 ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first
Move the check to make sure the original and donor inodes are
different earlier, to avoid a potential deadlock by trying to lock the
same inode twice.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 15:58:29 -04:00
Mingming Cao 8d5d02e6b1 ext4: async direct IO for holes and fallocate support
For async direct IO that covers holes or fallocate, the end_io
callback function now queued the convertion work on workqueue but
don't flush the work rightaway as it might take too long to afford.

But when fsync is called after all the data is completed, user expects
the metadata also being updated before fsync returns.

Thus we need to flush the conversion work when fsync() is called.
This patch keep track of a listed of completed async direct io that
has a work queued on workqueue.  When fsync() is called, it will go
through the list and do the conversion.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2009-09-28 15:48:29 -04:00
Mingming Cao 4c0425ff68 ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O
Currently the DIO VFS code passes create = 0 when writing to the
middle of file.  It does this to avoid block allocation for holes, so
as not to expose stale data out when there is a parallel buffered read
(which does not hold the i_mutex lock).  Direct I/O writes into holes
falls back to buffered IO for this reason.

Since preallocated extents are treated as holes when doing a
get_block() look up (buffer is not mapped), direct IO over fallocate
also falls back to buffered IO.  Thus ext4 actually silently falls
back to buffered IO in above two cases, which is undesirable.

To fix this, this patch creates unitialized extents when a direct I/O
write into holes in sparse files, and registering an end_io callback which
converts the uninitialized extent to an initialized extent after the
I/O is completed.

Singed-Off-By: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 15:48:41 -04:00
Mingming Cao 0031462b5b ext4: Split uninitialized extents for direct I/O
When writing into an unitialized extent via direct I/O, and the direct
I/O doesn't exactly cover the unitialized extent, split the extent
into uninitialized and initialized extents before submitting the I/O.
This avoids needing to deal with an ENOSPC error in the end_io
callback that gets used for direct I/O.

When the IO is complete, the written extent will be marked as initialized.

Singed-Off-By: Mingming Cao <cmm@us.ibm.com> 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 15:49:08 -04:00
Mingming Cao 9f0ccfd8e0 ext4: release reserved quota when block reservation for delalloc retry
ext4_da_reserve_space() can reserve quota blocks multiple times if
ext4_claim_free_blocks() fail and we retry the allocation. We should
release the quota reservation before restarting.

Bug found by Jan Kara.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 15:49:52 -04:00
Theodore Ts'o 55138e0bc2 ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks
Work around problems in the writeback code to force out writebacks in
larger chunks than just 4mb, which is just too small.  This also works
around limitations in the ext4 block allocator, which can't allocate
more than 2048 blocks at a time.  So we need to defeat the round-robin
characteristics of the writeback code and try to write out as many
blocks in one inode before allowing the writeback code to move on to
another inode.  We add a a new per-filesystem tunable,
max_writeback_mb_bump, which caps this to a default of 128mb per
inode.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 13:31:31 -04:00
Theodore Ts'o 7178057730 ext4: Fix hueristic which avoids group preallocation for closed files
The hueristic was designed to avoid using locality group preallocation
when writing the last segment of a closed file.  Fix it by move
setting size to the maximum of size and isize until after we check
whether size == isize.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 00:06:20 -04:00
Alexey Dobriyan f0f37e2f77 const: mark struct vm_struct_operations
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code

But leave TTM code alone, something is fishy there with global vm_ops
being used.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-27 11:39:25 -07:00
Theodore Ts'o 1693918e0b ext4: Use ext4_msg() for ext4_da_writepage() errors
This allows the user to see what filesystem was involved with a
particular ext4_da_writepage() error.  Also, use KERN_CRIT which is
more appropriate than KERN_EMERG.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-26 17:43:59 -04:00
Linus Torvalds bfebb14063 Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block
* 'writeback' of git://git.kernel.dk/linux-2.6-block:
  writeback: pass in super_block to bdi_start_writeback()
2009-09-26 10:11:13 -07:00
Linus Torvalds 07e2e6ba27 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: fix locking and list handling code in cifs_open and its helper
  [CIFS] Remove build warning
  cifs: fix problems with last two commits
  [CIFS] Fix build break when keys support turned off
  cifs: eliminate cifs_init_private
  cifs: convert oplock breaks to use slow_work facility (try #4)
  cifs: have cifsFileInfo hold an extra inode reference
  cifs: take read lock on GlobalSMBSes_lock in is_valid_oplock_break
  cifs: remove cifsInodeInfo.oplockPending flag
  cifs: fix oplock request handling in posix codepath
  [CIFS] Re-enable Lanman security
2009-09-26 10:10:35 -07:00
Jens Axboe a72bfd4dea writeback: pass in super_block to bdi_start_writeback()
Sometimes we only want to write pages from a specific super_block,
so allow that to be passed in.

This fixes a problem with commit 56a131dcf7
causing writeback on all super_blocks on a bdi, where we only really
want to sync a specific sb from writeback_inodes_sb().

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-26 00:10:40 +02:00
Jeff Layton 3321b791b2 cifs: fix locking and list handling code in cifs_open and its helper
The patch to remove cifs_init_private introduced a locking imbalance. It
didn't remove the leftover list addition code and the unlocking in that
function. cifs_new_fileinfo does the list addition now, so there should
be no need to do it outside of that function.

pCifsInode will never be NULL, so we don't need to check for that. This
patch also gets rid of the ugly locking and unlocking across function
calls.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-09-25 17:59:31 +00:00
Linus Torvalds 6d7f18f6ea Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block
* 'writeback' of git://git.kernel.dk/linux-2.6-block:
  writeback: writeback_inodes_sb() should use bdi_start_writeback()
  writeback: don't delay inodes redirtied by a fast dirtier
  writeback: make the super_block pinning more efficient
  writeback: don't resort for a single super_block in move_expired_inodes()
  writeback: move inodes from one super_block together
  writeback: get rid to incorrect references to pdflush in comments
  writeback: improve readability of the wb_writeback() continue/break logic
  writeback: cleanup writeback_single_inode()
  writeback: kupdate writeback shall not stop when more io is possible
  writeback: stop background writeback when below background threshold
  writeback: balance_dirty_pages() shall write more than dirtied pages
  fs: Fix busyloop in wb_writeback()
2009-09-25 09:27:30 -07:00
Jens Axboe 56a131dcf7 writeback: writeback_inodes_sb() should use bdi_start_writeback()
Pointless to iterate other devices looking for a super, when
we have a bdi mapping.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-25 18:08:26 +02:00
Wu Fengguang b3af9468ae writeback: don't delay inodes redirtied by a fast dirtier
Debug traces show that in per-bdi writeback, the inode under writeback
almost always get redirtied by a busy dirtier.  We used to call
redirty_tail() in this case, which could delay inode for up to 30s.

This is unacceptable because it now happens so frequently for plain cp/dd,
that the accumulated delays could make writeback of big files very slow.

So let's distinguish between data redirty and metadata only redirty.
The first one is caused by a busy dirtier, while the latter one could
happen in XFS, NFS, etc. when they are doing delalloc or updating isize.

The inode being busy dirtied will now be requeued for next io, while
the inode being redirtied by fs will continue to be delayed to avoid
repeated IO.

CC: Jan Kara <jack@suse.cz>
CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-25 18:08:26 +02:00
Jens Axboe 9ecc2738ac writeback: make the super_block pinning more efficient
Currently we pin the inode->i_sb for every single inode. This
increases cache traffic on sb->s_umount sem. Lets instead
cache the inode sb pin state and keep the super_block pinned
for as long as keep writing out inodes from the same
super_block.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-25 18:08:26 +02:00
Jens Axboe cf137307cd writeback: don't resort for a single super_block in move_expired_inodes()
If we only moved inodes from a single super_block to the temporary
list, there's no point in doing a resort for multiple super_blocks.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-25 18:08:26 +02:00
Shaohua Li 5c03449d34 writeback: move inodes from one super_block together
__mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
several partitions, writeback might keep spindle moving between partitions.
To reduce the move, better write big chunk of one partition and then move to
another. Inodes from one fs usually are in one partion, so idealy move indoes
from one fs together should reduce spindle move. This patch tries to address
this. Before per-bdi writeback is added, the behavior is write indoes
from one fs first and then another, so the patch restores previous behavior.
The loop in the patch is a bit ugly, should we add a dirty list for each
superblock in bdi_writeback?

Test in a two partition disk with attached fio script shows about 3% ~ 6%
improvement.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-25 18:08:25 +02:00
Jens Axboe 5b0830cb90 writeback: get rid to incorrect references to pdflush in comments
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-25 18:08:25 +02:00
Jens Axboe 71fd05a887 writeback: improve readability of the wb_writeback() continue/break logic
And throw some comments in there, too.

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-25 18:08:25 +02:00
Wu Fengguang ae1b7f7d4b writeback: cleanup writeback_single_inode()
Make the if-else straight in writeback_single_inode().
No behavior change.

Cc: Jan Kara <jack@suse.cz>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-25 18:08:25 +02:00
Wu Fengguang 7fbdea3232 writeback: kupdate writeback shall not stop when more io is possible
Fix the kupdate case, which disregards wbc.more_io and stop writeback
prematurely even when there are more inodes to be synced.

wbc.more_io should always be respected.

Also remove the pages_skipped check. It will set when some page(s) of some
inode(s) cannot be written for now. Such inodes will be delayed for a while.
This variable has nothing to do with whether there are other writeable inodes.

CC: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-25 18:08:25 +02:00