Commit Graph

23076 Commits

Author SHA1 Message Date
Wu Fengguang e185dda89d writeback: avoid extra sync work at enqueue time
This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:22 +08:00
Wu Fengguang e8dfc30582 writeback: elevate queue_io() into wb_writeback()
Code refactor for more logical code layout.
No behavior change.

- remove the mis-named __writeback_inodes_sb()

- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
  before calling __writeback_inodes_wb()

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:21 +08:00
Christoph Hellwig f758eeabeb writeback: split inode_wb_list_lock into bdi_writeback.list_lock
Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads.  It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
                      inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
                &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                ------------------------
                &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:21 +08:00
Wu Fengguang 424b351fe1 writeback: refill b_io iff empty
There is no point to carry different refill policies between for_kupdate
and other type of works. Use a consistent "refill b_io iff empty" policy
which can guarantee fairness in an easy to understand way.

A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.

This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.

This change relies on wb_writeback() to keep retrying as long as we made
some progress on cleaning some pages and/or inodes. Without that ability,
the old logic on background works relies on aggressively queuing all
eligible inodes into b_io at every time. But that's not a guarantee.

The below test script completes a slightly faster now:

             2.6.39-rc3	  2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed     256.043      252.367
stddev           24.381       12.530

tar elapsed      30.097       28.808
dd  elapsed      13.214       11.782

	#!/bin/zsh

	cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/

	umount /dev/sda7
	mkfs.xfs -f /dev/sda7
	mount /dev/sda7 /fs

	echo 3 > /proc/sys/vm/drop_caches

	tic=$(cat /proc/uptime|cut -d' ' -f2)

	cd /fs
	time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
	time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &

	wait
	sync
	tac=$(cat /proc/uptime|cut -d' ' -f2)
	echo elapsed: $((tac - tic))

It maintains roughly the same small vs. large file writeout shares, and
offers large files better chances to be written in nice 4M chunks.

Analyzes from Dave Chinner in great details:

Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.

With the old code, we do:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)

	writeback  8 small inodes 800 pages
		   1 large inode 224 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)
	.....

Your new code:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 8s)
	writeback  8 small inodes 800 pages

	b_io empty: (1800 pages written)
		b_more_io (large inode) -> b_io (1l)
		14 newly expired inodes -> b_io (1l, 14s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 14s)
	writeback  10 small inodes 1000 pages
		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
	writeback  5 small inodes 500 pages
	b_io empty: (2548 pages written)
		b_more_io (large inode) -> b_io (1l, 1s(24))
		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
	......

Rough progression of pages written at b_io refill:

Old code:

	total	large file	% of writeback
	1024	224		21.9% (fixed)

New code:
	total	large file	% of writeback
	1800	1024		~55%
	2550	1024		~40%
	3050	1024		~33%
	3500	1024		~29%
	3950	1024		~26%
	4250	1024		~24%
	4500	1024		~22.7%
	4700	1024		~21.7%
	4800	1024		~21.3%
	4800	1024		~21.3%
	(pretty much steady state from here)

Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)

The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks.

CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:21 +08:00
Wu Fengguang ba9aa8399f writeback: the kupdate expire timestamp should be a moving target
Dynamically compute the dirty expire timestamp at queue_io() time.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the evil pageout(). Nevertheless this patch merely addresses part
  of the problem.

v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.

CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:21 +08:00
Wu Fengguang e6fb6da2e1 writeback: try more writeback as long as something was written
writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

For example, imagine 100 inodes

        i0, i1, i2, ..., i90, i91, i99

At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.

Now that we do tagged sync and update inode->dirtied_when after the sync,
this change won't livelock sync(1).  I actually tried to write 1 page
per 1ms with this command

	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test

and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:20 +08:00
Wu Fengguang cb9bd1159c writeback: introduce writeback_control.inodes_written
The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_written to count the inodes get
cleaned from VFS POV.  A non-zero value means there are some progress on
writeback, in which case more writeback can be tried.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:20 +08:00
Wu Fengguang 94c3dcbb0b writeback: update dirtied_when for synced inode to prevent livelock
Explicitly update .dirtied_when on synced inodes, so that they are no
longer considered for writeback in the next round.

It can prevent both of the following livelock schemes:

- while true; do echo data >> f; done
- while true; do touch f;        done (in theory)

The exact livelock condition is, during sync(1):

(1) no new inodes are dirtied
(2) an inode being actively dirtied

On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX.
When finished, it will be redirty_tail()ed because it's still dirty
and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when
on condition (1). The sync work will then revisit it on the next
queue_io() and find it eligible again because its old ->dirtied_when
predates the sync work start time.

We'll do more aggressive "keep writeback as long as we wrote something"
logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
b9543dac5b ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
no longer be enough to stop sync livelock.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:20 +08:00
Wu Fengguang 6e6938b6d3 writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
do livelock prevention for it, too.

Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance
using page tagging") is a partial fix in that it only fixed the
WB_SYNC_ALL phase livelock.

Although ext4 is tested to no longer livelock with commit f446daaea9,
it may due to some "redirty_tail() after pages_skipped" effect which
is by no means a guarantee for _all_ the file systems.

Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.

Impact:  It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.

Acked-by: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:20 +08:00
Linus Torvalds e6ece70732 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
  btrfs: fix uninitialized variable warning
  btrfs: add helper for fs_info->closing
  Btrfs: add mount -o inode_cache
  btrfs: scrub: add explicit plugging
  btrfs: use btrfs_ino to access inode number
  Btrfs: don't save the inode cache if we are deleting this root
  btrfs: false BUG_ON when degraded
  Btrfs: don't save the inode cache in non-FS roots
  Btrfs: make sure we don't overflow the free space cache crc page
  Btrfs: fix uninit variable in the delayed inode code
  btrfs: scrub: don't reuse bios and pages
  Btrfs: leave spinning on lookup and map the leaf
  Btrfs: check for duplicate entries in the free space cache
  Btrfs: don't try to allocate from a block group that doesn't have enough space
  Btrfs: don't always do readahead
  Btrfs: try not to sleep as much when doing slow caching
  Btrfs: kill BTRFS_I(inode)->block_group
  Btrfs: don't look at the extent buffer level 3 times in a row
  Btrfs: map the node block when looking for readahead targets
  Btrfs: set range_start to the right start in count_range_bits
  ...
2011-06-05 06:17:23 +09:00
David Sterba aa0467d8d2 btrfs: fix uninitialized variable warning
With Linus' tree, today's linux-next build (powercp ppc64_defconfig)
produced this warning:

fs/btrfs/delayed-inode.c: In function 'btrfs_delayed_update_inode':
fs/btrfs/delayed-inode.c:1598:6: warning: 'ret' may be used
uninitialized in this function

Introduced by commit 16cdcec736 ("btrfs: implement delayed inode items
operation").

This fixes a bug in btrfs_update_inode(): if the returned value from
btrfs_delayed_update_inode is a nonzero garbage, inode stat data are not
updated and several call paths may hit a BUG_ON or fail with strange
code.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David Sterba <dsterba@suse.cz>
2011-06-04 08:11:38 -04:00
David Sterba 7841cb2898 btrfs: add helper for fs_info->closing
wrap checking of filesystem 'closing' flag and fix a few missing memory
barriers.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-06-04 08:11:22 -04:00
Chris Mason 4b9465cb9e Btrfs: add mount -o inode_cache
This makes the inode map cache default to off until we
fix the overflow problem when the free space crcs don't fit
inside a single page.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-04 08:03:47 -04:00
Arne Jansen e7786c3ae5 btrfs: scrub: add explicit plugging
With the removal of the implicit plugging scrub ends up doing more and
smaller I/O than necessary. This patch adds explicit plugging per chunk.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-04 08:03:46 -04:00
David Sterba a4689d2bd3 btrfs: use btrfs_ino to access inode number
commit 4cb5300bc ("Btrfs: add mount -o auto_defrag") accesses inode
number directly while it should use the helper with the new inode
number allocator.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-04 08:03:46 -04:00
Josef Bacik d132a538d2 Btrfs: don't save the inode cache if we are deleting this root
With xfstest 254 I can panic the box every time with the inode number caching
stuff on.  This is because we clean the inodes out when we delete the subvolume,
but then we write out the inode cache which adds an inode to the subvolume inode
tree, and then when it gets evicted again the root gets added back on the dead
roots list and is deleted again, so we have a double free.  To stop this from
happening just return 0 if refs is 0 (and we're not the tree root since tree
root always has refs of 0).  With this fix 254 no longer panics.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-04 08:03:45 -04:00
Arne Jansen 5f3f302a6f btrfs: false BUG_ON when degraded
In degraded mode the struct btrfs_device of missing devs don't have
device->name set. A kstrdup of NULL correctly returns NULL. Don't
BUG in this case.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-04 08:03:44 -04:00
liubo ca456ae280 Btrfs: don't save the inode cache in non-FS roots
This adds extra checks to make sure the inode map we are caching really
belongs to a FS root instead of a special relocation tree.  It
prevents crashes during balancing operations.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-04 08:03:44 -04:00
Chris Mason 211f96c24f Btrfs: make sure we don't overflow the free space cache crc page
The free space cache uses only one page for crcs right now,
which means we can't have a cache file bigger than the
crcs we can fit in the first page.  This adds a check to
enforce that restriction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-04 08:03:43 -04:00
Chris Mason 17aca1c987 Btrfs: fix uninit variable in the delayed inode code
The nitems counter needs to start at zero

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-04 08:03:43 -04:00
Arne Jansen 1bc8779349 btrfs: scrub: don't reuse bios and pages
The current scrub implementation reuses bios and pages as often as possible,
allocating them only on start and releasing them when finished. This leads
to more problems with the block layer than it's worth. The elevator gets
confused when there are more pages added to the bio than bi_size suggests.
This patch completely rips out the reuse of bios and pages and allocates
them freshly for each submit.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Maosn <chris.mason@oracle.com>
2011-06-04 08:03:17 -04:00
Linus Torvalds 4f1ba49efa Merge branch 'for-linus' of git://git.kernel.dk/linux-block
* 'for-linus' of git://git.kernel.dk/linux-block:
  block: Use hlist_entry() for io_context.cic_list.first
  cfq-iosched: Remove bogus check in queue_fail path
  xen/blkback: potential null dereference in error handling
  xen/blkback: don't call vbd_size() if bd_disk is NULL
  block: blkdev_get() should access ->bd_disk only after success
  CFQ: Fix typo and remove unnecessary semicolon
  block: remove unwanted semicolons
  Revert "block: Remove extra discard_alignment from hd_struct."
  nbd: adjust 'max_part' according to part_shift
  nbd: limit module parameters to a sane value
  nbd: pass MSG_* flags to kernel_recvmsg()
  block: improve the bio_add_page() and bio_add_pc_page() descriptions
2011-06-04 08:11:26 +09:00
Linus Torvalds 3af91a1256 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
  UBIFS: fix-up free space earlier
  UBIFS: intialize LPT earlier
  UBIFS: assert no fixup when writing a node
  UBIFS: fix clean znode counter corruption in error cases
  UBIFS: fix memory leak on error path
  UBIFS: fix shrinker object count reports
  UBIFS: fix recovery broken by the previous recovery fix
  UBIFS: amend ubifs_recover_leb interface
  UBIFS: introduce a "grouped" journal head flag
  UBIFS: supress false error messages
2011-06-04 07:59:32 +09:00
Ben Gardiner 098011940a UBIFS: fix-up free space earlier
The free space fixup is currently initiated during mount after the call to
ubifs_write_master() which results in a write to PEBs; this has been observed
with the patch 'assert no fixup when writing a node' applied:

Move the free space fixup on mount to before the calls to
ubifs_recover_inl_heads() and ubifs_write_master(). This results in no
assertions with the previously mentioned patch applied.

Artem: tweaked the patch a bit

Signed-off-by: Ben Gardiner <bengardiner@nanometrics>
Reviewed-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-06-03 18:12:31 +03:00
Ben Gardiner 781c5717a9 UBIFS: intialize LPT earlier
The current 'mount_ubifs()' implementation does not initialize the LPT until the
the master node is marked dirty. Move the LPT initialization to before marking
the master node dirty. This is a preparation for the next patch which will move
the free-space-fixup check to before marking the master node dirty, because we
have to fix-up the free space before doing any writes.

Artem: massaged the patch and commit message.

Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca>
Reviewed-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-06-03 18:12:31 +03:00
Ben Gardiner 4f1ab9b01d UBIFS: assert no fixup when writing a node
The current free space fixup can result in some writing to the UBI volume
when the space_fixup flag is set.

To catch instances where UBIFS is writing to the NAND while the space_fixup
flag is set, add an assert to ubifs_write_node().

Artem: tweaked the patch, added similar assertion to the write buffer
       write path.

Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca>
Reviewed-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-06-03 18:12:31 +03:00
Artem Bityutskiy 8370723770 UBIFS: fix clean znode counter corruption in error cases
UBIFS maintains per-filesystem and global clean znode counters
('c->clean_zn_cnt' and 'ubifs_clean_zn_cnt'). It is important to maintain
correct values there since the shrinker relies on 'ubifs_clean_zn_cnt'.

However, in case of failures during commit the counters were corrupted. E.g.,
if a failure happens in the middle of 'write_index()', then some nodes in the
commit list ('c->cnext') are marked as clean, and some are marked as dirty. And
the 'ubifs_destroy_tnc_subtree()' frees does not retrun correct count, and we
end up with non-zero 'c->clean_zn_cnt' when unmounting. This means that if we
have 2 file-sytem and one of them fails, and we unmount it,
'ubifs_clean_zn_cnt' stays incorrect and confuses the shrinker.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-06-03 18:12:31 +03:00
Artem Bityutskiy 812eb25831 UBIFS: fix memory leak on error path
UBIFS leaks memory on error path in 'ubifs_jnl_update()' in case of write
failure because it forgets to free the 'struct ubifs_dent_node *dent' object.
Although the object is small, the alignment can make it large - e.g., 2KiB
if the min. I/O unit is 2KiB.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
2011-06-03 18:12:31 +03:00
Artem Bityutskiy cf610bf419 UBIFS: fix shrinker object count reports
Sometimes VM asks the shrinker to return amount of objects it can shrink,
and we return the ubifs_clean_zn_cnt in that case. However, it is possible
that this counter is negative for a short period of time, due to the way
UBIFS TNC code updates it. And I can observe the following warnings sometimes:

shrink_slab: ubifs_shrinker+0x0/0x2b7 [ubifs] negative objects to delete nr=-8541616642706119788

This patch makes sure UBIFS never returns negative count of objects.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
2011-06-03 18:12:24 +03:00
Artem Bityutskiy da8b94ea61 UBIFS: fix recovery broken by the previous recovery fix
Unfortunately, the recovery fix d1606a59b6be4ea392eabd40d1250aa1eeb19efb
(UBIFS: fix extremely rare mount failure) broke recovery. This commit make
UBIFS drop the last min. I/O unit in all journal heads, but this is needed only
for the GC head. And this does not work for non-GC heads. For example, if
suppose we have min. I/O units A and B, and A contains a valid node X, which
was fsynced, and then a group of nodes Y which spans the rest of A and B. In
this case we'll drop not only Y, but also X, which is obviously incorrect.

This patch fixes the issue and additionally makes recovery to drop last min.
I/O unit only for the GC head, and leave things as they have been for ages for
the other heads - this is safer.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-06-01 12:29:06 +03:00
Artem Bityutskiy efcfde54ca UBIFS: amend ubifs_recover_leb interface
Instead of passing "grouped" parameter to 'ubifs_recover_leb()' which tells
whether the nodes are grouped in the LEB to recover, pass the journal head
number and let 'ubifs_recover_leb()' look at the journal head's 'grouped' flag.

This patch is a preparation to a further fix where we'll need to know the
journal head number for other purposes.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-06-01 12:29:06 +03:00
Artem Bityutskiy 1a0b06997c UBIFS: introduce a "grouped" journal head flag
Journal heads are different in a way how UBIFS writes nodes there. All normal
journal heads receive grouped nodes, while the GC journal heads receives
ungrouped nodes. This patch adds a 'grouped' flag to 'struct ubifs_jhead' which
describes this property.

This patch is a preparation to a further recovery fix.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-06-01 12:29:06 +03:00
Artem Bityutskiy ab75950b11 UBIFS: supress false error messages
Commit ab51afe05273741f72383529ef488aa1ea598ec6 was a good clean-up, but
it introduced a regression - now UBIFS prints scary error messages during
recovery on all corrupted nodes, even though the corruptions are expected
(due to a power cut). This patch fixes the issue.

Additionally fix a typo in a commentary introduced by the same commit.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-06-01 12:29:05 +03:00
Tejun Heo 4c49ff3fe1 block: blkdev_get() should access ->bd_disk only after success
d4dc210f69 (block: don't block events on excl write for non-optical
devices) added dereferencing of bdev->bd_disk to test
GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; however, bdev->bd_disk can be
%NULL if open failed which can lead to an oops.

Test the flag after testing open was successful, not before.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Miller <davem@davemloft.net>
Tested-by: David Miller <davem@davemloft.net>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-01 08:28:47 +02:00
Al Viro c7427d23f7 autofs4: bogus dentry_unhash() added in ->unlink()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-30 01:50:53 -04:00
Sage Weil 3cebde2413 vfs: shrink_dcache_parent before rmdir, dir rename
The dentry_unhash push-down series missed that shink_dcache_parent needs to
be called prior to rmdir or dir rename to clear DCACHE_REFERENCED and
allow efficient dentry reclaim.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-30 01:48:27 -04:00
Jens Axboe a1706ac4c0 Revert "block: Remove extra discard_alignment from hd_struct."
It was not a good idea to start dereferencing disk->queue from
the fs sysfs strategy for displaying discard alignment. We ran
into first a NULL pointer deref, and after fixing that we sometimes
see unvalid disk->queue pointer values.

Since discard is the only one of the bunch actually looking into
the queue, just revert the change.

This reverts commit 23ceb5b771.

Conflicts:
	fs/partitions/check.c
2011-05-30 07:42:51 +02:00
Linus Torvalds bd1bfe40ac Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
  eCryptfs: Remove ecryptfs_header_cache_2
  eCryptfs: Cleanup and optimize ecryptfs_lookup_interpose()
  eCryptfs: Return useful code from contains_ecryptfs_marker
  eCryptfs: Fix new inode race condition
  eCryptfs: Cleanup inode initialization code
  eCryptfs: Consolidate inode functions into inode.c
2011-05-29 14:13:25 -07:00
Linus Torvalds cd1acdf172 Merge branch 'pnfs-submit' of git://git.open-osd.org/linux-open-osd
* 'pnfs-submit' of git://git.open-osd.org/linux-open-osd: (32 commits)
  pnfs-obj: pg_test check for max_io_size
  NFSv4.1: define nfs_generic_pg_test
  NFSv4.1: use pnfs_generic_pg_test directly by layout driver
  NFSv4.1: change pg_test return type to bool
  NFSv4.1: unify pnfs_pageio_init functions
  pnfs-obj: objlayout_encode_layoutcommit implementation
  pnfs: encode_layoutcommit
  pnfs-obj: report errors and .encode_layoutreturn Implementation.
  pnfs: encode_layoutreturn
  pnfs: layoutret_on_setattr
  pnfs: layoutreturn
  pnfs-obj: osd raid engine read/write implementation
  pnfs: support for non-rpc layout drivers
  pnfs-obj: define per-inode private structure
  pnfs: alloc and free layout_hdr layoutdriver methods
  pnfs-obj: objio_osd device information retrieval and caching
  pnfs-obj: decode layout, alloc/free lseg
  pnfs-obj: pnfs_osd XDR client implementation
  pnfs-obj: pnfs_osd XDR definitions
  pnfs-obj: objlayoutdriver module skeleton
  ...
2011-05-29 14:10:13 -07:00
Tyler Hicks 3063287053 eCryptfs: Remove ecryptfs_header_cache_2
Now that ecryptfs_lookup_interpose() is no longer using
ecryptfs_header_cache_2 to read in metadata, the kmem_cache can be
removed and the ecryptfs_header_cache_1 kmem_cache can be renamed to
ecryptfs_header_cache.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-05-29 14:24:25 -05:00
Tyler Hicks 778aeb42a7 eCryptfs: Cleanup and optimize ecryptfs_lookup_interpose()
ecryptfs_lookup_interpose() has turned into spaghetti code over the
years. This is an effort to clean it up.

 - Shorten overly descriptive variable names such as ecryptfs_dentry
 - Simplify gotos and error paths
 - Create helper function for reading plaintext i_size from metadata

It also includes an optimization when reading i_size from the metadata.
A complete page-sized kmem_cache_alloc() was being done to read in 16
bytes of metadata. The buffer for that is now statically declared.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-05-29 14:24:24 -05:00
Tyler Hicks 7a86617e55 eCryptfs: Return useful code from contains_ecryptfs_marker
Instead of having the calling functions translate the true/false return
code to either 0 or -EINVAL, have contains_ecryptfs_marker() return 0 or
-EINVAL so that the calling functions can just reuse the return code.

Also, rename the function to ecryptfs_validate_marker() to avoid callers
mistakenly thinking that it returns true/false codes.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-05-29 14:24:24 -05:00
Tyler Hicks 3b06b3ebf4 eCryptfs: Fix new inode race condition
Only unlock and d_add() new inodes after the plaintext inode size has
been read from the lower filesystem. This fixes a race condition that
was sometimes seen during a multi-job kernel build in an eCryptfs mount.

https://bugzilla.kernel.org/show_bug.cgi?id=36002

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Reported-by: David <david@unsolicited.net>
Tested-by: David <david@unsolicited.net>
2011-05-29 14:23:39 -05:00
Linus Torvalds 57ed609d4b Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
  arch/tile: more /proc and /sys file support
2011-05-29 11:29:28 -07:00
Linus Torvalds a74d70b63f Merge branch 'for-2.6.40' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.40' of git://linux-nfs.org/~bfields/linux: (22 commits)
  nfsd: make local functions static
  NFSD: Remove unused variable from nfsd4_decode_bind_conn_to_session()
  NFSD: Check status from nfsd4_map_bcts_dir()
  NFSD: Remove setting unused variable in nfsd_vfs_read()
  nfsd41: error out on repeated RECLAIM_COMPLETE
  nfsd41: compare request's opcnt with session's maxops at nfsd4_sequence
  nfsd v4.1 lOCKT clientid field must be ignored
  nfsd41: add flag checking for create_session
  nfsd41: make sure nfs server process OPEN with EXCLUSIVE4_1 correctly
  nfsd4: fix wrongsec handling for PUTFH + op cases
  nfsd4: make fh_verify responsibility of nfsd_lookup_dentry caller
  nfsd4: introduce OPDESC helper
  nfsd4: allow fh_verify caller to skip pseudoflavor checks
  nfsd: distinguish functions of NFSD_MAY_* flags
  svcrpc: complete svsk processing on cb receive failure
  svcrpc: take advantage of tcp autotuning
  SUNRPC: Don't wait for full record to receive tcp data
  svcrpc: copy cb reply instead of pages
  svcrpc: close connection if client sends short packet
  svcrpc: note network-order types in svc_process_calldir
  ...
2011-05-29 11:21:12 -07:00
Linus Torvalds f1d1c9fa8f Merge branch 'nfs-for-2.6.40' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'nfs-for-2.6.40' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  SUNRPC: Support for RPC over AF_LOCAL transports
  SUNRPC: Remove obsolete comment
  SUNRPC: Use AF_LOCAL for rpcbind upcalls
  SUNRPC: Clean up use of curly braces in switch cases
  NFS: Revert NFSROOT default mount options
  SUNRPC: Rename xs_encode_tcp_fragment_header()
  nfs,rcu: convert call_rcu(nfs_free_delegation_callback) to kfree_rcu()
  nfs41: Correct offset for LAYOUTCOMMIT
  NFS: nfs_update_inode: print current and new inode size in debug output
  NFSv4.1: Fix the handling of NFS4ERR_SEQ_MISORDERED errors
  NFSv4: Handle expired stateids when the lease is still valid
  SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...
2011-05-29 11:20:02 -07:00
Linus Torvalds 2ff55e98d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
  Squashfs: Fix sanity check patches on big-endian systems
2011-05-29 11:19:45 -07:00
Al Viro ef1d57599d cifs/ubifs: Fix shrinker API change fallout
Commit 1495f230fa ("vmscan: change shrinker API by passing
shrink_control struct") changed the API of ->shrink(), but missed ubifs
and cifs instances.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-29 11:17:34 -07:00
Boaz Harrosh 9342077011 pnfs-obj: pg_test check for max_io_size
Implement pg_test vector to test for max IO sizes. We calculate
a max_io_size member only once, and cache it in lseg so to not
do so on every page insert.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[simplify logic]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2011-05-29 21:03:08 +03:00
Boaz Harrosh 5b36c7dc41 NFSv4.1: define nfs_generic_pg_test
By default, unless pnfs is used coalesce pages until pg_bsize
(rsize or wsize) is reached.

pnfs layout drivers define their own pg_test methods that use
pnfs_generic_pg_test and need to define their own I/O size
limits (e.g. based on the file stripe size).

[Move a check from nfs_pageio_do_add_request to nfs_generic_pg_test]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2011-05-29 21:02:42 +03:00