Rework the transaction lookup and allocation code in
xlog_recovery_process_ophdr() to fold two related call-once
helper functions into a single helper. Then fold in all the
XLOG_START_TRANS logic to that helper to clean up the remaining
logic in xlog_recovery_process_ophdr().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The code for managing transactions anf the items for recovery is
spread across 3 different locations in the file. Move them all
together so that it is easy to read the code without needing to jump
long distances in the file.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When an error occurs during buffer submission in
xlog_recover_commit_trans(), we free the trans structure twice. Fix
it by only freeing the structure in the caller regardless of the
success or failure of the function.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The XLOG_UNMOUNT_TRANS case skips the transaction, despite the fact
an unmount record is always in a standalone transaction. Hence
whenever we come across one of these we need to free the transaction
structure associated with it as there is no commit record that
follows it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Clean up xlog_recover_process_data() structure in preparation for
fixing the allocation and freeing context of the transaction being
recovered.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
On a sub-page sized filesystem, truncating a mapped region down
leaves us in a world of hurt. We truncate the pagecache, zeroing the
newly unused tail, then punch blocks out from under the page. If we
then truncate the file back up immediately, we expose that unmapped
hole to a dirty page mapped into the user application, and that's
where it all goes wrong.
In truncating the page cache, we avoid unmapping the tail page of
the cache because it still contains valid data. The problem is that
it also contains a hole after the truncate, but nobody told the mm
subsystem that. Therefore, if the page is dirty before the truncate,
we'll never get a .page_mkwrite callout after we extend the file and
the application writes data into the hole on the page. Hence when
we come to writing that region of the page, it has no blocks and no
delayed allocation reservation and hence we toss the data away.
This patch adds code to the truncate up case to solve it, by
ensuring the partial page at the old EOF is always cleaned after we
do any zeroing and move the EOF upwards. We can't actually serialise
the page writeback and truncate against page faults (yes, that
problem AGAIN) so this is really just a best effort and assumes it
is extremely unlikely that someone is concurrently writing to the
page at the EOF while extending the file.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Fix sparse warning introduced by commit 4ef897a ("xfs: flush both
inodes in xfs_swap_extents").
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_quota.h was included twice.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_dir3_data_get_ftype() gets the file type off disk, but ASSERTs
if it's invalid:
ASSERT(type < XFS_DIR3_FT_MAX);
We shouldn't ASSERT on bad values read from disk. V3 dirs are
CRC-protected, but V2 dirs + ftype are not.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When running a tight mount/unmount loop on an older kernel, RedHat
QE found that unmount would occasionally hang in
xfs_buf_unpin_wait() on the superblock buffer. Tracing and other
debug work by Eric Sandeen indicated that it was hanging on the
writing of the superblock during unmount immediately after logging
the superblock counters in a synchronous transaction. Further debug
indicated that the synchronous transaction was not waiting for
completion correctly, and we narrowed it down to
xlog_cil_force_lsn() returning NULLCOMMITLSN and hence not pushing
the transaction in the iclog buffer to disk correctly.
While this unmount superblock write code is now very different in
mainline kernels, the xlog_cil_force_lsn() code is identical, and it
was bisected to the backport of commit f876e44 ("xfs: always do log
forces via the workqueue"). This commit made the CIL push
asynchronous for log forces and hence exposed a race condition that
couldn't occur on a synchronous push.
Essentially, the xlog_cil_force_lsn() relied implicitly on the fact
that the sequence push would be complete by the time
xlog_cil_push_now() returned, resulting in the context being pushed
being in the committing list. When it was made asynchronous, it was
recognised that there was a race condition in detecting whether an
asynchronous push has started or not and code was added to handle
it.
Unfortunately, the fix was not quite right and left a race condition
where it it would detect an empty CIL while a push was in progress
before the context had been added to the committing list. This was
incorrectly seen as a "nothing to do" condition and so would tell
xfs_log_force_lsn() that there is nothing to wait for, and hence it
would push the iclogbufs in memory.
The fix is simple, but explaining the logic and the race condition
is a lot more complex. The fix is to add the context to the
committing list before we start emptying the CIL. This allows us to
detect the difference between an empty "do nothing" push and a push
that has not started by adding a discrete "emptying the CIL" state
to avoid the transient, incorrect "empty" condition that the
(unchanged) waiting code was seeing.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_free_file_space() only affects the range of the file for which space
is being freed. It currently writes and truncates the page cache from
the start offset of the free to EOF.
Modify xfs_free_file_space() to write back and truncate page cache of
just the range being freed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The collapse range operation currently writes the entire file before
starting the collapse to avoid changes in the in-core extent list due to
writeback causing the extent count to change. Now that collapse range is
fsb based rather than extent index based it can sustain changes in the
extent list during the shift sequence without disruption.
Modify xfs_collapse_file_space() to writeback and invalidate pages
associated with the range of the file to be shifted.
xfs_free_file_space() currently has similar behavior, but the space free
need only affect the region of the file that is freed and this could
change in the future.
Also update the comments to reflect the current implementation. We
retain the eofblocks trim permanently as a best option for dealing with
delalloc extents. We don't shift delalloc extents because this scenario
only occurs with post-eof preallocation (since data must be flushed such
that the cache can be invalidated and data can be shifted). That means
said space must also be initialized before being shifted into the
accessible region of the file only to be immediately truncated off as
the last part of the collapse. In other words, the eofblocks trim will
happen anyways, we just run it first to ensure the file remains in a
consistent state throughout the collapse.
Finally, detect and fail explicitly in the event of a delalloc extent
during the extent shift. The implementation does not support delalloc
extents and the caller is expected to prevent this scenario in advance
as is done by collapse.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_bmap_shift_extents() has a variety of conditions and error checks
that make the logic difficult to follow and indent heavy. Refactor the
loop body of this function into a new xfs_bmse_shift_one() helper. This
simplifies the error checks, eliminates index decrement on merge hack by
pushing the index increment down into the helper, and makes the code
more readable by reducing multiple levels of indentation.
This is a code refactor only. The behavior of extent shift and collapse
range is not modified.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The extent shift mechanism in xfs_bmap_shift_extents() is complicated
and handles several different, non-deterministic scenarios. These
include extent shifts, extent merges and potential btree updates in
either of the former scenarios.
Refactor the code to be more linear and readable. The loop logic in
xfs_bmap_shift_extents() and some initial error checking is adjusted
slightly. The associated btree lookup and update/delete operations are
condensed into single blocks of code. This reduces the number of
btree-specific blocks and facilitates the separation of the merge
operation into a new xfs_bmse_merge() and xfs_bmse_can_merge() helpers.
This is a code refactor only. The behavior of extent shift and collapse
range is not modified.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The collapse range implementation uses a transaction per extent shift.
The progress of the overall operation is tracked via the current extent
index of the in-core extent list. This is racy because the ilock must be
dropped and reacquired for each transaction according to locking and log
reservation rules. Therefore, writeback to prior regions of the file is
possible and can change the extent count. This changes the extent to
which the current index refers and causes the collapse to fail mid
operation. To avoid this problem, the entire file is currently written
back before the collapse operation starts.
To eliminate the need to flush the entire file, use the file offset
(fsb) to track the progress of the overall extent shift operation rather
than the extent index. Modify xfs_bmap_shift_extents() to
unconditionally convert the start_fsb parameter to an extent index and
return the file offset of the extent where the shift left off, if
further extents exist. The bulk of ths function can remain based on
extent index as ilock is held by the caller. xfs_collapse_file_space()
now uses the fsb output as the starting point for the subsequent shift.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
XFS has been having trouble with stray delayed allocation extents
beyond EOF for a long time. Recent changes to the collapse range
code has triggered erroneous EBUSY errors on page invalidtion for
block size smaller than page size filesystems. These
have been caused by dirty buffers beyond EOF on a partial page which
do not get written to disk during a sync.
The issue is that write-ahead in xfs_cluster_write() finds such a
partial page and handles it by leaving the page dirty but pushing it
into a writeback state. This used to work just fine, as the
write_cache_pages() code would then find the dirty partial page in
the next mapping tree lookup as the dirty tag is still set.
Unfortunately, when we moved to a mark and sweep approach to
writeback to fix other writeback sync issues, we broken this. THe
act of marking the page as under writeback now clears the TOWRITE
tag in the radix tree, even though the page is still dirty. This
causes the TOWRITE tag to be cleared, and hence the next lookup on
the mapping tree does not find the dirty partial page and so doesn't
try to write it again.
This same writeback bug was found recently in ext4 and fixed in
commit 1c8349a ("ext4: fix data integrity sync in ordered mode")
without communication to the wider filesystem community. We can use
exactly the same fix here so the TOWRITE flag is not cleared on
partial page writes.
cc: stable@vger.kernel.org # dependent on 1c8349a171
Root-cause-found-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
rbpp is always passed into xfs_rtmodify_summary
and xfs_rtget_summary, so there is no need to
test for it in xfs_rtmodify_summary_int.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_rtmodify_summary and xfs_rtget_summary are almost identical;
fold them into xfs_rtmodify_summary_int(), with wrappers for each of
the original calls.
The _int function modifies if a delta is passed, and returns a
summary pointer if *sum is passed.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_dir_canenter and xfs_dir_createname are
almost identical.
Fold the former into the latter, with a helpful
wrapper for the former. If createname is called without
an inode number, it now only checks for space, and does
not actually add the entry.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Move the resblks test out of the xfs_dir_canenter,
and into the caller.
This makes a little more sense on the face of it;
xfs_dir_canenter immediately returns if resblks !=0;
and given some of the comments preceding the calls:
* Check for ability to enter directory entry, if no space reserved.
even more so.
It also facilitates the next patch.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In xlog_do_recovery_pass(), there are 2 distinct cases:
non-wrapped and wrapped log recovery.
If we find a wrapped log, we recover around the end
of the log, and then handle the rest of recovery
exactly as in the non-wrapped case - using exactly the same
(duplicated) code.
Rather than having the same code in both cases, we can
get the wrapped portion out of the way first if needed,
and then recover the non-wrapped portion of the log.
There should be no functional change here, just code
reorganization & deduplication.
The patch looks a bit bigger than it really is; the last
hunk is whitespace changes (un-indenting).
Tested with xfstests "check -g log" on a stock configuration.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
For some reason, the older commit:
965c8e5 lseek: the "whence" argument is called "whence"
lseek: the "whence" argument is called "whence"
But the kernel decided to call it "origin" instead.
Fix most of the sites.
left out xfs. So fix xfs.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_seek_hole & xfs_seek_data are remarkably similar;
so much so that they can be combined, saving a fair
bit of semi-complex code duplication.
The following patch passes generic/285 and generic/286,
which specifically test seek behavior.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
XFS log recovery has been discovered to have race conditions with
buffers when I/O errors occur. External tools are available to simulate
I/O errors to XFS, but this alone is not sufficient for testing log
recovery. XFS unconditionally resets the inactive region of the log
prior to log recovery to avoid confusion over processing any partially
written log records that might have been written before an unclean
shutdown. Therefore, unconditional write I/O failures at mount time are
caught by the reset sequence rather than log recovery and hinder the
ability to test the latter.
The device-mapper dm-flakey module uses an up/down timer to define a
cycle for when to fail I/Os. Create a pre log recovery delay tunable
that can be used to coordinate XFS log recovery with I/O errors
simulated by dm-flakey. This facilitates coordination in userspace that
allows the reset of stale log blocks to succeed and writes due to log
recovery to fail. For example, define a dm-flakey instance with an
uptime long enough to allow log reset to succeed and a log recovery
delay long enough to allow the dm-flakey uptime to expire.
The 'log_recovery_delay' sysfs tunable is exported under
/sys/fs/xfs/debug and is only enabled for kernels compiled in XFS debug
mode. The value is exported in units of seconds and allows for a delay
of up to 60 seconds. Note that this is for XFS debug and test
instrumentation purposes only and should not be used by applications. No
delay is enabled by default.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Create a top-level debug directory for global debug sysfs attributes.
This directory is added and removed on XFS module initialization and
removal respectively for DEBUG mode kernels only. It typically resides
at /sys/fs/xfs/debug. It is located at the top level of the xfs sysfs
hierarchy as attributes might define global behavior or behavior that
must be configured before an xfs mount is available (e.g., log recovery
behavior).
Define the global debug kobject that represents the debug sysfs
directory and add generic attribute show/store helpers to support future
attributes. No debug attributes are exported as of yet.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
These were exposed by fsfuzzer runs; without them we fail
in various exciting and sometimes convoluted ways when we
encounter disk corruption.
Without the MAXLEVELS tests we tend to walk off the end of
an array in a loop like this:
for (i = 0; i < cur->bc_nlevels; i++) {
if (cur->bc_bufs[i])
Without the dirblklog test we try to allocate more memory
than we could possibly hope for and loop forever:
xfs_dabuf_map()
nfsb = mp->m_dir_geo->fsbcount;
irecs = kmem_zalloc(sizeof(irec) * nfsb, KM_SLEEP...
As for the logbsize check, that's the convoluted one.
If logbsize is specified at mount time, it's sanitized
in xfs_parseargs; in particular it makes sure that it's
not > XLOG_MAX_RECORD_BSIZE.
If not specified at mount time, it comes from the superblock
via sb_logsunit; this is limited to 256k at mkfs time as well;
it's copied into m_logbsize in xfs_finish_flags().
However, if for some reason the on-disk value is corrupt and
too large, nothing catches it. It's a circuitous path, but
that size eventually finds its way to places that make the kernel
very unhappy, leading to oopses in xlog_pack_data() because we
use the size as an index into iclog->ic_data, but the array
is not necessarily that big.
Anyway - bounds checking when we read from disk is a good thing!
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Workqueues must be explicitly set as freezable to ensure they are frozen
in the assocated part of the hibernation/suspend sequence. Freezing of
workqueues and kernel threads is important to ensure that modifications
are not made on-disk after the hibernation image has been created.
Otherwise, the in-memory state can become inconsistent with what is on
disk and eventually lead to filesystem corruption. We have reports of
free space btree corruptions that occur immediately after restore from
hibernate that suggest the xfs-eofblocks workqueue could be causing
such problems if it races with hibernation.
Mark all of the internal XFS workqueues as freezable to ensure nothing
changes on-disk once the freezer infrastructure freezes kernel threads
and creates the hibernation image.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reported-by: Carlos E. R. <carlos.e.r@opensuse.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_collapse_file_space() currently writes back the entire file
undergoing collapse range to settle things down for the extent shift
algorithm. While this prevents changes to the extent list during the
collapse operation, the writeback itself is not enough to prevent
unnecessary collapse failures.
The current shift algorithm uses the extent index to iterate the in-core
extent list. If a post-eof delalloc extent persists after the writeback
(e.g., a prior zero range op where the end of the range aligns with eof
can separate the post-eof blocks such that they are not written back and
converted), xfs_bmap_shift_extents() becomes confused over the encoded
br_startblock value and fails the collapse.
As with the full writeback, this is a temporary fix until the algorithm
is improved to cope with a volatile extent list and avoid attempts to
shift post-eof extents.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If we have delalloc extents on a file before we run a collapse range
opertaion, we sync the range that we are going to collapse to
convert delalloc extents in that region to real extents to simplify
the shift operation.
However, the shift operation then assumes that the extent list is
not going to change as it iterates over the extent list moving
things about. Unfortunately, this isn't true because we can't hold
the ILOCK over all the operations. We can prevent new IO from
modifying the extent list by holding the IOLOCK, but that doesn't
prevent writeback from running....
And when writeback runs, it can convert delalloc extents is the
range of the file prior to the region being collapsed, and this
changes the indexes of all the extents in the file. That causes the
collapse range operation to Go Bad.
The right fix is to rewrite the extent shift operation not to be
dependent on the extent list not changing across the entire
operation, but this is a fairly significant piece of work to do.
Hence, as a short-term workaround for the problem, sync the entire
file before starting a collapse operation to remove all delalloc
ranges from the file and so avoid the problem of concurrent
writeback changing the extent list.
Diagnosed-and-Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
all subsequent extents down into the specified, previously punched out,
region. This function performs some validation, such as whether a
sufficient hole exists in the target region of the collapse, then shifts
the remaining exents downward.
The exit path of the function currently logs the inode unconditionally.
While we must log the inode (and abort) if an error occurs and the
transaction is dirty, the initial validation paths can generate errors
before the transaction has been dirtied. This creates an unnecessary
filesystem shutdown scenario, as the caller will cancel a transaction
that has been marked dirty.
Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
are made to the inode bmap. Only log the inode in the exit path if
logflags has been set. This ensures we only have to cancel a dirty
transaction if modifications have been made and prevents an unnecessary
filesystem shutdown otherwise.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now we are not doing silly things with dirtying buffers beyond EOF
and using invalidation correctly, we can finally reduce the ranges of
writeback and invalidation used by direct IO to match that of the IO
being issued.
Bring the writeback and invalidation ranges back to match the
generic direct IO code - this will greatly reduce the perturbation
of cached data when direct IO and buffered IO are mixed, but still
provide the same buffered vs direct IO coherency behaviour we
currently have.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Similar to direct IO reads, direct IO writes are using
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.
This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.
cc: stable@vger.kernel.org
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads. This is different from the other filesystems who
only invalidate pages during DIO writes.
truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page. This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.
buffered reads will find an up to date page with zeros instead of
the data actually on disk.
This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.
[dchinner: catch error and warn if it fails. Comment.]
cc: stable@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.
The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?
Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?
OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.
This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.
Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.
cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Highlights:
- More fixes for read/write codepath regressions
- Sleeping while holding the inode lock
- Stricter enforcement of page contiguity when coalescing requests
- Fix up error handling in the page coalescing code
- Don't busy wait on SIGKILL in the file locking code
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT+0LpAAoJEGcL54qWCgDyWfsP/imrpge47aZywi95chV8vgjM
O85ITZbupTFwXbB7kE63CrcaxRGhFrSStk4UDhDCDkHfFb1ksjZaPR1mnkwvkR2p
4+JUoq0fkPfeX21+rqKCYmnhstpne/N8K8FJBsEs3/TqiCBWxWOelLXdyWun4H5B
9JBYQ7FYitUazeSiSiDXcl7Di/E09cFPi0H5VPKRyuNdYxySabnsBOELBE/28iXr
egW1I9UKQR2EtBrvgazBbWE5XmB9XAm4X3sD1l0QD65mfSNkbnNhPFSiCdT7f/d6
9uxECR0Y4wNYgYAfVLBew5/MXJajcv03BFMKmTUeGj9fOQzycpBT4Dx2KxEWqfnt
Xk2nNbISxBnO0koMflmo+LPv2lv+Br3kQ+eZCHHKknvBrX2a6bJdTCZkwACVtND9
LdbAveFQpdaeLrm/28TnRoE927r+VeAVM19yOSG8sNAskFFg4Yy51tR0e1GivkJT
+qmmTRx+l78HjHvoPXOYdNgBC954r6APH5ST7su/7WxNClM36fEK6XxA9xbDLJWm
wUzlGKvpwEeBJJhgjbQLwuU8BiksjFz/CaiObNvPOpc/d2GoKIhnTg19kNhg2R//
UCDa2d5fep4z0Bo9p0s1KZm9pSBkkLjvRp9dm8WEIxLcdaF1jBK3dJECepm6ccvw
dmEmEfjbMudVdt/ZhapJ
=2wRt
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Highlights:
- more fixes for read/write codepath regressions
* sleeping while holding the inode lock
* stricter enforcement of page contiguity when coalescing requests
* fix up error handling in the page coalescing code
- don't busy wait on SIGKILL in the file locking code"
* tag 'nfs-for-3.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: Don't busy-wait on SIGKILL in __nfs_iocounter_wait
nfs: can_coalesce_requests must enforce contiguity
nfs: disallow duplicate pages in pgio page vectors
nfs: don't sleep with inode lock in lock_and_join_requests
nfs: fix error handling in lock_and_join_requests
nfs: use blocking page_group_lock in add_request
nfs: fix nonblocking calls to nfs_page_group_lock
nfs: change nfs_page_group_lock argument
* Confine SH_INTC to platforms that need it
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT+oswAAoJENfPZGlqN0++/A8QAK2yp96zl7gABtddgkunNJiV
xWCm4wcP+7dIrLpwizxtt/6HMPj9ZJHYtcxRRKVlzquwcukeVCIkfZ3Qvn2JilWM
+b4rVRGzQ+z0M7SCpNGjtgFc1IFrVrzuxZpPwgseQ5I6HQsZJKUi3qn5rRJSEax2
w+ANis2eZfmSBdu3Qsx2QBDXvS7ZPlLpimoAE+rjr60dZnljcrvBrVHQvSgpEHxn
4rXPbYrrkIee32J4lxLRDPtn5fvAsrr+vcLXEYqyFr2292U2PAxIrjZgrRzZYNtD
L6xmhZAhdmIpC0gLDLyKQhwj/9pfGFxFErZX9jeGdPT2KTEEYPkSxb75kdTwmxP4
DEtZG9Nuu6CQCBldrMt6kpujn+XCYRl94cSyUffJ5KVhNUiqB+e+JP80Yj9HB7QU
nGUfs2hQ/azY8XEiwvnls0314thkB5gN7KHxmlaFa9Wz7cgYFwWAguHH8qrq839U
+i+7dUg7nlmVoIAvzYIIA+L7x3xX4gEwsf88x3i04DPp0A8/oDAwGrqM5QzsaZpD
ghodnKRG7foc+29RLYka5CxF/MqKzpJu675sSBCVB3uW9NZpoGSwml/NN9yUu1JD
uvU2FECcbxF5OU6RHfq8FFql0fAOVeVWsMCEXgkjpZOXynKU15VTf1K2d+qkUZ/Y
gV45Jj+yCZhYs0wbXUml
=QSyy
-----END PGP SIGNATURE-----
Merge tag 'renesas-sh-drivers-for-v3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas
Pull SH driver fix from Simon Horman:
"Confine SH_INTC to platforms that need it"
* tag 'renesas-sh-drivers-for-v3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas:
sh: intc: Confine SH_INTC to platforms that need it
Pull MIPS fixes from Ralf Baechle:
"Pretty much all across the field so with this we should be in
reasonable shape for the upcoming -rc2"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: OCTEON: make get_system_type() thread-safe
MIPS: CPS: Initialize EVA before bringing up VPEs from secondary cores
MIPS: Malta: EVA: Rename 'eva_entry' to 'platform_eva_init'
MIPS: EVA: Add new EVA header
MIPS: scall64-o32: Fix indirect syscall detection
MIPS: syscall: Fix AUDIT value for O32 processes on MIPS64
MIPS: Loongson: Fix COP2 usage for preemptible kernel
MIPS: NL: Fix nlm_xlp_defconfig build error
MIPS: Remove race window in page fault handling
MIPS: Malta: Improve system memory detection for '{e, }memsize' >= 2G
MIPS: Alchemy: Fix db1200 PSC clock enablement
MIPS: BCM47XX: Fix reboot problem on BCM4705/BCM4785
MIPS: Remove duplicated include from numa.c
MIPS: Add common plat_irq_dispatch declaration
MIPS: MSP71xx: remove unused plat_irq_dispatch() argument
MIPS: GIC: Remove useless parens from GICBIS().
MIPS: perf: Mark pmu interupt IRQF_NO_THREAD
separate trampolines had a design flaw with the interaction between
the function and function_graph tracers.
The main flaw was the simplification of the use of multiple tracers having
the same filter (like function and function_graph, that use the
set_ftrace_filter file to filter their code). The design assumed that the
two tracers could never run simultaneously as only one tracer can be
used at a time. The problem with this assumption was that the function
profiler could be implemented on top of the function graph tracer, and
the function profiler could run at the same time as the function tracer.
This caused the assumption to be broken and when ftrace detected this
failed assumpiton it would spit out a nasty warning and shut itself down.
Instead of using a single ftrace_ops that switches between the function
and function_graph callbacks, the two tracers can again use their own
ftrace_ops. But instead of having a complex hierarchy of ftrace_ops,
the filter fields are placed in its own structure and the ftrace_ops
can carefully use the same filter. This change took a bit to be able
to allow for this and currently only the global_ops can share the same
filter, but this new design can easily be modified to allow for any
ftrace_ops to share its filter with another ftrace_ops.
The first four patches deal with the change of allowing the ftrace_ops
to share the filter (and this needs to go to 3.16 as well).
The fifth patch fixes a bug that was also caused by the new changes
but only for archs other than x86, and only if those archs implement
a direct call to the function_graph tracer which they do not do yet
but will in the future. It does not need to go to stable, but needs
to be fixed before the other archs update their code to allow direct
calls to the function_graph trampoline.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT+hqSAAoJEKQekfcNnQGulvcH/0O4NMXX4HH1dQlYgKEaSYxE
Nh8WdiewopF5iaeNvo+8Nzdq8D2k3KgMOqSlzJ4JVmzd7gjOBSGeKDfqFwR+IbTk
9LcaJJCI3oG3MEf6m7gZMdjKPKyxkeYHDtG7kRHo8z94eliV9pKC6fUnEWayQO3o
Kv6IBupdkF8ICAiKRae5Uo0c9wjZ9YP0bZS7fxI2hJw3h/NMFnhnhUL03URIx8e3
dqgpweYg+P3KPfp2Jz6safdJqLTPK9rqqhkZhylbDl7o78xEzRN7wCyB6Nak00xz
swRgsW6vFP7ci/YSNx+B6HCIf7NTm3WLDrrIhitNHcJUZwUMU3CRO9IJHGsTuEE=
=J5lZ
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull fix for ftrace function tracer/profiler conflict from Steven Rostedt:
"The rewrite of the ftrace code that makes it possible to allow for
separate trampolines had a design flaw with the interaction between
the function and function_graph tracers.
The main flaw was the simplification of the use of multiple tracers
having the same filter (like function and function_graph, that use the
set_ftrace_filter file to filter their code). The design assumed that
the two tracers could never run simultaneously as only one tracer can
be used at a time. The problem with this assumption was that the
function profiler could be implemented on top of the function graph
tracer, and the function profiler could run at the same time as the
function tracer. This caused the assumption to be broken and when
ftrace detected this failed assumpiton it would spit out a nasty
warning and shut itself down.
Instead of using a single ftrace_ops that switches between the
function and function_graph callbacks, the two tracers can again use
their own ftrace_ops. But instead of having a complex hierarchy of
ftrace_ops, the filter fields are placed in its own structure and the
ftrace_ops can carefully use the same filter. This change took a bit
to be able to allow for this and currently only the global_ops can
share the same filter, but this new design can easily be modified to
allow for any ftrace_ops to share its filter with another ftrace_ops.
The first four patches deal with the change of allowing the ftrace_ops
to share the filter (and this needs to go to 3.16 as well).
The fifth patch fixes a bug that was also caused by the new changes
but only for archs other than x86, and only if those archs implement a
direct call to the function_graph tracer which they do not do yet but
will in the future. It does not need to go to stable, but needs to be
fixed before the other archs update their code to allow direct calls
to the function_graph trampoline"
* tag 'trace-fixes-v3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Use current addr when converting to nop in __ftrace_replace_code()
ftrace: Fix function_profiler and function tracer together
ftrace: Fix up trampoline accounting with looping on hash ops
ftrace: Update all ftrace_ops for a ftrace_hash_ops update
ftrace: Allow ftrace_ops to use the hashes from other ops
Pull x86 fixes from Ingo Molnar:
"A couple of EFI fixes, plus misc fixes all around the map"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/arm64: Store Runtime Services revision
firmware: Do not use WARN_ON(!spin_is_locked())
x86_32, entry: Clean up sysenter_badsys declaration
x86/doc: Fix the 'tlb_single_page_flush_ceiling' sysconfig path
x86/mm: Fix sparse 'tlb_single_page_flush_ceiling' warning and make the variable read-mostly
x86/mm: Fix RCU splat from new TLB tracepoints
Pull perf fixes from Ingo Molnar:
"A kprobes and a perf compat ioctl fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Handle compat ioctl
kprobes: Skip kretprobe hit in NMI context to avoid deadlock
A collection of fixes from this week, it's been pretty quiet and nothing
really stands out as particularly noteworthy here -- mostly minor fixes
across the field:
- ODROID booting was fixed due to PMIC interrupts missing in DT
- A collection of i.MX fixes
- Minor Tegra fix for regulators
- Rockchip fix and addition of SoC-specific mailing list to make it
easier to find posted patches.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJT+jEaAAoJEIwa5zzehBx3gPIP/2EMvkRiFL73z6u67d6AhqCR
UPpv65Lk0FzvKZYvpNdNDdPatS48SpEh4Ppj58HegK31jaZIYhsvGHFsAwrHpZXF
2D2H8Vqy6wl+UL+BQGTHJ2rhqgg40PSZQh1KksaBrjb9MDWUbAI0V9AysoT6ecIP
/02mzRkNPL7V5ISikksa82FWbZwI36KJGTM19ZDp6CYQnV2L4eG0LefGjgEzdE07
iur1PO/TUi0ibQex3It9D+ynNlza0sZJDR0AsGFTS/96fvtX6SQzvxtEsr4jUfyT
jceqD5KIWS9N1OmRudJ+e3awpzGGuRIkdq36eiJbhSe426LHgDNbIS4RU+YRNFIf
9/bK4blcxGNnddsDTLUIyi+vykAm1ObAfGNNrKeA4z9lDw0QVuoG5VwrgNKcIk3J
kugj3RLUQ5yd9iIFJyrPxlpB5mTo5SaPCSxjuDKzftwDQtF+SJI58V//Nde0Ocfw
K7VpmY26uYUmf6AltiyFxOCUASzUC3Bp+/cf0lYTWE1+iIuOTJRlYmYwKDMrJ+c0
mtz3uCiQhTzaHje6AksA1wlKhv3KS/opGN1oNVSILgjExiRGh94MSkROLqtJ35cE
WV/nuzZUPdmZ6tGQ6b7FEa0elbEElaioDiQraOeSehKijEjfhK+tHNJczoO94s2R
Swfff4NwRe346ZMk/L/2
=+oYu
-----END PGP SIGNATURE-----
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"A collection of fixes from this week, it's been pretty quiet and
nothing really stands out as particularly noteworthy here -- mostly
minor fixes across the field:
- ODROID booting was fixed due to PMIC interrupts missing in DT
- a collection of i.MX fixes
- minor Tegra fix for regulators
- Rockchip fix and addition of SoC-specific mailing list to make it
easier to find posted patches"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
bus: arm-ccn: Fix warning message
ARM: shmobile: koelsch: Remove non-existent i2c6 pinmux
ARM: tegra: apalis/colibri t30: fix on-module 5v0 supplies
MAINTAINERS: add new Rockchip SoC list
ARM: dts: rockchip: readd missing mmc0 pinctrl settings
ARM: dts: ODROID i2c improvements
ARM: dts: Enable PMIC interrupts on ODROID
ARM: dts: imx6sx: fix the pad setting for uart CTS_B
ARM: dts: i.MX53: fix apparent bug in VPU clks
ARM: imx: correct gpu2d_axi and gpu3d_axi clock setting
ARM: dts: imx6: edmqmx6: change enet reset pin
ARM: dts: vf610-twr: Fix pinctrl_esdhc1 pin definitions.
ARM: imx: remove unnecessary ARCH_HAS_OPP select
ARM: imx: fix TLB missing of IOMUXC base address during suspend
ARM: imx6: fix SMP compilation again
ARM: dt: sun6i: Add #address-cells and #size-cells to i2c controller nodes
- A largeish fix for the IRQ handling in the new Zynq driver.
The quite verbose commit message gives the exact details.
- Move some defines for gpiod flags outside an ifdef to make
stub functions work again.
- Various minor fixes that we can accept for -rc1.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT+ZCiAAoJEEEQszewGV1zOZEQALkGrQacC9pZdhw93+wumj/d
goA4B1VSwi+vlQ3DcF7oXbTcbaYfGu58ZAlShc9XhVINaWEkVfc1GnBCq9rsnjNh
/rPaqDpqqBo7tfghjsMIZFdlNH151u/4rCAFMUB/STOAlUckSCiqWdxd7wMhAPlZ
yLD5oKqm3mWu/PX2n334rkLb1UxT1vIniB3DgQRhUtrKeYarH9OpzJnWnCCmspCr
GeVWpq4Z851fHMok5D76EbLL57JykTCC7VjHyK9jrmbjpMFX6+lYmWWMCOgUHypi
zA5ZEL7PjhhrZlwTN5yn8QbuSfbmT14tTUP/OLtItE0CYjF1nheyxueKErHT+DP4
piLDZgdv79ZYBSbvfbpjeO5VLkZz3m7EAGp/0kWy89uGcwjVsk+zsDSDc0v09tTk
hYRhu5/gc8zp3rd7v4DspSBbpxK+iI7uMzaGuv32AEdPvFzDVy4SJmRN/qTt/W9E
KVAdk4OJdx7xb9ee9Z2OfdbteOt0zFQjmTIneZ0urlqOZCbvbFH4PvuPB4L81SCV
XQ023OZqbBYb72Kz3Fw46NjMvRql3rkDph62oikEtfSKmSziMDY9z6bY42b0PCTq
mSovoQduN06wRDy5VQA8H1h7VF3k8NXjTI2hMEV85VhxlCzhabukBf+HY37TQIhT
pLQp0fwKmf8iZWtuF4Of
=7iYA
-----END PGP SIGNATURE-----
Merge tag 'gpio-v3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull gpio fixes from Linus Walleij:
- a largeish fix for the IRQ handling in the new Zynq driver. The
quite verbose commit message gives the exact details.
- move some defines for gpiod flags outside an ifdef to make stub
functions work again.
- various minor fixes that we can accept for -rc1.
* tag 'gpio-v3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio-lynxpoint: enable input sensing in resume
gpio: move GPIOD flags outside #ifdef
gpio: delete unneeded test before of_node_put
gpio: zynq: Fix IRQ handlers
gpiolib: devres: use correct structure type name in sizeof
MAINTAINERS: Change maintainer for gpio-bcm-kona.c
Pull drm fixes from Dave Airlie:
"Intel and radeon fixes.
Post KS/LC git requests from i915 and radeon stacked up. They are all
fixes along with some new pci ids for radeon, and one maintainers file
entry.
- i915: display fixes and irq fixes
- radeon: pci ids, and misc gpuvm, dpm and hdp cache"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (29 commits)
MAINTAINERS: Add entry for Renesas DRM drivers
drm/radeon: add additional SI pci ids
drm/radeon: add new bonaire pci ids
drm/radeon: add new KV pci id
Revert "drm/radeon: Use write-combined CPU mappings of ring buffers with PCIe"
drm/radeon: fix active_cu mask on SI and CIK after re-init (v3)
drm/radeon: fix active cu count for SI and CIK
drm/radeon: re-enable selective GPUVM flushing
drm/radeon: Sync ME and PFP after CP semaphore waits v4
drm/radeon: fix display handling in radeon_gpu_reset
drm/radeon: fix pm handling in radeon_gpu_reset
drm/radeon: Only flush HDP cache for indirect buffers from userspace
drm/radeon: properly document reloc priority mask
drm/i915: don't try to retrain a DP link on an inactive CRTC
drm/i915: make sure VDD is turned off during system suspend
drm/i915: cancel hotplug and dig_port work during suspend and unload
drm/i915: fix HPD IRQ reenable work cancelation
drm/i915: take display port power domain in DP HPD handler
drm/i915: Don't try to enable cursor from setplane when crtc is disabled
drm/i915: Skip load detect when intel_crtc->new_enable==true
...